home cs/pos edit page issue tracker

PROPN: proper noun

Definition

A proper noun is a noun that is the name of a specific individual, place, or object. Czech proper nouns are always written starting with an uppercase letter. Note that names of days of week (pondělí, úterý, středa, čtvrtek, pátek, sobota, neděle)  and names of months (leden, únor, březen, duben, květen, červen, červenec, srpen, září, říjen, listopad, prosinec)  are not written capitalized (unlike in English) and are not considered proper nouns.

Single-word named entities should be tagged PROPN even if they originate from a common noun (Zajíc, Huť)  or an adjective (Veselý, Teplá).  Even if they were originally adjectives and inflect according to adjectival paradigms, they behave syntactically as nouns. For instance, Teplá  (a river and city in western Bohemia) is originally feminine form of the adjective teplý  “warm” but as a geographical name, it is a noun. It denotes a concrete location (rather than a property of somebody/something) and its feminine gender is fixed (while adjectives have forms in all three genders).

Note that names of languages (čeština, angličtina) and adjectives derived from geographical names (český, anglický  “Czech, English”) are written in lowercase and are not tagged PROPN.

Personal names are typically treated as a sequence of proper nouns (one or more given names and one or more surnames). If the name contains prepositions, conjunctions or articles (foreign names and old Czech names), these are tagged as ADP, CONJ and DET, respectively.

Czech (and other Slavic) multi-word named entities have internal syntactic structure, which is preserved in the annotation. The headword is always noun and there may be other nouns involved. They will be tagged either PROPN or NOUN and possible ambiguities must be resolved individually. Modifying adjectives are never tagged PROPN. Even if an adjective is the first word of a multi-word name, and thus it starts with an uppercase letter, it is still tagged ADJ. Similarly, function words in named entities retain their normal tags. These rules are less strict for foreign named entities where the original part of speech is hidden for a Czech speaker.

Examples

Conversion from the Prague Dependency Treebank

The PDT set of morphological (part-of-speech) tags does not distinguish common and proper nouns. However, lemmas in PDT contain additional features that also encode types of named entities. When converting the PDT annotation to UD, these lemma features are removed, the PROPN tag is used and the feature cs-feat/NameType is added to the universal features to preserve the type. Only nouns are treated this way. Foreign adjectives are not converted to PROPN despite the fact that they entered Czech as parts of foreign names and their lemmas contain the name type feature.

The following table lists the name types together with the most frequent examples. See http://ufal.mff.cuni.cz/techrep/tr27.pdf, page 8, section 2.1 (Lemma structure) for more details.

_;Ygiven nameJan, Jiří, Václav, Petr, Josef“Jan, Jiří, Václav, Petr, Josef”
_;SsurnameKlaus, Havel, Němec, Jelcin, Svoboda“Klaus, Havel, Němec, Yeltsin, Svoboda”
_;Emember of a particular nation, inhabitant of a particular territoryNěmec, Čech, Srb, Američan, Slovák“German, Czech, Serbian, American, Slovak”
_;Ggeographical namePraha, ČR, Evropa, Německo, Brno“Prague, CR, Europe, Germany, Brno”
_;Kcompany, organization, institutionODS, OSN, Sparta, ODA, Slavia“ODS, UN, Sparta, ODA, Slavia”
_;RproductLN, Mercedes, Tatra, PC, MF“LN, Mercedes, Tatra, PC, MF”
_;mother proper name: names of mines, stadiums, guerilla bases etc.US, PVP, Prix, Rapaport, Tour“US, PVP, Prix, Rapaport, Tour”

Diffs

Prague Dependency Treebank

Articles in foreign names (the, die, le)  are tagged ADJ, not DET. Otherwise, the morphological analysis usually includes the original part of speech of foreign words.

References


Treebank Statistics (UD_Czech)

There are 15254 PROPN lemmas (26%), 21954 PROPN types (17%) and 84031 PROPN tokens (6%). Out of 17 observed tags, the rank of PROPN is: 2 in number of lemmas, 4 in number of types and 6 in number of tokens.

The 10 most frequent PROPN lemmas: Praha, ČR, Evropa, LN, Jan, Jiří, Německo, Brno, ODS, USA

The 10 most frequent PROPN types: Praha, ČR, Praze, LN, ODS, USA, J, Jiří, Jan, OSN

The 10 most frequent ambiguous lemmas: J (PROPN 422, ADJ 30), M (PROPN 244, NOUN 8, ADJ 1), V (PROPN 210, NUM 23, NOUN 7, ADJ 5), A (PROPN 172, NOUN 8, ADJ 8), York (PROPN 165, ADJ 5), P (PROPN 136, ADJ 4, NOUN 2), S (PROPN 116, ADJ 12, NOUN 2), Washington (PROPN 111, ADJ 1), r (NOUN 55, PROPN 1, ADV 1), F (PROPN 99, NOUN 12, ADJ 10)

The 10 most frequent ambiguous types: J (PROPN 422, ADJ 30, NOUN 3), M (PROPN 244, NOUN 51, X 3, ADJ 1), V (ADP 3736, PROPN 210, NUM 23, NOUN 15, ADJ 6, ADV 2), A (CONJ 1042, PROPN 172, NOUN 93, ADJ 19, X 4), Rusko (PROPN 163, ADJ 3), Německo (PROPN 144, ADJ 2), P (PROPN 136, NOUN 124, ADJ 17, ADP 1), S (ADP 470, PROPN 117, NOUN 38, ADJ 14, X 3), r (NOUN 433, PROPN 1, ADV 1), F (PROPN 99, NOUN 27, ADJ 10)

Morphology

The form / lemma ratio of PROPN is 1.439229 (the average of all parts of speech is 2.195950).

The 1st highest number of forms (11) was observed with the lemma “Čech”: ČECH, ČEŠI, Čech, Čecha, Čechem, Čechovi, Čechy, Čechů, Čechům, Češi, Češích.

The 2nd highest number of forms (10) was observed with the lemma “Jan”: JAN, JANA, Jan, Jana, Janem, Janovi, Janové, Janu, Jany, Janů.

The 3rd highest number of forms (10) was observed with the lemma “Němec”: NĚMCI, NĚMCŮ, NĚMEC, Němce, Němcem, Němci, Němcích, Němců, Němcům, Němec.

PROPN occurs with 9 features: cs-feat/NameType (84031; 100% instances), cs-feat/Negative (84031; 100% instances), cs-feat/Gender (82083; 98% instances), cs-feat/Number (68761; 82% instances), cs-feat/Case (66478; 79% instances), cs-feat/Animacy (48949; 58% instances), cs-feat/Abbr (13042; 16% instances), cs-feat/Foreign (3684; 4% instances), cs-feat/Style (28; 0% instances)

PROPN occurs with 44 feature-value pairs: Abbr=Yes, Animacy=Anim, Animacy=Inan, Case=Acc, Case=Dat, Case=Gen, Case=Ins, Case=Loc, Case=Nom, Case=Voc, Foreign=Foreign, Gender=Fem, Gender=Masc, Gender=Neut, NameType=Com, NameType=Com,Geo, NameType=Com,Giv, NameType=Com,Giv,Sur, NameType=Com,Nat, NameType=Com,Pro, NameType=Com,Sur, NameType=Geo, NameType=Geo,Giv, NameType=Geo,Giv,Sur, NameType=Geo,Oth, NameType=Geo,Pro, NameType=Geo,Sur, NameType=Giv, NameType=Giv,Nat, NameType=Giv,Oth, NameType=Giv,Pro, NameType=Giv,Pro,Sur, NameType=Giv,Sur, NameType=Nat, NameType=Nat,Sur, NameType=Oth, NameType=Pro, NameType=Pro,Sur, NameType=Sur, Negative=Pos, Number=Plur, Number=Sing, Style=Arch, Style=Coll

PROPN occurs with 561 feature combinations. The most frequent feature combination is Animacy=Anim|Case=Nom|Gender=Masc|NameType=Sur|Negative=Pos|Number=Sing (14100 tokens). Examples: Klaus, Havel, Svoboda, Mečiar, Jelcin, John, Zeman, Němec, Novák, Benda

Relations

PROPN nodes are attached to their parents using 24 different relations: cs-dep/nmod (33493; 40% instances), cs-dep/nsubj (14713; 18% instances), cs-dep/name (13595; 16% instances), cs-dep/conj (8174; 10% instances), cs-dep/root (5373; 6% instances), cs-dep/dep (2878; 3% instances), cs-dep/dobj (2688; 3% instances), cs-dep/appos (1361; 2% instances), cs-dep/iobj (712; 1% instances), cs-dep/foreign (431; 1% instances), cs-dep/nsubjpass (351; 0% instances), cs-dep/advcl (155; 0% instances), cs-dep/xcomp (32; 0% instances), cs-dep/vocative (21; 0% instances), cs-dep/cc (18; 0% instances), cs-dep/ccomp (8; 0% instances), cs-dep/amod (6; 0% instances), cs-dep/case (6; 0% instances), cs-dep/acl (5; 0% instances), cs-dep/advmod (4; 0% instances), cs-dep/parataxis (3; 0% instances), cs-dep/csubj (2; 0% instances), cs-dep/csubjpass (1; 0% instances), cs-dep/punct (1; 0% instances)

Parents of PROPN nodes belong to 15 different parts of speech: NOUN (26226; 31% instances), PROPN (26088; 31% instances), VERB (23727; 28% instances), ROOT (5373; 6% instances), ADJ (1701; 2% instances), NUM (378; 0% instances), ADV (304; 0% instances), PRON (178; 0% instances), PART (15; 0% instances), ADP (12; 0% instances), DET (12; 0% instances), SYM (6; 0% instances), PUNCT (5; 0% instances), CONJ (4; 0% instances), INTJ (2; 0% instances)

38215 (45%) PROPN nodes are leaves.

23472 (28%) PROPN nodes have one child.

11511 (14%) PROPN nodes have two children.

10833 (13%) PROPN nodes have three or more children.

The highest child degree of a PROPN node is 57.

Children of PROPN nodes are attached using 29 different relations: cs-dep/punct (19518; 21% instances), cs-dep/case (16403; 18% instances), cs-dep/name (13613; 15% instances), cs-dep/nmod (13234; 14% instances), cs-dep/conj (8383; 9% instances), cs-dep/amod (5140; 6% instances), cs-dep/cc (3926; 4% instances), cs-dep/dep (3377; 4% instances), cs-dep/acl (1647; 2% instances), cs-dep/nummod (1629; 2% instances), cs-dep/foreign (1570; 2% instances), cs-dep/appos (1312; 1% instances), cs-dep/advmod:emph (1226; 1% instances), cs-dep/xcomp (392; 0% instances), cs-dep/mark (325; 0% instances), cs-dep/det (98; 0% instances), cs-dep/advmod (92; 0% instances), cs-dep/parataxis (80; 0% instances), cs-dep/cop (67; 0% instances), cs-dep/nsubj (62; 0% instances), cs-dep/nummod:gov (40; 0% instances), cs-dep/dobj (10; 0% instances), cs-dep/advcl (9; 0% instances), cs-dep/neg (7; 0% instances), cs-dep/det:numgov (5; 0% instances), cs-dep/aux (3; 0% instances), cs-dep/det:nummod (3; 0% instances), cs-dep/ccomp (2; 0% instances), cs-dep/expl (1; 0% instances)

Children of PROPN nodes belong to 16 different parts of speech: PROPN (26088; 28% instances), PUNCT (19520; 21% instances), ADP (16535; 18% instances), NOUN (12757; 14% instances), ADJ (6470; 7% instances), CONJ (4267; 5% instances), NUM (2556; 3% instances), VERB (2061; 2% instances), ADV (1073; 1% instances), SCONJ (335; 0% instances), PRON (198; 0% instances), PART (175; 0% instances), DET (106; 0% instances), SYM (22; 0% instances), INTJ (8; 0% instances), AUX (3; 0% instances)


Treebank Statistics (UD_Czech-CAC)

There are 3447 PROPN lemmas (12%), 4372 PROPN types (7%) and 9814 PROPN tokens (2%). Out of 16 observed tags, the rank of PROPN is: 4 in number of lemmas, 4 in number of types and 10 in number of tokens.

The 10 most frequent PROPN lemmas: Praha, KSČ, ROH, SSSR, ÚJČ, SSM, ČSAV, ČSSR, Československo, Škoda

The 10 most frequent PROPN types: KSČ, ROH, Praze, SSSR, ÚJČ, SSM, Praha, ČSAV, ČSSR, Škoda

The 10 most frequent ambiguous lemmas: hora (NOUN 24, PROPN 19), VB (PROPN 23, NOUN 4), Vyšehrad (PROPN 8, NOUN 1), KRB (PROPN 6, NOUN 1), Janský (PROPN 5, ADJ 3), most (NOUN 42, PROPN 1), KS (PROPN 3, NOUN 3), MP (PROPN 3, NOUN 1), SRPŠ (PROPN 3, NOUN 1), NVP (NOUN 2, PROPN 2)

The 10 most frequent ambiguous types: Praha (PROPN 104, NOUN 1), Škoda (PROPN 66, NOUN 4), Země (PROPN 29, NOUN 6), VB (PROPN 23, NOUN 4), Slunce (PROPN 13, NOUN 2), Svoboda (PROPN 10, NOUN 1), horách (PROPN 5, NOUN 2), Králík (PROPN 9, NOUN 3), Měsíce (PROPN 9, NOUN 4), Karpaty (PROPN 8, NOUN 1)

Morphology

The form / lemma ratio of PROPN is 1.268349 (the average of all parts of speech is 2.206260).

The 1st highest number of forms (6) was observed with the lemma “Honza”: Honza, Honzou, Honzovi, Honzové, Honzu, Honzy.

The 2nd highest number of forms (6) was observed with the lemma “hora”: Hora, hor, horami, hory, horách, horám.

The 3rd highest number of forms (5) was observed with the lemma “Jan”: Jan, Jana, Janem, Janovi, Janu.

PROPN occurs with 9 features: cs-feat/NameType (9814; 100% instances), cs-feat/Negative (9814; 100% instances), cs-feat/Gender (9803; 100% instances), cs-feat/Number (7864; 80% instances), cs-feat/Case (7810; 80% instances), cs-feat/Animacy (5431; 55% instances), cs-feat/Abbr (1878; 19% instances), cs-feat/Foreign (37; 0% instances), cs-feat/Style (2; 0% instances)

PROPN occurs with 38 feature-value pairs: Abbr=Yes, Animacy=Anim, Animacy=Inan, Case=Acc, Case=Dat, Case=Gen, Case=Ins, Case=Loc, Case=Nom, Case=Voc, Foreign=Foreign, Gender=Fem, Gender=Masc, Gender=Neut, NameType=Com, NameType=Com,Geo, NameType=Com,Giv, NameType=Com,Pro, NameType=Com,Sur, NameType=Geo, NameType=Geo,Giv, NameType=Geo,Oth, NameType=Geo,Sur, NameType=Giv, NameType=Giv,Oth, NameType=Giv,Pro, NameType=Giv,Sur, NameType=Nat, NameType=Nat,Sur, NameType=Oth, NameType=Pro, NameType=Pro,Sur, NameType=Sur, Negative=Pos, Number=Plur, Number=Sing, Style=Arch, Style=Coll

PROPN occurs with 237 feature combinations. The most frequent feature combination is Animacy=Anim|Case=Nom|Gender=Masc|NameType=Sur|Negative=Pos|Number=Sing (1589 tokens). Examples: Fučík, Erben, Horálek, Němec, Lenin, Záveský, Kraus, Huxley, Gottwald, Marx

Relations

PROPN nodes are attached to their parents using 17 different relations: cs-dep/nmod (5024; 51% instances), cs-dep/conj (1545; 16% instances), cs-dep/nsubj (1528; 16% instances), cs-dep/name (847; 9% instances), cs-dep/dobj (280; 3% instances), cs-dep/root (187; 2% instances), cs-dep/dep (164; 2% instances), cs-dep/appos (115; 1% instances), cs-dep/iobj (50; 1% instances), cs-dep/nsubjpass (26; 0% instances), cs-dep/advcl (15; 0% instances), cs-dep/xcomp (12; 0% instances), cs-dep/foreign (9; 0% instances), cs-dep/vocative (9; 0% instances), cs-dep/amod (1; 0% instances), cs-dep/cc (1; 0% instances), cs-dep/csubj (1; 0% instances)

Parents of PROPN nodes belong to 12 different parts of speech: NOUN (4203; 43% instances), PROPN (2633; 27% instances), VERB (2445; 25% instances), ADJ (224; 2% instances), ROOT (187; 2% instances), SYM (38; 0% instances), PRON (36; 0% instances), ADV (29; 0% instances), NUM (13; 0% instances), ADP (2; 0% instances), CONJ (2; 0% instances), DET (2; 0% instances)

4622 (47%) PROPN nodes are leaves.

2993 (30%) PROPN nodes have one child.

1246 (13%) PROPN nodes have two children.

953 (10%) PROPN nodes have three or more children.

The highest child degree of a PROPN node is 97.

Children of PROPN nodes are attached using 22 different relations: cs-dep/case (2280; 22% instances), cs-dep/nmod (1962; 19% instances), cs-dep/conj (1572; 15% instances), cs-dep/punct (1439; 14% instances), cs-dep/name (848; 8% instances), cs-dep/cc (657; 6% instances), cs-dep/amod (634; 6% instances), cs-dep/advmod:emph (186; 2% instances), cs-dep/appos (179; 2% instances), cs-dep/acl (158; 2% instances), cs-dep/xcomp (63; 1% instances), cs-dep/dep (59; 1% instances), cs-dep/mark (47; 0% instances), cs-dep/nummod (34; 0% instances), cs-dep/det (32; 0% instances), cs-dep/foreign (17; 0% instances), cs-dep/advmod (8; 0% instances), cs-dep/cop (8; 0% instances), cs-dep/nsubj (7; 0% instances), cs-dep/dobj (6; 0% instances), cs-dep/parataxis (6; 0% instances), cs-dep/nummod:gov (1; 0% instances)

Children of PROPN nodes belong to 14 different parts of speech: PROPN (2633; 26% instances), ADP (2265; 22% instances), PUNCT (1440; 14% instances), NOUN (1356; 13% instances), ADJ (689; 7% instances), CONJ (675; 7% instances), SYM (584; 6% instances), VERB (186; 2% instances), ADV (164; 2% instances), NUM (67; 1% instances), SCONJ (48; 0% instances), PART (39; 0% instances), DET (32; 0% instances), PRON (25; 0% instances)


PROPN in other languages: [bg] [cs] [de] [el] [en] [es] [eu] [fa] [fi] [fr] [ga] [he] [hu] [it] [ja] [ko] [sv] [u]