home edit page issue tracker

Tokenization

The low-level tokenization of the Czech UD treebank follows the tokenization of the Prague Dependency Treebank 3.0 (PDT):

Words and Tokens

In Czech there are fused words that correspond to multiple syntactic words. The original PDT data use special part-of-speech tags to identify fused words, nevertheless the fused token is not split in PDT and it corresponds to just one node in the dependency tree. (Note: An exception was the splitting of aby and kdyby in PDT 1.0 but it was abandoned in later versions.)

In contrast, the UD format requires that certain types of fused words be split. We say that there is a multi-word token consisting of several syntactic words, each having its own node in the tree (see also universal tokenization).

Preposition + Personal Pronoun on in the Accusative (něj)

This category covers words that would be tagged by the PDT tag P0-------------. However, no such word occurs in the PDT 3.0 data.

Preposition + Interrogative/Relative Pronoun co in the Accusative

This category covers words that would be tagged by the PDT tag PY-------------. No such word occurs in the PDT 3.0 data but there are a few occurrences in the CAC 2.0 data.

Note: There is another analogically fused word, proč “why”. In contrast to the above, proč has grammaticalized into an interrogative/relative adverb. It is more frequent than the three fusions listed above but it is not used to replace a prepositional object. We do not split it into pro co.

Participle, Pronoun or Subordinating Conjunction + the Auxiliary být in the 2nd Person Singular (jsi)

Note: This rule does not include the words bys, abys and kdybys. They resemble the words above but bys is an independent form of the auxiliary verb být “to be”, and abys and kdybys are in fact fused words, but they were formed using bys, not jsi.

This category does not have its own tag in PDT. The ses, sis pronouns are P7.* pronouns with the second person. The tys pronoun can be distinguished by having more verbal features in its tag (PP-S1--2P-AA---) than ty (PP-S1--2-------). The žes conjunction is tagged J,-S---2------- while že is tagged J,-------------. The participles can be distinguished by the value of person: normal participle udělal does not inflect for person (VpYS---XR-AA---) while participle fused with jsi, i.e. udělals, is tagged as being in the second person (VpYS---2R-AA---). None of these occur in the PDT 3.0 data.

Subordinating Conjunction aby or kdyby

Note: It is not clear even to a native speaker what exactly the first word should be (aby, až, kdyby or když); in any case, it is a conjunction. However, it is clear that the second word is a conditional form of být.

Heuristic to transform the tree if only surface tokens are desired as nodes: attach the fused token (e.g. abychom) to the parent and with the label of the first part (aby). Tag it as subordinating conjunction and merge the features of both parts:

3-4   abychom   _      _      _                 _                                            _   _      _   _
3     aby       aby   SCONJ   J,-------------   _                                            7   mark   _   _
4     bychom    být   AUX     Vc-P---1-------   Mood=Cnd|Number=Plur|Person=1|VerbForm=Fin   7   aux    _   _

will be transformed to

3     abychom   aby   SCONJ   J,-P---1-------   Mood=Cnd|Number=Plur|Person=3|VerbForm=Fin   6   mark   _   _

Verb + Conjunction neboť

The word forms in this group can be considered archaic.

There is only one occurrence in the PDT 3.0 data of the word neníť “because it is not” (tagged Vt-S---3P-NA--2).