home edit page issue tracker

Introduction

The Czech UD treebank is based on the Prague Dependency Treebank 3.0 (PDT), created at the Charles University in Prague. The treebank consists of 87,913 sentences (1.5 M tokens) and its domain is mainly newswire, reaching also to business and popular scientific articles from the 1990s. The treebank is licensed under the terms of CC BY-NC-SA 3.0 and its original (non-UD) version can be downloaded from http://hdl.handle.net/11858/00-097C-0000-0023-1AAF-3.

The morphological and syntactic annotation of the Czech UD treebank is created through a conversion of PDT data. The conversion procedure has been designed by Dan Zeman.

Source of annotations

This table summarizes the origins and checking of the various columns of the CoNLL-U data.

Column Status
ID Sentence segmentation and (surface) tokenization was automatically done and then hand-corrected; see PDT documentation. Splitting of fused tokens into syntactic words was done automatically during PDT-to-UD conversion.
FORM Identical to Prague Dependency Treebank 3.0 form.
LEMMA Manual selection from possibilities provided by morphological analysis: two annotators and then an arbiter. PDT-to-UD conversion stripped from lemmas the ID numbers distinguishing homonyms, semantic tags and comments; this information is preserved as attributes in the MISC column.
UPOSTAG Converted automatically from XPOSTAG (via Interset), from the semantic tags in PDT lemma, and occasionally from other information available in the treebank; human checking of patterns revealed by automatic consistency tests.
XPOSTAG Manual selection from possibilities provided by morphological analysis: two annotators and then an arbiter.
FEATS Converted automatically from XPOSTAG (via Interset), from the semantic tags in PDT lemma, and occasionally from other information available in the treebank; human checking of patterns revealed by automatic consistency tests.
HEAD Original PDT annotation is manual, done by two independent annotators and then an arbiter. Automatic conversion to UD; human checking of patterns revealed by automatic consistency tests.
DEPREL Original PDT annotation is manual, done by two independent annotators and then an arbiter. Automatic conversion to UD; human checking of patterns revealed by automatic consistency tests.
DEPS — (currently unused)
MISC Information about token spacing taken from PDT annotation. Lemma / word sense IDs, semantic tags and comments on meaning moved here from the PDT lemma.

Acknowledgments

We wish to thank all of the contributors to the original PDT annotation effort, including Eduard Bejček, Eva Hajičová, Jan Hajič, Pavlína Jínová, Václava Kettnerová, Veronika Kolářová, Marie Mikulová, Jiří Mírovský, Anna Nedoluzhko, Jarmila Panevová, Lucie Poláková, Magda Ševčíková, Jan Štěpánek, and Šárka Zikánová.

References