home 2015/uppsala edit page issue tracker

Uppsala Group on Conversion Best Practice and Tools

Aitziber, Barbara, Filip, Giuseppe, Lilian, Natalia, Verginica, Željko

Minimal requirements for a UD treebank

Annotation tools

At least these were mentioned, feel free to expand!

Conversion tools

Search and visualization tools

Tokenization

Nearly none of the treebanks distribute the untokenized sentences. This complicates automatic induction of tokenizers from the data. It would be great to include the untokenized text, use the SpaceAfter mechanism of the CoNLL-U format, or at least provide data for training the tokenizers privately.

Parallel treebank

Maybe it would be good to expand on the Cairo initiative and have a parallel text which could be annotated, so as to help new corpora get started as well as gather some UD parallel data. About 10K tokens would seem like the right size. The Cairo corpus should be mentioned somewhere on the main UD page.