home edit page issue tracker

Tokenization

Tokenization is obtained from Latvian Treebank tokenization by splitting “words with spaces”. This means:

Current version of treebank does not utilize range tokens in CoNLL-U files.

Known discrepancy

Current version of treebank does not split out “reflective particle” from verbs, because in Latvian reflectiveness is infused very deeply in the verb inflectional paradigm and it is very hard to split it from grammatical markers that are used for indicating person, time or creating derivative forms.