home v2/v2 edit page issue tracker

Tokenization in UD v2

We need to be able to handle the whole spectrum from multitoken words in Vietnamese to multiword tokens in Turkish. Ideally, we should also set up more substantial criteria for when to split tokens into words and vice versa. On this issue, there is a relevant paper dealing with the Turkish case. See also the report from the Uppsala meeting: tokenization.