Tokenization in UD v2
We need to be able to handle the whole spectrum from multitoken words in Vietnamese to multiword tokens in Turkish. Ideally, we should also set up more substantial criteria for when to split tokens into words and vice versa. On this issue, there is a relevant paper dealing with the Turkish case. See also the report from the Uppsala meeting: tokenization.