home edit page issue tracker

Tokenization

The tokenization of the UD Finnish treebank follows with only minor modifications the tokenization of the Turku Dependency Treebank (TDT), which is a straightforward whitespace-based tokenization with conventional separation of punctuation. The Finnish UD treebank does not contain multiword tokens.