home edit page issue tracker

Tokenization

The tokenization in the Swedish UD treebank follows the principles of the Stockholm-Umeå Corpus, Version 2.0 (SUC, 2006), which has become the de facto standard for Swedish tokenization and part-of-speech tagging. This is a straightforward segmentation based on whitespace and punctuation, but the following special cases deserve to be mentioned:

The Swedish UD treebank does not contain multiword tokens.

References

The Stockholm Umeå Corpus. Version 2.0. 2006. Stockholm University: Department of Linguistics.