home edit page issue tracker

Tokenization

The tokenization of the UD Korean Treebank follows the tokenization of the Korean data distributed by the SPMRL 2013 shared task, which is a straightforward whitespace-based tokenization with conventional separation of punctuation. Each token can contain one or more morphemes separated by plus (+) signs.