home edit page issue tracker

Tokenization

The tokenization in the Italian UD treebank is a straightforward segmentation based on whitespace and punctuation. The following special cases deserve to be mentioned:

Multi-word tokens

The Italian UD treebank does not contain multiword tokens.

Fused words

According to the UD guidelines, the basic units of annotation are syntactic words (not phonological or orthographic words), therefore we systematically split off clitics and articulated prepositions. Examples follow:

Sentence splitting

Each sentence contains only one root. Splitting is usually performed after an end-of-sentence dot or after a colon or semicolon when these punctuation marks separate unrelated subparts of a sentence. Items in a list may sometimes be rendered as separate sentences.