home edit page issue tracker

Tokenization

The French tokenization follows the universal guidelines: contractions are undone (e.g., au becomes two tokens à le). Otherwise the tokenization is based on white spaces and punctuations (except for multiword expressions with hyphens which are not split, e.g., Etats-Unis “United States”, sous-marin “submarine” stay one token).