home edit page issue tracker

Introduction

The Swedish UD treebank is based on the Professional Prose section of Talbanken (Einarsson, 1976), originally annotated by a team led by Ulf Teleman at Lund University according to the MAMBA annotation scheme (Teleman, 1974). It consists of roughly 6,000 sentences and 97,000 tokens taken from a variety of genres, including text books, information brochures, and newspaper articles. The syntactic annotation is converted directly from the original MAMBA annotation, while the morphological annotation is based on the reannotation performed when incorporating Talbanken into the Swedish Treebank (Nivre and Megyesi, 2007). This reannotation, which also involved minor tokenization changes, followed the guidelines of the Stockholm-Umeå Corpus (SUC) 2.0.

Source of annotations

This table summarizes the origins and checking of the various columns of the CoNLL-U data.

Column Status
ID Sentence segmentation from Talbanken; tokenization modified to SUC standard for Swedish Treebank.
FORM Identical to Talbanken except for minor tokenization changes mentioned above.
LEMMA Produced automatically by the Swedish Language Bank using SALDO; fairly careful human checking.
UPOSTAG Converted automatically from XPOSTAG + original Talbanken tags; fairly careful human checking.
XPOSTAG Produced automatically by a tagger trained on SUC; complete manual validation for Swedish Treebank.
FEATS Converted automatically from XPOSTAG + original Talbanken tags; fairly careful human checking.
HEAD Automatic conversion of Talbanken; fairly careful human checking.
DEPREL Automatic conversion of Talbanken; fairly careful human checking.
DEPS — (currently unused)
MISC — (currently unused)

Acknowledgments

The new conversion has been performed by Joakim Nivre and Aaron Smith at Uppsala University. We thank everyone who has been involved in previous conversion efforts at Växjö University and Uppsala University, including Bengt Dahlqvist, Sofia Gustafson-Capkova, Johan Hall, Anna Sågvall Hein, Beáta Megyesi, Jens Nilsson, and Filip Salomonsson. Special thanks also to Lars Borin and Markus Forsberg at the Swedish Language Bank for help with the lemmatization.

References