Guidelines for Universal Dependency Annotation
Joakim Nivre and Ryan McDonald
This document describes the annotation guidelines used in the Universal Dependency Treebank Project,
Version 2.0. The aim of the project is to create dependency treebanks with cross-linguistically consistent
annotation by adapting and harmonizing variants of the Stanford typed dependencies (de Marneffe et al.,
2006; de Marneffe and Manning, 2008). This scheme was originally developed for English but has
subsequently been adapted and applied to a number of other languages including Chinese (Chang et al.,
2009), Finnish (Haverinen et al., 2013), Persian (Seraji et al., 2012), and Modern Hebrew (Tsarfaty,
2013). We first give an overview of the modifications to the original Stanford scheme and then provide
a detailed description of each dependency relation and its relation to the original scheme(s). Besides a
syntactic dependency annotation, the treebanks also contain part-of-speech annotation using the Google
Universal Part-of-Speech Tags (Petrov et al., 2012).
1
1 Overview of the Annotation Scheme
We assume the Stanford basic dependencies (with punctuation included), where every dependency struc-
ture is a tree spanning all the input tokens, because this is the kind of representation that most available
dependency parsers require.
2
A sample dependency tree from the French treebank is shown in Figure 1.
Alexandre r
´
eside avec sa famille
`
a Tinqueux .
NOUN VERB ADP DET NOUN ADP NOUN P
NSUBJ
ADPMOD
ADPOBJ
POSS
ADPMOD
ADPOBJ
P
Figure 1: A sample French sentence.
The universal annotation scheme was created by harmonizing available treebanks in slightly different
variants of Stanford dependencies, some developed through manual annotation, some produced through
automatic conversion from other schemes.
3
In the harmonization step, we have eliminated cases where
the same label was used for different linguistic relations in different languages and, conversely, where
one and the same relation was annotated with different labels, both of which could happen accidentally
when the original Stanford scheme was adapted to specific languages. Secondly, we have avoided, as far
as possible, labels that are only used in one or two languages.
In order to satisfy these requirements, a number of language-specific labels have been merged into
more general labels. For example, in analogy with the nn label for (element of a) noun-noun compound,
the German scheme had a label aa for compound adjectives, and the Korean scheme had a label vv
for compound verbs. In the universal scheme, these three labels have been merged into a single label
compmod for modifier in compound. For Korean, the annotation scheme distinguished four different
subtypes of nominal subjects, which have all been merged to the single relation nsubj in the universal
annotation.
In addition to harmonizing language-specific labels, we have also renamed relations where the name
would be misleading in the universal context (although quite appropriate for English). For example,
the label prep (for a modifier headed by a preposition) has been renamed to adpmod, to make clear the
relation to other modifier labels and to allow postpositions as well as prepositions. Consequently, pobj
and pcomp have been changed to adpobj and adpcomp. Similarly, npadvmod has been replaced by nmod
(in analogy with amod and advmod). We have also eliminated a few distinctions in the original Stanford
1
In addition to the universal tags, we also provide language-specific tags when available.
2
This is in contrast to the collapsed dependencies, where multiple heads are allowed and where some tokens may not
correspond to nodes in the dependency structure.
3
For a more detailed description of this process, see McDonald et al. (2013).
1
- 1
- 2
- 3
前往页