Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University Prague
Slovko Introduction division of the annotation into several phases system for annotation quality checking ways of evaluation of the annotation and annotators a large corpus with a rich linguistic annotation an elaborated organization of the annotation process Prague Dependency Treebanks: Prague Dependency Treebank 2.0 (2006) Prague Czech-English Dependency Treebank (2010)
Slovko Prague Dependency Treebanks Introduction Prague Czech-English Dependency Treebank (PCEDT) texts from Penn Treebank: mostly economic articles from the Wall Street Journal for the Czech part texts were translated into Czech 2312 documents, sentences Ready for publication by the end of the 2010! Prague Dependency Treebank 2.0 (PDT 2.0) Czech written texts 3165 documents, sentences Published in 2006.
Slovko Word layer "raw-text„, tokens Morphological layer lemmas, tags Analytical layer surface syntax dependencies, relations Tectogrammatical layer deep syntax dependencies, relations (detailed) System of annotation layers in Prague Dependency Treebanks
Slovko Tectogrammatical layer in Prague Dependency Treebanks as an example of a rich linguistic annotation deep syntax dependencies, relations: 70 functors valency and ellipsis grammatemes: semantic counterparts of morphological categories coreference topic-focus, deep word order 39 different attributes 8,42 attributes filled on average for a node in PDT 2.0 The annotation manual has more than 1000 pages.
Slovko What can we do? Three organizational aspects of building a large corpus with a rich annotation error error error error error error error error error error error error error error error Division of the annotation into several phases rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule Annotation quality checking Motivative evaluation of the annotator RULERULE CORRECTION
Slovko Division of the annotation into several phases The division of the annotation process into several steps is desirable for the quality of the output data, even though some phenomena had to be reconsidered repeatedly by different annotators in various phases. How to divide the annotation when the information attached is mostly very complex? „working value“ of an attribute An annotation of one attribute requires an annotation of another attribute.
Slovko Annotation phases on the tectogrammatical layer in the Prague Czech-English Dependency Treebank 1. building a tree structure, dealing with ellipsis included; assignment of functors and valency frames, links to lower layers (10 attributes), 2. annotation of subfunctors (fine grained classification of functors, 1 attribute), 3. annotation of coreference (4 attributes), 4. annotation of topic-focus articulation, rhematizers and deep word order (3 attributes), 5. annotation of grammatemes, final form of tectogrammatical lemmata (17 attributes), 6. annotation of remaining phenomena (quotation, named entities etc.). First phase: 9.2 sentences per hour
Slovko Example of „working value“ in the Prague Czech-English Dependency Treebank First phase: building a tree structure. Ellipsis - a new node is added. Each node requires a lemma. The lemma of an added node signifies the type of the elision. #Gen stands for a general participant, #PersPron for a subject, #Cor for a controlee in control constructions, #Rcp for ellipses because of reciprocation etc. BUT for the building the tree structure, the type of elision is not substantial. Adding a new node is necessary! The annotator adds a node with the “working value” of the lemma and assigns only its syntactical function. “Working value”: #NewNode
Slovko Annotation quality checking expensive for large corpus with a rich annotation: impossible! Usually: parallel annotation of the same data PCEDT (first phase): one annotator can annotate 9.2 sentences in one hour. Annotation of the whole treebank (49,000 sentences) by one annotator would take 5326 hours. If an annotator worked for 20 hours a week (half-time job), the whole treebank would take 5 years. System for the automatic quality checking of data It was developed during the building of the PDT 2.0. The real checking took place when all the annotation had finished. The checking and fixing phase was quite complex and time-consuming. Now: fully integrated into the annotation process
Slovko Annotation quality checking Design of the automatic checking procedures programmed manually (in perl), based on annotation rules, return a list of erroneous positions in the data, run periodically. 103 checking procedures: improve the quality of the data: by fixing the present errors, by providing a feedback to the annotators.
Slovko Annotation quality checking Example of the checking procedure coord002: every coordination has at least two members struct001.1: the root of a tree has only a limited set of possible functors: PRED for a predicate, DENOM for nominative clause, PARTL for interjection clause etc. struct001.2: no dependent node has the PRED functor #!btred -N -T -t PML_T -e coord() package PML_T; $NAME=’coord002’; ## Every coordination has at least two members. sub coord { writeln("$NAME\tmembers\t".ThisAddress($this)) if IsCoord($this) and scalar(grep $_->{is_member},$this->children) < 2; } # coord
Slovko Evaluation of the annotators inter-annotator agreement, error rate, performance of the annotators. A system for the evaluation of the annotation and annotators integral part of any annotation project.
Slovko Inter-annotator agreement The structure to be compared is very complex. The algorithm aligning two tectogrammatical trees is not an easy task. Since there is no “golden” annotation, we just measure the agreement of all the pairs of annotators. As a baseline, we use the output of an automatic procedure with which the annotators start their work.
Slovko Inter-annotator agreement Example OverallK 94,08% Ma 94,01% A 93,83% O 93,78% Z 84,58% StructureA 88,62% Ma 88,60% O 87,92% K 87,88% Z 69,28% FunctorK 85,70% Ma 85,67% O 85,57% A 85,13% Z 66,80%
Slovko Error rate Using the list of errors generated by the checking procedures we count how often the annotators make errors: the number of errors the annotator made is divided by the number of sentences or nodes s/he annotated.
Slovko Error rate Example December 2007July 2009 WhoErrors per 100 sentencesErrors per 100 nodesErrors per 100 sentencesErrors per 100 nodes K O Ma A L Mi J
Slovko Performance of the annotators In the annotation process, the time the annotators spent working is measured. For each month we count the annotators' performance over the month and the over-all performance.
Slovko Performance of the annotators Example WhoHoursSentencesSentences per hourMinutes per sentence A I J K L Ma Mi O
Slovko Conclusion The organizational aspects of building a large treebank: divide the annotation process into several phases system for checking the correctness of the annotation three ways to evaluate the annotation and annotators. We believe that having published PDT 2.0 with 50,000 sentences and being in the halftime of the PCEDT project with more than a half data already annotated (33,500 sentences, 68% of the corpus) our proposals are sufficiently backed by our experience and practice. Grants: Centrum komputační lingvistiky LC 356; PIRE (NSF, USA, ); MŠMT KONTAKT ( ) ; GAČR 405/06/0589 ( ); GAUK 22908/2008; EU FP6 Euromatrix ( ); EU FP7 EuromatrixPlus FP7-ICT
Thank you for your attention.