Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.

Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University Prague

Slovko 2009mikulova@ufal.mff.cuni.cz2 Introduction division of the annotation into several phases system for annotation quality checking ways of evaluation of the annotation and annotators a large corpus with a rich linguistic annotation an elaborated organization of the annotation process Prague Dependency Treebanks: Prague Dependency Treebank 2.0 (2006) Prague Czech-English Dependency Treebank (2010)

Slovko 2009mikulova@ufal.mff.cuni.cz3 Prague Dependency Treebanks Introduction Prague Czech-English Dependency Treebank (PCEDT) texts from Penn Treebank: mostly economic articles from the Wall Street Journal for the Czech part texts were translated into Czech 2312 documents, 49 208 sentences Ready for publication by the end of the 2010! Prague Dependency Treebank 2.0 (PDT 2.0) Czech written texts 3165 documents, 49 431 sentences Published in 2006.

Slovko 2009mikulova@ufal.mff.cuni.cz4 Word layer "raw-text„, tokens Morphological layer lemmas, tags Analytical layer surface syntax dependencies, relations Tectogrammatical layer deep syntax dependencies, relations (detailed) System of annotation layers in Prague Dependency Treebanks

Slovko 2009mikulova@ufal.mff.cuni.cz5 Tectogrammatical layer in Prague Dependency Treebanks as an example of a rich linguistic annotation deep syntax dependencies, relations: 70 functors valency and ellipsis grammatemes: semantic counterparts of morphological categories coreference topic-focus, deep word order 39 different attributes 8,42 attributes filled on average for a node in PDT 2.0 The annotation manual has more than 1000 pages.

Slovko 2009mikulova@ufal.mff.cuni.cz6 What can we do? Three organizational aspects of building a large corpus with a rich annotation error error error error error error error error error error error error error error error Division of the annotation into several phases rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule Annotation quality checking Motivative evaluation of the annotator RULERULE CORRECTION

Slovko 2009mikulova@ufal.mff.cuni.cz7 Division of the annotation into several phases The division of the annotation process into several steps is desirable for the quality of the output data, even though some phenomena had to be reconsidered repeatedly by different annotators in various phases. How to divide the annotation when the information attached is mostly very complex? „working value“ of an attribute An annotation of one attribute requires an annotation of another attribute.

Slovko 2009mikulova@ufal.mff.cuni.cz8 Annotation phases on the tectogrammatical layer in the Prague Czech-English Dependency Treebank 1. building a tree structure, dealing with ellipsis included; assignment of functors and valency frames, links to lower layers (10 attributes), 2. annotation of subfunctors (fine grained classification of functors, 1 attribute), 3. annotation of coreference (4 attributes), 4. annotation of topic-focus articulation, rhematizers and deep word order (3 attributes), 5. annotation of grammatemes, final form of tectogrammatical lemmata (17 attributes), 6. annotation of remaining phenomena (quotation, named entities etc.). First phase: 9.2 sentences per hour

Slovko 2009mikulova@ufal.mff.cuni.cz9 Example of „working value“ in the Prague Czech-English Dependency Treebank First phase: building a tree structure. Ellipsis - a new node is added. Each node requires a lemma. The lemma of an added node signifies the type of the elision. #Gen stands for a general participant, #PersPron for a subject, #Cor for a controlee in control constructions, #Rcp for ellipses because of reciprocation etc. BUT for the building the tree structure, the type of elision is not substantial. Adding a new node is necessary! The annotator adds a node with the “working value” of the lemma and assigns only its syntactical function. “Working value”: #NewNode

Slovko 2009mikulova@ufal.mff.cuni.cz10 Annotation quality checking expensive for large corpus with a rich annotation: impossible! Usually: parallel annotation of the same data PCEDT (first phase): one annotator can annotate 9.2 sentences in one hour. Annotation of the whole treebank (49,000 sentences) by one annotator would take 5326 hours. If an annotator worked for 20 hours a week (half-time job), the whole treebank would take 5 years. System for the automatic quality checking of data It was developed during the building of the PDT 2.0. The real checking took place when all the annotation had finished. The checking and fixing phase was quite complex and time-consuming. Now: fully integrated into the annotation process

Slovko 2009mikulova@ufal.mff.cuni.cz11 Annotation quality checking Design of the automatic checking procedures programmed manually (in perl), based on annotation rules, return a list of erroneous positions in the data, run periodically. 103 checking procedures: improve the quality of the data: by fixing the present errors, by providing a feedback to the annotators.

Slovko 2009mikulova@ufal.mff.cuni.cz12 Annotation quality checking Example of the checking procedure coord002: every coordination has at least two members struct001.1: the root of a tree has only a limited set of possible functors: PRED for a predicate, DENOM for nominative clause, PARTL for interjection clause etc. struct001.2: no dependent node has the PRED functor #!btred -N -T -t PML_T -e coord() package PML_T; $NAME=’coord002’; ## Every coordination has at least two members. sub coord { writeln("$NAME\tmembers\t".ThisAddress($this)) if IsCoord($this) and scalar(grep $_->{is_member},$this->children) < 2; } # coord

Slovko 2009mikulova@ufal.mff.cuni.cz13 Evaluation of the annotators inter-annotator agreement, error rate, performance of the annotators. A system for the evaluation of the annotation and annotators integral part of any annotation project.

Slovko 2009mikulova@ufal.mff.cuni.cz14 Inter-annotator agreement The structure to be compared is very complex. The algorithm aligning two tectogrammatical trees is not an easy task. Since there is no “golden” annotation, we just measure the agreement of all the pairs of annotators. As a baseline, we use the output of an automatic procedure with which the annotators start their work.

Slovko 2009mikulova@ufal.mff.cuni.cz15 Inter-annotator agreement Example OverallK 94,08% Ma 94,01% A 93,83% O 93,78% Z 84,58% StructureA 88,62% Ma 88,60% O 87,92% K 87,88% Z 69,28% FunctorK 85,70% Ma 85,67% O 85,57% A 85,13% Z 66,80%

Slovko 2009mikulova@ufal.mff.cuni.cz16 Error rate Using the list of errors generated by the checking procedures we count how often the annotators make errors: the number of errors the annotator made is divided by the number of sentences or nodes s/he annotated.

Slovko 2009mikulova@ufal.mff.cuni.cz17 Error rate Example December 2007July 2009 WhoErrors per 100 sentencesErrors per 100 nodesErrors per 100 sentencesErrors per 100 nodes K29.78511.62411.51030.0806 O39.66992.06244.03310.2067 Ma61.40873.27078.46700.4533 A63.23183.34986.35830.3265 L--15.06680.8010 Mi--16.22410.8460 J--19.04761.0971

Slovko 2009mikulova@ufal.mff.cuni.cz18 Performance of the annotators In the annotation process, the time the annotators spent working is measured. For each month we count the annotators' performance over the month and the over-all performance.

Slovko 2009mikulova@ufal.mff.cuni.cz19 Performance of the annotators Example WhoHoursSentencesSentences per hourMinutes per sentence A114.259638.42897.1184 I827.0070068.47167.0825 J105.7010019.47026.3357 K107.00143013.36454.4895 L266.4117166.44129.3150 Ma78.006157.88467.6098 Mi169.9816559.73646.1624 O289.02321111.11005.4006

Slovko 2009mikulova@ufal.mff.cuni.cz20 Conclusion The organizational aspects of building a large treebank: divide the annotation process into several phases system for checking the correctness of the annotation three ways to evaluate the annotation and annotators. We believe that having published PDT 2.0 with 50,000 sentences and being in the halftime of the PCEDT project with more than a half data already annotated (33,500 sentences, 68% of the corpus) our proposals are sufficiently backed by our experience and practice. Grants: Centrum komputační lingvistiky LC 356; PIRE (NSF, USA, 2005-2010); MŠMT KONTAKT (2006-2010) ; GAČR 405/06/0589 (2006-2008); GAUK 22908/2008; EU FP6 Euromatrix (2006-2008); EU FP7 EuromatrixPlus FP7-ICT-2007-3- 231720.

Thank you for your attention. http:/ufal.mff.cuni.cz

Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.

Similar presentations

Presentation on theme: "Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.

Similar presentations

Presentation on theme: "Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics."— Presentation transcript:

Similar presentations

About project

Feedback