Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.

Slides:



Advertisements
Similar presentations
Dependency tree projection across parallel texts David Mareček Charles University in Prague Institute of Formal and Applied Linguistics.
Advertisements

Quranic Arabic Corpus Data Mining & Text Analytics By Ismail Teladia & Abdullah Alazwari.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
DS-to-PS conversion Fei Xia University of Washington July 29,
Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks: Layering the Annotation Jan Hajič Institute of Formal and Applied Linguistics.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks: Language-specific Issues Czech Jan Hajič Institute of Formal and Applied Linguistics.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks and Parsing Jan Hajič Institute of Formal and Applied Linguistics School of.
Workshop on Treebanks, Rochester NY, April 26, 2007 The Penn Treebank: Lessons Learned and Current Methodology Ann Bies Linguistic Data Consortium, University.
TectoMT two goals of TectoMT –to allow experimenting with MT based on deep- syntactic (tectogrammatical) transfer –to create a software framework into.
ELN – Natural Language Processing Giuseppe Attardi
PDT 2.0 Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague.
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
PDT Grammatemes and Coreference in the PDT 2.0 Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University in Prague.
1/21 Introduction to TectoMT Zdeněk Žabokrtský, Martin Popel Institute of Formal and Applied Linguistics Charles University in Prague CLARA Course on Treebank.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax 1 The PDT Morphology and Surface Syntax.
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
Morphological Meanings in the Prague Dependency Treebank Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
Tree-based Machine Translation using syntax and semantics
April 17, 2007MT Marathon: Tree-based Translation1 Tree-based Translation with Tectogrammatical Representation Jan Hajič Institute of Formal and Applied.
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.
Querying Structured Text in an XML Database By Xuemei Luo.
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
Member of German Institute for Adult Education (DIE) Administrative Aspects for VisuaLearning Project CP DE-Grundtvig-G1 Information on.
Language Data Resources About Corpora. J. Sinclair: “Language looks rather different when you look at a lot of it at once.“ P. Eisner: “Znáte jej, ten.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Cs target cs target en source Subject-PastParticiple agreement Czech subject and past participle must agree in number and gender. Two-step translation.
Jan Hajič Otakar Smrž Petr Zemánek Jan Šnaidauf Emanuel Beška Faculty of Mathematics and Physics Faculty of Philosophy and Arts Charles University in Prague.
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
SECURE WEB APPLICATIONS VIA AUTOMATIC PARTITIONING S. Chong, J. Liu, A. C. Myers, X. Qi, K. Vikram, L. Zheng, X. Zheng Cornell University.
Resemblances between Meaning-Text Theory and Functional Generative Description Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
Topic #1: Introduction EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
1 / 5 Zdeněk Žabokrtský: Automatic Functor Assignment in the PDT Automatic Functor Assignment (AFA) in the Prague Dependency Treebank PDT : –a long term.
Proper Nouns in Czech Corpora Magda Ševčíková Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics.
PDT Grammatemes in the PDT 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague
nd PIRE project workshop1 Tectogrammatical Representation of English Silvie Cinková Lucie Mladová, Anja Nedoluzhko, Jiří Semecký, Jana Šindlerová,
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 1 The Prague Dependency Treebank (PDT) Introduction Jan Hajič Institute.
Syntactic Annotation of Slovene Corpora (SDT, JOS) Nina Ledinek ISJ ZRC SAZU
Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002.
Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software.
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Tectogrammatics 1 PDT: Tectogrammatical Representation Jan Hajič Institute.
Semantic annotation of a dialog corpus Silvie Cinková Institute of Formal and Applied Linguistics Charles University in Prague, Czech Republic COMPANIONS.
Prague Czech-English Dependency Treebank 2.0 ufal.mff.cuni.cz/pcedt2.0 Silvie Cinková, Marie Mikulová, Jan Štěpánek & professors, annotators and programmers.
1/16 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
Netgraph – a Tool for Searching in the Prague Dependency Treebank 2.0 Defence of the Doctoral Thesis, Prague, September 3 rd, 2008 Author: Mgr. Jiří Mírovský.
David Mareček and Zdeněk Žabokrtský
Prague Arabic Dependency Treebank
Prague Dependency Treebank 2. 0 Zdeněk Žabokrtský Dept
Chapter 10: Compilers and Language Translation
The development of PDT 3.0 Introduction to the discussion
Presentation transcript:

Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University Prague

Slovko Introduction division of the annotation into several phases system for annotation quality checking ways of evaluation of the annotation and annotators a large corpus with a rich linguistic annotation an elaborated organization of the annotation process Prague Dependency Treebanks: Prague Dependency Treebank 2.0 (2006) Prague Czech-English Dependency Treebank (2010)

Slovko Prague Dependency Treebanks Introduction Prague Czech-English Dependency Treebank (PCEDT) texts from Penn Treebank: mostly economic articles from the Wall Street Journal for the Czech part texts were translated into Czech 2312 documents, sentences Ready for publication by the end of the 2010! Prague Dependency Treebank 2.0 (PDT 2.0) Czech written texts 3165 documents, sentences Published in 2006.

Slovko Word layer "raw-text„, tokens Morphological layer lemmas, tags Analytical layer surface syntax dependencies, relations Tectogrammatical layer deep syntax dependencies, relations (detailed) System of annotation layers in Prague Dependency Treebanks

Slovko Tectogrammatical layer in Prague Dependency Treebanks as an example of a rich linguistic annotation deep syntax dependencies, relations: 70 functors valency and ellipsis grammatemes: semantic counterparts of morphological categories coreference topic-focus, deep word order 39 different attributes 8,42 attributes filled on average for a node in PDT 2.0 The annotation manual has more than 1000 pages.

Slovko What can we do? Three organizational aspects of building a large corpus with a rich annotation error error error error error error error error error error error error error error error Division of the annotation into several phases rule rule rule rule rule rule rule rule rule rule rule rule rule rule rule Annotation quality checking Motivative evaluation of the annotator RULERULE CORRECTION

Slovko Division of the annotation into several phases The division of the annotation process into several steps is desirable for the quality of the output data, even though some phenomena had to be reconsidered repeatedly by different annotators in various phases. How to divide the annotation when the information attached is mostly very complex? „working value“ of an attribute An annotation of one attribute requires an annotation of another attribute.

Slovko Annotation phases on the tectogrammatical layer in the Prague Czech-English Dependency Treebank 1. building a tree structure, dealing with ellipsis included; assignment of functors and valency frames, links to lower layers (10 attributes), 2. annotation of subfunctors (fine grained classification of functors, 1 attribute), 3. annotation of coreference (4 attributes), 4. annotation of topic-focus articulation, rhematizers and deep word order (3 attributes), 5. annotation of grammatemes, final form of tectogrammatical lemmata (17 attributes), 6. annotation of remaining phenomena (quotation, named entities etc.). First phase: 9.2 sentences per hour

Slovko Example of „working value“ in the Prague Czech-English Dependency Treebank First phase: building a tree structure. Ellipsis - a new node is added. Each node requires a lemma. The lemma of an added node signifies the type of the elision. #Gen stands for a general participant, #PersPron for a subject, #Cor for a controlee in control constructions, #Rcp for ellipses because of reciprocation etc. BUT for the building the tree structure, the type of elision is not substantial. Adding a new node is necessary! The annotator adds a node with the “working value” of the lemma and assigns only its syntactical function. “Working value”: #NewNode

Slovko Annotation quality checking expensive for large corpus with a rich annotation: impossible! Usually: parallel annotation of the same data PCEDT (first phase): one annotator can annotate 9.2 sentences in one hour. Annotation of the whole treebank (49,000 sentences) by one annotator would take 5326 hours. If an annotator worked for 20 hours a week (half-time job), the whole treebank would take 5 years. System for the automatic quality checking of data It was developed during the building of the PDT 2.0. The real checking took place when all the annotation had finished. The checking and fixing phase was quite complex and time-consuming. Now: fully integrated into the annotation process

Slovko Annotation quality checking Design of the automatic checking procedures programmed manually (in perl), based on annotation rules, return a list of erroneous positions in the data, run periodically. 103 checking procedures: improve the quality of the data: by fixing the present errors, by providing a feedback to the annotators.

Slovko Annotation quality checking Example of the checking procedure coord002: every coordination has at least two members struct001.1: the root of a tree has only a limited set of possible functors: PRED for a predicate, DENOM for nominative clause, PARTL for interjection clause etc. struct001.2: no dependent node has the PRED functor #!btred -N -T -t PML_T -e coord() package PML_T; $NAME=’coord002’; ## Every coordination has at least two members. sub coord { writeln("$NAME\tmembers\t".ThisAddress($this)) if IsCoord($this) and scalar(grep $_->{is_member},$this->children) < 2; } # coord

Slovko Evaluation of the annotators inter-annotator agreement, error rate, performance of the annotators. A system for the evaluation of the annotation and annotators integral part of any annotation project.

Slovko Inter-annotator agreement The structure to be compared is very complex. The algorithm aligning two tectogrammatical trees is not an easy task. Since there is no “golden” annotation, we just measure the agreement of all the pairs of annotators. As a baseline, we use the output of an automatic procedure with which the annotators start their work.

Slovko Inter-annotator agreement Example OverallK 94,08% Ma 94,01% A 93,83% O 93,78% Z 84,58% StructureA 88,62% Ma 88,60% O 87,92% K 87,88% Z 69,28% FunctorK 85,70% Ma 85,67% O 85,57% A 85,13% Z 66,80%

Slovko Error rate Using the list of errors generated by the checking procedures we count how often the annotators make errors: the number of errors the annotator made is divided by the number of sentences or nodes s/he annotated.

Slovko Error rate Example December 2007July 2009 WhoErrors per 100 sentencesErrors per 100 nodesErrors per 100 sentencesErrors per 100 nodes K O Ma A L Mi J

Slovko Performance of the annotators In the annotation process, the time the annotators spent working is measured. For each month we count the annotators' performance over the month and the over-all performance.

Slovko Performance of the annotators Example WhoHoursSentencesSentences per hourMinutes per sentence A I J K L Ma Mi O

Slovko Conclusion The organizational aspects of building a large treebank: divide the annotation process into several phases system for checking the correctness of the annotation three ways to evaluate the annotation and annotators. We believe that having published PDT 2.0 with 50,000 sentences and being in the halftime of the PCEDT project with more than a half data already annotated (33,500 sentences, 68% of the corpus) our proposals are sufficiently backed by our experience and practice. Grants: Centrum komputační lingvistiky LC 356; PIRE (NSF, USA, ); MŠMT KONTAKT ( ) ; GAČR 405/06/0589 ( ); GAUK 22908/2008; EU FP6 Euromatrix ( ); EU FP7 EuromatrixPlus FP7-ICT

Thank you for your attention.