The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Deep Linguistic Information in Hybrid Machine Translation
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University.
Layering Semantics (Putting meaning into trees) Treebank Workshop Martha Palmer April 26, 2007.
GLARF-ULA: ULA08 Workshop March 19, 2007 GLARF-ULA: Working Towards Usability Unified Linguistic Annotation Workshop Adam Meyers New York University March.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
Overview of the Hindi-Urdu Treebank Fei Xia University of Washington 7/23/2011.
Language Data Resources Treebanks. A treebank is a … database of syntactic trees corpus annotated with morphological and syntactic information segmented,
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
LING 581: Advanced Computational Linguistics Lecture Notes January 19th.
Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees.
Introduction to treebanks Session 1: 7/08/
DS-to-PS conversion Fei Xia University of Washington July 29,
Tasks Talk: ULA08 Workshop March 18, 2007 A Talk about Tasks Unified Linguistic Annotation Workshop Adam Meyers New York University March 18, 2008.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks: Layering the Annotation Jan Hajič Institute of Formal and Applied Linguistics.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks: Language-specific Issues Czech Jan Hajič Institute of Formal and Applied Linguistics.
OntoNotes project Treebank Syntax Training Data Decoders Propositions Verb Senses and verbal ontology links Noun Senses and targeted nominalizations Coreference.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks and Parsing Jan Hajič Institute of Formal and Applied Linguistics School of.
Workshop on Treebanks, Rochester NY, April 26, 2007 The Penn Treebank: Lessons Learned and Current Methodology Ann Bies Linguistic Data Consortium, University.
TectoMT two goals of TectoMT –to allow experimenting with MT based on deep- syntactic (tectogrammatical) transfer –to create a software framework into.
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
Building the Valency Lexicon of Arabic Verbs Viktor Bielický Otakar Smrž LREC 2008, Marrakech, Morocco.
ELN – Natural Language Processing Giuseppe Attardi
PDT 2.0 Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague.
PDT Grammatemes and Coreference in the PDT 2.0 Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University in Prague.
1/21 Introduction to TectoMT Zdeněk Žabokrtský, Martin Popel Institute of Formal and Applied Linguistics Charles University in Prague CLARA Course on Treebank.
Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology.
Prague Dependency Treebank(s) Workshop at LSA2011, Part II Jan Hajič, Zdeňka Urešová Institute of Formal and Applied Linguistics School of Computer Science.
Syntactically annotated corpora of Estonian Heli Uibo Institute of Computer Science University of Tartu
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax 1 The PDT Morphology and Surface Syntax.
Morphological Meanings in the Prague Dependency Treebank Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
Tree-based Machine Translation using syntax and semantics
April 17, 2007MT Marathon: Tree-based Translation1 Tree-based Translation with Tectogrammatical Representation Jan Hajič Institute of Formal and Applied.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Treebanks and MWEs (Part 1) Jan Hajič, Pavel Straňák, Jiří Mírovský Institute of Formal and Applied Linguistics & LINDAT/CLARIN School of Computer Science.
AQUAINT Workshop – June 2003 Improved Semantic Role Parsing Kadri Hacioglu, Sameer Pradhan, Valerie Krugler, Steven Bethard, Ashley Thornton, Wayne Ward,
MASC The Manually Annotated Sub- Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau.
Jan Hajič Otakar Smrž Petr Zemánek Jan Šnaidauf Emanuel Beška Faculty of Mathematics and Physics Faculty of Philosophy and Arts Charles University in Prague.
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Prague Arabic Dependency Treebank MALACH Workshop in Prague August 28, 2003 Introduction & Related Projects Otakar Smrž et al.
Resemblances between Meaning-Text Theory and Functional Generative Description Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
1 / 5 Zdeněk Žabokrtský: Automatic Functor Assignment in the PDT Automatic Functor Assignment (AFA) in the Prague Dependency Treebank PDT : –a long term.
Prague Dependency Treebank(s) Workshop at LSA2011, Part I Jan Hajič, Zdeňka Urešová Institute of Formal and Applied Linguistics School of Computer Science.
Proper Nouns in Czech Corpora Magda Ševčíková Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics.
PDT Grammatemes in the PDT 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague
nd PIRE project workshop1 Tectogrammatical Representation of English Silvie Cinková Lucie Mladová, Anja Nedoluzhko, Jiří Semecký, Jana Šindlerová,
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 1 The Prague Dependency Treebank (PDT) Introduction Jan Hajič Institute.
Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.
Syntactic Annotation of Slovene Corpora (SDT, JOS) Nina Ledinek ISJ ZRC SAZU
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
Leonid Iomdin Institute for Information Transmission Problems, Russian Academy of Sciences
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.
Non-sentential utterances (NSU) in dialog Silvie Cinková (CU) Companions Semantic Representation and Dialog Interfacing Workshop Edinburgh, March 5, 2008.
NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software.
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Tectogrammatics 1 PDT: Tectogrammatical Representation Jan Hajič Institute.
Semantic annotation of a dialog corpus Silvie Cinková Institute of Formal and Applied Linguistics Charles University in Prague, Czech Republic COMPANIONS.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
Prague Czech-English Dependency Treebank 2.0 ufal.mff.cuni.cz/pcedt2.0 Silvie Cinková, Marie Mikulová, Jan Štěpánek & professors, annotators and programmers.
1/16 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
Netgraph – a Tool for Searching in the Prague Dependency Treebank 2.0 Defence of the Doctoral Thesis, Prague, September 3 rd, 2008 Author: Mgr. Jiří Mírovský.
Prague Arabic Dependency Treebank
Prague Dependency Treebank 2. 0 Zdeněk Žabokrtský Dept
LING/C SC 581: Advanced Computational Linguistics
Artificial Intelligence 2004 Speech & Natural Language Processing
Owen Rambow 6 Minutes.
Presentation transcript:

The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics Major contributions by: E: Silvie Cinková, Jana Šindlerová, Josef Toman, (J. Semecký) C: Marie Mikulová, Zdeňka Urešová, Jan Štěpánek

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 2 Today... The family of Prague Dependency Treebanks –Incl. the Prague (Czech-)English Dependency Treebank English “Tectogrammatical Representation” (TR) –Annotation layers –From Penn Treebank (et al.) to PDT-style English tectogrammatics –TR annotation of 5 interesting English phenomena The annotation process –TrEd, EngVallex and the current status To take home + pointers

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 3 The Family of Prague Dependency Treebanks Prague Dependency Treebank (Czech) –2001: version 1.0 (no deep syntax/semantics) –2006: version 2.0 (w/deep syntax, semantics) Prague Czech-English Dependency TB 1.0 –2004: automatic annotation –English: PTB, Czech: 1/3rd of PTB translated Prague Arabic Dependency Treebank 1.0 –2004: ~ PDT 1.0 (no deep syntax)

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 4 The Prague Czech-English Dependency Treebank Penn Treebank + PropBank + BBN (co-reference and Named Entities) + NP structure (D. Vadas, J. R. Curran, ACL’07) + “Czech-like” tectogrammatics Translation to Czech –Manual annotation (with auto pre-annotation) Morphology, Syntax, Tectogrammatics (TR)

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 5 Example: English TR Words Dependencies Sem. function Valency (predicates) Coref (BBN) Named Entities (BBN)

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 6 Layers of Annotation t-layer –tectogrammatics a-layer –(surface) syntax m-layer –Morphology (POS) w-layer –words (tokens)

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 7 English Surface Syntax From PTB: –Form –POS Tag –Function label –(Structure) Added –Lemma –Heads

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 8 Head Determination Rules Exhaustive set of rules –By J. Eisner + M. Čmejrek/J. Cuřín –4000 rules (non-terminal based) Ex.: (S (NP-SBJ VP.)) → VP –Additional rules Coordination, Apposition Punctuation (end-of-sentence, internal) Original idea (possibility of conversion) –J. Robinson (1960s)

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 9 Example: Head Determination Rules (board) (the) (join) (will) (join) (NP (DT NN)) → NN (VP (VB NP)) → VB (VP (MD VP)) → VP (S (… VP …)) → VP Rules:

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 10 Conversion: Analytic Structure, Functions Syntactic Function assignment (conversion) Rules –based on PTB functional tags: -SBJ Sb -PRD Pnom-BNF Obj -DTV Obj -LGS Obj-ADV Adv-DIR Adv-EXT Adv -LOC Adv-MNR Adv-PRP Adv-PUT Adv -TMP Adv –Ad-hoc rules (if functional tags missing) –Lemmatization (years → year)

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 11 Syntactic Structure, Functions: PTB to P(E)DT (board) (the) (join) (will) (join) → → Penn Treebank structure (with heads added) PDT-like Analytic Representation PRED.Fut PAT PDT-like Tectogrammatic Representation (automatic pre-annotation)

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 12 English TR I Predicative Complement Free (non-valency) modification (of both a noun and a verb) attribute compl.rf (green arrow to the noun)

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 13 English TR II Which + Relative Clause We have not answered your question completely, for which we apologize.

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 14 English TR III: Coordination

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 15 English TR IV: Comparison

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 16 English TR V: Restriction (“Exclusion”) except, with the exception of, excluding, (all/none) but, beyond, apart from, unless, bar, barring, besides

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 17 English TR: (manual) annotation TrEd –Pre-annotated –Graphical TR dep. tree is primary –Text + TR –Czech translation Valency (a.k.a. “propbanking”) –During TR annotation –Propbank origins and examples Linked, displayed

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 18 EngVallex (give)

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 19 EngVallex Format (admit)

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 20 Interannotator Agreement : - New annotators (lower numbers) - Annotation “by phenomenon” - Restarting now

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 21 Prague English Dependency Treebank Availability –Version 1.0 now (PTB license needed) 250k words –Full version (parallel with Czech): late 2010 Size –Full WSJ portion of PTB (2312 files) –49208 sentences, tokens –Now: –17210 sentences (34.97%), tokens (35.11%)

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 22 Czech PDT-style Annotation All layers –morphology, syntax, tectogrammatical So far… –Automatic (many tools by many authors) Manual annotation –In progress (28124 sentences/ words) –Top-down Tectogrammatical first (lower layers automatically) … then syntactic structure and morphology

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 23 Summary PDT is/has (a)… –(Family of) dependency-based treebanking project(s) Czech (English, Arabic,...) –~ 1mil. words sufficient size for ML experiments –4 interlinked layers of annotation token, morphology, syntax, deep syntax/semantics++) independent and “full” information at all levels interlinked (for the development of parsers/generators) –Parallel corpus Cze Eng -> Machine Translation

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 24 Pointers, Acknowledgements Acknowledgements –FP6-IST “Euromatrix”, FP7-IST “Euromatrix+” –LC536 (Center for Computational Linguistics) –GAČR 405/06/0589 (Speech and deep syntax) –MŠMT: MSM , ME838, ME09008