Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

Similar presentations


Presentation on theme: "The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics."— Presentation transcript:

1 The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics Major contributions by: E: Silvie Cinková, Jana Šindlerová, Josef Toman, (J. Semecký) C: Marie Mikulová, Zdeňka Urešová, Jan Štěpánek

2 June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 2 Today... The family of Prague Dependency Treebanks –Incl. the Prague (Czech-)English Dependency Treebank English “Tectogrammatical Representation” (TR) –Annotation layers –From Penn Treebank (et al.) to PDT-style English tectogrammatics –TR annotation of 5 interesting English phenomena The annotation process –TrEd, EngVallex and the current status To take home + pointers

3 June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 3 The Family of Prague Dependency Treebanks Prague Dependency Treebank (Czech) –2001: version 1.0 (no deep syntax/semantics) –2006: version 2.0 (w/deep syntax, semantics) Prague Czech-English Dependency TB 1.0 –2004: automatic annotation –English: PTB, Czech: 1/3rd of PTB translated Prague Arabic Dependency Treebank 1.0 –2004: ~ PDT 1.0 (no deep syntax)

4 June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 4 The Prague Czech-English Dependency Treebank Penn Treebank + PropBank + BBN (co-reference and Named Entities) + NP structure (D. Vadas, J. R. Curran, ACL’07) + “Czech-like” tectogrammatics Translation to Czech –Manual annotation (with auto pre-annotation) Morphology, Syntax, Tectogrammatics (TR)

5 June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 5 Example: English TR Words Dependencies Sem. function Valency (predicates) Coref (BBN) Named Entities (BBN)

6 June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 6 Layers of Annotation t-layer –tectogrammatics a-layer –(surface) syntax m-layer –Morphology (POS) w-layer –words (tokens)

7 June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 7 English Surface Syntax From PTB: –Form –POS Tag –Function label –(Structure) Added –Lemma –Heads

8 June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 8 Head Determination Rules Exhaustive set of rules –By J. Eisner + M. Čmejrek/J. Cuřín –4000 rules (non-terminal based) Ex.: (S (NP-SBJ VP.)) → VP –Additional rules Coordination, Apposition Punctuation (end-of-sentence, internal) Original idea (possibility of conversion) –J. Robinson (1960s)

9 June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 9 Example: Head Determination Rules (board) (the) (join) (will) (join) (NP (DT NN)) → NN (VP (VB NP)) → VB (VP (MD VP)) → VP (S (… VP …)) → VP Rules:

10 June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 10 Conversion: Analytic Structure, Functions Syntactic Function assignment (conversion) Rules –based on PTB functional tags: -SBJ Sb -PRD Pnom-BNF Obj -DTV Obj -LGS Obj-ADV Adv-DIR Adv-EXT Adv -LOC Adv-MNR Adv-PRP Adv-PUT Adv -TMP Adv –Ad-hoc rules (if functional tags missing) –Lemmatization (years → year)

11 June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 11 Syntactic Structure, Functions: PTB to P(E)DT (board) (the) (join) (will) (join) → → Penn Treebank structure (with heads added) PDT-like Analytic Representation PRED.Fut PAT PDT-like Tectogrammatic Representation (automatic pre-annotation)

12 June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 12 English TR I Predicative Complement Free (non-valency) modification (of both a noun and a verb) attribute compl.rf (green arrow to the noun)

13 June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 13 English TR II Which + Relative Clause We have not answered your question completely, for which we apologize.

14 June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 14 English TR III: Coordination

15 June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 15 English TR IV: Comparison

16 June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 16 English TR V: Restriction (“Exclusion”) except, with the exception of, excluding, (all/none) but, beyond, apart from, unless, bar, barring, besides

17 June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 17 English TR: (manual) annotation TrEd –Pre-annotated –Graphical TR dep. tree is primary –Text + TR –Czech translation Valency (a.k.a. “propbanking”) –During TR annotation –Propbank origins and examples Linked, displayed

18 June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 18 EngVallex (give)

19 June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 19 EngVallex Format (admit)

20 June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 20 Interannotator Agreement 2007-2009: - New annotators (lower numbers) - Annotation “by phenomenon” - Restarting now

21 June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 21 Prague English Dependency Treebank Availability –Version 1.0 now (PTB license needed) 250k words –Full version (parallel with Czech): late 2010 Size –Full WSJ portion of PTB (2312 files) –49208 sentences, 1253013 tokens –Now: –17210 sentences (34.97%), 439983 tokens (35.11%)

22 June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 22 Czech PDT-style Annotation All layers –morphology, syntax, tectogrammatical So far… –Automatic (many tools by many authors) Manual annotation –In progress (28124 sentences/639326 words) –Top-down Tectogrammatical first (lower layers automatically) … then syntactic structure and morphology

23 June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 23 Summary PDT is/has (a)… –(Family of) dependency-based treebanking project(s) Czech (English, Arabic,...) –~ 1mil. words sufficient size for ML experiments –4 interlinked layers of annotation token, morphology, syntax, deep syntax/semantics++) independent and “full” information at all levels interlinked (for the development of parsers/generators) –Parallel corpus Cze Eng -> Machine Translation

24 June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 24 Pointers, Acknowledgements http://ufal.mff.cuni.cz/pedt http://ufal.mff.cuni.cz/pdt2.0 http://ufal.mff.cuni.cz/~pajas/tred Acknowledgements –FP6-IST “Euromatrix”, FP7-IST “Euromatrix+” –LC536 (Center for Computational Linguistics) –GAČR 405/06/0589 (Speech and deep syntax) –MŠMT: MSM0021620838, ME838, ME09008


Download ppt "The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics."

Similar presentations


Ads by Google