Presentation is loading. Please wait.

Presentation is loading. Please wait.

June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.

Similar presentations


Presentation on theme: "June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna."— Presentation transcript:

1 June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna Čermáková, Anja Nedoluzhko, Jana Šindlerová, Josef Toman, Zdeněk Žabokrtský

2 June 6, 20073rd PIRE Meeting2 Outline: ● Functional Generative Description ● Parallel Treebanks ● PCEDT 2.0 – Project Report  tectogrammatical level of annotation  valency treatment  annotation manual for English  interannotator agreement

3 June 6, 20073rd PIRE Meeting3 Functional Generative Description ● Basic approach for Prague Treebanks  dependency  stratificational description of the language: ● From structure to function (meaning) - 3 layers of annotation:  morphological  analytical (=surface syntax)  tectogrammatical (=“deep“ syntax, semantics)

4 June 6, 20073rd PIRE Meeting4 Functional Generative Description ● Since 1995: Prague Dependency Treebank (PDT) - > Czech data (1.0 released LDC 2001, 2.0 – LDC 2006) ● The idea of a parallel corpus: English data, Czech data – translated: Prague Czech-English Dependency Treebank (PCEDT) (1.0 released LDC 2004)

5 June 6, 20073rd PIRE Meeting5 The Idea of a Parallel, Syntactically Annotated Corpus Build an English corpus in the same formalism as PDT (data resource: Wall Street Journal section of Penn Treebank) Translate it into Czech Manual annotations of both parts of the corpus Train tectogrammar-based machine translation

6 June 6, 20073rd PIRE Meeting6 Phrasal x Dependency Tree Mr. Payson, an art dealer and collector, sold Vincent van Gogh's "Irises" at a Sotheby's auction in November 1987 to Australian businessman Alan Bond.

7 June 6, 20073rd PIRE Meeting7 Dependency Trees: a-layer = surface syntax t-layer = underlying syntax, semantics It may have been painted instead by a Rubens associate.

8 June 6, 20073rd PIRE Meeting8 Dependency Trees: a-layer = surface syntax t-layer = underlying syntax, semantics It may have been painted instead of Rubens by a Rubens associate.

9 June 6, 20073rd PIRE Meeting9 Tectogrammatical Representation (t-tree) Contains: ● syntactic dependency and coordination: edges ● semantic relations: tectogrammatical functors  verb arguments (inner participants) ● semantic ACT, PAT ● syntactic ADDR, ORIG, EFF  free modifications (e.g. TWHEN, LOC, DIR, MANN,CAUS, CPR, ACMP...)  other: rhematizers, idiomatic expressions, foreign phrases... ● valency of the verbs: valency lexicon EngValLex

10 June 6, 20073rd PIRE Meeting10 Tectogrammatical Representation (t-tree) Contains: ● links to the lower layers ● grammatical (and textual) coreference ● topic-focus articulation

11 June 6, 20073rd PIRE Meeting11 Building the PCEDT 2.0, the Current Annotation of the English Data work with the corpus data ● input: WSJ texts (PTB), approx sentences (1.2 million words), automatically converted into PDT-like shape – a-layer ● automatic t-layer procession ● manual annotation running (approx trees annotated) ● meanwhile – Czech section annotation of the t-layer launched additional work ● conversion of the PropBank- lexicon into EngVallex (verbs only) ● tools adjustment (TrEd, unified macros for both CZ and ENG annotation) ● interannotator-agreement measuring ● first version of the annotation manual, is being revised ● training of new annotators

12 June 6, 20073rd PIRE Meeting12 EngValLex ● adaptation of PropBank into the format of PDT-Vallex (Valency lexicon for Czech) ● manual correction ● continuous checking during the annotation ● current version contains only verbs future work on EngValLex: ● defining surface realizations – morphosyntactic characteristics of the semantics roles ● valency of nouns and adjectives

13 June 6, 20073rd PIRE Meeting13 Annotation Manual = "Annotation of English on the tectogrammatical level: Reference book" ● based on the abbreviated version of the annotation manual for PDT (Czech) ● chapters specific to English data annotation added ● first rough version 1.0.1: April 2007 ● revision in progress ● extensions planned (concurrently with the annotation)

14 June 6, 20073rd PIRE Meeting14 Interannotator Agreement ● monthly control of the annotation consistency  approx. 30 trees ● measured:  structure: agreement in parent node  functors ● further analysis:  list of unpaired nodes  statistics for diverging functors  elimination of detected annotation divergences at annotator meetings

15 June 6, 20073rd PIRE Meeting15 Average Interannotator Agreement

16 June 6, 20073rd PIRE Meeting16 Future goals ● annotation expansion  500 trees/annotator/month  increasing (or at last keeping) the interannotator agreement  training of new annotators ● EngValLex precision ● annotation manual precision and expansion

17 June 6, 20073rd PIRE Meeting17 Acknowledgements The work on PCEDT project is supported by the grants PIRE ČR ME838 and GA405/06/0589.

18 June 6, 20073rd PIRE Meeting18 Acknowledgements The work on PCEDT project is supported by the grants PIRE ČR ME838 and GA405/06/0589. Thank you for your attention!


Download ppt "June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna."

Similar presentations


Ads by Google