Presentation is loading. Please wait.

Presentation is loading. Please wait.

Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University.

Similar presentations


Presentation on theme: "Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University."— Presentation transcript:

1 Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University Prague, Czech Republic {razimova,zabokrtsky}@ufal.mff.cuni.cz

2 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz2/30 Outline of the talk Introduction Prague Dependency Treebank 2.0 Annotation of grammatemes Motivation Grammateme attributes Two-level node hierarchy Examples of grammateme value assignment Final remarks

3 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz3/30 Introduction grammatemes in the PDT 2.0 one type of attributes of nodes of a deep syntactic tree capturing morphological meanings that are semantically indispensable number for nouns, degree of comparison for adjectives, tense for verbs, etc. annotation of grammatemes the last task in the PDT 2.0 annotation procedure possible to assign automatically – profiting from the already available annotation: annotation of the same sentence at the lower layers already available components of the t-tree (tree structure, types of dependency relations, co-reference, etc.)

4 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz4/30 Historical background and development of PDT project mid 1960’s – Praguian Functional Generative Description (Petr Sgall et al.) 1994 – Czech National Corpus 1995 – PDT started 1998 – PDT 0.5 pre-release 2001 – PDT 1.0 released by LDC manual annotation of morphology and surface syntax 2006 – PDT 2.0 to be released by LDC interlinked morphological, surface-syntactic and complex deep-syntactic annotation including annotation of grammatemes

5 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz5/30 Outline of the talk Introduction Prague Dependency Treebank 2.0 Annotation of grammatemes Motivation Grammateme attributes Two-level node hierarchy Examples of grammateme value assignment Final remarks

6 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz6/30 Layers of annotation tectogrammatical layer deep-syntactic dependency tree analytical layer surface-syntactic dependency tree morphological layer m-lemma and m-tag associated with each token word layer original text, segmented on word boundaries lit: He-was would went toforest. He would have gone to the forest.

7 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz7/30 Interlinking the layers lit: He-was would went toforest. He would have gone to the forest. any unit at any layer has a PDT unique ID neighboring layers connected by top-down pointers

8 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz8/30 Size of the PDT 2.0 data (i) 7,129 manually annotated textual documents all documents annotated at the m-layer 16,065 sentences with 1,960,657 tokens 75 % of the m-layer data annotated at the a-layer 5,338 documents, 87,980 sentences, 1,504,847 tokens 44 % of the m-layer data annotated also at the t-layer 3,168 documents, 49,442 sentences, 833,357 tokens

9 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz9/30 training data (80 %) development test data (10 %) evaluation test data (10 %) Size of the PDT 2.0 data (ii)

10 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz10/30 M-layer sentence represented as a sequence of tokens each token lemmatized and tagged (attributes m-lemma and m-tag ) positional m-tag: 15 characters 1. (main) POS 2. detailed POS 3. gender 4. number 5. case... lit.: Some contours problem(gen) reflexive_pronoun though after resurgence(instr) Havel's speech(instr) they-seem to-be clearer. Some contours of the problem seem to be clearer after the resurgence by Havel's speech.

11 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz11/30 A-layer rooted ordered tree with labeled nodes and edges a-nodes one token of the m-layer is represented by exactly one a-node labeled with a-lemmas (identical with word forms) a-edges represent dependency relations ( Sb, Obj, Adv, Atr) represent non-dependency relations ( Coord) analytical function attribute appears as an a-node attribute Some contours of the problem seem to be clearer after the resurgence by Havel's speech.

12 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz12/30 T-layer Some contours of the problem seem to be clearer after the resurgence by Havel's speech. rooted ordered tree with labeled nodes and edges t-nodes complex typed feature structures represent auto-semantic words functional words do not have nodes of their own artificially added nodes t-edges dependency relations ( functor ) non-dependency relations (coordination constructions) functor attribute appears as an t-node attribute

13 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz13/30 lit. [To] all was handed over a certificate of successful graduation from the course. They all received a certificate of successful graduation from this course. Areas of annotation at the t-layer tree structure t-lemma attribute dependency relation ( functor and subfunctor ) topic-focus attributes co-reference attributes node typing attributes ( nodetype and sempos) grammateme attributes Všem bylo předáno osvědčení o úspěšném absolvování kurzu.

14 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz14/30 Outline of the talk Introduction Prague Dependency Treebank 2.0 Annotation of grammatemes Motivation Grammateme attributes Two-level node hierarchy Examples of grammateme value assignment Final remarks

15 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz15/30 grammatemes t-node attributes representing inflectional information that is semantically indispensable (morphological meanings such as number for nouns, tense for verbs, degree of comparison for adjectives, etc.) semantically irrelevant morphological meanings are not part of the t-layer (e.g. case for nouns) Grammatemes: Motivation

16 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz16/30 Grammateme attributes 15 grammatemes indeftype numertype negation degcmp tense aspect verbmod deontmod dispmod resultative iterativeness number gender person politeness

17 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz17/30 Conditioned presence/absence of grammatemes obviously, not all grammatemes are relevant for all nodes no tense for dog, no degree of comparison for (he) waits, etc. how to formally declare presence/absence of a given grammateme attribute in a given node?  the need for node typing chosen solution: two-level typing 1 st level: 8 more general types of nodes grammatemes relevant only for one of them 2 nd level: 19 more specific subtypes, corresponding to detailed semantic parts of speech

18 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz18/30 Presence/absence of grammateme values: Two-level t-node hierarchy 1 st level: attribute nodetype 2 nd level: attribute sempos

19 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz19/30 8 attribute values: root | qcomplex | list | atom | coap | dphr | fphr | complex fully automatic annotation - use of the tree structure  root t-attributes t-lemma  qcomplex | list functor  atom | coap | dphr | fphr else  complex Levnější benzín na Východě, dražší na Západě Cheaper gasoline in the East, more expensive one in the West First level of the hierarchy: attribute nodetype

20 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz20/30 only complex nodes grouped into semantic parts of speech 19 values of the attribute sempos : n.... | adj.... | adv.... | v.... fully automatic annotation – use of m-tag t-lemma other t-attributes sempos value delimits the set of relevant grammatemes Second level of the hierarchy: attribute sempos

21 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz21/30 Values of nodetype and sempos in the PDT 2.0 – an overview nodetype values: sempos values:

22 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz22/30 Grammateme value assignment n-tred environment for processing the PDT data http://ufal.mff.cuni.cz/˜pajas automatic annotation 2000 lines of Perl code crucial importance of inter-layer links – use of t-attributes, a-attributes, m-attributes rules using special economic notation 2000 lines written in a text file lexical resources special purpose lists of adverbs / verbs manual annotation of special problems two annotators working in parallel simplified annotation environment: treebank positions extracted into simple HTML forms

23 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz23/30 Simple HTML-based environment for manual annotation lit: The difference [you] would have to pay yourself.

24 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz24/30 Automatic vs. manual assignment at the t-layer of the PDT 2.0: 1,594,333 grammateme values assigned at 550,947 complex nodes manually assigned: 17,520 grammateme values inter-annotator agreement: 70-85 %

25 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz25/30 Grammateme assignment and m-tag number grammateme: values sg | pl assigned automatically using m-tag e.g. les (forest) m-layer: tag NNIS2-----A----  t-layer: number=sg manual assignment nouns with only plural forms (identified by a list extracted from the machine- readable dictionary of standard Czech) e.g. dveře (door/doors) m-layer: always plural t-layer: annotator decision sg | pl n.denot number=sg lit: He-was would went toforest. He would have gone to the forest.

26 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz26/30 Grammateme assignment and tree structure v verbmod=cdn mood grammateme verbmod: values ind | imp | cdn assigned automatically one-word verbal forms e.g. jde (goes) m-tag information verbal forms consisting of more word forms (represented by a single node at the t-layer) e.g. byl by šel (would have gone) corresponding a-layer subtree involves the node by m-tag of the node by lit: He-was would went toforest. He would have gone to the forest.

27 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz27/30 lit. From remainder of raw material the diary produces dried milk, which [it] exports to Asia and South America. From the rest of the material, the diary produces dried milk, which is exported [by it] to Asia and South America. Grammateme assignment and co-reference grammatemes gender, number and person in relative pronouns are left underspecified (value inher ), since they are imposed only by grammatical agreement (thus can be “inherited from the antecedents”) Ze zbytku suroviny mlékárna vyrábí sušené mléko, které vyváží do Asie a Jižní Ameriky.

28 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz28/30 Outline of the talk Introduction Prague Dependency Treebank 2.0 Annotation of grammatemes Motivation Grammateme attributes Two-level node hierarchy Examples of grammateme value assignment Final remarks

29 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz29/30 Final remarks achievements: two-level typing of t-layer nodes which makes it possible to formally capture presence/absence of individual grammatemes in a given node automatic procedure for capturing the node classification and the grammateme attributes verification of the procedure on large-scale data experience: it was the existence of the lower annotation layers and the existence of inter-layer links what allowed to make the procedure of grammateme assignment more or less automatic

30 LREC 2006, Annotation Sciencerazimova@ufal.mff.cuni.cz30/30 http://ufal.mff.cuni.cz/pdt2.0/


Download ppt "Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University."

Similar presentations


Ads by Google