Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Statistical Model for Parsing Czech

Similar presentations


Presentation on theme: "A Statistical Model for Parsing Czech"— Presentation transcript:

1 A Statistical Model for Parsing Czech
Daniel Zeman A Statistical Model for Parsing Czech Daniel Zeman A Statistical Model for Parsing Czech

2 Daniel Zeman Basic Idea Read a manually annotated corpus — the treebank. Count the number of times each particular dependency was seen. p(edge([ve], [dveřích])) = p1 p(edge([v], [dveřích])) = p2 p(edge([ve], [dveře])) = p3 p(edge([ve], [dveřím])) = p4 where likely: p1 > p2 > p3 and p4  0 (p is a relative frequency rather than a probability) Corpus of manually annotated texts A Statistical Model for Parsing Czech

3 Building the tree Simplification: Stack of N best trees.
Daniel Zeman Building the tree Simplification: Stack of N best trees. In each step, for each tree on the stack, take M best edges that can be added and create M×N new trees. Prune the new trees: take the N best of them. Repeat the above until all words are added and thus the tree is complete. A Statistical Model for Parsing Czech

4 What is a word? Daniel Zeman 19.8.1998
stan = tent; stanout = standstill, stop; stát = become volební = election (adjective) Number of tags: potentially 3000 (i.e. potentially 3.19×106 different edges). In workshop final training data: 1000, after reduction 400. Number of lemmas (dictionary headwords): Hajič's electronic dictionary contains approx before derivation ( when distinguishing semantic differences in homonymes) and almost after derivation. Poldauf's Velký česko-anglický slovník contains (i.e. 4.6×109 different edges). Siebenschein's two-volume Česko-německý slovník contains (6.4×109 hran) — but how many of these are idiomatic phrases? Number of word forms: Hajič's dictionary covers roughly ! Highest known number of tags per one word: 108. A Statistical Model for Parsing Czech

5 Ambiguous tags from dictionary
Daniel Zeman Ambiguous tags from dictionary [solí, NFS7A|NFP2A|VPX3A] Training We don‘t know which lemma is the right one so we take tags from all possible lemmas (avoiding duplicity). We don‘t know which tag combination (dependency) is the right one so we increment the counters of all possible combinations. All combinations together form an occurrence of just one dependency — so the counter can be adjusted by only the combination‘s share of the occurrence (here 1/6). [bílou, AFS41A|AFS71A] A Statistical Model for Parsing Czech

6 Constraints and Improvements
Daniel Zeman Constraints and Improvements The dependencies cannot “cross”. The tag set can be reduced. (Not all information is interesting for parsing, so tags can be merged.) Originally the treebank contained over 1200 different tags, after reduction it is only over 400. Additional model for valency: how likely a word (a tag) has a particular number of child nodes? Adjacency: for dependency XY, separate counters are kept for the case of adjacent and non-adjacent words. Direction: separate counter for the case that X is to the left of Y, and for the case that X is to the right of Y. A Statistical Model for Parsing Czech

7 Crossing dependency example
Daniel Zeman Crossing dependency example 1.8% dependencies cross 70% trees with no crossing dep 90% trees with one or zero 98% trees with two or less …, protože doba přenosu více závisí na stavu telefonní linky než na rychlosti přístroje A Statistical Model for Parsing Czech

8 Tag reduction examples
Daniel Zeman Tag reduction examples Categories to consider part of speech detailed part of speech gender number case gender of possessor number of possessor person tense grade negation voice var Examples of reduction negation does not matter degree of comparison detailed part of speech often does not matter (classes of adjectives, pronouns, numerals etc.)  some merged, other left intact some pronouns and numerals are treated as adjectives or nouns some numerals are treated as adverbs (e.g. “five-times”) present, future, and imperative forms of verbs merged together vocalization of prepositions removed punctuation split into classes A Statistical Model for Parsing Czech

9 Valency examples R4 (preposition with accusative, e.g. “na”)
Daniel Zeman Valency examples R4 (preposition with accusative, e.g. “na”) ZSB (sentence root) PRCX3 (reflexive pronoun “se”) VPP1A (verb, present tense, plural, 1st person, e.g. “jedeme”) A Statistical Model for Parsing Czech

10 Adjacency and edge direction examples
Daniel Zeman Adjacency and edge direction examples Prepositions are usually adjacent to nouns or to adjectives being a part of a noun phrase. Adjectives are adjacent to nouns. Verbs are NOT adjacent to most of their modifiers. Final punctuation is NOT adjacent to the root. The root and all prepositions take their modifiers from the right. Nouns are modified by adjectives from the left, and by other nouns in genitive case from the right. Numerals find the counted entity on the right. Of course, for many other dependencies both cases are equally likely so that this distinction does not help them. A Statistical Model for Parsing Czech

11 Different sources of tags and lemmas
Daniel Zeman Different sources of tags and lemmas from dictionary (ambiguous) manually assigned (not ambiguous, “truth”, not available for testing) automatically assigned by a tagger or a lemmatizer (not ambiguous, round 92% accurate tags and 98% lemmas) A Statistical Model for Parsing Czech

12 Results with tags from different sources
Daniel Zeman Results with tags from different sources A Statistical Model for Parsing Czech

13 Daniel Zeman Lexicalization A new model (table, register), dependencies are couples of words rather than tags. Words are either lemmas (dictionary headwords) or word forms. Stay with automatically disambiguated data both for training and parsing. Here the ambiguity is not as high as by tags, the lemmatizer can achieve 98% accuracy. A Statistical Model for Parsing Czech

14 Lexicalization: lemmas vs. word forms
Daniel Zeman Lexicalization: lemmas vs. word forms Lemmas: a little ambiguous. Lemmatizer accuracy about 98%. Word forms are not ambiguous at all. Forms are sparser (700K possible lemmas, 20M possible forms). Lemma + tag = form, so tags could be used for backing off from forms. On the contrary, if lemmas are used, they should be always combined with tags. Investigation: how much sparser are forms than lemmas, and lemmas than tags? A Statistical Model for Parsing Czech

15 13481 sentences, 230450 words Daniel Zeman 19.8.1998
A Statistical Model for Parsing Czech

16 13481 sentences, 230450 words Daniel Zeman 19.8.1998
In fact, the number of forms is not but because the sentence headings such as “#53” should be collapsed to one “#”. A Statistical Model for Parsing Czech

17 How to combine lemmas and tags
Daniel Zeman How to combine lemmas and tags Tags are important — not just a back off. Estimation of the weight from held-out data: postponed to the future work. For now, a manual estimation was used. A Statistical Model for Parsing Czech

18 Lemma weight influence
Daniel Zeman Lemma weight influence A Statistical Model for Parsing Czech

19 Daniel Zeman Minor improvements Do not trust lower counts. Dependencies seen five times or less considered unknown. (The number five has been found experimentally.) Unknown lexical dependencies cannot be treated as impossible. Their share of the whole probability does not remain zero, it is donated to the tag dependency instead.  54% Broader search beam when building the tree: N = 50 instead of 5.  55% A Statistical Model for Parsing Czech

20 Summary of results Daniel Zeman 19.8.1998
A Statistical Model for Parsing Czech


Download ppt "A Statistical Model for Parsing Czech"

Similar presentations


Ads by Google