Learning Morphological Disambiguation Rules for Turkish Deniz Yuret Ferhan Türe Koç University, İstanbul.

Learning Morphological Disambiguation Rules for Turkish Deniz Yuret Ferhan Türe Koç University, İstanbul

Overview Turkish morphology The morphological disambiguation task The Greedy Prepend Algorithm Training Evaluation

Turkish Morphology Turkish is an agglutinative language: Many syntactic phenomena expressed by function words and word order in English are expressed by morphology in Turkish. I will be able to go. (go) + (able to) + (will) + (I) git + ebil + ecek + im Gidebileceğim.

Fun with Turkish Morphology AvrupaEurope lıEuropean laşbecome tırmake amanot able to dıkwe were larımızthose that danfrom mışwere sınızyou Avrupalılaştıramadıklarımızdanmışsınız

So how long can words be? uyu – sleep uyut – make X sleep uyuttur – have Y make X sleep uyutturt – have Z have Y make X sleep uyutturttur – have W have Z have Y make X sleep uyutturtturt – have Q have W have Z … …

Morphological Analyzer for Turkish masalı masal+Noun+A3sg+Pnon+Acc (= the story) masal+Noun+A3sg+P3sg+Nom (= his story) masa+Noun+A3sg+Pnon+Nom^DB+Adj+With (= with tables) Oflazer, K. (1994). Two-level description of Turkish morphology. Literary and Linguistic Computing Oflazer, K., Hakkani-Tür, D. Z., and Tür, G. (1999) Design for a turkish treebank. EACL’99 Kenneth R. Beesley and Lauri Karttunen, Finite State Morphology, CSLI Publications, 2003 Kenneth R. BeesleyLauri KarttunenCSLI Publications

Features, IGs and Tags 126 unique features 9129 unique IGs ∞ unique tags 11084 distinct tags observed in 1M word training corpus masa+Noun+A3sg+Pnon+Nom^DB+Adj+With stem features inflectional group (IG) IG derivational boundary tag

Why not just do POS tagging? from Oflazer (1999)

Why not just do POS tagging? Inflectional groups can independently act as heads or modifiers in syntactic dependencies. Full morphological analysis is essential for further syntactic analysis.

Morphological disambiguation Ambiguity rare in English: lives = live+s or life+s More serious in Turkish: 42.1% of the tokens ambiguous 1.8 parses per token on average 3.8 parses for ambiguous tokens

Morphological disambiguation Task: pick correct parse given context 1. masal+Noun+A3sg+Pnon+Acc 2. masal+Noun+A3sg+P3sg+Nom 3. masa+Noun+A3sg+Pnon+Nom^DB+Adj+With – Uzun masalı anlatTell the long story – Uzun masalı bittiHis long story ended – Uzun masalı odaRoom with long table

Morphological disambiguation Task: pick correct parse given context 1. masal+Noun+A3sg+Pnon+Acc 2. masal+Noun+A3sg+P3sg+Nom 3. masa+Noun+A3sg+Pnon+Nom^DB+Adj+With Key Idea Build a separate classifier for each feature.

Decision Lists 1. If (W = çok) and (R1 = +DA) Then W has +Det 2. If (L1 = pek) Then W has +Det 3. If (W = +AzI) Then W does not have +Det 4. If (W = çok) Then W does not have +Det 5. If TRUE Then W has +Det “pek çok alanda”(R1) “pek çok insan”(R2) “insan çok daha”(R4)

Greedy Prepend Algorithm GPA(data) 1 dlist = NIL 2 default-class = Most-Common-Class(data) 3 rule = [If TRUE Then default-class] 4 while Gain(rule, dlist, data) > 0 5 do dlist = prepend(rule, dlist) 6 rule = Max-Gain-Rule(dlist, data) 7 return dlist

Training Data 1M words of news material Semi automatically disambiguated Created 126 separate training sets, one for each feature Each training set only contains instances which have the corresponding feature in at least one of their parses

Input attributes For a five word window: The exact word string (e.g. W=Ali'nin) The lowercase version (e.g. W=ali'nin) All suffixes (e.g. W=+n, W=+In, W=+nIn, W=+'nIn, etc.) Character types (e.g. Ali'nin would be described with W=UPPER-FIRST, W=LOWER-MID, W=APOS-MID, W=LOWERLAST ) Average 40 features per instance.

Sample decision lists +Acc 0 1 W=+InI 1 W=+yI 1 W=UPPER0 1 W=+IzI 1 L1=~bu 1 W=~onu 1 R1=+mAK 1 W=~beni 0 W=~günü 1 W=+InlArI 1 W=~onlarý 0 W=+olAyI 0 W=~sorunu … (672 rules) +Prop 1 0 W=STFIRST 0 W==Türk 1 W=STFIRST R1=UCFIRST 0 L1==. 0 W=+AnAl 1 R1==, 0 W=+yAD 1 W=UPPER0 0 W=+lAD 0 W=+AK 1 R1=UPPER 0 W==Milli 1 W=STFIRST R1=UPPER0 … (3476 rules)

Models for individual features

Combining models masal+Noun+A3sg+P3sg+Nom masal+Noun+A3sg+Pnon+Acc Decision list results and confidence (only distinguishing features necessary): P3sg = yes(89.53%) Nom = no(93.92%) Pnon = no(95.03%) Acc = yes(89.24%) score(P3sg+Nom) = 0.8953 x (1 – 0.9392) score(Pnon+Acc) = (1 – 0.9503) x 0.8924

Evaluation Test corpus: 1000 words, hand tagged Accuracy: 95.87% (conf. int: 94.57-97.08) Better than the training data !?

Other Experiments Retraining on own output: 96.03% Training on unambiguous data: 82.57% Forget disambiguation, let’s do tagging with a single decision list: 91.23%, 10000 rules

Contributions Learning morphological disambiguation rules using GPA decision list learner. Reducing data sparseness and increase noise tolerance using separate models for individual output features. ECOC, WSD, etc.

Learning Morphological Disambiguation Rules for Turkish Deniz Yuret Ferhan Türe Koç University, İstanbul.

Similar presentations

Presentation on theme: "Learning Morphological Disambiguation Rules for Turkish Deniz Yuret Ferhan Türe Koç University, İstanbul."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning Morphological Disambiguation Rules for Turkish Deniz Yuret Ferhan Türe Koç University, İstanbul.

Similar presentations

Presentation on theme: "Learning Morphological Disambiguation Rules for Turkish Deniz Yuret Ferhan Türe Koç University, İstanbul."— Presentation transcript:

Similar presentations

About project

Feedback