Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Similar presentations

Presentation on theme: "Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005."— Presentation transcript:

1 Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005

2 Course Outline July 18: Intro to computational morphology XFST Readings Lauri Karttunen, Finite-State Constraints, The Last Phonological Rule. J. Goldsmith (ed.), pages 173-194, University of Chicago Press, 1993. Karttunen and Beesley, 25 Years of Finite-State Morphology Chapter 1: Gentle Introduction (B&K) July 20: Regular expressions More on XFST Readings Chapter 2: Systematic Introduction Chapter 3: The XFST interface

3 July 25 More on XFST: Date Parser Concatenative morphotactics: The LEXC language Readings Chapter 4. The LEXC Language July 27 Constraining non-local dependencies: Flag Diacritics Complex morphotactics and alternations: Finnish Numerals Readings Chapter 5. Flag Diacritics

4 August 1 Non-concatenative morphotactics Reduplication, interdigitation Realizational morphology Readings Chapter 8. Non-Concatenative Morphotactics Gregory T. Stump. Inflectional Morphology. A Theory of Paradigm Structure. Cambridge U. Press. 2001. (An excerpt) Lauri Karttunen, Computing with Realizational Morphology, Lecture Notes in Computer Science, Volume 2588, Alexander Gelbukh (ed.), 205-216, Springer Verlag. 2003. August 3 Optimality theory Readings Paul Kiparsky Finnish Noun Inflection Generative Approaches to Finnic and Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI Publications, 2003. Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273-329.

5 Syllabification revisited define MarkNonDiphthongs [ [..] -> "." || [HighV | MidV] _ LowV, # i.a, e.a LowV _ MidV, # a.e i _ [MidV - e], # i.o, i.ä u _ [MidV - o], # u.e y _ [MidV - ö], # y.e $V i _ e, # poiki.en V u _ o, # $V y _ ö, # $V [MidV | LowV] _ [u|y] C C|.#.]]; # define Syllabify [ C* V+ C* @->... "." || _ C V ]; regex FinnWords.o. MarkNonDiphthongs.o. Syllabify;

6 Constraints ge hund bon ne mal eg et in eg et o a ec jn MF%+ => _ ~$[%+Fem] %+Pl ; MF+ +Fem +Pl

7 Constraining by composition xfst[0]: read lexc < adj-noun-tags.lexc Root...2, Nouns...2, NounRoots...4, Nmf...5,.... Building lexicon...Minimizing...Done! 2.7 Kb. 45 states, 70 arcs, Circular. xfst[1]: up gehundino MF+hund+Noun+Fem+Sg xfst[1]: regex "MF+" => _ ~$["+Fem"] "+Pl" ; 1.2 Kb, 2 states, 7 arcs, Circular xfst[2]: compose 3.2 Kb, 61 states, 89 arcs, Circular xfst[1]: up gehundino xfst[1]: *** Not accepted *** Less words, bigger network.

8 Esperanto with Flags Multichar_Symbols +Noun +Adj +Nsuff +ASuff +Nize +Pl +Sg +Acc MF+ +Aug +Dim +Fem Op+ Neg+ @U.MF.Yes@ @U.MF.No@ LEXICON Root Nouns ; Adjectives ; LEXICON Nouns NounRoots ; @U.MF.Yes@ Ge ; LEXICON Ge MF+:ge NounRoots; LEXICON NounRoots bird Nmf ; hund Nmf ; kat Nmf ; LEXICON Nmf +Noun:0 AugDimFem ; LEXICON AugDimFem @U.MF.No@ Fem ; +Dim:et AugDimFem ; +Aug:eg AugDimFem ; Nend ; Adjend ; LEXICON Fem +Fem:in AugDimFem ;

9 Constraining by flags xfst[0]: read lexc < esperanto-flags.lexc xfst[1]: up gehundino xfst[1]: xfst[1]: down MF+hund+Noun+Fem+NSuff+Sg xfst[1]: xfst[1]: set obey-flags off variable obey-flags = off xfst[1]: up gehundino xfst[1]: MF+hund+Noun+Fem+NSuff+Sg xfst[1]: set show-flags on variable show-flags = on xfst[1]: down MF+hund+Noun+Fem+NSuff+Sg @U.MF.Yes@gehund@U.MF.No@ino@U.MF.No@

10 Flags in the sigma xfst[1]: print sigma MF+ Neg+ Op+ a b c d e f g h i j k l m n o r t u v +ASuff +Acc +Adj +Aug +Dim +Fem +Nsuff +Nize +Noun +Pl +Sg @U.MF.No@ @U.MF.Yes@ Size: 35 @U.MF.Yes@: UNIFY feature 'MF' with value 'Yes' @U.MF.No@: UNIFY feature 'MF' with value 'No' 2 flag diacritics

11 Eliminating flags xfst[1]: eliminate flag MF 3.2 Kb. 61 states 89 arcs, Circular Size: 35 xfst[1]: print sigma MF+ Neg+ Op+ a b c d e f g h i j k l m n o r t u v +ASuff +Acc +Adj +Aug +Dim +Fem +NSuff +Nize +Noun +Pl +Sg Size: 33 The eliminate flag command composes the network with constraint networks that have the same effect as the flag diacritics that are removed.

12 Flag Diacritics Special symbols for encoding features, that is, attribute-value pairs. Checked at runtime to avoid the cost of compiling them into the structure of the network If a check fails, the path is abandoned.

13 Attributes and Values Epsilon arcs with feature constraints. @U.Feature.Value@ @C.Feature@ Unify Feature with Value if possible. Set Feature to the unspecified value.

14 Rules There can be any number of attributes. An attribute can have any number of values. If the value of an attribute is unspecified, it unifies successfully with any given value and is set to that value. If the value of an attribute is specified, it unifies only with the given value.

15 Actions: Unify, Positive Set @U.Feature.Value@ Unify Value with the current setting of Feature, if possible. Otherwise fail. @P.Feature.Value@ Set Feature to Value regardless of the current setting. Always succeeds.

16 More Actions: Negative Set, Clear @N.Feature.Value@ Set Feature to the complement of Value regardless of the current setting. Always succeeds. @C.Feature@ Make Feature be unspecified.Always succeeds.

17 More Actions: Require @R.Feature.Value@ Succeed in Feature is set to Value. Otherwise fail. @R.Feature@ Succeed if Feature has been set to some value. Otherwise fail.

18 More Actions: Equality @E.Feature1.Feature2@ Succeed if Feature1 has the same value as Feature2. Otherwise fail.

19 Eliminating flags The constraints on "@U.FEATURE.VALUE@" have the form ~[?* PROHIBIT_FLAGS ~$[ALLOW_FLAGS] SELF ?*] Constraint for eliminating @U.MF.No@: ~[?* [ " @U.MF.Yes@ " ]# prohibit ~$[ " @P.MF.No@ " | @C.MF@]# allow " @U.MF.No@ " ?*]

20 Finnish Numerals

21 Numbers and Numerals The mapping from integers 0, 1, 2, 3 … to the corresponding numerals one, two, three… is a regular relation. Some languages have a very simple numeral system, some are more complicated: seventy-three, soixante-treize, drei-und-sibzig We can compile transducers that map between the numbers and the corresponding numerals.

22 Number-to-Numeral transducer Generation 105 hundred fivehundred and five one hundred and five Analysis hundred five 105

23 The Goal Ahead: Finnish Analysis sadanviiden 105+Sg+Gen hundred and five (Sg Gen) Generation 28+Ord+Pl+Gen kahdensienkymmenensienkahdeksansien twenty-eighth (Pl Gen)

24 Finnish Numerals Compound numerals written as one word 2 1000 + 5 100 + 3 10 + 1 = 2531 kaksituhattaviisisataakolmekymmentäyksi Express ordinality, number, and case sata+Sg+Nom (100)sata+Ord+Sg+Nom (100th) satasadas sata+Sg+Gen (100)sata+Ord+Sg+Gen (100th) sadansadannen sata+Pl+Gen (100)sata+Ord+Pl+Gen (100th) satojensadansien

25 Singular vs. Plural Numerals generally occur with singular nouns kaksi+Sg+Gen kenkä+Sg+Gen kahden kengän omistaja (owner of two shoes) Sets and public events may be in plural kaksi+Pl+Gen kenkä+Pl+Gen kaksien kenkien omistaja (owner of two pairs of shoes) kolme+Ord+Pl+Nom olympialainen+Pl+Nom kolmannet olympialaiset (third olympic games) yksi+Pl+Nom hää+Pl+Nom yhdet häät (one wedding)

26 Morphotactics All parts of compound numerals agree in all respects two thousand five hundred (2500) kaksi+Sg+Gen tuhat+Sg+Gen viisi+Sg+Gen sata+Sg+Gen kahden tuhannen viiden sadan two ten eighth (28th) kaksi+Ord+Pl+Gen kymmenen+Ord+Pl+Gen kahdeksan+Ord+Pl+Gen kahde ns i en kymmene ns i en kahdeksa ns i en

27 Singular nominative is exceptional Numeral with a noun kaksi+Gen kenkä+Gen kahden kengän(two shoes) kaksi+Nom kenkä+Part kaksi kenkää(two shoes) Compound numeral kaksi+Gen tuhat+Gen viisi+Gen sata+Gen kolme+Gen (2503) kahden tuhannen viiden sadan kolmen kaksi+Nom tuhat+Part viisi+Nom sata+Part kolme+Nom (2503) (kaksi tuhatta) + (viisi sataa) + kolme

28 Morphological Alternations Semiregular stem alternations yksi+Sg+Nom :yksi(one) yksi+Sg+Ess :yhtenä yksi+Sg+Gen :yhden yksi+Sg+Part :yhtä yksi+Pl+Gen :yksien Irregular stem alternations yksi+Ord+Sg+Nom :ensimmäinen(first) Regular suffix alternations Vowel harmony kolme+Sg+Part : kolmeavs. neljä+Sg+Part : neljää Illative vowel kolme+Sg+Ill : kolmeenvs. neljä+Ill+Part : neljään Partitive t yksi+Sg+Part : yhtävs. neljä+Sg+Part : neljää

29 Solution for Finnish Maps a number with morphological tags into an inflected Finnish numeral. Encodes morphotactic constraints. Numbers/ Finnish Transducer lexc source lexicon.o. Looping lexicon with all the forms of all Finnish single numerals concatenated in all possible ways. Composed with morphophonological rules.

30 Example Numbers/ Finnish Transducer 2 5 +Ord +Pl +Gen kaksi +Ord +Pl +Gen kymmenen +Ord +Pl +Gen viisi +Ord +Pl +Gen lexc source lexicon.o. kaksi +Pl +Nom kymmenen +Part VIISI +Ord +Gen kahdet kymmentä viidennen (ungrammatical) kaksi +Ord +Pl +Gen kymmenen +Ord +Pl +Gen viisi +Ord +Pl +Gen kahdensien kymmenensien viidensien

31 Sublexicon for One LEXICON Yksi YKSI+Sg:yksi Nom;# singular nominative YKSI+Sg:yhde WeakGrade;# weak stem (most cases) YKSI+Sg:yhte StrongGrade;# strong stem (essive, ill.) YKSI+Sg:yht Par;# partitive stem YKSI:yks PlStem1;# plural stem YKSI+Ord1+Sg:ensimmäinen Nom;# singular nominative YKSI+Ord1+Sg:ensimmäise AnyGrade;# weak/strong stem YKSI+Ord1+Sg:ensimmäis Par;# partitive stem YKSI+Ord+Sg:yhdes Nom;# singular nominative YKSI+Ord+Sg:yhdenne WeakGrade;# weak stem YKSI+Ord+Sg:yhdente StrongGrade;# strong stem YKSI+Ord+Sg:yhdet Par;# partitive stem YKSI+Ord:yhdens PlStem1;# plural stem

32 Some sublexicons LEXICON WeakGrade SgGen;! Singular Genitive PlNom;! Plural Nominative InvarWeak;! Invariant (plural and singular) cases LEXICON InvarWeak +Tra:ksi Next;! Translative into +Ine:ssA Next;! Inessive in +Ela:ltA Next;! Elative from (inside) +Ade:llA Next;! Adessive on +Abl:ltA Next;! Ablative from (outside) +All:lle Next;! Allative onto +Abe:ttA Next;! Abessive without

33 Sample paths for Two kaksi+Sg+Nomkaksi+Sg+Genkaksi+Sg+Ess kaksi kahde n kahte na kaksi+Sg+Par kaksi+Pl+Gen kaksi+Pl+Ill kah TA kaks i en kaks i Vn kaksi+Ord+Sg+Nomkaksi+Ord1+Sg+Nom kahde s toinen kaksi+Ord+Sg+Illkaksi+Ord1+Sg+Ill kahde nte Vntoise Vn

34 Morphophonologial rules define BackV [a | o | u]; define FrontV [ä | ö | y]; define Vow [BackV | FrontV | i | e]; define VHarmony [A -> a || BackV ~$[FrontV] _.o. A -> ä]; define IllativeV [V -> a || a (h) _, V -> e || e (h) _, … ] define PartitiveT [T -> 0 || \Vow Vow _ ];

35 Example again Numbers/ Finnish Transducer 2 5 +Ord +Pl +Gen KAKSI +Ord +Pl +Gen KYMMENEN +Ord +Pl +Gen VIISI +Ord +Pl +Gen lexc source lexicon.o. morpho- phonological rules.o. KAKSI +Pl +Nom KYMMENEN +Part VIISI +Ord +Gen (ungrammatical) kahdet kymmentä viidennen KAKSI +Ord +Pl +Gen KYMMENEN +Ord +Pl +Gen VIISI +Ord +Pl +Gen kahdensien kymmenensien viidensien

36 Remaining problems Special ordinals for yksi (one), kaksi (two) ensimmäinen (1st) vs. kahdeskymmenesyhdes (21st) Compose the lexicon with an appropriate filter to eliminate unwanted variants. No internal tags 2+Sg+Gen00+Sg+Gen Delete them: 0 <- Tag || _ $[\Tag Tag+].#. ; Singular nominative as partitive in compounds %+Nom -> %+Par // %+Sg %+Nom ~$Tag %+Sg _ ; Ordinal/Plural/Case agreement Flag diacritics!

37 Flags for Finnish numerals @U.Type.Card@ @U.Type.Ord@ @U.Number.Sg@ @U.Number.Pl@ @U.Case.Nom@ @U.Case.Gen@ @U.Case.Par@ @U.Case.Tra@ @U.Case.Ess@ @U.Case.Abe@ @U.Case.Ine@ @U.Case.Ela@ @U.Case.Ill@ @U.Case.Ade@ @U.Case.Abl@ @U.Case.All@ @U.Case.Com@ @U.Case.Ins@ 3 00 +Sg +Gen @U.Type.Card@ @U.Num.Sg@ @U.Case.Gen@ @U.Type.Card@ @U.Num.Sg@ @U.Case.Gen@ k o lmen s a dan 300+Sg+Gen kolmensadan

38 Conclusion Mapping from numbers to numerals can be done in a simple and elegant way even for languages with complex morphology. Necessary for text to speech applications. Tervetuloa kahdensienkymmenensienkahdeksansien olympialaisten avajaisiin! Welcome to the opening ceremonies of the 28th Olympic Games!

39 Demo!

Download ppt "Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005."

Similar presentations

Ads by Google