Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute August 1, 2005.

Similar presentations


Presentation on theme: "Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute August 1, 2005."— Presentation transcript:

1 Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute August 1, 2005

2 August 1 Non-concatenative morphotactics Reduplication, interdigitation Realizational morphology Readings Chapter 8. “Non-Concatenative Morphotactics” Gregory T. Stump. Inflectional Morphology. A Theory of Paradigm Structure. Cambridge U. Press. 2001. (An excerpt) Lauri Karttunen, “ Computing with Realizational Morphology ”, Lecture Notes in Computer Science, Volume 2588, Alexander Gelbukh (ed.), 205-216, Springer Verlag. 2003. August 3 Optimality theory Readings Paul Kiparsky “ Finnish Noun Inflection ” Generative Approaches to Finnic and Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI Publications, 2003. Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273-329.

3 Morphotactics Most languages construct words by concatenating morphemes in a strict order: book+s un+think+ing+ly uta+ma+na+ca+pjja+samacha+i+wa (Aymara) (utamancapjjasamachiwa = it looks like they are in your house) paris+mut+nngau+niraq+lauq+sima+nngit+junga (Inuktitut) (parimunngauniralauqsimanngittunga = I never said I wanted to go to Paris) Many languages also have non-concatenative processes of word-formation reduplication (Malay) interdigitation (Arabic)

4 Weakness of traditional finite-state morphotactics Two-level and finite-state morphology have been widely criticized for handling only concatenative morphotactics. “Only restricted infixation and reduplication can be handled adequately with the present system. Some extensions or revisions will be necessary for an adequate description of languages possessing extensive infixation or reduplication.” (Koskenniemi, 1983, p. 27)

5 Interdigitation in Arabic Concatenative: kuutib + a “stem” “suffix” Non-concatenative stem: ktb CVVCVC ui “root” “template” “vocalization” kuutib Informally speaking, the root, template and vocalization morphemes “interdigitate” into a stem.

6 Full-Stem Reduplication in Malay In Malay, the overt plural of bagi (“suitcase”) is bagibagi (orthographically bagi-bagi ); the plural of peraturan (“rule”) is peraturanperaturan, etc. To model such pluralization, you need to copy the stem, no matter what it is and no matter how long it is. Such “full-stem reduplication” appears to be far beyond finite-state power The copy language ww is context-sensitive.

7 A new algorithm: compile-replace Define networks using concatenation, as before, but in such a way that the paths in the network may themselves contain regular expressions. Reapply the compiler to its own output, compiling the regular expression substrings and replacing them with the result of the compilation.

8 A non-linguistic example: before compile-replace a ^[*^] Network containing a regular expression, a* delimited with ^[ and ^].

9 Non-linguistic example: after compile-replace a:0 a *:a 0:a *:0 Maps every string in the infinite a* language to the regular expression from which the language was compiled.

10 Iteration operator ^n A^2 denotes two concatenations of the language A with itself, equivalent to [A A]. A = {bagi, pelanbuhan}, A^2 = {bagibagi, bagipelanbuhan, pelanbuhanbagi, pelanbuhanpelanbuhan}. Finite-state languages and relations are closed under n-ary concatenation.

11 Solution for Malay Construct the basic lexicon with paths such as Lexical string: b a g i +Noun +Plural Surface string: ^[ { b a g i } ^ 2 ^] Lexical string: p e l a b u h a n +Noun +Plural Surface string: ^[ { p e l a b u h a n } ^ 2 ^] Apply compile-replace on the lower side of the network.

12 Compile-replace: before and after Before Lexical string: b a g i +Noun +Plural Surface string: ^[ { b a g i } ^2 ^] After Lexical string: b a g i +Noun +Plural Surface string: b a g i b a g i The compile replace operation does not create any ill-formed reduplicates such as pelabuhanbagi.

13 Caveats The Malay solution, the use of ^2 as the reduplication operator, is for full-stem reduplication identity between the base and the reduplicate There are other types of reduplication partial reduplication partial non-identity This is a hot research topic.

14 Partial Reduplication Agta ( Assignment 3) t a k k i +Pl ^[ t a k k i +Pl ^].o. C* V C @-> "["... "]" %^ 2 || "^[" _ $"+Pl" ; t a k k i +Pl ^[ [ t a k ] ^ 2 k i ^] compile-replace lower t a k +Pl t a k t a k k i

15 Arabic stem interdigitation Two-lines of computational work: Kay 1987, Kiraz 1994, 2000 Inspired by McCarthy 1981 Separate “tiers” for root, pattern, and vocalism Requires a special mechanism for constructing a fourth tier for the stem Kataja&Koskenniemi 1988, Beesley 1989, 1991, 1996 Tripartite representation on a single tier Use intersection to combine the three components

16 Interdigitation Formalized as Intersection Let root ktb be formalized as ?* k ?* t ?* b ?*, equivalent to {ktb}/?. Let template CVVCVC be formalized as {CVVCVC}, where C is defined as the union of all consonants and V as the union of all vowels. Let vocalization ui be formalized as [u*i]/\V. Then stems can be formed via finite-state intersection rather than concatenation : {ktb}/? & {CVVCVC} & [u*i]/\V = kuutib The string kuutib is the only one simultaneously satisfying all the constraints.

17 ‘merge’, a faster intersection To model the morphotactics of Arabic, you need union, concatenation and intersection. Languages that require only union and concatenation are just special cases. The intersection required for Arabic stems is in fact another special case: {ktb}/? & {CVVCVC} & [u*i]/\V involves just fitting the consonants of the root into the C slots and the vowels of the vocalization into the V slots.

18 Merge Operators.m>. is the “merge to the right” operator and.<m. is the “merge to the left operator”. xfst[0]: list C k t b d r s m n b t ; xfst[0]: list V a i u ; xfst[0]: read regex {ktb}.m>. {CVVCVC}.<m. [u*i] ; xfst[1]: print words kuutib

19 Solution for Arabic Construct the basic lexicon with paths such as Lexical: k t b +Root C V C V C +Template a + +Voc Surface: ^[ k t b.m>. C V C V C.<m. a + ^] Lexical: d r s +Root C V V C V C +Template u * i +Voc Surface: ^[ d r s.m>. C V V C V C.<m. u * i ^] Apply compile-replace on the lower side of the network.

20 Compile-replace: before and after Before Lexical: k t b +Root C V C V C +Template a + +Voc Surface: ^[ k t b.m>. C V C V C.<m. a + ^] After Lexical: k t b +Root C V C V C +Template a + +Voc Surface: k a t a b Alternation rules apply to the interdigitated stems to produce the real surface strings.

21 Summary Flag diacritics make it possible to represent long- distance constrains without blow-up in size Compile-replace technique allows any finite-state operation to be used in morphotactic description. A special template filling operation, merge, allows fast interdigitation in cases such as Arabic.

22 Computing with Realizational Morphology Lauri Karttunen

23 Overview A Puzzle Realizational Morphology (is finite-state) Lexical representations Realization rules Morphophonological rules Rules of referral Elsewhere principle (Panini's principle) Discussion

24 A Puzzle A Puzzle The success of computational morphology has not made any impact within paper-and-pencil linguistics. Computational concerns completeness of coverage, physical size, speed of application, formal power, complexity of algorithms … Academic concerns: explanation, universal principles, generalizations, theoretical predictions, elegant formalism, … Theoretical Issues: tags (+Accusative) vs. features (Case: Accusative) commitment to morphemes?

25 Realizational Morphology Gregory Stump, Inflectional Morphology. A Theory of Paradigm Structure. Cambridge U. Press. 2001. No morphemes! (No fixed meaning-sound pairs) A rich set of notational conventions designed to capture important linguistic generalizations. Interpretable, precise formalism. Computational implementation in DATR (Finkel & Stump 2002). The good news: Realizational morphology is a finite-state model.

26 Finite-state advantage Casting Stump's system into a regular expression formalism that has a compiler has a fundamental advantage over implementation in systems such as DATR. DATR can be used to generate an inflected surface form from its lexical representation but it is not directly usable for recognition. In contrast, finite-state transducers are bidirectional generator/recognizers. Issues to be addressed: Lexical representations Realization rules (= rules of exponence) Morphophonological rules Rules of referral Rule ordering by general principles

27 Lexical representation A phonological representation A set of morphological properties Lingala verb nakobeti 'I hit you': <bet, {Sub:[Per:1, Num:Sg], Obj:[Per:2, Num:Sg], Tns:Past:Rec}>

28 Realization rule RR n  C ( ) = def phonological input features phonological output rule block features realized by the rule category RR 3, Obj:[Per:2, Num:Sg]  V ( ) = def The singular second person object agreement features are realized by prefixing "ko" to the beginning of the current phonological form. The rule appears in Block 3 and applies to verbs.

29 Rule application Realization rules are ordered into blocks by the linguist. Within blocks, the ordering is determined by specificity (Elsewhere rule, Panini's principle). The final output of a realization rule may depend on morphophonological rules. X  Y  Y'

30 Cascade of rule applications <bet, {Sub:[Per:1, Num:Sg], Obj:[Per:2, Num:Sg], Tns:Past:Rec}> RR 3, Obj:[Per:2, Num:Sg]  V ( ) = def <kobet, {Sub:[Per:1, Num:Sg], Obj:[Per:2, Num:Sg], Tns:Past:Rec}> RR 1, Subj:[Per:1, Num:Sg]  V ( ) = def <nakobet, {Sub:[Per:1, Num:Sg], Obj:[Per:2, Num:Sg], Tns:Past:Rec}> RR 5, Tnsj:Past:Rec  V ( ) = def <nakobeti, {Sub:[Per:1, Num:Sg], Obj:[Per:2, Num:Sg], Tns:Past:Rec}> 135

31 Observations The lexical representations of Realizational Morphology constitute a regular language. They can be described by a regular expression. All examples of realization rules given in Stump's book represent regular relations. They can be compiled compiled into finite-state transducers. Because regular relations are closed under composition, the cascade of rule applications yields a single transducer. We can eliminate the features from the surface side once the composition has been done.

32 Literal example b e n ak o b e t ti Sub, : {< 1Num:Per,:Sg[] … etc. A path in the lexical transducer for Lingala mapping the surface form nakobeti directly into its lexical representation <bet, {Sub:[Per:1, Num: Sg], Obj:[Per:2, Num:Sg],Tns:Past:Rec}>, and vice versa. In a real application, one would prefer a more parsimonious encoding of the feature structure.

33 Realization rules Stump's realization rules can easily be expressed in Parc/XRCE regular expression formalism. Example: RR 3, Obj:[Per:2, Num:Sg]  V ( ) = def define R301 [..] -> {ko} || "<" _ $[ObjAgr & $2 & $Sg] "Rule R301: Insert (= rewrite the empty string as) "ko" to the beginning of a phonological form whose object agreement features contain the values 2 and Sg."

34 Morphophonological rules The output of a realization rule may be subject to a morphophonological rule. Stump's morphophonemic rules are simple rewrite rules, easily expressed in the Parc/XRCE regular expression formalism. If X=W[vowel 1 ] and Y=X[vowel 2 ]Z, then the indicated [vowel 2 ] is absent from Y'. Vowel -> 0 || Vowel "+" _ ; where "+" marks the place where the suffix is inserted.

35 Rules of referral Realization rules may be defined in terms of other realization rules. The same affix can express more than one bundle of morphological features (syncretism). In Lingala, mo expresses class 4 singular 3rd person agreement for subjects and objects. In the Parc/XRCE regular expression formalism, a rule of referral corresponds to a substitution operation. If R305 is the object agreement rule, the corresponding subject agreement rule is `[R305, Obj, Sub] It yields a transducer identical to R305 except that the insertion of mo is controlled by subject agreement features.

36 Elsewhere principle While the rule blocks are ordered by the linguist, the the realization rules within each block and the morphophonological rules are ordered by specificity. A specific rule takes precedence over a more general rule in cases where both are applicable. This principle is very important for Stump who calls it "Panini's principle". But he gives no precise definition for it within his formalism. The Elsewhere Principle is an extremely simple notion for realization rules and for symbol-to-symbol morphophonological rules.

37 Specific vs. General k -> 0 || Vowel _ Vowel Rule A k -> v || u _ u Rule B

38 Input/Output languages Rule A and Rule B have the same input language: the universal language. Both rules can be applied without failure to any string. If the context is not met, the output is the same as the input. The output languages are not the same. A "successful" application an obligatory rule removes from the output language the strings to which it has applied. Every string missing from the output language of Rule B is missing from the output language of Rule A, but not vice versa. The output language of Rule A is a proper subset of the output language of Rule B.

39 Output language of Rule A k -> 0 || Vowel _ Vowel k -> v || u _ u Rule A Rule B

40 Output language of Rule B k -> 0 || Vowel _ Vowel k -> v || u _ u Rule A Rule B

41 Principled rule ordering The relationship of any two rules A and B that have been compiled into transducers can be determined by the following method: (1)Extract the output languages (a finite-state operation). (2)Check whether one is the proper subset of the other (a finite-state operation). This determination can be done efficiently and without any knowledge of how the rules were expressed.

42 Discussion It is evident that Realizational Morphology is yet another variant of finite-state morphology. Stump could say: "Your theory is a notational variant of mine but mine is better." There are many examples where notation matters: B => A _ C"B must occur between A and C." ~[ [~[?* A] B ?*] | [?* B ~[C ?*] ] ] Stump's convoluted and cumbersome notation takes no advantage of the nice formal and computational properties that it in fact has.

43 Reflections Computational morphology and paper-and- pencil morphology have a curious non- relationship going back at least 30 years. Time after time computational knights have presented themselves at the Court of Linguistics, rushed up to the Princess of Phonology and Morphology in great excitement to deliver the same message.

44 At the Court of Linguistics Knight: "Dear Princess. I have wonderful news for you. You are not like some of you NP-complete sisters. You are regular. You are rational. You are finite-state. Please marry me. Together we can do great things." Princess: "Not interested. You don't understand theory. Go away you geek." Innocent little boy: "The Princess has no clothes. The Princess has no clothes…"

45 Scheduling (a Princess effect?) 4:50-6:30 MW LSA.306: Introduction to MorphologyLSA.306: Introduction to Morphology 4:50-6:30 MW LSA.207: Finite-State Methods in Natural Language ProcessingLSA.207: Finite-State Methods in Natural Language Processing


Download ppt "Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute August 1, 2005."

Similar presentations


Ads by Google