Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 18, 2005.

Similar presentations


Presentation on theme: "Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 18, 2005."— Presentation transcript:

1 Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 18, 2005

2 Course Outline July 18: Intro to computational morphology XFST Readings Lauri Karttunen, “Finite-State Constraints”, The Last Phonological Rule. J. Goldsmith (ed.), pages 173-194, University of Chicago Press, 1993. Karttunen and Beesley, “25 Years of Finite-State Morphology” Chapter 1: “Gentle Introduction” (B&K) July 20: Regular expressions More on XFST Readings Chapter 2: “Systematic Introduction” Chapter 3: “The XFST interface”

3 July 25 Concatenative morphotactics Constraining non-local dependencies Readings Chapter 4. “The LEXC Language” Chapter 5. “Flag Diacritics” July 27 Non-concatenative morphotactics Reduplication, interdigitation Readings Chapter 8. “Non-Concatenative Morphotactics”

4 August 1 Realizational morphology Readings Gregory T. Stump. Inflectional Morphology. A Theory of Paradigm Structure. Cambridge U. Press. 2001. (An excerpt) Lauri Karttunen, “Computing with Realizational Morphology”, Lecture Notes in Computer Science, Volume 2588, Alexander Gelbukh (ed.), 205-216, Springer Verlag. 2003. August 3 Optimality theory Readings Paul Kiparsky “Finnish Noun Inflection” Generative Approaches to Finnic and Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI Publications, 2003. Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273-329.

5 Getting credit for LSA 207 There will be three assignments, given on each Wednesday. The first two are to be turned in by the following Monday, the last one by the following Friday. You will get credit for the course if you solve at least two of the three assignments. The solutions will involve programming in the xfst scripting language. The problems will be easy to solve if you have attended the class. If you have any problems in doing the assignments, Michael Wagner and I will be happy to help you.

6 Textbook Copies will arrive in the Linguistics Department tomorrow afternoon. You can purchase a copy there tomorrow as soon as the books have arrived. Starting Wednesday, books can Be purchased from our TA, Michael Wagner. The price is $35. With the book comes a software CD for Solaris, Linux, MacOSX and Windows operating systems.

7 LSA 207 Web site http://lsa.dlp.mit.edu/Class/207 You can use this username and password to access materials: Username: LSA207 Password: seunsehi207 Your are free to copy, modify and use the slides for whatever purpose provided that you give appropriate credit to the original source. The readings for Wednesday’s class (“Finite-State Constraints”, “25 Years of Finite-State Morphology” and “Gentle Introduction” (Chapter 1 of B&K book) are posted on the web site).

8 Software The software on the Book CD dates back to the Spring of 2003. For an update, point your browser to http://www.stanford.edu/~laurik/.lsa207/ Please read the README file and the License Agreement before downloading the software. The updated software supports UTF-8 encoded Unicode input/output. The Book version supports only Latin-1 (ISO-8859-1). The XFST application will be available locally on some computers (ask Michael). Check out the web site for the Book: http://www.fsmbook.com/

9 Finite-State Methods in NLP Domains of Application Tokenization Sentence breaking Spelling correction Morphology (analysis/generation) Phonological disambiguation (Speech Recognition) Morphological disambiguation (“Tagging”) Pattern matching (“Named Entity Recognition”) Shallow Parsing Types of Finite-State Systems Classical (non-weighted) automata Weighted (associated with weights in a semi-ring) Binary relations (simple transducers ) N-ary relations (multi-tape transducers)

10 Computational morphology Analysis leaves leaf N Plleave N Plleave V Sg3 Generation hang V Past hangedhung

11 Two challenges Morphotactics Words are composed of smaller elements that must be combined in a certain order: piti-less-ness is English piti-ness-less is not English Phonological alternations The shape of an element may vary depending on the context pity is realized as piti in pitilessness die becomes dy in dying

12 Morphology is regular (=rational) Morphology is regular (=rational) The relation between the surface forms of a language and the corresponding lexical forms can be described as a regular relation. A regular relation consists of ordered pairs of strings. leaf+N+Pl : leaveshang+V+Past : hung Any finite collection of such pairs is a regular relation. Regular relations are closed under operations such as concatenation, iteration, union, and composition. Complex regular relations can be derived from simple relations.

13 Morphology is finite-state A regular relation can be defined using the metalanguage of regular expressions. [{talk} | {walk} | {work}] [%+Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}]; A regular expression can be compiled into a finite- state transducer that implements the relation computationally.

14 Compilation [{talk} | {walk} | {work}] [%+Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}]; Regular expression k t a a w o l r +Progr:i  :g +3rdSg:s +Past:e  :d  :n +Base:  Finite-state transducer final state initial state

15 work+3rdSg --> works k:kk:k t:t a:a w:ww:w o:oo:o l:l r:rr:r +Progr:i  :g +3rdSg:s +Past:e  :d  :n +Base:  Generation

16 talked --> talk+Past k:kk:k t:tt:t a:a a:aa:a w:w o:o l:ll:l r:r +Progr:i  :g +3rdSg:s +Past:e :d:d  :n +Base:  Analysis

17 XFST Demo 1 xfst[0]: regex [{talk} | {walk} | {work}] [% +Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}]; % xfst xfst[0]: start xfst compile a regular expression apply the result xfst[1]: apply up walked walk+Past xfst[1]: apply down talk+SgGen3 talks

18 Lexical transducer veut vouloir +IndP +SG + P3 Finite-state transducer inflected form citation forminflection codes vouloir+IndP+SG+P3 veut Bidirectional: generation or analysis Compact and fast Comprehensive systems have been built for over 40 languages: English, German, Dutch, French, Italian, Spanish, Portuguese, Finnish, Russian, Turkish, Japanese, Korean, Basque, Greek, Arabic, Hebrew, Bulgarian, …

19 How lexical transducers are made Lexicon FST Rule FSTs Compiler fat +Adj r +Comp fat te Lexical Transducer (a single FST) composition Lexicon Regular Expression Rules Regular Expressions Morphotactics Alternations

20 Sequential Model... Surface form Intermediate form Lexical form fst 1fst 2 fst n Ordered sequence of rewrite rules (Chomsky & Halle ‘68) can be modeled by a cascade of finite-state transducers Johnson ‘72 Kaplan & Kay ‘81

21 Discovery and Rediscovery C. Douglas Johnson (1972) showed that –phonological rewrite rules are interpreted in a way that makes them less powerful than they appear –rewrite rules can be modeled by finite transducers –for any two finite transducers applied in a sequence there exists an equivalent single transducer (Schützenberger 1961). Johnson’s result was ignored and forgotten, rediscovered by Ronald M. Kaplan and Martin Kay at Xerox around 1980.

22 Application constraint Phonological rewrite rules are not as powerful as they appear because of the constraint that a rule does not apply to its own output. (Johnson 1972, Kaplan&Kay 1980).

23 Sequential application N -> m / _ p p -> m / m _ k a N p a n k a m p a n k a m m a n

24 Sequential application in detail N:m N ? ? 0 2 1 p m p N m p:m ? ? 0 1 m p m k a N p a n k a m p a n k a m m a n 0 0 0 2 0 0 0 0 0 0 1 0 0 0

25 Composition N:m N ? ? 0 3 1 m p N ? m 2 p:m N m N:m k a N p a n k a m m a n 0 0 0 3 0 0 0

26 Parallel Model Set of parallel of two-level rules (constraints) compiled into finite-state automata interpreted as transducers Koskenniemi ‘83 fst 1 fst 2 fst n... Surface form Lexical form

27 Sequential vs. parallel rules compose intersect FST rule 1 rule 2 rule n... Surface form Lexical form Koskenniemi 1983 Intermediate form... Surface form Lexical form rule 1 rule n rule 1 Chomsky&Halle 1968

28 Rewrite rules Epenthesis Harmony Lowering ? u: t y ? A s ? u: t I y ? A s ? u: t u y ? a s ? o: t u y ? a s Yawelmani Vowel Harmony Kisseberth 1969

29 Two-level constraints ? u: t 0 y ? A s ? o: t u y ? a s Underlying representation controls all three alternations. Epenthesis: Insert u or i (underspecification) Harmony: Rounding next to a round V of the same height. Lowering: Long u always realized as long o.

30 Rewrite Rules vs. Constraints Two different ways of decomposing the complex relation between lexical and surface forms into a set of simpler relations that can be more easily understood and manipulated. One approach may be more convenient than the other for particular applications.

31 The Big Picture Language or Relation Regular Expression Finite-State Network describesencodes compiles into a a {a}

32 XFST Demo 2 xfst[1]: apply up apply up> dog dog apply up> panther apply up> apply up> END; xfst[0]: define Cat {cat} | {tiger} | {lion}; defined Cat: 640 bytes. 11 states, 12 arcs, 3 paths.... xfst[0]: xfst[0]: set verbose off xfst[0]: define Dog {dog} | {spaniel} | {poodle}; xfst[0]: regex Cat | Dog ; xfst[1]: define Animal xfst[0]:

33 xfst[0]: regex Cat & Dog; xfst[1]: print net Sigma: a c d e g i l n o p r s t Size: 13, Label Map: Default Net: Flags: deterministic, pruned, minimized, epsilon_free,... s0: (no arcs) xfst[1]: xfst[1]: pop xfst[0]: xfst[0]: regex Animal - Dog; xfst[1]: push Cat xfst[2]: test equivalent 1, (0=NO,1=YES) xfst[2]: clear xfst[0]:

34 Compiling networks from words rlcae v e e t h f a Network xfst[0]: read text clear clever ear ever fat father ^D 432 bytes. 10 states, 12 arcs, 6 paths. read text < file read regex {clear}|{clever}|{ear}|{ever}|{fat}|{father} ;

35 Regular Expression Calculus Symbols Simple symbols vs. symbol pairs Special symbols: ANY, EPSILON Common regular expression operators concatenation, union, intersection, negation, composition Xerox operators contains, restriction, replacement

36 Symbols and Labels Single and multicharacter symbols a, b, c, …, +Adj, +SG, ^Fin Special symbols 0 EPSILON ? ANY Symbols vs. symbol pairs In general, no distinction is made between a the language {“a”} a:a the identity relation { } a

37 Common RE Operators concatenation * + iteration | union & intersection* ~ \ - complementation*, minus*.x. : crossproduct.o. composition * = not applicable to regular relations because the result may not be encodable by a finite-state network.

38 Iteration A* zero or more contatenations of A A+ one or more concatenations of A ?*the universal language/ the universal identity relation ? a:A b:B c:C d:D [a:A | b:B | c:C | d:D | … ]*

39 Negation \A any single symbol that is not in A \? the null language ~A any string that is not in A a \a Sigma: a, ? ~a a a ? ? a a ? ?

40 Crossproduct A.x. BThe relation that maps every string in A to every string in B, and vice versa A:BSame as [A.x. B]. b:y c:0a:x a b c.x. x y[a b c] : [x y]{abc}:{xy}

41 Composition A.o. BThe relation C such that if A maps x to y and B maps y to z, C maps x to z. b:B c:Ca:A b ca b:B c:C d:D {abc}.o. [a:A | b:B | c:C | d:D]*

42 Xerox RE Operators $ containment => restriction -> @-> replacement Make it easier to describe complex languages and relations without extending the formal power of finite-state systems.

43 Containmenta? ? a $a [?* a ?*]

44 Restriction ? c b b c ? a c a => b _ c “Any a must be preceded by b and followed by c.” ~[~[?* b] a ?*] & ~[?* a ~[c ?*]] Equivalent expression


Download ppt "Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 18, 2005."

Similar presentations


Ads by Google