Presentation on theme: "Morphological analysis"— Presentation transcript:
1 Morphological analysis LING 570Fei XiaWeek 3: 10/14/09TexPoint fonts used in EMF.Read the TexPoint manual before you delete this box.: AAA
2 Outline The task Porter stemmer FST morphological analyzer: J&M
3 The taskTo break word down into component morphemes and build a structured representationA morpheme is the minimal meaning-bearing unit in a language.Stem: the morpheme that forms the central meaning unit in a wordAffix: prefix, suffix, infix, circumfixPrefix: e.g., possible impossibleSuffix: e.g., walk walkingInfix: e.g., hingi humingi (Tagalog)Circumfix: e.g., sagen gesagt (German)
4 Two slightly different tasks Stemming:Ex: writing writ + ing (or write + ing)Lemmatization:Ex1: writing write +V +ProgEx2: books book +N +PlEx3: writes write +V +3Per +Sg
6 Language variation Isolated languages: e.g., Chinese Morphologically poor languages: e.g., EnglishMorphologically complex languages: e.g., Turkish
7 Ways to combine morphemes to form words Inflection: stem + gram. morpheme same classEx: help + ed helpedDerivation: stem + gram. morpheme different classEx: civil + -zation civilizationCompounding: multiple stemsEx: cabdriver, doghouseCliticization: stem + cliticEx: they’ll, she’s (*I don’t know who she is)
9 Porter stemmer The algorithm was introduced in 1980 by Martin Porter. Purpose: to improve IR.It removes suffixes only.Ex: civilization civilIt is rule-based, and does not require a lexicon.
10 How does it work? The format of rules: (condition) S1 S2 Ex: (m>1) ZATION ²Rules are partially ordered:Step 1a: -sStep 1b: -ed, -ingStep 2-4: derivational suffixesStep 5: some final fixesHow well does it work? What are the main problems with this kind of approach?
13 English morphology For now, we will focus on inflection only. Affixes: prefixes, suffixes; no infixes, circumfixes.Inflectional:Noun: -sVerbs: -s, -ing, -ed, -edAdjectives: -er, -estDerivational:Ex: V + suf Ncomputerize + -ation computerizationkill + er killerCompound: pickup, database, heartbroken, etc.Cliticization: ‘m, ‘ve, ‘re, etc. For now, we will focus on inflection only.
14 Three componentsLexicon: the list of stems and affixes, with associated features.Ex: book: N-s: +PLMorphotactics:Ex: +PL follows a nounOrthographic rules (spelling rules): to handle exceptions that can be dealt with by rules.Ex1: y ie: fly + -s fliesEx2: ² e: fox + -s foxesEx2’: ² e / x^_s#
15 An example Task: foxes fox +N +PL Surface: foxes Intermediate: fox s Lexical: fox +N +plOrthographic rulesLexicon + morphotactics
17 The lexicon (in general) The role of the lexicon is to associate linguistic information with words of the language.Many words are ambiguous: with more than one entry in the lexicon.Information associated with a word in a lexicon is called a lexical entry.
18 The lexicon (cont) fly: v, +base fly: n, +sg fox: n, +sg fly: (NP, V) fly: (NP, V, NP)Should the following be included in the lexicon?flies: v, +sg +3rdflies: n, +plfoxes: n, +plflew: v, +past
19 The lexicon for English noun inflection fox: n, +sg, +reg reg-noungoose: n, +sg, -reg irreg-sg-noungeese: n, +pl, -reg irreg-pl-noun
25 So far Ex: cats Ex: civilize Remaining issues: Have the entry “cat: reg-noun” in the lexiconA path: q0 q1 q2Result: cats cat s cat^s#Ex: civilizeHave the entry “civil: noun1” in the lexiconResult: civilize civil^ize#Remaining issues:cat^s# cat +N +PLspelling changes: foxes fox^s#
33 Orthographic rules E insertion: fox foxes 1st try: ² e “e” is added after -s, -x, -z, etc. before -s2nd try: ² e / (s|x|z|) _ sProblem?Ex: glass glases3rd try: ² e / (s|x|z)^_ s#
34 Rewrite rules Format: Rewrite rules can be optional or obligatory Rewrite rules can be ordered to reduce ambiguity.Under some conditions, these rewrite rules are equivalent to FSTs.® is not allowed to match something introduced in the previous rule application
37 What would the FST accept? (f, f)(fox, fox)(fox#, fox#)(fox^z#, foxz#)(fox^s#, foxes#)It will reject:(fox^s, foxs)
38 Combining lexicon and rules Lexical level:Intermediate level:Surface level:
39 Summary of FST morphological analyzer Three components:LexiconMorphotacticsOrthographic rulesRepresenting morphotactics as FST and expand it with the lexicon entries.Representing orthographic rules as FSTs.Combining all FSTs with operations such as composition.Giving the three components, creating and combining FSTs can be done automatically.
43 Hw5: Q5Compare Porter’s stemmer with FST morphological analyzer
44 Remaining issuesCreating the three components by hand is time consuming. unsupervised morphological inductionHow would a morphological analyzer help a particular application (e.g., IR, MT)?
45 How does the induction work? Start from a simple list of words and their frequencies:Ex: play 27played 100walked 40Try to find the most efficient way to encode the wordlist:Ex: minimum description length (MDL)
46 General approachInitialize: start from an initial set of “words” and find the description length of this setRepeat until convergenceGenerate a candidate set of new “words” that will each enable a reduction in the description lengthEx: walk, walked, play, playedfour wordstwo words (walk and play) and a suffix (-ed)