Presentation is loading. Please wait.

Presentation is loading. Please wait.

Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin.

Similar presentations


Presentation on theme: "Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin."— Presentation transcript:

1 Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

2 Contents Morphology morphemes, inflection and derivation, allomporphs Morphological Parsing finite-state automata, two-level morphology Finite-State Transducers rules, combination of FSTs, lexicon-free FSTs Human Morphological Processing Exercise

3 Morphology Morphology is the study of the way words are built up from smaller meaning-bearing units, morphemes. e.g. talo + ssa + ni + kin Two broad classes of morphemes, stems and affixes: the stem is the ”main morpheme” of the word, supplying the main meaning, e.g. talo in talo+ssa+ni+kin

4 Affixes Affixes add ”additional” meanings. Concatenative morphology uses the following types of affixes: prefixes, e.g. epä- in epä+olennainen suffixes, e.g. –ssa in talo+ssa circumfixes, e.g. German ge- -t in ge+sag+t ([have] said)

5 Non-concatenative Morphology In non-concatenative morphology the stem morpheme is split up. The following types of affixes are used: infixes, e.g. Californian Jurok, sepolah (field), se+ge+polah (fields) transfixes, e.g. Hebrew, l+a+m+a+d (he studied), l+i+m+e+d (he taught), l+u+m+a+d (he was taught) This type of non-concatenative morphology is called templatic or root-and-pattern morphology.

6 Inflection and Derivation There are two broad classes of ways to form words from morphemes: inflection and derivation.

7 Inflection Inflection is the combination of a word stem with a grammatical morpheme, usually resulting in a word of the same class as the original stem, and usually filling some syntactic function, e.g. plural of nouns. talo (singular), talo+t (plural) Inflection is productive. talo, talo+t vs. auto, auto+t vs. metsä, metsä+t The meaning of the resulting word is easily predictable.

8 Derivation Derivation is the combination of a word stem with a grammatical morpheme, usually resulting in a word of a different class, often with a meaning hard to predict exactly. e.g. järki, järje+st+ää, järje+st+ö, järje+st+ell+ä, järje+st+el+mä, järje+st+el+mä+lli+nen, järje+st+el+mä+lli+syys Not always productive. järki, järje+st+ää vs. metsä, metsä+st+ää vs. talo, talo+st+aa?

9 Allomorphs A group of allomorphs make up one morpheme class. An allomorph is a special variant of a morpheme. e.g. Finnish illative ending: + n, +h n, +seen, +siin  talo+on, metsä+än, talo+i+hin, huonee+seen, huone+i+siin e.g. Finnish stem variation: käsi, käde+n, kät+tä, käte+en

10 Why Allomorphs? Phonological constraints e.g. vowel harmony, talo+ssa vs. metsä+ssä Morphological paradigms e.g. käsi, käde+n vs. kasi, kasi+n, Swedish leta, leta+de vs. heta, het+te Irregularities e.g. cat, cat+s vs. goose, geese Orthographic constraints, i.e. spelling rules e.g. cat, cat+s vs. city, citi+es

11 Morphological Parsing Parsing means taking an input and producing some sort of structure for it. Morphological parsing means breaking down a word form into its constituent morphemes. e.g. talossa  talo +ssa Mapping of a word form to its baseform is called stemming. e.g. talossa  talo

12 Finite-State Morphological Parsing In order to build a parser we need the following: a lexicon containing the stems and affixes, morphotactics, i.e. the model of morpheme ordering, e.g. talo+ssa+ni instead of talo+ni+ssa, a set of rules (orthographic, etc.), i.e. the model of changes that occur in a word, usually when two morphemes combine, e.g. city + s  cities.

13 Finite-State Automaton for Inflection of English Verbs q0 q1q2q3 irreg-past-verb-form reg-verb-stem irreg-verb-stem preterite (-ed) past-participle (-ed) 3-singular (-s) progressive (-ing)

14 Finite-State Automaton for Inflection of the Verbs ’talk’, ’test’ and ’sing’ q0 q1q2q3 s g e d d e g n i s i n g s u an kl a t t a l k e s t e st

15 Two-Level Morphology Two-level morphology represents a word as a correspondence between a lexical level, which represents a simple concatenation of morphemes making up a word, and the surface level, which represents the actual spelling of the final word. snig +PROG +V sniggni Lexical Surface

16 Finite-State Transducer A transducer maps between one set of symbols and another; a finite state transducer does this via a finite automaton. Where an FSA accepts a language stated over a finite alphabet of single symbols, e.g.  ={a, b, c,...}, an FST accepts a language stated over pairs of symbols, e.g.  ={a:a, b:b, a:c, a: ,  : ,...} In two-level morphology, we call pairs like a:a default pairs, and refer to them by a single symbol a. An FST can be seen as a recognizer, generator, translator or a set relator.

17 Finite-State Transducer for Inflection of the Verbs ’talk’, ’test’ and ’sing’ q0 q3 +3SG:s g +PRET:e :d:d +PSTPCP:e +PROG:i s i n g s i:u i:a n kl a t t a l k e s t e st ng +V:  +PRET:  +PSTPCP:  :d:d :g:g :n:n

18 Examples Lexical formSurface form talk +Vtalk sing +V +3SGsings test +V +PROGtesting talk +V +PRETtalked sing +V +PRETsang talk +V +PSTPCPtalked sing +V +PSTPCPsung

19 Useful FST Operations Inversion: Switch input and output labels. e.g.  (T)={a:b, c:d}   (inv(T))={b:a, d:c} Intersection: Only sequences of pairs accepted by both transducerT1 and transducerT2 are accepted by transducer T1^T2. Composition: The output of transducer T1 serves as input to T2. This is marked as T1ºT2 or T2(T1).

20 Spelling Rules and FSTs Name Description of Rule Example Consonant doubling 1-letter consonant doubled before -ing/-ed beg/begging E deletionSilent e dropped before -ing and –ed make/making E insertione added after –s, -z, -x, -ch, -sh before -s watch/watches Y replacement-y changes to –ie before -s, and to -i before -ed try/tries K insertionverbs ending with vowel + -c add -k panic/panicked

21 Three levels Add an intermediate level between the lexical and surface levels iksses Surface iks#s Intermediate ^s iks +3SG Lexical +Vs

22 FST for the E-insertion Rule q0q3q4 q5 q1q2 ^:   :e ^:  z, s, x s # other z, x #, other # other s

23 Combination of FSTs (1) iksses Surface iks#s Intermediate ^siks +3SG Lexical +Vs Lexicon-FST Rule1-FSTRuleN-FST...

24 Combination of FSTs (2) iksses Surface iks#s Intermediate ^s iks +3SG Lexical +Vs Lexicon-FST Rule1-FSTRuleN-FST... Intersect

25 Combination of FSTs (3) iksses Surface iks#s^s iks +3SG Lexical +Vs Lexicon-FST Rule1-FSTRuleN-FST... Intersect Compose Intermediate

26 Intersection and Composition For each state qi in transducer T1 and state qj in transducer T2, create a new state qij. Intersection: For any pair a:b, if T1 transitions from qi to qn, and T2 transitions from qj to qm, T1^T2 transitions from qij to qnm. Composition: If T1 transitions from qi to qn with the pair a:b, and T2 transitions from qj to qm with the pair b:c, then T1ºT2 transitions from qij to qnm with the pair a:c.

27 Lexicon-Free FSTs Used in information-retrieval E.g. the Porter algorithm, which is based on a series of simple cascaded rewrite rules: ATIONAL  ATE (relational  relate) ING   if stem contains vowel (motoring  motor) Errors occur: organization  organ, doing  doe, university  universe

28 Human Morphological Processing (1) How are multi-morphemic words represented in the minds of human speakers? full-listing hypothesis vs. minimum redundancy hypothesis Experiments: Stanners et al. 1979: a word is recognized faster if it has been seen before (priming): lifting  lift, burned  burn, selective  / select, i.e. different representations for inflection and derivation. Marsen-Wilson et al. 1994: spoken derived words can prime their stems, but only if their meaning is close: government  govern, department  / depart

29 Human Morphological Processing (2) Speech errors: Speakers mix up the order of words... e.g. if you break it, it’ll drop... and also attach affixes to the wrong stems: e.g. it’s not only we who have screw looses (for ”screws loose”) e.g. easy enoughly (for ”easily enough”)

30 Excercise (1/3) Your task is to create a finite-state transducer that can analyze the following Finnish word forms: Surface formLexical form talotalo +NOM taloontalo +ILL talommetalo +NOM +POS1PL taloommetalo +ILL +POS1PL metsämetsä +NOM metsäänmetsä +ILL metsämmemetsä +NOM +POS1PL metsäämmemetsä +ILL +POS1PL

31 Exercise (2/3) The morphological tags have the following meaning: +NOM = nominative; +ILL = illative; +POS1PL = possessive, 1 st person plural. Take a look at Fig 3.16, 3.17 and 3.18 in Jurafsky & Martin. Create three separate finite-state transducers that you finally combine into one: a) Create a transducer that operates between the intermediate and surface level. This transducer handles the vowel lengthening that is necessary for the illative form: talo +ILL  talo|on vs. metsä +ILL  metsä|än.

32 Excercise (3/3) b) Create a transducer that operates between the intermediate and surface level. This transducer handles the deletion of n in front of a possessive ending: talo + mme  talo|mme vs. talo|on + mme  talo|o|mme. c) Create a transducer that operates between the lexical and the intermediate level. This transducer maps morphological tags onto endings. d) Combine all the transducers into one. Present your transducers as graphs or tables (cf. Fig. 3.15 in Jurafsky & Martin)


Download ppt "Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin."

Similar presentations


Ads by Google