Presentation is loading. Please wait.

Presentation is loading. Please wait.

Morphology and Finite-State Transducers

Similar presentations


Presentation on theme: "Morphology and Finite-State Transducers"— Presentation transcript:

1 Morphology and Finite-State Transducers
by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

2 Contents Morphology Morphological Parsing Finite-State Transducers
morphemes, inflection and derivation, allomporphs Morphological Parsing finite-state automata, two-level morphology Finite-State Transducers rules, combination of FSTs, lexicon-free FSTs Human Morphological Processing Exercise

3 Morphology Morphology is the study of the way words are built up from smaller meaning-bearing units, morphemes. e.g. talo + ssa + ni + kin Two broad classes of morphemes, stems and affixes: the stem is the ”main morpheme” of the word, supplying the main meaning, e.g. talo in talo+ssa+ni+kin

4 Affixes Affixes add ”additional” meanings.
Concatenative morphology uses the following types of affixes: prefixes, e.g. epä- in epä+olennainen suffixes, e.g. –ssa in talo+ssa circumfixes, e.g. German ge- -t in ge+sag+t ([have] said)

5 Non-concatenative Morphology
In non-concatenative morphology the stem morpheme is split up. The following types of affixes are used: infixes, e.g. Californian Jurok, sepolah (field), se+ge+polah (fields) transfixes, e.g. Hebrew, l+a+m+a+d (he studied), l+i+m+e+d (he taught), l+u+m+a+d (he was taught) This type of non-concatenative morphology is called templatic or root-and-pattern morphology.

6 Inflection and Derivation
There are two broad classes of ways to form words from morphemes: inflection and derivation.

7 Inflection Inflection is the combination of a word stem with a grammatical morpheme, usually resulting in a word of the same class as the original stem, and usually filling some syntactic function, e.g. plural of nouns. talo (singular), talo+t (plural) Inflection is productive. talo, talo+t vs. auto, auto+t vs. metsä, metsä+t The meaning of the resulting word is easily predictable.

8 Derivation Derivation is the combination of a word stem with a grammatical morpheme, usually resulting in a word of a different class, often with a meaning hard to predict exactly. e.g. järki, järje+st+ää, järje+st+ö, järje+st+ell+ä, järje+st+el+mä, järje+st+el+mä+lli+nen, järje+st+el+mä+lli+syys Not always productive. järki, järje+st+ää vs. metsä, metsä+st+ää vs. talo, talo+st+aa?

9 Allomorphs A group of allomorphs make up one morpheme class. An allomorph is a special variant of a morpheme. e.g. Finnish illative ending: +<vowel_lengthening>n, +h<vowel>n, +seen, +siin  talo+on, metsä+än, talo+i+hin, huonee+seen, huone+i+siin e.g. Finnish stem variation: käsi, käde+n, kät+tä, käte+en

10 Why Allomorphs? Phonological constraints Morphological paradigms
e.g. vowel harmony, talo+ssa vs. metsä+ssä Morphological paradigms e.g. käsi, käde+n vs. kasi, kasi+n, Swedish leta, leta+de vs. heta, het+te Irregularities e.g. cat, cat+s vs. goose, geese Orthographic constraints, i.e. spelling rules e.g. cat, cat+s vs. city, citi+es

11 Morphological Parsing
Parsing means taking an input and producing some sort of structure for it. Morphological parsing means breaking down a word form into its constituent morphemes. e.g. talossa  talo +ssa Mapping of a word form to its baseform is called stemming. e.g. talossa  talo

12 Finite-State Morphological Parsing
In order to build a parser we need the following: a lexicon containing the stems and affixes, morphotactics, i.e. the model of morpheme ordering, e.g. talo+ssa+ni instead of talo+ni+ssa, a set of rules (orthographic, etc.), i.e. the model of changes that occur in a word, usually when two morphemes combine, e.g. city + s  cities.

13 Finite-State Automaton for Inflection of English Verbs
irreg-past-verb-form reg-verb-stem q1 preterite (-ed) q3 q0 past-participle (-ed) reg-verb-stem q2 progressive (-ing) irreg-verb-stem 3-singular (-s)

14 Finite-State Automaton for Inflection of the Verbs ’talk’, ’test’ and ’sing’
q1 l k e a t d q3 q0 e s t e d t a s l g k q2 n i i g s n

15 Two-Level Morphology Two-level morphology represents a word as a correspondence between a lexical level, which represents a simple concatenation of morphemes making up a word, and the surface level, which represents the actual spelling of the final word. Lexical s i n g +V +PROG Surface s i n g i n g

16 Finite-State Transducer
A transducer maps between one set of symbols and another; a finite state transducer does this via a finite automaton. Where an FSA accepts a language stated over a finite alphabet of single symbols, e.g. ={a, b, c, ...}, an FST accepts a language stated over pairs of symbols, e.g. ={a:a, b:b, a:c, a:, :, ...} In two-level morphology, we call pairs like a:a default pairs, and refer to them by a single symbol a. An FST can be seen as a recognizer, generator, translator or a set relator.

17 Finite-State Transducer for Inflection of the Verbs ’talk’, ’test’ and ’sing’
i:u +V: n s i:a g +V: s t +PSTPCP: e +V: +PRET: l k +PRET:e a t :d q3 q0 e s t +PSTPCP:e :d t a s l k +V: :g +PROG:i :n i g n +3SG:s

18 Examples Lexical form Surface form talk +V talk sing +V +3SG sings
test +V +PROG testing talk +V +PRET talked sing +V +PRET sang talk +V +PSTPCP sing +V +PSTPCP sung

19 Useful FST Operations Inversion: Switch input and output labels.
e.g. (T)={a:b, c:d}  (inv(T))={b:a, d:c} Intersection: Only sequences of pairs accepted by both transducerT1 and transducerT2 are accepted by transducer T1^T2. Composition: The output of transducer T1 serves as input to T2. This is marked as T1ºT2 or T2(T1).

20 Spelling Rules and FSTs
Name Description of Rule Example Consonant doubling 1-letter consonant doubled before -ing/-ed beg/begging E deletion Silent e dropped before -ing and –ed make/making E insertion e added after –s, -z, -x, -ch, -sh before -s watch/watches Y replacement -y changes to –ie before -s, and to -i before -ed try/tries K insertion verbs ending with vowel + -c add -k panic/panicked

21 Three levels Add an intermediate level between the lexical and surface levels Lexical k i s s +V +3SG Intermediate k i s s ^ s # Surface k i s s e s

22 FST for the E-insertion Rule
^: q5 other other z, s, x # z, s, x ^: s z, s, x ^: :e q0 q1 q2 q3 s q4 z, x #, other #, other #

23 Combination of FSTs (1) i k s Lexical +V Lexicon-FST i k s #
+3SG Lexical +V Lexicon-FST i k s # Intermediate ^ ... Rule1-FST RuleN-FST i k s e Surface

24 Combination of FSTs (2) i k s Lexical +V Lexicon-FST Intermediate k i
+3SG Lexical +V Lexicon-FST Intermediate k i s s ^ s # ... Rule1-FST RuleN-FST Intersect i k s e Surface

25 Combination of FSTs (3) i k s Lexical +V Compose Lexicon-FST
+3SG Lexical +V Compose Lexicon-FST Intermediate k i s s ^ s # ... Rule1-FST RuleN-FST Intersect i k s e Surface

26 Intersection and Composition
For each state qi in transducer T1 and state qj in transducer T2, create a new state qij. Intersection: For any pair a:b, if T1 transitions from qi to qn, and T2 transitions from qj to qm, T1^T2 transitions from qij to qnm. Composition: If T1 transitions from qi to qn with the pair a:b, and T2 transitions from qj to qm with the pair b:c, then T1ºT2 transitions from qij to qnm with the pair a:c.

27 Lexicon-Free FSTs Used in information-retrieval
E.g. the Porter algorithm, which is based on a series of simple cascaded rewrite rules: ATIONAL  ATE (relational  relate) ING   if stem contains vowel (motoring  motor) Errors occur: organization  organ, doing  doe, university  universe

28 Human Morphological Processing (1)
How are multi-morphemic words represented in the minds of human speakers? full-listing hypothesis vs. minimum redundancy hypothesis Experiments: Stanners et al. 1979: a word is recognized faster if it has been seen before (priming): lifting  lift, burned  burn, selective / select, i.e. different representations for inflection and derivation. Marsen-Wilson et al. 1994: spoken derived words can prime their stems, but only if their meaning is close: government  govern, department / depart

29 Human Morphological Processing (2)
Speech errors: Speakers mix up the order of words... e.g. if you break it, it’ll drop ... and also attach affixes to the wrong stems: e.g. it’s not only we who have screw looses (for ”screws loose”) e.g. easy enoughly (for ”easily enough”)

30 Excercise (1/3) Your task is to create a finite-state transducer that can analyze the following Finnish word forms: Surface form Lexical form talo talo +NOM taloon talo +ILL talomme talo +NOM +POS1PL taloomme talo +ILL +POS1PL metsä metsä +NOM metsään metsä +ILL metsämme metsä +NOM +POS1PL metsäämme metsä +ILL +POS1PL

31 Exercise (2/3) The morphological tags have the following meaning: +NOM = nominative; +ILL = illative; +POS1PL = possessive, 1st person plural. Take a look at Fig 3.16, 3.17 and 3.18 in Jurafsky & Martin. Create three separate finite-state transducers that you finally combine into one: a) Create a transducer that operates between the intermediate and surface level. This transducer handles the vowel lengthening that is necessary for the illative form: talo +ILL  talo|on vs. metsä +ILL  metsä|än.

32 Excercise (3/3) b) Create a transducer that operates between the intermediate and surface level. This transducer handles the deletion of n in front of a possessive ending: talo + mme  talo|mme vs talo|on + mme  talo|o|mme. c) Create a transducer that operates between the lexical and the intermediate level. This transducer maps morphological tags onto endings. d) Combine all the transducers into one. Present your transducers as graphs or tables (cf. Fig in Jurafsky & Martin)


Download ppt "Morphology and Finite-State Transducers"

Similar presentations


Ads by Google