Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 47051 Morphological Parsing CS 4705. 2 Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:

Similar presentations


Presentation on theme: "CS 47051 Morphological Parsing CS 4705. 2 Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:"— Presentation transcript:

1 CS 47051 Morphological Parsing CS 4705

2 2 Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing: taking a word or string of words as input and identifying the stems and affixes (and possibly interpreting these) –E.g.:E.g goose  goose +N +SG or goose + V geese  goose +N +PL gooses  goose +V +3SG –Bracketing: indecipherable  [in [ [de [cipher] ] able] ]

3 3 Why ‘parse’ words? To find stems –Simple key to word similarity –Yellow, yellowish, yellows, yellowed, yellowing… To find affixes and the information they convey –‘ed’ signals a verb –‘ish’ an adjective –‘s’? Morphological parsing provides information about a word’s semantics and the syntactic role it plays in a sentence

4 4 Some Practical Applications For spell-checking –Is muncheble a legal word? To identify a word’s part-of-speech (pos) –For sentence parsing, for machine translation, … To identify a word’s stem –For information retrieval Why not just list all word forms in a lexicon?

5 5 What do we need to build a morphological parser? Lexicon: list of stems and affixes (w/ corresponding p.o.s.) Morphotactics of the language: model of how and which morphemes can be affixed to a stem Orthographic rules: spelling modifications that may occur when affixation occurs –in  il in context of l (in- + legal) Most morphological phenomena can be described with regular expressions – so finite state techniques often used to represent morphological processes

6 6 Using FSAs to Represent English Plural Nouns English nominal inflection q0q2q1 plural (-s) reg-n irreg-sg-n irreg-pl-n Inputs: cats, geese, goose

7 7 Derivational morphology: adjective fragment q3 q5 q4 q0 q1q2 un- adj-root 1 -er, -ly, -est  adj-root 1 adj-root 2 -er, -est Adj-root 1 : clear, happi, real (clearly) Adj-root 2 : big, red (*bigly)

8 8 FSAs can also represent the Lexicon Expand each non-terminal arc in the previous FSA into a sub-lexicon FSA (e.g. adj_root 2 = {big, red}) and then expand each of these stems into its letters (e.g. red  r e d) to get a recognizer for adjectives q0 q1 ε r e q2 q4 q3 d b g q5 iq6

9 9 But….. Covering the whole lexicon this way will require very large FSAs with consequent search and maintenance problems –Adding new items to the lexicon means recomputing the whole FSA –Non-determinism –Some stems require modification when they acquire affixes FSAs tell us whether a word is in the language or not – but usually we want to know more: –What is the stem? –What are the affixes and what sort are they? –We used this information to recognize the word: why can’t we store it?

10 10 Parsing with Finite State Transducers cats  cat +N +PL (a plural NP) Kimmo Koskenniemi’s two-level morphology –Idea: word is a relationship between lexical level (its morphemes) and surface level (its orthography) –Morphological parsing : find the mapping (transduction) between lexical and surface levels cat+N+PL cats lexical surface

11 11 Finite State Transducers can represent this mapping FSTs map between one set of symbols and another using a FSA whose alphabet  is composed of pairs of symbols from input and output alphabets In general, FSTs can be used for –Translators (Hello:Ciao) –Parser/generators (Hello:How may I help you?) –As well as Kimmo-style morphological parsing

12 12 FST is a 5-tuple consisting of –Q: set of states {q0,q1,q2,q3,q4} –  : an alphabet of complex symbols, each an i/o pair s.t. i  I (an input alphabet) and o  O (an output alphabet) and  is in I x O –q0: a start state –F: a set of final states in Q {q4} –  (q,i:o): a transition function mapping Q x  to Q –Quizzical Cow  Emphatic Sheep q0 q4 q1q2q3 m:bo:a ?:!

13 13 FST for a 2-level Lexicon E.g. Reg-nIrreg-pl-nIrreg-sg-n c a tg o:e o:e s eg o o s e q0q1q2 q3 c:ca:at:t q4q6q7q5 so:e eg

14 14 FST for English Nominal Inflection q0q7 +PL:^s# q1q4 q2q5 q3q6 reg-n irreg-n-sg irreg-n-pl +N:  +PL:# +SG:# +N:  stac c+PL+Nta

15 15 Useful Operations on Transducers Cascade: running 2+ FSTs in sequence Intersection: represent the common transitions in FST1 and FST2 (ASR: finding pronunciations) Composition: apply FST2 transition function to result of FST1 transition function Inversion: exchanging the input and output alphabets (recognize and generate with same FST) cf AT&T FSM Toolkit and papers by Mohri, Pereira, and RileyAT&T FSM Toolkit Mohri Pereira

16 16 Orthographic Rules and FSTs Define additional FSTs to implement rules such as consonant doubling (beg  begging), ‘e’ deletion (make  making), ‘e’ insertion (watch  watches), etc. Lexical fox+N+PL Intermediate fox^s# Surface foxes

17 17 Porter Stemmer Porter Stemmer (1980) Used for tasks in which you only care about the stem –IR, modeling given/new distinction, topic detection, document similarity Lexicon-free morphological analysis Cascades rewrite rules (e.g. misunderstanding --> misunderstand --> understand --> …) Easily implemented as an FST with rules e.g. –ATIONAL  ATE –ING  ε Not perfect …. –Doing  doe

18 18 Policy  police Does stemming help? –IR, little –Topic detection, more

19 19 Summing Up FSTs provide a useful tool for implementing a standard model of morphological analysis, Kimmo’s two-level morphology But for many tasks (e.g. IR) much simpler approaches are still widely used, e.g. the rule- based Porter Stemmer Next time: –Read Ch 5:1-8 HW1 assigned (read the assignment)


Download ppt "CS 47051 Morphological Parsing CS 4705. 2 Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:"

Similar presentations


Ads by Google