Morphology: Parsing Words

Morphology: Parsing Words
Lecture 3 Morphology: Parsing Words CS 4705

What is morphology? The study of how words are composed from smaller, meaning-bearing units (morphemes) Stems: children, undoubtedly, Affixes (prefixes, suffixes, circumfixes, infixes) Immaterial Trying Gesagt Absobl**dylutely Concatenative vs. non-concatenative (e.g. Arabic root-and-pattern) morphological systems

Agglutinative (e. g. Turkish, Japanese) vs. analytic (e. g
Agglutinative (e.g. Turkish, Japanese) vs. analytic (e.g. Mandarin) vs inflectional systems (e.g. English, Latin, Russian)

(English) Inflectional Morphology
Word stem + grammatical morpheme Usually produces word of same class Usually serves a syntactic function (e.g. agreement) like  likes or liked bird  birds Nominal morphology Plural forms s or es Irregular forms (goose/geese) Mass vs. count nouns (fish/fish, or s?) Possessives (cat’s, cats’)

Verbal inflection Main verbs (sleep, like, fear) verbs relatively regular -s, ing, ed And productive: ed, instant-messaged, faxed, homered But some are not regular: eat/ate/eaten, catch/caught/caught Primary (be, have, do) and modal verbs (can, will, must) often irregular and not productive Be: am/is/are/were/was/been/being Irregular verbs few (~250) but frequently occurring So….English inflectional morphology is fairly easy to model….with some special cases...

(English) Derivational Morphology
Word stem + grammatical morpheme Usually produces word of different class More complicated than inflectional E.g. verbs --> nouns -ize verbs  -ation nouns generalize, realize  generalization, realization E.g.: verbs, nouns  adjectives embrace, pity embraceable, pitiable care, wit  careless, witless

Example: adjective  adverb But “rules” have many exceptions
happy  happily But “rules” have many exceptions Less productive: *evidence-less, *concern-less, *go-able, *sleep-able Meanings of derived terms harder to predict by rule clueless, careless, nerveless

Parsing Taking a surface input and identifying its components and underlying structure Morphological parsing: parsing a word into stem and affixes, identifying its parts and their relationships Stem and features: goose  goose +N +SG or goose + V geese  goose +N +PL gooses  goose +V +3SG Bracketing: indecipherable  [in [[de [cipher]] able]] Cipher from mfr cifre from arabic cifra (zero, nothing)

Why parse words? For spell-checking
Is muncheble a legal word? To identify a word’s part-of-speech (pos) For sentence parsing, for machine translation, … To identify a word’s stem For information retrieval Why not just list all word forms in a lexicon?

What do we need to build a morphological parser?
Lexicon: list of stems and affixes (w/ corresponding pos) Morphotactics of the language: model of how and which morphemes can be affixed to a stem Orthographic rules: spelling modifications that may occur when affixation occurs in  il in context of l (in- + legal)

Using FSAs to Represent Morphotactic Models (given a lexicon)
English nominal inflection plural (-s) reg-n q0 q1 q2 irreg-pl-n irreg-sg-n Inputs: cats, goose, geese

Derivational morphology: adjective fragment
adj-root1 -er, -ly, -est un- q0 q1 q2 adj-root1 q5 q3 q4  -er, -est adj-root2 What will happen if we use only the FSA defined by the purple nodes? Will allow unbig, unred,… Solution: define classes of adjective stems NFSA: easier and more intuitive to define Adj-root1: clear, happy, real Adj-root2: big, red

FSAs can also represent the Lexicon
Expand each non-terminal arc in the previous FSA into a sub-lexicon FSA (e.g. adj_root2 = {big, red}) and then expand each of these stems into its letters (e.g. red  r e d) to get a recognizer for adjectives e r q1 q2 un- q3 q7 q0 b d q4 -er, -est q5 i g q6

But….. Covering the whole lexicon this way will require very large FSAs with consequent search and maintenance problems Adding new items to the lexicon means recomputing the whole FSA Non-determinism FSAs tell us whether a word is in the language or not – but usually we want to know more: What is the stem? What are the affixes and what sort are they? We used this information to recognize the word: can we get it back? Adding new lexical items means we will need to determinize and minimize the FSA each time.

Parsing with Finite State Transducers
cats cat +N +PL Koskenniemi’s two-level morphology Idea: word is a relationship between lexical level (its morphemes) and surface level (its orthography) Morphological parsing : find the mapping (transduction) between lexical and surface levels c a t +N +PL s

Finite State Transducers can represent this mapping
FSTs map between one set of symbols and another using an FSA whose alphabet  is composed of pairs of symbols from input and output alphabets In general, FSTs can be used for Translators (Hello:Ciao) Parser/generator s(Hello:How may I help you?) As well as Kimmo-style morphological parsing

FST is a 5-tuple consisting of
Q: set of states {q0,q1,q2,q3,q4} : an alphabet of complex symbols, each an i/o pair s.t. i  I (an input alphabet) and o  O (an output alphabet) and  is in I x O q0: a start state F: a set of final states in Q {q4} (q,i:o): a transition function mapping Q x  to Q Emphatic Sheep  Quizzical Cow a:o b:m a:o a:o !:? q0 q1 q2 q3 q4

FST for a 2-level Lexicon
c:c a:a t:t E.g. q3 q0 q1 q2 g q4 q5 q6 q7 e e:o e:o s Reg-n Irreg-pl-n Irreg-sg-n c a t g o:e o:e s e g o o s e NB: by convention, a:a is written just a

FST for English Nominal Inflection
reg-n +PL:^s# q1 q4 +SG:-# +N: irreg-n-sg q0 q2 q5 q7 +SG:-# irreg-n-pl q3 q6 +PL:-s# +N: s t a c +PL +N

Combining (via cascade or composition) this FSA with FSAs for each noun type replaces e.g. reg-n with every regular noun representation in the lexicon (cf. J&M p.76) e.g. Reg-noun-stem: cat q0 q7

Orthographic Rules and FSTs
Define additional FSTs to implement rules such as consonant doubling (beg  begging), ‘e’ deletion (make  making), ‘e’ insertion (watch  watches), etc. Lexical f o x +N +PL Intermediate ^ s # Surface e

Note: These FSTs can be used for generation or recognition by simply exchanging the input and output alphabets

Summing Up FSTs provide a useful tool for implementing a standard model of morphological analysis, Kimmo’s two-level morphology Key is to provide an FST for each of multiple levels of representation and then to combine those FSTs using a variety of operators (cf AT&T FSM Toolkit and papers by Mohri, Pereira, and Riley, e.g. Other (older) approaches are still widely used, e.g. the rule-based Porter Stemmer described in J&M appendix B Next time: Read Ch 4

Word Classes AKA morphological classes, parts-of-speech
Closed vs. open (function vs. content) class words Pronoun, preposition, conjunction, determiner,… Noun, verb, adverb, adjective,…

How do people represent words?
Hypotheses: Full listing hypothesis: words listed Minimum redundancy hypothesis: morphemes listed Experimental evidence: Priming experiments (Does seeing/hearing one word facilitate recognition of another?) suggest neither Regularly inflected forms prime stem but not derived forms But spoken derived words can prime stems if they are semantically close (e.g. government/govern but not department/depart) Min redund: morphemes must be accessed and combined

Speech errors suggest affixes must be represented separately in the mental lexicon
easy enoughly

Morphology: Parsing Words

Similar presentations

Presentation on theme: "Morphology: Parsing Words"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Morphology: Parsing Words

Similar presentations

Presentation on theme: "Morphology: Parsing Words"— Presentation transcript:

Similar presentations

About project

Feedback