Presentation is loading. Please wait.

Presentation is loading. Please wait.

Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011.

Similar presentations


Presentation on theme: "Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011."— Presentation transcript:

1 Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011

2 Roadmap Motivation: Representing words A little (mostly English) Morphology Stemming FSTs & Morphology Stemming Morphological analysis FSTs & Phonology

3 Words Goal: Compact representation of all surface forms in a language

4 Lexicon Goal: Compact representation of all surface forms in a language Enumeration: Impractical for morphologically rich languages Descriptively unsatisfying for most languages

5 Lexicon Goal: Compact representation of all surface forms in a language Enumeration: Impractical for morphologically rich languages Descriptively unsatisfying for most languages Orthographic variation: Fly+er  Flier

6 Lexicon Goal: Compact representation of all surface forms in a language Enumeration: Impractical for morphologically rich languages Descriptively unsatisfying for most languages Orthographic variation: Fly+er  Flier Morphological variation: saw + s  saws; fish + s  fish; goose + s  geese

7 Lexicon Goal: Compact representation of all surface forms in a language Enumeration: Impractical for morphologically rich languages Descriptively unsatisfying for most languages Orthographic variation: Fly+er  Flier Morphological variation: saw + s  saws; fish + s  fish; goose + s  geese Phonological variation: dog + s  dog + /z/; fox + s  fox + /IH Z/

8 Morphological Parsing Goal: Take a surface word form and generate a linguistic structure of component morphemes A morpheme is the minimal meaning-bearing unit in a language. Stem: the morpheme that forms the central meaning unit in a word Affix: prefix, suffix, infix, circumfix Prefix: e.g., possible  impossible Suffix: e.g., walk  walking Infix: e.g., hingi  humingi (Tagalog) Circumfix: e.g., sagen  gesagt (German)

9 Combining Morphemes Inflection: Stem + gram. morpheme  same class E.g.: help + ed  helped

10 Combining Morphemes Inflection: Stem + gram. morpheme  same class E.g.: help + ed  helped Derivation: Stem + gram. morpheme  new class E.g. Walk + er  walker (N)

11 Combining Morphemes Inflection: Stem + gram. morpheme  same class E.g.: help + ed  helped Derivation: Stem + gram. morpheme  new class E.g. Walk + er  walker (N) Compounding: multiple stems  new word E.g. doghouse, catwalk, …

12 Combining Morphemes Inflection: Stem + gram. morpheme  same class E.g.: help + ed  helped Derivation: Stem + gram. morpheme  new class E.g. Walk + er  walker (N) Compounding: multiple stems  new word E.g. doghouse, catwalk, … Clitics: stem+clitic I + ll  I’ll; he + is  he’s

13 Inflectional Morphology (Mostly English) Relatively simple inflectional system Nouns, verbs, some adjectives

14 Inflectional Morphology (Mostly English) Relatively simple inflectional system Nouns, verbs, some adjectives Noun inflection: Only plural, possessive Non-English???

15 Inflectional Morphology (Mostly English) Relatively simple inflectional system Nouns, verbs, some adjectives Noun inflection: Only plural, possessive Non-English??? Plural: mostly stem + ‘s’, ‘es’ after s,z,sh,ch,x Possessive: RegularIrregular Singularcatthrushgooseox Pluralcatsthrushesgeeseoxen

16 Inflectional Morphology (Mostly English) Relatively simple inflectional system Nouns, verbs, some adjectives Noun inflection: Only plural, possessive Non-English??? Plural: mostly stem + ‘s’, ‘es’ after s,z,sh,ch,x Possessive: sg, irreg pl: +’s; reg pl, after s,z: ‘ RegularIrregular Singularcatthrushgooseox Pluralcatsthrushesgeeseoxen

17 Verb Inflectional Morphology Classes: Main (eat, hit), modal (can, should), primary (be, have) Only main, primary inflected

18 Verb Inflectional Morphology Classes: Main (eat, hit), modal (can, should), primary (be, have) Only main, primary inflected Regular verbs: Forms predictable from stem, productive FormRegularVerbs Stemwalkmergetrymap -s formwalksmergestriesmaps -ing partwalkingmergingtryingmapping past (-ed)walkedmergedtriedmapped

19 Verb Inflectional Morphology Classes: Main (eat, hit), modal (can, should), primary (be, have) Only main, primary inflected Regular verbs: Forms predictable from stem, productive Irregular verbs: Only about 250, but very frequent FormRegularVerbs Stemwalkmergetrymap -s formwalksmergestriesmaps -ing partwalkingmergingtryingmapping past (-ed)walkedmergedtriedmapped eateatseatingateeaten catchcatchescatchingcaught cutcutscuttingcut

20 Derivational Morphology Relatively complex, common in English Nominalization: Verb or Adj + affix  Noun

21 Derivational Morphology Relatively complex, common in English Nominalization: Verb or Adj + affix  Noun Adjectives: Verb or Noun + affix  Adj SuffixBaseDerived Noun -ationcomputerizecomputerization -eeappointappointee -erkillkiller -nessfuzzyfuzziness

22 Derivational Morphology Relatively complex, common in English Nominalization: Verb or Adj + affix  Noun Adjectives: Verb or Noun + affix  Adj SuffixBaseDerived Noun -ationcomputerizecomputerization -eeappointappointee -erkillkiller -nessfuzzyfuzziness SuffixBaseDerived Adjective -alcomputationcomputational -ableembraceembraceable -lessclueclueless

23 Cliticization Clitics: between affix and word Affix: short, reduced Word: act as pronouns, articles, conj, verbs

24 Cliticization Clitics: between affix and word Affix: short, reduced Word: act as pronouns, articles, conj, verbs In English: Presence is (mostly) unambiguous: ‘ Meaning is often ambiguous: e.g. he’s

25 Cliticization Clitics: between affix and word Affix: short, reduced Word: act as pronouns, articles, conj, verbs In English: Presence is (mostly) unambiguous: ‘ Meaning is often ambiguous: e.g. he’s More complex in other languages: e.g. Arabic

26 Cliticization Clitics: between affix and word Affix: short, reduced Word: act as pronouns, articles, conj, verbs In English: Presence is (mostly) unambiguous: ‘ Meaning is often ambiguous: e.g. he’s More complex in other languages: e.g. Arabic Can prefix (proclitic) article, prep, conj, No markers Removal of such clitics often referred to as light stemming

27 Stemming Simple type of morphological analysis Commonly used in information retrieval (IR) Supports matching using base form e.g. Television, televised, televising  televise

28 Stemming Simple type of morphological analysis Commonly used in information retrieval (IR) Supports matching using base form e.g. Television, televised, televising  televise Typically improves retrieval of short documents – why?

29 Stemming Simple type of morphological analysis Commonly used in information retrieval (IR) Supports matching using base form e.g. Television, televised, televising  televise Typically improves retrieval of short documents – why? Most popular: Porter stemmer (snowball.tartarus.org)

30 Stemming Simple type of morphological analysis Commonly used in information retrieval (IR) Supports matching using base form e.g. Television, televised, televising  televise Typically improves retrieval of short documents – why? Most popular: Porter stemmer (snowball.tartarus.org) Task: Given surface form, produce base form Typically, removes suffixes

31 Stemming Simple type of morphological analysis Commonly used in information retrieval (IR) Supports matching using base form e.g. Television, televised, televising  televise Typically improves retrieval of short documents – why? Most popular: Porter stemmer (snowball.tartarus.org) Task: Given surface form, produce base form Typically, removes suffixes Model: Rule cascade No lexicon!

32 Porter Stemmer Rule cascade: Rule form: (condition) PATT1  PATT2

33 Porter Stemmer Rule cascade: Rule form: (condition) PATT1  PATT2 E.g. stem contains vowel, ING -> ε

34 Porter Stemmer Rule cascade: Rule form: (condition) PATT1  PATT2 E.g. stem contains vowel, ING -> ε ATIONAL  ATE

35 Porter Stemmer Rule cascade: Rule form: (condition) PATT1  PATT2 E.g. stem contains vowel, ING -> ε ATIONAL  ATE Rule partial order: Step1a: -s Step1b: -ed, -ing

36 Porter Stemmer Rule cascade: Rule form: (condition) PATT1  PATT2 E.g. stem contains vowel, ING -> ε ATIONAL  ATE Rule partial order: Step1a: -s Step1b: -ed, -ing Step 2-4: derivational suffixes

37 Porter Stemmer Rule cascade: Rule form: (condition) PATT1  PATT2 E.g. stem contains vowel, ING -> ε ATIONAL  ATE Rule partial order: Step1a: -s Step1b: -ed, -ing Step 2-4: derivational suffixes Step 5: cleanup Pros:

38 Porter Stemmer Rule cascade: Rule form: (condition) PATT1  PATT2 E.g. stem contains vowel, ING -> ε ATIONAL  ATE Rule partial order: Step1a: -s Step1b: -ed, -ing Step 2-4: derivational suffixes Step 5: cleanup Pros: Simple, fast, buildable for a variety of languages Cons:

39 Porter Stemmer Rule cascade: Rule form: (condition) PATT1  PATT2 E.g. stem contains vowel, ING -> ε ATIONAL  ATE Rule partial order: Step1a: -s Step1b: -ed, -ing Step 2-4: derivational suffixes Step 5: cleanup Pros: Simple, fast, buildable for a variety of languages Cons: Overaggressive and underaggressive Limited in application

40 FST Morphological Analysis Focus on English morphology FSA acceptor: cats  yes; foxes  yes; childs  no

41 FST Morphological Analysis Focus on English morphology FSA acceptor: cats  yes; foxes  yes; childs  no FST morphological analyzer: fox + N + pl  fox^s#

42 FST Morphological Analysis Focus on English morphology FSA acceptor: cats  yes; foxes  yes; childs  no FST morphological analyzer: fox + N + pl  fox^s# FST for orthographic rules: fox^s#  foxes#

43 Morphological Analysis Components Lexicon: List of stems and affixes E.g.: cat: N -s: Pl

44 Morphological Analysis Components Lexicon: List of stems and affixes E.g.: cat: N -s: Pl Morphotactics: Model of morpheme ordering Association with classes, affix ordering E.g. Pl follows N

45 Morphological Analysis Components Lexicon: List of stems and affixes E.g.: cat: N -s: Pl Morphotactics: Model of morpheme ordering Association with classes, affix ordering E.g. Pl follows N Orthographic rules: Spelling rules Changes when morphemes combine E.g. y  ie in try + s

46 Example Goal: foxes  fox + N + Pl

47 Example Goal: foxes  fox + N + Pl Surface: foxes

48 Example Goal: foxes  fox + N + Pl Surface: foxes Orthographic rules Intermediate: fox s

49 Example Goal: foxes  fox + N + Pl Surface: foxes Orthographic rules Intermediate: fox s Lexicon + morphotactics Lexical: fox + N + Pl

50 Multiple Levels Generation and Analysis Generation: fox + N + Pl  fox^s#; fox^s#  foxes# Analysis: foxes#  fox^s#; fox^s#  fox + N + Pl

51 The Lexicon Repository for words: Simplest would be enumeration Impractical (at least) for many languages

52 The Lexicon Repository for words: Simplest would be enumeration Impractical (at least) for many languages Includes stems, affixes, some morphotactics E.g cat: N, +sg; fly: v, +base

53 The Lexicon Repository for words: Simplest would be enumeration Impractical (at least) for many languages Includes stems, affixes, some morphotactics E.g cat: N, +sg; fly: v, +base What about: flies: v, +sg +3 rd ? Common model of morphotactics: FSA

54 Basic Noun Lexicon (J&M, CH3) reg-nounirreg-pl-nounirreg-sg-nounplural foxgeesegoose-s catsheep dogmicemouse

55 Basic Noun Lexicon (J&M, CH3) As an FSA reg-nounirreg-pl-nounirreg-sg-nounplural foxgeesegoose-s catsheep dogmicemouse

56 Basic Noun Lexicon (J&M, CH3) As an FSA reg-nounirreg-pl-nounirreg-sg-nounplural foxgeesegoose-s catsheep dogmicemouse

57 FSA Lexicon with Words What’s up with the ‘s’ arc?

58 FSA Lexicon with Words What’s up with the ‘s’ arc? Orthographic rules will fix ‘es’

59 Lexicon for English Verbs Verbs and classes: reg-v-stemirreg-v-stemirreg-past-v-formpastpart-partpres-part3sg walkcutcaught-ed -ing-s fryspeakate talksingeaten impeachsang

60 Lexicon for English Verbs Verbs and classes: reg-v-stemirreg-v-stemirreg-past-v-formpastpart-partpres-part3sg walkcutcaught-ed -ing-s fryspeakate talksingeaten impeachsang

61 FSA for Derivational Morphology Complex….

62 FSAs for Morphotactics We have: stems (and stem class identities) e.g cat: reg-noun, goose: irreg-noun e.g. walk: reg-verb-stem; cut: irreg-verb-stem

63 FSAs for Morphotactics We have: stems (and stem class identities) e.g cat: reg-noun, goose: irreg-noun e.g. walk: reg-verb-stem; cut: irreg-verb-stem affixes (by form and class) e.g. –s: Plural e.g. –ed: past, past-part

64 FSAs for Morphotactics We have: stems (and stem class identities) e.g cat: reg-noun, goose: irreg-noun e.g. walk: reg-verb-stem; cut: irreg-verb-stem affixes (by form and class) e.g. –s: Plural e.g. –ed: past, past-part morphotactic FSAs: Accept combinations of stems & affixes in language Reject o.w.

65 Recognition vs Analysis/Generation Can validate a morphological sequence

66 Recognition vs Analysis/Generation Can validate a morphological sequence Recognition not usually main goal Analysis: Given a surface form, produce component morphemes Generation: Given some morphological structure, produce full surface form

67 Recognition vs Analysis/Generation Can validate a morphological sequence Recognition not usually main goal Analysis: Given a surface form, produce component morphemes Generation: Given some morphological structure, produce full surface form Requires translation from one form to another

68 Recognition vs Analysis/Generation Can validate a morphological sequence Recognition not usually main goal Analysis: Given a surface form, produce component morphemes Generation: Given some morphological structure, produce full surface form Requires translation from one form to another FSTs

69 Multilevel Tape Machines FST1… Orthographic Rules …..FSTn Lexicon FST

70 Noun Morphology FSA Remember:

71 Schematic FST cat + N + Pl  cat^s#Map morph features to empty string if there is no corresponding output

72 Updating the Lexicon Need words, not just classes, as FST fox  fox

73 Updating the Lexicon Need words, not just classes, as FST fox  fox Need:

74 Updating the Lexicon Need words, not just classes, as FST fox  fox Need: geese  goose + N + Pl Assume f:f written as f reg-nounirreg-pl-nounirreg-sg-noun foxg o o s e catsheep aardvarkmouse

75 Updating the Lexicon Need words, not just classes, as FST fox  fox Need: geese  goose + N + Pl Assume f:f written as f reg-nounirreg-pl-nounirreg-sg-noun foxg o:e o:e s eg o o s e catsheep aardvarkm o:i u:εs:c emouse

76 Integrating the Lexicon Replace classes with stems

77 Adding Orthographic Rules Current transducer concatenates morphemes Should work for cats, aardvarks, mice,..

78 Adding Orthographic Rules Current transducer concatenates morphemes Should work for cats, aardvarks, mice,.. foxs? Problem: spelling changes at morpheme boundaries

79 Adding Orthographic Rules Current transducer concatenates morphemes Should work for cats, aardvarks, mice,.. foxs? Problem: spelling changes at morpheme boundaries Many such rules Consonant doubling before –ing, -ed E-deletion: silent e dropped before –ing, -ed, Y replacement: y  ie before –s, i before -ed, etc

80 Adding Orthographic Rules Current transducer concatenates morphemes Should work for cats, aardvarks, mice,.. foxs? Problem: spelling changes at morpheme boundaries Many such rules Consonant doubling before –ing, -ed E-deletion: silent e dropped before –ing, -ed, Y replacement: y  ie before –s, i before -ed, etc Approach: Transducers for orthographic rules

81 Creating an Orthographic Rule Goal: Correct e insertion in plurals E.g. fox^s#  foxes Approach 1:

82 Creating an Orthographic Rule Goal: Correct e insertion in plurals E.g. fox^s#  foxes Approach 1: ε  e  foxes

83 Creating an Orthographic Rule Goal: Correct e insertion in plurals E.g. fox^s#  foxes Approach 1: ε  e  foxes, but also cates, doges, etc…

84 Creating an Orthographic Rule Goal: Correct e insertion in plurals E.g. fox^s#  foxes Approach 1: ε  e  foxes, but also cates, doges, etc… Only apply in context: after s,z,x, etc before s Approach 2: ε  e /(s|z|x|)_s Issue

85 Creating an Orthographic Rule Goal: Correct e insertion in plurals E.g. fox^s#  foxes Approach 1: ε  e  foxes, but also cates, doges, etc… Only apply in context: after s,z,x, etc before s Approach 2: ε  e /(s|z|x|)_s Issue? glass  glases Approach 3: ε  e /(s|z|x|)^_s#

86 Rewrite Rules Format: a  b/c_d Rewrite rules can be optional or obligatory Rewrite rules can be ordered to reduce ambiguity. Under some conditions, rewrite rules equivalent to FSTs. a not allowed to match s.t. introduced in prior rule application

87 E-insertion Rule Transducer ε  e /(s|z|x|)^_s# Input: ….(s|z|x)^s# Intermediate level Output: …(s|z|x)es# surface level

88 Using the E-insertion FST (fox,fox):

89 Using the E-insertion FST (fox,fox): q0, q0,q0,q1, accept (fox#,fox#):

90 Using the E-insertion FST (fox,fox): q0, q0,q0,q1, accept (fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#):

91 Using the E-insertion FST (fox,fox): q0, q0,q0,q1, accept (fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#): q0,q0,q0,q1,q2,q3,q4,q0,accept (fox^s,foxs):

92 Using the E-insertion FST (fox,fox): q0, q0,q0,q1, accept (fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#): q0,q0,q0,q1,q2,q3,q4,q0,accept (fox^s,foxs): q0,q0,q0,q1,q2,q5,reject

93 Using the E-insertion FST (fox,fox): q0, q0,q0,q1, accept (fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#): q0,q0,q0,q1,q2,q3,q4,q0,accept (fox^s,foxs): q0,q0,q0,q1,q2,q5,reject (fox^z#,foxz#) ?

94 What will it accept? (f,f) (fox#,fox#) (fox^s#,foxes#) (fox^z#,foxz#)

95 What will it accept? (f,f) (fox#,fox#) (fox^s#,foxes#) (fox^z#,foxz#) Goal: write rules capture only those constraints Let all other input pass through

96 Combining FST Lexicon & Rules Two-level morphological system: ‘Cascade’ Transducer from Lexicon to Intermediate Rule transducers from Intermediate to Surface

97 Generation & Parsing Generation: Given lexicon tape, cascade to produce surface form fox + N + PL  foxes#

98 Generation & Parsing Generation: Given lexicon tape, cascade to produce surface form fox + N + PL  foxes# Parsing: Given surface form, generate analysis foxes# 

99 Generation & Parsing Generation: Given lexicon tape, cascade to produce surface form fox + N + PL  foxes# Parsing: Given surface form, generate analysis foxes#  fox + N + PL

100 Generation & Parsing Generation: Given lexicon tape, cascade to produce surface form fox + N + PL  foxes# Parsing: Given surface form, generate analysis foxes#  fox + N + PL or  fox + V + 3Sg How can we disambiguate?

101 Generation & Parsing Generation: Given lexicon tape, cascade to produce surface form fox + N + PL  foxes# Parsing: Given surface form, generate analysis foxes#  fox + N + PL or  fox + V + 3Sg How can we disambiguate? We can’t here – need outside information What about ‘assess’?

102 Generation & Parsing Generation: Given lexicon tape, cascade to produce surface form fox + N + PL  foxes# Parsing: Given surface form, generate analysis foxes#  fox + N + PL or  fox + V + 3Sg How can we disambiguate? We can’t here – need outside information What about ‘assess’? Need same sort of search as NFAs

103 FST Morphological Analysis Summary: Main components Lexicon Morphotactics Orthographic rules Morphotactics as FSTs, expanded with FST Lexicon Orthographic rules as FSTs Combine FSTs, e.g. in cascade

104 Issues What do you think of creating all the rules for a languages – by hand? Time-consuming, complicated

105 Issues What do you think of creating all the rules for a languages – by hand? Time-consuming, complicated Proposed approach: Unsupervised morphology induction

106 Issues What do you think of creating all the rules for a languages – by hand? Time-consuming, complicated Proposed approach: Unsupervised morphology induction Potentially useful for many applications IR, MT

107 Unsupervised Morphology Start from tokenized text (or word frequencies) talk 60 talked120 walked40 walk30

108 Unsupervised Morphology Start from tokenized text (or word frequencies) talk 60 talked120 walked40 walk30 Treat as coding/compression problem Find most compact representation of lexicon Popular model MDL (Minimum Description Length) Smallest total encoding: Weighted combination of lexicon size & ‘rules’

109 Approach Generate initial model: Base set of words, compute MDL length

110 Approach Generate initial model: Base set of words, compute MDL length Iterate: Generate a new set of words + some model to create a smaller description size

111 Approach Generate initial model: Base set of words, compute MDL length Iterate: Generate a new set of words + some model to create a smaller description size E.g. for talk, talked, walk, walked 4 words

112 Approach Generate initial model: Base set of words, compute MDL length Iterate: Generate a new set of words + some model to create a smaller description size E.g. for talk, talked, walk, walked 4 words 2 words (talk, walk) + 1 affix (-ed) + combination info 2 words (t,w) + 2 affixes (alk,-ed) + combination info

113 Homework #3


Download ppt "Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011."

Similar presentations


Ads by Google