Presentation is loading. Please wait.

Presentation is loading. Please wait.

Morphology & FSTs Shallow Processing Techniques for NLP Ling570 October 17, 2011.

Similar presentations


Presentation on theme: "Morphology & FSTs Shallow Processing Techniques for NLP Ling570 October 17, 2011."— Presentation transcript:

1 Morphology & FSTs Shallow Processing Techniques for NLP Ling570 October 17, 2011

2 Roadmap Two-level morphology summary Unsupervised morphology

3 Combining FST Lexicon & Rules Two-level morphological system: ‘Cascade’ Transducer from Lexicon to Intermediate Rule transducers from Intermediate to Surface

4 Integrating the Lexicon Replace classes with stems

5 Using the E-insertion FST (fox,fox): q0, q0,q0,q1, accept (fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#): q0,q0,q0,q1,q2,q3,q4,q0,accept (fox^s,foxs): q0,q0,q0,q1,q2,q5,reject (fox^z#,foxz#) ?

6 Issues What do you think of creating all the rules for a languages – by hand? Time-consuming, complicated

7 Issues What do you think of creating all the rules for a languages – by hand? Time-consuming, complicated Proposed approach: Unsupervised morphology induction

8 Issues What do you think of creating all the rules for a languages – by hand? Time-consuming, complicated Proposed approach: Unsupervised morphology induction Potentially useful for many applications IR, MT

9 Unsupervised Morphology Start from tokenized text (or word frequencies) talk 60 talked120 walked40 walk30

10 Unsupervised Morphology Start from tokenized text (or word frequencies) talk 60 talked120 walked40 walk30 Treat as coding/compression problem Find most compact representation of lexicon Popular model MDL (Minimum Description Length) Smallest total encoding: Weighted combination of lexicon size & ‘rules’

11 Approach Generate initial model: Base set of words, compute MDL length

12 Approach Generate initial model: Base set of words, compute MDL length Iterate: Generate a new set of words + some model to create a smaller description size

13 Approach Generate initial model: Base set of words, compute MDL length Iterate: Generate a new set of words + some model to create a smaller description size E.g. for talk, talked, walk, walked 4 words

14 Approach Generate initial model: Base set of words, compute MDL length Iterate: Generate a new set of words + some model to create a smaller description size E.g. for talk, talked, walk, walked 4 words 2 words (talk, walk) + 1 affix (-ed) + combination info 2 words (t,w) + 2 affixes (alk,-ed) + combination info

15 Successful Applications Inducing word classes (e.g. N,V) by affix patterns Unsupervised morphological analysis for MT Word segmentation in CJK Word text/sound segmentation in English

16 Unit #1 Summary

17 Formal Languages Formal Languages and Grammars Chomsky hierarchy Languages and the grammars that accept/generate

18 Formal Languages Formal Languages and Grammars Chomsky hierarchy Languages and the grammars that accept/generate Equivalences Regular languages Regular grammars Regular expressions Finite State Automata

19 Finite-State Automata & Transducers Finite-State Automata: Deterministic & non-deterministic automata Equivalence and conversion Probabilistic & weighted FSAs

20 Finite-State Automata & Transducers Finite-State Automata: Deterministic & non-deterministic automata Equivalence and conversion Probabilistic & weighted FSAs Packages and operations: Carmel

21 Finite-State Automata & Transducers Finite-State Automata: Deterministic & non-deterministic automata Equivalence and conversion Probabilistic & weighted FSAs Packages and operations: Carmel FSTs & regular relations Closures and equivalences Composition, inversion

22 FSA/FST Applications Range of applications: Parsing Translation Tokenization…

23 FSA/FST Applications Range of applications: Parsing Translation Tokenization… Morphology: Lexicon: cat: N, +Sg; -s: Pl Morphotactics: N+PL Orthographic rules: fox + s  foxes Parsing & Generation

24 Implementation Tokenizers FSA acceptors FST acceptors/translators Orthographic rule as FST

25 Language Modeling

26 Roadmap Motivation: LM applications N-grams Training and Testing Evaluation: Perplexity

27 Predicting Words Given a sequence of words, the next word is (somewhat) predictable: I’d like to place a collect …..

28 Predicting Words Given a sequence of words, the next word is (somewhat) predictable: I’d like to place a collect ….. Ngram models: Predict next word given previous N Language models (LMs): Statistical models of word sequences

29 Predicting Words Given a sequence of words, the next word is (somewhat) predictable: I’d like to place a collect ….. Ngram models: Predict next word given previous N Language models (LMs): Statistical models of word sequences Approach: Build model of word sequences from corpus Given alternative sequences, select the most probable

30 N-gram LM Applications Used in Speech recognition Spelling correction Augmentative communication Part-of-speech tagging Machine translation Information retrieval

31 Terminology Corpus (pl. corpora): Online collection of text of speech E.g. Brown corpus: 1M word, balanced text collection E.g. Switchboard: 240 hrs of speech; ~3M words

32 Terminology Corpus (pl. corpora): Online collection of text of speech E.g. Brown corpus: 1M word, balanced text collection E.g. Switchboard: 240 hrs of speech; ~3M words Wordform: Full inflected or derived form of word: cats, glottalized

33 Terminology Corpus (pl. corpora): Online collection of text of speech E.g. Brown corpus: 1M word, balanced text collection E.g. Switchboard: 240 hrs of speech; ~3M words Wordform: Full inflected or derived form of word: cats, glottalized Word types: # of distinct words in corpus

34 Terminology Corpus (pl. corpora): Online collection of text of speech E.g. Brown corpus: 1M word, balanced text collection E.g. Switchboard: 240 hrs of speech; ~3M words Wordform: Full inflected or derived form of word: cats, glottalized Word types: # of distinct words in corpus Word tokens: total # of words in corpus

35 Corpus Counts Estimate probabilities by counts in large collections of text/speech Should we count: Wordform vs lemma ? Case? Punctuation? Disfluency? Type vs Token ?

36 Words, Counts and Prediction They picnicked by the pool, then lay back on the grass and looked at the stars.

37 Words, Counts and Prediction They picnicked by the pool, then lay back on the grass and looked at the stars. Word types (excluding punct):

38 Words, Counts and Prediction They picnicked by the pool, then lay back on the grass and looked at the stars. Word types (excluding punct): 14 Word tokens (“ ):

39 Words, Counts and Prediction They picnicked by the pool, then lay back on the grass and looked at the stars. Word types (excluding punct): 14 Word tokens (“ ): 16. I do uh main- mainly business data processing Utterance (spoken “sentence” equivalent)

40 Words, Counts and Prediction They picnicked by the pool, then lay back on the grass and looked at the stars. Word types (excluding punct): 14 Word tokens (“ ): 16. I do uh main- mainly business data processing Utterance (spoken “sentence” equivalent) What about: Disfluencies main-: fragment uh: filler (aka filled pause)

41 Words, Counts and Prediction They picnicked by the pool, then lay back on the grass and looked at the stars. Word types (excluding punct): 14 Word tokens (“ ): 16. I do uh main- mainly business data processing Utterance (spoken “sentence” equivalent) What about: Disfluencies main-: fragment uh: filler (aka filled pause) Keep, depending on app.: can help prediction; uh vs um

42 LM Task Training: Given a corpus of text, learn probabilities of word sequences

43 LM Task Training: Given a corpus of text, learn probabilities of word sequences Testing: Given trained LM and new text, determine sequence probabilities, or Select most probable sequence among alternatives

44 LM Task Training: Given a corpus of text, learn probabilities of word sequences Testing: Given trained LM and new text, determine sequence probabilities, or Select most probable sequence among alternatives LM types: Basic, Class-based, Structured

45 Word Prediction Goal: Given some history, what is probability of some next word? Formally, P(w|h) e.g. P(call|I’d like to place a collect)

46 Word Prediction Goal: Given some history, what is probability of some next word? Formally, P(w|h) e.g. P(call|I’d like to place a collect) How can we compute?

47 Word Prediction Goal: Given some history, what is probability of some next word? Formally, P(w|h) e.g. P(call|I’d like to place a collect) How can we compute? Relative frequency in a corpus C(I’d like to place a collect call)/C(I’d like to place a collect) Issues?

48 Word Prediction Goal: Given some history, what is probability of some next word? Formally, P(w|h) e.g. P(call|I’d like to place a collect) How can we compute? Relative frequency in a corpus C(I’d like to place a collect call)/C(I’d like to place a collect) Issues? Zero counts: language is productive! Joint word sequence probability of length N: Count of all sequences of length N & count of that sequence

49 Word Sequence Probability Notation: P(X i =the) written as P(the) P(w 1 w 2 w 3 …w n ) =

50 Word Sequence Probability Notation: P(X i =the) written as P(the) P(w 1 w 2 w 3 …w n ) = Compute probability of word sequence by chain rule Links to word prediction by history

51 Word Sequence Probability Notation: P(X i =the) written as P(the) P(w 1 w 2 w 3 …w n ) = Compute probability of word sequence by chain rule Links to word prediction by history Issues?

52 Word Sequence Probability Notation: P(X i =the) written as P(the) P(w 1 w 2 w 3 …w n ) = Compute probability of word sequence by chain rule Links to word prediction by history Issues? Potentially infinite history

53 Word Sequence Probability Notation: P(X i =the) written as P(the) P(w 1 w 2 w 3 …w n ) = Compute probability of word sequence by chain rule Links to word prediction by history Issues? Potentially infinite history Language infinitely productive

54 Markov Assumptions Exact computation requires too much data

55 Markov Assumptions Exact computation requires too much data Approximate probability given all prior words Assume finite history

56 Markov Assumptions Exact computation requires too much data Approximate probability given all prior words Assume finite history Unigram: Probability of word in isolation (0 th order) Bigram: Probability of word given 1 previous First-order Markov Trigram: Probability of word given 2 previous

57 Markov Assumptions Exact computation requires too much data Approximate probability given all prior words Assume finite history Unigram: Probability of word in isolation (0 th order) Bigram: Probability of word given 1 previous First-order Markov Trigram: Probability of word given 2 previous N-gram approximation Bigram sequence

58 Unigram Models P(w 1 w 2 …w 3 )~

59 Unigram Models P(w 1 w 2 …w 3 ) ~ P(w 1 )*P(w 2 )*…*P(w n ) Training: Estimate P(w) given corpus

60 Unigram Models P(w 1 w 2 …w 3 ) ~ P(w 1 )*P(w 2 )*…*P(w n ) Training: Estimate P(w) given corpus Relative frequency:

61 Unigram Models P(w 1 w 2 …w 3 ) ~ P(w 1 )*P(w 2 )*…*P(w n ) Training: Estimate P(w) given corpus Relative frequency: P(w) = C(w)/N, N=# tokens in corpus How many parameters?

62 Unigram Models P(w 1 w 2 …w 3 ) ~ P(w 1 )*P(w 2 )*…*P(w n ) Training: Estimate P(w) given corpus Relative frequency: P(w) = C(w)/N, N=# tokens in corpus How many parameters? Testing: For sentence s, compute P(s) Model with PFA: Input symbols? Probabilities on arcs? States?

63 Bigram Models P(w 1 w 2 …w 3 ) = P(BOS w 1 w 2 ….w n EOS)

64 Bigram Models P(w 1 w 2 …w 3 ) = P(BOS w 1 w 2 ….w n EOS) ~ P(BOS)*P(w 1 |BOS)*P(w 2 |w 1 )*…*P(w n |w n-1 )*P(EOS|w n ) Training: Relative frequency:

65 Bigram Models P(w 1 w 2 …w 3 ) = P(BOS w 1 w 2 ….w n EOS) ~ P(BOS)*P(w 1 |BOS)*P(w 2 |w 1 )*…*P(w n |w n-1 )*P(EOS|w n ) Training: Relative frequency: P(w i |w i-1 ) = C(w i-1 w i )/C(w i-1 ) How many parameters?

66 Bigram Models P(w 1 w 2 …w 3 ) = P(BOS w 1 w 2 ….w n EOS) ~ P(BOS)*P(w 1 |BOS)*P(w 2 |w 1 )*…*P(w n |w n-1 )*P(EOS|w n ) Training: Relative frequency: P(w i |w i-1 ) = C(w i-1 w i )/C(w i-1 ) How many parameters? Testing: For sentence s, compute P(s) Model with PFA: Input symbols? Probabilities on arcs? States?

67 Trigram Models P(w 1 w 2 …w 3 ) = P(BOS w 1 w 2 ….w n EOS)

68 Trigram Models P(w 1 w 2 …w 3 ) = P(BOS w 1 w 2 ….w n EOS) ~ P(BOS)*P(w 1 |BOS)*P(w 2 |BOS,w 1 )*… *P(w n |w n-2, w n- 1 )*P(EOS|w n-1, w n ) Training: P(w i |w i-2,w i-1 )

69 Trigram Models P(w 1 w 2 …w 3 ) = P(BOS w 1 w 2 ….w n EOS) ~ P(BOS)*P(w 1 |BOS)*P(w 2 |BOS,w 1 )*… *P(w n |w n-2, w n- 1 )*P(EOS|w n-1, w n ) Training: P(w i |w i-2,w i-1 ) = C(w i-2 w i-1 w i )/C(w i-2 w i-1 ) How many parameters?

70 Trigram Models P(w 1 w 2 …w 3 ) = P(BOS w 1 w 2 ….w n EOS) ~ P(BOS)*P(w 1 |BOS)*P(w 2 |BOS,w 1 )*… *P(w n |w n-2, w n- 1 )*P(EOS|w n-1, w n ) Training: P(w i |w i-2,w i-1 ) = C(w i-2 w i-1 w i )/C(w i-2 w i-1 ) How many parameters? How many states?

71 Speech and Language Processing - Jurafsky and Martin An Example I am Sam Sam I am I do not like green eggs and ham

72 Recap Ngrams: # FSA states:

73 Recap Ngrams: # FSA states: |V| n-1 # Model parameters:

74 Recap Ngrams: # FSA states: |V| n-1 # Model parameters: |V| n Issues:

75 Recap Ngrams: # FSA states: |V| n-1 # Model parameters: |V| n Issues: Data sparseness, Out-of-vocabulary elements (OOV)  Smoothing Mismatches between training & test data Other Language Models


Download ppt "Morphology & FSTs Shallow Processing Techniques for NLP Ling570 October 17, 2011."

Similar presentations


Ads by Google