Morphology & FSTs Shallow Processing Techniques for NLP Ling570 October 17, 2011.

Slides:

Advertisements

Similar presentations

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

Advertisements

Finite State Automata. A very simple and intuitive formalism suitable for certain tasks A bit like a flow chart, but can be used for both recognition.

CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:

Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.

Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011.

Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.

Finite-State Automata Shallow Processing Techniques for NLP Ling570 October 5, 2011.

Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.

CS4705 Natural Language Processing.  Regular Expressions  Finite State Automata ◦ Determinism v. non-determinism ◦ (Weighted) Finite State Transducers.

6/10/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini.

FSA and HMM LING 572 Fei Xia 1/5/06.

Midterm Review CS4705 Natural Language Processing.

Probabilistic Pronunciation + N-gram Models CSPP Artificial Intelligence February 25, 2004.

Computational language: week 9 Finish finite state machines FSA’s for modelling word structure Declarative language models knowledge representation and.

1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.

Finite state automaton (FSA)

LING 438/538 Computational Linguistics

1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.

Morphological analysis

1 Finite state automaton (FSA) LING 570 Fei Xia Week 2: 10/07/09 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA.

Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.

Word prediction What are likely completions of the following sentences? –“Oh, that must be a syntax …” –“I have to go to the …” –“I’d also like a Coke.

تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.

C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.

LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.

Albert Gatt Corpora and Statistical Methods Lecture 9.

SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.

8/27/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

1 LIN6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 5: N-grams.

BİL711 Natural Language Processing1 Statistical Language Processing In the solution of some problems in the natural language processing, statistical techniques.

Introduction to Natural Language Processing Heshaam Faili University of Tehran.

October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.

Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.

Heshaam Faili University of Tehran

6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.

NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.

Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model.

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Chapter 6: N-GRAMS Heshaam Faili University of Tehran.

Finite State Transducers for Morphological Parsing

Words: Surface Variation and Automata CMSC Natural Language Processing April 3, 2003.

Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.

Graphical Models over Multiple Strings Markus Dreyer and Jason Eisner Dept. of Computer Science, Johns Hopkins University EMNLP 2009 Presented by Ji Zongcheng.

Tokenization & POS-Tagging

N-gram Models CMSC Artificial Intelligence February 24, 2005.

Chapter 23: Probabilistic Language Models April 13, 2004.

1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab

Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.

Estimating N-gram Probabilities Language Modeling.

Natural Language Processing Statistical Inference: n-grams

October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.

2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Learning, Uncertainty, and Information: Evaluating Models Big Ideas November 12, 2004.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.

Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,

Language Model for Machine Translation Jang, HaYoung.

Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.

CIS, Ludwig-Maximilians-Universität München Computational Morphology

Natural Language Processing

CS4705 Natural Language Processing

CS4705 Natural Language Processing

CPSC 503 Computational Linguistics

Morphological Parsing

Presentation transcript:

Morphology & FSTs Shallow Processing Techniques for NLP Ling570 October 17, 2011

Roadmap Two-level morphology summary Unsupervised morphology

Combining FST Lexicon & Rules Two-level morphological system: ‘Cascade’ Transducer from Lexicon to Intermediate Rule transducers from Intermediate to Surface

Integrating the Lexicon Replace classes with stems

Using the E-insertion FST (fox,fox): q0, q0,q0,q1, accept (fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#): q0,q0,q0,q1,q2,q3,q4,q0,accept (fox^s,foxs): q0,q0,q0,q1,q2,q5,reject (fox^z#,foxz#) ?

Issues What do you think of creating all the rules for a languages – by hand? Time-consuming, complicated

Issues What do you think of creating all the rules for a languages – by hand? Time-consuming, complicated Proposed approach: Unsupervised morphology induction

Issues What do you think of creating all the rules for a languages – by hand? Time-consuming, complicated Proposed approach: Unsupervised morphology induction Potentially useful for many applications IR, MT

Unsupervised Morphology Start from tokenized text (or word frequencies) talk 60 talked120 walked40 walk30

Unsupervised Morphology Start from tokenized text (or word frequencies) talk 60 talked120 walked40 walk30 Treat as coding/compression problem Find most compact representation of lexicon Popular model MDL (Minimum Description Length) Smallest total encoding: Weighted combination of lexicon size & ‘rules’

Approach Generate initial model: Base set of words, compute MDL length

Approach Generate initial model: Base set of words, compute MDL length Iterate: Generate a new set of words + some model to create a smaller description size

Approach Generate initial model: Base set of words, compute MDL length Iterate: Generate a new set of words + some model to create a smaller description size E.g. for talk, talked, walk, walked 4 words

Approach Generate initial model: Base set of words, compute MDL length Iterate: Generate a new set of words + some model to create a smaller description size E.g. for talk, talked, walk, walked 4 words 2 words (talk, walk) + 1 affix (-ed) + combination info 2 words (t,w) + 2 affixes (alk,-ed) + combination info

Successful Applications Inducing word classes (e.g. N,V) by affix patterns Unsupervised morphological analysis for MT Word segmentation in CJK Word text/sound segmentation in English

Unit #1 Summary

Formal Languages Formal Languages and Grammars Chomsky hierarchy Languages and the grammars that accept/generate

Formal Languages Formal Languages and Grammars Chomsky hierarchy Languages and the grammars that accept/generate Equivalences Regular languages Regular grammars Regular expressions Finite State Automata

Finite-State Automata & Transducers Finite-State Automata: Deterministic & non-deterministic automata Equivalence and conversion Probabilistic & weighted FSAs

Finite-State Automata & Transducers Finite-State Automata: Deterministic & non-deterministic automata Equivalence and conversion Probabilistic & weighted FSAs Packages and operations: Carmel

Finite-State Automata & Transducers Finite-State Automata: Deterministic & non-deterministic automata Equivalence and conversion Probabilistic & weighted FSAs Packages and operations: Carmel FSTs & regular relations Closures and equivalences Composition, inversion

FSA/FST Applications Range of applications: Parsing Translation Tokenization…

FSA/FST Applications Range of applications: Parsing Translation Tokenization… Morphology: Lexicon: cat: N, +Sg; -s: Pl Morphotactics: N+PL Orthographic rules: fox + s  foxes Parsing & Generation

Implementation Tokenizers FSA acceptors FST acceptors/translators Orthographic rule as FST

Language Modeling

Roadmap Motivation: LM applications N-grams Training and Testing Evaluation: Perplexity

Predicting Words Given a sequence of words, the next word is (somewhat) predictable: I’d like to place a collect …..

Predicting Words Given a sequence of words, the next word is (somewhat) predictable: I’d like to place a collect ….. Ngram models: Predict next word given previous N Language models (LMs): Statistical models of word sequences

Predicting Words Given a sequence of words, the next word is (somewhat) predictable: I’d like to place a collect ….. Ngram models: Predict next word given previous N Language models (LMs): Statistical models of word sequences Approach: Build model of word sequences from corpus Given alternative sequences, select the most probable

N-gram LM Applications Used in Speech recognition Spelling correction Augmentative communication Part-of-speech tagging Machine translation Information retrieval

Terminology Corpus (pl. corpora): Online collection of text of speech E.g. Brown corpus: 1M word, balanced text collection E.g. Switchboard: 240 hrs of speech; ~3M words

Terminology Corpus (pl. corpora): Online collection of text of speech E.g. Brown corpus: 1M word, balanced text collection E.g. Switchboard: 240 hrs of speech; ~3M words Wordform: Full inflected or derived form of word: cats, glottalized

Terminology Corpus (pl. corpora): Online collection of text of speech E.g. Brown corpus: 1M word, balanced text collection E.g. Switchboard: 240 hrs of speech; ~3M words Wordform: Full inflected or derived form of word: cats, glottalized Word types: # of distinct words in corpus

Terminology Corpus (pl. corpora): Online collection of text of speech E.g. Brown corpus: 1M word, balanced text collection E.g. Switchboard: 240 hrs of speech; ~3M words Wordform: Full inflected or derived form of word: cats, glottalized Word types: # of distinct words in corpus Word tokens: total # of words in corpus

Corpus Counts Estimate probabilities by counts in large collections of text/speech Should we count: Wordform vs lemma ? Case? Punctuation? Disfluency? Type vs Token ?

Words, Counts and Prediction They picnicked by the pool, then lay back on the grass and looked at the stars.

Words, Counts and Prediction They picnicked by the pool, then lay back on the grass and looked at the stars. Word types (excluding punct):

Words, Counts and Prediction They picnicked by the pool, then lay back on the grass and looked at the stars. Word types (excluding punct): 14 Word tokens (“ ):

Words, Counts and Prediction They picnicked by the pool, then lay back on the grass and looked at the stars. Word types (excluding punct): 14 Word tokens (“ ): 16. I do uh main- mainly business data processing Utterance (spoken “sentence” equivalent)

Words, Counts and Prediction They picnicked by the pool, then lay back on the grass and looked at the stars. Word types (excluding punct): 14 Word tokens (“ ): 16. I do uh main- mainly business data processing Utterance (spoken “sentence” equivalent) What about: Disfluencies main-: fragment uh: filler (aka filled pause)

Words, Counts and Prediction They picnicked by the pool, then lay back on the grass and looked at the stars. Word types (excluding punct): 14 Word tokens (“ ): 16. I do uh main- mainly business data processing Utterance (spoken “sentence” equivalent) What about: Disfluencies main-: fragment uh: filler (aka filled pause) Keep, depending on app.: can help prediction; uh vs um

LM Task Training: Given a corpus of text, learn probabilities of word sequences

LM Task Training: Given a corpus of text, learn probabilities of word sequences Testing: Given trained LM and new text, determine sequence probabilities, or Select most probable sequence among alternatives

LM Task Training: Given a corpus of text, learn probabilities of word sequences Testing: Given trained LM and new text, determine sequence probabilities, or Select most probable sequence among alternatives LM types: Basic, Class-based, Structured

Word Prediction Goal: Given some history, what is probability of some next word? Formally, P(w|h) e.g. P(call|I’d like to place a collect)

Word Prediction Goal: Given some history, what is probability of some next word? Formally, P(w|h) e.g. P(call|I’d like to place a collect) How can we compute?

Word Prediction Goal: Given some history, what is probability of some next word? Formally, P(w|h) e.g. P(call|I’d like to place a collect) How can we compute? Relative frequency in a corpus C(I’d like to place a collect call)/C(I’d like to place a collect) Issues?

Word Prediction Goal: Given some history, what is probability of some next word? Formally, P(w|h) e.g. P(call|I’d like to place a collect) How can we compute? Relative frequency in a corpus C(I’d like to place a collect call)/C(I’d like to place a collect) Issues? Zero counts: language is productive! Joint word sequence probability of length N: Count of all sequences of length N & count of that sequence

Word Sequence Probability Notation: P(X i =the) written as P(the) P(w 1 w 2 w 3 …w n ) =

Word Sequence Probability Notation: P(X i =the) written as P(the) P(w 1 w 2 w 3 …w n ) = Compute probability of word sequence by chain rule Links to word prediction by history

Word Sequence Probability Notation: P(X i =the) written as P(the) P(w 1 w 2 w 3 …w n ) = Compute probability of word sequence by chain rule Links to word prediction by history Issues?

Word Sequence Probability Notation: P(X i =the) written as P(the) P(w 1 w 2 w 3 …w n ) = Compute probability of word sequence by chain rule Links to word prediction by history Issues? Potentially infinite history

Word Sequence Probability Notation: P(X i =the) written as P(the) P(w 1 w 2 w 3 …w n ) = Compute probability of word sequence by chain rule Links to word prediction by history Issues? Potentially infinite history Language infinitely productive

Markov Assumptions Exact computation requires too much data

Markov Assumptions Exact computation requires too much data Approximate probability given all prior words Assume finite history

Markov Assumptions Exact computation requires too much data Approximate probability given all prior words Assume finite history Unigram: Probability of word in isolation (0 th order) Bigram: Probability of word given 1 previous First-order Markov Trigram: Probability of word given 2 previous

Markov Assumptions Exact computation requires too much data Approximate probability given all prior words Assume finite history Unigram: Probability of word in isolation (0 th order) Bigram: Probability of word given 1 previous First-order Markov Trigram: Probability of word given 2 previous N-gram approximation Bigram sequence

Unigram Models P(w 1 w 2 …w 3 )~

Unigram Models P(w 1 w 2 …w 3 ) ~ P(w 1 )*P(w 2 )*…*P(w n ) Training: Estimate P(w) given corpus

Unigram Models P(w 1 w 2 …w 3 ) ~ P(w 1 )*P(w 2 )*…*P(w n ) Training: Estimate P(w) given corpus Relative frequency:

Unigram Models P(w 1 w 2 …w 3 ) ~ P(w 1 )*P(w 2 )*…*P(w n ) Training: Estimate P(w) given corpus Relative frequency: P(w) = C(w)/N, N=# tokens in corpus How many parameters?

Unigram Models P(w 1 w 2 …w 3 ) ~ P(w 1 )*P(w 2 )*…*P(w n ) Training: Estimate P(w) given corpus Relative frequency: P(w) = C(w)/N, N=# tokens in corpus How many parameters? Testing: For sentence s, compute P(s) Model with PFA: Input symbols? Probabilities on arcs? States?

Bigram Models P(w 1 w 2 …w 3 ) = P(BOS w 1 w 2 ….w n EOS)

Bigram Models P(w 1 w 2 …w 3 ) = P(BOS w 1 w 2 ….w n EOS) ~ P(BOS)*P(w 1 |BOS)*P(w 2 |w 1 )*…*P(w n |w n-1 )*P(EOS|w n ) Training: Relative frequency:

Bigram Models P(w 1 w 2 …w 3 ) = P(BOS w 1 w 2 ….w n EOS) ~ P(BOS)*P(w 1 |BOS)*P(w 2 |w 1 )*…*P(w n |w n-1 )*P(EOS|w n ) Training: Relative frequency: P(w i |w i-1 ) = C(w i-1 w i )/C(w i-1 ) How many parameters?

Bigram Models P(w 1 w 2 …w 3 ) = P(BOS w 1 w 2 ….w n EOS) ~ P(BOS)*P(w 1 |BOS)*P(w 2 |w 1 )*…*P(w n |w n-1 )*P(EOS|w n ) Training: Relative frequency: P(w i |w i-1 ) = C(w i-1 w i )/C(w i-1 ) How many parameters? Testing: For sentence s, compute P(s) Model with PFA: Input symbols? Probabilities on arcs? States?

Trigram Models P(w 1 w 2 …w 3 ) = P(BOS w 1 w 2 ….w n EOS)

Trigram Models P(w 1 w 2 …w 3 ) = P(BOS w 1 w 2 ….w n EOS) ~ P(BOS)*P(w 1 |BOS)*P(w 2 |BOS,w 1 )*… *P(w n |w n-2, w n- 1 )*P(EOS|w n-1, w n ) Training: P(w i |w i-2,w i-1 )

Trigram Models P(w 1 w 2 …w 3 ) = P(BOS w 1 w 2 ….w n EOS) ~ P(BOS)*P(w 1 |BOS)*P(w 2 |BOS,w 1 )*… *P(w n |w n-2, w n- 1 )*P(EOS|w n-1, w n ) Training: P(w i |w i-2,w i-1 ) = C(w i-2 w i-1 w i )/C(w i-2 w i-1 ) How many parameters?

Trigram Models P(w 1 w 2 …w 3 ) = P(BOS w 1 w 2 ….w n EOS) ~ P(BOS)*P(w 1 |BOS)*P(w 2 |BOS,w 1 )*… *P(w n |w n-2, w n- 1 )*P(EOS|w n-1, w n ) Training: P(w i |w i-2,w i-1 ) = C(w i-2 w i-1 w i )/C(w i-2 w i-1 ) How many parameters? How many states?

Speech and Language Processing - Jurafsky and Martin An Example I am Sam Sam I am I do not like green eggs and ham

Recap Ngrams: # FSA states:

Recap Ngrams: # FSA states: |V| n-1 # Model parameters:

Recap Ngrams: # FSA states: |V| n-1 # Model parameters: |V| n Issues:

Recap Ngrams: # FSA states: |V| n-1 # Model parameters: |V| n Issues: Data sparseness, Out-of-vocabulary elements (OOV)  Smoothing Mismatches between training & test data Other Language Models