Language modelling María Fernández Pajares Verarbeitung gesprochener Sprache.

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
CS4705 Natural Language Processing.  Regular Expressions  Finite State Automata ◦ Determinism v. non-determinism ◦ (Weighted) Finite State Transducers.
Midterm Review CS4705 Natural Language Processing.
Language modeling for speaker recognition Dan Gillick January 20, 2004.
September SOME BASIC NOTIONS OF PROBABILITY THEORY Universita’ di Venezia 29 Settembre 2003.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
CPSC 322, Lecture 31Slide 1 Probability and Time: Markov Models Computer Science cpsc322, Lecture 31 (Textbook Chpt 6.5) March, 25, 2009.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Why is ASR Hard? Natural speech is continuous
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.
Natural Language Understanding
Albert Gatt Corpora and Statistical Methods Lecture 9.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
ISSUES IN SPEECH RECOGNITION Shraddha Sharma
1 Advanced Smoothing, Evaluation of Language Models.
8/27/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Ngram Models Bahareh Sarrafzadeh Winter Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
Graphical models for part of speech tagging
7-Speech Recognition Speech Recognition Concepts
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Some Probability Theory and Computational models A short overview.
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
IRCS/CCN Summer Workshop June 2003 Speech Recognition.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Copyright © Curt Hill Languages and Grammars This is not English Class. But there is a resemblance.
Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.
Chapter 23: Probabilistic Language Models April 13, 2004.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Large Vocabulary Continuous Speech Recognition. Subword Speech Units.
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Estimating N-gram Probabilities Language Modeling.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Supertagging CMSC Natural Language Processing January 31, 2006.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Natural Language Processing Statistical Inference: n-grams
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Introduction to N-grams Language Modeling. Dan Jurafsky Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation:
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者:郝柏翰 2013/05/23.
Language Model for Machine Translation Jang, HaYoung.
Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.
Machine Learning Ali Ghodsi Department of Statistics
Language Modeling for Speech Recognition
N-Gram Model Formulas Word sequences Chain rule of probability
Probability and Time: Markov Models
CS4705 Natural Language Processing
Probability and Time: Markov Models
CPSC 503 Computational Linguistics
Presentation transcript:

language modelling María Fernández Pajares Verarbeitung gesprochener Sprache

Index: 1.introduction 2. regular grammars 3. stochastics languages 4. N-grams models 5. perplexity

Introduction: Language models What is a language model?  It´s a language structure defining method, in order to limit the most probable linguistic units sequences.  They tend to be useful for aplications which show a complex syntax and/or semantic.  A good ML should only accept( with a high probability) right sentences and reject (or give a low probability) to wrong word sequences.  CLASSIC MODELS: - N-gramms - Stochastic Grammars.

Introduction: general scheme of a system signal measurement of parameters comparison of models Rule of decision Acustic and grammar models text

Introduction: task´s difficulty measurement Determined by the admited language`s real flexibility Perplexity: average of options There are finer measures that take into account the difficulty of the words or the acustics models Speech recognizers seek the word sequence W which is most likely to be produced from acoustic evidence A Speech recognition involves acoustic processing, acoustic modelling, language modelling, and search

Language models (LMs) assign a probability estimate P(W ) to word sequences W = {w1,...,wn} subject to Language models help guide and constrain the search among alternative word hypotheses during recognition Huge vocabularies: integration of the acoustic models and of the language in a hidden macro-model in the Markov to all the language.

Introduction: problems dificulty dimensions conectivity speakers Vocabulary and language complexity (+noise, robustness)

Introduction: MODELS BASED IN GRAMMARS * They represent language restrictions in a natural way *They allow the modelling of dependencies as long as required *the definition of these models involves a big difficulty for tasks that entail languages next to natural languages (pseudo-natural) *Integration with the acustic model isn´t very natural

Introduction: Kinds of grammars If we take the following grammar G=(N,S,P,S) Chomsky hierarchy  0. No restrictions in the rules  too complex to be useful  1 Sensible rules to the context  too complex  2 Independent of the context  they are used in experimental systems  3 regulars or Finite state

Grammars and automat Every kind of grammar is relationed with a kind of automat, that recognizes it:  Kind 0 (without restrictions): Turing Machine  Kind 1(free of context): lineal limited automat  Kind 2 (sensibles to the context):push-down automat  Kind 3 (regulars): finite state automat

Regular grammars A regular grammar is any right-linear or left-linear grammar Examples: Regular grammars generate regular languages Languages Generated by Regular Grammars Regular Languages

space search

An example:

Grammars and stochastics languages Add a probability to each of the production rules A stochastics grammar is a couple (G,p) Where G is a grammar and p is a function p:P  [0,1] that has the property Where represents a set of grammar rules who´s antecedent is A. A stochastic language over an alphabet is a pair that fulfill the following conditions :

example

P(W) can be broken down like: When n=2  bigrams When n=3  trigrams N-gramms models

Let us suppose that the result of an acoustic decoding assigns to resemblances probabilities to the phrases: If: * P(pig | the)=P(big | the) then the election of one or another depends of the word dog. * P(the pig dog)=P(the). P(pig | the). P(dog | the pig) * P(the big dog)=P(the). P(big | the). P(dog | the big) as P(dog | the big)> P(dog | the pig) the model helps to decode the sentences correctly Problems: Necessity of elevating number of learning samples: unigram: bigram: trigram : Example:

Advantages: Probabilities are based on data Parameters determined automatically from corpora Incorporate local syntax, semantics, and pragmatics Many languages have a strong tendency toward standard word order and are thus substantially local Relatively easy to integrate into forward search methods such as Viterbi (bigram) or A ∗ Disadvantages: Unable to incorporate long-distance constraints Not well suited for flexible word order languages Cannot easily accommodate – New vocabulary items – Alternative domains – Dynamic changes (e.g., discourse) Not as good as humans at tasks of – Identifying and correcting recognizer errors – Predicting following words (or letters) Do not capture meaning for speech understanding

Estimation of the Probabilities We go to you suppose that the model of N-gramms has been modelized with a finite automat: Unigram: bigram w1w2:trigram w1w2w3: Let us suppose that they we have a sample of training, on which has considered a model of N-gramms, represented like a finite automat. A state of the automat is q, and is c (q) is total number of events (N- gramas) observed in the sample when model is in state q.

C(w|q) is the number of times that the word w has been observed in the sample,being the model in the state q. P(w|q) is the probability of observation of the word w conditioned to the state q. The set of words observed in the sample when the model is in the state q. The total vocabulary of the language that has to be modelate For example in a bigram: This attitude approach assigns the probability 0 to the events that haven´t been said  this cause problems of cover  the solution is smooth the model  we can smooth the model with:plane,lineal,no lineal, back-off, sintact back-off..

 Bigrams are easily incorporated in Viterbi search  Trigrams used for large vocabulary recognition in mid-1970’s and remain the dominant language modeL  IBM TRIGRAM EXAMPLE:

Methods, in order to measure the probability of ungesehenen N-grams: n-gram performance can be improved by clustering words – Hard clustering puts a word into a single cluster – Soft clustering allows a word to belong to multiple clusters Clusters can be created manually, or automatically – Manually created clusters have worked well for small domains – Automatic clusters have been created bottom-up or top- down

PERPLEXITY Average of options Quantifying LM Complexity One LM is better than another if it can predict an n word test corpus W with a higher probability For LMs representable by the chain rule, comparisons are usually based on the average per word logprob, LP A more intuitive representation of LP is the perplexity (a uniform LM will have PP equal to vocabulary size) PP is often interpreted as an average branching factor

Perplexity Examples

Bibliography: P. Brown et al., Class-based n-gram models of natural language, Computational Linguistics, R. Lau, Adaptive Statistical Language Modelling, S.M. Thesis, MIT, M. McCandless, Automatic Acquisition of Language Models for Speech Recognition, S.M. Thesis, MIT, L.R.Rabiner y B.-H.Juang:”Fundamentals of Speech Recognition”,Prentice-Hall,1993 GOOGLE