Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
An Introduction of Support Vector Machine
Support Vector Machines
Machine learning continued Image source:
An Overview of Machine Learning
Supervised Learning Recap
Chapter 4: Linear Models for Classification
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Hidden Markov Models Theory By Johan Walters (SR 2003)
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
CES 514 – Data Mining Lecture 8 classification (contd…)
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Natural Language Understanding
Crash Course on Machine Learning
Step 3: Classification Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Isolated-Word Speech Recognition Using Hidden Markov Models
This week: overview on pattern recognition (related to machine learning)
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Graphical models for part of speech tagging
7-Speech Recognition Speech Recognition Concepts
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Comparison of Bayesian Neural Networks with TMVA classifiers Richa Sharma, Vipin Bhatnagar Panjab University, Chandigarh India-CMS March, 2009 Meeting,
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Hidden Markov Models in Keystroke Dynamics Md Liakat Ali, John V. Monaco, and Charles C. Tappert Seidenberg School of CSIS, Pace University, White Plains,
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Linear Models for Classification
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Data Mining and Decision Support
Stochastic Methods for NLP Probabilistic Context-Free Parsers Probabilistic Lexicalized Context-Free Parsers Hidden Markov Models – Viterbi Algorithm Statistical.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Statistical Models for Automatic Speech Recognition
The Elements of Statistical Learning
Data Mining Lecture 11.
CS4705 Natural Language Processing
LECTURE 15: REESTIMATION, EM AND MIXTURES
CPSC 503 Computational Linguistics
CS249: Neural Language Model
What is Artificial Intelligence?
Presentation transcript:

Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas

Learning Central to statistical NLP In most cases, supervised methods are used, with a separate training set Unsupervised methods (clustering) recalculate the entire model on new data

Parameterized models Assume that the observed (training) data D is described by a given distribution This distribution, possibly with some parameters , is our model . We want to maximize the likelihood function, P(D|  ) or P(D|  ).

Find the  that maximizes P(D|  ), i.e., Example: Binomial distribution P(D|m) = Therefore, m=D/N Maximum likelihood estimation

Smoothing MLE assigns zero probability to unseen events Example: trigrams in part of speech tagging (23% unseen) Solution: smoothing (small probabilities for unseen data)

Bayesian learning It is often impossible to solve Bayes decision rule: choose  that maximizes P(  |D) (minimum error rate) But it may be hard to calculate P(  |D) Use Bayes’ rule: Naïve Bayes:

Examples Gale et al 1992, 90% sense disambiguation accuracy (choose between “bank/money” and “bank/river”) Hanks and Rooth 1990, prepositional phrase attachment –He ate pasta with cheese –He ate pasta with a fork Both rely on observable features (nearby words, the verb)

Markov models A stochastic process follows a sequence of states over time with some transition probabilities If the process is stationary and with limited memory, we have a Markov chain The model can be visible, or with hidden states (HMM)

Example: N-gram language models Result for a word depends only on the word and a limited number of neighbors Part-of-speech tagging: maximize With Bayes rule, chain rule, and independence assumptions Use HMM for automatically adjusting back- off smoothing

Example: Speech recognition Need to find correct sequence of words given aural signal Language model (N-gram) accounts for dependencies between words Acoustic model maps from visible (phonemes) to hidden (words) level HMM combines both Viterbi algorithm will find optimal solution

Estimation-Maximization In general, we can iteratively estimate complex models with hidden parameters Define a quality function Q as the conditional likelihood of the model on all parameters Estimate Q from an initial choice for  Choose new  that maximizes Q

Example: PCFG parsing Probabilistic context-free grammars Likelihood of each rule (e.g., VP  V or VP  V NP) is a basic parameter Combined probability of the entire tree gives the quality function Forward-backward algorithm gives the solution Lexicalization (Collins, 1996, 1997)

Example: Machine Translation The noisy channel model (Brown et al., 1991) –Input in one language (e.g., English) is garbled into another (e.g., French) –Estimate probabilities of each word or phrase generating words or phrases in the other language and how many of them (fertility) A similar approach: Transliteration (Knight, 1998)

Linear regression Predict output as a linear combination of input variables Choose weights that minimize the sum of residual square error (least squares) Can be computed efficiently via a matrix decomposition and inversion

Log-linear regression Ideal output is 0 or 1 Because the distribution changes from normal to binomial, a transformed LS fit is not accurate Solution: Use an intermediate predictor , Can be approximated with iterative reweighted least squares

Examples Text categorization for information retrieval (Yang, 1998) Many types of sentence/word classification –cue words (Passonneau and Litman, 1993) –prosodic features (Pan and McKeown, 1999)

A technique for reducing dimensionality; data points are projected Given matrix A (n  m), find matrices T (n  k), S (k  k), and D (k  m) so that their product is A S is the top k singular values of A Projection is achieved by multiplying and A Application: Latent Semantic Indexing Singular-value decomposition

Methods without an explicit probability model Use empirical techniques to directly provide output without calculating a model Decision trees: Each node is associated with a decision on one of the input features The tree is built incrementally by choosing features with the most discriminatory power

Variations on decision trees Shrinking to prevent over-training Decision lists (Yarowsky 1997) use only the top feature for accent restoration

Rule induction Similar to decision trees, but the rules are allowed to vary and contain different operators Examples: RIPPER (Cohen 1996), transformation-based learning (Brill 1996), genetic algorithms (Siegel 1998)

Methods without explicit model k-Nearest Neighbor classification Neural networks Genetic algorithms

Support vector machines Find hyperplane that maximizes distance from support vectors Non-linear transformation: From original space to separable space via kernel function Text categorization (Joachims, 1997), OCR (Burges and Vapnik, 1996), Speech recognition (Schmidt, 1996)

Classification issues Two or many classes Classifier confidence, probability of membership in each class Training / test set distributions Balance of training data across classes

When to use each method? Probabilistic models depend on distributional assumptions Linear models (and SVD) assume a normal data distribution, and generalized linear models a Poisson, binomial, or negative binomial Markov models capture limited dependencies Rule-based models allow for multi-way classification easier than linear/log-linear ones For many applications, it is important to get a confidence estimate