SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.

Slides:



Advertisements
Similar presentations
Latent Variables Naman Agarwal Michael Nute May 1, 2013.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Distant Supervision for Emotion Classification in Twitter posts 1/17.
Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
CMPUT 466/551 Principal Source: CMU
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.
Text Categorization Moshe Koppel Lecture 2: Naïve Bayes Slides based on Manning, Raghavan and Schutze.
1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Introduction to Machine Learning Approach Lecture 5.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Crash Course on Machine Learning
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
Masquerade Detection Mark Stamp 1Masquerade Detection.
Final Presentation Tong Wang. 1.Automatic Article Screening in Systematic Review 2.Compression Algorithm on Document Classification.
1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.
SI485i : NLP Set 5 Using Naïve Bayes.
Text Classification, Active/Interactive learning.
1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Classification Techniques: Bayesian Classification
Oliver Schulte Machine Learning 726 Bayes Net Classifiers The Naïve Bayes Model.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Slides for “Data Mining” by I. H. Witten and E. Frank.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
MAXIMUM ENTROPY MARKOV MODEL Adapted From: Heshaam Faili University of Tehran – Dikkala Sai Nishanth – Ashwin P. Paranjape
Machine Learning for Natural Language Processing Robin Tibor Schirrmeister Seminar Information Extraktion Wednesday, 6th November 2013.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Linear Models (II) Rong Jin. Recap  Classification problems Inputs x  output y y is from a discrete set Example: height 1.8m  male/female?  Statistical.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Hidden Markov Models BMI/CS 576
Brief Intro to Machine Learning CS539
Statistical Models for Automatic Speech Recognition
Lecture 15: Text Classification & Naive Bayes
Jun Xu Harbin Institute of Technology China
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
CS4705 Natural Language Processing
LECTURE 23: INFORMATION THEORY REVIEW
Speech recognition, machine learning
Logistic Regression [Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]
Introduction to Sentiment Analysis
Speech recognition, machine learning
Presentation transcript:

SI485i : NLP Set 12 Features and Prediction

What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots of machine learning over features NLP researchers also use linguistic insights, deep language processing, and semantics. But really, semantics and deep language processing ends up being shoved into feature representations 2

What is a feature? 3

Naïve Bayes had features Bigram Model The features are the two-word phrases “the dog”, “dog ran”, “ran out” Each feature has a numeric value, such as how many times each bigram was seen. You calculated probabilities for each feature. These are the feature weights P(d | Dickens) = P(“the dog” | dickens) * P(“dog ran” | dickens) * P(“ran out” | dickens) 4

What is a feature-based model? Predicting class c is dependent solely on the features f taken from your data d. In author prediction Class c: “Dickens” Data d: a document Features f: the n-grams you extract In sentiment classification Class c: “negative sentiment” Data d: a tweet Features f: the words 5

Features appear everywhere Distributional learning. “drink” is represented by a vector of feature counts. The words in the grammatical object make up a vector of counts. The words are the features and the counts/PMI scores are the weights. 6

Feature-Based Model Decision 1: what features do I use? Decision 2: how do I weight the features? Decision 3: how do I combine the weights to make a prediction? Decisions 2 and 3 often go hand in hand. The “model” is typically defined by how 2 and 3 are defined Finding “features” is a separate task. 7

Feature-Based Model Naïve Bayes Model Decision 1: features are n-grams (or other features too!) Decision 2: weight features using MLE: P(n-gram | class) Decision 3: multiply weights together Vector-Based Distributional Model Decision 1: features are words, syntax, etc. Decision 2: weight features with PMI scores Decision 3: put features in a vector, and use cosine similarity 8

MaxEnt Model An important classifier in NLP…an exponential model This is not Naïve Bayes, but it does calculate probabilities. Features are Feature weights are Don’t be frightened. This is easier than it looks. 9

Naïve Bayes is like MaxEnt Naïve Bayes is an exponential model too. 10 You know this definition. Just add exp(log(x)). The lambdas are the log(P(x)). The f(c,d) is the seen feature!

MaxEnt So Naïve Bayes is just features with weights. The weights are probabilities. MaxEnt: “stop requiring weights to be probabilities” Learn the best weights for P(c|d) Learn weights that optimize your c guesses How? Not this semester… Hint: take derivatives for the lambdas, find the maximum 11

MaxEnt vs Naïve Bayes Naïve Bayes Trained to maximize joint likelihood of data and classes: P(c,d) Features assumed to supply independent evidence. Feature weights can be set independently. 12 MaxEnt Trained to maximize conditional likelihood of classes: P(c|d) Feature weights take feature dependence into account. Feature weights must be mutually estimated.

MaxEnt vs Naïve Bayes Naïve Bayes Trained to maximize joint likelihood of data and classes: P(c,d) P(c|d) = P(c,d)/P(d) So it learns the entire joint model P(c,d) even though we only care about P(c|d) 13 MaxEnt Trained to maximize conditional likelihood of classes: P(c|d) P(c|d) = MaxEnt So it learns directly what we care about. It’s hard to learn P(c,d) correctly…so don’t try.

What should you use? MaxEnt usually outperforms Naïve Bayes Why? MaxEnt learns better weights for the features Naïve Bayes makes too many assumptions on the features, and so the model is too generalized MaxEnt learns “optimal” weights, so they may be too specific to your training set and not work in the wild! Use MaxEnt, or at least try it to see which is best for your task. Several available implementations online: Weka is popular easy-to-use: darmstadt.de/fileadmin/user_upload/GROUP_WERC/LKE/tutorials/ML-tutorial-5-6.pdf darmstadt.de/fileadmin/user_upload/GROUP_WERC/LKE/tutorials/ML-tutorial-5-6.pdf 14

Exercise on Features Deep down here by the dark water lived old Gollum, a small, slimy creature. He was dark as darkness, except for two big, round, pale eyes in his thin face. He lived on a slimy island of rock in the middle of the lake. Bilbo could not see him, but Gollum was watching him now from the distance, with his pale eyes like telescopes. The word “Bilbo” is a person. What features would help a computer identify it as a person token? Classes: person, location, organization, none Data: the text, specifically the word “Bilbo” Features: ??? 15

Sequence Models This exercise brings us to sequence models. Sometimes classifying one word helps classify the next word (markov chains!). “Bilbo Baggins said …” If your classifier thought Bilbo was a name, then use that as a feature when you try to classify Baggins. This will boost the chance that you also label Baggins as a name. Feature = “was the previous word a name?” 16

Sequence Models We don’t have time to cover sequence models. See your textbook. These are very influential and appear in several places: Speech recognition Named entity recognition (labeling names, as in our exercise) Information extraction 17