NLP Document and Sequence Models. Computational models of how natural languages work These are sometimes called Language Models or sometimes Grammars.

Slides:



Advertisements
Similar presentations
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.
Advertisements

Hidden Markov Model Jianfeng Tang Old Dominion University 03/03/2004.
Statistical Topic Modeling part 1
A Joint Model of Text and Aspect Ratings for Sentiment Summarization Ivan Titov (University of Illinois) Ryan McDonald (Google Inc.) ACL 2008.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Albert Gatt Corpora and Statistical Methods Lecture 8.
Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –One exception: games with multiple moves In particular, the Bayesian.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Latent Dirichlet Allocation a generative model for text
Analyzing Federal Funding, Scientific Publications and with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths.
Latent Semantic Analysis Probabilistic Topic Models & Associative Memory.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario.
CPSC 322, Lecture 31Slide 1 Probability and Time: Markov Models Computer Science cpsc322, Lecture 31 (Textbook Chpt 6.5) March, 25, 2009.
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Scalable Text Mining with Sparse Generative Models
CS 188: Artificial Intelligence Fall 2009 Lecture 19: Hidden Markov Models 11/3/2009 Dan Klein – UC Berkeley.
Text Categorization Moshe Koppel Lecture 2: Naïve Bayes Slides based on Manning, Raghavan and Schutze.
Integrating Topics and Syntax Paper by Thomas Griffiths, Mark Steyvers, David Blei, Josh Tenenbaum Presentation by Eric Wang 9/12/2008.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.
1 Sequence Labeling Raymond J. Mooney University of Texas at Austin.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Text-as-Data March 25, 2009.
Example 16,000 documents 100 topic Picked those with large p(w|z)
Courts at Work. Criminal cases An adult criminal case has many steps It usually is not completed in one day, especially felony cases The first step is.
Meta-Knowledge Computer-age study skill or What kids need to know to be effective students Graham Seibert Copyright 2006.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
27. May Topic Models Nam Khanh Tran L3S Research Center.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Genre: Drama Author’s Purpose: Entertain Comprehension Skill: Compare & Contrast Compare & ContrastCompare & Contrast By: Douglas Love Blame it on the.
Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.
Integrating Topics and Syntax -Thomas L
Natural language processing tools Lê Đức Trọng 1.
NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
John Lafferty Andrew McCallum Fernando Pereira
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Introduction to N-grams Language Modeling. Dan Jurafsky Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation:
Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Language Model for Machine Translation Jang, HaYoung.
U NSUPERVISED T OPIC M ODELING Daphna Weinshall B Slides credit: Thomas Huffman, Tom Landauer, Peter Foltz, Melanie Martin, Hsuan-Sheng Chiu, Haiyan.
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
Bayesian Generative Modeling
Online Multiscale Dynamic Topic Models
Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
Statistical Models for Automatic Speech Recognition
Command Me Specification
LECTURE 23: INFORMATION THEORY REVIEW
Latent Semantic Analysis
Presentation transcript:

NLP Document and Sequence Models

Computational models of how natural languages work These are sometimes called Language Models or sometimes Grammars Three main types (among many others): 1.Document models, or “topic” models 2.Sequence models: Markov models, HMMs, others 3.Context-free grammar models

Computational models of how natural languages work Most of the models I will show you are -Probabilistic models -Graphical models -Generative models In other words, they are essentially Bayes Nets. In addition, many (but not all) are -Latent variable models This means that some variables in the model are not observed in data, and must be inferred. (Like the hidden states in an HMM.)

Topic Models

Three documents with the word “play” (numbers & colors  topic assignments)

Example: topics from an educational corpus (TASA) PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW 37K docs, 26K words 1700 topics, e.g.:

Polysemy PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW

Topic Modeling Techniques The two most common techniques are: Probabilistic Latent Semantic Analysis (pLSA) Latent Dirichlet Allocation (LDA) Commonly-used software packages: Mallet, a Java toolkit for various NLP related things like document classification, and includes a widely-used implementation of LDA. Mallet Stanford Topic Modeling Toolbox A list of implementations for various topic modeling techniqueslist of implementations

The LDA Model  z4z4 z3z3 z2z2 z1z1 w4w4 w3w3 w2w2 w1w1    z4z4 z3z3 z2z2 z1z1 w4w4 w3w3 w2w2 w1w1  z4z4 z3z3 z2z2 z1z1 w4w4 w3w3 w2w2 w1w1 For each document, Choose  ~Dirichlet(  ) For each of the N words w n : – Choose a topic z n » Multinomial(  ) – Choose a word w n from p(w n |z n,  ), a multinomial probability conditioned on the topic z n.

Applications Visualization and exploration of a corpus Track news stories as they evolve Pre-processing of a corpus for document classification tasks

(slide from tutorial by David Blei, KDD 2011)

Microsoft’s Twahpic System

LDA/pLSA for Text Classification Topic models are easy to incorporate into text classification: 1.Train a topic model using a big corpus 2.Decode the topic model (find best topic/cluster for each word) on a training set 3.Train classifier using the topic/cluster as a feature 4.On a test document, first decode the topic model, then make a prediction with the classifier

Why use a topic model for classification? Topic models help handle polysemy and synonymy – The count for a topic in a document can be much more informative than the count of individual words belonging to that topic. Topic models help combat data sparsity – You can control the number of topics – At a reasonable choice for this number, you’ll observe the topics many times in training data (unlike individual words, which may be very sparse)

Synonymy and Polysemy (example from Lillian Lee) auto engine bonnet tyres lorry boot car emissions hood make model trunk make hidden Markov model emissions normalize Synonymy Appear dissimilar if you compare words but are related Polysemy Appear similar if you compare words but not truly related Training

Sequence Models Sequences models are models that can predict the likelihood of a sequence of text (eg, a sentence). Sometimes using latent state variables Let x 1, …, x N be the words in a sentence. P(x 1, …, x N ) is the likelihood of the sentence. We’ll look at two types of generative sequence models: N-gram models, which are slightly fancy versions of Markov models Hidden Markov Models, which you’ve seen before

What’s a sequence model for? Speech Recognition: – often the acoustic model will be confused between several possible words for a given sound – Speech recognition systems choose between these possibilities by selecting the one with the highest probability, according to a sequence model Machine translation – Often, the translation model will be confused between several possible translations of a given phrase – The system chooses between these possibilities by selecting the one with the highest probability, according to a sequence model Many other applications: Handwriting recognition Spelling correction Optical character recognition …

Example Language Model Application Speech Recognition: convert an acoustic signal (sound wave recorded by a microphone) to a sequence of words (text file). Straightforward model: But this can be hard to train effectively.

Example Language Model Application Speech Recognition: convert an acoustic signal (sound wave recorded by a microphone) to a sequence of words (text file). Traditional solution: Bayes’ Rule Ignore: doesn’t matter for picking a good text Acoustic Model (easier to train) Language Model

Example Language Model Application Ignore: doesn’t matter for picking a good text Acoustic Model (easier to train) Sequence Model

N-gram Models P(w 1, …, w T )= P(w 1 ) x P(w 2 |w 1 ) x P(w 3 |w 2,w 1 ) x P(w 4 |w 3,w 2,w 1 ) x … x P(w T |w T-1,…,w 1 ) N-gram models make the assumption that the next word depends only on the previous N-1 words. For example, a 2-gram model (usually called bigram): P(w 1, …, w T )= P(w 1 ) x P(w 2 |w 1 ) x P(w 3 |w 2 ) x P(w 4 |w 3 ) x … x P(w T |w T-1 ) Notice: this is a Markov model, where the states are words.

N-gram Tools IRSTLM, a C++ n-gram toolkit often used for speech recognition IRSTLM Berkeleylm, another n-gram toolkit, in Java Berkeleylm Moses, a machine translation toolkit Moses CMU Sphinx, open-source speech recognition toolkit CMU Sphinx

Some HMM Tools jHMM, a Java implementation that’s relatively easy to use jHMM MPI implementation of HMMs, for training large HMMs in a distributed environment MPI implementation of HMMs

(HMM Demo)

Conditional Random Fields CRFs, like HMMs, are also latent-variable sequence models. However, CRFs are discriminative instead of generative: They cannot tell you P(x 1, …, x N, z 1, …, z N ), and they cannot tell you P(x 1, …, x N ). But they are often better than HMMs at predicting P(z 1, …, z N | x 1, …, x N ). For these reason, they are often used as sequence labelers.

Example Sequence Labeling Task Slide from Jenny Rose Finkel (ACL 2005)

Other common sequence-labeling tasks Tokenization/segmentation SSNNNNNNNNNNNSSNSNNSSNNNNNSSNNNSS “Tokenization isn’t always easy.” Chinese word segmentation: Pi-Chuan Chang, Michel Galley and Chris Manning, 2008:

Other common sequence-labeling tasks Part-of-speech tagging “ N V Adv Adv Adj. “ “ Tokenization is n’t always easy. ” Relation extraction O O B-arg I-arg O B-rel I-rel Drug giant Pfizer Inc. has reached an I-rel I-rel O O agreement to buy the private O O B-arg I-arg biotechnology firm Rinat Neuroscience I-arg O O O O O Corp., the companies announced Thursday  buy(Pfizer Inc., Rinat Neuroscience Corp.)

ReVerb demo

Some CRF implementations, tools a package of a variety of NLP tools, including CRFs a CRF toolkit a variety of NLP tools, including several CRF models