Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, Oregon 97331 Machine Learning for Sequential.

Slides:



Advertisements
Similar presentations
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Introduction to Conditional Random Fields John Osborne Sept 4, 2009.
Supervised Learning Recap
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields - A probabilistic graphical model Stefan Mutter Machine Learning Group Conditional Random Fields - A probabilistic graphical.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Conditional Random Fields
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 6: Conditional Random Fields 1.
Conditional Random Fields
Graphical models for part of speech tagging
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Hidden Markov Models in Keystroke Dynamics Md Liakat Ali, John V. Monaco, and Charles C. Tappert Seidenberg School of CSIS, Pace University, White Plains,
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS
A Generalization of Forward-backward Algorithm Ai Azuma Yuji Matsumoto Nara Institute of Science and Technology.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Revisiting Output Coding for Sequential Supervised Learning Guohua Hao & Alan Fern School of Electrical Engineering and Computer Science Oregon State University.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
John Lafferty Andrew McCallum Fernando Pereira
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Eric Xing © Eric CMU, Machine Learning Structured Models: Hidden Markov Models versus Conditional Random Fields Eric Xing Lecture 13,
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.
Conditional Random Fields & Table Extraction Dongfang Xu School of Information.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
Structured prediction
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Conditional Random Fields for ASR
Intelligent Information System Lab
CSC 594 Topics in AI – Natural Language Processing
Jeremy Morris & Eric Fosler-Lussier 04/19/2007
Presentation transcript:

Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, Oregon Machine Learning for Sequential Data

Outline  Sequential Supervised Learning  Research Issues  Methods for Sequential Supervised Learning  Concluding Remarks

Some Example Learning Problems  Cellular Telephone Fraud  Part-of-speech Tagging  Information Extraction from the Web  Hyphenation for Word Processing

Cellular Telephone Fraud  Given the sequence of recent telephone calls, can we determine which calls (if any) are fraudulent?

Part-of-Speech Tagging  Given an English sentence, can we assign a part of speech to each word?  “Do you want fries with that?” 

Information Extraction from the Web Srinivasan Seshan (Carnegie Mellon University) Making Virtual Worlds Real Tuesday, June 4, :00 PM, 322 Sieg Research Seminar * * * name name * * affiliation affiliation affiliation * * * * title title title title * * * date date date date * time time * location location * event-type event-type

Hyphenation  “Porcupine” ! “ ”

Sequential Supervised Learning (SSL)  Given: A set of training examples of the form (X i,Y i ), where X i = hx i,1, …, x i,T i i and Y i = hy i,1, …, y i,T i i are sequences of length T i  Find: A function f for predicting new sequences: Y = f(X).

Examples as Sequential Supervised Learning DomainInput X i Output Y i Telephone Fraud sequence of calls sequence of labels {ok, fraud} Part-of-speech Tagging sequence of words sequence of parts of speech Information Extraction sequence of tokens sequence of field labels {name, …} Hyphenationsequence of letters sequence of {0,1} 1 = hyphen ok

Two Kinds of Relationships  Relationships between the x t ’s and y t ’s Example: “Friday” is usually a “date”  Relationships among the y t ’s Example: “name” is usually followed by “affiliation”  SSL can (and should) exploit both kinds of information

Two Other Tasks that are Not SSL  Sequence Classification  Time-Series Prediction

Sequence Classification  Given an input sequence, assign one label to the entire sequence  Example: Recognize a person from their handwriting. Input sequence: Sequence of pen strokes Output label: Name of person

Time-Series Prediction  Given: A sequence hy 1, …, y t i predict y t+1.  Example: Predict unemployment rate for next month based on history of unemployment rates.

Key Differences  In SSL, there is one label y i,t for each input x i,t  In SSL, we are given the entire X sequence before we need to predict any of the y t values  In SSL, we do not have any of the true y values when we predict y t+1

Outline  Sequential Supervised Learning  Research Issues  Methods for Sequential Supervised Learning  Concluding Remarks

Research Issues for SSL  Loss Functions How do we measure performance?  Feature Selection and Long-distance Interactions How do we model relationships among the y t ’s, especially long-distance effects?  Computational Cost How do we make it efficient?

Basic Loss Functions  Count the number of entire sequences Y i correctly predicted (i.e., every y i,t must be right)  Count the number of individual labels y i,t correctly predicted

More Complex Loss Functions x x x x x x x x x x x x x x x x x x x x x x x x x True labels: Phone calls: Loss: Loss is computed for first “fraudulent” prediction

More Complex Loss Functions (2)  Hyphenation False positives are very bad Need at least one correct hyphen near middle of word

Hyphenation Loss  Perfect: “qual-i-fi-ca-tion”  Very good: “quali-fi-cation”  OK: “quali-fication”, “qualifi-cation”  Worse: “qual-ification”, “qualifica-tion”  Very bad: “qua-lification”, “qualificatio-n”

Feature Selection and Long Distance Effects  Any solution to SSL must employ some form of divide-and-conquer  How do we determine the information relevant for predicting y t ?

Long Distance Effects  Consider the text-to-speech problem: “photograph” => / / “photography” =>/ /  The letter “y” changes the pronunciation of all vowels in the word!

Standard Feature Selection Methods  Wrapper method with forward selection or backwards elimination  Optimize feature weights  Measures of feature influence  Fit simple models to test for relevance

Wrapper Methods  Stepwise Regression  Wrapper Methods (Kohavi, et al.)  Problem: Very inefficient with large numbers of possible features

Optimizing the Feature Weights  Start with all features in the model  Encourage the learning algorithm to remove irrelevant features  Problem: There are too many possible features. We can’t include them all in the model.

Measures of Feature Influence  Importance of single features Mutual information, Correlation  Importance of feature subsets Schema racing (Moore, et al.) RELIEFF (Kononenko, et al.)  Question: Will subset methods scale to thousands of features?

Fitting Simple Models  Fit simple models using all of the features. Analyze the resulting model to determine feature importance Belief networks and Markov blanket analysis L 1 Support Vector Machines  Prediction: These will be the most practical methods

Outline  Sequential Supervised Learning  Research Issues  Methods for Sequential Supervised Learning  Concluding Remarks

Methods for Sequential Supervised Learning  Sliding Windows  Recurrent Sliding Windows  Hidden Markov Models and company Maximum Entropy Markov Models Input-Output HMMs Conditional Random Fields

Sliding Windows ___Doyouwantfrieswiththat___ Doyou!verb youwantfries!verb wantfrieswith!noun frieswiththat!prep withthat___!pron Doyouwant!pron

Properties of Sliding Windows  Converts SSL to ordinary supervised learning  Only captures the relationship between (part of) X and y t. Does not explicitly model relations among the y t ’s  Does not capture long-distance interactions  Assumes each window is independent

Recurrent Sliding Windows ___Doyouwantfrieswiththat___ Doyou___!verb youwantfriespron!verb wantfrieswithverb!noun frieswiththatnoun!prep withthat___prep!pron Doyouwantverb!pron

Recurrent Sliding Windows  Key Idea: Include y t as input feature when computing y t+1.  During training: Use the correct value of y t Or train iteratively (especially recurrent neural networks)  During evaluation: Use the predicted value of y t

Properties of Recurrent Sliding Windows  Captures relationship among the y’s, but only in one direction!  Results on text-to-speech: MethodDirectionWordsLetters sliding windownone12.5%69.6% recurrent s. w.left-right17.0%67.9% recurrent s. w.right-left24.4%74.2%

Hidden Markov Models y2y2 y1y1 y3y3 x1x1 x2x2 x3x3  y t ’s are generated as a Markov chain  x t ’s are generated independently (as in naïve Bayes or Gaussian classifiers).

Hidden Markov Models (2)  Models both the x t $ y t relationships and the y t $ y t+1 relationships.  Does not handle long-distance effects Everything must be captured by the current label y t.  Does not permit rich X $ y t relationships Unlike the sliding window, we can’t use several x t ’s to predict y t.

Using HMMs  Training Extremely simple, because the y t ’s are known on the training set.  Execution: Dynamic Programming methods If the loss function depends on the whole sequence, use the Viterbi algorithm: argmax Y P(Y | X) If the loss function depends on individual y t predictions, use the forward-backward algorithm: argmax y t P(y t | X)

HMM Alternatives: Maximum Entropy Markov Models y2y2 y1y1 y3y3 x1x1 x2x2 x3x3

MEMM Properties  Permits complex X $ y t relationships by employing a sparse maximum entropy model of P(y t+1 |X, y t ): P(y t+1 |X,y t ) / exp(  b  b f b (X,y t,y t+1 )) where f b is a boolean feature.  Training can be expensive (gradient descent or iterative scaling)

HMM Alternatives (2): Input/Output HMM h2h2 h1h1 h3h3 x1x1 x2x2 x3x3 y1y1 y2y2 y3y3

IOHMM Properties  Hidden states permit “memory” of long distance effects (beyond what is captured by the class labels)  As with MEMM, arbitrary features of the input X can be used to predict y t.

 Forward models that are normalized at each step exhibit a problem.  Consider a domain with only two sequences: “rib” ! “111” and “rob” ! “222”.  Consider what happens when an MEMM sees the sequence “rib”. Label Bias Problem

Label Bias Problem (2)  After “r”, both labels 1 and 2 have same probability. After “i”, label 2 must still send all of its probability forward, even though it was expecting “o”. Result: both output strings “111” and “222” have same probability. ib r r ob

Conditional Random Fields y2y2 y1y1 y3y3 x1x1 x2x2 x3x3  The y t ’s form a Markov Random Field conditioned on X.

Representing the CRF parameters  Each undirected arc y t $ y t+1 represents a potential function: M(y t,y t+1 |X) = exp[  a a f a (y t,y t+1,X) +  b  b g b (y t,X)] where f a and g b are arbitrary boolean features.

Using CRFs P(Y|X) / M(y 1,y 2 |X) ¢ M(y 2,y 3 |X) ¢ … ¢ M(y T-1,y T |X)  Training: Gradient descent or iterative scaling

CRFs on Part-of-speech tagging HMMMEMMCRF baseline spelling features spelling features (OOV) Lafferty, McCallum & Pereira (2001) (error rates in percent)

Summary of Methods IssueSWRSWHMMMEMMIOHMMCRF x t $ y t y t $ y t+1 NOPartlyYES long dist?NOPartlyNO YES?NO X $ y t rich?YES NOYES efficient?YES YES?NO label bias ok?YES NO YES

Loss Functions and Training  Kakade, Teh & Roweis (2002) show that if the loss function depends only on errors of y t, then MEMM’s, IOHMM’s, and CRF’s should be trained to maximize the likelihood P(y t | X) instead of P(Y|X) or P(X,Y).

Concluding Remarks  Many applications of pattern recognition can be formalized as Sequential Supervised Learning  Many methods have been developed specifically for SSL, but none is perfect  Similar issues arise in other complex learning problems (e.g., spatial and relational data)