CIS 700 Advanced Machine Learning Structured Machine Learning:   Theory and Applications in Natural Language Processing Shyam Upadhyay Department of.

Slides:



Advertisements
Similar presentations
Search-Based Structured Prediction
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Linear Classifiers (perceptrons)
Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Learning to Predict Structures with Applications to Natural Language Processing Ivan Titov TexPoint fonts used in EMF. Read the TexPoint manual before.
Conditional Random Fields
Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Online Learning Algorithms
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 6: Conditional Random Fields 1.
Final review LING572 Fei Xia Week 10: 03/11/
1 Data-Driven Dependency Parsing. 2 Background: Natural Language Parsing Syntactic analysis String to (tree) structure He likes fish S NP VP NP VNPrn.
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 5: Sequence Prediction & HMMs 1.
8/25/05 Cognitive Computations Software Tutorial Page 1 SNoW: Sparse Network of Winnows Presented by Nick Rizzolo.
Classification: Feature Vectors
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Prediction of Protein Binding Sites in Protein Structures Using Hidden Markov Support Vector Machine.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
CS621: Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 33,34– HMM, Viterbi, 14 th Oct, 18 th Oct, 2010.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Page 1 CS 546 Machine Learning in NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer Science University of Illinois.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Hidden Markov Models BMI/CS 576
Evaluating Classifiers
Statistical NLP Winter 2009
Dan Roth Department of Computer and Information Science
Structured prediction
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Artificial Intelligence
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Perceptrons Lirong Xia.
Classification with Perceptrons Reading:
CIS 700 Advanced Machine Learning for NLP A First Look at Structures
CIS 700 Advanced Machine Learning for NLP Inference Applications
Max-margin sequential learning methods
CSC 594 Topics in AI – Natural Language Processing
CS 4/527: Artificial Intelligence
CRFs for SPLODD William W. Cohen Sep 8, 2011.
Hidden Markov Models Part 2: Algorithms
Algorithms (2IL15) – Lecture 2
CHAPTER 7 BAYESIAN NETWORK INDEPENDENCE BAYESIAN NETWORK INFERENCE MACHINE LEARNING ISSUES.
Kai-Wei Chang University of Virginia
Chapter 2: Evaluative Feedback
CSCI 5832 Natural Language Processing
ML – Lecture 3B Deep NN.
CSCI 5832 Natural Language Processing
Artificial Intelligence 9. Perceptron
The Voted Perceptron for Ranking and Structured Classification
A task of induction to find patterns
A task of induction to find patterns
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Chapter 2: Evaluative Feedback
Perceptrons Lirong Xia.
Presentation transcript:

CIS 700 Advanced Machine Learning Structured Machine Learning:   Theory and Applications in Natural Language Processing Shyam Upadhyay Department of Computer and Information Science University of Pennsylvania

Reminder Google form for filling in preferences for paper presentation Deadline 19th September Fill in 4 papers, each from a different section. No class on Thursday (reading assignment on MEMM, CRF)

Ingredients of Structured Prediction Structured Prediction Formulation Today’s Plan Ingredients of Structured Prediction Structured Prediction Formulation Multiclass Classification HMM Seq. Labeling Dependency Parsing Structured Perceptron

Ingredients of a Structured Prediction Problem Input Output Feature Extractor Inference Inference , also called “Decoding” Loss

Ingredients of a Structured Prediction Problem Input IInstance in SL Output IStructure in SL Feature Extractor AbstractFeatureGenerator in SL Inference Loss AbstractInferenceSolver in SL

Multiclass Classification (Toy) N training examples, and we need to predict the label from M different classes. Winner takes all. Maintain M different weight vectors. Score a example using each of the weight vectors. Predict the class whose score is highest.

Questions What is How to write the score function as ? How to do inference?

Solution Confirm that

See it in Code

Write binary classification as structured prediction Short Quiz How will you implement the error correcting code approach to multiclass classification? How does definition of or change? How does definition of weight vector change? How does inference change? Write binary classification as structured prediction Try to avoid redundant weights.

The cat sat on the mat . DT NN VBD IN DT NN . Sequence Tagging Naïve approach Local inference (actually works pretty well for POS tagging) DT NN VBD IN DT NN .

Sequence Tagging HMM Roth 1999, Collins 2002 How to write this as a structured prediction problem?

How to write the score function as ? Questions What is How to write the score function as ? Should respect the model How to do inference?

Solution – Weight Vector

Solution – Feature Vector Confirm that

(HW) HMM with Greedy and MCMC. Inference in HMM Greedy Choose current position’s tag so that is maximizes the score so far. Viterbi Use Dynamic Programming to incrementally compute, Sampling MCMC (HW) HMM with Greedy and MCMC.

See it in Code

Short Quiz – Naïve Bayes Recall Naïve Bayes classification, How to formulate this as structured prediction? Assume there are only two classes.

(HW) Implement Naïve Bayes in Illinois-SL. Assume there are only two classes (HW) Implement Naïve Bayes in Illinois-SL.

Dependency Parsing Borrowed from Graham Neubig’s slides

Typed and Untyped Before we proceed, convince yourself that this is a (directed) tree. Borrowed from Graham Neubig’s slides

Given input x there are a fixed # of legal candidate trees. Learning Problem Setup INPUT: A sentence with N words. OUTPUT: A directed tree representing the dependency relations. Given input x there are a fixed # of legal candidate trees. Search Space Find the highest scoring dependency tree, from the space of all dependency trees of N words. How big is the search space? Exponential # of candidates!

How to write the score function as ? Questions What is How to write the score function as ? Unlike Multiclass, cannot learn a different model for each tree How to do inference?

Learn a model to score edge (i,j) of a candidate tree Decompose the Score Learn a model to score edge (i,j) of a candidate tree S[i][j] = score of word i having the word j as a parent. Score of a dependency tree is sum of score of its edges. Can you think of features for a edge? The notation s(i,j) here is somewhat misleading. You will have access to the input X. The correct notation should be s(i,j,x)

Finding Highest Scoring Tree Cast as a directed maximum spanning tree problem. Compute a matrix S of edge scores. Chu-Liu-Edmonds Algorithm (black-box). We can solve

Our First Structured Learning Algorithm So far, we just identified what ingredients we needed. How to learn a weight vector? Structured Perceptron (Collins 2002) Structured Version of Binary Perceptron Mistake Driven, just like Perceptron Virtually Identical

Binary Perceptron

Structured Perceptron

Why do we not have a bias term? Short Quiz Why do we not have a bias term? Do we see all possible structures during training? AbstractInferenceSolver Where was this used? Learning requires inference over structures Such inference can prove costly for large search spaces. Think improvements.

Averaged Structured Perceptron Remember we do not want to use only one weight vector. Why? Naïve way of averaging. Maintain a list of weight vectors seen during training. Maintain counts of how many examples each vector “survived”. Compute weighted average at the end. Drawbacks? Better way? I only want to maintain O(1) weight vectors and make updates only when necessary.

Averaging Say we make the ith update at time ci Weight vector after ith update is wi The algorithm stops at time cT and the last mistake was made at time cn What is the weighted average?

Averaged Structured Perceptron

What We Learned Today Ingredients for Structured Prediction Toy Formulations Our First Learning Algorithm for Structured Prediction

Try implementation exercises given in the slides. HW for Today’s Lecture Required Reading Ming-Wei Chang’s Thesis Chapter 2 (most of today’s lecture) Hal Daume’s Thesis Chapter 2 (structured perceptron) M. Collins Discriminative Training for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms EMNLP 2002. Optional Reading L. Huang, S. Fayong, Y. Guo Structured Perceptron with Inexact Search NAACL 2012. Try implementation exercises given in the slides.