1 Machine Learning in Natural Language More on Discriminative models Dan Roth University of Illinois, Urbana-Champaign

Slides:



Advertisements
Similar presentations
Lecture 9 Support Vector Machines
Advertisements

ECG Signal processing (2)
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Linear Classifiers (perceptrons)

1 Machine Learning in Natural Language 1.No Lecture on Thursday. 2.Instead: Monday, 4pm, 1404SC Mark Johnson lectures on: Bayesian Models of Language Acquisition.
An Introduction of Support Vector Machine
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
SVM—Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Boosting Approach to ML
1 Machine Learning: Lecture 7 Instance-Based Learning (IBL) (Based on Chapter 8 of Mitchell T.., Machine Learning, 1997)
Lazy vs. Eager Learning Lazy vs. eager learning
Pattern Recognition and Machine Learning
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
The Implicit Mapping into Feature Space. In order to learn non-linear relations with a linear machine, we need to select a set of non- linear features.
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
Lecture 10: Support Vector Machines
Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.
Online Learning Algorithms
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Overview of Kernel Methods Prof. Bennett Math Model of Learning and Discovery 2/27/05 Based on Chapter 2 of Shawe-Taylor and Cristianini.
This week: overview on pattern recognition (related to machine learning)
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
8/25/05 Cognitive Computations Software Tutorial Page 1 SNoW: Sparse Network of Winnows Presented by Nick Rizzolo.
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
Benk Erika Kelemen Zsolt
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Today Ensemble Methods. Recap of the course. Classifier Fusion
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
ONLINE LEARNING CS446 -FALL ‘15 Administration Registration Hw2Hw2 is out  Please start working on it as soon as possible  Come to sections with questions.
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
Face Detection Using Large Margin Classifiers Ming-Hsuan Yang Dan Roth Narendra Ahuja Presented by Kiang “Sean” Zhou Beckman Institute University of Illinois.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Support Vector Machines and Kernel Methods for Co-Reference Resolution 2007 Summer Workshop on Human Language Technology Center for Language and Speech.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.
Kernels and Margins Maria Florina Balcan 10/13/2011.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
On-Line Algorithms in Machine Learning By: WALEED ABDULWAHAB YAHYA AL-GOBI MUHAMMAD BURHAN HAFEZ KIM HYEONGCHEOL HE RUIDAN SHANG XINDI.
Page 1 CS 546 Machine Learning in NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer Science University of Illinois.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Linear machines márc Decison surfaces We focus now on the decision surfaces Linear machines = linear decision surface Non-optimal solution but.
Fun with Hyperplanes: Perceptrons, SVMs, and Friends
Dan Roth Department of Computer and Information Science
Machine Learning Basics
CS 4/527: Artificial Intelligence
An Introduction to Support Vector Machines
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Linear machines 28/02/2017.
Machine Learning: UNIT-4 CHAPTER-1
Presentation transcript:

1 Machine Learning in Natural Language More on Discriminative models Dan Roth University of Illinois, Urbana-Champaign

2 Generalization (since the representation is the same) How many examples are needed to get to a given level of accuracy? Efficiency How long does it take to learn a hypothesis and evaluate it (per-example)? Robustness; Adaptation to a new domain, …. How to Compare?

3 S= I don’t know whether to laugh or cry Define a set of features: features are relations that hold in the sentence Map a sentence to its feature-based representation The feature-based representation will give some of the information in the sentence Use this as an example to your algorithm - Sentence Representation

4 S= I don’t know whether to laugh or cry Define a set of features: features are relations that hold in the sentence Conceptually, there are two steps in coming up with a feature-based representation What are the information sources available? Sensors: words, order of words, properties (?) of words What features to construct based on these? Why needed? Sentence Representation

5 Weather Whether New discriminator in functionally simpler Embedding

6 The number of potential features is very large The instance space is sparse Decisions depend on a small set of features (sparse) Want to learn from a number of examples that is small relative to the dimensionality Domain Characteristics

7 Dominated by the sparseness of the function space Most features are irrelevant # of examples required by multiplicative algorithms depends mostly on # of relevant features (Generalization bounds depend on ||w||;) Lesser issue: Sparseness of features space: advantage to additive. Generalization depend on ||x|| (Kivinen/Warmuth 95) Generalization

8 Function: At least 10 out of fixed 100 variables are active Dimensionality is n Perceptron,SVMs n: Total # of Variables (Dimensionality) Winnow Mistakes bounds for 10 of 100 of n # of mistakes to convergence

9 Dominated by the size of the feature space Could be more efficient since work is done in the original feature space. Additive algorithms allow the use of Kernels No need to explicitly generate the complex features Most features are functions (e.g., n-grams) of raw attributes Efficiency

10 Update rule: Multiplicative /Additive/NB (+ regularization) Feature space: Infinite Attribute Space - examples of variable size: only active features - determined in a data driven way Multi Class Learner Several approaches are possible Makes Possible: Generation of many complex/relational types of features Only a small fraction is actually represented Computationally efficient (on-line!) SNoW

11 Other methods are used broadly today in NLP: SVM, AdaBoost, Mutli class classification Dealing with lack of data: Semi-supervised learning Missing data Other Issues in Classification

12 Weather Whether New discriminator in functionally simpler Embedding

13  A method to run Perceptron on a very large feature set, without incurring the cost of keeping a very large weight vector.  Computing the weight vector is done in the original space.  Notice: this pertains only to efficiency.  Generalization is still relative to the real dimensionality.  This is the main trick in SVMs. (Algorithm - different) (although many applications actually use linear kernels). Kernel Based Methods

14 Let I be the set t 1,t 2,t 3 …of monomials (conjunctions) over The feature space x 1, x 2 … x n. Then we can write a linear function over this new feature space. Kernel Base Methods

15 Great Increase in expressivity Can run Perceptron (and Winnow) but the convergence bound may suffer exponential growth. Exponential number of monomials are true in each example. Also, will have to keep many weights. Kernel Based Methods

16 Consider the value of w used in the prediction. Each previous mistake, on example z, makes an additive contribution of +/-1 to w, iff t(z) = 1. The value of w is determined by the number of mistakes on which t() was satisfied. The Kernel Trick(1)

17 P – set of examples on which we Promoted D – set of examples on which we Demoted M = P  D The Kernel Trick(2)

18 P – set of examples on which we Promoted D – set of examples on which we Demoted M = P  D Where S(z)=1 if z  P and S(z) = -1 if z  D. Reordering: The Kernel Trick(3)

19 S(y)=1 if y  P and S(y) = -1 if y  D. A mistake on z contributes the value +/-1 to all monomials satisfied by z. The total contribution of z to the sum is equal to the number of monomials that satisfy both x and z. Define a dot product in the t-space: We get the standard notation: The Kernel Trick(4)

20 What does this representation give us? We can view this Kernel as the distance between x,z in the t-space. But, K(x,z) can be measured in the original space, without explicitly writing the t-representation of x, z Kernel Based Methods

21 Consider the space of all 3 n monomials (allowing both positive and negative literals). Then, Where same(x,z) is the number of features that have the same value for both x and z.. We get: Example: Take n=2; x=(00), z=(01), …. Other Kernels can be used. Kernel Based Methods

22 Simply run Perceptron in an on-line mode, but keep track of the set M. Keeping the set M allows to keep track of S(z). Rather than remembering the weight vector w, remember the set M (P and D) – all those examples on which we made mistakes. Dual Representation Implementation

23 A method to run Perceptron on a very large feature set, without incurring the cost of keeping a very large weight vector. Computing the weight vector can still be done in the original feature space. Notice: this pertains only to efficiency: The classifier is identical to the one you get by blowing up the feature space. Generalization is still relative to the real dimensionality. This is the main trick in SVMs. (Algorithm - different) (although most applications actually use linear kernels) Summary – Kernel Based Methods I

24 There is a tradeoff between the computational efficiency with which these kernels can be computed and the generalization ability of the classifier. For example, using such kernels the Perceptron algorithm can make an exponential number of mistakes even when learning simple functions. In addition, computing with kernels depends strongly on the number of examples. It turns out that sometimes working in the blown up space is more efficient than using kernels. Next: Kernel methods in NLP Efficiency-Generalization Tradeoff

25 Other methods are used broadly today in NLP: SVM, AdaBoost, SVM Mutliclass classification Dealing with lack of data: Semi-supervised learning Missing data: EM Other Issues in Classification