John Blitzer 自然语言计算组

Slides:



Advertisements
Similar presentations
Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.
Advertisements

Introduction to Support Vector Machines (SVM)
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
On-line learning and Boosting
Linear Classifiers (perceptrons)
Support Vector Machines and Margins
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Optimization Tutorial
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Confidence-Weighted Linear Classification Mark Dredze, Koby Crammer University of Pennsylvania Fernando Pereira Penn  Google.
Pattern Recognition and Machine Learning
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Sparse vs. Ensemble Approaches to Supervised Learning
Unsupervised Models for Named Entity Classification Michael Collins and Yoram Singer Yimeng Zhang March 1 st, 2007.
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Online Learning Algorithms
Unsupervised Domain Adaptation: From Practice to Theory John Blitzer TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Adapting Natural Language Processing Systems to New Domains John Blitzer Joint with: Shai Ben-David, Koby Crammer, Mark Dredze, Fernando Pereira, Alex.
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Outline Separating Hyperplanes – Separable Case
Crash Course on Machine Learning Part II
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE: Support Vector Machines.
CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Benk Erika Kelemen Zsolt
Classification: Feature Vectors
WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Linear Models for Classification
HAITHAM BOU AMMAR MAASTRICHT UNIVERSITY Transfer for Supervised Learning Tasks.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Supervised Learning. CS583, Bing Liu, UIC 2 An example application An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc)
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
NTU & MSRA Ming-Feng Tsai
CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Chapter 10 The Support Vector Method For Estimating Indicator Functions Intelligent Information Processing Laboratory, Fudan University.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Roughly overview of Support vector machines Reference: 1.Support vector machines and machine learning on documents. Christopher D. Manning, Prabhakar Raghavan.
Adaboost (Adaptive boosting) Jo Yeong-Jun Schapire, Robert E., and Yoram Singer. "Improved boosting algorithms using confidence- rated predictions."
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Deep Feedforward Networks
Dan Roth Department of Computer and Information Science
Jan Rupnik Jozef Stefan Institute
Probabilistic Models for Linear Regression
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Presentation transcript:

John Blitzer 自然语言计算组

This is an NLP summer school. Why should I care about machine learning? ACL 2008: 50 of 96 full papers mention learning, or statistics in their titles 4 of 4 outstanding papers propose new learning or statistical inference methods

Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it, suffering from a headache the entire time, and eventually i lit it on fire. One less copy in the world...don't waste your money. I wish i had the time spent reading this book back so i could use it for better purposes. This book wasted my life Positive Negative Output: LabelsInput: Product Review

From the MSRA 机器学习组

... Un-ranked List Ranked List...

全国田径冠军赛结束 The national track & field championships concluded Input: English sentence Output: Chinese sentence

1) Supervised Learning [2.5 hrs] 2) Semi-supervised learning [3 hrs] 3) Learning bounds for domain adaptation [30 mins]

1) Notation and Definitions [5 mins] 2) Generative Models [25 mins] 3) Discriminative Models [55 mins] 4) Machine Learning Examples [15 mins]

Training data: labeled pairs... ??... ?? Use this function to label unlabeled testing data Use training data to learn a function

3 0 horribleread_half waste horrible 2 excellent loved_it

Encode a multivariate probability distribution Nodes indicate random variables Edges indicate conditional dependency

p(horrible | -1) horrible waste p(read_half | -1) read_half p(y = -1)

Given an unlabeled instance, how can we find its label? ?? Just choose the most probable label y

Back to labeled training data:...

Input query: “ 自然语言处理 ” Query classification Travel Technology News Entertainment.. Training and testing same as in binary case

Why set parameters to counts?

Predicting broken traffic lights Lights are broken: both lights are red always Lights are working: 1 is red & 1 is green

Now, suppose both lights are red. What will our model predict? We got the wrong answer. Is there a better model? The MLE generative model is not the best model!!

We can introduce more dependencies This can explode parameter space Discriminative models minimize error -- next Further reading K. Toutanova. Competitive generative models with structure learning for NLP classification tasks. EMNLP A. Ng and M. Jordan. On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naïve Bayes. NIPS 2002

We will focus on linear models Model training error

0-1 loss (error): NP-hard to minimize over all data points Exp loss: exp(-score): Minimized by AdaBoost Hinge loss: Minimized by support vector machines

In NLP, a feature can be a weak learner Sentiment example:

Input: training sample (1)Initialize (2)For t = 1 … T, Train a weak hypothesis to minimize error on Set [later] Update (3)Output model

+ + – – Excellent book. The_plot was riveting Excellent read Terrible: The_plot was boring and opaque Awful book. Couldn’t follow the_plot –– + + –– –– – – –– – – ––

Bound on training error [Freund & Schapire 1995] We greedily minimize error by minimizing

For proofs and a more complete discussion Robert Schapire and Yoram Singer. Improved Boosting Algorithms Using Confidence- rated Predictions. Machine Learning Journal 1998.

We chose to minimize. Was that the right choice? This gives Plugging in our solution for, we have

What happens when an example is mis-labeled or an outlier? Exp loss exponentially penalizes incorrect scores. Hinge loss linearly penalizes incorrect scores.

Linearly separable + + – – – – – – – – + + Non-separable

Lots of separating hyperplanes. Which should we choose? + + – – – – – – – – + + Choose the hyperplane with largest margin

score of correct label greater than margin Why do we fix norm of w to be less than 1? Scaling the weight vector doesn’t change the optimal hyperplane

Minimize the norm of the weight vector With fixed margin for each example

We can’t satisfy the margin constraints But some hyperplanes are better than others + + – – – – + +

Add slack variables to the optimization Allow margin constraints to be violated But minimize the violation as much as possible

Max creates a non-differentiable point, but there is a subgradient Subgradient:

Subgradient descent is like gradient descent. Also guaranteed to converge, but slow Pegasos [Shalev-Schwartz and Singer 2007] Sub-gradient descent for a randomly selected subset of examples. Convergence bound: Objective after T iterations Best objective value Linear convergence

We’ve been looking at binary classification But most NLP problems aren’t binary Piece-wise linear decision boundaries We showed 2-dimensional examples But NLP is typically very high dimensional Joachims [2000] discusses linear models in high- dimensional spaces

Kernels let us efficiently map training data into a high-dimensional feature space Then learn a model which is linear in the new space, but non-linear in our original space But for NLP, we already have a high- dimensional representation! Optimization with non-linear kernels is often super-linear in number of examples

John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press Dan Klein and Ben Taskar. Max Margin Methods for NLP: Estimation, Structure, and Applications. ACL 2005 Tutorial. Ryan McDonald. Generalized Linear Classifiers in NLP. Tutorial at the Swedish Graduate School in Language Technology

SVMs with slack are noise tolerant AdaBoost has no explicit regularization Must resort to early stopping AdaBoost easily extends to non-linear models Non-linear optimization for SVMs is super- linear in the number of examples Can be important for examples with hundreds or thousands of features

Logistic regression: Also known as Maximum Entropy Probabilistic discriminative model which directly models p(y | x) A good general machine learning book On discriminative learning and more Chris Bishop. Pattern Recognition and Machine Learning. Springer 2006.

(1) (2) (3) (4)

Good features for this model? (1) How many words are shared between the query and the web page? (2) What is the PageRank of the webpage? (3) Other ideas?

Loss for a query and a pair of documents Score for documents of different ranks must be separated by a margin MSRA 互联网搜索与挖掘组