Machine Learning with Discriminative Methods Lecture 18 – Structured Prediction CS 790-134 Spring 2015 Alex Berg.

Slides:



Advertisements
Similar presentations
CHAPTER 13: Alpaydin: Kernel Machines
Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.
Protein Fold Recognition with Relevance Vector Machines Patrick Fernie COMS 6772 Advanced Machine Learning 12/05/2005.
Maximum Margin Markov Network Ben Taskar, Carlos Guestrin Daphne Koller 2004.
SVM—Support Vector Machines
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 2: Review Part 2.
Crash Course on Machine Learning Part IV Several slides from Derek Hoiem, and Ben Taskar.
Visual Recognition Tutorial
Lecture 14 – Neural Networks
1 Transfer Learning Algorithms for Image Classification Ariadna Quattoni MIT, CSAIL Advisors: Michael Collins Trevor Darrell.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Conditional Random Fields
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant Functions Yongqiang Wang 1,2, Qiang.
Classification and application in Remote Sensing.
Visual Recognition Tutorial
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
Review of Lecture Two Linear Regression Normal Equation
An Introduction to Support Vector Machines Martin Law.
Crash Course on Machine Learning
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 6: Conditional Random Fields 1.
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Constrained Optimization for Validation-Guided Conditional Random Field Learning Minmin Chen , Yixin Chen , Michael Brent , Aaron Tenney Washington University.
Structured Prediction: A Large Margin Approach Ben Taskar University of Pennsylvania Joint work with: V. Chatalbashev, M. Collins, C. Guestrin, M. Jordan,
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Outline Separating Hyperplanes – Separable Case
CS 6961: Structured Prediction Fall 2014 Course Information.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
Machine Learning with Discriminative Methods Lecture 01 – Of Machine Learning and Loss CS Spring 2015 Alex Berg.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
An Introduction to Support Vector Machines (M. Law)
Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.
Max-Margin Markov Networks by Ben Taskar, Carlos Guestrin, and Daphne Koller Presented by Michael Cafarella CSE574 May 25, 2005.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Machine Learning with Discriminative Methods Lecture 00 – Introduction CS Spring 2015 Alex Berg.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Intro. ANN & Fuzzy Systems Lecture 15. Pattern Classification (I): Statistical Formulation.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Regularized Least-Squares and Convex Optimization.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Neural networks and support vector machines
CS 9633 Machine Learning Support Vector Machines
Lecture 15. Pattern Classification (I): Statistical Formulation
Lecture 04: Logistic Regression
Dan Roth Department of Computer and Information Science
Empirical risk minimization
Boosting and Additive Trees
Structured prediction:
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Empirical risk minimization
CS639: Data Management for Data Science
Presentation transcript:

Machine Learning with Discriminative Methods Lecture 18 – Structured Prediction CS Spring 2015 Alex Berg

Today’s class Structured prediction Discuss next reading for deep learning

Structure Prediction Some examples from Ben Taskar (UPenn and University of Washington)…

Handwriting Recognition brace Sequential structure xy

Object Segmentation Spatial structure xy

Natural Language Parsing The screen was a sea of red Recursive structure xy

Bilingual Word Alignment What is the anticipated cost of collecting fees under the new proposal? En vertu des nouvelles propositions, quel est le coût prévu de perception des droits? xy What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de les droits ? Combinatorial structure

Protein Structure and Disulfide Bridges Protein: 1IMT AVITGACERDLQCG KGTCCAVSLWIKSV RVCTPVGTSGEDCH PASHKIPFSGQRMH HTCPCAPNLACVQT SPKKFKCLSK

Local Prediction Classify using local information  Ignores correlations & constraints! breac

Local Prediction building tree shrub ground

Structured Prediction Use local information Exploit correlations breac

Structured Prediction building tree shrub ground

Formulating the problem There is a rich history, and we are skipping to a post-modern reductionist viewpoint (see first 5 sections of Nowozin reading…) In particular we are avoiding a probabilistic formulation, but keep in mind that the following technique works for a probabilistic model, replacing armgax with maximum likelihood, and using sampling where appropriate…

Discriminative modeling for structured prediction We have already seen one example… Multiclass classification! #i’s == #unique y’s May be too flexible with large space of outputs…

Discriminative modeling for structured prediction Can we simplify f ? n<<exponential in length of y Component functions are “features” Can be anything Learn the coefficients  i Example:

Optimization for discriminative learning of structured prediction models Let’s look at perceptrons first. From Wikipedia: Perceptron Structured Perceptron

Optimization for discriminative learning of structured prediction models More generally we will do empirical risk minimization again… Possibly annoying bit is that to compute empirical estimate of loss we need to compute Parameterize f by  Find the parameters  that minimize the a regularization penalty plus the loss of f   on training data. - Lots of recent work on approximating this step, as long as training and test are done the same way it is usually ok! - For the  ’s are the  ’s  and optimizing them is a convex problem (for reasonable choices on loss and regularization, e.g. hinge loss and L2 on  ’s…)

What is the loss for structured prediction? So we want: Will settle for minimizing a convex surrogate like:

Can get probabilistic models too (Conditional Random Fields CRFs) Boxed algorithms from: Jason Eisner (NLP/ML at JHU), proponent of Empircial Risk Minimization and Probabilistic Models. CRFs (~2000-) had a huge impact on natural language processing and computer vision communities. By modeling P(y|x) directly without modeling p(x), p(y) and p(x|y) as intermediate steps, saves much computational and sampling complexity. Some of Ben Taskar’s work (~2004-) addressed max margin structured prediction with probabilistic models combining these ideas with the margin on previous slide.

For next class Read the deep learning tutorial slides linked on the course web page: lxmls.pdf