Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Lecture 9 Support Vector Machines
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
Machine learning continued Image source:
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
The Nature of Statistical Learning Theory by V. Vapnik
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Reduced Support Vector Machine
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Computational Learning Theory; The Tradeoff between Computational Complexity and Statistical Soundness Shai Ben-David CS Department, Cornell and Technion,
Active Learning with Support Vector Machines
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Binary Classification Problem Linearly Separable Case
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Support Vector Machines
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
SVM Support Vectors Machines
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
The Apparent Tradeoff between Computational Complexity and Generalization of Learning: A Biased Survey of our Current Knowledge Shai Ben-David Technion.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
An Introduction to Support Vector Machines Martin Law.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
Online Detection of Change in Data Streams Shai Ben-David School of Computer Science U. Waterloo.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Universit at Dortmund, LS VIII
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Support Vector Machines Tao Department of computer science University of Illinois.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Data Mining and Decision Support
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
SVMs, Part 2 Summary of SVM algorithm Examples of “custom” kernels Standardizing data for SVMs Soft-margin SVMs.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
FUZZ-IEEE Kernel Machines and Additive Fuzzy Systems: Classification and Function Approximation Yixin Chen and James Z. Wang The Pennsylvania State.
Support Vector Machines Part 2. Recap of SVM algorithm Given training set S = {(x 1, y 1 ), (x 2, y 2 ),..., (x m, y m ) | (x i, y i )   n  {+1, -1}
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
CS 9633 Machine Learning Support Vector Machines
Support Vector Machines
Pawan Lingras and Cory Butz
Support Vector Machines 2
Presentation transcript:

Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007

A High Level Agenda The purpose of science is to find meaningful simplicity in the midst of disorderly complexity Herbert Simon

Representative learning tasks Medical research. Detection of fraudulent activity (credit card transactions, intrusion detection, stock market manipulation) Analysis of genome functionality spam detection. Spatial prediction of landslide hazards.

Common to all such tasks We wish to develop algorithms that detect meaningful regularities in large complex data sets. We focus on data that is too complex for humans to figure out its meaningful regularities. We consider the task of finding such regularities from random samples of the data population. We should derive conclusions in timely manner. Computational efficiency is essential.

Different types of learning tasks Classification prediction – we wish to classify data points into categories, and we are given already classified samples as our training input. For example: Training a spam filter Medical Diagnosis (Patient info High/Low risk). Stock market prediction ( Predict tomorrows market trend from companies performance data)

Other Learning Tasks Clustering – the grouping data into representative collections - a fundamental tool for data analysis. Examples : Clustering customers for targeted marketing. Clustering pixels to detect objects in images. Clustering web pages for content similarity.

Differences from Classical Statistics We are interested in hypothesis generation rather than hypothesis testing. We wish to make no prior assumptions about the structure of our data. We develop algorithms for automated generation of hypotheses. We are concerned with computational efficiency.

Learning Theory: The fundamental dilemma… X Y y=f(x) Good models should enable Prediction of new data… Tradeoff between accuracy and simplicity Tradeoff between accuracy and simplicity

A Fundamental Dilemma of Science: Model Complexity vs Prediction Accuracy Complexity Accuracy Possible Models/representations Limited data

Problem Outline We are interested in (automated) Hypothesis Generation, rather than traditional Hypothesis Testing First obstacle: The danger of overfitting. First solution: Consider only a limited set of candidate hypotheses.

Empirical Risk Minimization Paradigm Choose a Hypothesis Class H of subsets of X. For an input sample S, find some h in H that fits S well. For a new point x, predict a label according to its membership in h.

The Mathematical Justification S (x,l) Assume both a training sample S and the test point (x,l) are generated i.i.d. by the same distribution over X x {0,1} X x {0,1} then, H If H is not too rich ( in some formal sense) then, hHh S x for every h in H, the training error of h on the sample S is a good estimate of its probability of success on the new x. In other words – there is no overfitting

Training error Expected test error The Mathematical Justification - Formally If S is sampled i.i.d. by some probability P over X×{0,1} then, with probability > 1-, For all h in H Complexity Term

The Types of Errors to be Considered Approximation Error Estimation Error The Class H P Best regressor for P Training error minimizer hHP Best h (in H) for P Total error

H Expanding H will lower the approximation error BUT it will increase the estimation error (lower statistical soundness) The Model Selection Problem

Once we have a large enough training sample, how much computation is required to search for a good hypothesis? (That is, empirically good.) Yet another problem – Computational Complexity

The Computational Problem HR n Given a class H of subsets of R n :{0, 1}S R n Input: A finite set of {0, 1}-labeled points S in R n :H Output: Some hypothesis function h in H that maximizes the number of correctly labeled points of S.

For each of the following classes, approximating the H best agreement rate for h in H (on a given input S sample S ) up to some constant ratio, is NP-hard : MonomialsConstant width Monotone MonomialsHalf-spaces Balls Axis aligned Rectangles Threshold NNs BD-Eiron-Long Bartlett- BD Hardness-of-Approximation Results

The Types of Errors to be Considered Output of the the learning Algorithm D Best regressor for D Approximation Error Estimation Error Computational Error The Class H Total Error

Our hypotheses set should balance several requirements: Expressiveness – being able to capture the structure of our learning task. Statistical compactness- having low combinatorial complexity. Computational manageability – existence of efficient ERM algorithms.

(where w is the weight vector of the hyperplane h, and x=(x 1, …x i,…x n ) is the example to classify) Sign ( w i x i +b) The predictor h: Concrete learning paradigm- linear separators h

Potential problem – data may not be linearly separable

The SVM Paradigm X Choose an Embedding of the domain X into some high dimensional Euclidean space, so that the data sample becomes (almost) linearly separable. Find a large-margin data-separating hyperplane in this image space, and use it for prediction. Important gain: When the data is separable, finding such a hyperplane is computationally feasible.

The SVM Idea: an Example

x (x, x 2 )

The SVM Idea: an Example

Potentially the embeddings may require very high Euclidean dimension. How can we search for hyperplanes efficiently? The Kernel Trick: Use algorithms that depend only on the inner product of sample points. Controlling Computational Complexity

Rather than define the embedding explicitly, define just the matrix of the inner products in the range space. Kernel-Based Algorithms Mercer Theorem: If the matrix is symmetric and positive semi-definite, then it is the inner product matrix with respect to some embedding K(x 1 x 1 ) K(x 1 x 2 )K(x 1 x m ) K(x m x m )K(x m x 1 ) K(x i x j )

(x 1 y 1 )... (x m y m ) K On input: Sample (x 1 y 1 )... (x m y m ) and a kernel matrix K Output:A good separating hyperplane Support Vector Machines (SVMs)

A Potential Problem: Generalization VC-dimension bounds: VC-dimension bounds: The VC-dimension of R n n+1 the class of half-spaces in R n is n+1. Can we guarantee low dimension of the embeddings range? Margin bounds:, g Margin bounds: Regardless of the Euclidean dimension, generalization can bounded as a function of the margins of the hypothesis hyperplane. Can one guarantee the existence of a large-margin separation?

(where w n is the weight vector of the hyperplane h) maxminw n x i separating hx i The Margins of a Sample h

Summary of SVM learning 1. The user chooses a Kernel Matrix - a measure of similarity between input points. 2. Upon viewing the training data, the algorithm finds a linear separator the maximizes the margins (in the high dimensional Feature Space).

How are the basic requirements met? Expressiveness – by allowing all types of kernels there is (potentially) high expressive power. Statistical compactness- only if we are lucky, and the algorithm found a large margin good separator. Computational manageability – it turns out the search for a large margin classifier can be done in time polynomial in the input size.