Learning with General Similarity Functions Maria-Florina Balcan.

Slides:



Advertisements
Similar presentations
New Horizons in Machine Learning Avrim Blum CMU This is mostly a survey, but last part is joint work with Nina Balcan and Santosh Vempala [Workshop on.
Advertisements

Lecture 9 Support Vector Machines
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
Support vector machine
Boosting Approach to ML
On a Theory of Similarity Functions for Learning and Clustering Avrim Blum Carnegie Mellon University [Includes work joint with Nina Balcan, Nati Srebro,
Semi-Supervised Learning and Learning via Similarity Functions: Two key settings for Data- Dependent Concept Spaces Avrim Blum [NIPS 2008 Workshop on Data.
Online Learning Avrim Blum Carnegie Mellon University Your guide: [Machine Learning Summer School 2012] And Other Cool Stuff.
Semi-Supervised Learning
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.
Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.
Support Vector Machines and Kernel Methods
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.
Active Learning of Binary Classifiers
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Probably Approximately Correct Model (PAC)
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Active Learning with Support Vector Machines
Support Vector Machines Kernel Machines
New Theoretical Frameworks for Machine Learning
Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,
Support Vector Machines
Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
A Discriminative Framework for Clustering via Similarity Functions Maria-Florina Balcan Carnegie Mellon University Joint with Avrim Blum and Santosh Vempala.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon.
On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT.
Online Learning Algorithms
Incorporating Unlabeled Data in the Learning Process
An Introduction to Support Vector Machines Martin Law.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Submodular Functions Learnability, Structure & Optimization Nick Harvey, UBC CS Maria-Florina Balcan, Georgia Tech.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Kernels, Margins, and Low-dimensional Mappings [NIPS 2007 Workshop on TOPOLOGY LEARNING ] Maria-Florina Balcan, Avrim Blum, Santosh Vempala.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
An Introduction to Support Vector Machines (M. Law)
T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory Michael Pfeiffer
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Data Mining and Decision Support
Support Vector Machines Exercise solutions Ata Kaban The University of Birmingham.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
5.3 Algorithmic Stability Bounds Summarized by: Sang Kyun Lee.
Kernels and Margins Maria Florina Balcan 10/13/2011.
Warsaw Summer School 2015, OSU Study Abroad Program Normal Distribution.
Features, Kernels, and Similarity functions Avrim Blum Machine learning lunch 03/05/07.
1 Machine Learning in Natural Language More on Discriminative models Dan Roth University of Illinois, Urbana-Champaign
Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer Science Dept. Carnegie Mellon University SIGIR.
On a Theory of Similarity functions for Learning and Clustering Avrim Blum Carnegie Mellon University This talk is based on work joint with Nina Balcan,
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
CS 9633 Machine Learning Support Vector Machines
LECTURE 11: Advanced Discriminant Analysis
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
CSCI B609: “Foundations of Data Science”
Maria Florina Balcan 03/04/2010
A Theory of Learning and Clustering via Similarity Functions
Presentation transcript:

Learning with General Similarity Functions Maria-Florina Balcan

2-Minute Version Generic classification problem: Problem: pixel representation not so good. Powerful technique: use a kernel, a special kind of similarity function K(, ). Develop a theory that views K as a measure of similarity. General sufficient conditions for K to be useful for learning. But, standard theory in terms of implicit mappings. [Balcan-Blum, ICML 2006][Balcan-Blum-Srebro, MLJ 2008] Our Work: [Balcan-Blum-Srebro, COLT 2008]

3 Kernel Methods A kernel K is a legal def of dot-product: i.e. there exists an implicit mapping  such that K(, )=  ( ) ¢  ( ). E.g., K(x,y) = (x ¢ y + 1) d  (n-dimensional space) ! n d -dimensional space Why Kernels matter? Many algorithms interact with data only via dot-products. So, if replace x ¢ y with K(x,y), they act implicitly as if data was in the higher-dimensional  -space. What is a Kernel? Prominent method for supervised classification today. The learning alg. interacts with the data via a similarity fns

4 x2x2 x1x1 O O O O O O O O X X X X X X X X X X X X X X X X X X z1z1 z3z3 O O O O O O O O O X X X X X X X X X X X X X X X X X X Example E.g., for n=2, d=2, the kernel z2z2 K(x,y) = (x ¢ y) d corresponds to original space  space

5 Generalize Well if Good Margin If data is linearly separable by margin in  -space, then good sample complexity. |  (x)| ·   If margin  in  -space, then need sample size of only Õ(1/  2 ) to get confidence in generalization.

6 Kernel Methods Significant percentage of ICML, NIPS, COLT. Very useful in practice for dealing with many different types of data. Prominent method for supervised classification today

7 Limitations of the Current Theory Existing Theory: in terms of margins in implicit spaces. In practice: kernels are constructed by viewing them as measures of similarity. Kernel requirement rules out many natural similarity functions. Difficult to think about, not great for intuition. Better theoretical explanation?

8 Better Theoretical Framework Existing Theory: in terms of margins in implicit spaces. In practice: kernels are constructed by viewing them as measures of similarity. Kernel requirement rules out natural similarity functions. Difficult to think about, not great for intuition. Yes! We provide a more general and intuitive theory that formalizes the intuition that a good kernel is a good measure of similarity. Better theoretical explanation? [Balcan-Blum, ICML 2006][Balcan-Blum-Srebro, MLJ 2008] [Balcan-Blum-Srebro, COLT 2008]

9 More General Similarity Functions 2) Is broad: includes usual notion of good kernel. We provide a notion of a good similarity function: 1)Simpler, in terms of natural direct quantities. no implicit high-dimensional spaces no requirement that K(x,y)=  (x) ¢  (y) K can be used to learn well. has a large margin sep. in  -space Good kernels First attempt Main notion 3) Allows one to learn classes that have no good kernels.

10 A First Attempt K is ( ,  )-good for P if a 1-  prob. mass of x satisfy: E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+  P distribution over labeled examples (x, l (x)) K is good if most x are on average more similar to points y of their own type than to points y of the other type. Goal: output classification rule good for P Average similarity to points of opposite label gap Average similarity to points of the same label

11 A First Attempt E.g., K(x,y) ¸ 0.2, l (x) = l (y) K(x,y) random in {-1,1}, l (x)  l (y) Example: K is ( ,  )-good for P if a 1-  prob. mass of x satisfy: E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+  0.3

12 A First Attempt Algorithm Draw sets S +, S - of positive and negative examples. Classify x based on average similarity to S + versus to S -. K is ( ,  )-good for P if a 1-  prob. mass of x satisfy: S+S S-S- xx E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+ 

13 A First Attempt K is ( ,  )-good for P if a 1-  prob. mass of x satisfy: E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+  Theorem Algorithm Draw sets S +, S - of positive and negative examples. Classify x based on average similarity to S + versus to S -. For a fixed good x prob. of error w.r.t. x (over draw of S +, S - ) is ± ² ’. [Hoeffding] Overall error rate ·  +  ’. If |S + | and |S - | are  ((1/  2 ) ln(1/  ’)), then with probability ¸ 1- , error ·  +  ’. At most  chance that the error rate over GOOD is ¸  ’.

14 A First Attempt: Not Broad Enough E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+  has a large margin separator; more similar to - than to typical + ½ versus ¼ 30 o Similarity function K(x,y)=x ¢ y does not satisfy our definition. ½ versus ½ ¢ 1 + ½ ¢ (- ½)

15 A First Attempt: Not Broad Enough E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+  R Broaden: 9 non-negligible R s.t. most x are on average more similar to y 2 R of same label than to y 2 R of other label. [even if do not know R in advance] 30 o

16 Broader Definition K is ( , ,  ) if 9 a set R of “reasonable” y (allow probabilistic) s.t. 1-  fraction of x satisfy: Draw S={y 1, , y d } set of landmarks. F(x) = [K(x,y 1 ), …,K(x,y d )]. RdRd F F(P) Property x ! If enough landmarks (d=  (1/  2  )), then with high prob. there exists a good L 1 large margin linear separator. Re-represent data. w=[0,0,1/n+,1/n+,0,0,0,-1/n-,0,0] P At least  prob. mass of reasonable positives & negatives. E y~P [K(x,y)| l (y)= l (x), R(y)] ¸ E y~P [K(x,y)| l (y)  l (x), R(y)]+ 

17 Broader Definition Draw S={y 1, , y d } set of landmarks. F(x) = [K(x,y 1 ), …,K(x,y d )] RdRd F F(P) Algorithm x ! Re-represent data. O O O O O X X X X X XX X XX O O O O O and run a good L 1 linear separator alg.Take a new set of labeled examples, project to this space, and run a good L 1 linear separator alg. P K is ( , ,  ) if 9 a set R of “reasonable” y (allow probabilistic) s.t. 1-  fraction of x satisfy: d u =Õ(1/(  2  )) d l =O(1/(  2 ² acc ln (d u ) )) At least  prob. mass of reasonable positives & negatives. E y~P [K(x,y)| l (y)= l (x), R(y)] ¸ E y~P [K(x,y)| l (y)  l (x), R(y)]+ 

18 Kernels versus Similarity Functions Theorem K is also a good similarity function. Main Technical Contributions Our Work Good Kernels Good Similarities (but  gets squared). K is a good kernel If K has margin  in implicit space, then for any , K is ( ,  2,  )-good in our sense.

19 Kernels versus Similarity Functions Can also show a Strict Separation. Main Technical Contributions Our Work Good Kernels Good Similarities Strictly more general For any class C of n pairwise uncorrelated functions, 9 a similarity function good for all f in C, but no such good kernel function exists. Theorem K is also a good similarity function. (but  gets squared). K is a good kernel

20 Kernels versus Similarity Functions Can also show a Strict Separation. For any class C of n pairwise uncorrelated functions, 9 a similarity function good for all f in C, but no such good kernel function exists. Theorem In principle, should be able to learn from O(  -1 log(|C|/  )) labeled examples. Claim 1: can define generic (0,1,1/|C|)-good similarity function achieving this bound. (Assume D not too concentrated) Claim 2: There is no ( ,  ) good kernel in hinge loss, even if  =1/2 and  =1/ |C| -1/2. So, margin based SC is d=  (1/|C|).

21 Learning with Multiple Similarity Functions Let K 1, …, K r be similarity functions s. t. some (unknown) convex combination of them is ( ,  )-good. Guarantee: Whp the induced distribution F(P) in R 2dr has a separator of error ·  +  at L 1 margin at least Algorithm Draw S={y 1, , y d } set of landmarks. Concatenate features. Sample complexity only increases by log(r) factor! F(x) = [K 1 (x,y 1 ), …,K r (x,y 1 ), …, K 1 (x,y d ),…,K r (x,y d )].

Conclusions Theory of learning with similarity fns that provides a formal way of understanding good kernels as good similarity fns. Our algorithms work for similarity fns that aren’t necessarily PSD (or even symmetric). Algorithmic Implications Can use non-PSD similarities, no need to “transform” them into PSD functions and plug into SVM. E.g., Liao and Noble, Journal of Computational Biology

Conclusions Theory of learning with similarity fns that provides a formal way of understanding good kernels as good similarity fns. Our algorithms work for similarity fns that aren’t necessarily PSD (or even symmetric). Open Questions Analyze other notions of good similarity fns. Our Work Good Kernels

24

25 Similarity Functions for Classification Algorithmic Implications Can use non-PSD similarities, no need to “transform” them into PSD functions and plug into SVM. E.g., Liao and Noble, Journal of Computational Biology Give justification to the following rule: Also show that anything learnable with SVM is learnable this way!