A Theory of Learning and Clustering via Similarity Functions

A Theory of Learning and Clustering via Similarity Functions
Maria-Florina Balcan Carnegie Mellon University

Main Research Directions
New Theoretical Frameworks and Algorithms for Key Problems in Machine Learning Machine Learning Algorithmic Game Theory Algorithms for Pricing Problems (Revenue Maximization)

Outline of the Talk Background. Learning with Similarity Functions.
Important aspects in Machine Learning today. Learning with Similarity Functions. Supervised Learning. Kernel methods and their limitations. A new framework: General Similarity Functions. Unsupervised Learning. A new framework for Clustering. Other work and future directions.

Machine Learning Spam Detection Image Classification
Background Image Classification Document Categorization Speech Recognition Protein Classification Back to our generic classification problem. Speech recognition Extracting parts from text ML is the method of choice for developing software … what is common for the application of those niche. Branch Prediction Fraud Detection Spam Detection

Machine Learning Spam Detection Image Classification
Background Image Classification Supervised Learning Document Categorization Speech Recognition Protein Classification Back to our generic classification problem. Speech recognition Extracting parts from text ML is the method of choice for developing software … what is common for the application of those niche. Branch Prediction Fraud Detection Spam Detection

Example: Supervised Classification
Background Decide which s are spam and which are important. Supervised classification Not spam spam Take a sample S of data, labeled according to whether they were/weren't spam. Goal: use s seen so far to produce good prediction rule for future data.

Example: Supervised Classification
Background Represent each message by features. (e.g., keywords, spelling, etc.) example label Reasonable RULES: Given data, some reasonable rules might be: And generally we can think of linear separators as being nice geometric decision surfaces… + - Predict SPAM if unknown AND (money OR pills) Predict SPAM if 2money + 3pills –5 known > 0 Linearly separable

Two Main Aspects in Machine Learning
Background Algorithm Design. How to optimize? Automatically generate rules that do well on observed data. + - Our best algorithms for learning linear separators. Confidence Bounds, Generalization Guarantees Confidence for rule effectiveness on future data. Well understood for supervised learning.

What if Not Linearly Separable
Background Problem: data not linearly separable in the most natural feature representation. Example: No good linear separator in pixel representation. vs Solutions: Promenient Classic: “Learn a more complex class of functions”. Modern: “Use a Kernel” (prominent method today)

Learning with Similarity Functions
Contributions Kernels: special kind of similarity K( , ). Prominent method today, difficult theory. My Work: Methods for more general similarity functions. More tangible, direct theory. Previous work: Helps in the design of good kernels for new learning tasks. [Balcan-Blum, ICML 2006] [Balcan-Blum-Srebro, COLT 2008] [Balcan-Blum-Srebro, MLJ 2008] Will describe this in a few of minutes.

One Other Major Aspect in Machine Learning
Background Where do we get the data? And what type of data do we use in learning? Traditional methods: learning from labeled examples only. Modern applications: lots of unlabeled data, labeled data is rare or expensive: Web page, document classification OCR, Image classification Biology classification problems

Incorporating Unlabeled Data in the Learning Process
Background and Contributions Areas of significant importance and activity. Semi-Supervised Learning Using cheap unlabeled data in addition to labeled data. Active Learning The algorithm interactively asks for labels of informative examples. Unified theoretical understanding was lacking. My Work: Foundational theoretical understanding. Analyze practically successful existing as well as new algos. [BB, COLT 2005] [BB, book chapter, “Semi-Supervised Learning”, 2006] [BBY, NIPS 2004] [BBL, ICML 2006] [BBL, JCSS 2008] [BBZ, COLT 2007] [BHW, COLT2008]

Important aspects in Machine Learning today. Learning with Similarity Functions. Supervised Learning. Kernel methods and their limitations. A new framework: General Similarity Functions. Unsupervised Learning. A new framework for Clustering. Other work and future directions.

Kernel Methods Prominent method for supervised classification today.
Kernels Prominent method for supervised classification today. What is a Kernel? A kernel K is a legal def of dot-product: i.e. there exists an implicit mapping  such that K( , )= ( )¢ ( ). E.g., K(x,y) = (x ¢ y + 1)d : (n-dimensional space) ! nd-dimensional space Why Kernels matter? Many algorithms interact with data only via dot-products. So, if replace x ¢ y with K(x,y), they act implicitly as if data was in the higher-dimensional -space.

Example K(x,y) = (x¢y)d corresponds to E.g., for n=2, d=2, the kernel
Kernels K(x,y) = (x¢y)d corresponds to E.g., for n=2, d=2, the kernel original space -space x2 x1 O X z2 z1 z3 O X

Generalize Well if Good Margin
Kernels If data is linearly separable by margin in -space, then good sample complexity. + -  If margin  in -space, then need sample size of only Õ(1/2) to get confidence in generalization. |(x)| · 1 (Example of a generalization bound)

Kernel Methods Prominent method for supervised classification today
Kernels Prominent method for supervised classification today Lots of Books, Workshops. Significant percentage of ICML, NIPS, COLT. Important ICML 2007, Business meeting

Limitations of the Current Theory
Kernels In practice: kernels are constructed by viewing them as measures of similarity. Existing Theory: in terms of margins in implicit spaces. Difficult to think about, not great for intuition. Kernel requirement rules out many natural similarity functions. Give an example of a kernel here? A slide in between? Better theoretical explanation?

Better Theoretical Framework
Kernels Yes! We provide a more general and intuitive theory that formalizes the intuition that a good kernel is a good measure of similarity. In practice: kernels are constructed by viewing them as measures of similarity. Existing Theory: in terms of margins in implicit spaces. Difficult to think about, not great for intuition. Kernel requirement rules out natural similarity functions. [Balcan-Blum, ICML 2006] [Balcan-Blum-Srebro, MLJ 2008] [Balcan-Blum-Srebro, COLT 2008] Give an example of a kernel here? A slide in between? Better theoretical explanation?

More General Similarity Functions
New Framework We provide a notion of a good similarity function: Simpler, in terms of natural direct quantities. Main notion no implicit high-dimensional spaces no requirement that K(x,y)=(x) ¢  (y) Good kernels 2) K can be used to learn well. First attempt 3) Is broad: includes usual notion of good kernel. Give an example of a kernel here? A slide in between? has a large margin sep. in -space

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
A First Attempt New Framework P distribution over labeled examples (x, l(x)) Goal: output classification rule good for P K is good if most x are on average more similar to points y of their own type than to points y of the other type. K is (,)-good for P if a 1- prob. mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ gap Average similarity to points of the same label Average similarity to points of opposite label

A First Attempt New Framework K is (,)-good for P if a 1- prob. mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ Example: 0.4 0.3 0.5 E.g., K(x,y) ¸ 0.2, l(x) = l(y) -1 1 K(x,y) random in {-1,1}, l(x)  l(y) 1

A First Attempt New Framework K is (,)-good for P if a 1- prob. mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ Algorithm Draw sets S+, S- of positive and negative examples. Classify x based on average similarity to S+ versus to S-. -1 1 0.5 0.4 S+ S- x x

A First Attempt New Framework K is (,)-good for P if a 1- prob. mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ Algorithm Draw sets S+, S- of positive and negative examples. Classify x based on average similarity to S+ versus to S-. If |S+| and |S-| are ((1/2) ln(1/’)), then with probability ¸ 1-, error · +’. Theorem For a fixed good x prob. of error w.r.t. x (over draw of S+, S-) is small. [Hoeffding] So, the expected error rate is small.

A First Attempt: Not Broad Enough
New Framework Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ more similar to - than to typical + + + + + + + 30o 30o ½ versus ¼ - - - - ½ versus ½ ¢ 1 + ½ ¢ (- ½) - - Similarity function K(x,y)=x ¢ y has a large margin separator; does not satisfy our definition.

A First Attempt: Not Broad Enough
New Framework Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ R + - 30o 30o Broaden: 9 non-negligible R s.t. most x are on average more similar to y 2 R of same label than to y 2 R of other label. [even if do not know R in advance]

Ey~P[K(x,y)|l(y)=l(x), R(y)] ¸ Ey~P[K(x,y)|l(y)l(x), R(y)]+
Broader Definition New Framework Ask that 9 a set R of “reasonable” y (allow probabilistic) s.t. almost all x satisfy: Ey~P[K(x,y)|l(y)=l(x), R(y)] ¸ Ey~P[K(x,y)|l(y)l(x), R(y)]+ At least  prob. mass of reasonable positives & negatives. Property Draw S={y1, , yd} set of landmarks. Re-represent data. x ! F(x) = [K(x,y1), …,K(x,yd)]. Rd F F(P) P Allow membership in R to be a probabilistic function If enough landmarks (d=(1/2  )), then with high prob. there exists linear separator of small error. w=[0,0,1/n+,1/n+,0,0,0,-1/n-,0,0]

Ey~P[K(x,y)|l(y)=l(x), R(y)] ¸ Ey~P[K(x,y)|l(y)l(x), R(y)]+
Broader Definition New Framework Ask that 9 a set R of “reasonable” y (allow probabilistic) s.t. almost all x satisfy: Ey~P[K(x,y)|l(y)=l(x), R(y)] ¸ Ey~P[K(x,y)|l(y)l(x), R(y)]+ At least  prob. mass of reasonable positives & negatives. Algorithm Draw S={y1, , yd} set of landmarks. Re-represent data. x ! F(x) = [K(x,y1), …,K(x,yd)] Rd F F(P) P X X X X X Allow membership in R to be a probabilistic function X O X O O O O O O X O X X O O Take a new set of labeled examples, project to this space, and run a linear separator alg.

Kernels versus Similarity Functions
New Framework Main Technical Contributions State of the art technique My Work Good Kernels Good Similarities [technically hardest parts] Strictly more general Theorem K is a good kernel K is also a good similarity function. (but  gets squared). Can also show a Strict Separation. (use Fourier analysis)

Similarity Functions for Classification
Summary, Part I Conceptual Contributions Before After, Our Work Much more intuitive theory Difficult theory Implicit spaces No Implicit spaces Formalizes a common intuition. Not helpful for intuition. Limiting. Provably more general. Algorithmic Implications Before after: state of the art, state of our art Can use non-PSD similarities, no need to “transform” them into PSD functions and plug into SVM. E.g., Liao and Noble, Journal of Computational Biology

Important aspects in Machine Learning today. Learning with Similarity Functions. Supervised Learning. Kernel methods and their limitations. A new framework: General Similarity Functions. Unsupervised Learning. [Balcan-Blum-Vempala, STOC 2008] A new framework for Clustering. Other work and future directions.

What if only Unlabeled Examples Available?
Clustering [sports] [fashion] S set of n objects. [documents] 9 ground truth clustering. x, l(x) in {1,…,t}. [topic] Goal: h of low error where err(h) = minPrx~S[(h(x))  l(x)] Problem: unlabeled data only! But have a Similarity Function!

Clustering [sports] [fashion] Protocol 9 ground truth clustering for S The similarity function K has to be related to the ground-truth. i.e., each x in S has l(x) in {1,…,t}. Input S, a similarity function K. Clustering of small error. Output

Clustering [sports] [fashion] Fundamental Question What natural properties on a similarity function would be sufficient to allow one to cluster well?

Contrast with Standard Approaches
Clustering Approximation algorithms Mixture models Input: graph or embedding into Rd Input: embedding into Rd score algs based on apx ratios score algs based on error rate - analyze algs to optimize various criteria over edges strong probabilistic assumptions Clustering Theoretical Frameworks Discriminative, not generative. Our Approach Much better when input graph/ similarity is based on heuristics. [Balcan-Blum-Vempala, STOC 2008] Input: graph or similarity info score algs based on error rate E.g., clustering documents by topic, web search results by category no strong probabilistic assumptions

Condition that trivially works.
Clustering What natural properties on a similarity function would be sufficient to allow one to cluster well? [sports] [fashion] Condition that trivially works. C C’ K(x,y) > 0 for all x,y, l(x) = l(y). K(x,y) < 0 for all x,y, l(x)  l(y). A A’

Clustering What natural properties on a similarity function would be sufficient to allow one to cluster well? All x more similar to all y in own cluster than any z in any other cluster Problem: same K can satisfy it for two very different, equally natural clusterings of the same data! sports fashion soccer tennis Lacoste Gucci sports fashion soccer tennis Lacoste Gucci K(x,x’)=1 K(x,x’)=0.5 K(x,x’)=0

Relax Our Goals Clustering Produce a hierarchical clustering s.t. correct answer is approximately some pruning of it.

Obtain a rich, general model.
Relax Our Goals Clustering Produce a hierarchical clustering s.t. correct answer is approximately some pruning of it. soccer tennis Lacoste Gucci All topics sports fashion tennis soccer Gucci Lacoste 2. List of clusterings s.t. at least one has low error. Tradeoff strength of assumption with size of list. Obtain a rich, general model.

Examples of Properties and Algorithms
Clustering Strict Separation Property All x are more similar to all y in own cluster than any z in any other cluster Sufficient for hierarchical clustering (single linkage algorithm) Stability Property C C’ For all clusters C, C’, for all Aµ C, A’ µ C, neither A nor A’ more attracted to the other one than to the rest of its own cluster. A’ A Sufficient for hierarchical clustering (average linkage algorithm)

Examples of Properties and Algorithms
Clustering Average Attraction Property Ex’ 2 C(x)[K(x,x’)] > Ex’ 2 C’ [K(x,x’)]+ (8 C’C(x)) Not sufficient for hierarchical clustering Can produce a small list of clusterings. (sampling based algorithm) Stability of Large Subsets Property C C’ For all clusters C, C’, for all Aµ C, A’ µ C, |A|+|A’|¸ sn, neither A nor A’ more attracted to the other one than to the rest of its own cluster. A’ A Sufficient for hierarchical clustering Find hierarchy using a multi-stage learning-based algorithm.

Stability of Large Subsets Property
Clustering C C’ For all C, C’, all A ½ C, A’ µ C’, K(A,C-A) > K(A,A’), |A|+|A’| ¸ sn A’ A Algorithm Generate list L of candidate clusters (average attraction alg.) Ensure that any ground-truth cluster is f-close to one in L. For every (C, C0) in L s.t. all three parts are large: If K(C Å C0, C \ C0) ¸ K(C Å C0, C0 \ C), then throw out C0 C C0 C Å C0 Else throw out C. 3) Clean and hook up the surviving clusters into a tree.

Similarity Functions for Clustering, Summary
Summary, Part II Similarity Functions for Clustering, Summary Main Conceptual Contributions Minimal conditions on K to be useful for clustering. For robust theory, relax objective: hierarchy, list. A general model that parallels PAC, SLT, Learning with Kernels and Similarity Functions in Supervised Classification. Technically Most Difficult Aspects We provide a new framework for analyzing Clustering algorithms and this is the first framework that makes no no strong probabilistic assumptions about the underlying data distribution. And we were able to do this by relaxing CC from outputting the right correct clustering to outputting a tree such the correct answer is approx some pruning of the hierarchy HC, or more generally considering List. Algos for stability of large subsets; -strict separation. Algos and analysis for the inductive setting: e.g, sampling preserves stability (regularity based arguments).

Similarity Functions, Overall Summary
Supervised Classification Unsupervised Learning Generalize and simplify the existing theory of Kernels. First Clustering model for analyzing accuracy without strong probabilistic assumptions. [Balcan-Blum, ICML 2006] [Balcan-Blum-Srebro, COLT 2008] [Balcan-Blum-Srebro, MLJ 2008] [Balcan-Blum-Vempala, STOC 2008]

Mechanism Design and Pricing Problems
Other Research Directions

Mechanism Design and Pricing Problems
Other Research Directions generic Incentive compatible auction design Standard algorithm design reduction [BBHM, FOCS 2005] [BBHM, JCSS 2008] Approximation and Online Algorithms Pricing Problems Revenue maximization in combinatorial auctions Single minded customers Customers with general valuation functions [BB, EC 2006] [BB, TCS 2007] [BBCH, WINE 2007] [BBM, EC 2008]

New Frameworks for Machine Learning
Other Research Directions Kernels, Margins, Random Projections, Feature Selection [BBV, ALT 2004] [BBV, MLJ 2006] Incorporating Unlabeled Data in the Learning Process Semi-supervised Learning Active Learning - Unified theoretical framework - Agnostic Active Learning [BB, COLT 2005] [BBL, ICML 2006] [BBL, JCSS 2008] [BB, book chapter, 2006] - Margin Based Active Learning - Co-training [BBZ, COLT 2007] [BBY, NIPS 2004]

Future Directions Connections between Computer Science and Economics
Use Machine Learning to automate aspects of Mechanism Design and analyze complex systems. New Frameworks and Algorithms for Machine Learning Interactive Learning Similarity Functions for Learning and Clustering Learn a good similarity based on data from related problems. Other navigational structures: e.g., a small DAG. Other notions of “useful”, other types of feedback. Machine Learning for other areas of Computer Science

Similarity Functions, Overall Summary
Supervised Classification Unsupervised Learning Generalize and simplify the existing theory of Kernels. First Clustering model for analyzing accuracy without strong probabilistic assumptions. [Balcan-Blum, ICML 2006] [Balcan-Blum-Srebro, COLT 2008] [Balcan-Blum-Srebro, MLJ 2008] [Balcan-Blum-Vempala, STOC 2008]

A Theory of Learning and Clustering via Similarity Functions

Similar presentations

Presentation on theme: "A Theory of Learning and Clustering via Similarity Functions"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Theory of Learning and Clustering via Similarity Functions

Similar presentations

Presentation on theme: "A Theory of Learning and Clustering via Similarity Functions"— Presentation transcript:

Similar presentations

About project

Feedback