Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.

Slides:

Advertisements

Similar presentations

Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.

Advertisements

QuickSort Average Case Analysis An Incompressibility Approach Brendan Lucier August 2, 2005.

Algorithms Analysis Lecture 6 Quicksort. Quick Sort Divide and Conquer.

Dynamic Planar Convex Hull Operations in Near- Logarithmic Amortized Time TIMOTHY M. CHAN.

Types of Algorithms.

Study Group Randomized Algorithms 21 st June 03. Topics Covered Game Tree Evaluation –its expected run time is better than the worst- case complexity.

Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.

Discrete Structure Li Tak Sing( 李德成 ) Lectures

A general agnostic active learning algorithm

Longin Jan Latecki Temple University

Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.

Active Learning. 2 Learning from Examples  Passive learning A random set of labeled examples A random set of labeled examples.

Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.

Active Learning of Binary Classifiers

CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.

CSC 2300 Data Structures & Algorithms March 27, 2007 Chapter 7. Sorting.

Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.

Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.

Probably Approximately Correct Model (PAC)

Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.

Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.

1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.

Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,

LEARNING DECISION TREES

Vapnik-Chervonenkis Dimension Part II: Lower and Upper bounds.

End of Chapter 8 Neil Weisenfeld March 28, 2005.

ICS 273A Intro Machine Learning

Decision Trees and more!. Learning OR with few attributes Target function: OR of k literals Goal: learn in time – polynomial in k and log n –  and 

Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.

Priority Models Sashka Davis University of California, San Diego June 1, 2003.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Ensemble Learning (2), Tree and Forest

Incorporating Unlabeled Data in the Learning Process

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Fall 2004 TDIDT Learning CS478 - Machine Learning.

1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,

1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.

Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

Benk Erika Kelemen Zsolt

Learning from observations

Learning from Observations Chapter 18 Through

CHAPTER 18 SECTION 1 – 3 Learning from Observations.

1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)

For Wednesday No reading Homework: –Chapter 18, exercise 6.

For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.

Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.

Adversarial Games. Two Flavors  Perfect Information –everything that can be known is known –Chess, Othello  Imperfect Information –Player’s have each.

Agnostic Active Learning Maria-Florina Balcan*, Alina Beygelzimer**, John Langford*** * : Carnegie Mellon University, ** : IBM T.J. Watson Research Center,

Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.

CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.

Classification and Regression Trees

1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 4-Inducción de árboles de decisión (1/2) Eduardo Poggi.

Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Learning from Observations

Data Transformation: Normalization

DECISION TREES An internal node represents a test on an attribute.

Decision Trees (suggested time: 30 min)

Ch9: Decision Trees 9.1 Introduction A decision tree:

Data Science Algorithms: The Basic Methods

Vapnik–Chervonenkis Dimension

Computational Learning Theory

Machine Learning: Lecture 3

Computational Learning Theory

Data Transformations targeted at minimizing experimental variance

Machine Learning: UNIT-3 CHAPTER-2

Decision trees One possible representation for hypotheses

Presentation transcript:

Analysis of greedy active learning Sanjoy Dasgupta UC San Diego

Standard learning model Given m labeled points, want to learn a classifier with misclassification rate < , chosen from a hypothesis class H with VC dimension d < 1. VC theory: need m to be roughly d/ , in the realizable case.

Active learning Unlabeled data is easy to come by, but there is a charge for each label. What is the minimum number of labels needed to achieve the target error rate?

Can adaptive querying help? Simple hypothesis class: threshold functions on the real line: h w (x) = 1(x ¸ w), H = {h w : w 2 R} Start with m ¼ 1/  unlabeled points Binary search – need just log m labels, from which the rest can be inferred! An exponential improvement in sample complexity.

Binary search X1?X1? X6?X6?X8?X8? X1X1 X8X8 X6X6 X3X3 m data points: there are effectively m+1 different hypotheses. Query tree has m+1 leaves, depth ¼ log m. Question: Is this a general phenomenon? For other hypothesis classes, is a generalized binary search possible?

Bad news – I H = {linear separators in R 1 }: active learning reduces sample complexity from m to log m. But H = {linear separators in R 2 }: there are some target hypotheses for which all m labels need to be queried! (No matter how benign the input distribution.) In this case: learning to accuracy  requires 1/  labels…

The benefit of averaging For linear separators in R 2 : In the worst case over target hypotheses, active learning offers no improvement in sample complexity. But there is a query tree in which the depths of the O(m 2 ) target hypotheses are spread almost evenly over [log m, m]. The average depth is just log m. Question: does active learning help only in a Bayesian model?

Degrees of Bayesian-ity Prior  over hypotheses Pseudo-Bayesian model The prior is used only to count queries Bayesian model The prior is used for counting queries and also for the generalization bound High  mass Low  mass Different stopping criteria. Suppose the remaining version space is:

Effective hypothesis class Fix a hypothesis class H of VC dimension d < 1, and a set of unlabeled examples x 1, x 2, …, x m, where m ¸ d/ . Sauer’s lemma: H can label these points in at most m d different ways… the effective hypothesis class H eff = { (h(x 1 ), h(x 2 ), …, h(x m )) : h 2 H} has size |H eff | · m d. Goal (in the realizable case): pick the element of H eff which is consistent with all the hidden labels, while asking for just a small subset of these labels.

Model of an active learner Query tree: X1?X1? X6?X6?X8?X8? X3?X3? h1h1 h5h5 h6h6 h3h3 h2h2 Each leaf is annotated with an element of H eff. Weights  over H eff. Goal: a tree T of small average depth, Q(T,  ) =  h  (h) ¢ depth(h) (can also use random coin flips at internal nodes) Question: in this averaged model, can we always find a tree of depth o(m)?

Bad news – II Pick any d > 0 and m ¸ 2d. There is an input space of size m and a hypothesis class H of VC dimension d such that (for uniform  ) any active learning strategy requires ¸ m/8 queries on average. Choose: Input space = any {x 1, …, x m } H = all concepts which are positive on exactly d inputs.

A revised goal Depending on the choice of  the hypothesis class perhaps the input distribution the average number of labels needed by an optimal active learner is somewhere in the range [d log m, m]. Ideal case: d log mperfect binary search Worst case:mrandomly chosen labels (within constants) Is there a generic active learning strategy which always achieves close to the optimal number of queries, no matter what it might be?

Heuristics for active learning A common strategy in many heuristics: Greedy strategy. After seeing t labels, remaining version space is some H t. Always choose the point which most evenly divides H t, according to  -mass. For instance, Tong-Koller (2000) – linear separators:  / volume Question: How good is this greedy scheme? And how does its performance depend on the choice of  ?

Greedy active learning Choose any . How does the greedy query tree T G compare to the optimal tree T * ? Upper bound. Q(T G,  ) · 4 Q(T *,  ) log 1/(min h  (h)). Example: For uniform , the approximation ratio is log |H eff | · d log m. Lower bounds. [1] Uniform  : we have an example in which Q(T G,  ) ¸ Q(T*,  ) ¢  (log |H eff |/log log |H eff |) [2] Non-uniform  : an example where  ranges between 1/2 and 1/2 n, and Q(T G,  ) ¸ Q(T*,  ) ¢  (n).

Sub-optimality of greedy scheme [1] The case of uniform . There are simple examples in which the greedy scheme uses  (log n/log log n) times the optimal number of labels. (a) The hypothesis class consists of several clusters (b) Each cluster is efficiently searchable (c) But first the version space must be narrowed down to one of these clusters: an inefficient process [Invoke this construction recursively.] Optimal strategy reduces entropy only gradually at first, then ramps it up later – an over-eager greedy scheme is fooled.

Sub-optimality, cont’d [2] The case of general . For any n ¸ 2: There is a hypothesis class H of size 2n+1 and distribution  over H such that: (a)  ranges from 1/2 to 1/2 n+1 (b) optimal expected number of queries is <3 (c) greedy strategy uses ¸ n/2 queries on average. h0h0 h 11 h 21 h 12 h 13 h 22 h 23 h 1n h 2n H,  (proportional to area)

Sub-optimality, cont’d Three types of queries: (i) Is target some h 1i ? (ii) some h 2i ? (iii) h 1j or h 2j ?

Upper bound: overview Upper bound. Q(T G,  ) · 4 Q(T*,  ) log 1/(min h  (h)). If the optimal tree is short, then either: there is a query which (in expectation) cuts off a good chunk of the version space or: some particular hypothesis has high weight. At least in the first case, the greedy scheme gets off to a good start [cf. Johnson’s argument for set cover].

Quality of a query Need a notion of query quality which can only decrease with time. If S is a version space, and query x i splits it into S +, S -, we’ll say that “x i shrinks (S,  )” by 2  (S + )  (S - )  (S) Claim: If x i shrinks (H eff,  ) by , then it shrinks (S,  ) by at most  for any S µ H eff.

When is the optimal tree short? Claim: Pick any S µ H eff, and any tree T whose leaves include all of S. Then there must be a query which shrinks (S,  S ) by at least: (1 – CP(  S ))/Q(T,  S ). Here:  S is  restricted to S CP( ) =  h (h) 2 (collision probability)

Main argument If the optimal tree has small average depth, then there are two possible cases: Case one: there is some query which shrinks the version space significantly In this case, the greedy strategy will find such a query and clear progress will be made. The resulting subtrees, considered together, will also require few queries.

Proof, cont’d Case two: some classifier h * has very high  -mass In this case, the version space might shrink by just an insignificant amount in one round. But: in roughly the number of queries that the optimal strategy requires for target h *, the greedy strategy will either eliminate h * or declare it to be the answer. In the former case, by the time h * is eliminated, the version space will have shrunk significantly. These two cases form the basis of an inductive argument.

An open problem Just about the only positive result in active learning: [FSST97] Query by committee: if the data distribution is uniform over the unit sphere, can learn homogeneous linear separators using just O(d log 1/  ) labels. But the minute we allow non-homogeneous hyperplanes, the query complexity increases to 1/  … What’s going on?