Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.

Analysis of greedy active learning Sanjoy Dasgupta UC San Diego

Standard learning model Given m labeled points, want to learn a classifier with misclassification rate < , chosen from a hypothesis class H with VC dimension d < 1. VC theory: need m to be roughly d/ , in the realizable case.

Active learning Unlabeled data is easy to come by, but there is a charge for each label. What is the minimum number of labels needed to achieve the target error rate?

Can adaptive querying help? Simple hypothesis class: threshold functions on the real line: h w (x) = 1(x ¸ w), H = {h w : w 2 R} Start with m ¼ 1/  unlabeled points Binary search – need just log m labels, from which the rest can be inferred! An exponential improvement in sample complexity.

Binary search X1?X1? X6?X6?X8?X8? X1X1 X8X8 X6X6 X3X3 m data points: there are effectively m+1 different hypotheses. Query tree has m+1 leaves, depth ¼ log m. Question: Is this a general phenomenon? For other hypothesis classes, is a generalized binary search possible?

Bad news – I H = {linear separators in R 1 }: active learning reduces sample complexity from m to log m. But H = {linear separators in R 2 }: there are some target hypotheses for which all m labels need to be queried! (No matter how benign the input distribution.) In this case: learning to accuracy  requires 1/  labels…

The benefit of averaging For linear separators in R 2 : In the worst case over target hypotheses, active learning offers no improvement in sample complexity. But there is a query tree in which the depths of the O(m 2 ) target hypotheses are spread almost evenly over [log m, m]. The average depth is just log m. Question: does active learning help only in a Bayesian model?

Degrees of Bayesian-ity Prior  over hypotheses Pseudo-Bayesian model The prior is used only to count queries Bayesian model The prior is used for counting queries and also for the generalization bound High  mass Low  mass Different stopping criteria. Suppose the remaining version space is:

Effective hypothesis class Fix a hypothesis class H of VC dimension d < 1, and a set of unlabeled examples x 1, x 2, …, x m, where m ¸ d/ . Sauer’s lemma: H can label these points in at most m d different ways… the effective hypothesis class H eff = { (h(x 1 ), h(x 2 ), …, h(x m )) : h 2 H} has size |H eff | · m d. Goal (in the realizable case): pick the element of H eff which is consistent with all the hidden labels, while asking for just a small subset of these labels.

Model of an active learner Query tree: X1?X1? X6?X6?X8?X8? X3?X3? h1h1 h5h5 h6h6 h3h3 h2h2 Each leaf is annotated with an element of H eff. Weights  over H eff. Goal: a tree T of small average depth, Q(T,  ) =  h  (h) ¢ depth(h) (can also use random coin flips at internal nodes) Question: in this averaged model, can we always find a tree of depth o(m)?

Bad news – II Pick any d > 0 and m ¸ 2d. There is an input space of size m and a hypothesis class H of VC dimension d such that (for uniform  ) any active learning strategy requires ¸ m/8 queries on average. Choose: Input space = any {x 1, …, x m } H = all concepts which are positive on exactly d inputs.

A revised goal Depending on the choice of  the hypothesis class perhaps the input distribution the average number of labels needed by an optimal active learner is somewhere in the range [d log m, m]. Ideal case: d log mperfect binary search Worst case:mrandomly chosen labels (within constants) Is there a generic active learning strategy which always achieves close to the optimal number of queries, no matter what it might be?

Heuristics for active learning A common strategy in many heuristics: Greedy strategy. After seeing t labels, remaining version space is some H t. Always choose the point which most evenly divides H t, according to  -mass. For instance, Tong-Koller (2000) – linear separators:  / volume Question: How good is this greedy scheme? And how does its performance depend on the choice of  ?

Greedy active learning Choose any . How does the greedy query tree T G compare to the optimal tree T * ? Upper bound. Q(T G,  ) · 4 Q(T *,  ) log 1/(min h  (h)). Example: For uniform , the approximation ratio is log |H eff | · d log m. Lower bounds. [1] Uniform  : we have an example in which Q(T G,  ) ¸ Q(T*,  ) ¢  (log |H eff |/log log |H eff |) [2] Non-uniform  : an example where  ranges between 1/2 and 1/2 n, and Q(T G,  ) ¸ Q(T*,  ) ¢  (n).

Sub-optimality of greedy scheme [1] The case of uniform . There are simple examples in which the greedy scheme uses  (log n/log log n) times the optimal number of labels. (a) The hypothesis class consists of several clusters (b) Each cluster is efficiently searchable (c) But first the version space must be narrowed down to one of these clusters: an inefficient process [Invoke this construction recursively.] Optimal strategy reduces entropy only gradually at first, then ramps it up later – an over-eager greedy scheme is fooled.

Sub-optimality, cont’d [2] The case of general . For any n ¸ 2: There is a hypothesis class H of size 2n+1 and distribution  over H such that: (a)  ranges from 1/2 to 1/2 n+1 (b) optimal expected number of queries is <3 (c) greedy strategy uses ¸ n/2 queries on average. h0h0 h 11 h 21 h 12 h 13 h 22 h 23 h 1n h 2n H,  (proportional to area)

Sub-optimality, cont’d Three types of queries: (i) Is target some h 1i ? (ii) some h 2i ? (iii) h 1j or h 2j ?

Upper bound: overview Upper bound. Q(T G,  ) · 4 Q(T*,  ) log 1/(min h  (h)). If the optimal tree is short, then either: there is a query which (in expectation) cuts off a good chunk of the version space or: some particular hypothesis has high weight. At least in the first case, the greedy scheme gets off to a good start [cf. Johnson’s argument for set cover].

Quality of a query Need a notion of query quality which can only decrease with time. If S is a version space, and query x i splits it into S +, S -, we’ll say that “x i shrinks (S,  )” by 2  (S + )  (S - )  (S) Claim: If x i shrinks (H eff,  ) by , then it shrinks (S,  ) by at most  for any S µ H eff.

When is the optimal tree short? Claim: Pick any S µ H eff, and any tree T whose leaves include all of S. Then there must be a query which shrinks (S,  S ) by at least: (1 – CP(  S ))/Q(T,  S ). Here:  S is  restricted to S CP( ) =  h (h) 2 (collision probability)

Main argument If the optimal tree has small average depth, then there are two possible cases: Case one: there is some query which shrinks the version space significantly In this case, the greedy strategy will find such a query and clear progress will be made. The resulting subtrees, considered together, will also require few queries.

Proof, cont’d Case two: some classifier h * has very high  -mass In this case, the version space might shrink by just an insignificant amount in one round. But: in roughly the number of queries that the optimal strategy requires for target h *, the greedy strategy will either eliminate h * or declare it to be the answer. In the former case, by the time h * is eliminated, the version space will have shrunk significantly. These two cases form the basis of an inductive argument.

An open problem Just about the only positive result in active learning: [FSST97] Query by committee: if the data distribution is uniform over the unit sphere, can learn homogeneous linear separators using just O(d log 1/  ) labels. But the minute we allow non-homogeneous hyperplanes, the query complexity increases to 1/  … What’s going on?

Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.

Similar presentations

Presentation on theme: "Analysis of greedy active learning Sanjoy Dasgupta UC San Diego."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.

Similar presentations

Presentation on theme: "Analysis of greedy active learning Sanjoy Dasgupta UC San Diego."— Presentation transcript:

Similar presentations

About project

Feedback