Presentation is loading. Please wait.

Presentation is loading. Please wait.

Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.

Similar presentations


Presentation on theme: "Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006."— Presentation transcript:

1 Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006

2 Maria-Florina Balcan Modern Topics in Learning Theory Semi-Supervised Learning Active Learning Kernels and Similarity Functions Tighter Data Dependent Bounds

3 Maria-Florina Balcan Semi-Supervised Learning Hot topic in recent years in Machine Learning. Many applications have lots of unlabeled data, but labeled data is rare or expensive: Web page, document classification OCR, Image classification

4 Maria-Florina Balcan Combining Labeled and Unlabeled Data Several methods have been developed to try to use unlabeled data to improve performance, e.g.: Transductive SVM [J98] Co-training [BM98, BBY04] Graph-based methods [BC01], [ZGL03], [BLRR04] Augmented PAC model for SSL [BB05, BB06]

5 Maria-Florina Balcan Can we extend the PAC model to deal with Unlabeled Data? PAC model – nice/standard model for learning from labeled data. Goal – extend it naturally to the case of learning from both labeled and unlabeled data. –Different algorithms are based on different assumptions about how data should behave. –Question – how to capture many of the assumptions typically used?

6 Maria-Florina Balcan Example of “typical” assumption The separator goes through low density regions of the space/large margin. –assume we are looking for linear separator –belief: should exist one with large separation + + _ _ Labeled data only + + _ _ + + _ _ Transductive SVM SVM

7 Maria-Florina Balcan Another Example Agreement between two parts : co-training. –examples contain two sufficient sets of features, i.e. an example is x= h x 1, x 2 i and the belief is that the two parts of the example are consistent, i.e. 9 c 1, c 2 such that c 1 (x 1 )=c 2 (x 2 )=c * (x) –for example, if we want to classify web pages: My AdvisorProf. Avrim BlumMy AdvisorProf. Avrim Blum x 2 - Text info x 1 - Link info x - Link info & Text info x = h x 1, x 2 i

8 Maria-Florina Balcan Co-Training [BM98] Text info Link info + + + X1X1 X2X2 Works by using unlabeled data to propagate learned information.

9 Maria-Florina Balcan Proposed Model [BB05, BB06] Augment the notion of a concept class C with a notion of compatibility  between a concept and the data distribution. “learn C” becomes “learn (C,  )” (i.e. learn class C under compatibility notion  ) Express relationships that one hopes the target function and underlying distribution will possess. Idea: use unlabeled data & the belief that the target is compatible to reduce C down to just {the highly compatible functions in C}.

10 Maria-Florina Balcan Proposed Model, cont Idea: use unlabeled data & our belief to reduce size(C) down to size(highly compatible functions in C) in our sample complexity bounds. Want to be able to analyze how much unlabeled data is needed to uniformly estimate compatibilities well. Require that the degree of compatibility be something that can be estimated from a finite sample.

11 Maria-Florina Balcan Proposed Model, cont Augment the notion of a concept class C with a notion of compatibility  between a concept and the data distribution. Require that the degree of compatibility be something that can be estimated from a finite sample. Require  to be an expectation over individual examples: –  ( h,D)=E x 2 D [  (h, x)] compatibility of h with D,  (h,x) 2 [0,1] – err unl (h)=1-  (h, D) incompatibility of h with D (unlabeled error rate of h)

12 Maria-Florina Balcan Margins, Compatibility Margins: belief is that should exist a large margin separator. Incompatibility of h and D (unlabeled error rate of h) – the probability mass within distance  of h. Can be written as an expectation over individual examples  (h,D)=E x 2 D [  (h,x)] where:  ( h,x)=0 if dist(x,h) ·   ( h,x)=1 if dist(x,h) ¸  Highly compatible + + + _ _

13 Maria-Florina Balcan Margins, Compatibility Margins: belief is that should exist a large margin separator. If do not want to commit to  in advance, define  (h,x) to be a smooth function of dist(x,h), e.g.: Illegal notion of compatibility: the largest  s.t. D has probability mass exactly zero within distance  of h. Highly compatible + + + _ _

14 Maria-Florina Balcan Co-Training, Compatibility Co-training: examples come as pairs h x 1, x 2 i and the goal is to learn a pair of functions h h 1, h 2 i Hope is that the two parts of the example are consistent. Legal (and natural) notion of compatibility: –the compatibility of h h 1, h 2 i and D: –can be written as an expectation over examples:

15 Maria-Florina Balcan Examples of results in our model Sample Complexity - Uniform convergence bounds Finite Hypothesis Spaces, Doubly Realizable Case Define C D,  (  ) = {h 2 C : err unl (h) ·  }. Theorem Bound the # of labeled examples as a measure of the helpfulness of D with respect to  –a helpful distribution is one in which C D,  (  ) is small

16 Maria-Florina Balcan Semi-Supervised Learning Natural Formalization (PAC  ) We will say an algorithm "PAC  -learns" if it runs in poly time using samples poly in respective bounds. E.g., can think of ln|C| as # bits to describe target without knowing D, and ln|C D,  (  )| as number of bits to describe target knowing a good approximation to D, given the assumption that the target has low unlabeled error rate.

17 Maria-Florina Balcan Examples of results in our model Sample Complexity - Uniform convergence bounds Finite Hypothesis Spaces – c * not fully compatible: Theorem

18 Maria-Florina Balcan Examples of results in our model Sample Complexity - Uniform convergence bounds Infinite Hypothesis Spaces Assume  (h,x) 2 {0,1} and  (C) = {  h : h 2 C} where  h (x) =  (h,x). C[m,D] - expected # of splits of m points from D with concepts in C.

19 Maria-Florina Balcan Examples of results in our model Sample Complexity - Uniform convergence bounds For S µ X, denote by U S the uniform distribution over S, and by C[m, U S ] the expected number of splits of m points from U S with concepts in C. Assume err(c * )=0 and err unl (c * )=0. Theorem The number of labeled examples depends on the unlabeled sample. Useful since can imagine the learning alg. performing some calculations over the unlabeled data and then deciding how many labeled examples to purchase.

20 Maria-Florina Balcan Examples of results in our model Sample Complexity,  -Cover-based bounds For algorithms that behave in a specific way: –first use the unlabeled data to choose a representative set of compatible hypotheses –then use the labeled sample to choose among these Theorem Can result in much better bound than uniform convergence!

21 Maria-Florina Balcan Implications of our analysis Ways in which unlabeled data can help If the target is highly compatible with D and have enough unlabeled data to estimate  over all h 2 C, then can reduce the search space (from C down to just those h 2 C whose estimated unlabeled error rate is low). By providing an estimate of D, unlabeled data can allow a more refined distribution-specific notion of hypothesis space size (such as Annealed VC-entropy or the size of the smallest  -cover). If D is nice so that the set of compatible h 2 C has a small  - cover and the elements of the cover are far apart, then can learn from even fewer labeled examples than the 1/  needed just to verify a good hypothesis.

22 Maria-Florina Balcan Modern Topics in Learning Theory Semi-Supervised Learning Active Learning Kernels and Similarity Functions Data Dependent Bounds

23 Maria-Florina Balcan Active Learning Unlabeled data is cheap & easy to obtain, labeled data is (much) more expensive. The learner has the ability to choose specific examples to be labeled: - The learner works harder, in order to use fewer labeled examples.

24 Maria-Florina Balcan Membership queries The learner “constructs” the examples. [Baum and Lang, 1991] tried fitting a neural net to handwritten characters. –synthetic instances created were incomprehensible to humans…

25 Maria-Florina Balcan A PAC-like model [CAL92] Underlying distribution P on the (x,y) data. (agnostic setting) Learner has two abilities: draw an unlabeled sample from the distribution ask for a label of one of these samples Special case : assume the data is separable, i.e. some concept h 2 C labels all points perfectly. (realizable setting)

26 Maria-Florina Balcan Can adaptive querying help? [CAL92, D04] Consider threshold functions on the real line: Start with 1/  unlabeled points. Binary search – need just log 1/  labels, from which the rest can be inferred. Output a consistent hypothesis. w +- Exponential improvement in sample complexity

27 Maria-Florina Balcan Region of uncertainty [CAL92] current version space Example: data lies on circle in R 2 and hypotheses are linear separators. region of uncertainty in data space Current version space: part of C consistent with labels so far. “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space) + +

28 Maria-Florina Balcan Region of uncertainty [CAL92] current version space region of uncertainty in data space Algorithm: Of the unlabeled points which lie in the region of uncertainty, pick one at random to query.

29 Maria-Florina Balcan Region of uncertainty [CAL92] Number of labels needed depends on C and also on P. Example: C -- linear separators in R d, D -- uniform distribution over unit sphere. need only d 3/2 log 1/  labels to find a hypothesis with error rate < . supervised learning: d/  labels. Exponential improvement in sample complexity For a robust version of CAL’92 see BBL’06.


Download ppt "Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006."

Similar presentations


Ads by Google