Techniques For Exploiting Unlabeled Data Mugizi Rwebangira Thesis Proposal May 11,2007 Committee: Avrim Blum, CMU (Co-Chair) John Lafferty, CMU (Co-Chair)

Techniques For Exploiting Unlabeled Data Mugizi Rwebangira Thesis Proposal May 11,2007 Committee: Avrim Blum, CMU (Co-Chair) John Lafferty, CMU (Co-Chair) William Cohen, CMU Xiaojin (Jerry) Zhu, Wisconsin

2 Motivation Model x →y Labeled Examples {(x i,y i )} Supervised Machine Learning: induction Algorithms: SVM, Neural Nets, Decision Trees, etc. Problems: Document classification, image classification, protein sequence determination.

3 Motivation In recent years, there has been growing interest in techniques for using unlabeled data: More data is being collected than ever before. Labeling examples can be expensive and/or require human intervention.

4 Examples Proteins: sequence can be easily determined, structure determination is a hard problem. Web Pages: Can be easily crawled on the web, labeling requires human intervention. Images: Abundantly available (digital cameras) labeling requires humans (captchas).

5 Motivation Labeled Examples {(x i,y i )} Semi-Supervised Machine Learning: x →y Unlabeled Examples {x i }

6 Motivation + + - -

7 However… Techniques for adapting supervised algorithms to semi-supervised algorithms Best practices for using unlabeled data: Techniques not as well developed as supervised techniques:

8 Outline Motivation Randomized Graph Mincut Local Linear Semi-supervised Regression Learning with Similarity Functions Proposed Work and Time Line

9 Graph Mincut (Blum & Chawla,2001)

10 Construct an (unweighted) Graph

11 Add auxiliary “super-nodes” -+

12 Obtain s-t mincut Mincut -+

13 Classification +- Mincut

14 Problem + - Plain mincut can give very unbalanced cuts.

15 Solution For each unlabeled example take a majority vote. Add random weights to the edges Run plain mincut and obtain a classification. Repeat the above process several times.

16 Before adding random weights +- Mincut

17 After adding random weights +- Mincut

18 PAC-Bayes PAC-Bayes bounds suggests that when the graph has many small cuts consistent with the labeling, randomization should improve generalization performance. In this case each distinct cut corresponds to a different hypothesis. Hence the average of these cuts will be less likely to overfit than any single cut.

19 Markov Random Fields Ideally we would like to assign a weight to each cut in the graph (a higher weight to small cuts) and then take a weighted vote over all the cuts in the graph. This corresponds to a Markov Random Field model. We don’t know how to do this efficiently, but we can view randomized mincuts as an approximation.

20 How to construct the graph? k-NN –Graph may not have small balanced cuts. –How to learn k? Connect all points within distance δ –Can have disconnected components. –How to learn δ? Minimum Spanning Tree –No parameters to learn. –Gives connected, sparse graph. –Seems to work well on most datasets.

21 Experiments ONE vs. TWO: 1128 examples. (8 X 8 array of integers, Euclidean distance). ODD vs. EVEN: 4000 examples. (16 X 16 array of integers, Euclidean distance). PC vs. MAC: 1943 examples. (20 newsgroup dataset, TFIDF distance).

22 ONE vs. TWO

23 ODD vs. EVEN

24 PC vs. MAC

25 Summary We can apply PAC sample complexity analysis and interpret it in terms of Markov Random Fields. Randomization helps plain mincut achieve a comparable performance to Gaussian Fields. There is an intuitive interpretation for the confidence of a prediction in terms of the “margin” of the vote. “Semi-supervised Learning Using Randomized Mincuts”, A.Blum, J. Lafferty, M.R. Rwebangira, R. Reddy, ICML 2004

27 Gaussian Fields (Zhu, Ghahramani & Lafferty) ξ(f) = ∑ w ij (f i -f j ) 2 This algorithm minimize the following functional Where w ij is the similarity between examples i and j. And f i and f j are the predictions for example i and j.

28 Locally Constant (Kernel regression) * * * * x y

29 Locally Linear y x * * * *

30 Local Linear Regression ξ(β) = ∑ w i (y i -β T X xi ) 2 This algorithm minimize the following functional Where w i is the similarity between examples i and x. β is the coefficient of the local linear fit at x.

31 Problem Develop Local Linear version of Gaussian Fields Or semi-supervised version of Local Linear Regression Local Linear Semi-supervised Regression

32 Local Linear Semi-supervised Regression xjxj xixi βiβi βjβj β jo β io X ji T β j } (β io – X ji T β j ) 2

33 Local Linear Semi-supervised Regression ξ(β) = ∑ w ij (β io – X ji T β j ) 2 This algorithm minimize the following functional Where w ij is the similarity between x i and x j.

34 Synthetic Data: Doppler σ 2 = 0.1 (noise) Doppler function y = (1/x)sin (15/x)

35 Experimental Results: DOPPLER Weighted Kernel Regression, LOOCV MSE= 6.54, MSE=25.7

36 Experimental Results: DOPPLER Local Linear Regression, LOOCV MSE= 80.8, MSE=14.4

37 Experimental Results: DOPPLER LLSR, LOOCV MSE= 2.00, MSE=7.99

38 PROBLEM: RUNNING TIME If number of examples is n and the dimension of the examples is d then we have to invert an n(d+1) X n(d+1) matrix. This is prohibitively expensive, especially if the d is large.

39 PROPOSED WORK: Improving Running Time Sparsification: Ignore examples which are far away so as to get a sparser matrix to invert. Iterative Methods for solving Linear systems: For a matrix equation Ax=b, we can obtain successive approximations x 1, x 2 … x k. Can be significantly faster if matrix A is sparse.

40 PROPOSED WORK: Improving Running Time Power series: Use the identity (I-A) -1 = I + A + A 2 + A 3 + … y’ =(Q+γΔ) -1 Py = Q -1 Py + (-γQ -1 Δ)Q -1 Py + (-γQ -1 Δ) 2 Q -1 Py + … A few terms may be sufficient to get a good approximation Compute supervised answer first, then “smooth” the answer to get semi- Supervised solution. This can be combined with iterative methods as we can use the supervised solution as the starting point for our iterative algorithm.

41 PROPOSED WORK: Experimental Evaluation Comparison against other proposed semi-supervised regression algorithms. Evaluation on a large variety of data sets, especially high dimensional ones.

43 Kernels K(x,y) = Φ(x)∙Φ(y) Allows us to implicitly project non-linearly separable data into a high dimensional space where a linear separator can be found. Kernel must satisfy strict mathematical definitions 1. Continuous 2. Symmetric 3. Positive semi-definite

44 Generic similarity Functions What if the best similarity function in a given domain does not satisfy the properties of a kernel? Two options: 1. Use a kernel with inferior performance 2. Try to “coerce” the similarity function into a kernel by building a kernel that has similar behavior. There is another way …

45 The Balcan-Blum approach Recently Balcan and Blum initiated the theory of learning with generic similarity functions. They gave a general definition of a good similarity function for learning and showed that the popular large margin kernels are a special case of their definition. They also gave an algorithm for learning with good similarity functions. Their approach makes use of unlabeled data…

46 The Balcan-Blum approach The algorithm is very simple Suppose S(x,y) is our similarity function. Then 1.Draw d examples {x 1, x 2, x 3, … x d } uniformly at random from the data set. 2. For each example x compute the mapping x → {S(x,x 1 ), S(x,x 2 ), S(x,x 3 ), … S(x,x d )}

47 Synthetic Data: Circle

48 Experimental Results: Circle

49 PROPOSED WORK Two main application areas: 1. Domains which have expert defined similarity functions that are not kernels (protein homology). 2. Domains which have many irrelevant features and in which the data may not be linearly separable in the original features (text classification). Overall goal: Investigate the practical applicability of this theory and find out what is needed to make it work on real problems.

50 PROPOSED WORK: Protein Homology The Smith-Waterman score is the best performing measure of similarity but it does not satisfy the kernel properties. Machine learning applications have either used other similarity functions Or tried to force SW score into a kernel. Can we achieve better performance by using SW score directly?

51 PROPOSED WORK: Text Classification Most popular technique is Bag-of-Words (BOW) where each document is converted into a vector and each position in the vector indicates how many times each word occurred. The vectors tend to be sparse and there will be many irrelevant features, hence this is well suited to the Winnow algorithm. Our approach makes the winnow algorithm more powerful. Within this framework we have strong motivation for investigating “domain specific” similarity function, e.g. “edit distance” between documents instead of cosine similarity. Can we achieve better performance than current techniques using “domain specific” similarity functions?

52 PROPOSED WORK: Domain Specific Similarity Functions As mentioned in the previous two slides, designing specific similarity functions for each domain, is well motivated in this approach. What are the “best practice” principles for designing domain specific similarity functions? In what circumstances are domain specific similarity functions likely to be most useful? We will answer these questions by generalizing from several different datasets and systematically noting what seems to work best.

53 Proposed Work and Time Line Summer 2007 (1)Speeding up LLSR (2)Learning with similarity in protein homology and text classification domain. Fall 2007 (1)Comparison of LLSR with other semi-supervised regression algs. (2)Investigate principles of domain specific similarity functions. Spring 2008 Start Writing Thesis Summer 2008 Finish Writing Thesis

54 Back Up Slides

55 References “Semi-supervised Learning Using Randomized Mincuts”, A.Blum, J. Lafferty, M.R. Rwebangira, R. Reddy, ICML 2004

56 My Work Practical techniques for using unlabeled data and generic similarity functions to “kernelize” the winnow algorithm. Techniques for extending Local Linear Regression to the semi-supervised setting Techniques for improving graph mincut algorithms for semi-supervised classification

57 Problem There may be several minimum cuts in the graph. + - Indeed, there are potentially exponentially many minimum cuts in the graph.

58 Real Data: CO 2 Source: World Watch Institute Carbon dioxide concentration in the atmosphere over the last two centuries.

59 Experimental Results: CO 2 Weighted Kernel Regression, MSE = 660

60 Experimental Results: CO 2 Local Linear Regression, MSE = 144

61 Experimental Results:CO 2 LLSR, MSE = 97.4

62 Winnow A linear separator algorithm, first proposed by Littlestone. We are particularly interested in winnow because 1.It is known to be able to effectively learn in the presence of irrelevant attributes. Since we will be creating many new features, we expect many of them will be irrelevant. 2. It is fast and does not require a lot of memory. Since we hope to use large amounts of unlabeled data, scalability is an important consideration.

63 Synthetic Data: Blobs and Lines Can we create a data set that needs BOTH the original and the new features to do well? To answer this we create the data set we will call “Blobs and Lines” We generate the data in the following way: 2. We flip a coin. 1.We select k point to be the centers of our “blobs” and assign them labels in {-1,+1}. 3. If heads, then we set x to be a random boolean vector of dimension d and set the label to be the first coordinate of x. 4. If tails, we pick one of the centers and flip r bits and set x equal to that and set the label to the label of the center.

64 Synthetic Data: Blobs and Lines + + + + + - - - - - - - - + + +

65 Experimental Results: Blobs and Lines

Techniques For Exploiting Unlabeled Data Mugizi Rwebangira Thesis Proposal May 11,2007 Committee: Avrim Blum, CMU (Co-Chair) John Lafferty, CMU (Co-Chair)

Similar presentations

Presentation on theme: "Techniques For Exploiting Unlabeled Data Mugizi Rwebangira Thesis Proposal May 11,2007 Committee: Avrim Blum, CMU (Co-Chair) John Lafferty, CMU (Co-Chair)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Techniques For Exploiting Unlabeled Data Mugizi Rwebangira Thesis Proposal May 11,2007 Committee: Avrim Blum, CMU (Co-Chair) John Lafferty, CMU (Co-Chair)

Similar presentations

Presentation on theme: "Techniques For Exploiting Unlabeled Data Mugizi Rwebangira Thesis Proposal May 11,2007 Committee: Avrim Blum, CMU (Co-Chair) John Lafferty, CMU (Co-Chair)"— Presentation transcript:

Similar presentations

About project

Feedback