Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semi-supervised Learning

Similar presentations


Presentation on theme: "Semi-supervised Learning"— Presentation transcript:

1 Semi-supervised Learning
Rong Jin

2 Spectrum of Learning Problems

3 What is Semi-supervised Learning
Learning from a mixture of labeled and unlabeled examples Labeled Data Unlabeled Data L = f ( x l 1 ; y ) : n g U = f x u 1 ; : n g Total number of examples: f ( x ) : X ! Y N = n l + u

4 Why Semi-supervised Learning?
Labeling is expensive and difficult Labeling is unreliable Ex. Segmentation applications Need for multiple experts Unlabeled examples Easy to obtain in large numbers Ex. Web pages, text documents, etc.

5 Semi-supervised Learning Problems
Classification Transductive – predict labels of unlabeled data Inductive – learn a classification function Clustering (constrained clustering) Ranking (semi-supervised ranking) Almost every learning problem has a semi-supervised counterpart.

6 Why Unlabeled Could be Helpful
Clustering assumption Unlabeled data help decide the decision boundary Manifold assumption Unlabeled data help decide decision function f ( X ) = f ( X )

7 Clustering Assumption
?

8 Clustering Assumption
? Suggest A Simple Alg. for Semi-supervised Learning ? Mention two approaches: a) clustering data first and label each cluster by its dominate cluster b) Utilize unlabeled data to choose appropriate decision boundary Points with same label are connected through high density regions, thereby defining a cluster Clusters are separated through low-density regions

9 Manifold Assumption Regularize the classification function f(x)
Graph representation Vertex: training example (labeled and unlabeled) Edge: similar examples Labeled examples x 1 2 a n d r e c o t ! j f ( ) i s m l Regularize the classification function f(x)

10 Manifold Assumption Manifold assumption Graph representation
Vertex: training example (labeled and unlabeled) Edge: similar examples Manifold assumption Data lies on a low-dimensional manifold Classification function f(x) should “follow” the data manifold

11 Statistical View Generative model for classification P r ( X ; Y j µ ´
) = θ Y X

12 Statistical View Generative model for classification
Unlabeled data help estimate  Clustering assumption P r ( X ; Y j ) = P r ( X j Y ; ) θ Y X P r ( X u ) l j Y n i = 1 x y à K !

13 Statistical View Discriminative model for classification P r ( X ; Y j
) = θ μ Y X

14 Statistical View Discriminative model for classification
Unlabeled data help regularize θ via a prior  Manifold assumption P r ( X ; Y j ) = P r ( j X ) P r ( j X ) / e x p @ n u i ; = 1 [ f ] 2 w A > θ μ Y X

15 Semi-supervised Learning Algorithms
Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training

16 Label Propagation: Key Idea
A decision boundary based on the labeled examples is unable to take into account the layout of the data points How to incorporate the data distribution into the prediction of class labels?

17 Label Propagation: Key Idea
Connect the data points that are close to each other

18 Label Propagation: Key Idea
Connect the data points that are close to each other Propagate the class labels over the connected graph

19 Label Propagation: Key Idea
Connect the data points that are close to each other Propagate the class labels over the connected graph Different from the K Nearest Neighbor

20 Label Propagation: Representation
Adjancy matrix Similarity matrix Matrix W 2 f ; 1 g N W i ; j = 1 x a n d c o e t h r w s W 2 R N + W i ; j : s m l a r t y b e w n x d D = d i a g ( 1 ; : N ) d i = P j 6 W ;

21 Label Propagation: Representation
Adjancy matrix Similarity matrix Degree matrix W 2 f ; 1 g N W i ; j = 1 x a n d c o e t h r w s W 2 R N + W i ; j : s m l a r t y b e w n x d D = d i a g ( 1 ; : N ) d i = P j 6 W ;

22 Label Propagation: Representation
Given Label information W 2 R N + y l = ( 1 ; 2 : n ) f + g y u = ( 1 ; 2 : n ) f + g

23 Label Propagation: Representation
Given Label information W 2 R N + y l = ( 1 ; 2 : n ) f + g y = ( l ; u )

24 Label Propagation Initial class assignments
Predicted class assignments First predict the confidence scores Then predict the class assignments b y 2 f 1 ; + g N b y i = 1 x s l a e d u n f 2 R N y 2 f 1 ; + g N y i = + 1 f >

25 Label Propagation Initial class assignments
Predicted class assignments First predict the confidence scores Then predict the class assignments b y 2 f 1 ; + g N b y i = 1 x s l a e d u n f = ( 1 ; : N ) y 2 f 1 ; + g N y i = + 1 f >

26 Label Propagation (II)
One round of propagation f i = b y x s l a e d P N 1 W ; j o t h r w Weight for each propagation Weighted KNN f 1 = b y + W

27 Label Propagation (II)
Two rounds of propagation How to generate any number of iterations? f 2 = 1 + W b y f k = b y + X i 1 W

28 Label Propagation (II)
Two rounds of propagation Results for any number of iterations f 2 = 1 + W b y f k = b y + X i 1 W

29 Label Propagation (II)
Two rounds of propagation Results for infinite number of iterations f 2 = 1 + W b y f 1 = b y + X i W

30 Label Propagation (II)
Two rounds of propagation Results for infinite number of iterations f 2 = 1 + W b y Matrix Inverse f 1 = ( I W ) b y W = D 1 2 Normalized Similarity Matrix:

31 Local and Global Consistency [Zhou et.al., NIPS 03]
Local consistency: Like KNN Global consistency: Beyond KNN

32 Summary: Construct a graph using pairwise similarities
Propagate class labels along the graph Key parameters : the decay of propagation W: similarity matrix Computational complexity Matrix inverse: O(n3) Chelosky decomposition Clustering f = ( I W ) 1 b y

33 Questions ? Transductive Inductive Cluster Assumption
Manifold Assumption ? Transductive predict classes for unlabeled data Inductive learn classification function

34 Application: Text Classification [Zhou et.al., NIPS 03]
20-newsgroups autos, motorcycles, baseball, and hockey under rec Pre-processing stemming, remove stopwords & rare words, and skip header #Docs: 3970, #word: 8014 SVM KNN Propagation

35 Application: Image Retrieval [Wang et al., ACM MM 2004]
5,000 images Relevance feedback for the top 20 ranked images Classification problem Relevant or not? f(x): degree of relevance Learning relevance function f(x) Supervised learning: SVM Label propagation Label propagation SVM

36 Semi-supervised Learning Algorithms
Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training

37 Graph Partition Classification as graph partitioning
Search for a classification boundary Consistent with labeled examples Partition with small graph cut Graph Cut = 1 Graph Cut = 2

38 Graph Partitioning Classification as graph partitioning
Search for a classification boundary Consistent with labeled examples Partition with small graph cut Graph Cut = 1

39 Min-cuts [Blum and Chawla, ICML 2001]
Additional nodes V+ : source, V-: sink Infinite weights connecting sinks and sources High computational cost Graph Cut = 1 V+ V ̲ Source Sink

40 Harmonic Function [Zhu et al., ICML 2003]
Weight matrix W wi,j 0: similarity between xi and xi Membership vector f = ( 1 ; : N ) A B + 1 f i = + 1 x 2 A B 1

41 Harmonic Function (cont’d)
B + 1 Graph cut Degree matrix Diagonal element: C ( f ) C ( f ) = N X i 1 j 2 4 w ; > D W L D = d i a g ( 1 ; : N ) d i = P j 6 W ;

42 Harmonic Function (cont’d)
B + 1 Graph cut Graph Laplacian L = D –W Pairwise relationships among data poitns Mainfold geometry of data C ( f ) C ( f ) = N X i 1 j 2 4 w ; > D W L

43 Harmonic Function m i n C ( ) = 4 L s . t y · A B
Consistency with graph structures m i n f 2 1 ; + g N C ( ) = 4 > L s . t y l Challenge: Discrete space  Combinatorial Opt. Consistent with labeled data A B + 1

44 Harmonic Function m i n C ( ) = 4 L s . t y · A B m i n C ( ) = 1 4 L
2 1 ; + g N C ( ) = 4 > L s . t y l Relaxation: {-1, +1}  continuous real number Convert continuous f to binary ones A B + 1 m i n f 2 R N C ( ) = 1 4 > L s . t y ; l

45 How to handle a large number of unlabeled data points
Harmonic Function m i n f 2 R N C ( ) = 1 4 > L s . t y ; l L = l ; u f ( ) f u = L 1 ; l y How to handle a large number of unlabeled data points

46 Harmonic Function Local Propagation f u = L 1 ; l y

47 Harmonic Function f ¡ L = y Sound familiar ? 1 l u ; Local Propagation
Global propagation

48 Spectral Graph Transducer [Joachim , 2003]
+ n l X i = 1 ( f y ) 2 m i n f 2 R N C ( ) = 1 4 > L s . t y ; l Soften hard constraints

49 Spectral Graph Transducer [Joachim , 2003]
+ n l X i = 1 ( f y ) 2 m i n f 2 R N C ( ) = 1 4 > L s . t y ; l Solved by Constrained Eigenvector Problem m i n f 2 R N C ( ) = 1 4 > L + l X y s . t

50 Manifold Regularization [Belkin, 2006]
Loss function for misclassification m i n f 2 R N C ( ) = 1 4 > L + l X y s . t Regularize the norm of classifier

51 Manifold Regularization [Belkin, 2006]
s f u n c t i : l ( x ) ; y m i n f 2 R N 1 4 > L + l X = ( y ) s . t Manifold Regularization m i n f 2 R N > L + l X = 1 ( x ) ; y j H K

52 Summary Construct a graph using pairwise similarity
Key quantity: graph Laplacian Captures the geometry of the graph Decision boundary is consistent Graph structure Labeled examples Parameters , , similarity A B + 1

53 Questions ? Transductive Inductive Cluster Assumption
Manifold Assumption ? Transductive predict classes for unlabeled data Inductive learn classification function

54 Application: Text Classification
20-newsgroups autos, motorcycles, baseball, and hockey under rec Pre-processing stemming, remove stopwords & rare words, and skip header #Docs: 3970, #word: 8014 SVM KNN Propagation Harmonic

55 Application: Text Classification
PRBEP: precision recall break even point.

56 Application: Text Classification
Improvement in PRBEP by SGT

57 Semi-supervised Classification Algorithms
Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training

58 Transductive SVM Support vector machine
Classification margin Maximum classification margin Decision boundary given a small number of labeled examples

59 Transductive SVM Decision boundary given a small number of labeled examples How to change decision boundary given both labeled and unlabeled examples ?

60 Transductive SVM Decision boundary given a small number of labeled examples Move the decision boundary to low local density

61 Transductive SVM f ( x ) ! ( X ; y f ) f = a r g m x ! ( X ; y )
Classification margin f(x): classification function Supervised learning Semi-supervised learning Optimize over both f(x) and yu ! ( X ; y f ) f ( x ) f = a r g m x 2 H K ! ( X ; y ) ! ( X ; y f )

62 Transductive SVM f ( x ) ! ( X ; y f ) f = a r g m x ! ( X ; y )
Classification margin f(x): classification function Supervised learning Semi-supervised learning Optimize over both f(x) and yu ! ( X ; y f ) f ( x ) f = a r g m x 2 H K ! ( X ; y )

63 Transductive SVM f ( x ) ! ( X ; y f ) f = a r g m x ! ( X ; y ) f = a
Classification margin f(x): classification function Supervised learning Semi-supervised learning Optimize over both f(x) and yu ! ( X ; y f ) f ( x ) f = a r g m x 2 H K ! ( X ; y ) f = a r g m x 2 H K ; y u 1 + n ! ( X l )

64 Transductive SVM Decision boundary given a small number of labeled examples Move the decision boundary to place with low local density Classification results How to formulate this idea?

65 Transductive SVM: Formulation
A binary variables for label of each example Transductive SVM Original SVM Constraints for unlabeled data

66 Computational Issue No longer convex optimization problem.
Alternating optimization

67 Summary Based on maximum margin principle
Classification margin is decided by Labeled examples Class labels assigned to unlabeled data High computational cost Variants: Low Density Separation (LDS), Semi-Supervised Support Vector Machine (S3VM), TSVM

68 Questions ? Transductive Inductive Cluster Assumption
Manifold Assumption ? Transductive predict classes for unlabeled data Inductive learn classification function

69 Text Classification by TSVM
10 categories from the Reuter collection 3299 test documents 1000 informative words selected by MI criterion

70 Semi-supervised Classification Algorithms
Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training

71 Co-training [Blum & Mitchell, 1998]
Classify web pages into category for students and category for professors Two views of web pages Content “I am currently the second year Ph.D. student …” Hyperlinks “My advisor is …” “Students: …”

72 Co-training for Semi-Supervised Learning

73 Co-training for Semi-Supervised Learning
It is easier to classify this web page using hyperlinks It is easy to classify the type of this web page based on its content

74 Co-training Two representation for each web page
Content representation: (doctoral, student, computer, university…) Hyperlink representation: Inlinks: Prof. Cheng Oulinks: Prof. Cheng

75 Co-training Train a content-based classifier

76 Co-training Train a content-based classifier using labeled examples
Label the unlabeled examples that are confidently classified

77 Co-training Train a content-based classifier using labeled examples
Label the unlabeled examples that are confidently classified Train a hyperlink-based classifier

78 Co-training Train a content-based classifier using labeled examples
Label the unlabeled examples that are confidently classified Train a hyperlink-based classifier

79 Co-training Train a content-based classifier using labeled examples
Label the unlabeled examples that are confidently classified Train a hyperlink-based classifier

80 Co-training Assume two views of objects Key idea
Two sufficient representations Key idea Augment training examples of one view by exploiting the classifier of the other view Extension to multiple view Problem: how to find equivalent views

81 Active Learning Active learning
Select the most informative examples In contrast to passive learning Key question: which examples are informative Uncertainty principle: most informative example is the one that is most uncertain to classify Measure classification uncertainty

82 Active Learning Simple but very effective approaches
Query by committee (QBC) Construct an ensemble of classifiers Classification uncertainty  largest degree of disagreement SVM based approach Classification uncertainty  distance to decision boundary Simple but very effective approaches

83 Semi-supervised Clustering
Clustering data into two clusters

84 Semi-supervised Clustering
Must link cannot link Clustering data into two clusters Side information: Must links vs. cannot links

85 Semi-supervised Clustering
Also called constrained clustering Two types of approaches Restricted data partitions Distance metric learning approaches

86 Restricted Data Partition
Require data partitions to be consistent with the given links Links  hard constraints E.g. constrained K-Means (Wagstaff et al., 2001) Links  soft constraints E.g., Metric Pairwise Constraints K-means (Basu et al., 2004)

87 Restricted Data Partition
Hard constraints Cluster memberships must obey the link constraints must link cannot link Yes

88 Restricted Data Partition
Hard constraints Cluster memberships must obey the link constraints must link cannot link Yes

89 Restricted Data Partition
Hard constraints Cluster memberships must obey the link constraints must link cannot link No

90 Restricted Data Partition
Soft constraints Penalize data clustering if it violates some links must link cannot link Penality = 0

91 Restricted Data Partition
Hard constraints Cluster memberships must obey the link constraints must link cannot link Penality = 0

92 Restricted Data Partition
Hard constraints Cluster memberships must obey the link constraints must link cannot link Penality = 1

93 Distance Metric Learning
Learning a distance metric from pairwise links Enlarge the distance for a cannot-link Shorten the distance for a must-link Applied K-means with pairwise distance measured by the learned distance metric Transformed by learned distance metric must link cannot link

94 Example of Distance Metric Learning
2D data projection using Euclidean distance metric 2D data projection using learned distance metric Solid lines: must links dotted lines: cannot links

95 BoostCluster [Liu, Jin & Jain, 2007]
General framework for semi-supervised clustering Improves any given unsupervised clustering algorithm with pairwise constraints Key challenges How to influence an arbitrary clustering algorithm by side information? Encode constraints into data representation How to take into account the performance of underlying clustering algorithm? Iteratively improve the clustering performance 95 95

96 BoostCluster Data Pairwise Constraints New data Representation Clustering Algorithm Results Final Results Kernel Matrix Given: (a) pairwise constraints, (b) data examples, and (c) a clustering algorithm 96 96

97 BoostCluster Data Pairwise Constraints New data Representation Clustering Algorithm Results Final Results Kernel Matrix Find the best data rep. that encodes the unsatisfied pairwise constraints 97 97

98 BoostCluster Data Pairwise Constraints New data Representation Clustering Algorithm Results Final Results Kernel Matrix Obtain the clustering results given the new data representation 98 98

99 BoostCluster Update the kernel with the clustering results Kernel
Data Pairwise Constraints New data Representation Clustering Algorithm Results Final Results Kernel Matrix Update the kernel with the clustering results 99 99

100 BoostCluster Run the procedure iteratively Kernel Matrix Data
Pairwise Constraints New data Representation Clustering Algorithm Results Final Results Kernel Matrix Run the procedure iteratively 100 100

101 BoostCluster Compute the final clustering result Kernel Matrix Data
Pairwise Constraints New data Representation Clustering Algorithm Results Final Results Kernel Matrix Compute the final clustering result 101 101

102 Summary Clustering data under given pairwise constraints
Must links vs. cannot links Two types of approaches Restricted data partitions (either soft or hard) Distance metric learning Questions: how to acquire links/constraints? Manual assignments Derive from side information: hyper links, citation, user logs, etc. May be noisy and unreliable

103 Application: Document Clustering [Basu et al., 2004]
300 docs from topics (atheism, baseball, space) of 20-newsgroups 3251 unique words after removal of stopwords and rare words and stemming Evaluation metric: Normalized Mutual Informtion (NMI) KMeans-x-x: different variants of constrained clustering algs.

104 Kernel Learning Kernel plays central role in machine learning
Kernel functions can be learned from data Kernel alignment, multiple kernel learning, non-parametric learning, … Kernel learning is suitable for IR Similarity measure is key to IR Kernel learning allows us to identify the optimal similarity measure automatically

105 Transfer Learning Different document categories are correlated
We should be able to borrow information of one class to the training of another class Key question: what to transfer between classes? Representation, model priors, similarity measure …


Download ppt "Semi-supervised Learning"

Similar presentations


Ads by Google