Semi-supervised Learning

Name: Semi-supervised Learning
Uploaded: 2017-08-13T14:14:33+00:00
Duration: PTM53S37
Description: Semi-supervised Learning

Semi-supervised Learning
Rong Jin

Spectrum of Learning Problems

What is Semi-supervised Learning
Learning from a mixture of labeled and unlabeled examples Labeled Data Unlabeled Data L = f ( x l 1 ; y ) : n g U = f x u 1 ; : n g Total number of examples: f ( x ) : X ! Y N = n l + u

Why Semi-supervised Learning?
Labeling is expensive and difficult Labeling is unreliable Ex. Segmentation applications Need for multiple experts Unlabeled examples Easy to obtain in large numbers Ex. Web pages, text documents, etc.

Semi-supervised Learning Problems
Classification Transductive – predict labels of unlabeled data Inductive – learn a classification function Clustering (constrained clustering) Ranking (semi-supervised ranking) Almost every learning problem has a semi-supervised counterpart.

Why Unlabeled Could be Helpful
Clustering assumption Unlabeled data help decide the decision boundary Manifold assumption Unlabeled data help decide decision function f ( X ) = f ( X )

Clustering Assumption
?

Clustering Assumption
? Suggest A Simple Alg. for Semi-supervised Learning ? Mention two approaches: a) clustering data first and label each cluster by its dominate cluster b) Utilize unlabeled data to choose appropriate decision boundary Points with same label are connected through high density regions, thereby defining a cluster Clusters are separated through low-density regions

Manifold Assumption Regularize the classification function f(x)
Graph representation Vertex: training example (labeled and unlabeled) Edge: similar examples Labeled examples x 1 2 a n d r e c o t ! j f ( ) i s m l Regularize the classification function f(x)

Manifold Assumption Manifold assumption Graph representation
Vertex: training example (labeled and unlabeled) Edge: similar examples Manifold assumption Data lies on a low-dimensional manifold Classification function f(x) should “follow” the data manifold

Statistical View Generative model for classification P r ( X ; Y j µ ´
) = θ Y X 

Statistical View Generative model for classification
Unlabeled data help estimate  Clustering assumption P r ( X ; Y j ) = P r ( X j Y ; ) θ Y X  P r ( X u ) l j Y n i = 1 x y Ã K !

Statistical View Discriminative model for classification P r ( X ; Y j
) = θ μ Y X

Statistical View Discriminative model for classification
Unlabeled data help regularize θ via a prior  Manifold assumption P r ( X ; Y j ) = P r ( j X ) P r ( j X ) / e x p @ n u i ; = 1 [ f ] 2 w A > θ μ Y X

Semi-supervised Learning Algorithms
Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training

Label Propagation: Key Idea
A decision boundary based on the labeled examples is unable to take into account the layout of the data points How to incorporate the data distribution into the prediction of class labels?

Connect the data points that are close to each other

Connect the data points that are close to each other Propagate the class labels over the connected graph

Connect the data points that are close to each other Propagate the class labels over the connected graph Different from the K Nearest Neighbor

Label Propagation: Representation
Adjancy matrix Similarity matrix Matrix W 2 f ; 1 g N W i ; j = 1 x a n d c o e t h r w s W 2 R N + W i ; j : s m l a r t y b e w n x d D = d i a g ( 1 ; : N ) d i = P j 6 W ;

Adjancy matrix Similarity matrix Degree matrix W 2 f ; 1 g N W i ; j = 1 x a n d c o e t h r w s W 2 R N + W i ; j : s m l a r t y b e w n x d D = d i a g ( 1 ; : N ) d i = P j 6 W ;

Given Label information W 2 R N + y l = ( 1 ; 2 : n ) f + g y u = ( 1 ; 2 : n ) f + g

Given Label information W 2 R N + y l = ( 1 ; 2 : n ) f + g y = ( l ; u )

Label Propagation Initial class assignments
Predicted class assignments First predict the confidence scores Then predict the class assignments b y 2 f 1 ; + g N b y i = 1 x s l a e d u n f 2 R N y 2 f 1 ; + g N y i = + 1 f >

Label Propagation Initial class assignments
Predicted class assignments First predict the confidence scores Then predict the class assignments b y 2 f 1 ; + g N b y i = 1 x s l a e d u n f = ( 1 ; : N ) y 2 f 1 ; + g N y i = + 1 f >

Label Propagation (II)
One round of propagation f i = b y x s l a e d P N 1 W ; j o t h r w Weight for each propagation Weighted KNN f 1 = b y + W

Two rounds of propagation How to generate any number of iterations? f 2 = 1 + W b y f k = b y + X i 1 W

Two rounds of propagation Results for any number of iterations f 2 = 1 + W b y f k = b y + X i 1 W

Two rounds of propagation Results for infinite number of iterations f 2 = 1 + W b y f 1 = b y + X i W

Two rounds of propagation Results for infinite number of iterations f 2 = 1 + W b y Matrix Inverse f 1 = ( I W ) b y W = D 1 2 Normalized Similarity Matrix:

Local and Global Consistency [Zhou et.al., NIPS 03]
Local consistency: Like KNN Global consistency: Beyond KNN

Summary: Construct a graph using pairwise similarities
Propagate class labels along the graph Key parameters : the decay of propagation W: similarity matrix Computational complexity Matrix inverse: O(n3) Chelosky decomposition Clustering f = ( I W ) 1 b y

Questions ? Transductive Inductive Cluster Assumption
Manifold Assumption ? Transductive predict classes for unlabeled data Inductive learn classification function

Application: Text Classification [Zhou et.al., NIPS 03]
20-newsgroups autos, motorcycles, baseball, and hockey under rec Pre-processing stemming, remove stopwords & rare words, and skip header #Docs: 3970, #word: 8014 SVM KNN Propagation

Application: Image Retrieval [Wang et al., ACM MM 2004]
5,000 images Relevance feedback for the top 20 ranked images Classification problem Relevant or not? f(x): degree of relevance Learning relevance function f(x) Supervised learning: SVM Label propagation Label propagation SVM

Semi-supervised Learning Algorithms

Graph Partition Classification as graph partitioning
Search for a classification boundary Consistent with labeled examples Partition with small graph cut Graph Cut = 1 Graph Cut = 2

Graph Partitioning Classification as graph partitioning
Search for a classification boundary Consistent with labeled examples Partition with small graph cut Graph Cut = 1

Min-cuts [Blum and Chawla, ICML 2001]
Additional nodes V+ : source, V-: sink Infinite weights connecting sinks and sources High computational cost Graph Cut = 1  V+  V ̲   Source Sink

Harmonic Function [Zhu et al., ICML 2003]
Weight matrix W wi,j 0: similarity between xi and xi Membership vector f = ( 1 ; : N ) A B + 1 f i = + 1 x 2 A B 1

Harmonic Function (cont’d)
B + 1 Graph cut Degree matrix Diagonal element: C ( f ) C ( f ) = N X i 1 j 2 4 w ; > D W L D = d i a g ( 1 ; : N ) d i = P j 6 W ;

Harmonic Function (cont’d)
B + 1 Graph cut Graph Laplacian L = D –W Pairwise relationships among data poitns Mainfold geometry of data C ( f ) C ( f ) = N X i 1 j 2 4 w ; > D W L

Harmonic Function m i n C ( ) = 4 L s . t y · A B
Consistency with graph structures m i n f 2 1 ; + g N C ( ) = 4 > L s . t y l Challenge: Discrete space  Combinatorial Opt. Consistent with labeled data A B + 1

Harmonic Function m i n C ( ) = 4 L s . t y · A B m i n C ( ) = 1 4 L
2 1 ; + g N C ( ) = 4 > L s . t y l Relaxation: {-1, +1}  continuous real number Convert continuous f to binary ones A B + 1 m i n f 2 R N C ( ) = 1 4 > L s . t y ; l

How to handle a large number of unlabeled data points
Harmonic Function m i n f 2 R N C ( ) = 1 4 > L s . t y ; l L = l ; u f ( ) f u = L 1 ; l y How to handle a large number of unlabeled data points

Harmonic Function Local Propagation f u = L 1 ; l y

Harmonic Function f ¡ L = y Sound familiar ? 1 l u ; Local Propagation
Global propagation

Spectral Graph Transducer [Joachim , 2003]
+ n l X i = 1 ( f y ) 2 m i n f 2 R N C ( ) = 1 4 > L s . t y ; l Soften hard constraints

Spectral Graph Transducer [Joachim , 2003]
+ n l X i = 1 ( f y ) 2 m i n f 2 R N C ( ) = 1 4 > L s . t y ; l Solved by Constrained Eigenvector Problem m i n f 2 R N C ( ) = 1 4 > L + l X y s . t

Manifold Regularization [Belkin, 2006]
Loss function for misclassification m i n f 2 R N C ( ) = 1 4 > L + l X y s . t Regularize the norm of classifier

Manifold Regularization [Belkin, 2006]
s f u n c t i : l ( x ) ; y m i n f 2 R N 1 4 > L + l X = ( y ) s . t Manifold Regularization m i n f 2 R N > L + l X = 1 ( x ) ; y j H K

Summary Construct a graph using pairwise similarity
Key quantity: graph Laplacian Captures the geometry of the graph Decision boundary is consistent Graph structure Labeled examples Parameters , , similarity A B + 1

Application: Text Classification
20-newsgroups autos, motorcycles, baseball, and hockey under rec Pre-processing stemming, remove stopwords & rare words, and skip header #Docs: 3970, #word: 8014 SVM KNN Propagation Harmonic

PRBEP: precision recall break even point.

Improvement in PRBEP by SGT

Semi-supervised Classification Algorithms

Transductive SVM Support vector machine
Classification margin Maximum classification margin Decision boundary given a small number of labeled examples

Transductive SVM Decision boundary given a small number of labeled examples How to change decision boundary given both labeled and unlabeled examples ?

Transductive SVM Decision boundary given a small number of labeled examples Move the decision boundary to low local density

Transductive SVM f ( x ) ! ( X ; y f ) f = a r g m x ! ( X ; y )
Classification margin f(x): classification function Supervised learning Semi-supervised learning Optimize over both f(x) and yu ! ( X ; y f ) f ( x ) f = a r g m x 2 H K ! ( X ; y ) ! ( X ; y f )

Transductive SVM f ( x ) ! ( X ; y f ) f = a r g m x ! ( X ; y )
Classification margin f(x): classification function Supervised learning Semi-supervised learning Optimize over both f(x) and yu ! ( X ; y f ) f ( x ) f = a r g m x 2 H K ! ( X ; y )

Transductive SVM f ( x ) ! ( X ; y f ) f = a r g m x ! ( X ; y ) f = a
Classification margin f(x): classification function Supervised learning Semi-supervised learning Optimize over both f(x) and yu ! ( X ; y f ) f ( x ) f = a r g m x 2 H K ! ( X ; y ) f = a r g m x 2 H K ; y u 1 + n ! ( X l )

Transductive SVM Decision boundary given a small number of labeled examples Move the decision boundary to place with low local density Classification results How to formulate this idea?

Transductive SVM: Formulation
A binary variables for label of each example Transductive SVM Original SVM Constraints for unlabeled data

Computational Issue No longer convex optimization problem.
Alternating optimization

Summary Based on maximum margin principle
Classification margin is decided by Labeled examples Class labels assigned to unlabeled data High computational cost Variants: Low Density Separation (LDS), Semi-Supervised Support Vector Machine (S3VM), TSVM

Text Classification by TSVM
10 categories from the Reuter collection 3299 test documents 1000 informative words selected by MI criterion

Semi-supervised Classification Algorithms

Co-training [Blum & Mitchell, 1998]
Classify web pages into category for students and category for professors Two views of web pages Content “I am currently the second year Ph.D. student …” Hyperlinks “My advisor is …” “Students: …”

Co-training for Semi-Supervised Learning

Co-training for Semi-Supervised Learning
It is easier to classify this web page using hyperlinks It is easy to classify the type of this web page based on its content

Co-training Two representation for each web page
Content representation: (doctoral, student, computer, university…) Hyperlink representation: Inlinks: Prof. Cheng Oulinks: Prof. Cheng

Co-training Train a content-based classifier

Co-training Train a content-based classifier using labeled examples
Label the unlabeled examples that are confidently classified

Co-training Train a content-based classifier using labeled examples
Label the unlabeled examples that are confidently classified Train a hyperlink-based classifier

Co-training Assume two views of objects Key idea
Two sufficient representations Key idea Augment training examples of one view by exploiting the classifier of the other view Extension to multiple view Problem: how to find equivalent views

Active Learning Active learning
Select the most informative examples In contrast to passive learning Key question: which examples are informative Uncertainty principle: most informative example is the one that is most uncertain to classify Measure classification uncertainty

Active Learning Simple but very effective approaches
Query by committee (QBC) Construct an ensemble of classifiers Classification uncertainty  largest degree of disagreement SVM based approach Classification uncertainty  distance to decision boundary Simple but very effective approaches

Semi-supervised Clustering
Clustering data into two clusters

Must link cannot link Clustering data into two clusters Side information: Must links vs. cannot links

Also called constrained clustering Two types of approaches Restricted data partitions Distance metric learning approaches

Restricted Data Partition
Require data partitions to be consistent with the given links Links  hard constraints E.g. constrained K-Means (Wagstaff et al., 2001) Links  soft constraints E.g., Metric Pairwise Constraints K-means (Basu et al., 2004)

Hard constraints Cluster memberships must obey the link constraints must link cannot link Yes

Hard constraints Cluster memberships must obey the link constraints must link cannot link No

Soft constraints Penalize data clustering if it violates some links must link cannot link Penality = 0

Hard constraints Cluster memberships must obey the link constraints must link cannot link Penality = 0

Hard constraints Cluster memberships must obey the link constraints must link cannot link Penality = 1

Distance Metric Learning
Learning a distance metric from pairwise links Enlarge the distance for a cannot-link Shorten the distance for a must-link Applied K-means with pairwise distance measured by the learned distance metric Transformed by learned distance metric must link cannot link

Example of Distance Metric Learning
2D data projection using Euclidean distance metric 2D data projection using learned distance metric Solid lines: must links dotted lines: cannot links

BoostCluster [Liu, Jin & Jain, 2007]
General framework for semi-supervised clustering Improves any given unsupervised clustering algorithm with pairwise constraints Key challenges How to influence an arbitrary clustering algorithm by side information? Encode constraints into data representation How to take into account the performance of underlying clustering algorithm? Iteratively improve the clustering performance 95 95

BoostCluster Data Pairwise Constraints New data Representation Clustering Algorithm Results Final Results Kernel Matrix Given: (a) pairwise constraints, (b) data examples, and (c) a clustering algorithm 96 96

BoostCluster Data Pairwise Constraints New data Representation Clustering Algorithm Results Final Results Kernel Matrix Find the best data rep. that encodes the unsatisfied pairwise constraints 97 97

BoostCluster Data Pairwise Constraints New data Representation Clustering Algorithm Results Final Results Kernel Matrix Obtain the clustering results given the new data representation 98 98

BoostCluster Update the kernel with the clustering results Kernel
Data Pairwise Constraints New data Representation Clustering Algorithm Results Final Results Kernel Matrix Update the kernel with the clustering results 99 99

BoostCluster Run the procedure iteratively Kernel Matrix Data
Pairwise Constraints New data Representation Clustering Algorithm Results Final Results Kernel Matrix Run the procedure iteratively 100 100

BoostCluster Compute the final clustering result Kernel Matrix Data
Pairwise Constraints New data Representation Clustering Algorithm Results Final Results Kernel Matrix Compute the final clustering result 101 101

Summary Clustering data under given pairwise constraints
Must links vs. cannot links Two types of approaches Restricted data partitions (either soft or hard) Distance metric learning Questions: how to acquire links/constraints? Manual assignments Derive from side information: hyper links, citation, user logs, etc. May be noisy and unreliable

Application: Document Clustering [Basu et al., 2004]
300 docs from topics (atheism, baseball, space) of 20-newsgroups 3251 unique words after removal of stopwords and rare words and stemming Evaluation metric: Normalized Mutual Informtion (NMI) KMeans-x-x: different variants of constrained clustering algs.

Kernel Learning Kernel plays central role in machine learning
Kernel functions can be learned from data Kernel alignment, multiple kernel learning, non-parametric learning, … Kernel learning is suitable for IR Similarity measure is key to IR Kernel learning allows us to identify the optimal similarity measure automatically

Transfer Learning Different document categories are correlated
We should be able to borrow information of one class to the training of another class Key question: what to transfer between classes? Representation, model priors, similarity measure …

Semi-supervised Learning

Similar presentations

Presentation on theme: "Semi-supervised Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Semi-supervised Learning

Similar presentations

Presentation on theme: "Semi-supervised Learning"— Presentation transcript:

Similar presentations

About project

Feedback