Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semi-Supervised Learning

Similar presentations


Presentation on theme: "Semi-Supervised Learning"โ€” Presentation transcript:

1 Semi-Supervised Learning
Jia-Bin Huang Virginia Tech ECE-5424G / CS-5824 Spring 2019

2 Administrative HW 4 due April 10

3 Recommender Systems Motivation Problem formulation
Content-based recommendations Collaborative filtering Mean normalization

4 Problem motivation 5 0.9 ? 1.0 0.01 4 0.99 0.1 Movie Alice (1) Bob (2)
Carol (3) Dave (4) ๐‘ฅ 1 (romance) ๐‘ฅ 2 (action) Love at last 5 0.9 Romance forever ? 1.0 0.01 Cute puppies of love 4 0.99 Nonstop car chases 0.1 Swords vs. karate

5 Problem motivation ๐œƒ 1 = 0 5 0 ๐œƒ 2 = 0 5 0 ๐œƒ 3 = 0 0 5 ๐œƒ 4 = 0 0 5
Movie Alice (1) Bob (2) Carol (3) Dave (4) ๐‘ฅ 1 (romance) ๐‘ฅ 2 (action) Love at last 5 ? Romance forever Cute puppies of love 4 Nonstop car chases Swords vs. karate ๐œƒ 1 = ๐œƒ 2 = ๐œƒ 3 = ๐œƒ 4 = ๐‘ฅ 1 = ? ? ?

6 Optimization algorithm
Given ๐œƒ 1 , ๐œƒ 2 , โ‹ฏ, ๐œƒ ๐‘› ๐‘ข , to learn ๐‘ฅ (๐‘–) : min ๐‘ฅ (๐‘–) ๐‘—:๐‘Ÿ ๐‘–,๐‘— = (๐œƒ ๐‘— ) โŠค ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘–,๐‘— ๐œ† 2 ๐‘˜=1 ๐‘› ๐‘ฅ ๐‘˜ (๐‘–) 2 Given ๐œƒ 1 , ๐œƒ 2 , โ‹ฏ, ๐œƒ ๐‘› ๐‘ข , to learn ๐‘ฅ (1) , ๐‘ฅ (2) , โ‹ฏ, ๐‘ฅ ( ๐‘› ๐‘š ) : min ๐‘ฅ (1) , ๐‘ฅ (2) , โ‹ฏ, ๐‘ฅ ( ๐‘› ๐‘š ) ๐‘–=1 ๐‘› ๐‘š ๐‘—:๐‘Ÿ ๐‘–,๐‘— = (๐œƒ ๐‘— ) โŠค ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘–,๐‘— ๐œ† 2 ๐‘–=1 ๐‘› ๐‘š ๐‘˜=1 ๐‘› ๐‘ฅ ๐‘˜ (๐‘–) 2

7 Collaborative filtering
Given ๐‘ฅ 1 , ๐‘ฅ 2 , โ‹ฏ, ๐‘ฅ ๐‘› ๐‘š (and movie ratings), Can estimate ๐œƒ 1 , ๐œƒ 2 , โ‹ฏ, ๐œƒ ๐‘› ๐‘ข Given ๐œƒ 1 , ๐œƒ 2 , โ‹ฏ, ๐œƒ ๐‘› ๐‘ข Can estimate ๐‘ฅ 1 , ๐‘ฅ 2 , โ‹ฏ, ๐‘ฅ ๐‘› ๐‘š

8 Collaborative filtering optimization objective
Given ๐‘ฅ 1 , ๐‘ฅ 2 , โ‹ฏ, ๐‘ฅ ๐‘› ๐‘š , estimate ๐œƒ 1 , ๐œƒ 2 , โ‹ฏ, ๐œƒ ๐‘› ๐‘ข min ๐œƒ 1 , ๐œƒ 2 , โ‹ฏ, ๐œƒ ๐‘› ๐‘ข ๐‘—=1 ๐‘› ๐‘ข ๐‘–:๐‘Ÿ ๐‘–,๐‘— = (๐œƒ ๐‘— ) โŠค ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘–,๐‘— ๐œ† 2 ๐‘—=1 ๐‘› ๐‘ข ๐‘˜=1 ๐‘› ๐œƒ ๐‘˜ ๐‘— 2 Given ๐œƒ 1 , ๐œƒ 2 , โ‹ฏ, ๐œƒ ๐‘› ๐‘ข , estimate ๐‘ฅ 1 , ๐‘ฅ 2 , โ‹ฏ, ๐‘ฅ ๐‘› ๐‘š min ๐‘ฅ (1) , ๐‘ฅ (2) , โ‹ฏ, ๐‘ฅ ( ๐‘› ๐‘š ) ๐‘–=1 ๐‘› ๐‘š ๐‘—:๐‘Ÿ ๐‘–,๐‘— = (๐œƒ ๐‘— ) โŠค ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘–,๐‘— ๐œ† 2 ๐‘–=1 ๐‘› ๐‘š ๐‘˜=1 ๐‘› ๐‘ฅ ๐‘˜ (๐‘–) 2

9 Collaborative filtering optimization objective
Given ๐‘ฅ 1 , ๐‘ฅ 2 , โ‹ฏ, ๐‘ฅ ๐‘› ๐‘š , estimate ๐œƒ 1 , ๐œƒ 2 , โ‹ฏ, ๐œƒ ๐‘› ๐‘ข min ๐œƒ 1 , ๐œƒ 2 , โ‹ฏ, ๐œƒ ๐‘› ๐‘ข ๐‘—=1 ๐‘› ๐‘ข ๐‘–:๐‘Ÿ ๐‘–,๐‘— = (๐œƒ ๐‘— ) โŠค ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘–,๐‘— ๐œ† 2 ๐‘—=1 ๐‘› ๐‘ข ๐‘˜=1 ๐‘› ๐œƒ ๐‘˜ ๐‘— 2 Given ๐œƒ 1 , ๐œƒ 2 , โ‹ฏ, ๐œƒ ๐‘› ๐‘ข , estimate ๐‘ฅ 1 , ๐‘ฅ 2 , โ‹ฏ, ๐‘ฅ ๐‘› ๐‘š min ๐‘ฅ (1) , ๐‘ฅ (2) , โ‹ฏ, ๐‘ฅ ( ๐‘› ๐‘š ) ๐‘–=1 ๐‘› ๐‘š ๐‘—:๐‘Ÿ ๐‘–,๐‘— = (๐œƒ ๐‘— ) โŠค ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘–,๐‘— ๐œ† 2 ๐‘–=1 ๐‘› ๐‘š ๐‘˜=1 ๐‘› ๐‘ฅ ๐‘˜ (๐‘–) 2 Minimize ๐‘ฅ 1 , ๐‘ฅ 2 , โ‹ฏ, ๐‘ฅ ๐‘› ๐‘š and ๐œƒ 1 , ๐œƒ 2 , โ‹ฏ, ๐œƒ ๐‘› ๐‘ข simultaneously ๐ฝ= ๐‘Ÿ ๐‘–,๐‘— = (๐œƒ ๐‘— ) โŠค ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘–,๐‘— ๐œ† 2 ๐‘—=1 ๐‘› ๐‘ข ๐‘˜=1 ๐‘› ๐œƒ ๐‘˜ ๐‘— ๐œ† 2 ๐‘–=1 ๐‘› ๐‘š ๐‘˜=1 ๐‘› ๐‘ฅ ๐‘˜ (๐‘–) 2

10 Collaborative filtering optimization objective
๐ฝ( ๐‘ฅ 1 , ๐‘ฅ 2 , โ‹ฏ, ๐‘ฅ ๐‘› ๐‘š , ๐œƒ 1 , ๐œƒ 2 , โ‹ฏ, ๐œƒ ๐‘› ๐‘ข )= 1 2 ๐‘Ÿ ๐‘–,๐‘— =1 (๐œƒ ๐‘— ) โŠค ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘–,๐‘— 2 + ๐œ† 2 ๐‘—=1 ๐‘› ๐‘ข ๐‘˜=1 ๐‘› ๐œƒ ๐‘˜ ๐‘— 2 + ๐œ† 2 ๐‘–=1 ๐‘› ๐‘š ๐‘˜=1 ๐‘› ๐‘ฅ ๐‘˜ (๐‘–) 2

11 Collaborative filtering algorithm
Initialize ๐‘ฅ 1 , ๐‘ฅ 2 , โ‹ฏ, ๐‘ฅ ๐‘› ๐‘š , ๐œƒ 1 , ๐œƒ 2 , โ‹ฏ, ๐œƒ ๐‘› ๐‘ข to small random values Minimize ๐ฝ( ๐‘ฅ 1 , ๐‘ฅ 2 , โ‹ฏ, ๐‘ฅ ๐‘› ๐‘š , ๐œƒ 1 , ๐œƒ 2 , โ‹ฏ, ๐œƒ ๐‘› ๐‘ข ) using gradient descent (or an advanced optimization algorithm). For every ๐‘—= 1โ‹ฏ ๐‘› ๐‘ข , ๐‘–=1, โ‹ฏ, ๐‘› ๐‘š : ๐‘ฅ ๐‘˜ ๐‘— โ‰” ๐‘ฅ ๐‘˜ ๐‘— โˆ’๐›ผ ๐‘—:๐‘Ÿ ๐‘–,๐‘— =1 ( ๐œƒ ๐‘— โŠค ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘–,๐‘— ) ๐œƒ ๐‘˜ ๐‘– +๐œ† ๐‘ฅ ๐‘˜ (๐‘–) ๐œƒ ๐‘˜ ๐‘— โ‰” ๐œƒ ๐‘˜ ๐‘— โˆ’๐›ผ ๐‘–:๐‘Ÿ ๐‘–,๐‘— =1 ( ๐œƒ ๐‘— โŠค ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘–,๐‘— ) ๐‘ฅ ๐‘˜ ๐‘– +๐œ† ๐œƒ ๐‘˜ (๐‘—) For a user with parameter ๐œƒ and movie with (learned) feature ๐‘ฅ, predict a star rating of ๐œƒ โŠค ๐‘ฅ

12 Collaborative filtering
Movie Alice (1) Bob (2) Carol (3) Dave (4) Love at last 5 Romance forever ? Cute puppies of love 4 Nonstop car chases Swords vs. karate

13 Collaborative filtering
Predicted ratings: ๐‘‹= โˆ’ ๐‘ฅ โŠค โˆ’ โˆ’ ๐‘ฅ โŠค โˆ’ โ‹ฎ โˆ’ ๐‘ฅ ๐‘› ๐‘š โŠค โˆ’ ฮ˜= โˆ’ ๐œƒ โŠค โˆ’ โˆ’ ๐œƒ โŠค โˆ’ โ‹ฎ โˆ’ ๐œƒ ๐‘› ๐‘ข โŠค โˆ’ Y=X ฮ˜ โŠค Low-rank matrix factorization

14 Finding related movies/products
For each product ๐‘–, we learn a feature vector ๐‘ฅ (๐‘–) โˆˆ ๐‘… ๐‘› ๐‘ฅ 1 : romance, ๐‘ฅ 2 : action, ๐‘ฅ 3 : comedy, โ€ฆ How to find movie ๐‘— relate to movie ๐‘–? Small ๐‘ฅ (๐‘–) โˆ’ ๐‘ฅ (๐‘—) movie j and I are โ€œsimilarโ€

15 Recommender Systems Motivation Problem formulation
Content-based recommendations Collaborative filtering Mean normalization

16 Users who have not rated any movies
Alice (1) Bob (2) Carol (3) Dave (4) Eve (5) Love at last 5 ? Romance forever Cute puppies of love 4 Nonstop car chases Swords vs. karate 1 2 ๐‘Ÿ ๐‘–,๐‘— = (๐œƒ ๐‘— ) โŠค ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘–,๐‘— ๐œ† 2 ๐‘—=1 ๐‘› ๐‘ข ๐‘˜=1 ๐‘› ๐œƒ ๐‘˜ ๐‘— ๐œ† 2 ๐‘–=1 ๐‘› ๐‘š ๐‘˜=1 ๐‘› ๐‘ฅ ๐‘˜ (๐‘–) 2 ๐œƒ (5) = 0 0

17 Users who have not rated any movies
Alice (1) Bob (2) Carol (3) Dave (4) Eve (5) Love at last 5 Romance forever ? Cute puppies of love 4 Nonstop car chases Swords vs. karate 1 2 ๐‘Ÿ ๐‘–,๐‘— = (๐œƒ ๐‘— ) โŠค ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘–,๐‘— ๐œ† 2 ๐‘—=1 ๐‘› ๐‘ข ๐‘˜=1 ๐‘› ๐œƒ ๐‘˜ ๐‘— ๐œ† 2 ๐‘–=1 ๐‘› ๐‘š ๐‘˜=1 ๐‘› ๐‘ฅ ๐‘˜ (๐‘–) 2 ๐œƒ (5) = 0 0

18 Mean normalization Learn ๐œƒ (๐‘—) , ๐‘ฅ (๐‘–) For user ๐‘—, on movie ๐‘– predict:
๐œƒ ๐‘— โŠค ๐‘ฅ (๐‘–) + ๐œ‡ ๐‘– User 5 (Eve): ๐œƒ 5 = ๐œƒ โŠค ๐‘ฅ (๐‘–) + ๐œ‡ ๐‘– Learn ๐œƒ (๐‘—) , ๐‘ฅ (๐‘–)

19 Recommender Systems Motivation Problem formulation
Content-based recommendations Collaborative filtering Mean normalization

20 Review: Supervised Learning
K nearest neighbor Linear Regression Naรฏve Bayes Logistic Regression Support Vector Machines Neural Networks

21 Review: Unsupervised Learning
Clustering, K-Mean Expectation maximization Dimensionality reduction Anomaly detection Recommendation system

22 Advanced Topics Semi-supervised learning
Probabilistic graphical models Generative models Sequence prediction models Deep reinforcement learning

23 Semi-supervised Learning
Motivation Problem formulation Consistency regularization Entropy-based method Pseudo-labeling

24 Semi-supervised Learning
Motivation Problem formulation Consistency regularization Entropy-based method Pseudo-labeling

25 Classic Paradigm Insufficient Nowadays
Modern applications: massive amounts of raw data. Only a tiny fraction can be annotated by human experts Protein sequences Billions of webpages Images

26 Semi-supervised Learning

27 Active Learning

28 Semi-supervised Learning
Motivation Problem formulation Consistency regularization Entropy-based method Pseudo-labeling

29 Semi-supervised Learning Problem Formulation
Labeled data ๐‘† ๐‘™ = ๐‘ฅ 1 , ๐‘ฆ 1 , ๐‘ฅ 2 , ๐‘ฆ 2 , โ‹ฏ, ๐‘ฅ ๐‘š ๐‘™ , ๐‘ฆ ๐‘š ๐‘™ Unlabeled data ๐‘† ๐‘ข = ๐‘ฅ 1 , ๐‘ฆ 1 , ๐‘ฅ 2 , ๐‘ฆ 2 , โ‹ฏ, ๐‘ฅ ๐‘š ๐‘ข , ๐‘ฆ ๐‘š ๐‘ข Goal: Learn a hypothesis โ„Ž ๐œƒ (e.g., a classifier) that has small error

30 Combining labeled and unlabeled data - Classical methods
Transductive SVM [Joachims โ€™99] Co-training [Blum and Mitchell โ€™98] Graph-based methods [Blum and Chawla โ€˜01] [Zhu, Ghahramani, Lafferty โ€˜03]

31 Transductive SVM The separator goes through low density regions of the space / large margin

32 SVM Transductive SVM Inputs: ๐‘ฅ l (๐‘–) , ๐‘ฆ l (๐‘–) Inputs:
s.t. ๐‘ฆ l (๐‘–) ๐œƒ โŠค ๐‘ฅ ๐‘™ ๐‘– โ‰ฅ1 Transductive SVM Inputs: ๐‘ฅ l (๐‘–) , ๐‘ฆ l (๐‘–) , ๐‘ฅ u (๐‘–) , ๐‘ฆ ๐‘ข (๐‘–) min ๐œƒ ๐‘—=1 ๐‘› ๐œƒ ๐‘— 2 s.t ๐‘ฆ l (๐‘–) ๐œƒ โŠค ๐‘ฅ ๐‘™ ๐‘– โ‰ฅ1 ๐‘ฆ u (๐‘–) ๐œƒ โŠค ๐‘ฅ ๐‘– โ‰ฅ1 ๐‘ฆ u ๐‘– โˆˆ{โˆ’1, 1}

33 Transductive SVMs First maximize margin over the labeled points
Use this to give initial labels to unlabeled points based on this separator. Try flipping labels of unlabeled points to see if doing so can increase margin

34 Deep Semi-supervised Learning

35 Semi-supervised Learning
Motivation Problem formulation Consistency regularization Entropy-based method Pseudo-labeling

36 Stochastic Perturbations/ฮ -Model
Realistic perturbations ๐‘ฅโ†’ ๐‘ฅ of data points ๐‘ฅโˆˆ ๐ท ๐‘ˆ๐ฟ should not significantly change the output of h ๐œƒ (๐‘ฅ)

37 Temporal Ensembling

38 Mean Teacher

39 Virtual Adversarial Training

40 Semi-supervised Learning
Motivation Problem formulation Consistency regularization Entropy-based method Pseudo-labeling

41 EntMin Encourages more confident predictions on unlabeled data.

42 Semi-supervised Learning
Motivation Problem formulation Consistency regularization Entropy-based method Pseudo-labeling

43 Comparison

44 Varying number of labels

45 Class mismatch in Labeled/Unlabeled datasets hurts the performance

46 Lessons Standardized architecture + equal budget for tuning hyperparameters Unlabeled data from a different class distribution not that useful Most methods donโ€™t work well in the very low labeled-data regime Transferring Pre-Trained Imagenet produces lower error rate Conclusions based on small datasets though


Download ppt "Semi-Supervised Learning"

Similar presentations


Ads by Google