Semi-Supervised Learning

Semi-Supervised Learning
Jia-Bin Huang Virginia Tech ECE-5424G / CS-5824 Spring 2019

Administrative HW 4 due April 10

Recommender Systems Motivation Problem formulation
Content-based recommendations Collaborative filtering Mean normalization

Problem motivation 5 0.9 ? 1.0 0.01 4 0.99 0.1 Movie Alice (1) Bob (2)
Carol (3) Dave (4) 𝑥 1 (romance) 𝑥 2 (action) Love at last 5 0.9 Romance forever ? 1.0 0.01 Cute puppies of love 4 0.99 Nonstop car chases 0.1 Swords vs. karate

Problem motivation 𝜃 1 = 0 5 0 𝜃 2 = 0 5 0 𝜃 3 = 0 0 5 𝜃 4 = 0 0 5
Movie Alice (1) Bob (2) Carol (3) Dave (4) 𝑥 1 (romance) 𝑥 2 (action) Love at last 5 ? Romance forever Cute puppies of love 4 Nonstop car chases Swords vs. karate 𝜃 1 = 𝜃 2 = 𝜃 3 = 𝜃 4 = 𝑥 1 = ? ? ?

Optimization algorithm
Given 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 , to learn 𝑥 (𝑖) : min 𝑥 (𝑖) 𝑗:𝑟 𝑖,𝑗 = (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 𝜆 2 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2 Given 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 , to learn 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 ( 𝑛 𝑚 ) : min 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 ( 𝑛 𝑚 ) 𝑖=1 𝑛 𝑚 𝑗:𝑟 𝑖,𝑗 = (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2

Collaborative filtering
Given 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 (and movie ratings), Can estimate 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 Given 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 Can estimate 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚

Collaborative filtering optimization objective
Given 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 , estimate 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 min 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 𝑗=1 𝑛 𝑢 𝑖:𝑟 𝑖,𝑗 = (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 𝜆 2 𝑗=1 𝑛 𝑢 𝑘=1 𝑛 𝜃 𝑘 𝑗 2 Given 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 , estimate 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 min 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 ( 𝑛 𝑚 ) 𝑖=1 𝑛 𝑚 𝑗:𝑟 𝑖,𝑗 = (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2

Given 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 , estimate 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 min 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 𝑗=1 𝑛 𝑢 𝑖:𝑟 𝑖,𝑗 = (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 𝜆 2 𝑗=1 𝑛 𝑢 𝑘=1 𝑛 𝜃 𝑘 𝑗 2 Given 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 , estimate 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 min 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 ( 𝑛 𝑚 ) 𝑖=1 𝑛 𝑚 𝑗:𝑟 𝑖,𝑗 = (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2 Minimize 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 and 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 simultaneously 𝐽= 𝑟 𝑖,𝑗 = (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 𝜆 2 𝑗=1 𝑛 𝑢 𝑘=1 𝑛 𝜃 𝑘 𝑗 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2

𝐽( 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 , 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 )= 1 2 𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑗=1 𝑛 𝑢 𝑘=1 𝑛 𝜃 𝑘 𝑗 2 + 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2

Collaborative filtering algorithm
Initialize 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 , 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 to small random values Minimize 𝐽( 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 , 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 ) using gradient descent (or an advanced optimization algorithm). For every 𝑗= 1⋯ 𝑛 𝑢 , 𝑖=1, ⋯, 𝑛 𝑚 : 𝑥 𝑘 𝑗 ≔ 𝑥 𝑘 𝑗 −𝛼 𝑗:𝑟 𝑖,𝑗 =1 ( 𝜃 𝑗 ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 ) 𝜃 𝑘 𝑖 +𝜆 𝑥 𝑘 (𝑖) 𝜃 𝑘 𝑗 ≔ 𝜃 𝑘 𝑗 −𝛼 𝑖:𝑟 𝑖,𝑗 =1 ( 𝜃 𝑗 ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 ) 𝑥 𝑘 𝑖 +𝜆 𝜃 𝑘 (𝑗) For a user with parameter 𝜃 and movie with (learned) feature 𝑥, predict a star rating of 𝜃 ⊤ 𝑥

Movie Alice (1) Bob (2) Carol (3) Dave (4) Love at last 5 Romance forever ? Cute puppies of love 4 Nonstop car chases Swords vs. karate

Predicted ratings: 𝑋= − 𝑥 ⊤ − − 𝑥 ⊤ − ⋮ − 𝑥 𝑛 𝑚 ⊤ − Θ= − 𝜃 ⊤ − − 𝜃 ⊤ − ⋮ − 𝜃 𝑛 𝑢 ⊤ − Y=X Θ ⊤ Low-rank matrix factorization

Finding related movies/products
For each product 𝑖, we learn a feature vector 𝑥 (𝑖) ∈ 𝑅 𝑛 𝑥 1 : romance, 𝑥 2 : action, 𝑥 3 : comedy, … How to find movie 𝑗 relate to movie 𝑖? Small 𝑥 (𝑖) − 𝑥 (𝑗) movie j and I are “similar”

Users who have not rated any movies
Alice (1) Bob (2) Carol (3) Dave (4) Eve (5) Love at last 5 ? Romance forever Cute puppies of love 4 Nonstop car chases Swords vs. karate 1 2 𝑟 𝑖,𝑗 = (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 𝜆 2 𝑗=1 𝑛 𝑢 𝑘=1 𝑛 𝜃 𝑘 𝑗 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2 𝜃 (5) = 0 0

Users who have not rated any movies
Alice (1) Bob (2) Carol (3) Dave (4) Eve (5) Love at last 5 Romance forever ? Cute puppies of love 4 Nonstop car chases Swords vs. karate 1 2 𝑟 𝑖,𝑗 = (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 𝜆 2 𝑗=1 𝑛 𝑢 𝑘=1 𝑛 𝜃 𝑘 𝑗 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2 𝜃 (5) = 0 0

Mean normalization Learn 𝜃 (𝑗) , 𝑥 (𝑖) For user 𝑗, on movie 𝑖 predict:
𝜃 𝑗 ⊤ 𝑥 (𝑖) + 𝜇 𝑖 User 5 (Eve): 𝜃 5 = 𝜃 ⊤ 𝑥 (𝑖) + 𝜇 𝑖 Learn 𝜃 (𝑗) , 𝑥 (𝑖)

Review: Supervised Learning
K nearest neighbor Linear Regression Naïve Bayes Logistic Regression Support Vector Machines Neural Networks

Review: Unsupervised Learning
Clustering, K-Mean Expectation maximization Dimensionality reduction Anomaly detection Recommendation system

Advanced Topics Semi-supervised learning
Probabilistic graphical models Generative models Sequence prediction models Deep reinforcement learning

Semi-supervised Learning
Motivation Problem formulation Consistency regularization Entropy-based method Pseudo-labeling

Classic Paradigm Insufficient Nowadays
Modern applications: massive amounts of raw data. Only a tiny fraction can be annotated by human experts Protein sequences Billions of webpages Images

Active Learning

Semi-supervised Learning Problem Formulation
Labeled data 𝑆 𝑙 = 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯, 𝑥 𝑚 𝑙 , 𝑦 𝑚 𝑙 Unlabeled data 𝑆 𝑢 = 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯, 𝑥 𝑚 𝑢 , 𝑦 𝑚 𝑢 Goal: Learn a hypothesis ℎ 𝜃 (e.g., a classifier) that has small error

Combining labeled and unlabeled data - Classical methods
Transductive SVM [Joachims ’99] Co-training [Blum and Mitchell ’98] Graph-based methods [Blum and Chawla ‘01] [Zhu, Ghahramani, Lafferty ‘03]

Transductive SVM The separator goes through low density regions of the space / large margin

SVM Transductive SVM Inputs: 𝑥 l (𝑖) , 𝑦 l (𝑖) Inputs:
s.t. 𝑦 l (𝑖) 𝜃 ⊤ 𝑥 𝑙 𝑖 ≥1 Transductive SVM Inputs: 𝑥 l (𝑖) , 𝑦 l (𝑖) , 𝑥 u (𝑖) , 𝑦 𝑢 (𝑖) min 𝜃 𝑗=1 𝑛 𝜃 𝑗 2 s.t 𝑦 l (𝑖) 𝜃 ⊤ 𝑥 𝑙 𝑖 ≥1 𝑦 u (𝑖) 𝜃 ⊤ 𝑥 𝑖 ≥1 𝑦 u 𝑖 ∈{−1, 1}

Transductive SVMs First maximize margin over the labeled points
Use this to give initial labels to unlabeled points based on this separator. Try flipping labels of unlabeled points to see if doing so can increase margin

Deep Semi-supervised Learning

Stochastic Perturbations/Π-Model
Realistic perturbations 𝑥→ 𝑥 of data points 𝑥∈ 𝐷 𝑈𝐿 should not significantly change the output of h 𝜃 (𝑥)

Temporal Ensembling

Mean Teacher

Virtual Adversarial Training

EntMin Encourages more confident predictions on unlabeled data.

Comparison

Varying number of labels

Class mismatch in Labeled/Unlabeled datasets hurts the performance

Lessons Standardized architecture + equal budget for tuning hyperparameters Unlabeled data from a different class distribution not that useful Most methods don’t work well in the very low labeled-data regime Transferring Pre-Trained Imagenet produces lower error rate Conclusions based on small datasets though

Semi-Supervised Learning

Similar presentations

Presentation on theme: "Semi-Supervised Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Semi-Supervised Learning

Similar presentations

Presentation on theme: "Semi-Supervised Learning"— Presentation transcript:

Similar presentations

About project

Feedback