Semi-Supervised Learning

Slides:



Advertisements
Similar presentations
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Advertisements

Interactively Co-segmentating Topically Related Images with Intelligent Scribble Guidance Dhruv Batra, Carnegie Mellon University Adarsh Kowdle, Cornell.
1 Semi-supervised learning for protein classification Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY Gen*NY*sis.
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
An Overview of Machine Learning
Recommender Systems Problem formulation Machine Learning.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.
Semi-Supervised Learning Using Randomized Mincuts Avrim Blum, John Lafferty, Raja Reddy, Mugizi Rwebangira.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Sparse vs. Ensemble Approaches to Supervised Learning
CS Ensembles and Bayes1 Semi-Supervised Learning Can we improve the quality of our learning by combining labeled and unlabeled data Usually a lot.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Anomaly detection with Bayesian networks Website: John Sandiford.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CS558 Project Local SVM Classification based on triangulation (on the plane) Glenn Fung.
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.
NTU & MSRA Ming-Feng Tsai
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
Minimal Kernel Classifiers Glenn Fung Olvi Mangasarian Alexander Smola Data Mining Institute University of Wisconsin - Madison Informs 2002 San Jose, California,
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Introduction to Machine Learning Nir Ailon Lecture 12: EM, Clustering and More.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Supervised Learning Classification and Regression
Data Mining Practical Machine Learning Tools and Techniques
Neural networks and support vector machines
Recent Trends in Text Mining
Who am I? Work in Probabilistic Machine Learning Like to teach 
Machine Learning Models
CS 445/545 Machine Learning Winter, 2017
Semi-Supervised Clustering
Machine Learning – Classification David Fenyő
ECE 5424: Introduction to Machine Learning
Machine Learning & Deep Learning
Semi-supervised Machine Learning Gergana Lazarova
Constrained Clustering -Semi Supervised Clustering-
CS 445/545 Machine Learning Spring, 2017
cs540 - Fall 2015 (Shavlik©), Lecture 25, Week 14
Generative Adversarial Networks
Multimodal Learning with Deep Boltzmann Machines
Basic machine learning background with Python scikit-learn
Machine Learning Basics
Unsupervised Learning and Autoencoders
Using Transductive SVMs for Object Classification in Images
Machine Learning Week 1.
INF 5860 Machine learning for image classification
Overview of Machine Learning
Support Vector Machine I
Concave Minimization for Support Vector Machine Classifiers
Multivariate Methods Berlin Chen, 2005 References:
Machine Learning Support Vector Machine Supervised Learning
Machine Learning – a Probabilistic Perspective
Jia-Bin Huang Virginia Tech
MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn
Derek Hoiem CS 598, Spring 2009 Jan 27, 2009
Usman Roshan Dept. of Computer Science NJIT
Introduction to Neural Networks
Jia-Bin Huang Virginia Tech
Minimal Kernel Classifiers
Jia-Bin Huang Virginia Tech
Recommender Systems Problem formulation Machine Learning.
What is Artificial Intelligence?
Presentation transcript:

Semi-Supervised Learning Jia-Bin Huang Virginia Tech ECE-5424G / CS-5824 Spring 2019

Administrative HW 4 due April 10

Recommender Systems Motivation Problem formulation Content-based recommendations Collaborative filtering Mean normalization

Problem motivation 5 0.9 ? 1.0 0.01 4 0.99 0.1 Movie Alice (1) Bob (2) Carol (3) Dave (4) 𝑥 1 (romance) 𝑥 2 (action) Love at last 5 0.9 Romance forever ? 1.0 0.01 Cute puppies of love 4 0.99 Nonstop car chases 0.1 Swords vs. karate

Problem motivation 𝜃 1 = 0 5 0 𝜃 2 = 0 5 0 𝜃 3 = 0 0 5 𝜃 4 = 0 0 5 Movie Alice (1) Bob (2) Carol (3) Dave (4) 𝑥 1 (romance) 𝑥 2 (action) Love at last 5 ? Romance forever Cute puppies of love 4 Nonstop car chases Swords vs. karate 𝜃 1 = 0 5 0 𝜃 2 = 0 5 0 𝜃 3 = 0 0 5 𝜃 4 = 0 0 5 𝑥 1 = ? ? ?

Optimization algorithm Given 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 , to learn 𝑥 (𝑖) : min 𝑥 (𝑖) 1 2 𝑗:𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2 Given 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 , to learn 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 ( 𝑛 𝑚 ) : min 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 ( 𝑛 𝑚 ) 1 2 𝑖=1 𝑛 𝑚 𝑗:𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2

Collaborative filtering Given 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 (and movie ratings), Can estimate 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 Given 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 Can estimate 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚

Collaborative filtering optimization objective Given 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 , estimate 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 min 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 1 2 𝑗=1 𝑛 𝑢 𝑖:𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑗=1 𝑛 𝑢 𝑘=1 𝑛 𝜃 𝑘 𝑗 2 Given 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 , estimate 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 min 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 ( 𝑛 𝑚 ) 1 2 𝑖=1 𝑛 𝑚 𝑗:𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2

Collaborative filtering optimization objective Given 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 , estimate 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 min 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 1 2 𝑗=1 𝑛 𝑢 𝑖:𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑗=1 𝑛 𝑢 𝑘=1 𝑛 𝜃 𝑘 𝑗 2 Given 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 , estimate 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 min 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 ( 𝑛 𝑚 ) 1 2 𝑖=1 𝑛 𝑚 𝑗:𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2 Minimize 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 and 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 simultaneously 𝐽= 1 2 𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑗=1 𝑛 𝑢 𝑘=1 𝑛 𝜃 𝑘 𝑗 2 + 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2

Collaborative filtering optimization objective 𝐽( 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 , 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 )= 1 2 𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑗=1 𝑛 𝑢 𝑘=1 𝑛 𝜃 𝑘 𝑗 2 + 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2

Collaborative filtering algorithm Initialize 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 , 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 to small random values Minimize 𝐽( 𝑥 1 , 𝑥 2 , ⋯, 𝑥 𝑛 𝑚 , 𝜃 1 , 𝜃 2 , ⋯, 𝜃 𝑛 𝑢 ) using gradient descent (or an advanced optimization algorithm). For every 𝑗= 1⋯ 𝑛 𝑢 , 𝑖=1, ⋯, 𝑛 𝑚 : 𝑥 𝑘 𝑗 ≔ 𝑥 𝑘 𝑗 −𝛼 𝑗:𝑟 𝑖,𝑗 =1 ( 𝜃 𝑗 ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 ) 𝜃 𝑘 𝑖 +𝜆 𝑥 𝑘 (𝑖) 𝜃 𝑘 𝑗 ≔ 𝜃 𝑘 𝑗 −𝛼 𝑖:𝑟 𝑖,𝑗 =1 ( 𝜃 𝑗 ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 ) 𝑥 𝑘 𝑖 +𝜆 𝜃 𝑘 (𝑗) For a user with parameter 𝜃 and movie with (learned) feature 𝑥, predict a star rating of 𝜃 ⊤ 𝑥

Collaborative filtering Movie Alice (1) Bob (2) Carol (3) Dave (4) Love at last 5 Romance forever ? Cute puppies of love 4 Nonstop car chases Swords vs. karate

Collaborative filtering Predicted ratings: 𝑋= − 𝑥 1 ⊤ − − 𝑥 2 ⊤ − ⋮ − 𝑥 𝑛 𝑚 ⊤ − Θ= − 𝜃 1 ⊤ − − 𝜃 2 ⊤ − ⋮ − 𝜃 𝑛 𝑢 ⊤ − Y=X Θ ⊤ Low-rank matrix factorization

Finding related movies/products For each product 𝑖, we learn a feature vector 𝑥 (𝑖) ∈ 𝑅 𝑛 𝑥 1 : romance, 𝑥 2 : action, 𝑥 3 : comedy, … How to find movie 𝑗 relate to movie 𝑖? Small 𝑥 (𝑖) − 𝑥 (𝑗) movie j and I are “similar”

Recommender Systems Motivation Problem formulation Content-based recommendations Collaborative filtering Mean normalization

Users who have not rated any movies Alice (1) Bob (2) Carol (3) Dave (4) Eve (5) Love at last 5 ? Romance forever Cute puppies of love 4 Nonstop car chases Swords vs. karate 1 2 𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑗=1 𝑛 𝑢 𝑘=1 𝑛 𝜃 𝑘 𝑗 2 + 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2 𝜃 (5) = 0 0

Users who have not rated any movies Alice (1) Bob (2) Carol (3) Dave (4) Eve (5) Love at last 5 Romance forever ? Cute puppies of love 4 Nonstop car chases Swords vs. karate 1 2 𝑟 𝑖,𝑗 =1 (𝜃 𝑗 ) ⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2 + 𝜆 2 𝑗=1 𝑛 𝑢 𝑘=1 𝑛 𝜃 𝑘 𝑗 2 + 𝜆 2 𝑖=1 𝑛 𝑚 𝑘=1 𝑛 𝑥 𝑘 (𝑖) 2 𝜃 (5) = 0 0

Mean normalization Learn 𝜃 (𝑗) , 𝑥 (𝑖) For user 𝑗, on movie 𝑖 predict: 𝜃 𝑗 ⊤ 𝑥 (𝑖) + 𝜇 𝑖 User 5 (Eve): 𝜃 5 = 0 0 𝜃 5 ⊤ 𝑥 (𝑖) + 𝜇 𝑖 Learn 𝜃 (𝑗) , 𝑥 (𝑖)

Recommender Systems Motivation Problem formulation Content-based recommendations Collaborative filtering Mean normalization

Review: Supervised Learning K nearest neighbor Linear Regression Naïve Bayes Logistic Regression Support Vector Machines Neural Networks

Review: Unsupervised Learning Clustering, K-Mean Expectation maximization Dimensionality reduction Anomaly detection Recommendation system

Advanced Topics Semi-supervised learning Probabilistic graphical models Generative models Sequence prediction models Deep reinforcement learning

Semi-supervised Learning Motivation Problem formulation Consistency regularization Entropy-based method Pseudo-labeling

Semi-supervised Learning Motivation Problem formulation Consistency regularization Entropy-based method Pseudo-labeling

Classic Paradigm Insufficient Nowadays Modern applications: massive amounts of raw data. Only a tiny fraction can be annotated by human experts Protein sequences Billions of webpages Images

Semi-supervised Learning

Active Learning

Semi-supervised Learning Motivation Problem formulation Consistency regularization Entropy-based method Pseudo-labeling

Semi-supervised Learning Problem Formulation Labeled data 𝑆 𝑙 = 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯, 𝑥 𝑚 𝑙 , 𝑦 𝑚 𝑙 Unlabeled data 𝑆 𝑢 = 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯, 𝑥 𝑚 𝑢 , 𝑦 𝑚 𝑢 Goal: Learn a hypothesis ℎ 𝜃 (e.g., a classifier) that has small error

Combining labeled and unlabeled data - Classical methods Transductive SVM [Joachims ’99] Co-training [Blum and Mitchell ’98] Graph-based methods [Blum and Chawla ‘01] [Zhu, Ghahramani, Lafferty ‘03]

Transductive SVM The separator goes through low density regions of the space / large margin

SVM Transductive SVM Inputs: 𝑥 l (𝑖) , 𝑦 l (𝑖) Inputs: s.t. 𝑦 l (𝑖) 𝜃 ⊤ 𝑥 𝑙 𝑖 ≥1 Transductive SVM Inputs: 𝑥 l (𝑖) , 𝑦 l (𝑖) , 𝑥 u (𝑖) , 𝑦 𝑢 (𝑖) min 𝜃 1 2 𝑗=1 𝑛 𝜃 𝑗 2 s.t. 𝑦 l (𝑖) 𝜃 ⊤ 𝑥 𝑙 𝑖 ≥1 𝑦 u (𝑖) 𝜃 ⊤ 𝑥 𝑖 ≥1 𝑦 u 𝑖 ∈{−1, 1}

Transductive SVMs First maximize margin over the labeled points Use this to give initial labels to unlabeled points based on this separator. Try flipping labels of unlabeled points to see if doing so can increase margin

Deep Semi-supervised Learning

Semi-supervised Learning Motivation Problem formulation Consistency regularization Entropy-based method Pseudo-labeling

Stochastic Perturbations/Π-Model Realistic perturbations 𝑥→ 𝑥 of data points 𝑥∈ 𝐷 𝑈𝐿 should not significantly change the output of h 𝜃 (𝑥)

Temporal Ensembling

Mean Teacher

Virtual Adversarial Training

Semi-supervised Learning Motivation Problem formulation Consistency regularization Entropy-based method Pseudo-labeling

EntMin Encourages more confident predictions on unlabeled data.

Semi-supervised Learning Motivation Problem formulation Consistency regularization Entropy-based method Pseudo-labeling

Comparison

Varying number of labels

Class mismatch in Labeled/Unlabeled datasets hurts the performance

Lessons Standardized architecture + equal budget for tuning hyperparameters Unlabeled data from a different class distribution not that useful Most methods don’t work well in the very low labeled-data regime Transferring Pre-Trained Imagenet produces lower error rate Conclusions based on small datasets though