Presentation on theme: "Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University."— Presentation transcript:
Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University Self-taught Learning Transfer Learning from Unlabeled Data
The “one learning algorithm” hypothesis There is some evidence that the human brain uses essentially the same algorithm to understand many different input modalities. – Example: Ferret experiments, in which the “input” for vision was plugged into auditory part of brain, and the auditory cortex learns to “see.” [Roe et al., 1992] Self-taught Learning (Roe et al., 1992. Hawkins & Blakeslee, 2004)
There is some evidence that the human brain uses essentially the same algorithm to understand many different input modalities. – Example: Ferret experiments, in which the “input” for vision was plugged into auditory part of brain, and the auditory cortex learns to “see.” [Roe et al., 1992] If we could find this one learning algorithm, we would be done. (Finally!) Self-taught Learning (Roe et al., 1992. Hawkins & Blakeslee, 2004) The “one learning algorithm” hypothesis
This talk If the brain really is one learning algorithm, it would suffice to just: Find a learning algorithm for a single layer, and, Show that it can build a small number of layers. We evaluate our algorithms: Against biology. On applications. Finding a deep learning algorithm Self-taught Learning e.g., Sparse RBMs for V2: Poster yesterday (Lee et al.)
Supervised learning Cars Motorcycles TrainTest Self-taught Learning Supervised learning algorithms may not work well with limited labeled data.
Learning in humans Your brain has 10 14 synapses (connections). You will live for 10 9 seconds. If each synapse requires 1 bit to parameterize, you need to “learn” 10 14 bits in 10 9 seconds. Or, 10 5 bits per second. Human learning is largely unsupervised, and uses readily available unlabeled data. Self-taught Learning (Geoffrey Hinton, personal communication)
Recent history of machine learning 20 years ago: Supervised learning 10 years ago: Semi-supervised learning. 10 years ago: Transfer learning. Next: Self-taught learning? Cars Motorcycles BusCars Motorcycles TractorAircraftHelicopter Natural scenes Car Motorcycle Cars Motorcycles
Self-taught Learning Labeled examples: Unlabeled examples: The unlabeled and labeled data: Need not share labels y. Need not share a generative distribution. Advantage: Such unlabeled data is often easy to obtain.
Overview: Represent each labeled or unlabeled input as a sparse linear combination of “basis vectors”. A self-taught learning algorithm = 0.8 * + 0.3 * + 0.5 * x = 0.8 * b 87 + 0.3 * b 376 + 0.5 * b 411 Self-taught Learning
Key steps: 1.Learn good bases using unlabeled data. 2.Use these learnt bases to construct “higher-level” features for the labeled data. 3.Apply a standard supervised learning algorithm on these features. A self-taught learning algorithm = 0.8 * + 0.3 * + 0.5 * Self-taught Learning x = 0.8 * b 87 + 0.3 * b 376 + 0.5 * b 411
Given only unlabeled data, we find good bases b using sparse coding: Learning the bases: Sparse coding Self-taught Learning Reconstruction errorSparsity penalty [Details: An extra normalization constraint on is required.] (Efficient algorithms: Lee et al., NIPS 2006)
Constructing features Using the learnt bases b, compute features for the examples x l from the classification task by solving: Finally, learn a classifer using a standard supervised learning algorithm (e.g., SVM) over these features. = 0.8 * + 0.3 * + 0.5 * Self-taught Learning x l = 0.8 * b 87 + 0.3 * b 376 + 0.5 * b 411 Reconstruction error Sparsity penalty
Image classification Self-taught Learning Large image (Platypus from Caltech101 dataset) Feature visualization
Image classification Self-taught Learning Baseline16% PCA37% Sparse coding47% Other reported results: Fei-Fei et al, 2004: 16% Berg et al., 2005: 17% Holub et al., 2005: 40% Serre et al., 2005: 35% Berg et al, 2005: 48% Zhang et al., 2006: 59% Lazebnik et al., 2006: 56% (15 labeled images per class) 36.0% error reduction
Raw54.8% PCA54.8% Sparse coding58.5% Character recognition Self-taught Learning DigitsHandwritten EnglishEnglish font Handwritten English classification (20 labeled images per handwritten character) Bases learnt on digits English font classification (20 labeled images per font character) Bases learnt on handwritten English Raw17.9% PCA14.5% Sparse coding16.6% Sparse coding + Raw20.2% 8.2% error reduction2.8% error reduction
Text classification Self-taught Learning Raw words62.8% PCA63.3% Sparse coding64.3% Reuters newswire Webpages UseNet articles Webpage classification (2 labeled documents per class) Bases learnt on Reuters newswire Raw words61.3% PCA60.7% Sparse coding63.8% UseNet classification (2 labeled documents per class) Bases learnt on Reuters newswire 4.0% error reduction6.5% error reduction
Audio classification Self-taught Learning Spectrogram38.5% MFCCs43.8% Sparse coding48.7% 8.7% error reduction (Details: Grosse et al., UAI 2007) Speaker identification (5 labels, TIMIT corpus, 1 sentence per speaker.) Bases learnt on different dialects Spectrogram48.4% MFCCs54.0% Music-specific model49.3% Sparse coding56.6% Musical genre classification (5 labels, 18 seconds per genre.) Bases learnt on different genres, songs 5.7% error reduction
Sparse deep belief networks Self-taught Learning (Details: Lee et al., NIPS 2007. Poster yesterday.)... h: Hidden layer v: Visible layer W, b, c: Parameters New Sparse RBM
Sparse deep belief networks Self-taught Learning 1-layer sparse DBN44.5% 2-layer sparse DBN46.6% 3.2% error reduction (Details: Lee et al., NIPS 2007. Poster yesterday.) Image classification (Caltech101 dataset)
Summary Self-taught learning: Unlabeled data does not share the labels of the classification task. Use unlabeled data to discover features. Use sparse coding to construct an easy-to-classify, “higher-level” representation. Self-taught Learning Cars Motorcycles = 0.8 * + 0.3 * + 0.5 * Unlabeled images
Related Work Self-taught Learning Weston et al, ICML 2006 Make stronger assumptions on the unlabeled data. Ando & Zhang, JMLR 2005 For natural language tasks and character recognition, use heuristics to construct a transfer learning task using unlabeled data.