Learning Data Representations with “Partial Supervision” Ariadna Quattoni.

Slides:

Advertisements

Similar presentations

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Advertisements

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.

Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University.

Multi-Label Prediction via Compressed Sensing By Daniel Hsu, Sham M. Kakade, John Langford, Tong Zhang (NIPS 2009) Presented by: Lingbo Li ECE, Duke University.

Pattern Recognition and Machine Learning

Machine learning continued Image source:

Computer vision: models, learning and inference Chapter 8 Regression.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct

Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS

Principal Component Analysis

1 Transfer Learning Algorithms for Image Classification Ariadna Quattoni MIT, CSAIL Advisors: Michael Collins Trevor Darrell.

An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.

Reduced Support Vector Machine

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

Proceedings of the 2007 SIAM International Conference on Data Mining.

CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.

An Introduction to Support Vector Machines Martin Law.

Jinhui Tang †, Shuicheng Yan †, Richang Hong †, Guo-Jun Qi ‡, Tat-Seng Chua † † National University of Singapore ‡ University of Illinois at Urbana-Champaign.

Summarized by Soo-Jin Kim

Outline Separating Hyperplanes – Separable Case

Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.

Cs: compressed sensing

EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.

计算机学院计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知计算机学院 Perceptron Revisited: Linear Separators Binary classification.

Universit at Dortmund, LS VIII

Transfer Learning with Applications to Text Classification Jing Peng Computer Science Department.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

An Introduction to Support Vector Machines (M. Law)

Transfer Learning for Image Classification Group No.: 15 Group member : Feng Cai Sauptik Dhar Sauptik.

Dimensionality Reduction Motivation I: Data Compression Machine Learning.

CSE 185 Introduction to Computer Vision Face Recognition.

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.

Linear Models for Classification

Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

CpSc 881: Machine Learning

CSE 446 Dimensionality Reduction and PCA Winter 2012 Slides adapted from Carlos Guestrin & Luke Zettlemoyer.

Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,

Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:

Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.

Ariadna Quattoni Xavier Carreras An Efficient Projection for l 1,∞ Regularization Michael Collins Trevor Darrell MIT CSAIL.

MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.

Transfer Learning for Image Classification. Transfer Learning Approaches Leverage data from related tasks to improve performance:  Improve generalization.

Ultra-high dimensional feature selection Yun Li

Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:

Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.

Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.

Jianchao Yang, John Wright, Thomas Huang, Yi Ma CVPR 2008 Image Super-Resolution as Sparse Representation of Raw Image Patches.

A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning Ronan Collobert Jason Weston Presented by Jie Peng.

1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.

1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.

Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Boosting and Additive Trees (2)

Machine Learning Dimensionality Reduction

Outline Peter N. Belhumeur, Joao P. Hespanha, and David J. Kriegman, “Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection,”

Learning with information of features

Bilinear Classifiers for Visual Recognition

Probabilistic Models with Latent Variables

Optimal sparse representations in general overcomplete bases

INTRODUCTION TO Machine Learning

Feature space tansformation methods

Concave Minimization for Support Vector Machine Classifiers

Neural networks (3) Regularization Autoencoder

An Efficient Projection for L1-∞ Regularization

Presentation transcript:

Learning Data Representations with “Partial Supervision” Ariadna Quattoni

Outline  Motivation: Low dimensional representations.  Principal Component Analysis.  Structural Learning.  Vision Applications.  NLP Applications.  Joint Sparsity.  Vision Applications.

Outline  Motivation: Low dimensional representations.  Principal Component Analysis.  Structural Learning.  Vision Applications.  NLP Applications.  Joint Sparsity.  Vision Applications.

Semi-Supervised Learning “Raw” Feature Space Output Space Core Task:Learn a function from X to Y Labeled Dataset (Small) Unlabeled Dataset (Large) Partially Labeled Dataset (Large) Classical Setting Partial Supervision Setting

Semi-Supervised Learning Classical Setting Unlabeled Dataset Learn Representation Labeled Dataset Train Classifier Dimensionality Reduction

Semi-Supervised Learning Partial Supervision Setting Unlabeled Dataset + Partial Supervision Learn Representation Labeled Dataset Train Classifier Dimensionality Reduction

Why is “learning representations” useful?  Infer the intrinsic dimensionality of the data.  Learn the “relevant” dimensions.  Infer the hidden structure.

Example: Hidden Structure 20 Symbols 4 Topics Subset of 3 symbols Generate a datapoint:  Choose a topic T.  Sample 3 symbols from T. Data Covariance

Example: Hidden Structure  Number of latent dimensions = 4  Map each x to the topic that generated it  Function: Projection Matrix Topic Vector DataPoint Latent Representation 1

Outline  Motivation: Low dimensional representations.  Principal Component Analysis.  Structural Learning.  Vision Applications.  NLP Applications.  Joint Sparsity.  Vision Applications.

Classical Setting Principal Components Analysis  Rows of theta as a ‘basis’:  Example generated by: T1 T2 T3 T4  Low Reconstruction Error:

Minimum Error Formulation Error : Orthonormal basis Solution Data covariance Distorsion Approximate high dimensional x with low dimensional x‘

Principal Component Analysis 2D Example Projection Error  Uncorrelated variablesand  Cut dimensions according to their variance.  Variables must be correlated.

Outline  Motivation: Low dimensional representations.  Principal Component Analysis.  Structural Learning.  Vision Applications.  NLP Applications.  Joint Sparsity.  Vision Applications.

Unlabeled Dataset + Partial Supervision Create Auxiliary Tasks Structure Learning Partial Supervision Setting [Ando & Zhang JMLR 2005 ]

Partial Supervision Setting  Unlabeled data + partial supervision:  Images with associated natural language captions.  Video sequences with associated speech.  Document + keywords  How could the partial supervision help?  A hint for discovering important features.  Use the partial supervision to define “auxiliary tasks”.  Discover feature groupings that are useful for these tasks. Sometimes ‘auxiliary tasks’ defined from unlabeled data alone. E.g. Auxiliary Task for word tagging predicting substructures-

Auxiliary Tasks: keywords: machine learning, dimensionality reduction keywords: linear embedding, spectral methods, distance learning keywords: object recognition, shape matching, stereo machine learning papers computer vision papers mask occurrences of keywords Auxiliary task: predict object recognition from document content Core task: Is a vision or machine learning article?

Auxiliary Tasks

Structure Learning Learning with no prior knowledge Hypothesis learned from examples Best hypothesis Learning with prior knowledge Learning from auxiliary tasks Hypothesis learned for related tasks

Learning Good Hypothesis Spaces Class of linear predictors: is an h by d matrix of structural parameters. Goal: Find the parameters and shared that minimizes the joint loss. Loss on training set Problem specific parameters Shared parameters Class of linear predictors: is an h by d matrix of structural parameters. Goal: Find the parameters and shared that minimizes the joint loss.

Algorithm Step 1: Train classifiers for auxiliary tasks.

Algorithm Step 2: PCA On Classifiers Coefficients by taking the first h eigenvectors Linear subspace of dimension h; a good low dimensional approximation to the space of coefficients. of Covariance Matrix:

Algorithm Step 3: Training on the core task Project data: Equivalent to training core task on the original d dimensional space with parameters constraints:

Example Object = { letter, letter, letter } An object abCabC

Example The same object seen in a different font A b c

Example The same object seen in a different font A B c

Example The same object seen in a different font a bC

Example acE AC E B 6 Letters (topics) 5 fonts per letter (symbols) auxiliary task: recognize object. words “ABC” object “ADE” object “BCF” words “ABD” words 30 Symbols  30 Features 20 words

PCA on Data can not recover lantent structure Covariance DATA

PCA on Coefficients can recover latent structure Features i.e. fonts Auxiliary Tasks Topics i.e Letters Parameters for object BCD W

PCA on Coefficients can recover latent structure Each Block of Correlated Variables corresponds to a Latent Topic Covariance W Features i.e. fonts Features i.e. fonts

Outline  Motivation: Low dimensional representations.  Principal Component Analysis.  Structural Learning.  Vision Applications.  NLP Applications.  Joint Sparsity.  Vision Applications.

News domain figure skating ice hockey golden globes grammys Dataset: News images from Reuters web-site. Problem: Predicting news topics from images.

Learning visual representations using images with captions The Italian team celebrate their gold medal win during the flower ceremony after the final round of the men's team pursuit speedskating at Oval Lingotto during the 2006 Winter Olympics. Former U.S. President Bill Clinton speaks during a joint news conference with Pakistan's Prime Minister Shaukat Aziz at Prime Minister house in Islamabad. Diana and Marshall Reed leave the funeral of miner David Lewis in Philippi, West Virginia on January 8, Lewis was one of 12 miners who died in the Sago Mine. Senior Hamas leader Khaled Meshaal (2nd-R), is surrounded by his bodyguards after a news conference in Cairo February 8, Jim Scherr, the US Olympic Committee's chief executive officer seen here in 2004, said his group is watching the growing scandal and keeping informed about the NHL's investigation into Rick Tocchet, U.S. director Stephen Gaghan and his girlfriend Daniela Unruh arrive on the red carpet for the screening of his film 'Syriana' which runs out of competition at the 56th Berlinale International Film Festival. Auxiliary task: predict “ team ” from image content

Learning visual topics word ‘games’ might contain the visual topics: medals people pavement Auxiliary tasks share visual topics people word ‘Demonstrations’ might contain the visual topics: Different words can share topics. Each topic can be observed under different appearances.

Experiments Results

Outline  Motivation: Low dimensional representations.  Principal Component Analysis.  Structural Learning.  Vision Applications.  NLP Applications.  Joint Sparsity.  Vision Applications.

Chunking Jane lives in New York and works for Bank of New York. PERLOCORG But economists in Europe failed to predict that … NP VPSBARPP Named entity chunking Syntactic chunking Data points: word occurrences Labels: Begin-PER, Inside-PER, Begin-LOC, …, Outside

Example input vector representation 1 curr-“in” left-“lives” right-“New” 1 1 … lives in New York … 1 curr-“New” left-“in” right-“York” 1 1 input vector X High-dimensional vectors. Most entries are 0.

1.Create m auxiliary problems. 2.Assign auxiliary labels to unlabeled data. 3.Compute  (shared structure) by joint empirical risk minimization over all the auxiliary problems. 4.Fix , and minimize empirical risk on the labeled data for the target task. Algorithmic Procedure Predictor: Additional features

Is the current word “New”? Is the current word “day”? Is the current word “IBM”? Is the current word “computer”? : Predict  1 from  2. compute shared   add   2 as new features Example auxiliary problems ??:???:? current word left word right word 1 1 11 22

Experiments (CoNLL-03 named entity) 4 classes: LOC, ORG, PER, MISC Labeled data: News documents. 204K words (English), 206K words (German) Unlabeled data : 27M words (English), 35M words (German) Features: A slight modification of ZJ03. Words, POS, char types, 4 chars at the beginning/ending in a 5-word window; words in a 3-chunk window; labels assigned to two words on the left, bi-gram of the current word and left label; labels assigned to previous occurrences of the current word. No gazetteer. No hand-crafted resources.

Auxiliary problems # of aux. problems Auxiliary labelsFeatures used for learning auxiliary problems 1000 Previous words Current words Next words All but previous words All but current words All but next words 300 auxiliary problems.

Syntactic chunking results (CoNLL-00) methoddescriptionF -measure supervisedbaseline ASO-semi+Unlabeled data Co/self oracle +Unlabeled data KM01 SVM combination CM03 Perceptron in two layers ZDJ02 Reg. Winnow Exceeds previous best systems. ZDJ02+ +full parser (ESG) output (+0.79%)

Other experiments POS tagging Text categorization (2 standard corpora) Confirmed effectiveness on:

Outline  Motivation: Low dimensional representations.  Principal Component Analysis.  Structural Learning.  Vision Applications.  NLP Applications.  Joint Sparsity.  Vision Applications.

Notation Collection of Tasks Joint Sparse Approximation

Single Task Sparse Approximation  Consider learning a single sparse linear classifier of the form:  We want a few features with non-zero coefficients  Recent work suggests to use L 1 regularization: Classification error L 1 penalizes non-sparse solutions  Donoho [2004] proved (in a regression setting) that the solution with smallest L 1 norm is also the sparsest solution.

Joint Sparse Approximation  Setting : Joint Sparse Approximation Average Loss on training set k penalizes solutions that utilize too many features

Joint Regularization Penalty  How do we penalize solutions that use too many features? Coefficients for for feature 2 Coefficients for classifier 2  Would lead to a hard combinatorial problem.

Joint Regularization Penalty  We will use a L 1-∞ norm [Tropp 2006]  The combination of the two norms results in a solution where only a few features are used but the features used will contribute in solving many classification problems.  This norm combines: An L 1 norm on the maximum absolute values of the coefficients across tasks promotes sparsity. Use few features The L ∞ norm on each row promotes non-sparsity on the rows. Share features

Joint Sparse Approximation  Using the L 1-∞ norm we can rewrite our objective function as:  For the hinge loss: the optimization problem can be expressed as a linear program.  For any convex loss this is a convex objective.

Joint Sparse Approximation  Linear program formulation (hinge loss):  Max value constraints: and  Slack variables constraints: and

 The LP formulation is feasible for small problems but becomes intractable for larger data-sets with thousands of examples and dimensions.  We might want a more general optimization algorithm that can handle arbitrary convex losses. An efficient training algorithm  The LP formulation can be optimized using standard LP solvers.  We developed a simple an efficient global optimization algorithm for training joint models with L 1−∞ constraints.  The total cost is in the order of:

Outline  Motivation: Low dimensional representations.  Principal Component Analysis.  Structural Learning.  Vision Applications.  NLP Applications.  Joint Sparsity.  Vision Applications.

SuperBowlDanish CartoonsSharon Australian Open Trapped MinersGolden globes Grammys Figure Skating Academy Awards Iraq  Learn a representation using labeled data from 9 topics.  Train a classifier for the 10 th held out topic using the relevant features R only.  Define the set of relevant features to be:  Learn the matrix W using our transfer algorithm.

Results

Future Directions Joint Sparsity Regularization to control inference time. Learning representations for ranking problems.