Learning Data Representations with “Partial Supervision” Ariadna Quattoni.

Learning Data Representations with “Partial Supervision” Ariadna Quattoni

Outline  Motivation: Low dimensional representations.  Principal Component Analysis.  Structural Learning.  Vision Applications.  NLP Applications.  Joint Sparsity.  Vision Applications.

Semi-Supervised Learning “Raw” Feature Space Output Space Core Task:Learn a function from X to Y Labeled Dataset (Small) Unlabeled Dataset (Large) Partially Labeled Dataset (Large) Classical Setting Partial Supervision Setting

Semi-Supervised Learning Classical Setting Unlabeled Dataset Learn Representation Labeled Dataset Train Classifier Dimensionality Reduction

Semi-Supervised Learning Partial Supervision Setting Unlabeled Dataset + Partial Supervision Learn Representation Labeled Dataset Train Classifier Dimensionality Reduction

Why is “learning representations” useful?  Infer the intrinsic dimensionality of the data.  Learn the “relevant” dimensions.  Infer the hidden structure.

Example: Hidden Structure 20 Symbols 4 Topics Subset of 3 symbols Generate a datapoint:  Choose a topic T.  Sample 3 symbols from T. Data Covariance

Example: Hidden Structure  Number of latent dimensions = 4  Map each x to the topic that generated it  Function: Projection Matrix Topic Vector DataPoint Latent Representation 1

Classical Setting Principal Components Analysis  Rows of theta as a ‘basis’:  Example generated by: T1 T2 T3 T4  Low Reconstruction Error:

Minimum Error Formulation Error : Orthonormal basis Solution Data covariance Distorsion Approximate high dimensional x with low dimensional x‘

Principal Component Analysis 2D Example Projection Error  Uncorrelated variablesand  Cut dimensions according to their variance.  Variables must be correlated.

Unlabeled Dataset + Partial Supervision Create Auxiliary Tasks Structure Learning Partial Supervision Setting [Ando & Zhang JMLR 2005 ]

Partial Supervision Setting  Unlabeled data + partial supervision:  Images with associated natural language captions.  Video sequences with associated speech.  Document + keywords  How could the partial supervision help?  A hint for discovering important features.  Use the partial supervision to define “auxiliary tasks”.  Discover feature groupings that are useful for these tasks. Sometimes ‘auxiliary tasks’ defined from unlabeled data alone. E.g. Auxiliary Task for word tagging predicting substructures-

Auxiliary Tasks: keywords: machine learning, dimensionality reduction keywords: linear embedding, spectral methods, distance learning keywords: object recognition, shape matching, stereo machine learning papers computer vision papers mask occurrences of keywords Auxiliary task: predict object recognition from document content Core task: Is a vision or machine learning article?

Auxiliary Tasks

Structure Learning Learning with no prior knowledge Hypothesis learned from examples Best hypothesis Learning with prior knowledge Learning from auxiliary tasks Hypothesis learned for related tasks

Learning Good Hypothesis Spaces Class of linear predictors: is an h by d matrix of structural parameters. Goal: Find the parameters and shared that minimizes the joint loss. Loss on training set Problem specific parameters Shared parameters Class of linear predictors: is an h by d matrix of structural parameters. Goal: Find the parameters and shared that minimizes the joint loss.

Algorithm Step 1: Train classifiers for auxiliary tasks.

Algorithm Step 2: PCA On Classifiers Coefficients by taking the first h eigenvectors Linear subspace of dimension h; a good low dimensional approximation to the space of coefficients. of Covariance Matrix:

Algorithm Step 3: Training on the core task Project data: Equivalent to training core task on the original d dimensional space with parameters constraints:

Example Object = { letter, letter, letter } An object abCabC

Example The same object seen in a different font A b c

Example The same object seen in a different font A B c

Example The same object seen in a different font a bC

Example acE 100000000000100... 00001 AC E B 6 Letters (topics) 5 fonts per letter (symbols) auxiliary task: recognize object. words “ABC” object “ADE” object “BCF” words “ABD” words 30 Symbols  30 Features 20 words

PCA on Data can not recover lantent structure Covariance DATA

PCA on Coefficients can recover latent structure Features i.e. fonts Auxiliary Tasks Topics i.e Letters Parameters for object BCD W

PCA on Coefficients can recover latent structure Each Block of Correlated Variables corresponds to a Latent Topic Covariance W Features i.e. fonts Features i.e. fonts

News domain figure skating ice hockey golden globes grammys Dataset: News images from Reuters web-site. Problem: Predicting news topics from images.

Learning visual representations using images with captions The Italian team celebrate their gold medal win during the flower ceremony after the final round of the men's team pursuit speedskating at Oval Lingotto during the 2006 Winter Olympics. Former U.S. President Bill Clinton speaks during a joint news conference with Pakistan's Prime Minister Shaukat Aziz at Prime Minister house in Islamabad. Diana and Marshall Reed leave the funeral of miner David Lewis in Philippi, West Virginia on January 8, 2006. Lewis was one of 12 miners who died in the Sago Mine. Senior Hamas leader Khaled Meshaal (2nd-R), is surrounded by his bodyguards after a news conference in Cairo February 8, 2006. Jim Scherr, the US Olympic Committee's chief executive officer seen here in 2004, said his group is watching the growing scandal and keeping informed about the NHL's investigation into Rick Tocchet, U.S. director Stephen Gaghan and his girlfriend Daniela Unruh arrive on the red carpet for the screening of his film 'Syriana' which runs out of competition at the 56th Berlinale International Film Festival. Auxiliary task: predict “ team ” from image content

Learning visual topics word ‘games’ might contain the visual topics: medals people pavement Auxiliary tasks share visual topics people word ‘Demonstrations’ might contain the visual topics: Different words can share topics. Each topic can be observed under different appearances.

Experiments Results

Chunking Jane lives in New York and works for Bank of New York. PERLOCORG But economists in Europe failed to predict that … NP VPSBARPP Named entity chunking Syntactic chunking Data points: word occurrences Labels: Begin-PER, Inside-PER, Begin-LOC, …, Outside

Example input vector representation 1 curr-“in” left-“lives” right-“New” 1 1 … lives in New York … 1 curr-“New” left-“in” right-“York” 1 1 input vector X High-dimensional vectors. Most entries are 0.

1.Create m auxiliary problems. 2.Assign auxiliary labels to unlabeled data. 3.Compute  (shared structure) by joint empirical risk minimization over all the auxiliary problems. 4.Fix , and minimize empirical risk on the labeled data for the target task. Algorithmic Procedure Predictor: Additional features

Is the current word “New”? Is the current word “day”? Is the current word “IBM”? Is the current word “computer”? : Predict  1 from  2. compute shared   add   2 as new features Example auxiliary problems ??:???:? current word left word right word 1 1 11 22

Experiments (CoNLL-03 named entity) 4 classes: LOC, ORG, PER, MISC Labeled data: News documents. 204K words (English), 206K words (German) Unlabeled data : 27M words (English), 35M words (German) Features: A slight modification of ZJ03. Words, POS, char types, 4 chars at the beginning/ending in a 5-word window; words in a 3-chunk window; labels assigned to two words on the left, bi-gram of the current word and left label; labels assigned to previous occurrences of the current word. No gazetteer. No hand-crafted resources.

Auxiliary problems # of aux. problems Auxiliary labelsFeatures used for learning auxiliary problems 1000 Previous words Current words Next words All but previous words All but current words All but next words 300 auxiliary problems.

Syntactic chunking results (CoNLL-00) methoddescriptionF -measure supervisedbaseline 93.60 ASO-semi+Unlabeled data 94.39 Co/self oracle +Unlabeled data 93.66 KM01 SVM combination 93.91 CM03 Perceptron in two layers 93.74 ZDJ02 Reg. Winnow 93.57 Exceeds previous best systems. ZDJ02+ +full parser (ESG) output 94.17 (+0.79%)

Other experiments POS tagging Text categorization (2 standard corpora) Confirmed effectiveness on:

Notation Collection of Tasks Joint Sparse Approximation

Single Task Sparse Approximation  Consider learning a single sparse linear classifier of the form:  We want a few features with non-zero coefficients  Recent work suggests to use L 1 regularization: Classification error L 1 penalizes non-sparse solutions  Donoho [2004] proved (in a regression setting) that the solution with smallest L 1 norm is also the sparsest solution.

Joint Sparse Approximation  Setting : Joint Sparse Approximation Average Loss on training set k penalizes solutions that utilize too many features

Joint Regularization Penalty  How do we penalize solutions that use too many features? Coefficients for for feature 2 Coefficients for classifier 2  Would lead to a hard combinatorial problem.

Joint Regularization Penalty  We will use a L 1-∞ norm [Tropp 2006]  The combination of the two norms results in a solution where only a few features are used but the features used will contribute in solving many classification problems.  This norm combines: An L 1 norm on the maximum absolute values of the coefficients across tasks promotes sparsity. Use few features The L ∞ norm on each row promotes non-sparsity on the rows. Share features

Joint Sparse Approximation  Using the L 1-∞ norm we can rewrite our objective function as:  For the hinge loss: the optimization problem can be expressed as a linear program.  For any convex loss this is a convex objective.

Joint Sparse Approximation  Linear program formulation (hinge loss):  Max value constraints: and  Slack variables constraints: and

 The LP formulation is feasible for small problems but becomes intractable for larger data-sets with thousands of examples and dimensions.  We might want a more general optimization algorithm that can handle arbitrary convex losses. An efficient training algorithm  The LP formulation can be optimized using standard LP solvers.  We developed a simple an efficient global optimization algorithm for training joint models with L 1−∞ constraints.  The total cost is in the order of:

SuperBowlDanish CartoonsSharon Australian Open Trapped MinersGolden globes Grammys Figure Skating Academy Awards Iraq  Learn a representation using labeled data from 9 topics.  Train a classifier for the 10 th held out topic using the relevant features R only.  Define the set of relevant features to be:  Learn the matrix W using our transfer algorithm.

Results

Future Directions Joint Sparsity Regularization to control inference time. Learning representations for ranking problems.

Learning Data Representations with “Partial Supervision” Ariadna Quattoni.

Similar presentations

Presentation on theme: "Learning Data Representations with “Partial Supervision” Ariadna Quattoni."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning Data Representations with “Partial Supervision” Ariadna Quattoni.

Similar presentations

Presentation on theme: "Learning Data Representations with “Partial Supervision” Ariadna Quattoni."— Presentation transcript:

Similar presentations

About project

Feedback