Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning Data Representations with “Partial Supervision” Ariadna Quattoni.

Similar presentations


Presentation on theme: "Learning Data Representations with “Partial Supervision” Ariadna Quattoni."— Presentation transcript:

1 Learning Data Representations with “Partial Supervision” Ariadna Quattoni

2 Outline  Motivation: Low dimensional representations.  Principal Component Analysis.  Structural Learning.  Vision Applications.  NLP Applications.  Joint Sparsity.  Vision Applications.

3 Outline  Motivation: Low dimensional representations.  Principal Component Analysis.  Structural Learning.  Vision Applications.  NLP Applications.  Joint Sparsity.  Vision Applications.

4 Semi-Supervised Learning “Raw” Feature Space Output Space Core Task:Learn a function from X to Y Labeled Dataset (Small) Unlabeled Dataset (Large) Partially Labeled Dataset (Large) Classical Setting Partial Supervision Setting

5 Semi-Supervised Learning Classical Setting Unlabeled Dataset Learn Representation Labeled Dataset Train Classifier Dimensionality Reduction

6 Semi-Supervised Learning Partial Supervision Setting Unlabeled Dataset + Partial Supervision Learn Representation Labeled Dataset Train Classifier Dimensionality Reduction

7 Why is “learning representations” useful?  Infer the intrinsic dimensionality of the data.  Learn the “relevant” dimensions.  Infer the hidden structure.

8 Example: Hidden Structure 20 Symbols 4 Topics Subset of 3 symbols Generate a datapoint:  Choose a topic T.  Sample 3 symbols from T. Data Covariance

9 Example: Hidden Structure  Number of latent dimensions = 4  Map each x to the topic that generated it  Function: Projection Matrix Topic Vector DataPoint Latent Representation 1

10 Outline  Motivation: Low dimensional representations.  Principal Component Analysis.  Structural Learning.  Vision Applications.  NLP Applications.  Joint Sparsity.  Vision Applications.

11 Classical Setting Principal Components Analysis  Rows of theta as a ‘basis’:  Example generated by: T1 T2 T3 T4  Low Reconstruction Error:

12 Minimum Error Formulation Error : Orthonormal basis Solution Data covariance Distorsion Approximate high dimensional x with low dimensional x‘

13 Principal Component Analysis 2D Example Projection Error  Uncorrelated variablesand  Cut dimensions according to their variance.  Variables must be correlated.

14 Outline  Motivation: Low dimensional representations.  Principal Component Analysis.  Structural Learning.  Vision Applications.  NLP Applications.  Joint Sparsity.  Vision Applications.

15 Unlabeled Dataset + Partial Supervision Create Auxiliary Tasks Structure Learning Partial Supervision Setting [Ando & Zhang JMLR 2005 ]

16 Partial Supervision Setting  Unlabeled data + partial supervision:  Images with associated natural language captions.  Video sequences with associated speech.  Document + keywords  How could the partial supervision help?  A hint for discovering important features.  Use the partial supervision to define “auxiliary tasks”.  Discover feature groupings that are useful for these tasks. Sometimes ‘auxiliary tasks’ defined from unlabeled data alone. E.g. Auxiliary Task for word tagging predicting substructures-

17 Auxiliary Tasks: keywords: machine learning, dimensionality reduction keywords: linear embedding, spectral methods, distance learning keywords: object recognition, shape matching, stereo machine learning papers computer vision papers mask occurrences of keywords Auxiliary task: predict object recognition from document content Core task: Is a vision or machine learning article?

18 Auxiliary Tasks

19 Structure Learning Learning with no prior knowledge Hypothesis learned from examples Best hypothesis Learning with prior knowledge Learning from auxiliary tasks Hypothesis learned for related tasks

20 Learning Good Hypothesis Spaces Class of linear predictors: is an h by d matrix of structural parameters. Goal: Find the parameters and shared that minimizes the joint loss. Loss on training set Problem specific parameters Shared parameters Class of linear predictors: is an h by d matrix of structural parameters. Goal: Find the parameters and shared that minimizes the joint loss.

21 Algorithm Step 1: Train classifiers for auxiliary tasks.

22 Algorithm Step 2: PCA On Classifiers Coefficients by taking the first h eigenvectors Linear subspace of dimension h; a good low dimensional approximation to the space of coefficients. of Covariance Matrix:

23 Algorithm Step 3: Training on the core task Project data: Equivalent to training core task on the original d dimensional space with parameters constraints:

24 Example Object = { letter, letter, letter } An object abCabC

25 Example The same object seen in a different font A b c

26 Example The same object seen in a different font A B c

27 Example The same object seen in a different font a bC

28 Example acE 100000000000100... 00001 AC E B 6 Letters (topics) 5 fonts per letter (symbols) auxiliary task: recognize object. words “ABC” object “ADE” object “BCF” words “ABD” words 30 Symbols  30 Features 20 words

29 PCA on Data can not recover lantent structure Covariance DATA

30 PCA on Coefficients can recover latent structure Features i.e. fonts Auxiliary Tasks Topics i.e Letters Parameters for object BCD W

31 PCA on Coefficients can recover latent structure Each Block of Correlated Variables corresponds to a Latent Topic Covariance W Features i.e. fonts Features i.e. fonts

32 Outline  Motivation: Low dimensional representations.  Principal Component Analysis.  Structural Learning.  Vision Applications.  NLP Applications.  Joint Sparsity.  Vision Applications.

33 News domain figure skating ice hockey golden globes grammys Dataset: News images from Reuters web-site. Problem: Predicting news topics from images.

34 Learning visual representations using images with captions The Italian team celebrate their gold medal win during the flower ceremony after the final round of the men's team pursuit speedskating at Oval Lingotto during the 2006 Winter Olympics. Former U.S. President Bill Clinton speaks during a joint news conference with Pakistan's Prime Minister Shaukat Aziz at Prime Minister house in Islamabad. Diana and Marshall Reed leave the funeral of miner David Lewis in Philippi, West Virginia on January 8, 2006. Lewis was one of 12 miners who died in the Sago Mine. Senior Hamas leader Khaled Meshaal (2nd-R), is surrounded by his bodyguards after a news conference in Cairo February 8, 2006. Jim Scherr, the US Olympic Committee's chief executive officer seen here in 2004, said his group is watching the growing scandal and keeping informed about the NHL's investigation into Rick Tocchet, U.S. director Stephen Gaghan and his girlfriend Daniela Unruh arrive on the red carpet for the screening of his film 'Syriana' which runs out of competition at the 56th Berlinale International Film Festival. Auxiliary task: predict “ team ” from image content

35 Learning visual topics word ‘games’ might contain the visual topics: medals people pavement Auxiliary tasks share visual topics people word ‘Demonstrations’ might contain the visual topics: Different words can share topics. Each topic can be observed under different appearances.

36 Experiments Results

37 Outline  Motivation: Low dimensional representations.  Principal Component Analysis.  Structural Learning.  Vision Applications.  NLP Applications.  Joint Sparsity.  Vision Applications.

38 Chunking Jane lives in New York and works for Bank of New York. PERLOCORG But economists in Europe failed to predict that … NP VPSBARPP Named entity chunking Syntactic chunking Data points: word occurrences Labels: Begin-PER, Inside-PER, Begin-LOC, …, Outside

39 Example input vector representation 1 curr-“in” left-“lives” right-“New” 1 1 … lives in New York … 1 curr-“New” left-“in” right-“York” 1 1 input vector X High-dimensional vectors. Most entries are 0.

40 1.Create m auxiliary problems. 2.Assign auxiliary labels to unlabeled data. 3.Compute  (shared structure) by joint empirical risk minimization over all the auxiliary problems. 4.Fix , and minimize empirical risk on the labeled data for the target task. Algorithmic Procedure Predictor: Additional features

41 Is the current word “New”? Is the current word “day”? Is the current word “IBM”? Is the current word “computer”? : Predict  1 from  2. compute shared   add   2 as new features Example auxiliary problems ??:???:? current word left word right word 1 1 11 22

42 Experiments (CoNLL-03 named entity) 4 classes: LOC, ORG, PER, MISC Labeled data: News documents. 204K words (English), 206K words (German) Unlabeled data : 27M words (English), 35M words (German) Features: A slight modification of ZJ03. Words, POS, char types, 4 chars at the beginning/ending in a 5-word window; words in a 3-chunk window; labels assigned to two words on the left, bi-gram of the current word and left label; labels assigned to previous occurrences of the current word. No gazetteer. No hand-crafted resources.

43 Auxiliary problems # of aux. problems Auxiliary labelsFeatures used for learning auxiliary problems 1000 Previous words Current words Next words All but previous words All but current words All but next words 300 auxiliary problems.

44 Syntactic chunking results (CoNLL-00) methoddescriptionF -measure supervisedbaseline 93.60 ASO-semi+Unlabeled data 94.39 Co/self oracle +Unlabeled data 93.66 KM01 SVM combination 93.91 CM03 Perceptron in two layers 93.74 ZDJ02 Reg. Winnow 93.57 Exceeds previous best systems. ZDJ02+ +full parser (ESG) output 94.17 (+0.79%)

45 Other experiments POS tagging Text categorization (2 standard corpora) Confirmed effectiveness on:

46 Outline  Motivation: Low dimensional representations.  Principal Component Analysis.  Structural Learning.  Vision Applications.  NLP Applications.  Joint Sparsity.  Vision Applications.

47 Notation Collection of Tasks Joint Sparse Approximation

48 Single Task Sparse Approximation  Consider learning a single sparse linear classifier of the form:  We want a few features with non-zero coefficients  Recent work suggests to use L 1 regularization: Classification error L 1 penalizes non-sparse solutions  Donoho [2004] proved (in a regression setting) that the solution with smallest L 1 norm is also the sparsest solution.

49 Joint Sparse Approximation  Setting : Joint Sparse Approximation Average Loss on training set k penalizes solutions that utilize too many features

50 Joint Regularization Penalty  How do we penalize solutions that use too many features? Coefficients for for feature 2 Coefficients for classifier 2  Would lead to a hard combinatorial problem.

51 Joint Regularization Penalty  We will use a L 1-∞ norm [Tropp 2006]  The combination of the two norms results in a solution where only a few features are used but the features used will contribute in solving many classification problems.  This norm combines: An L 1 norm on the maximum absolute values of the coefficients across tasks promotes sparsity. Use few features The L ∞ norm on each row promotes non-sparsity on the rows. Share features

52 Joint Sparse Approximation  Using the L 1-∞ norm we can rewrite our objective function as:  For the hinge loss: the optimization problem can be expressed as a linear program.  For any convex loss this is a convex objective.

53 Joint Sparse Approximation  Linear program formulation (hinge loss):  Max value constraints: and  Slack variables constraints: and

54  The LP formulation is feasible for small problems but becomes intractable for larger data-sets with thousands of examples and dimensions.  We might want a more general optimization algorithm that can handle arbitrary convex losses. An efficient training algorithm  The LP formulation can be optimized using standard LP solvers.  We developed a simple an efficient global optimization algorithm for training joint models with L 1−∞ constraints.  The total cost is in the order of:

55 Outline  Motivation: Low dimensional representations.  Principal Component Analysis.  Structural Learning.  Vision Applications.  NLP Applications.  Joint Sparsity.  Vision Applications.

56 SuperBowlDanish CartoonsSharon Australian Open Trapped MinersGolden globes Grammys Figure Skating Academy Awards Iraq  Learn a representation using labeled data from 9 topics.  Train a classifier for the 10 th held out topic using the relevant features R only.  Define the set of relevant features to be:  Learn the matrix W using our transfer algorithm.

57 Results

58 Future Directions Joint Sparsity Regularization to control inference time. Learning representations for ranking problems.


Download ppt "Learning Data Representations with “Partial Supervision” Ariadna Quattoni."

Similar presentations


Ads by Google