Download presentation

Presentation is loading. Please wait.

Published byJarvis Dodsworth Modified over 2 years ago

1
Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan

2
« Discriminative method » Decision theoretic framework: Loss: Decision function: Risk Contrast funtion

3
« with structure » on outputs: Handwriting recognition InputOutput brace huge! Machine translation Ce n'est pas un autre problème de classification. This is not another classification problem.

4
« with structure » on inputs: text documents ….. ……. … ……… ….......................................... latent variable model new representation classification

5
Structure on outputs: Discriminative Word Alignment project (joint work with Ben Taskar, Dan Klein and Mike Jordan)

6
Word Alignment What is the anticipated cost of collecting fees under the new proposal? En vertu des nouvelles propositions, quel est le coût prévu de perception des droits? xy What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de les droits ? Key step in most machine translation systems

7
Overview Review of large-margin word alignment [Taskar et al. EMNLP 05] Two new extensions to the basic model: Fertility features First order interactions using quadratic assignment Results on Hansards dataset

8
Feature-Based Alignment Features: Association MI = 3.2 Dice = 4.1 Lexical pair ID( proposal, proposition ) = 1 Position in sentence AbsDist = 5 RelDist = 0.3 Orthography ExactMatch = 0 Similarity = 0.8 Resources PairInDictionary Other Models (IBM2, IBM4) What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de le droits ? j k

9
Scoring Whole Alignments What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de le droits ? j k

10
Prediction as a Linear Program Still guaranteed to have integral solutions y Degree constraint What is the anticipated cost of collecting fees under the new proposal ? En vertu de les nouvelles propositions, quel est le coût prévu de perception de le droits ? j k relaxation

11
Learning w Supervised training data Training methods Maximum likelihood/entropy Perceptron Maximum margin

12
Maximum Likelihood/Entropy Probabilistic approach: Problem: denominator is #P-complete [Valiant 79, Jerrum & Sinclair 93] Cant find maximum likelihood parameters

13
(Averaged) Perceptron Perceptron for structured output [Collins 2002]: For each example, Predict: Update: Output averaged parameters:

14
Large Margin Estimation Equivalent min-max formulation [Taskar et al 04,05] Simple LP true score other score loss

15
Min-max formulation - QP LP duality QP of polynomial size! => Mosek

16
Experimental Setup French Canadian Hansards Corpus Word-level aligned 200 sentence pairs (training data) 37 sentence pairs (validation data) 247 sentence pairs (test data) Sentence-level aligned 1M sentence pairs Generate association-based features Learn unsupervised IBM Models Learn using Large Margin Evaluate alignment quality using standard AER (Alignment Error Rate) [similar to F1]

17
Old Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% AER Prec / Rec

18
Improving basic model We would like to model: Fertility: Alignments are not necessarily 1-to-1 First-order interactions: Alignments are mostly locally diagonal: would like to score depending on its neighbors Strategy: extensions keeping prediction model as a LP

19
Modeling Fertility Example of node feature: for word w, fraction of time it had fertility > k on the training set fertility penalty

20
Fertility Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% AER Prec / Rec

21
Fertility Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% + model 4 + fertility4.996 / 94% AER Prec / Rec

22
Fertility example Sure align. Possible align. Predicted align. = = =

23
Modeling First Order Effects Restrict: monoticity local inversion local fertility want: relaxation:

24
Integer program Quadratic assignment NP-complete; on real-world sentences (2 to 30 words) takes a few seconds using Mosek (~1k variables) Interestingly, in our dataset 80% of examples yield integer solution when solved via linear relaxation same AER when using relaxation!

25
New Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% AER Prec / Rec

26
New Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% Basic + fertility + qap6.194 / 93% AER Prec / Rec

27
New Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% Basic + fertility + qap6.194 / 93% + fertility + qap + model 44.396 / 95% AER Prec / Rec

28
New Results 200 train/247 test split IBM model 4 (intersected)6.598 / 88% Basic8.293 / 90% + model 45.198 / 92% Basic + fertility + qap6.194 / 93% + fertility + qap + model 44.396 / 95% + fertility + qap + model 4 + liang3.897 / 96 % AER Prec / Rec

29
Fert + qap example

31
Conclusions Feature-based word alignment Efficient algorithms for supervised learning Exploit unsupervised data via features, other models Surprisingly accurate with simple features Include fertility model and first order interactions 38% AER reduction over intersected Model 4 Lowest published AER on this data set High recall alignments -> promising for MT

32
Structure on inputs: discLDA project (work in progress) (joint work with Fei Sha and Mike Jordan)

33
Unsupervised dimensionality reduction text documents ….. ……. … ……… ….......................................... latent variables model new representation classification

34
Analogy: PCA vs. FDA x x x x x x x x x x o o o o o o o o o o o o o o o o x x x PCA direction FDA direction

35
Goal: supervised dim. reduction text documents ….. ……. … ……… ….......................................... latent variables model with supervised information new representation classification

36
Review: LDA model

37
Discriminative version of LDA Ultimately, want to learn discriminatively -> but high-dimensional non-convex objective, hard to optimize! Instead, propose to learn class-dependent linear transformation of common s: New generative model: Equivalently, transformation on :

38
Simplex Geometry x x x x x x o o o o word simplex w3 w2 w1 topic simplex x x x x x x o o o o w2 w1 w3

39
Interpretation 1 Shared topic vs. class-specific topic: shared topics class-specific topics

40
Interpretation 2 Generative model from T, add a new latent variable u:

41
Compare with AT model Author-Topic model [Rosen-Zvi et al. 2004] discLDA

42
Inference and learning

43
Learning For fixed T, learn by sampling (z,u) [Rao-Blackwellized Gibbs sampling] For fixed, update T using stochastic gradient ascent on conditional log-likelihood: in an online fashion get approximate gradient using Monte Carlo EM use Harmonic Mean estimator to estimate Currently, results are noisy…

44
Inference (dimensionality reduction) Given learned T and : estimate using Harmonic Mean estimator compute by marginalizing over y to get new representation of document

45
Preliminary Experiments

46
20 Newsgroup dataset Used fixed T: Get reduced representation -> train linear SVM on it hence 110 topics for 11k train 7.5k test vocabulary: 50k

47
Classification results discLDA + SVM: 20% error LDA + SVM: 25% error discLDA predictions: 20% error

48
Newsgroup embedding (LDA)

49
Newsgroup embedding (discLDA)

50
using tSNE (on discLDA) thanks to Laurens van der Maaten for figure! [Hintons group]

51
using tSNE (on LDA) thanks to Laurens van der Maaten for figure! [Hintons group]

52
Learned topics

53
Another embedding NIPS papers vs. Psychology abstracts LDA discLDA

54
13 scenes dataset [Fei-Fei 2005] train: 100 per category test: 2558

55
Vocabulary (visual words)

56
Topics

57
Conclusion fixed transformation T enables topic sharing & exploration get reduced representation which preserves predictive power noisy gradient estimates still work in progress will probably try variational approach instead

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google