Presentation is loading. Please wait.

Presentation is loading. Please wait.

Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been.

Similar presentations


Presentation on theme: "Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been."— Presentation transcript:

1 Section 6: Approximation-aware Training 1

2 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: What’s going on here? Can we trust BP’s estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 2

3 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: What’s going on here? Can we trust BP’s estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 3

4 Modern NLP 4 Linguistics Mathematical Modeling Machine Learning Combinatorial Optimization NLP

5 Machine Learning for NLP 5 Linguistics No semantic interpretation … Linguistics inspires the structures we want to predict

6 Machine Learning for NLP 6 Linguistics Mathematical Modeling pθ(pθ() = 0.50 pθ(pθ() = 0.25 pθ(pθ() = 0.10 pθ(pθ() = 0.01 Our model defines a score for each structure …

7 Machine Learning for NLP 7 Linguistics Mathematical Modeling pθ(pθ() = 0.50 pθ(pθ() = 0.25 pθ(pθ() = 0.10 pθ(pθ() = 0.01 It also tells us what to optimize Our model defines a score for each structure …

8 Machine Learning for NLP 8 Linguistics Mathematical Modeling Machine Learning Learning tunes the parameters of the model x1:x1: x1:x1: y1:y1: y1:y1: x2:x2: x2:x2: y2:y2: y2:y2: x3:x3: x3:x3: y3:y3: y3:y3: Given training instances {(x 1, y 1 ), (x 2, y 2 ),…, (x n, y n )} Find the best model parameters, θ

9 Machine Learning for NLP 9 Linguistics Mathematical Modeling Machine Learning Learning tunes the parameters of the model x1:x1: x1:x1: y1:y1: y1:y1: x2:x2: x2:x2: y2:y2: y2:y2: x3:x3: x3:x3: y3:y3: y3:y3: Given training instances {(x 1, y 1 ), (x 2, y 2 ),…, (x n, y n )} Find the best model parameters, θ

10 Machine Learning for NLP 10 Linguistics Mathematical Modeling Machine Learning Learning tunes the parameters of the model x1:x1: x1:x1: y1:y1: y1:y1: x2:x2: x2:x2: y2:y2: y2:y2: x3:x3: x3:x3: y3:y3: y3:y3: Given training instances {(x 1, y 1 ), (x 2, y 2 ),…, (x n, y n )} Find the best model parameters, θ

11 x new : y*:y*: y*:y*: Machine Learning for NLP 11 Linguistics Mathematical Modeling Machine Learning Combinatorial Optimization Inference finds the best structure for a new sentence x new : y*:y*: y*:y*: Given a new sentence, x new Search over the set of all possible structures (often exponential in size of x new ) Return the highest scoring structure, y * (Inference is usually called as a subroutine in learning)

12 x new : y*:y*: y*:y*: Machine Learning for NLP 12 Linguistics Mathematical Modeling Machine Learning Combinatorial Optimization Inference finds the best structure for a new sentence x new : y*:y*: y*:y*: Given a new sentence, x new Search over the set of all possible structures (often exponential in size of x new ) Return the Minimum Bayes Risk (MBR) structure, y * (Inference is usually called as a subroutine in learning)

13 Machine Learning for NLP 13 Linguistics Mathematical Modeling Machine Learning Combinatorial Optimization Inference finds the best structure for a new sentence Given a new sentence, x new Search over the set of all possible structures (often exponential in size of x new ) Return the Minimum Bayes Risk (MBR) structure, y * (Inference is usually called as a subroutine in learning) 13 Easy Polynomial timeNP-hard

14 Modern NLP 14 Linguistics inspires the structures we want to predict It also tells us what to optimize Our model defines a score for each structure Learning tunes the parameters of the model Inference finds the best structure for a new sentence LinguisticsMathematical Modeling Machine Learning Combinatorial Optimization NLP (Inference is usually called as a subroutine in learning)

15 An Abstraction for Modeling 15 Mathematical Modeling ✔      ✔   ✔   ✔ Now we can work at this level of abstraction.

16 Training Thus far, we’ve seen how to compute (approximate) marginals, given a factor graph… …but where do the potential tables ψ α come from? – Some have a fixed structure (e.g. Exactly1, CKYTree ) – Others could be trained ahead of time (e.g. TrigramHMM ) – For the rest, we define them parametrically and learn the parameters! 16 Two ways to learn: 1.Standard CRF Training (very simple; often yields state-of-the- art results) 2.ERMA (less simple; but takes approximations and loss function into account) Two ways to learn: 1.Standard CRF Training (very simple; often yields state-of-the- art results) 2.ERMA (less simple; but takes approximations and loss function into account)

17 Standard CRF Parameterization Define each potential function in terms of a fixed set of feature functions: 17 Observed variables Predicted variables

18 Standard CRF Parameterization Define each potential function in terms of a fixed set of feature functions: 18 timeflies like an arrow n ψ2ψ2 v ψ4ψ4 p ψ6ψ6 d ψ8ψ8 n ψ1ψ1 ψ3ψ3 ψ5ψ5 ψ7ψ7 ψ9ψ9

19 Standard CRF Parameterization Define each potential function in terms of a fixed set of feature functions: 19 n ψ1ψ1 ψ2ψ2 v ψ3ψ3 ψ4ψ4 p ψ5ψ5 ψ6ψ6 d ψ7ψ7 ψ8ψ8 n ψ9ψ9 time like flies anarrow np ψ 10 vp ψ 12 pp ψ 11 s ψ 13

20 What is Training? That’s easy: Training = picking good model parameters! But how do we know if the model parameters are any “good”? 20

21 Machine Learning Conditional Log-likelihood Training 1.Choose model Such that derivative in #3 is ea 2.Choose objective: Assign high probability to the things we observe and low probability to everything else 21 3.Compute derivative by hand using the chain rule 4.Replace exact inference by approximate inference

22 Conditional Log-likelihood Training 1.Choose model Such that derivative in #3 is easy 2.Choose objective: Assign high probability to the things we observe and low probability to everything else 22 3.Compute derivative by hand using the chain rule 4.Replace exact inference by approximate inference Machine Learning We can approximate the factor marginals by the (normalized) factor beliefs from BP!

23 Stochastic Gradient Descent Input: – Training data, {(x (i), y (i) ) : 1 ≤ i ≤ N } – Initial model parameters, θ Output: – Trained model parameters, θ. Algorithm: While not converged: – Sample a training example (x (i), y (i) ) – Compute the gradient of log(p θ (y (i) | x (i) )) with respect to our model parameters θ. – Take a (small) step in the direction of the gradient. 23 Machine Learning

24 What’s wrong with the usual approach? If you add too many factors, your predictions might get worse! The model might be richer, but we replace the true marginals with approximate marginals (e.g. beliefs computed by BP) Approximate inference can cause gradients for structured learning to go awry! (Kulesza & Pereira, 2008). 24

25 What’s wrong with the usual approach? Mistakes made by Standard CRF Training: 1.Using BP (approximate) 2.Not taking loss function into account 3.Should be doing MBR decoding Big pile of approximations… …which has tunable parameters. Treat it like a neural net, and run backprop! 25

26 Modern NLP 26 Linguistics inspires the structures we want to predict It also tells us what to optimize Our model defines a score for each structure Learning tunes the parameters of the model Inference finds the best structure for a new sentence LinguisticsMathematical Modeling Machine Learning Combinatorial Optimization NLP (Inference is usually called as a subroutine in learning)

27 Empirical Risk Minimization 1. Given training data: 27 2. Choose each of these: – Decision function – Loss function Examples: Linear regression, Logistic regression, Neural Network Examples: Mean-squared error, Cross Entropy x1:x1: x1:x1: y1:y1: y1:y1: x2:x2: x2:x2: y2:y2: y2:y2: x3:x3: x3:x3: y3:y3: y3:y3:

28 Empirical Risk Minimization 1. Given training data:3. Define goal: 28 2. Choose each of these: – Decision function – Loss function 4. Train with SGD: (take small steps opposite the gradient)

29 1. Given training data:3. Define goal: 29 2. Choose each of these: – Decision function – Loss function 4. Train with SGD: (take small steps opposite the gradient) Empirical Risk Minimization

30 Conditional Log-likelihood Training 1.Choose model Such that derivative in #3 is easy 2.Choose objective: Assign high probability to the things we observe and low probability to everything else 30 3.Compute derivative by hand using the chain rule 4.Replace true inference by approximate inference Machine Learning

31 What went wrong? How did we compute these approximate marginal probabilities anyway? 31 By Belief Propagation of course! Machine Learning

32 Error Back-Propagation 32 Slide from (Stoyanov & Eisner, 2012)

33 Error Back-Propagation 33 Slide from (Stoyanov & Eisner, 2012)

34 Error Back-Propagation 34 Slide from (Stoyanov & Eisner, 2012)

35 Error Back-Propagation 35 Slide from (Stoyanov & Eisner, 2012)

36 Error Back-Propagation 36 Slide from (Stoyanov & Eisner, 2012)

37 Error Back-Propagation 37 Slide from (Stoyanov & Eisner, 2012)

38 Error Back-Propagation 38 Slide from (Stoyanov & Eisner, 2012)

39 Error Back-Propagation 39 Slide from (Stoyanov & Eisner, 2012)

40 Error Back-Propagation 40 Slide from (Stoyanov & Eisner, 2012)

41 Error Back-Propagation 41 y3y3 P(y 3 = noun |x) μ(y 1  y 2 )=μ(y 3  y 1 )*μ(y 4  y 1 ) ϴ Slide from (Stoyanov & Eisner, 2012)

42 Error Back-Propagation Applying the chain rule of differentiation over and over. Forward pass: – Regular computation (inference + decoding) in the model (+ remember intermediate quantities). Backward pass: – Replay the forward pass in reverse, computing gradients. 42

43 Background: Backprop through time Recurrent neural network: BPTT: 1. Unroll the computation over time 43 (Robinson & Fallside, 1987) (Werbos, 1988) (Mozer, 1995) axtxt btbt x t+1 y t+1 a x1x1 b1b1 x2x2 b2b2 x3x3 b3b3 x4x4 y4y4 2. Run backprop through the resulting feed- forward network

44 What went wrong? How did we compute these approximate marginal probabilities anyway? 44 By Belief Propagation of course! Machine Learning

45 ERMA Empirical Risk Minimization under Approximations (ERMA) Apply Backprop through time to Loopy BP Unrolls the BP computation graph Includes inference, decoding, loss and all the approximations along the way 45 (Stoyanov, Ropson, & Eisner, 2011)

46 ERMA 1.Choose model to be the computation with all its approximations 2.Choose objective to likewise include the approximations 3.Compute derivative by backpropagation (treating the entire computation as if it were a neural network) 4.Make no approximations! (Our gradient is exact) 46 Machine Learning Key idea: Open up the black box! (Stoyanov, Ropson, & Eisner, 2011)

47 ERMA Empirical Risk Minimization 47 Machine Learning Key idea: Open up the black box! Minimum Bayes Risk (MBR) Decoder (Stoyanov, Ropson, & Eisner, 2011)

48 Approximation-aware Learning What if we’re using Structured BP instead of regular BP? No problem, the same approach still applies! The only difference is that we embed dynamic programming algorithms inside our computation graph. 48 Machine Learning Key idea: Open up the black box! (Gormley, Dredze, & Eisner, 2015)

49 Connection to Deep Learning 49 y exp(Θ y  f(x)) (Gormley, Yu, & Dredze, In submission)

50 Empirical Risk Minimization under Approximations (ERMA) 50 Approximation Aware NoYes Loss Aware No Yes SVM struct [Finley and Joachims, 2008] M 3 N [Taskar et al., 2003] Softmax-margin [Gimpel & Smith, 2010] ERMA MLE Figure from (Stoyanov & Eisner, 2012)


Download ppt "Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been."

Similar presentations


Ads by Google