Presentation is loading. Please wait.

Presentation is loading. Please wait.

1. Structured Belief Propagation for NLP Matthew R. Gormley & Jason Eisner ACL ‘14 Tutorial June 22, 2014 2 For the latest version of these slides, please.

Similar presentations


Presentation on theme: "1. Structured Belief Propagation for NLP Matthew R. Gormley & Jason Eisner ACL ‘14 Tutorial June 22, 2014 2 For the latest version of these slides, please."— Presentation transcript:

1 1

2 Structured Belief Propagation for NLP Matthew R. Gormley & Jason Eisner ACL ‘14 Tutorial June 22, For the latest version of these slides, please visit:

3 3 Language has a lot going on at once Structured representations of utterances Structured knowledge of the language Many interacting parts for BP to reason about!

4 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: What’s going on here? Can we trust BP’s estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 4

5 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: What’s going on here? Can we trust BP’s estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 5

6 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: What’s going on here? Can we trust BP’s estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 6

7 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: What’s going on here? Can we trust BP’s estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 7

8 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: What’s going on here? Can we trust BP’s estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 8

9 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: What’s going on here? Can we trust BP’s estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 9

10 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: What’s going on here? Can we trust BP’s estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 10

11 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: What’s going on here? Can we trust BP’s estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 11

12 Section 1: Introduction Modeling with Factor Graphs 12

13 Sampling from a Joint Distribution 13 time like flies anarrow X1X1 ψ2ψ2 X2X2 ψ4ψ4 X3X3 ψ6ψ6 X4X4 ψ8ψ8 X5X5 ψ1ψ1 ψ3ψ3 ψ5ψ5 ψ7ψ7 ψ9ψ9 ψ0ψ0 X0X0 nv p d n Sample 6: vn v d n Sample 5: vn p d n Sample 4: nv p d n Sample 3: nn v d n Sample 2: nv p d n Sample 1: A joint distribution defines a probability p(x) for each assignment of values x to variables X. This gives the proportion of samples that will equal x. A joint distribution defines a probability p(x) for each assignment of values x to variables X. This gives the proportion of samples that will equal x.

14 Sampling from a Joint Distribution 14 X1X1 ψ1ψ1 ψ2ψ2 X2X2 ψ3ψ3 ψ4ψ4 X3X3 ψ5ψ5 ψ6ψ6 X4X4 ψ7ψ7 ψ8ψ8 X5X5 ψ9ψ9 X6X6 ψ 10 X7X7 ψ 12 ψ 11 Sample 1: ψ1ψ1 ψ2ψ2 ψ3ψ3 ψ4ψ4 ψ5ψ5 ψ6ψ6 ψ7ψ7 ψ8ψ8 ψ9ψ9 ψ 10 ψ 12 ψ 11 Sample 2: ψ1ψ1 ψ2ψ2 ψ3ψ3 ψ4ψ4 ψ5ψ5 ψ6ψ6 ψ7ψ7 ψ8ψ8 ψ9ψ9 ψ 10 ψ 12 ψ 11 Sample 3: ψ1ψ1 ψ2ψ2 ψ3ψ3 ψ4ψ4 ψ5ψ5 ψ6ψ6 ψ7ψ7 ψ8ψ8 ψ9ψ9 ψ 10 ψ 12 ψ 11 Sample 4: ψ1ψ1 ψ2ψ2 ψ3ψ3 ψ4ψ4 ψ5ψ5 ψ6ψ6 ψ7ψ7 ψ8ψ8 ψ9ψ9 ψ 10 ψ 12 ψ 11 A joint distribution defines a probability p(x) for each assignment of values x to variables X. This gives the proportion of samples that will equal x. A joint distribution defines a probability p(x) for each assignment of values x to variables X. This gives the proportion of samples that will equal x.

15 nn v d n Sample 2: time like flies anarrow Sampling from a Joint Distribution 15 W1W1 W2W2 W3W3 W4W4 W5W5 X1X1 ψ2ψ2 X2X2 ψ4ψ4 X3X3 ψ6ψ6 X4X4 ψ8ψ8 X5X5 ψ1ψ1 ψ3ψ3 ψ5ψ5 ψ7ψ7 ψ9ψ9 ψ0ψ0 X0X0 nv p d n Sample 1: time like flies anarrow pn n v v Sample 4: with you time willsee nv p n n Sample 3: flies with fly theirwings A joint distribution defines a probability p(x) for each assignment of values x to variables X. This gives the proportion of samples that will equal x. A joint distribution defines a probability p(x) for each assignment of values x to variables X. This gives the proportion of samples that will equal x.

16 W1W1 W2W2 W3W3 W4W4 W5W5 X1X1 ψ2ψ2 X2X2 ψ4ψ4 X3X3 ψ6ψ6 X4X4 ψ8ψ8 X5X5 ψ1ψ1 ψ3ψ3 ψ5ψ5 ψ7ψ7 ψ9ψ9 ψ0ψ0 X0X0 Factors have local opinions (≥ 0) 16 Each black box looks at some of the tags X i and words W i Note: We chose to reuse the same factors at different positions in the sentence.

17 Factors have local opinions (≥ 0) 17 timeflies like an arrow n ψ2ψ2 v ψ4ψ4 p ψ6ψ6 d ψ8ψ8 n ψ1ψ1 ψ3ψ3 ψ5ψ5 ψ7ψ7 ψ9ψ9 ψ0ψ0 Each black box looks at some of the tags X i and words W i p( n, v, p, d, n, time, flies, like, an, arrow ) = ?

18 Global probability = product of local opinions 18 timeflies like an arrow n ψ2ψ2 v ψ4ψ4 p ψ6ψ6 d ψ8ψ8 n ψ1ψ1 ψ3ψ3 ψ5ψ5 ψ7ψ7 ψ9ψ9 ψ0ψ0 Each black box looks at some of the tags X i and words W i p( n, v, p, d, n, time, flies, like, an, arrow ) = (4 * 8 * 5 * 3 * …) Uh-oh! The probabilities of the various assignments sum up to Z > 1. So divide them all by Z. Uh-oh! The probabilities of the various assignments sum up to Z > 1. So divide them all by Z.

19 Markov Random Field (MRF) 19 timeflies like an arrow n ψ2ψ2 v ψ4ψ4 p ψ6ψ6 d ψ8ψ8 n ψ1ψ1 ψ3ψ3 ψ5ψ5 ψ7ψ7 ψ9ψ9 ψ0ψ0 p( n, v, p, d, n, time, flies, like, an, arrow ) = (4 * 8 * 5 * 3 * …) Joint distribution over tags X i and words W i The individual factors aren’t necessarily probabilities. Joint distribution over tags X i and words W i The individual factors aren’t necessarily probabilities.

20 timeflies like an arrow nv p d n Hidden Markov Model 20 But sometimes we choose to make them probabilities. Constrain each row of a factor to sum to one. Now Z = 1. p( n, v, p, d, n, time, flies, like, an, arrow ) = (.3 *.8 *.2 *.5 * …)

21 Markov Random Field (MRF) 21 timeflies like an arrow n ψ2ψ2 v ψ4ψ4 p ψ6ψ6 d ψ8ψ8 n ψ1ψ1 ψ3ψ3 ψ5ψ5 ψ7ψ7 ψ9ψ9 ψ0ψ0 p( n, v, p, d, n, time, flies, like, an, arrow ) = (4 * 8 * 5 * 3 * …) Joint distribution over tags X i and words W i

22 Conditional Random Field (CRF) 22 timeflies like an arrow n ψ2ψ2 v ψ4ψ4 p ψ6ψ6 d ψ8ψ8 n ψ1ψ1 ψ3ψ3 ψ5ψ5 ψ7ψ7 ψ9ψ9 ψ0ψ0 Conditional distribution over tags X i given words w i. The factors and Z are now specific to the sentence w. Conditional distribution over tags X i given words w i. The factors and Z are now specific to the sentence w. p( n, v, p, d, n, time, flies, like, an, arrow ) = (4 * 8 * 5 * 3 * …)

23 How General Are Factor Graphs? Factor graphs can be used to describe – Markov Random Fields (undirected graphical models) i.e., log-linear models over a tuple of variables – Conditional Random Fields – Bayesian Networks (directed graphical models) Inference treats all of these interchangeably. – Convert your model to a factor graph first. – Pearl (1988) gave key strategies for exact inference: Belief propagation, for inference on acyclic graphs Junction tree algorithm, for making any graph acyclic (by merging variables and factors: blows up the runtime)

24 Object-Oriented Analogy What is a sample? A datum: an immutable object that describes a linguistic structure. What is the sample space? The class of all possible sample objects. What is a random variable? An accessor method of the class, e.g., one that returns a certain field. – Will give different values when called on different random samples. 24 class Tagging: int n;// length of sentence Word[] w;// array of n words (values w i ) Tag[] t;// array of n tags (values t i ) Word W(int i) { return w[i]; }// random var W i Tag T(int i) { return t[i]; }// random var T i String S(int i) {// random var S i return suffix(w[i], 3); } Random variable W 5 takes value w 5 == “arrow” in this sample

25 Object-Oriented Analogy What is a sample? A datum: an immutable object that describes a linguistic structure. What is the sample space? The class of all possible sample objects. What is a random variable? An accessor method of the class, e.g., one that returns a certain field. A model is represented by a different object. What is a factor of the model? A method of the model that computes a number ≥ 0 from a sample, based on the sample’s values of a few random variables, and parameters stored in the model. What probability does the model assign to a sample? A product of its factors (rescaled). E.g., uprob(tagging) / Z(). How do you find the scaling factor? Add up the probabilities of all possible samples. If the result Z != 1, divide the probabilities by that Z. 25 class TaggingModel: float transition(Tagging tagging, int i) { // tag-tag bigram return tparam[tagging.t(i-1)][tagging.t(i)]; } float emission(Tagging tagging, int i) { // tag-word bigram return eparam[tagging.t(i)][tagging.w(i)]; } float uprob(Tagging tagging) { // unnormalized prob float p=1; for (i=1; i <= tagging.n; i++) { p *= transition(i) * emission(i); } return p; }

26 Modeling with Factor Graphs Factor graphs can be used to model many linguistic structures. Here we highlight a few example NLP tasks. – People have used BP for all of these. We’ll describe how variables and factors were used to describe structures and the interactions among their parts. 26

27 Annotating a Tree 27 Given: a sentence and unlabeled parse tree. nv p d n time like flies anarrow np vp pp s

28 Annotating a Tree 28 Given: a sentence and unlabeled parse tree. Construct a factor graph which mimics the tree structure, to predict the tags / nonterminals. Given: a sentence and unlabeled parse tree. Construct a factor graph which mimics the tree structure, to predict the tags / nonterminals. X1X1 ψ1ψ1 ψ2ψ2 X2X2 ψ3ψ3 ψ4ψ4 X3X3 ψ5ψ5 ψ6ψ6 X4X4 ψ7ψ7 ψ8ψ8 X5X5 ψ9ψ9 time like flies anarrow X6X6 ψ 10 X8X8 ψ 12 X7X7 ψ 11 X9X9 ψ 13

29 Annotating a Tree 29 Given: a sentence and unlabeled parse tree. Construct a factor graph which mimics the tree structure, to predict the tags / nonterminals. Given: a sentence and unlabeled parse tree. Construct a factor graph which mimics the tree structure, to predict the tags / nonterminals. n ψ1ψ1 ψ2ψ2 v ψ3ψ3 ψ4ψ4 p ψ5ψ5 ψ6ψ6 d ψ7ψ7 ψ8ψ8 n ψ9ψ9 time like flies anarrow np ψ 10 vp ψ 12 pp ψ 11 s ψ 13

30 Annotating a Tree 30 n ψ1ψ1 ψ2ψ2 v ψ3ψ3 ψ4ψ4 p ψ5ψ5 ψ6ψ6 d ψ7ψ7 ψ8ψ8 n ψ9ψ9 time like flies anarrow np ψ 10 vp ψ 12 pp ψ 11 s ψ 13 Given: a sentence and unlabeled parse tree. Construct a factor graph which mimics the tree structure, to predict the tags / nonterminals. Given: a sentence and unlabeled parse tree. Construct a factor graph which mimics the tree structure, to predict the tags / nonterminals. We could add a linear chain structure between tags. (This creates cycles!) We could add a linear chain structure between tags. (This creates cycles!)

31 Constituency Parsing 31 n ψ1ψ1 ψ2ψ2 v ψ3ψ3 ψ4ψ4 p ψ5ψ5 ψ6ψ6 d ψ7ψ7 ψ8ψ8 n ψ9ψ9 time like flies anarrow np ψ 10 vp ψ 12 pp ψ 11 s ψ 13 What if we needed to predict the tree structure too? Use more variables: Predict the nonterminal of each substring, or ∅ if it’s not a constituent. What if we needed to predict the tree structure too? Use more variables: Predict the nonterminal of each substring, or ∅ if it’s not a constituent. ∅ ψ 10 ∅ ∅ ∅ ∅ ∅

32 Constituency Parsing 32 n ψ1ψ1 ψ2ψ2 v ψ3ψ3 ψ4ψ4 p ψ5ψ5 ψ6ψ6 d ψ7ψ7 ψ8ψ8 n ψ9ψ9 time like flies anarrow np ψ 10 vp ψ 12 pp ψ 11 s ψ 13 What if we needed to predict the tree structure too? Use more variables: Predict the nonterminal of each substring, or ∅ if it’s not a constituent. But nothing prevents non-tree structures. What if we needed to predict the tree structure too? Use more variables: Predict the nonterminal of each substring, or ∅ if it’s not a constituent. But nothing prevents non-tree structures. ∅ ψ 10 ∅ s ∅ ∅ ∅

33 Constituency Parsing 33 n ψ1ψ1 ψ2ψ2 v ψ3ψ3 ψ4ψ4 p ψ5ψ5 ψ6ψ6 d ψ7ψ7 ψ8ψ8 n ψ9ψ9 time like flies anarrow np ψ 10 vp ψ 12 pp ψ 11 s ψ 13 What if we needed to predict the tree structure too? Use more variables: Predict the nonterminal of each substring, or ∅ if it’s not a constituent. But nothing prevents non-tree structures. What if we needed to predict the tree structure too? Use more variables: Predict the nonterminal of each substring, or ∅ if it’s not a constituent. But nothing prevents non-tree structures. ∅ ψ 10 ∅ s ∅ ∅ ∅

34 Constituency Parsing 34 n ψ1ψ1 ψ2ψ2 v ψ3ψ3 ψ4ψ4 p ψ5ψ5 ψ6ψ6 d ψ7ψ7 ψ8ψ8 n ψ9ψ9 time like flies anarrow np ψ 10 vp ψ 12 pp ψ 11 s ψ 13 What if we needed to predict the tree structure too? Use more variables: Predict the nonterminal of each substring, or ∅ if it’s not a constituent. But nothing prevents non-tree structures. What if we needed to predict the tree structure too? Use more variables: Predict the nonterminal of each substring, or ∅ if it’s not a constituent. But nothing prevents non-tree structures. ∅ ψ 10 ∅ s ∅ ∅ ∅ Add a factor which multiplies in 1 if the variables form a tree and 0 otherwise.

35 Constituency Parsing 35 n ψ1ψ1 ψ2ψ2 v ψ3ψ3 ψ4ψ4 p ψ5ψ5 ψ6ψ6 d ψ7ψ7 ψ8ψ8 n ψ9ψ9 time like flies anarrow np ψ 10 vp ψ 12 pp ψ 11 s ψ 13 What if we needed to predict the tree structure too? Use more variables: Predict the nonterminal of each substring, or ∅ if it’s not a constituent. But nothing prevents non-tree structures. What if we needed to predict the tree structure too? Use more variables: Predict the nonterminal of each substring, or ∅ if it’s not a constituent. But nothing prevents non-tree structures. ∅ ψ 10 ∅ ∅ ∅ ∅ ∅ Add a factor which multiplies in 1 if the variables form a tree and 0 otherwise.

36 Constituency Parsing Variables: – Constituent type (or ∅ ) for each of O(n 2 ) substrings Interactions: – Constituents must describe a binary tree – Tag bigrams – Nonterminal triples (parent, left-child, right-child) [these factors not shown] 36 Example Task: (Naradowsky, Vieira, & Smith, 2012) nv p d n time like flies anarrow np vp pp s n ψ1ψ1 ψ2ψ2 v ψ3ψ3 ψ4ψ4 p ψ5ψ5 ψ6ψ6 d ψ7ψ7 ψ8ψ8 n ψ9ψ9 time like flies anarrow np ψ 10 vp ψ 12 pp ψ 11 s ψ 13 ∅ ψ 10 ∅ ∅ ∅ ∅ ∅

37 Dependency Parsing Variables: – POS tag for each word – Syntactic label (or ∅ ) for each of O(n 2 ) possible directed arcs Interactions: – Arcs must form a tree – Discourage (or forbid) crossing edges – Features on edge pairs that share a vertex 37 (Smith & Eisner, 2008) Example Task: time like flies anarrow *Figure from Burkett & Klein (2012) Learn to discourage a verb from having 2 objects, etc. Learn to encourage specific multi-arc constructions

38 Joint CCG Parsing and Supertagging Variables: – Spans – Labels on non- terminals – Supertags on pre- terminals Interactions: – Spans must form a tree – Triples of labels: parent, left-child, and right-child – Adjacent tags 38 (Auli & Lopez, 2011) Example Task:

39 39 Figure thanks to Markus Dreyer Example task: Transliteration or Back-Transliteration Variables (string): – English and Japanese orthographic strings – English and Japanese phonological strings Interactions: – All pairs of strings could be relevant

40 Variables (string): – Inflected forms of the same verb Interactions: – Between pairs of entries in the table (e.g. infinitive form affects present- singular) 40 (Dreyer & Eisner, 2009) Example task: Morphological Paradigms

41 Word Alignment / Phrase Extraction Variables (boolean): – For each (Chinese phrase, English phrase) pair, are they linked? Interactions: – Word fertilities – Few “jumps” (discontinuities) – Syntactic reorderings – “ITG contraint” on alignment – Phrases are disjoint (?) 41 (Burkett & Klein, 2012) Application:

42 Congressional Voting 42 (Stoyanov & Eisner, 2012) Application:

43 Semantic Role Labeling with Latent Syntax Variables: – Semantic predicate sense – Semantic dependency arcs – Labels of semantic arcs – Latent syntactic dependency arcs Interactions: – Pairs of syntactic and semantic dependencies – Syntactic dependency arcs must form a tree 43 (Naradowsky, Riedel, & Smith, 2012) (Gormley, Mitchell, Van Durme, & Dredze, 2014) Application: time like flies anarrow arg0 arg The made barista coffee R 2,1 L 2,1 R 1,2 L 1,2 R 3,2 L 3,2 R 2,3 L 2,3 R 3,1 L 3,1 R 1,3 L 1,3 R 4,3 L 4,3 R 3,4 L 3,4 R 4,2 L 4,2 R 2,4 L 2,4 R 4,1 L 4,1 R 1,4 L 1,4 L 0,1 L 0,3 L 0,4 L 0,2

44 Joint NER & Sentiment Analysis Variables: – Named entity spans – Sentiment directed toward each entity Interactions: – Words and entities – Entities and sentiment 44 (Mitchell, Aguilar, Wilson, & Van Durme, 2013) Application: love I MarkTwain PERSON POSITIVE

45 45 Variable-centric view of the world When we deeply understand language, what representations (type and token) does that understanding comprise?

46 46 lexicon (word types) semantics sentences discourse context resources entailment correlation inflection cognates transliteration abbreviation neologism language evolution translation alignment editing quotation speech misspellings,typos formatting entanglement annotation N tokens To recover variables, model and exploit their correlations

47 Section 2: Belief Propagation Basics 47

48 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: What’s going on here? Can we trust BP’s estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 48

49 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: What’s going on here? Can we trust BP’s estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 49

50 Factor Graph Notation 50 Variables: Factors: Joint Distribution

51 Factors are Tensors 51 Factors: s vp pp

52 Inference Given a factor graph, two common tasks … – Compute the most likely joint assignment, x * = argmax x p(X=x) – Compute the marginal distribution of variable X i : p(X i =x i ) for each value x i Both consider all joint assignments. Both are NP-Hard in general. So, we turn to approximations. 52 p(X i =x i ) = sum of p(X=x) over joint assignments with X i =x i

53 Marginals by Sampling on Factor Graph 53 time like flies anarrow X1X1 ψ2ψ2 X2X2 ψ4ψ4 X3X3 ψ6ψ6 X4X4 ψ8ψ8 X5X5 ψ1ψ1 ψ3ψ3 ψ5ψ5 ψ7ψ7 ψ9ψ9 ψ0ψ0 X0X0 nv p d n Sample 6: vn v d n Sample 5: vn p d n Sample 4: nv p d n Sample 3: nn v d n Sample 2: nv p d n Sample 1: Suppose we took many samples from the distribution over taggings:

54 Marginals by Sampling on Factor Graph 54 time like flies anarrow X1X1 ψ2ψ2 X2X2 ψ4ψ4 X3X3 ψ6ψ6 X4X4 ψ8ψ8 X5X5 ψ1ψ1 ψ3ψ3 ψ5ψ5 ψ7ψ7 ψ9ψ9 ψ0ψ0 X0X0 nv p d n Sample 6: vn v d n Sample 5: vn p d n Sample 4: nv p d n Sample 3: nn v d n Sample 2: nv p d n Sample 1: The marginal p(X i = x i ) gives the probability that variable X i takes value x i in a random sample

55 Marginals by Sampling on Factor Graph 55 time like flies anarrow X1X1 ψ2ψ2 X2X2 ψ4ψ4 X3X3 ψ6ψ6 X4X4 ψ8ψ8 X5X5 ψ1ψ1 ψ3ψ3 ψ5ψ5 ψ7ψ7 ψ9ψ9 ψ0ψ0 X0X0 nv p d n Sample 6: vn v d n Sample 5: vn p d n Sample 4: nv p d n Sample 3: nn v d n Sample 2: nv p d n Sample 1: Estimate the marginals as: n4/6 v2/6 n3/6 v p4/6 v2/6 d6/6 n

56 Sampling one joint assignment is also NP-hard in general. – In practice: Use MCMC (e.g., Gibbs sampling) as an anytime algorithm. – So draw an approximate sample fast, or run longer for a “good” sample. Sampling finds the high-probability values x i efficiently. But it takes too many samples to see the low-probability ones. – How do you find p(“The quick brown fox …”) under a language model? Draw random sentences to see how often you get it? Takes a long time. Or multiply factors (trigram probabilities)? That’s what BP would do. 56 How do we get marginals without sampling? That’s what Belief Propagation is all about! Why not just sample?

57 ____ ____ __ ______ ______ ____ ____ __ ______ ______ Great Ideas in ML: Message Passing 3 behind you 2 behind you 1 behind you 4 behind you 5 behind you 1 before you 2 before you there's 1 of me 3 before you 4 before you 5 before you Count the soldiers 57 adapted from MacKay (2003) textbook

58 Great Ideas in ML: Message Passing 3 behind you 2 before you there's 1 of me Belief: Must be = 6 of us only see my incoming messages 231 Count the soldiers 58 adapted from MacKay (2003) textbook 2 before you

59 Great Ideas in ML: Message Passing 4 behind you 1 before you there's 1 of me only see my incoming messages Count the soldiers 59 adapted from MacKay (2003) textbook Belief: Must be = 6 of us 231 Belief: Must be = 6 of us 141

60 Great Ideas in ML: Message Passing 7 here 3 here 11 here (= 7+3+1) 1 of me Each soldier receives reports from all branches of tree 60 adapted from MacKay (2003) textbook

61 Great Ideas in ML: Message Passing 3 here 7 here (= 3+3+1) Each soldier receives reports from all branches of tree 61 adapted from MacKay (2003) textbook

62 Great Ideas in ML: Message Passing 7 here 3 here 11 here (= 7+3+1) Each soldier receives reports from all branches of tree 62 adapted from MacKay (2003) textbook

63 Great Ideas in ML: Message Passing 7 here 3 here Belief: Must be 14 of us Each soldier receives reports from all branches of tree 63 adapted from MacKay (2003) textbook

64 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 7 here 3 here Belief: Must be 14 of us wouldn't work correctly with a 'loopy' (cyclic) graph 64 adapted from MacKay (2003) textbook

65 Message Passing in Belief Propagation 65 X Ψ … … … … My other factors think I’m a noun But my other variables and I think you’re a verb v 1 n 6 a 3 v 6 n 1 a 3 v 6 n 1 a 9 Both of these messages judge the possible values of variable X. Their product = belief at X = product of all 3 messages to X.

66 Sum-Product Belief Propagation 66 Beliefs Messages VariablesFactors X2X2 ψ1ψ1 X1X1 X3X3 X1X1 ψ2ψ2 ψ3ψ3 ψ1ψ1 X1X1 ψ2ψ2 ψ3ψ3 ψ1ψ1 X2X2 ψ1ψ1 X1X1 X3X3

67 X1X1 ψ2ψ2 ψ3ψ3 ψ1ψ1 Sum-Product Belief Propagation 67 v.4 n 6 p 0 Variable Belief

68 X1X1 ψ2ψ2 ψ3ψ3 ψ1ψ1 Sum-Product Belief Propagation 68 Variable Message

69 Sum-Product Belief Propagation 69 Factor Belief ψ1ψ1 X1X1 X3X3 vn p d 0.17 n 91

70 Sum-Product Belief Propagation 70 Factor Belief ψ1ψ1 X1X1 X3X3 vn p d 0.17 n 91

71 Sum-Product Belief Propagation 71 Factor Message ψ1ψ1 X1X1 X3X3

72 Sum-Product Belief Propagation Factor Message 72 ψ1ψ1 X1X1 X3X3 matrix-vector product (for a binary factor)

73 Input: a factor graph with no cycles Output: exact marginals for each variable and factor Algorithm: 1.Initialize the messages to the uniform distribution. 1.Choose a root node. 2.Send messages from the leaves to the root. Send messages from the root to the leaves. 1.Compute the beliefs (unnormalized marginals). 2.Normalize beliefs and return the exact marginals. Sum-Product Belief Propagation 73

74 Sum-Product Belief Propagation 74 Beliefs Messages VariablesFactors X2X2 ψ1ψ1 X1X1 X3X3 X1X1 ψ2ψ2 ψ3ψ3 ψ1ψ1 X1X1 ψ2ψ2 ψ3ψ3 ψ1ψ1 X2X2 ψ1ψ1 X1X1 X3X3

75 Sum-Product Belief Propagation 75 Beliefs Messages VariablesFactors X2X2 ψ1ψ1 X1X1 X3X3 X1X1 ψ2ψ2 ψ3ψ3 ψ1ψ1 X1X1 ψ2ψ2 ψ3ψ3 ψ1ψ1 X2X2 ψ1ψ1 X1X1 X3X3

76 X2X2 X3X3 X1X1 CRF Tagging Model 76 findpreferredtags Could be adjective or verbCould be noun or verb Could be verb or noun

77 77 … … find preferred tags CRF Tagging by Belief Propagation v 0.3 n 0 a 0.1 v 1.8 n 0 a 4.2 αβα belief message v 2 n 1 a 7 Forward-backward is a message passing algorithm. It’s the simplest case of belief propagation. v 7 n 2 a 1 v 3 n 1 a 6 β vna v 021 n 210 a 031 v 3 n 6 a 1 vna v 021 n 210 a 031 Forward algorithm = message passing (matrix-vector products) Backward algorithm = message passing (matrix-vector products)

78 X2X2 X3X3 X1X1 78 findpreferredtags Could be adjective or verbCould be noun or verb Could be verb or noun So Let’s Review Forward-Backward …

79 X2X2 X3X3 X1X1 79 v n a v n a v n a STARTEND Show the possible values for each variable findpreferredtags

80 X2X2 X3X3 X1X1 80 v n a v n a v n a STARTEND Let’s show the possible values for each variable One possible assignment findpreferredtags So Let’s Review Forward-Backward …

81 X2X2 X3X3 X1X1 81 v n a v n a v n a STARTEND Let’s show the possible values for each variable One possible assignment And what the 7 factors think of it … findpreferredtags So Let’s Review Forward-Backward …

82 X2X2 X3X3 X1X1 Viterbi Algorithm: Most Probable Assignment 82 v n a v n a v n a STARTEND findpreferredtags So p(v a n) = (1/Z) * product of 7 numbers Numbers associated with edges and nodes of path Most probable assignment = path with highest product ψ {0,1} ( START,v) ψ {1,2} (v,a) ψ {2,3} (a,n) ψ {3,4} (a, END ) ψ {1} ( v ) ψ {2} ( a ) ψ {3} ( n )

83 X2X2 X3X3 X1X1 Viterbi Algorithm: Most Probable Assignment 83 v n a v n a v n a STARTEND findpreferredtags So p(v a n) = (1/Z) * product weight of one path ψ {0,1} ( START,v) ψ {1,2} (v,a) ψ {2,3} (a,n) ψ {3,4} (a, END ) ψ {1} ( v ) ψ {2} ( a ) ψ {3} ( n )

84 X2X2 X3X3 X1X1 Forward-Backward Algorithm: Finds Marginals 84 v n a v n a v n a STARTEND findpreferredtags So p(v a n) = (1/Z) * product weight of one path Marginal probability p(X 2 = a) = (1/Z) * total weight of all paths through a

85 X2X2 X3X3 X1X1 Forward-Backward Algorithm: Finds Marginals 85 v n a v n a v n a STARTEND findpreferredtags So p(v a n) = (1/Z) * product weight of one path Marginal probability p(X 2 = n) = (1/Z) * total weight of all paths through n

86 X2X2 X3X3 X1X1 Forward-Backward Algorithm: Finds Marginals 86 v n a v n a v n a STARTEND findpreferredtags So p(v a n) = (1/Z) * product weight of one path Marginal probability p(X 2 = v) = (1/Z) * total weight of all paths through v

87 X2X2 X3X3 X1X1 Forward-Backward Algorithm: Finds Marginals 87 v n a v n a v n a STARTEND findpreferredtags So p(v a n) = (1/Z) * product weight of one path Marginal probability p(X 2 = n) = (1/Z) * total weight of all paths through n

88 α 2 (n) = total weight of these path prefixes (found by dynamic programming: matrix-vector products) X2X2 X3X3 X1X1 Forward-Backward Algorithm: Finds Marginals 88 v n a v n a v n a STARTEND findpreferredtags

89 = total weight of these path suffixes X2X2 X3X3 X1X1 Forward-Backward Algorithm: Finds Marginals 89 v n a v n a v n a STARTEND findpreferredtags  2 (n) (found by dynamic programming: matrix-vector products)

90 α 2 (n) = total weight of these path prefixes = total weight of these path suffixes X2X2 X3X3 X1X1 Forward-Backward Algorithm: Finds Marginals 90 v n a v n a v n a STARTEND findpreferredtags  2 (n) (a + b + c) (x + y + z) Product gives ax+ay+az+bx+by+bz+cx+cy+cz = total weight of paths

91 total weight of all paths through =   X2X2 X3X3 X1X1 Forward-Backward Algorithm: Finds Marginals 91 v n a v n a v n a STARTEND findpreferredtags n ψ {2} (n) α 2 (n)  2 (n) α 2 (n)ψ {2} (n)  2 (n) “belief that X 2 = n” Oops! The weight of a path through a state also includes a weight at that state. So α(n)∙β(n) isn’t enough. The extra weight is the opinion of the unigram factor at this variable.

92 X2X2 X3X3 X1X1 Forward-Backward Algorithm: Finds Marginals 92 v n a n a v n a STARTEND findpreferredtags ψ {2} (v) α 2 (v)  2 (v) “belief that X 2 = v” v “belief that X 2 = n” total weight of all paths through =   v α 2 (v)ψ {2} (v)  2 (v)

93 X2X2 X3X3 X1X1 Forward-Backward Algorithm: Finds Marginals 93 v n a v n a v n a STARTEND findpreferredtags ψ {2} (a) α 2 (a)  2 (a) “belief that X 2 = a” “belief that X 2 = v” “belief that X 2 = n” sum = Z (total probability of all paths) v 1.8 n 0 a 4.2 v 0.3 n 0 a 0.7 divide by Z=6 to get marginal probs total weight of all paths through =   a α 2 (a)ψ {2} (a)  2 (a)

94 (Acyclic) Belief Propagation 94 In a factor graph with no cycles: 1.Pick any node to serve as the root. 2.Send messages from the leaves to the root. 3.Send messages from the root to the leaves. A node computes an outgoing message along an edge only after it has received incoming messages along all its other edges. In a factor graph with no cycles: 1.Pick any node to serve as the root. 2.Send messages from the leaves to the root. 3.Send messages from the root to the leaves. A node computes an outgoing message along an edge only after it has received incoming messages along all its other edges. X1X1 ψ1ψ1 X2X2 ψ3ψ3 X3X3 ψ5ψ5 X4X4 ψ7ψ7 X5X5 ψ9ψ9 time like flies anarrow X6X6 ψ 10 X8X8 ψ 12 X7X7 ψ 11 X9X9 ψ 13

95 (Acyclic) Belief Propagation 95 X1X1 ψ1ψ1 X2X2 ψ3ψ3 X3X3 ψ5ψ5 X4X4 ψ7ψ7 X5X5 ψ9ψ9 time like flies anarrow X6X6 ψ 10 X8X8 ψ 12 X7X7 ψ 11 X9X9 ψ 13 In a factor graph with no cycles: 1.Pick any node to serve as the root. 2.Send messages from the leaves to the root. 3.Send messages from the root to the leaves. A node computes an outgoing message along an edge only after it has received incoming messages along all its other edges. In a factor graph with no cycles: 1.Pick any node to serve as the root. 2.Send messages from the leaves to the root. 3.Send messages from the root to the leaves. A node computes an outgoing message along an edge only after it has received incoming messages along all its other edges.

96 Acyclic BP as Dynamic Programming 96 X1X1 ψ1ψ1 X2X2 ψ3ψ3 X3X3 ψ5ψ5 X4X4 ψ7ψ7 X5X5 ψ9ψ9 time like flies anarrow X6X6 ψ 10 ψ 12 XiXi ψ 14 X9X9 ψ 13 ψ 11 F G H Figure adapted from Burkett & Klein (2012) Subproblem: Inference using just the factors in subgraph H

97 Acyclic BP as Dynamic Programming 97 X1X1 X2X2 X3X3 X4X4 ψ7ψ7 X5X5 ψ9ψ9 time like flies anarrow X6X6 ψ 10 XiXi X9X9 ψ 11 H Subproblem: Inference using just the factors in subgraph H The marginal of X i in that smaller model is the message sent to X i from subgraph H Message to a variable

98 Acyclic BP as Dynamic Programming 98 X1X1 X2X2 X3X3 ψ5ψ5 X4X4 X5X5 time like flies anarrow X6X6 XiXi ψ 14 X9X9 G Subproblem: Inference using just the factors in subgraph H The marginal of X i in that smaller model is the message sent to X i from subgraph H Message to a variable

99 Acyclic BP as Dynamic Programming 99 X1X1 ψ1ψ1 X2X2 ψ3ψ3 X3X3 X4X4 X5X5 time like flies anarrow X6X6 X8X8 ψ 12 XiXi X9X9 ψ 13 F Subproblem: Inference using just the factors in subgraph H The marginal of X i in that smaller model is the message sent to X i from subgraph H Message to a variable

100 Acyclic BP as Dynamic Programming 100 X1X1 ψ1ψ1 X2X2 ψ3ψ3 X3X3 ψ5ψ5 X4X4 ψ7ψ7 X5X5 ψ9ψ9 time like flies anarrow X6X6 ψ 10 ψ 12 XiXi ψ 14 X9X9 ψ 13 ψ 11 F H Subproblem: Inference using just the factors in subgraph F  H The marginal of X i in that smaller model is the message sent by X i out of subgraph F  H Message from a variable

101 If you want the marginal p i (x i ) where X i has degree k, you can think of that summation as a product of k marginals computed on smaller subgraphs. Each subgraph is obtained by cutting some edge of the tree. The message-passing algorithm uses dynamic programming to compute the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals. If you want the marginal p i (x i ) where X i has degree k, you can think of that summation as a product of k marginals computed on smaller subgraphs. Each subgraph is obtained by cutting some edge of the tree. The message-passing algorithm uses dynamic programming to compute the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals. Acyclic BP as Dynamic Programming 101 time like flies anarrow X1X1 ψ1ψ1 X2X2 ψ3ψ3 X3X3 ψ5ψ5 X4X4 ψ7ψ7 X5X5 ψ9ψ9 X6X6 ψ 10 X8X8 ψ 12 X7X7 ψ 14 X9X9 ψ 13 ψ 11

102 If you want the marginal p i (x i ) where X i has degree k, you can think of that summation as a product of k marginals computed on smaller subgraphs. Each subgraph is obtained by cutting some edge of the tree. The message-passing algorithm uses dynamic programming to compute the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals. If you want the marginal p i (x i ) where X i has degree k, you can think of that summation as a product of k marginals computed on smaller subgraphs. Each subgraph is obtained by cutting some edge of the tree. The message-passing algorithm uses dynamic programming to compute the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals. Acyclic BP as Dynamic Programming 102 time like flies anarrow X1X1 ψ1ψ1 X2X2 ψ3ψ3 X3X3 ψ5ψ5 X4X4 ψ7ψ7 X5X5 ψ9ψ9 X6X6 ψ 10 X8X8 ψ 12 X7X7 ψ 14 X9X9 ψ 13 ψ 11

103 If you want the marginal p i (x i ) where X i has degree k, you can think of that summation as a product of k marginals computed on smaller subgraphs. Each subgraph is obtained by cutting some edge of the tree. The message-passing algorithm uses dynamic programming to compute the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals. If you want the marginal p i (x i ) where X i has degree k, you can think of that summation as a product of k marginals computed on smaller subgraphs. Each subgraph is obtained by cutting some edge of the tree. The message-passing algorithm uses dynamic programming to compute the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals. Acyclic BP as Dynamic Programming 103 time like flies anarrow X1X1 ψ1ψ1 X2X2 ψ3ψ3 X3X3 ψ5ψ5 X4X4 ψ7ψ7 X5X5 ψ9ψ9 X6X6 ψ 10 X8X8 ψ 12 X7X7 ψ 14 X9X9 ψ 13 ψ 11

104 If you want the marginal p i (x i ) where X i has degree k, you can think of that summation as a product of k marginals computed on smaller subgraphs. Each subgraph is obtained by cutting some edge of the tree. The message-passing algorithm uses dynamic programming to compute the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals. If you want the marginal p i (x i ) where X i has degree k, you can think of that summation as a product of k marginals computed on smaller subgraphs. Each subgraph is obtained by cutting some edge of the tree. The message-passing algorithm uses dynamic programming to compute the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals. Acyclic BP as Dynamic Programming 104 time like flies anarrow X1X1 ψ1ψ1 X2X2 ψ3ψ3 X3X3 ψ5ψ5 X4X4 ψ7ψ7 X5X5 ψ9ψ9 X6X6 ψ 10 X8X8 ψ 12 X7X7 ψ 14 X9X9 ψ 13 ψ 11

105 If you want the marginal p i (x i ) where X i has degree k, you can think of that summation as a product of k marginals computed on smaller subgraphs. Each subgraph is obtained by cutting some edge of the tree. The message-passing algorithm uses dynamic programming to compute the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals. If you want the marginal p i (x i ) where X i has degree k, you can think of that summation as a product of k marginals computed on smaller subgraphs. Each subgraph is obtained by cutting some edge of the tree. The message-passing algorithm uses dynamic programming to compute the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals. Acyclic BP as Dynamic Programming 105 time like flies anarrow X1X1 ψ1ψ1 X2X2 ψ3ψ3 X3X3 ψ5ψ5 X4X4 ψ7ψ7 X5X5 ψ9ψ9 X6X6 ψ 10 X8X8 ψ 12 X7X7 ψ 14 X9X9 ψ 13 ψ 11

106 If you want the marginal p i (x i ) where X i has degree k, you can think of that summation as a product of k marginals computed on smaller subgraphs. Each subgraph is obtained by cutting some edge of the tree. The message-passing algorithm uses dynamic programming to compute the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals. If you want the marginal p i (x i ) where X i has degree k, you can think of that summation as a product of k marginals computed on smaller subgraphs. Each subgraph is obtained by cutting some edge of the tree. The message-passing algorithm uses dynamic programming to compute the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals. Acyclic BP as Dynamic Programming 106 time like flies anarrow X1X1 ψ1ψ1 X2X2 ψ3ψ3 X3X3 ψ5ψ5 X4X4 ψ7ψ7 X5X5 ψ9ψ9 X6X6 ψ 10 X8X8 ψ 12 X7X7 ψ 14 X9X9 ψ 13 ψ 11

107 If you want the marginal p i (x i ) where X i has degree k, you can think of that summation as a product of k marginals computed on smaller subgraphs. Each subgraph is obtained by cutting some edge of the tree. The message-passing algorithm uses dynamic programming to compute the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals. If you want the marginal p i (x i ) where X i has degree k, you can think of that summation as a product of k marginals computed on smaller subgraphs. Each subgraph is obtained by cutting some edge of the tree. The message-passing algorithm uses dynamic programming to compute the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals. Acyclic BP as Dynamic Programming 107 time like flies anarrow X1X1 ψ1ψ1 X2X2 ψ3ψ3 X3X3 ψ5ψ5 X4X4 ψ7ψ7 X5X5 ψ9ψ9 X6X6 ψ 10 X8X8 ψ 12 X7X7 ψ 14 X9X9 ψ 13 ψ 11

108 Loopy Belief Propagation 108 time like flies anarrow X1X1 ψ1ψ1 X2X2 ψ3ψ3 X3X3 ψ5ψ5 X4X4 ψ7ψ7 X5X5 ψ9ψ9 X6X6 ψ 10 X8X8 ψ 12 X7X7 ψ 14 X9X9 ψ 13 ψ 11 Messages from different subgraphs are no longer independent! – Dynamic programming can’t help. It’s now #P-hard in general to compute the exact marginals. But we can still run BP -- it's a local algorithm so it doesn't "see the cycles." Messages from different subgraphs are no longer independent! – Dynamic programming can’t help. It’s now #P-hard in general to compute the exact marginals. But we can still run BP -- it's a local algorithm so it doesn't "see the cycles." ψ2ψ2 ψ4ψ4 ψ6ψ6 ψ8ψ8 What if our graph has cycles?

109 What can go wrong with loopy BP? 109 F All 4 factors on cycle enforce equality F F F

110 What can go wrong with loopy BP? 110 T All 4 factors on cycle enforce equality T T T This factor says upper variable is twice as likely to be true as false (and that’s the true marginal!)

111 This is an extreme example. Often in practice, the cyclic influences are weak. (As cycles are long or include at least one weak correlation.) BP incorrectly treats this message as separate evidence that the variable is T. Multiplies these two messages as if they were independent. But they don’t actually come from independent parts of the graph. One influenced the other (via a cycle). What can go wrong with loopy BP? 111 T 2 F 1 All 4 factors on cycle enforce equality T 2 F 1 T 2 F 1 T 2 F 1 T 2 F 1 This factor says upper variable is twice as likely to be T as F (and that’s the true marginal!) T 4 F 1 T 4 F 1 Messages loop around and around … 2, 4, 8, 16, 32,... More and more convinced that these variables are T! So beliefs converge to marginal distribution (1, 0) rather than (2/3, 1/3).

112 Kathy says so Bob says so Charlie says so Alice says so Your prior doesn’t think Obama owns it. But everyone’s saying he does. Under a Naïve Bayes model, you therefore believe it. What can go wrong with loopy BP? 112 … Obama owns it T 1 F 99 T 2 F 1 T 2 F 1 T 2 F 1 T 2 F 1 T 2048 F 99 A lie told often enough becomes truth. -- Lenin A rumor is circulating that Obama secretly owns an insurance company. (Obamacare is actually designed to maximize his profit.)

113 Kathy says so Bob says so Charlie says so Alice says so Better model... Rush can influence conversation. – Now there are 2 ways to explain why everyone’s repeating the story: it’s true, or Rush said it was. – The model favors one solution (probably Rush). – Yet BP has 2 stable solutions. Each solution is self- reinforcing around cycles; no impetus to switch. What can go wrong with loopy BP? 113 … Obama owns it T 1 F 99 T ??? F Rush says so A lie told often enough becomes truth. -- Lenin If everyone blames Obama, then no one has to blame Rush. But if no one blames Rush, then everyone has to continue to blame Obama (to explain the gossip). T 1 F 24 Actually 4 ways: but “both” has a low prior and “neither” has a low likelihood, so only 2 good ways.

114 Loopy Belief Propagation Algorithm Run the BP update equations on a cyclic graph – Hope it “works” anyway (good approximation) Though we multiply messages that aren’t independent No interpretation as dynamic programming – If largest element of a message gets very big or small, Divide the message by a constant to prevent over/underflow Can update messages in any order – Stop when the normalized messages converge Compute beliefs from final messages – Return normalized beliefs as approximate marginals 114 e.g., Murphy, Weiss & Jordan (1999)

115 Input: a factor graph with cycles Output: approximate marginals for each variable and factor Algorithm: 1.Initialize the messages to the uniform distribution. 1.Send messages until convergence. Normalize them when they grow too large. 1.Compute the beliefs (unnormalized marginals). 2.Normalize beliefs and return the approximate marginals. Loopy Belief Propagation 115

116 Section 3: Belief Propagation Q&A Methods like BP and in what sense they work 116

117 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: What’s going on here? Can we trust BP’s estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 117

118 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: What’s going on here? Can we trust BP’s estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 118

119 Q&A 119 Q: Forward-backward is to the Viterbi algorithm as sum-product BP is to __________ ? A:max-product BP

120 Max-product Belief Propagation Sum-product BP can be used to compute the marginals, p i (X i ) Max-product BP can be used to compute the most likely assignment, X * = argmax X p(X) 120

121 Max-product Belief Propagation Change the sum to a max: Max-product BP computes max-marginals – The max-marginal b i (x i ) is the (unnormalized) probability of the MAP assignment under the constraint X i = x i. – For an acyclic graph, the MAP assignment (assuming there are no ties) is given by: 121

122 Max-product Belief Propagation Change the sum to a max: Max-product BP computes max-marginals – The max-marginal b i (x i ) is the (unnormalized) probability of the MAP assignment under the constraint X i = x i. – For an acyclic graph, the MAP assignment (assuming there are no ties) is given by: 122

123 Deterministic Annealing Motivation: Smoothly transition from sum- product to max-product Add inverse temperature parameter to each factor: Send messages as usual for sum-product BP Anneal T from 1 to 0 : 123 Annealed Joint Distribution T = 1 Sum-product T  0 Max-product

124 Q&A 124 Q: This feels like Arc Consistency… Any relation? A:Yes, BP is doing (with probabilities) what people were doing in AI long before.

125 =  From Arc Consistency to BP Goal: Find a satisfying assignment Algorithm: Arc Consistency 1.Pick a constraint 2.Reduce domains to satisfy the constraint 3.Repeat until convergence ,1, 32,1, 32,1, X, Y, U, T ∈ { 1, 2, 3 } X  Y Y = U T  U X < T XY T U 32,1,   Propagation completely solved the problem! Note: These steps can occur in somewhat arbitrary order Slide thanks to Rina Dechter (modified)

126 =  From Arc Consistency to BP Goal: Find a satisfying assignment Algorithm: Arc Consistency 1.Pick a constraint 2.Reduce domains to satisfy the constraint 3.Repeat until convergence ,1, 32,1, 32,1, X, Y, U, T ∈ { 1, 2, 3 } X  Y Y = U T  U X < T XY T U 32,1,   Propagation completely solved the problem! Note: These steps can occur in somewhat arbitrary order Slide thanks to Rina Dechter (modified) Arc Consistency is a special case of Belief Propagation.

127  From Arc Consistency to BP Solve the same problem with BP Constraints become “hard” factors with only 1’s or 0’s Send messages until convergence ,1, 32,1, 32,1, X, Y, U, T ∈ { 1, 2, 3 } X  Y Y = U T  U X < T XY T U 32,1,   Slide thanks to Rina Dechter (modified) =

128  From Arc Consistency to BP Solve the same problem with BP Constraints become “hard” factors with only 1’s or 0’s Send messages until convergence ,1, 32,1, 32,1, X, Y, U, T ∈ { 1, 2, 3 } X  Y Y = U T  U X < T XY T U 32,1,   Slide thanks to Rina Dechter (modified) =

129  From Arc Consistency to BP ,1, 32,1, 32,1, X, Y, U, T ∈ { 1, 2, 3 } X  Y Y = U T  U X < T XY T U 32,1,   Slide thanks to Rina Dechter (modified) = Solve the same problem with BP Constraints become “hard” factors with only 1’s or 0’s Send messages until convergence

130  From Arc Consistency to BP ,1, 32,1, 32,1, X, Y, U, T ∈ { 1, 2, 3 } X  Y Y = U T  U X < T XY T U 32,1,   Slide thanks to Rina Dechter (modified) = Solve the same problem with BP Constraints become “hard” factors with only 1’s or 0’s Send messages until convergence

131  From Arc Consistency to BP ,1, 32,1, 32,1, X, Y, U, T ∈ { 1, 2, 3 } X  Y Y = U T  U X < T XY T U 32,1,   Slide thanks to Rina Dechter (modified) = Solve the same problem with BP Constraints become “hard” factors with only 1’s or 0’s Send messages until convergence

132  From Arc Consistency to BP ,1, 32,1, 32,1, X, Y, U, T ∈ { 1, 2, 3 } X  Y Y = U T  U X < T XY T U 32,1,   Slide thanks to Rina Dechter (modified) = Solve the same problem with BP Constraints become “hard” factors with only 1’s or 0’s Send messages until convergence

133  From Arc Consistency to BP ,1, 32,1, 32,1, X, Y, U, T ∈ { 1, 2, 3 } X  Y Y = U T  U X < T XY T U 32,1,   Slide thanks to Rina Dechter (modified) = Solve the same problem with BP Constraints become “hard” factors with only 1’s or 0’s Send messages until convergence

134  From Arc Consistency to BP ,1, 32,1, 32,1, X, Y, U, T ∈ { 1, 2, 3 } X  Y Y = U T  U X < T XY T U 32,1,   Slide thanks to Rina Dechter (modified) = Solve the same problem with BP Constraints become “hard” factors with only 1’s or 0’s Send messages until convergence Loopy BP will converge to the equivalent solution!

135  From Arc Consistency to BP ,1, 32,1, 32,1, XY T U 32,1,   Slide thanks to Rina Dechter (modified) = Loopy BP will converge to the equivalent solution! Takeaways: Arc Consistency is a special case of Belief Propagation. Arc Consistency will only rule out impossible values. BP rules out those same values (belief = 0).

136 Q&A 136 Q: Is BP totally divorced from sampling? A:Gibbs Sampling is also a kind of message passing algorithm.

137 From Gibbs Sampling to Particle BP to BP Message Representation: A.Belief Propagation: full distribution B.Gibbs sampling: single particle C.Particle BP: multiple particles 137 # of particles Gibbs Sampling 1 Particle BP k BP +∞

138 From Gibbs Sampling to Particle BP to BP W ψ2ψ2 X ψ3ψ3 Y … … meant to type man too tight meant two taipei mean totype 138

139 From Gibbs Sampling to Particle BP to BP W ψ2ψ2 X ψ3ψ3 Y … … meant to type man too tight meant two taipei mean totype 139 Approach 1: Gibbs Sampling For each variable, resample the value by conditioning on all the other variables – Called the “full conditional” distribution – Computationally easy because we really only need to condition on the Markov Blanket We can view the computation of the full conditional in terms of message passing – Message puts all its probability mass on the current particle (i.e. current value) For each variable, resample the value by conditioning on all the other variables – Called the “full conditional” distribution – Computationally easy because we really only need to condition on the Markov Blanket We can view the computation of the full conditional in terms of message passing – Message puts all its probability mass on the current particle (i.e. current value)

140 From Gibbs Sampling to Particle BP to BP W ψ2ψ2 X ψ3ψ3 Y … … meant to type man too tight meant two taipei mean totype 140 Approach 1: Gibbs Sampling For each variable, resample the value by conditioning on all the other variables – Called the “full conditional” distribution – Computationally easy because we really only need to condition on the Markov Blanket We can view the computation of the full conditional in terms of message passing – Message puts all its probability mass on the current particle (i.e. current value) For each variable, resample the value by conditioning on all the other variables – Called the “full conditional” distribution – Computationally easy because we really only need to condition on the Markov Blanket We can view the computation of the full conditional in terms of message passing – Message puts all its probability mass on the current particle (i.e. current value)

141 From Gibbs Sampling to Particle BP to BP W ψ2ψ2 X ψ3ψ3 Y … … meant to type man too tight meant two taipei mean totype 141 Approach 1: Gibbs Sampling For each variable, resample the value by conditioning on all the other variables – Called the “full conditional” distribution – Computationally easy because we really only need to condition on the Markov Blanket We can view the computation of the full conditional in terms of message passing – Message puts all its probability mass on the current particle (i.e. current value) For each variable, resample the value by conditioning on all the other variables – Called the “full conditional” distribution – Computationally easy because we really only need to condition on the Markov Blanket We can view the computation of the full conditional in terms of message passing – Message puts all its probability mass on the current particle (i.e. current value)

142 From Gibbs Sampling to Particle BP to BP 142 W ψ2ψ2 X ψ3ψ3 Y … … meant to type man too tight meant two taipei mean to type too tight to

143 From Gibbs Sampling to Particle BP to BP 143 W ψ2ψ2 X ψ3ψ3 Y … … meant to type man too tight meant two taipei mean to type too tight to Approach 2: Multiple Gibbs Samplers Run each Gibbs Sampler independently Full conditionals computed independently – k separate messages that are each a pointmass distribution Run each Gibbs Sampler independently Full conditionals computed independently – k separate messages that are each a pointmass distribution

144 From Gibbs Sampling to Particle BP to BP 144 W ψ2ψ2 X ψ3ψ3 Y … … meant to type man too tight meant two taipei mean to type too tight to Approach 3: Gibbs Sampling w/Averaging Keep k samples for each variable Resample from the average of the full conditionals for each possible pair of variables – Message is a uniform distribution over current particles Keep k samples for each variable Resample from the average of the full conditionals for each possible pair of variables – Message is a uniform distribution over current particles

145 From Gibbs Sampling to Particle BP to BP 145 W ψ2ψ2 X ψ3ψ3 Y … … meant to type man too tight meant two taipei mean to type too tight to Approach 3: Gibbs Sampling w/Averaging Keep k samples for each variable Resample from the average of the full conditionals for each possible pair of variables – Message is a uniform distribution over current particles Keep k samples for each variable Resample from the average of the full conditionals for each possible pair of variables – Message is a uniform distribution over current particles

146 From Gibbs Sampling to Particle BP to BP 146 W ψ2ψ2 X ψ3ψ3 Y … … meant to type man too tight meant two taipei mean to type too tight to Approach 3: Gibbs Sampling w/Averaging Keep k samples for each variable Resample from the average of the full conditionals for each possible pair of variables – Message is a uniform distribution over current particles Keep k samples for each variable Resample from the average of the full conditionals for each possible pair of variables – Message is a uniform distribution over current particles

147 From Gibbs Sampling to Particle BP to BP 147 W ψ2ψ2 X ψ3ψ3 Y … … meant type man tight meant taipei mean type Approach 4: Particle BP Similar in spirit to Gibbs Sampling w/Averaging Messages are a weighted distribution over k particles Similar in spirit to Gibbs Sampling w/Averaging Messages are a weighted distribution over k particles (Ihler & McAllester, 2009)

148 From Gibbs Sampling to Particle BP to BP 148 W ψ2ψ2 X ψ3ψ3 Y … … Approach 5: BP In Particle BP, as the number of particles goes to +∞, the estimated messages approach the true BP messages Belief propagation represents messages as the full distribution – This assumes we can store the whole distribution compactly In Particle BP, as the number of particles goes to +∞, the estimated messages approach the true BP messages Belief propagation represents messages as the full distribution – This assumes we can store the whole distribution compactly (Ihler & McAllester, 2009)

149 From Gibbs Sampling to Particle BP to BP Message Representation: A.Belief Propagation: full distribution B.Gibbs sampling: single particle C.Particle BP: multiple particles 149 # of particles Gibbs Sampling 1 Particle BP k BP +∞

150 From Gibbs Sampling to Particle BP to BP Sampling values or combinations of values: quickly get a good estimate of the frequent cases may take a long time to estimate probabilities of infrequent cases may take a long time to draw a sample (mixing time) exact if you run forever Enumerating each value and computing its probability exactly: have to spend time on all values but only spend O(1) time on each value (don’t sample frequent values over and over while waiting for infrequent ones) runtime is more predictable lets you tradeoff exactness for greater speed (brute force exactly enumerates exponentially many assignments, BP approximates this by enumerating local configurations) 150 Tension between approaches…

151 Background: Convergence When BP is run on a tree-shaped factor graph, the beliefs converge to the marginals of the distribution after two passes. 151

152 Q&A 152 Q: How long does loopy BP take to converge? A:It might never converge. Could oscillate. ψ1ψ1 ψ2ψ2 ψ1ψ1 ψ2ψ2 ψ2ψ2 ψ2ψ2

153 Q&A 153 Q: When loopy BP converges, does it always get the same answer? A:No. Sensitive to initialization and update order. ψ1ψ1 ψ2ψ2 ψ1ψ1 ψ2ψ2 ψ2ψ2 ψ2ψ2 ψ1ψ1 ψ2ψ2 ψ1ψ1 ψ2ψ2 ψ2ψ2 ψ2ψ2

154 Q&A 154 Q: Are there convergent variants of loopy BP? A: Yes. It's actually trying to minimize a certain differentiable function of the beliefs, so you could just minimize that function directly.

155 Q&A 155 Q: But does that function have a unique minimum? A: No, and you'll only be able to find a local minimum in practice. So you're still dependent on initialization.

156 Q&A 156 Q: If you could find the global minimum, would its beliefs give the marginals of the true distribution? A:No. We’ve found the bottom!!

157 Q&A 157 Q: Is it finding the marginals of some other distribution? A: No, just a collection of beliefs. Might not be globally consistent in the sense of all being views of the same elephant. *Cartoon by G. Renee Guzlas

158 Q&A 158 Q: Does the global minimum give beliefs that are at least locally consistent? A:Yes. X1X1 ψαψα X2X2 p 5 d 2 n 10 vn p 23 d 11 n 46 vn 7 A variable belief and a factor belief are locally consistent if the marginal of the factor’s belief equals the variable’s belief. A variable belief and a factor belief are locally consistent if the marginal of the factor’s belief equals the variable’s belief.

159 Q&A 159 Q: In what sense are the beliefs at the global minimum any good? A:They are the global minimum of the Bethe Free Energy. We’ve found the bottom!!

160 Q&A 160 Q: When loopy BP converges, in what sense are the beliefs any good? A:They are a local minimum of the Bethe Free Energy.

161 Q&A 161 Q: Why would you want to minimize the Bethe Free Energy? A: 1)It’s easy to minimize* because it’s a sum of functions on the individual beliefs. [*] Though we can’t just minimize each function separately – we need message passing to keep the beliefs locally consistent. 2)On an acyclic factor graph, it measures KL divergence between beliefs and true marginals, and so is minimized when beliefs = marginals. (For a loopy graph, we close our eyes and hope it still works.)

162 Section 4: Incorporating Structure into Factors and Variables 162

163 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: What’s going on here? Can we trust BP’s estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 163

164 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: What’s going on here? Can we trust BP’s estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 164

165 BP for Coordination of Algorithms 165 T ψ2ψ2 T ψ4ψ4 T F F F la blanca casa T ψ2ψ2 T ψ4ψ4 T F F F the house white TT T TT T TT T

166 Sending Messages: Computational Complexity 166 X2X2 ψ1ψ1 X1X1 X3X3 X1X1 ψ2ψ2 ψ3ψ3 ψ1ψ1 From VariablesTo Variables O(d*k d ) d = # of neighboring variables k = maximum # possible values for a neighboring variable O(d*k) d = # of neighboring factors k = # possible values for X i

167 Sending Messages: Computational Complexity 167 X2X2 ψ1ψ1 X1X1 X3X3 X1X1 ψ2ψ2 ψ3ψ3 ψ1ψ1 From VariablesTo Variables O(d*k d ) d = # of neighboring variables k = maximum # possible values for a neighboring variable O(d*k) d = # of neighboring factors k = # possible values for X i

168 INCORPORATING STRUCTURE INTO FACTORS 168

169 Unlabeled Constituency Parsing 169 T ψ1ψ1 ψ2ψ2 T ψ3ψ3 ψ4ψ4 T ψ5ψ5 ψ6ψ6 T ψ7ψ7 ψ8ψ8 T ψ9ψ9 time like flies anarrow T ψ 10 T ψ 12 T ψ 11 T ψ 13 Given: a sentence. Predict: unlabeled parse. We could predict whether each span is present T or not F. Given: a sentence. Predict: unlabeled parse. We could predict whether each span is present T or not F. F ψ 10 F F F F F (Naradowsky, Vieira, & Smith, 2012)

170 Unlabeled Constituency Parsing 170 T ψ1ψ1 ψ2ψ2 T ψ3ψ3 ψ4ψ4 T ψ5ψ5 ψ6ψ6 T ψ7ψ7 ψ8ψ8 T ψ9ψ9 time like flies anarrow T ψ 10 T ψ 12 T ψ 11 T ψ 13 Given: a sentence. Predict: unlabeled parse. We could predict whether each span is present T or not F. Given: a sentence. Predict: unlabeled parse. We could predict whether each span is present T or not F. F ψ 10 F F F F F (Naradowsky, Vieira, & Smith, 2012)

171 Unlabeled Constituency Parsing 171 T ψ1ψ1 ψ2ψ2 T ψ3ψ3 ψ4ψ4 T ψ5ψ5 ψ6ψ6 T ψ7ψ7 ψ8ψ8 T ψ9ψ9 time like flies anarrow T ψ 10 T ψ 12 T ψ 11 T ψ 13 Given: a sentence. Predict: unlabeled parse. We could predict whether each span is present T or not F. Given: a sentence. Predict: unlabeled parse. We could predict whether each span is present T or not F. F ψ 10 F F F F T (Naradowsky, Vieira, & Smith, 2012)

172 Unlabeled Constituency Parsing 172 T ψ1ψ1 ψ2ψ2 T ψ3ψ3 ψ4ψ4 T ψ5ψ5 ψ6ψ6 T ψ7ψ7 ψ8ψ8 T ψ9ψ9 time like flies anarrow T ψ 10 T ψ 12 T ψ 11 T ψ 13 Given: a sentence. Predict: unlabeled parse. We could predict whether each span is present T or not F. Given: a sentence. Predict: unlabeled parse. We could predict whether each span is present T or not F. F ψ 10 F F F F F (Naradowsky, Vieira, & Smith, 2012)

173 Unlabeled Constituency Parsing 173 T ψ1ψ1 ψ2ψ2 T ψ3ψ3 ψ4ψ4 T ψ5ψ5 ψ6ψ6 T ψ7ψ7 ψ8ψ8 T ψ9ψ9 time like flies anarrow T ψ 10 T ψ 12 T ψ 11 T ψ 13 Given: a sentence. Predict: unlabeled parse. We could predict whether each span is present T or not F. Given: a sentence. Predict: unlabeled parse. We could predict whether each span is present T or not F. F ψ 10 F F F F F (Naradowsky, Vieira, & Smith, 2012)

174 Unlabeled Constituency Parsing 174 T ψ1ψ1 ψ2ψ2 T ψ3ψ3 ψ4ψ4 T ψ5ψ5 ψ6ψ6 T ψ7ψ7 ψ8ψ8 T ψ9ψ9 time like flies anarrow T ψ 10 F ψ 12 T ψ 11 T ψ 13 Given: a sentence. Predict: unlabeled parse. We could predict whether each span is present T or not F. Given: a sentence. Predict: unlabeled parse. We could predict whether each span is present T or not F. F ψ 10 F T F F F (Naradowsky, Vieira, & Smith, 2012)

175 Unlabeled Constituency Parsing 175 T ψ1ψ1 ψ2ψ2 T ψ3ψ3 ψ4ψ4 T ψ5ψ5 ψ6ψ6 T ψ7ψ7 ψ8ψ8 T ψ9ψ9 time like flies anarrow T ψ 10 T ψ 12 T ψ 11 T ψ 13 Given: a sentence. Predict: unlabeled parse. We could predict whether each span is present T or not F. Given: a sentence. Predict: unlabeled parse. We could predict whether each span is present T or not F. F ψ 10 F F F F F Sending a messsage to a variable from its unary factors takes only O(d*k d ) time where k=2 and d=1. (Naradowsky, Vieira, & Smith, 2012)

176 Unlabeled Constituency Parsing 176 T ψ1ψ1 ψ2ψ2 T ψ3ψ3 ψ4ψ4 T ψ5ψ5 ψ6ψ6 T ψ7ψ7 ψ8ψ8 T ψ9ψ9 time like flies anarrow T ψ 10 T ψ 12 T ψ 11 T ψ 13 Given: a sentence. Predict: unlabeled parse. We could predict whether each span is present T or not F. But nothing prevents non-tree structures. Given: a sentence. Predict: unlabeled parse. We could predict whether each span is present T or not F. But nothing prevents non-tree structures. F ψ 10 F T F F F Sending a messsage to a variable from its unary factors takes only O(d*k d ) time where k=2 and d=1. (Naradowsky, Vieira, & Smith, 2012)

177 Unlabeled Constituency Parsing 177 time like flies anarrow Given: a sentence. Predict: unlabeled parse. We could predict whether each span is present T or not F. But nothing prevents non-tree structures. Given: a sentence. Predict: unlabeled parse. We could predict whether each span is present T or not F. But nothing prevents non-tree structures. Add a CKYTree factor which multiplies in 1 if the variables form a tree and 0 otherwise. T ψ1ψ1 ψ2ψ2 T ψ3ψ3 ψ4ψ4 T ψ5ψ5 ψ6ψ6 T ψ7ψ7 ψ8ψ8 T ψ9ψ9 time like flies anarrow T ψ 10 T ψ 12 T ψ 11 T ψ 13 F ψ 10 F T F F F (Naradowsky, Vieira, & Smith, 2012)

178 Unlabeled Constituency Parsing 178 time like flies anarrow Given: a sentence. Predict: unlabeled parse. We could predict whether each span is present T or not F. But nothing prevents non-tree structures. Given: a sentence. Predict: unlabeled parse. We could predict whether each span is present T or not F. But nothing prevents non-tree structures. Add a CKYTree factor which multiplies in 1 if the variables form a tree and 0 otherwise. T ψ1ψ1 ψ2ψ2 T ψ3ψ3 ψ4ψ4 T ψ5ψ5 ψ6ψ6 T ψ7ψ7 ψ8ψ8 T ψ9ψ9 time like flies anarrow T ψ 10 T ψ 12 T ψ 11 T ψ 13 F ψ 10 F F F F F (Naradowsky, Vieira, & Smith, 2012)

179 Unlabeled Constituency Parsing 179 time like flies anarrow Add a CKYTree factor which multiplies in 1 if the variables form a tree and 0 otherwise. T ψ1ψ1 ψ2ψ2 T ψ3ψ3 ψ4ψ4 T ψ5ψ5 ψ6ψ6 T ψ7ψ7 ψ8ψ8 T ψ9ψ9 time like flies anarrow T ψ 10 T ψ 12 T ψ 11 T ψ 13 F ψ 10 F F F F F How long does it take to send a message to a variable from the the CKYTree factor? For the given sentence, O(d*k d ) time where k=2 and d=15. For a length n sentence, this will be O(2 n*n ). But we know an algorithm (inside-outside) to compute all the marginals in O(n 3 ) … So can’t we do better? (Naradowsky, Vieira, & Smith, 2012)

180 Example: The Exactly1 Factor 180 X1X1 X2X2 X3X3 X4X4 ψ E1 ψ2ψ2 ψ3ψ3 ψ4ψ4 ψ1ψ1 (Smith & Eisner, 2008) Variables: d binary variables X 1, …, X d Global Factor: 1 if exactly one of the d binary variables X i is on, 0 otherwise Exactly1(X 1, …, X d ) =

181 Example: The Exactly1 Factor 181 X1X1 X2X2 X3X3 X4X4 ψ E1 ψ2ψ2 ψ3ψ3 ψ4ψ4 ψ1ψ1 (Smith & Eisner, 2008) Variables: d binary variables X 1, …, X d Global Factor: 1 if exactly one of the d binary variables X i is on, 0 otherwise Exactly1(X 1, …, X d ) =

182 Example: The Exactly1 Factor 182 X1X1 X2X2 X3X3 X4X4 ψ E1 ψ2ψ2 ψ3ψ3 ψ4ψ4 ψ1ψ1 (Smith & Eisner, 2008) Variables: d binary variables X 1, …, X d Global Factor: 1 if exactly one of the d binary variables X i is on, 0 otherwise Exactly1(X 1, …, X d ) =

183 Example: The Exactly1 Factor 183 X1X1 X2X2 X3X3 X4X4 ψ E1 ψ2ψ2 ψ3ψ3 ψ4ψ4 ψ1ψ1 (Smith & Eisner, 2008) Variables: d binary variables X 1, …, X d Global Factor: 1 if exactly one of the d binary variables X i is on, 0 otherwise Exactly1(X 1, …, X d ) =

184 Example: The Exactly1 Factor 184 X1X1 X2X2 X3X3 X4X4 ψ E1 ψ2ψ2 ψ3ψ3 ψ4ψ4 ψ1ψ1 (Smith & Eisner, 2008) Variables: d binary variables X 1, …, X d Global Factor: 1 if exactly one of the d binary variables X i is on, 0 otherwise Exactly1(X 1, …, X d ) =

185 Example: The Exactly1 Factor 185 X1X1 X2X2 X3X3 X4X4 ψ E1 ψ2ψ2 ψ3ψ3 ψ4ψ4 ψ1ψ1 (Smith & Eisner, 2008) Variables: d binary variables X 1, …, X d Global Factor: 1 if exactly one of the d binary variables X i is on, 0 otherwise Exactly1(X 1, …, X d ) =

186 Messages: The Exactly1 Factor 186 From VariablesTo Variables O(d*2 d ) d = # of neighboring variables O(d*2) d = # of neighboring factors X1X1 X2X2 X3X3 X4X4 ψ E1 ψ2ψ2 ψ3ψ3 ψ4ψ4 ψ1ψ1 X1X1 X2X2 X3X3 X4X4 ψ2ψ2 ψ3ψ3 ψ4ψ4 ψ1ψ1

187 Messages: The Exactly1 Factor 187 From VariablesTo Variables X1X1 X2X2 X3X3 X4X4 ψ E1 ψ2ψ2 ψ3ψ3 ψ4ψ4 ψ1ψ1 X1X1 X2X2 X3X3 X4X4 ψ2ψ2 ψ3ψ3 ψ4ψ4 ψ1ψ1 O(d*2 d ) d = # of neighboring variables O(d*2) d = # of neighboring factors Fast!

188 Messages: The Exactly1 Factor 188 To Variables X1X1 X2X2 X3X3 X4X4 ψ E1 ψ2ψ2 ψ3ψ3 ψ4ψ4 ψ1ψ1 O(d*2 d ) d = # of neighboring variables

189 Messages: The Exactly1 Factor 189 To Variables X1X1 X2X2 X3X3 X4X4 ψ E1 ψ2ψ2 ψ3ψ3 ψ4ψ4 ψ1ψ1 But the outgoing messages from the Exactly1 factor are defined as a sum over the 2 d possible assignments to X 1, …, X d. Conveniently, ψ E1 (x a ) is 0 for all but d values – so the sum is sparse! So we can compute all the outgoing messages from ψ E1 in O(d) time! O(d*2 d ) d = # of neighboring variables

190 Messages: The Exactly1 Factor 190 To Variables X1X1 X2X2 X3X3 X4X4 ψ E1 ψ2ψ2 ψ3ψ3 ψ4ψ4 ψ1ψ1 But the outgoing messages from the Exactly1 factor are defined as a sum over the 2 d possible assignments to X 1, …, X d. Conveniently, ψ E1 (x a ) is 0 for all but d values – so the sum is sparse! So we can compute all the outgoing messages from ψ E1 in O(d) time! O(d*2 d ) d = # of neighboring variables Fast!

191 Messages: The Exactly1 Factor 191 X1X1 X2X2 X3X3 X4X4 ψ E1 ψ2ψ2 ψ3ψ3 ψ4ψ4 ψ1ψ1 (Smith & Eisner, 2008) A factor has a belief about each of its variables. An outgoing message from a factor is the factor's belief with the incoming message divided out. We can compute the Exactly1 factor’s beliefs about each of its variables efficiently. (Each of the parenthesized terms needs to be computed only once for all the variables.)

192 Example: The CKYTree Factor Variables: O(n 2 ) binary variables S ij Global Factor: the made barista coffee S 01 S 12 S 23 S 34 S 02 S 13 S 24 S 03 S 14 S 04 1 if the span variables form a constituency tree, 0 otherwise CKYTree(S 01, S 12, …, S 04 ) = (Naradowsky, Vieira, & Smith, 2012)

193 Messages: The CKYTree Factor 193 From VariablesTo Variables the made barista coffee S 01 S 12 S 23 S 34 S 02 S 13 S 24 S 03 S 14 S the made barista coffee S 01 S 12 S 23 S 34 S 02 S 13 S 24 S 03 S 14 S 04 O(d*2 d ) d = # of neighboring variables O(d*2) d = # of neighboring factors

194 Messages: The CKYTree Factor 194 From VariablesTo Variables the made barista coffee S 01 S 12 S 23 S 34 S 02 S 13 S 24 S 03 S 14 S the made barista coffee S 01 S 12 S 23 S 34 S 02 S 13 S 24 S 03 S 14 S 04 O(d*2 d ) d = # of neighboring variables O(d*2) d = # of neighboring factors Fast!

195 Messages: The CKYTree Factor 195 To Variables the made barista coffee S 01 S 12 S 23 S 34 S 02 S 13 S 24 S 03 S 14 S 04 O(d*2 d ) d = # of neighboring variables

196 the made barista coffee S 01 S 12 S 23 S 34 S 02 S 13 S 24 S 03 S 14 S 04 To Variables But the outgoing messages from the CKYTree factor are defined as a sum over the O(2 n*n ) possible assignments to {S ij }. Messages: The CKYTree Factor 196 ψ CKYTree (x a ) is 1 for exponentially many values in the sum – but they all correspond to trees! With inside-outside we can compute all the outgoing messages from CKYTree in O(n 3 ) time! O(d*2 d ) d = # of neighboring variables

197 the made barista coffee S 01 S 12 S 23 S 34 S 02 S 13 S 24 S 03 S 14 S 04 To Variables But the outgoing messages from the CKYTree factor are defined as a sum over the O(2 n*n ) possible assignments to {S ij }. Messages: The CKYTree Factor 197 ψ CKYTree (x a ) is 1 for exponentially many values in the sum – but they all correspond to trees! With inside-outside we can compute all the outgoing messages from CKYTree in O(n 3 ) time! O(d*2 d ) d = # of neighboring variables Fast!

198 Example: The CKYTree Factor the made barista coffee S 01 S 12 S 23 S 34 S 02 S 13 S 24 S 03 S 14 S 04 (Naradowsky, Vieira, & Smith, 2012) For a length n sentence, define an anchored weighted context free grammar (WCFG). Each span is weighted by the ratio of the incoming messages from the corresponding span variable. Run the inside-outside algorithm on the sentence a 1, a 1, …, a n with the anchored WCFG.

199 Example: The TrigramHMM Factor 199 (Smith & Eisner, 2008) time like flies anarrow X1X1 X2X2 X3X3 X4X4 X5X5 W1W1 W2W2 W3W3 W4W4 W5W5 Factors can compactly encode the preferences of an entire sub- model. Consider the joint distribution of a trigram HMM over 5 variables: – It’s traditionally defined as a Bayes Network – But we can represent it as a (loopy) factor graph – We could even pack all those factors into a single TrigramHMM factor (Smith & Eisner, 2008)

200 Example: The TrigramHMM Factor 200 (Smith & Eisner, 2008) time like flies anarrow X1X1 ψ1ψ1 ψ2ψ2 X2X2 ψ3ψ3 ψ4ψ4 X3X3 ψ5ψ5 ψ6ψ6 X4X4 ψ7ψ7 ψ8ψ8 X5X5 ψ9ψ9 ψ 10 ψ 11 ψ 12 Factors can compactly encode the preferences of an entire sub- model. Consider the joint distribution of a trigram HMM over 5 variables: – It’s traditionally defined as a Bayes Network – But we can represent it as a (loopy) factor graph – We could even pack all those factors into a single TrigramHMM factor (Smith & Eisner, 2008)

201 Example: The TrigramHMM Factor 201 (Smith & Eisner, 2008) time like flies anarrow X1X1 ψ1ψ1 ψ2ψ2 X2X2 ψ3ψ3 ψ4ψ4 X3X3 ψ5ψ5 ψ6ψ6 X4X4 ψ7ψ7 ψ8ψ8 X5X5 ψ9ψ9 ψ 10 ψ 11 ψ 12 Factors can compactly encode the preferences of an entire sub- model. Consider the joint distribution of a trigram HMM over 5 variables: – It’s traditionally defined as a Bayes Network – But we can represent it as a (loopy) factor graph – We could even pack all those factors into a single TrigramHMM factor (Smith & Eisner, 2008)

202 Example: The TrigramHMM Factor Factors can compactly encode the preferences of an entire sub- model. Consider the joint distribution of a trigram HMM over 5 variables: – It’s traditionally defined as a Bayes Network – But we can represent it as a (loopy) factor graph – We could even pack all those factors into a single TrigramHMM factor (Smith & Eisner, 2008) 202 (Smith & Eisner, 2008) time like flies anarrow X1X1 X2X2 X3X3 X4X4 X5X5

203 Example: The TrigramHMM Factor 203 (Smith & Eisner, 2008) time like flies anarrow X1X1 X2X2 X3X3 X4X4 X5X5 Variables: n discrete variables X 1, …, X n Global Factor: p(X 1, …, X n ) according to a trigram HMM model TrigramHMM (X 1, …, X n ) =

204 Example: The TrigramHMM Factor 204 (Smith & Eisner, 2008) time like flies anarrow X1X1 X2X2 X3X3 X4X4 X5X5 Variables: n discrete variables X 1, …, X n Global Factor: p(X 1, …, X n ) according to a trigram HMM model TrigramHMM (X 1, …, X n ) = Compute outgoing messages efficiently with the standard trigram HMM dynamic programming algorithm (junction tree)!

205 Combinatorial Factors Usually, it takes O(k n ) time to compute outgoing messages from a factor over n variables with k possible values each. But not always: – Factors like Exactly1 with only polynomially many nonzeroes in the potential table. – Factors like CKYTree with exponentially many nonzeroes but in a special pattern. – Factors like TrigramHMM (Smith & Eisner 2008) with all nonzeroes but which factor further. 205

206 Example NLP constraint factors: – Projective and non-projective dependency parse constraint (Smith & Eisner, 2008) – CCG parse constraint (Auli & Lopez, 2011) – Labeled and unlabeled constituency parse constraint (Naradowsky, Vieira, & Smith, 2012) – Inversion transduction grammar (ITG) constraint (Burkett & Klein, 2012) Combinatorial Factors Factor graphs can encode structural constraints on many variables via constraint factors. 206

207 Combinatorial Optimization within Max-Product Max-product BP computes max-marginals. – The max-marginal b i (x i ) is the (unnormalized) probability of the MAP assignment under the constraint X i = x i. Duchi et al. (2006) define factors, over many variables, for which efficient combinatorial optimization algorithms exist. – Bipartite matching: max-marginals can be computed with standard max-flow algorithm and the Floyd- Warshall all-pairs shortest-paths algorithm. – Minimum cuts: max-marginals can be computed with a min-cut algorithm. Similar to sum-product case: the combinatorial algorithms are embedded within the standard loopy BP algorithm. 207 (Duchi, Tarlow, Elidan, & Koller, 2006)

208 Additional Resources See NAACL 2012 / ACL 2013 tutorial by Burkett & Klein “Variational Inference in Structured NLP Models” for… – An alternative approach to efficient marginal inference for NLP: Structured Mean Field – Also, includes Structured BP 208

209 Sending Messages: Computational Complexity 209 X2X2 ψ1ψ1 X1X1 X3X3 X1X1 ψ2ψ2 ψ3ψ3 ψ1ψ1 From VariablesTo Variables O(d*k d ) d = # of neighboring variables k = maximum # possible values for a neighboring variable O(d*k) d = # of neighboring factors k = # possible values for X i

210 Sending Messages: Computational Complexity 210 X2X2 ψ1ψ1 X1X1 X3X3 X1X1 ψ2ψ2 ψ3ψ3 ψ1ψ1 From VariablesTo Variables O(d*k d ) d = # of neighboring variables k = maximum # possible values for a neighboring variable O(d*k) d = # of neighboring factors k = # possible values for X i

211 INCORPORATING STRUCTURE INTO VARIABLES 211

212 String-Valued Variables Consider two examples from Section 1: 212 Variables (string): – English and Japanese orthographic strings – English and Japanese phonological strings Interactions: – All pairs of strings could be relevant Variables (string): – Inflected forms of the same verb Interactions: – Between pairs of entries in the table (e.g. infinitive form affects present-singular)

213 Graphical Models over Strings 213 ψ1ψ1 X2X2 X1X1 (Dreyer & Eisner, 2009) ring rangrung ring rang 712 rung 813 Most of our problems so far: – Used discrete variables – Over a small finite set of string values – Examples: POS tagging Labeled constituency parsing Dependency parsing We use tensors (e.g. vectors, matrices) to represent the messages and factors

214 Graphical Models over Strings 214 ψ1ψ1 X2X2 X1X1 (Dreyer & Eisner, 2009) ring rangrung ring rang 712 rung 813 ψ1ψ1 X2X2 X1X1 aardvark … rang ring rung … zymurgy aardvar k … rang ring rung … zymurgy What happens as the # of possible values for a variable, k, increases? We can still keep the computational complexity down by including only low arity factors (i.e. small d ). Time Complexity: var.  fac. O(d*k d ) fac.  var. O(d*k)

215 Graphical Models over Strings 215 ψ1ψ1 X2X2 X1X1 (Dreyer & Eisner, 2009) ring rangrung ring rang 712 rung 813 ψ1ψ1 X2X2 X1X1 aardvark … rang ring rung … aardvar k … rang ring rung … But what if the domain of a variable is Σ *, the infinite set of all possible strings? How can we represent a distribution over one or more infinite sets?

216 Graphical Models over Strings 216 ψ1ψ1 X2X2 X1X1 (Dreyer & Eisner, 2009) ring rangrung ring rang 712 rung 813 ψ1ψ1 X2X2 X1X1 aardvark … rang ring rung … aardvar k … rang ring rung … ψ1ψ1 X2X2 X1X1 r in g u e ε e e s e h a s in g r a n g u a e ε ε a r s a u r in g u e ε s e h a Finite State Machines let us represent something infinite in finite space!

217 Graphical Models over Strings 217 (Dreyer & Eisner, 2009) ψ1ψ1 X2X2 X1X1 aardvark … rang ring rung … aardvar k … rang ring rung … ψ1ψ1 X2X2 X1X1 r in g u e ε e e s e h a s in g r a n g u a e ε ε a r s a u r in g u e ε s e h a Finite State Machines let us represent something infinite in finite space! messages and beliefs are Weighted Finite State Acceptors (WFSA) factors are Weighted Finite State Transducers (WFST)

218 Graphical Models over Strings 218 (Dreyer & Eisner, 2009) ψ1ψ1 X2X2 X1X1 aardvark … rang ring rung … aardvar k … rang ring rung … ψ1ψ1 X2X2 X1X1 r in g u e ε e e s e h a s in g r a n g u a e ε ε a r s a u r in g u e ε s e h a Finite State Machines let us represent something infinite in finite space! messages and beliefs are Weighted Finite State Acceptors (WFSA) factors are Weighted Finite State Transducers (WFST) That solves the problem of representation. But how do we manage the problem of computation? (We still need to compute messages and beliefs.)

219 Graphical Models over Strings 219 (Dreyer & Eisner, 2009) ψ1ψ1 X2X2 X1X1 r in g u e ε e e s e h a s in g r a n g u a e ε ε a r s a u r in g u e ε s e h a ψ1ψ1 X2X2 r in g u e ε e e s e h a r in g u e ε s e h a ψ1ψ1 ψ1ψ1 r in g u e ε e e s e h a All the message and belief computations simply reuse standard FSM dynamic programming algorithms.

220 Graphical Models over Strings 220 (Dreyer & Eisner, 2009) ψ1ψ1 X2X2 r in g u e ε e e s e h a r in g u e ε s e h a ψ1ψ1 ψ1ψ1 r in g u e ε e e s e h a The pointwise product of two WFSAs is… …their intersection. Compute the product of (possibly many) messages μ α  i (each of which is a WSFA) via WFSA intersection

221 Graphical Models over Strings 221 (Dreyer & Eisner, 2009) ψ1ψ1 X2X2 X1X1 r in g u e ε e e s e h a s in g r a n g u a e ε ε a r s a u r in g u e ε s e h a Compute marginalized product of WFSA message μ k  α and WFST factor ψ α, with: domain ( compose (ψ α, μ k  α )) –compose : produces a new WFST with a distribution over (X i, X j ) –domain : marginalizes over X j to obtain a WFSA over X i only

222 Graphical Models over Strings 222 (Dreyer & Eisner, 2009) ψ1ψ1 X2X2 X1X1 r in g u e ε e e s e h a s in g r a n g u a e ε ε a r s a u r in g u e ε s e h a ψ1ψ1 X2X2 r in g u e ε e e s e h a r in g u e ε s e h a ψ1ψ1 ψ1ψ1 r in g u e ε e e s e h a All the message and belief computations simply reuse standard FSM dynamic programming algorithms.

223 The usual NLP toolbox 223 WFSA: weighted finite state automata WFST: weighted finite state transducer k -tape WFSM: weighted finite state machine jointly mapping between k strings They each assign a score to a set of strings. We can interpret them as factors in a graphical model. The only difference is the arity of the factor.

224 WFSA: weighted finite state automata WFST: weighted finite state transducer k -tape WFSM: weighted finite state machine jointly mapping between k strings WFSA as a Factor Graph 224 X1X1 ψ1ψ1 b r ec h e n x 1 = ψ 1 = a b c … z ψ 1 (x 1 ) = 4.25 A WFSA is a function which maps a string to a score.

225 WFSA: weighted finite state automata WFST: weighted finite state transducer k -tape WFSM: weighted finite state machine jointly mapping between k strings WFST as a Factor Graph 225 X2X2 X1X1 ψ1ψ1 (Dreyer, Smith, & Eisner, 2008) b r ec h e n x 1 = b r ac h t x 2 = b r ec h e ε b r ac h ε n ε t n t ψ 1 = ψ 1 (x 1, x 2 ) = A WFST is a function that maps a pair of strings to a score.

226 k -tape WFSM as a Factor Graph 226 WFSA: weighted finite state automata WFST: weighted finite state transducer k -tape WFSM: weighted finite state machine jointly mapping between k strings X2X2 X1X1 ψ1ψ1 X4X4 X3X3 ψ 1 = b b b b r r r r e ε ε ε ε a a a c c c c h h h h e ε e ε n ε n ε ε t ε ε ψ 1 (x 1, x 2, x 3, x 4 ) = A k -tape WFSM is a function that maps k strings to a score. What's wrong with a 100-tape WFSM for jointly modeling the 100 distinct forms of a Polish verb? – Each arc represents a 100-way edit operation – Too many arcs!

227 Factor Graphs over Multiple Strings P(x 1, x 2, x 3, x 4 ) = 1/Z ψ 1 (x 1, x 2 ) ψ 2 (x 1, x 3 ) ψ 3 (x 1, x 4 ) ψ 4 (x 2, x 3 ) ψ 5 (x 3, x 4 ) 227 ψ1ψ1 ψ2ψ2 ψ3ψ3 ψ4ψ4 ψ5ψ5 X2X2 X1X1 X4X4 X3X3 (Dreyer & Eisner, 2009) Instead, just build factor graphs with WFST factors (i.e. factors of arity 2)

228 Factor Graphs over Multiple Strings P(x 1, x 2, x 3, x 4 ) = 1/Z ψ 1 (x 1, x 2 ) ψ 2 (x 1, x 3 ) ψ 3 (x 1, x 4 ) ψ 4 (x 2, x 3 ) ψ 5 (x 3, x 4 ) 228 ψ1ψ1 ψ2ψ2 ψ3ψ3 ψ4ψ4 ψ5ψ5 X2X2 X1X1 X4X4 X3X3 (Dreyer & Eisner, 2009) Instead, just build factor graphs with WFST factors (i.e. factors of arity 2)

229 BP for Coordination of Algorithms 229 T ψ2ψ2 T ψ4ψ4 T F F F la blanca casa T ψ2ψ2 T ψ4ψ4 T F F F the house white TT T TT T TT T

230 Section 5: What if even BP is slow? Computing fewer messages Computing them faster 230

231 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: What’s going on here? Can we trust BP’s estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 231

232 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: What’s going on here? Can we trust BP’s estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 232

233 Loopy Belief Propagation Algorithm 1.For every directed edge, initialize its message to the uniform distribution. 2.Repeat until all normalized beliefs converge: a.Pick a directed edge u  v. b.Update its message: recompute u  v from its “parent” messages v’  u for v’ ≠ v. More efficient if u has high degree (e.g., CKYTree): Compute all outgoing messages u  … at once, based on all incoming messages …  u.

234 Loopy Belief Propagation Algorithm 1.For every directed edge, initialize its message to the uniform distribution. 2.Repeat until all normalized beliefs converge: a.Pick a directed edge u  v. b.Update its message: recompute u  v from its “parent” messages v’  u for v’ ≠ v. Which edge do we pick and recompute? A “stale” edge?

235 Message Passing in Belief Propagation 235 X Ψ … … … … My other factors think I’m a noun But my other variables and I think you’re a verb v 1 n 6 a 3 v 6 n 1 a 3

236 Stale Messages 236 X Ψ … … … … We update this message from its antecedents. Now it’s “fresh.” Don’t need to update it again. antecedents

237 Stale Messages 237 X Ψ … … … … antecedents But it again becomes “stale” – out of sync with its antecedents – if they change. Then we do need to revisit. We update this message from its antecedents. Now it’s “fresh.” Don’t need to update it again. The edge is very stale if its antecedents have changed a lot since its last update. Especially in a way that might make this edge change a lot.

238 Stale Messages 238 Ψ … … … … For a high-degree node that likes to update all its outgoing messages at once … We say that the whole node is very stale if its incoming messages have changed a lot.

239 Stale Messages 239 Ψ … … … … For a high-degree node that likes to update all its outgoing messages at once … We say that the whole node is very stale if its incoming messages have changed a lot.

240 Maintain an Queue of Stale Messages to Update X1X1 X2X2 X3X3 X4X4 X5X5 time like flies anarrow X6X6 X8X8 X7X7 X9X9 Messages from variables are actually fresh (in sync with their uniform antecedents). Initially all messages are uniform.

241 Maintain an Queue of Stale Messages to Update X1X1 X2X2 X3X3 X4X4 X5X5 time like flies anarrow X6X6 X8X8 X7X7 X9X9

242 Maintain an Queue of Stale Messages to Update X1X1 X2X2 X3X3 X4X4 X5X5 time like flies anarrow X6X6 X8X8 X7X7 X9X9

243 A Bad Update Order! time like flies anarrow X1X1 X2X2 X3X3 X4X4 X5X5 X0X0

244 Acyclic Belief Propagation 244 In a graph with no cycles: 1.Send messages from the leaves to the root. 2.Send messages from the root to the leaves. Each outgoing message is sent only after all its incoming messages have been received. In a graph with no cycles: 1.Send messages from the leaves to the root. 2.Send messages from the root to the leaves. Each outgoing message is sent only after all its incoming messages have been received. X1X1 X2X2 X3X3 X4X4 X5X5 time like flies anarrow X6X6 X8X8 X7X7 X9X9

245 Acyclic Belief Propagation 245 In a graph with no cycles: 1.Send messages from the leaves to the root. 2.Send messages from the root to the leaves. Each outgoing message is sent only after all its incoming messages have been received. In a graph with no cycles: 1.Send messages from the leaves to the root. 2.Send messages from the root to the leaves. Each outgoing message is sent only after all its incoming messages have been received. X1X1 X2X2 X3X3 X4X4 X5X5 time like flies anarrow X6X6 X8X8 X7X7 X9X9

246 Don’t update a message if its parents will get a big update. Otherwise, will have to re-update. Loopy Belief Propagation Asynchronous – Pick a directed edge: update its message – Or, pick a vertex: update all its outgoing messages at once 246 In what order do we send messages for Loopy BP? X1X1 X2X2 X3X3 X4X4 X5X5 time like flies anarrow X6X6 X8X8 X7X7 X9X9 Size. Send big updates first. Forces other messages to wait for them. Topology. Use graph structure. E.g., in an acyclic graph, a message can wait for all updates before sending. Wait for your parents

247 Message Scheduling Synchronous (SBP) – Compute all the messages – Send all the messages Asynchronous (ABP) – Pick an edge: compute and send that message Tree-based Reparameterization (TRP) – Successively update embedded spanning trees (Wainwright et al., 2001) – Choose spanning trees such that each edge is included in at least one Residual BP (RBP) – Pick the edge whose message would change the most if sent: compute and send that message (Elidan et al., 2006) 247 Figure from (Elidan, McGraw, & Koller, 2006) The order in which messages are sent has a significant effect on convergence

248 Message Scheduling Synchronous (SBP) – Compute all the messages – Send all the messages Asynchronous (ABP) – Pick an edge: compute and send that message Residual BP (RBP) – Pick the edge whose message would change the most if sent: compute and send that message (Elidan et al., 2006) Tree-based Reparameterization (TRP) – Successively update embedded spanning trees (Wainwright et al., 2001) – Choose spanning trees such that each edge is included in at least one Convergence rates: 248 The order in which messages are sent has a significant effect on convergence Figure from (Elidan, McGraw, & Koller, 2006)

249 Message Scheduling 249 (Jiang, Moon, Daumé III, & Eisner, 2013) Even better dynamic scheduling is possible by learning the heuristics for selecting the next message by reinforcement learning (RLBP).

250 Computing Variable Beliefs 250 X ring.4 rang 6 rung 0 Suppose… – X i is a discrete variable – Each incoming messages is a Multinomial Pointwise product is easy when the variable’s domain is small and discrete

251 Computing Variable Beliefs Suppose… – X i is a real-valued variable – Each incoming message is a Gaussian The pointwise product of n Gaussians is… …a Gaussian! 251 X

252 Computing Variable Beliefs Suppose… – X i is a real-valued variable – Each incoming messages is a mixture of k Gaussians The pointwise product explodes! 252 X p(x) = p 1 (x) p 2 (x)…p n (x) ( 0.3 q 1,1 (x) q 1,2 (x)) ( 0.5 q 2,1 (x) q 2,2 (x))

253 Computing Variable Beliefs 253 X Suppose… – X i is a string-valued variable (i.e. its domain is the set of all strings) – Each incoming messages is a FSA The pointwise product explodes!

254 Example: String-valued Variables Messages can grow larger when sent through a transducer factor Repeatedly sending messages through a transducer can cause them to grow to unbounded size! 254 X2X2 X1X1 ψ2ψ2 a a ε a a a a (Dreyer & Eisner, 2009) ψ1ψ1 a a

255 Example: String-valued Variables Messages can grow larger when sent through a transducer factor Repeatedly sending messages through a transducer can cause them to grow to unbounded size! 255 X2X2 X1X1 ψ2ψ2 a ε a a a a (Dreyer & Eisner, 2009) ψ1ψ1 a a a a

256 Example: String-valued Variables Messages can grow larger when sent through a transducer factor Repeatedly sending messages through a transducer can cause them to grow to unbounded size! 256 X2X2 X1X1 ψ2ψ2 ε a a a (Dreyer & Eisner, 2009) ψ1ψ1 a a a a a a a

257 Example: String-valued Variables Messages can grow larger when sent through a transducer factor Repeatedly sending messages through a transducer can cause them to grow to unbounded size! 257 X2X2 X1X1 ψ2ψ2 ε a a a (Dreyer & Eisner, 2009) ψ1ψ1 a a a a a a a a

258 Example: String-valued Variables Messages can grow larger when sent through a transducer factor Repeatedly sending messages through a transducer can cause them to grow to unbounded size! 258 X2X2 X1X1 ψ2ψ2 ε a a a (Dreyer & Eisner, 2009) ψ1ψ1 a a a a a a a a a a a a a a a a a

259 Example: String-valued Variables Messages can grow larger when sent through a transducer factor Repeatedly sending messages through a transducer can cause them to grow to unbounded size! 259 X2X2 X1X1 ψ2ψ2 ε a a a (Dreyer & Eisner, 2009) ψ1ψ1 a a a a a a a a a a a a a a a a a The domain of these variables is infinite (i.e. Σ * ); WSFA’s representation is finite – but the size of the representation can grow In cases where the domain of each variable is small and finite this is not an issue

260 Message Approximations Three approaches to dealing with complex messages: 1.Particle Belief Propagation (see Section 3) 2.Message pruning 3.Expectation propagation 260

261 For real variables, try a mixture of K Gaussians: – E.g., true product is a mixture of K d Gaussians – Prune back: Randomly keep just K of them – Chosen in proportion to weight in full mixture – Gibbs sampling to efficiently choose them – What if incoming messages are not Gaussian mixtures? – Could be anything sent by the factors … – Can extend technique to this case. Message Pruning Problem: Product of d messages = complex distribution. – Solution: Approximate with a simpler distribution. – For speed, compute approximation without computing full product. 261 (Sudderth et al., 2002 –“Nonparametric BP”) X

262 Problem: Product of d messages = complex distribution. – Solution: Approximate with a simpler distribution. – For speed, compute approximation without computing full product. For string variables, use a small finite set: – Each message µ i gives positive probability to … – … every word in a 50,000 word vocabulary – … every string in ∑* (using a weighted FSA) – Prune back to a list L of a few “good” strings – Each message adds its own K best strings to L –For each x  L, let µ (x) =  i µ i (x) – each message scores x –For each x  L, let µ (x) = 0 Message Pruning 262 (Dreyer & Eisner, 2009) X

263 Problem: Product of d messages = complex distribution. – Solution: Approximate with a simpler distribution. – For speed, compute approximation without computing full product. EP provides four special advantages over pruning: 1.General recipe that can be used in many settings. 2.Efficient. Uses approximations that are very fast. 3.Conservative. Unlike pruning, never forces b(x) to 0. Never kills off a value x that had been possible. 4.Adaptive. Approximates µ(x) more carefully if x is favored by the other messages. Tries to be accurate on the most “plausible” values. Expectation Propagation (EP) 263 (Minka, 2001)

264 Expectation Propagation (EP) X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 exponential-family approximations inside Belief at X 3 will be simple! Messages to and from X 3 will be simple!

265 Key idea: Approximate variable X’s incoming messages µ. We force them to have a simple parametric form: µ(x) = exp (θ ∙ f(x)) “log-linear model” (unnormalized) where f(x) extracts a feature vector from the value x. For each variable X, we’ll choose a feature function f. Expectation Propagation (EP) 265 So by storing a few parameters θ, we’ve defined µ(x) for all x. Now the messages are super-easy to multiply: µ 1 (x) µ 2 (x) = exp (θ ∙ f(x)) exp (θ ∙ f(x)) = exp ((θ 1 +θ 2 ) ∙ f(x)) Represent a message by its parameter vector θ. To multiply messages, just add their θ vectors! So beliefs and outgoing messages also have this simple form. Maybe unnormalizable, e.g., initial message θ=0 is uniform “distribution”

266 Expectation Propagation X1X1 X2X2 X3X3 X4X4 X7X7 exponential-family approximations inside X5X5 Form of messages/beliefs at X 3 ? – Always µ(x)=exp (θ∙f(x)) If x is real: – Gaussian: Take f(x) = (x,x 2 ) If x is string: – Globally normalized trigram model: Take f(x) = (count of aaa, count of aab, … count of zzz) If x is discrete: – Arbitrary discrete distribution (can exactly represent original message, so we get ordinary BP) – Coarsened discrete distribution, based on features of x Can’t use mixture models, or other models that use latent variables to define µ(x) = ∑ y p(x, y)

267 Expectation Propagation X1X1 X2X2 X3X3 X4X4 X7X7 exponential-family approximations inside X5X5 Each message to X 3 is µ(x) = exp (θ ∙ f(x)) for some θ. We only store θ. To take a product of such messages, just add their θ – Easily compute belief at X 3 (sum of incoming θ vectors) – Then easily compute each outgoing message (belief minus one incoming θ) All very easy …

268 Expectation Propagation X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 But what about messages from factors? – Like the message M 4. – This is not exponential family! Uh-oh! – It’s just whatever the factor happens to send. This is where we need to approximate, by µ 4. M4M4 µ4µ4

269 X5X5 Expectation Propagation X1X1 X2X2 X3X3 X4X4 X7X7 µ1µ1 µ2µ2 µ3µ3 µ4µ4 M4M4 blue = arbitrary distribution, green = simple distribution exp (θ ∙ f(x)) The belief at x “should” be p(x) = µ 1 (x) ∙ µ 2 (x) ∙ µ 3 (x) ∙ M 4 (x) But we’ll be using b(x) = µ 1 (x) ∙ µ 2 (x) ∙ µ 3 (x) ∙ µ 4 (x) Choose the simple distribution b that minimizes KL(p || b). Then, work backward from belief b to message µ 4. – Take θ vector of b and subtract off the θ vectors of µ 1, µ 2, µ 3. – Chooses µ 4 to preserve belief well. That is, choose b that assigns high probability to samples from p. Find b’s params θ in closed form – or follow gradient: E x~p [f(x)] – E x~b [f(x)]

270 Expectation Propagation 270 (Hall & Klein, 2012) Example: Factored PCFGs Task: Constituency parsing, with factored annotations – Lexical annotations – Parent annotations – Latent annotations Approach: – Sentence specific approximation is an anchored grammar: q(A  B C, i, j, k) – Sending messages is equivalent to marginalizing out the annotations

271 Section 6: Approximation-aware Training 271

272 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: What’s going on here? Can we trust BP’s estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 272

273 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: What’s going on here? Can we trust BP’s estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 273

274 Training Thus far, we’ve seen how to compute (approximate) marginals, given a factor graph… …but where do the potential tables ψ α come from? – Some have a fixed structure (e.g. Exactly1, CKYTree ) – Others could be trained ahead of time (e.g. TrigramHMM ) – For the rest, we define them parametrically and learn the parameters! 274 Two ways to learn: 1.Standard CRF Training (very simple; often yields state-of-the- art results) 2.ERMA (less simple; but takes approximations and loss function into account) Two ways to learn: 1.Standard CRF Training (very simple; often yields state-of-the- art results) 2.ERMA (less simple; but takes approximations and loss function into account)

275 Standard CRF Parameterization Define each potential function in terms of a fixed set of feature functions: 275 Observed variables Predicted variables

276 Standard CRF Parameterization Define each potential function in terms of a fixed set of feature functions: 276 timeflies like an arrow n ψ2ψ2 v ψ4ψ4 p ψ6ψ6 d ψ8ψ8 n ψ1ψ1 ψ3ψ3 ψ5ψ5 ψ7ψ7 ψ9ψ9

277 Standard CRF Parameterization Define each potential function in terms of a fixed set of feature functions: 277 n ψ1ψ1 ψ2ψ2 v ψ3ψ3 ψ4ψ4 p ψ5ψ5 ψ6ψ6 d ψ7ψ7 ψ8ψ8 n ψ9ψ9 time like flies anarrow np ψ 10 vp ψ 12 pp ψ 11 s ψ 13

278 What is Training? That’s easy: Training = picking good model parameters! But how do we know if the model parameters are any “good”? 278

279 Standard CRF Training Given labeled training examples: Maximize Conditional Log-likelihood: 279

280 Standard CRF Training Given labeled training examples: Maximize Conditional Log-likelihood: 280

281 Standard CRF Training Given labeled training examples: Maximize Conditional Log-likelihood: 281 We can approximate the factor marginals by the factor beliefs from BP!

282 Stochastic Gradient Descent Input: – Training data, {(x (i), y (i) ) : 1 ≤ i ≤ N } – Initial model parameters, θ Output: – Trained model parameters, θ. Algorithm: While not converged: – Sample a training example (x (i), y (i) ) – Compute the gradient of log(p θ (y (i) | x (i) )) with respect to our model parameters θ. – Take a (small) step in the direction of the gradient. 282 (Stoyanov, Ropson, & Eisner, 2011)

283 What’s wrong with the usual approach? If you add too many features, your predictions might get worse! – Log-linear models used to remove features to avoid this overfitting – How do we fix it now? Regularization! If you add too many factors, your predictions might get worse! – The model might be better, but we replace the true marginals with approximate marginals (e.g. beliefs computed by BP) – But approximate inference can cause gradients for structured learning to go awry! (Kulesza & Pereira, 2008). 283

284 What’s wrong with the usual approach? Mistakes made by Standard CRF Training: 1.Using BP (approximate) 2.Not taking loss function into account 3.Should be doing MBR decoding Big pile of approximations… …which has tunable parameters. Treat it like a neural net, and run backprop! 284

285 Error Back-Propagation 285 Slide from (Stoyanov & Eisner, 2012)

286 Error Back-Propagation 286 Slide from (Stoyanov & Eisner, 2012)

287 Error Back-Propagation 287 Slide from (Stoyanov & Eisner, 2012)

288 Error Back-Propagation 288 Slide from (Stoyanov & Eisner, 2012)

289 Error Back-Propagation 289 Slide from (Stoyanov & Eisner, 2012)

290 Error Back-Propagation 290 Slide from (Stoyanov & Eisner, 2012)

291 Error Back-Propagation 291 Slide from (Stoyanov & Eisner, 2012)

292 Error Back-Propagation 292 Slide from (Stoyanov & Eisner, 2012)

293 Error Back-Propagation 293 Slide from (Stoyanov & Eisner, 2012)

294 Error Back-Propagation 294 y3y3 P(y 3 = noun |x) μ(y 1  y 2 )=μ(y 3  y 1 )*μ(y 4  y 1 ) ϴ Slide from (Stoyanov & Eisner, 2012)

295 Error Back-Propagation Applying the chain rule of derivation over and over. Forward pass: – Regular computation (inference + decoding) in the model (+ remember intermediate quantities). Backward pass: – Replay the forward pass in reverse computing gradients. 295

296 Empirical Risk Minimization as a Computational Expression Graph 296 Decoder Module: takes marginals as input, and outputs a prediction. Loss Module: takes the prediction as input, and outputs a loss for the training example. Loopy BP Module: takes the parameters as input, and outputs marginals (beliefs). (Stoyanov, Ropson, & Eisner, 2011) (Stoyanov & Eisner, 2012) Forward Pass beliefs from model parameters an output from beliefs loss from an output

297 Empirical Risk Minimization as a Computational Expression Graph 297 Decoder Module: takes marginals as input, and outputs a prediction. Loss Module: takes the prediction as input, and outputs a loss for the training example. Loopy BP Module: takes the parameters as input, and outputs marginals (beliefs). (Stoyanov, Ropson, & Eisner, 2011) (Stoyanov & Eisner, 2012) Forward Pass

298 Empirical Risk Minimization under Approximations (ERMA) Input: – Training data, {(x (i), y (i) ) : 1 ≤ i ≤ N } – Initial model parameters, θ – Decision function (aka. decoder), f θ (x) – Loss function, L Output: – Trained model parameters, θ. Algorithm: While not converged: – Sample a training example (x (i), y (i) ) – Compute the gradient of L(f θ (x (i) ), y (i) ) with respect to our model parameters θ. – Take a (small) step in the direction of the gradient. 298 (Stoyanov, Ropson, & Eisner, 2011)

299 Empirical Risk Minimization under Approximations Input: – Training data, {(x (i), y (i) ) : 1 ≤ i ≤ N } – Initial model parameters, θ – Decision function (aka. decoder), f θ (x) – Loss function, L Output: – Trained model parameters, θ. Algorithm: While not converged: – Sample a training example (x (i), y (i) ) – Compute the gradient of L(f θ (x (i) ), y (i) ) with respect to our model parameters θ. – Take a (small) step in the direction of the gradient. 299 This section is about how to (efficiently) compute this gradient, by treating inference, decoding, and the loss function as a differentiable black-box. Figure from (Stoyanov & Eisner, 2012)

300 The Chain Rule Version 1: Version 2: 300 Key idea: 1.Represent inference, decoding, and the loss function as a computational expression graph. 2.Then repeatedly apply the chain rule to a compute the partial derivatives.

301 Module-based AD The underlying idea goes by various names: – Described by Bottou & Gallinari (1991), as “A Framework for the Cooperation of Learning Algorithms” – Automatic Differentation in the reverse mode – Backpropagation is a special case for Neural Network training Define a set of modules, connected in a feed-forward topology (i.e. computational expression graph) Each module must define the following: – Input variables – Output variables – Forward pass: function mapping input variables to output variables – Backward pass: function mapping the adjoint of the output variables to the adjoint of the input variables The forward pass computes the goal The backward pass computes the partial derivative of the goal with respect to each parameter in the computational expression graph 301 (Bottou & Gallinari, 1991)

302 Empirical Risk Minimization as a Computational Expression Graph 302 Decoder Module: takes marginals as input, and outputs a prediction. Loss Module: takes the prediction as input, and outputs a loss for the training example. Loopy BP Module: takes the parameters as input, and outputs marginals (beliefs). (Stoyanov, Ropson, & Eisner, 2011) (Stoyanov & Eisner, 2012) Forward Pass Backward Pass beliefs from model parameters an output from beliefs loss from an output d loss / d output d loss / d beliefs by chain rule d loss / d model params by chain rule

303 Empirical Risk Minimization as a Computational Expression Graph 303 Decoder Module: takes marginals as input, and outputs a prediction. Loss Module: takes the prediction as input, and outputs a loss for the training example. Loopy BP Module: takes the parameters as input, and outputs marginals (beliefs). (Stoyanov, Ropson, & Eisner, 2011) (Stoyanov & Eisner, 2012) Forward Pass Backward Pass

304 Empirical Risk Minimization as a Computational Expression Graph 304 Decoder Module: takes marginals as input, and outputs a prediction. Loss Module: takes the prediction as input, and outputs a loss for the training example. Loopy BP Module: takes the parameters as input, and outputs marginals (beliefs). (Stoyanov, Ropson, & Eisner, 2011) (Stoyanov & Eisner, 2012) Forward Pass Backward Pass

305 Loopy BP as a Computational Expression Graph 305 … … … …

306 Loopy BP as a Computational Expression Graph 306 … … … … We obtain a feed-forward (acyclic) topology for the graph by “unrolling” the message passing algorithm. This amounts to indexing each message with a timestamp.

307 Empirical Risk Minimization under Approximations (ERMA) 307 Approximation Aware NoYes Loss Aware No Yes SVM struct [Finley and Joachims, 2008] M 3 N [Taskar et al., 2003] Softmax-margin [Gimpel & Smith, 2010] ERMA MLE Figure from (Stoyanov & Eisner, 2012)

308 Example: Congressional Voting 308 (Stoyanov & Eisner, 2012) Application: Task: predict representatives’ votes based on debates Novel training method: – Empirical Risk Minimization under Approximations (ERMA) – Loss-aware – Approximation-aware Findings: – On highly loopy graphs, significantly improves over (strong) loss-aware baseline

309 Example: Congressional Voting 309 (Stoyanov & Eisner, 2012) Application: Task: predict representatives’ votes based on debates Novel training method: – Empirical Risk Minimization under Approximations (ERMA) – Loss-aware – Approximation-aware Findings: – On highly loopy graphs, significantly improves over (strong) loss-aware baseline

310 Section 7: Software 310

311 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: What’s going on here? Can we trust BP’s estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 311

312 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: What’s going on here? Can we trust BP’s estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 312

313 Pacaya 313 Features: – Structured Loopy BP over factor graphs with: Discrete variables Structured constraint factors (e.g. projective dependency tree constraint factor) – Coming Soon: ERMA training with backpropagation through structured factors (Gormley, Dredze, & Eisner, In prep.) Language: Java Authors: Gormley, Mitchell, & Wolfe URL: (Gormley, Mitchell, Van Durme, & Dredze, 2014) (Gormley, Dredze, & Eisner, In prep.)

314 ERMA 314 Features: ERMA performs inference and training on CRFs and MRFs with arbitrary model structure over discrete variables. The training regime, Empirical Risk Minimization under Approximations is loss-aware and approximation-aware. ERMA can optimize several loss functions such as Accuracy, MSE and F-score. Language: Java Authors: Stoyanov, Ropson, & Eisner URL: https://sites.google.com/site/ermasoftware/https://sites.google.com/site/ermasoftware/ (Stoyanov, Ropson, & Eisner, 2011) (Stoyanov & Eisner, 2012)

315 Graphical Models Libraries Factorie (McCallum, Shultz, & Singh, 2012) is a Scala library allowing modular specification of inference, learning, and optimization methods. Inference algorithms include belief propagation and MCMC. Learning settings include maximum likelihood learning, maximum margin learning, learning with approximate inference, SampleRank, pseudo-likelihood. LibDAI (Mooij, 2010) is a C++ library that supports inference, but not learning, via Loopy BP, Fractional BP, Tree-Reweighted BP, (Double-loop) Generalized BP, variants of Loop Corrected Belief Propagation, Conditioned Belief Propagation, and Tree Expectation Propagation. OpenGM2 (Andres, Beier, & Kappes, 2012) provides a C++ template library for discrete factor graphs with support for learning and inference (including tie-ins to all LibDAI inference algorithms). FastInf (Jaimovich, Meshi, Mcgraw, Elidan) is an efficient Approximate Inference Library in C++. Infer.NET (Minka et al., 2012) is a.NET language framework for graphical models with support for Expectation Propagation and Variational Message Passing

316 References 316

317 M. Auli and A. Lopez, “A Comparison of Loopy Belief Propagation and Dual Decomposition for Integrated CCG Supertagging and Parsing,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, 2011, pp. 470–480. M. Auli and A. Lopez, “Training a Log-Linear Parser with Loss Functions via Softmax-Margin,” in Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK., 2011, pp. 333–343. Y. Bengio, “Training a neural network with a financial criterion rather than a prediction criterion,” in Decision Technologies for Financial Engineering: Proceedings of the Fourth International Conference on Neural Networks in the Capital Markets (NNCM’96), World Scientific Publishing, 1997, pp. 36–48. D. P. Bertsekas and J. N. Tsitsiklis, Parallel and distributed computation: numerical methods. Prentice-Hall, Inc., D. P. Bertsekas and J. N. Tsitsiklis, Parallel and distributed computation: numerical methods. Athena Scientific, L. Bottou and P. Gallinari, “A Framework for the Cooperation of Learning Algorithms,” in Advances in Neural Information Processing Systems, vol. 3, D. Touretzky and R. Lippmann, Eds. Denver: Morgan Kaufmann, R. Bunescu and R. J. Mooney, “Collective information extraction with relational Markov networks,” 2004, p. 438–es. C. Burfoot, S. Bird, and T. Baldwin, “Collective Classification of Congressional Floor-Debate Transcripts,” presented at the Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Techologies, 2011, pp. 1506–1515. D. Burkett and D. Klein, “Fast Inference in Phrase Extraction Models with Belief Propagation,” presented at the Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2012, pp. 29–38. T. Cohn and P. Blunsom, “Semantic Role Labelling with Tree Conditional Random Fields,” presented at the Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005), 2005, pp. 169–

318 F. Cromierès and S. Kurohashi, “An Alignment Algorithm Using Belief Propagation and a Structure-Based Distortion Model,” in Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), Athens, Greece, 2009, pp. 166–174. M. Dreyer, “A non-parametric model for the discovery of inflectional paradigms from plain text using graphical models over strings,” Johns Hopkins University, Baltimore, MD, USA, M. Dreyer and J. Eisner, “Graphical Models over Multiple Strings,” presented at the Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 2009, pp. 101–110. M. Dreyer and J. Eisner, “Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model,” presented at the Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 2011, pp. 616–627. J. Duchi, D. Tarlow, G. Elidan, and D. Koller, “Using Combinatorial Optimization within Max-Product Belief Propagation,” Advances in neural information processing systems, G. Durrett, D. Hall, and D. Klein, “Decentralized Entity-Level Modeling for Coreference Resolution,” presented at the Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2013, pp. 114–124. G. Elidan, I. McGraw, and D. Koller, “Residual belief propagation: Informed scheduling for asynchronous message passing,” in Proceedings of the Twenty-second Conference on Uncertainty in AI (UAI, K. Gimpel and N. A. Smith, “Softmax-Margin CRFs: Training Log-Linear Models with Cost Functions,” in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, California, 2010, pp. 733–736. J. Gonzalez, Y. Low, and C. Guestrin, “Residual splash for optimally parallelizing belief propagation,” in International Conference on Artificial Intelligence and Statistics, 2009, pp. 177–184. D. Hall and D. Klein, “Training Factored PCFGs with Expectation Propagation,” presented at the Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012, pp. 1146–

319 T. Heskes, “Stable fixed points of loopy belief propagation are minima of the Bethe free energy,” Advances in neural information processing systems, vol. 15, pp. 359–366, A. T. Ihler, J. W. Fisher III, A. S. Willsky, and D. M. Chickering, “Loopy belief propagation: convergence and effects of message errors.,” Journal of Machine Learning Research, vol. 6, no. 5, A. T. Ihler and D. A. Mcallester, “Particle belief propagation,” in International Conference on Artificial Intelligence and Statistics, 2009, pp. 256–263. J. Jancsary, J. Matiasek, and H. Trost, “Revealing the Structure of Medical Dictations with Conditional Random Fields,” presented at the Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, 2008, pp. 1–10. J. Jiang, T. Moon, H. Daumé III, and J. Eisner, “Prioritized Asynchronous Belief Propagation,” in ICML Workshop on Inferning, A. Kazantseva and S. Szpakowicz, “Linear Text Segmentation Using Affinity Propagation,” presented at the Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 2011, pp. 284–293. T. Koo and M. Collins, “Hidden-Variable Models for Discriminative Reranking,” presented at the Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, 2005, pp. 507–514. A. Kulesza and F. Pereira, “Structured Learning with Approximate Inference.,” in NIPS, 2007, vol. 20, pp. 785– 792. J. Lee, J. Naradowsky, and D. A. Smith, “A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, 2011, pp. 885–894. S. Lee, “Structured Discriminative Model For Dialog State Tracking,” presented at the Proceedings of the SIGDIAL 2013 Conference, 2013, pp. 442–

320 X. Liu, M. Zhou, X. Zhou, Z. Fu, and F. Wei, “Joint Inference of Named Entity Recognition and Normalization for Tweets,” presented at the Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2012, pp. 526–535. D. J. C. MacKay, J. S. Yedidia, W. T. Freeman, and Y. Weiss, “A Conversation about the Bethe Free Energy and Sum-Product,” MERL, TR , A. Martins, N. Smith, E. Xing, P. Aguiar, and M. Figueiredo, “Turbo Parsers: Dependency Parsing by Approximate Variational Inference,” presented at the Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, 2010, pp. 34–44. D. McAllester, M. Collins, and F. Pereira, “Case-Factor Diagrams for Structured Probabilistic Modeling,” in In Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI’04), T. Minka, “Divergence measures and message passing,” Technical report, Microsoft Research, T. P. Minka, “Expectation propagation for approximate Bayesian inference,” in Uncertainty in Artificial Intelligence, 2001, vol. 17, pp. 362–369. M. Mitchell, J. Aguilar, T. Wilson, and B. Van Durme, “Open Domain Targeted Sentiment,” presented at the Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013, pp. 1643– K. P. Murphy, Y. Weiss, and M. I. Jordan, “Loopy belief propagation for approximate inference: An empirical study,” in Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, 1999, pp. 467–475. T. Nakagawa, K. Inui, and S. Kurohashi, “Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables,” presented at the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2010, pp. 786–794. J. Naradowsky, S. Riedel, and D. Smith, “Improving NLP through Marginalization of Hidden Syntactic Structure,” in Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012, pp. 810–820. J. Naradowsky, T. Vieira, and D. A. Smith, Grammarless Parsing for Joint Inference. Mumbai, India, J. Niehues and S. Vogel, “Discriminative Word Alignment via Alignment Matrix Modeling,” presented at the Proceedings of the Third Workshop on Statistical Machine Translation, 2008, pp. 18–

321 J. Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, X. Pitkow, Y. Ahmadian, and K. D. Miller, “Learning unbelievable probabilities,” in Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2011, pp. 738–746. V. Qazvinian and D. R. Radev, “Identifying Non-Explicit Citing Sentences for Citation-Based Summarization.,” presented at the Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 2010, pp. 555–564. H. Ren, W. Xu, Y. Zhang, and Y. Yan, “Dialog State Tracking using Conditional Random Fields,” presented at the Proceedings of the SIGDIAL 2013 Conference, 2013, pp. 457–461. D. Roth and W. Yih, “Probabilistic Reasoning for Entity & Relation Recognition,” presented at the COLING 2002: The 19th International Conference on Computational Linguistics, A. Rudnick, C. Liu, and M. Gasser, “HLTDI: CL-WSD Using Markov Random Fields for SemEval-2013 Task 10,” presented at the Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), 2013, pp. 171–177. T. Sato, “Inside-Outside Probability Computation for Belief Propagation.,” in IJCAI, 2007, pp. 2605–2610. D. A. Smith and J. Eisner, “Dependency Parsing by Belief Propagation,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Honolulu, 2008, pp. 145–156. V. Stoyanov and J. Eisner, “Fast and Accurate Prediction via Evidence-Specific MRF Structure,” in ICML Workshop on Inferning: Interactions between Inference and Learning, Edinburgh, V. Stoyanov and J. Eisner, “Minimum-Risk Training of Approximate CRF-Based NLP Systems,” in Proceedings of NAACL-HLT, 2012, pp. 120–

322 V. Stoyanov, A. Ropson, and J. Eisner, “Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure,” in Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), Fort Lauderdale, 2011, vol. 15, pp. 725–733. E. B. Sudderth, A. T. Ihler, W. T. Freeman, and A. S. Willsky, “Nonparametric belief propagation,” MIT, Technical Report 2551, E. B. Sudderth, A. T. Ihler, W. T. Freeman, and A. S. Willsky, “Nonparametric belief propagation,” in In Proceedings of CVPR, E. B. Sudderth, A. T. Ihler, M. Isard, W. T. Freeman, and A. S. Willsky, “Nonparametric belief propagation,” Communications of the ACM, vol. 53, no. 10, pp. 95–103, C. Sutton and A. McCallum, “Collective Segmentation and Labeling of Distant Entities in Information Extraction,” in ICML Workshop on Statistical Relational Learning and Its Connections to Other Fields, C. Sutton and A. McCallum, “Piecewise Training of Undirected Models,” in Conference on Uncertainty in Artificial Intelligence (UAI), C. Sutton and A. McCallum, “Improved dynamic schedules for belief propagation,” UAI, M. J. Wainwright, T. Jaakkola, and A. S. Willsky, “Tree-based reparameterization for approximate inference on loopy graphs.,” in NIPS, 2001, pp. 1001–1008. Z. Wang, S. Li, F. Kong, and G. Zhou, “Collective Personal Profile Summarization with Social Networks,” presented at the Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013, pp. 715–725. Y. Watanabe, M. Asahara, and Y. Matsumoto, “A Graph-Based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields,” presented at the Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP- CoNLL), 2007, pp. 649–

323 Y. Weiss and W. T. Freeman, “On the optimality of solutions of the max-product belief-propagation algorithm in arbitrary graphs,” Information Theory, IEEE Transactions on, vol. 47, no. 2, pp. 736–744, J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Bethe free energy, Kikuchi approximations, and belief propagation algorithms,” MERL, TR , J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Constructing free-energy approximations and generalized belief propagation algorithms,” IEEE Transactions on Information Theory, vol. 51, no. 7, pp. 2282–2312, Jul J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Generalized belief propagation,” in NIPS, 2000, vol. 13, pp. 689– 695. J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Understanding belief propagation and its generalizations,” Exploring artificial intelligence in the new millennium, vol. 8, pp. 236–239, J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Constructing Free Energy Approximations and Generalized Belief Propagation Algorithms,” MERL, TR , J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Constructing free-energy approximations and generalized belief propagation algorithms,” Information Theory, IEEE Transactions on, vol. 51, no. 7, pp. 2282–2312,


Download ppt "1. Structured Belief Propagation for NLP Matthew R. Gormley & Jason Eisner ACL ‘14 Tutorial June 22, 2014 2 For the latest version of these slides, please."

Similar presentations


Ads by Google