Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 1 David A. Smith (JHU  UMass Amherst) Jason Eisner (Johns Hopkins University) Dependency Parsing by Belief Propagation.

Similar presentations


Presentation on theme: "1 1 David A. Smith (JHU  UMass Amherst) Jason Eisner (Johns Hopkins University) Dependency Parsing by Belief Propagation."— Presentation transcript:

1

2 1 1 David A. Smith (JHU  UMass Amherst) Jason Eisner (Johns Hopkins University) Dependency Parsing by Belief Propagation

3 2 Outline Edge-factored parsing  Dependency parses  Scoring the competing parses: Edge features  Finding the best parse Higher-order parsing  Throwing in more features: Graphical models  Finding the best parse: Belief propagation  Experiments Conclusions New! Old

4 3 Outline Edge-factored parsing  Dependency parses  Scoring the competing parses: Edge features  Finding the best parse Higher-order parsing  Throwing in more features: Graphical models  Finding the best parse: Belief propagation  Experiments Conclusions New! Old

5 4 MOD Word Dependency Parsing He reckons the current account deficit will narrow to only 1.8 billion in September. Raw sentence Part-of-speech tagging He reckons the current account deficit will narrow to only 1.8 billion in September. PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP. POS-tagged sentence Word dependency parsing slide adapted from Yuji Matsumoto Word dependency parsed sentence He reckons the current account deficit will narrow to only 1.8 billion in September. SUBJ ROOT S-COMP SUBJ SPEC MOD COMP

6 5 What does parsing have to do with belief propagation? loopy belief propagation belief loopypropagation

7 6 Outline Edge-factored parsing  Dependency parses  Scoring the competing parses: Edge features  Finding the best parse Higher-order parsing  Throwing in more features: Graphical models  Finding the best parse: Belief propagation  Experiments Conclusions New! Old

8 7 7 Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972) In the beginning, we used generative models. p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * … each choice depends on a limited part of the history but which dependencies to allow? what if they’re all worthwhile? p(D | A,B,C)? … p(D | A,B) * p(C | A,B,D)?

9 8 8 Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972) In the beginning, we used generative models. Solution: Log-linear (max-entropy) modeling  Features may interact in arbitrary ways  Iterative scaling keeps adjusting the feature weights until the model agrees with the training data. p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * … which dependencies to allow? (given limited training data) (1/Z) * Φ (A) * Φ (B,A) * Φ (C,A) * Φ (C,B) * Φ (D,A,B) * Φ (D,B,C) * Φ (D,A,C) * … throw them all in!

10 9 Log-linear models great for n-way classification Also good for predicting sequences Also good for dependency parsing 9 How about structured outputs? but to allow fast dynamic programming, only use n-gram features but to allow fast dynamic programming or MST parsing, only use single-edge features …find preferred links… find preferred tags van

11 10 How about structured outputs? but to allow fast dynamic programming or MST parsing, only use single-edge features …find preferred links…

12 11 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking Is this a good edge? yes, lots of green...

13 12 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking Is this a good edge? jasný  den (“bright day”)

14 13 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking Is this a good edge? jasný  den (“bright day”) jasný  N (“bright NOUN”) VAAANJNVC

15 14 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking Is this a good edge? jasný  den (“bright day”) jasný  N (“bright NOUN”) VAAANJNVC A  NA  N

16 15 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking Is this a good edge? jasný  den (“bright day”) jasný  N (“bright NOUN”) VAAANJNVC A  N preceding conjunction A  NA  N

17 16 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking How about this competing edge? VAAANJNVC not as good, lots of red...

18 17 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking How about this competing edge? VAAANJNVC jasný  hodiny (“bright clocks”)... undertrained...

19 18 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking How about this competing edge? VAAANJNVC byljasnstuddubndenahodiodbítřin jasný  hodiny (“bright clocks”)... undertrained... jasn  hodi (“bright clock,” stems only)

20 19 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking How about this competing edge? VAAANJNVC jasn  hodi (“bright clock,” stems only) byljasnstuddubndenahodiodbítřin A plural  N singular jasný  hodiny (“bright clocks”)... undertrained...

21 20 jasný  hodiny (“bright clocks”)... undertrained... Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking How about this competing edge? VAAANJNVC jasn  hodi (“bright clock,” stems only) byljasnstuddubndenahodiodbítřin A plural  N singular A  N where N follows a conjunction

22 21 jasný Edge-Factored Parsers (McDonald et al. 2005) Bylstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking VAAANJNVC byljasnstuddubndenahodiodbítřin Which edge is better?  “bright day” or “bright clocks”?

23 22 jasný Edge-Factored Parsers (McDonald et al. 2005) Bylstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking VAAANJNVC byljasnstuddubndenahodiodbítřin Which edge is better? Score of an edge e =   features(e) Standard algos  valid parse with max total score our current weight vector

24 23 Edge-Factored Parsers (McDonald et al. 2005) Which edge is better? Score of an edge e =   features(e) Standard algos  valid parse with max total score our current weight vector can’t have both (one parent per word) can‘t have both (no crossing links) Can’t have all three (no cycles) Thus, an edge may lose (or win) because of a consensus of other edges.

25 24 Outline Edge-factored parsing  Dependency parses  Scoring the competing parses: Edge features  Finding the best parse Higher-order parsing  Throwing in more features: Graphical models  Finding the best parse: Belief propagation  Experiments Conclusions New! Old

26 25 Finding Highest-Scoring Parse The cat in the hat wore a stovepipe. ROOT Convert to context-free grammar (CFG) Then use dynamic programming each subtree is a linguistic constituent (here a noun phrase) The cat in the hat wore a stovepipe ROOT let’s vertically stretch this graph drawing

27 26 Finding Highest-Scoring Parse each subtree is a linguistic constituent (here a noun phrase) The cat in the hat wore a stovepipe ROOT so CKY’s “grammar constant” is no longer constant  Convert to context-free grammar (CFG) Then use dynamic programming  CKY algorithm for CFG parsing is O(n 3 )  Unfortunately, O(n 5 ) in this case to score “cat  wore” link, not enough to know this is NP must know it’s rooted at “cat” so expand nonterminal set by O(n): {NP the, NP cat, NP hat,...}

28 27 Finding Highest-Scoring Parse each subtree is a linguistic constituent (here a noun phrase) The cat in the hat wore a stovepipe ROOT Convert to context-free grammar (CFG) Then use dynamic programming  CKY algorithm for CFG parsing is O(n 3 )  Unfortunately, O(n 5 ) in this case  Solution: Use a different decomposition (Eisner 1996) Back to O(n 3 )

29 28 Spans vs. constituents Two kinds of substring. »Constituent of the tree: links to the rest only through its headword (root). »Span of the tree: links to the rest only through its endwords. The cat in the hat wore a stovepipe. ROOT

30 Decomposing a tree into spans The cat in the hat wore a stovepipe. ROOT The cat wore a stovepipe. ROOT cat in the hat wore + + in the hat wore cat in + hat wore in the hat + cat in the hat wore a stovepipe. ROOT

31 30 Finding Highest-Scoring Parse Convert to context-free grammar (CFG) Then use dynamic programming  CKY algorithm for CFG parsing is O(n 3 )  Unfortunately, O(n 5 ) in this case  Solution: Use a different decomposition (Eisner 1996) Back to O(n 3 ) Can play usual tricks for dynamic programming parsing  Further refining the constituents or spans Allow prob. model to keep track of even more internal information  A*, best-first, coarse-to-fine  Training by EM etc. require “outside” probabilities of constituents, spans, or links

32 31 Hard Constraints on Valid Trees Score of an edge e =   features(e) Standard algos  valid parse with max total score our current weight vector can’t have both (one parent per word) can‘t have both (no crossing links) Can’t have all three (no cycles) Thus, an edge may lose (or win) because of a consensus of other edges.

33 32 talk Non-Projective Parses can‘t have both (no crossing links) The “projectivity” restriction. Do we really want it? IgiveaonbootstrappingtomorrowROOT‘ll subtree rooted at “talk” is a discontiguous noun phrase

34 33 Non-Projective Parses istameamnoritgloriacanitiemROOT IgiveaonbootstrappingtalktomorrowROOT‘ll that NOM my ACC may-knowglory NOM going-gray ACC That glory may-know my going-gray (i.e., it shall last till I go gray) occasional non-projectivity in English frequent non-projectivity in Latin, etc.

35 34 Finding highest-scoring non-projective tree Consider the sentence “John saw Mary” (left). The Chu-Liu-Edmonds algorithm finds the maximum- weight spanning tree (right) – may be non-projective. Can be found in time O(n 2 ). 9 10 30 20 30 0 11 3 9 root John saw Mary 10 30 root John saw Mary slide thanks to Dragomir Radev Every node selects best parent If cycles, contract them and repeat

36 35 Summing over all non-projective trees Finding highest-scoring non-projective tree Consider the sentence “John saw Mary” (left). The Chu-Liu-Edmonds algorithm finds the maximum- weight spanning tree (right) – may be non-projective. Can be found in time O(n 2 ). How about total weight Z of all trees? How about outside probabilities or gradients? Can be found in time O(n 3 ) by matrix determinants and inverses (Smith & Smith, 2007). slide thanks to Dragomir Radev

37 36 Graph Theory to the Rescue! Tutte’s Matrix-Tree Theorem (1948) The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r. Exactly the Z we need! O(n 3 ) time!

38 37 Building the Kirchoff (Laplacian) Matrix Negate edge scores Sum columns (children) Strike root row/col. Take determinant N.B.: This allows multiple children of root, but see Koo et al. 2007.

39 38 Why Should This Work? Chu-Liu-Edmonds analogy: Every node selects best parent If cycles, contract and recur Clear for 1x1 matrix; use induction Undirected case; special root cases for directed

40 39 Outline Edge-factored parsing  Dependency parses  Scoring the competing parses: Edge features  Finding the best parse Higher-order parsing  Throwing in more features: Graphical models  Finding the best parse: Belief propagation  Experiments Conclusions New! Old

41 40 Exactly Finding the Best Parse With arbitrary features, runtime blows up  Projective parsing: O(n 3 ) by dynamic programming  Non-projective: O(n 2 ) by minimum spanning tree but to allow fast dynamic programming or MST parsing, only use single-edge features …find preferred links… O(n 4 ) grandparents O(n 5 ) grandp. + sibling bigrams O(n 3 g 6 ) POS trigrams … O(2 n ) sibling pairs (non-adjacent) NP-hard any of the above features soft penalties for crossing links pretty much anything else!

42 41 Let’s reclaim our freedom (again!) Output probability is a product of local factors  Throw in any factors we want! (log-linear model) How could we find best parse?  Integer linear programming (Riedel et al., 2006) doesn’t give us probabilities when training or parsing  MCMC Slow to mix? High rejection rate because of hard T REE constraint?  Greedy hill-climbing (McDonald & Pereira 2006) This paper in a nutshell (1/Z) * Φ (A) * Φ (B,A) * Φ (C,A) * Φ (C,B) * Φ (D,A,B) * Φ (D,B,C) * Φ (D,A,C) * … none of these exploit tree structure of parses as the first-order methods do

43 42 Let’s reclaim our freedom (again!) Output probability is a product of local factors  Throw in any factors we want! (log-linear model) Let local factors negotiate via “belief propagation” Links (and tags) reinforce or suppress one another  Each iteration takes total time O(n 2 ) or O(n 3 ) Converges to a pretty good (but approx.) global parse certain global factors ok too each global factor can be handled fast via some traditional parsing algorithm (e.g., inside-outside) This paper in a nutshell (1/Z) * Φ (A) * Φ (B,A) * Φ (C,A) * Φ (C,B) * Φ (D,A,B) * Φ (D,B,C) * Φ (D,A,C) * …

44 43 Let’s reclaim our freedom (again!) Training with many featuresDecoding with many features Iterative scalingBelief propagation Each weight in turn is influenced by others Each variable in turn is influenced by others Iterate to achieve globally optimal weights Iterate to achieve locally consistent beliefs To train distrib. over trees, use dynamic programming to compute normalizer Z To decode distrib. over trees, use dynamic programming to compute messages This paper in a nutshell New!

45 44 Outline Edge-factored parsing  Dependency parses  Scoring the competing parses: Edge features  Finding the best parse Higher-order parsing  Throwing in more features: Graphical models  Finding the best parse: Belief propagation  Experiments Conclusions New! Old

46 45 First, a familiar example  Conditional Random Field (CRF) for POS tagging 45 Local factors in a graphical model … … find preferred tags vvv Possible tagging (i.e., assignment to remaining variables) Observed input sentence (shaded)

47 46 Local factors in a graphical model First, a familiar example  Conditional Random Field (CRF) for POS tagging … … find preferred tags van Possible tagging (i.e., assignment to remaining variables) Another possible tagging Observed input sentence (shaded)

48 47 Local factors in a graphical model First, a familiar example  Conditional Random Field (CRF) for POS tagging … … find preferred tags vna v 021 n 210 a 031 vna v 021 n 210 a 031 ”Binary” factor that measures compatibility of 2 adjacent tags Model reuses same parameters at this position

49 48 Local factors in a graphical model First, a familiar example  Conditional Random Field (CRF) for POS tagging … … find preferred tags v 0.2 n a 0 “Unary” factor evaluates this tag Its values depend on corresponding word can’t be adj v 0.2 n a 0

50 49 Local factors in a graphical model First, a familiar example  Conditional Random Field (CRF) for POS tagging … … find preferred tags v 0.2 n a 0 “Unary” factor evaluates this tag Its values depend on corresponding word (could be made to depend on entire observed sentence)

51 50 Local factors in a graphical model First, a familiar example  Conditional Random Field (CRF) for POS tagging … … find preferred tags v 0.2 n a 0 “Unary” factor evaluates this tag Different unary factor at each position v 0.3 n 0.02 a 0 v 0.3 n 0 a 0.1

52 51 Local factors in a graphical model First, a familiar example  Conditional Random Field (CRF) for POS tagging … … find preferred tags vna v 021 n 210 a 031 v 0.3 n 0.02 a 0 vna v 021 n 210 a 031 v 0.3 n 0 a 0.1 v 0.2 n a 0 van p( v a n ) is proportional to the product of all factors’ values on v a n

53 52 Local factors in a graphical model First, a familiar example  Conditional Random Field (CRF) for POS tagging … … find preferred tags vna v 021 n 210 a 031 v 0.3 n 0.02 a 0 vna v 021 n 210 a 031 v 0.3 n 0 a 0.1 v 0.2 n a 0 van = … 1*3*0.3*0.1*0.2 … p( v a n ) is proportional to the product of all factors’ values on v a n

54 53 First, a familiar example  CRF for POS tagging Now let’s do dependency parsing!  O(n 2 ) boolean variables for the possible links van Local factors in a graphical model find preferred links … …

55 54 First, a familiar example  CRF for POS tagging Now let’s do dependency parsing!  O(n 2 ) boolean variables for the possible links 54 Local factors in a graphical model find preferred links … … tfftff Possible parse— encoded as an assignment to these vars van

56 55 First, a familiar example  CRF for POS tagging Now let’s do dependency parsing!  O(n 2 ) boolean variables for the possible links 55 Local factors in a graphical model find preferred links … … f f t f t f Possible parse— encoded as an assignment to these vars Another possible parse van

57 56 First, a familiar example  CRF for POS tagging Now let’s do dependency parsing!  O(n 2 ) boolean variables for the possible links (cycle) 56 Local factors in a graphical model find preferred links … … f t t t f Possible parse— encoded as an assignment to these vars Another possible parse An illegal parse van f

58 57 First, a familiar example  CRF for POS tagging Now let’s do dependency parsing!  O(n 2 ) boolean variables for the possible links (cycle) 57 Local factors in a graphical model find preferred links … … t t t Possible parse— encoded as an assignment to these vars Another possible parse An illegal parse Another illegal parse van t (multiple parents) f f

59 58  So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation 58 Local factors for parsing find preferred links … … t 2 f 1 t 1 f 2 t 1 f 2 t 1 f 6 t 1 f 3 as before, goodness of this link can depend on entire observed input context t 1 f 8 some other links aren’t as good given this input sentence But what if the best assignment isn’t a tree??

60 59 Global factors for parsing  So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree  this is a “hard constraint”: factor is either 0 or 1 find preferred links … … ffffff 0 ffffft 0 fffftf 0 …… fftfft 1 …… tttttt 0

61 60  So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree  this is a “hard constraint”: factor is either 0 or 1 60 Global factors for parsing find preferred links … … ffffff 0 ffffft 0 fffftf 0 …… fftfft 1 …… tttttt 0 t f f t f f 64 entries (0/1) So far, this is equivalent to edge-factored parsing (McDonald et al. 2005). Note: McDonald et al. (2005) don’t loop through this table to consider exponentially many trees one at a time. They use combinatorial algorithms; so should we! optionally require the tree to be projective (no crossing links) we’re legal!

62 61  So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree  this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables  grandparent 61 Local factors for parsing find preferred links … … ft f11 t13 t t 3

63 62  So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree  this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables  grandparent  no-cross 62 Local factors for parsing find preferred links … … t by t ft f11 t10.2

64 63 Local factors for parsing find preferred links … … by  So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree  this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables  grandparent  no-cross  siblings  hidden POS tags  subcategorization ……

65 64 Outline Edge-factored parsing  Dependency parses  Scoring the competing parses: Edge features  Finding the best parse Higher-order parsing  Throwing in more features: Graphical models  Finding the best parse: Belief propagation  Experiments Conclusions New! Old

66 65 Good to have lots of features, but … Nice model Shame about the NP-hardness  Can we approximate? Machine learning to the rescue!  ML community has given a lot to NLP  In the 2000’s, NLP has been giving back to ML Mainly techniques for joint prediction of structures Much earlier, speech recognition had HMMs, EM, smoothing … 65

67 66 Great Ideas in ML: Message Passing 66 3 behind you 2 behind you 1 behind you 4 behind you 5 behind you 1 before you 2 before you there’s 1 of me 3 before you 4 before you 5 before you adapted from MacKay (2003) textbook Count the soldiers

68 67 Great Ideas in ML: Message Passing 67 3 behind you 2 before you there’s 1 of me Belief: Must be 2 + 1 + 3 = 6 of us only see my incoming messages 231 Count the soldiers adapted from MacKay (2003) textbook

69 68 Belief: Must be 2 + 1 + 3 = 6 of us 231 Great Ideas in ML: Message Passing 68 4 behind you 1 before you there’s 1 of me only see my incoming messages Belief: Must be 1 + 1 + 4 = 6 of us 141 Count the soldiers adapted from MacKay (2003) textbook

70 69 Great Ideas in ML: Message Passing 69 7 here 3 here 11 here (= 7+3+1) 1 of me Each soldier receives reports from all branches of tree adapted from MacKay (2003) textbook

71 70 Great Ideas in ML: Message Passing 70 3 here 7 here (= 3+3+1) Each soldier receives reports from all branches of tree adapted from MacKay (2003) textbook

72 71 Great Ideas in ML: Message Passing 71 7 here 3 here 11 here (= 7+3+1) Each soldier receives reports from all branches of tree adapted from MacKay (2003) textbook

73 72 Great Ideas in ML: Message Passing 72 7 here 3 here Belief: Must be 14 of us Each soldier receives reports from all branches of tree adapted from MacKay (2003) textbook

74 73 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 73 7 here 3 here Belief: Must be 14 of us wouldn’t work correctly with a “loopy” (cyclic) graph adapted from MacKay (2003) textbook

75 74 … … find preferred tags Great ideas in ML: Forward-Backward v 0.3 n 0 a 0.1 v 1.8 n 0 a 4.2 αβα belief message v 2 n 1 a 7 In the CRF, message passing = forward-backward v 7 n 2 a 1 v 3 n 1 a 6 β vna v 021 n 210 a 031 v 3 n 6 a 1 vna v 021 n 210 a 031

76 75 Extend CRF to “skip chain” to capture non-local factor  More influences on belief 75 … … find preferred tags Great ideas in ML: Forward-Backward v 3 n 1 a 6 v 2 n 1 a 7 αβ v 3 n 1 a 6 v 5.4 n 0 a 25.2 v 0.3 n 0 a 0.1

77 76 Extend CRF to “skip chain” to capture non-local factor  More influences on belief  Graph becomes loopy  76 … … find preferred tags Great ideas in ML: Forward-Backward v 3 n 1 a 6 v 2 n 1 a 7 αβ v 3 n 1 a 6 v 5.4` n 0 a 25.2` v 0.3 n 0 a 0.1 Red messages not independent? Pretend they are!

78 77 Two great tastes that taste great together You got dynamic programming in my belief propagation! You got belief propagation in my dynamic programming! Upcoming attractions …

79 78 Loopy Belief Propagation for Parsing find preferred links … … Sentence tells word 3, “Please be a verb” Word 3 tells the 3  7 link, “Sorry, then you probably don’t exist” The 3  7 link tells the Tree factor, “You’ll have to find another parent for 7” The tree factor tells the 10  7 link, “You’re on!” The 10  7 link tells 10, “Could you please be a noun?” …

80 79  Higher-order factors (e.g., Grandparent) induce loops Let’s watch a loop around one triangle … Strong links are suppressing or promoting other links … 79 Loopy Belief Propagation for Parsing find preferred links … …

81 80  Higher-order factors (e.g., Grandparent) induce loops Let’s watch a loop around one triangle …  How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t, given the messages it receives from all the other links?” 80 Loopy Belief Propagation for Parsing find preferred links … … ? TREE factor ffffff 0 ffffft 0 fffftf 0 …… fftfft 1 …… tttttt 0

82 81  How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t, given the messages it receives from all the other links?” 81 Loopy Belief Propagation for Parsing find preferred links … … ? TREE factor ffffff 0 ffffft 0 fffftf 0 …… fftfft 1 …… tttttt 0 But this is the outside probability of green link!  TREE factor computes all outgoing messages at once (given all incoming messages) Projective case: total O(n 3 ) time by inside-outside Non-projective: total O(n 3 ) time by inverting Kirchhoff matrix (Smith & Smith, 2007)

83 82  How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t, given the messages it receives from all the other links?” 82 Loopy Belief Propagation for Parsing But this is the outside probability of green link!  TREE factor computes all outgoing messages at once (given all incoming messages) Projective case: total O(n 3 ) time by inside-outside Non-projective: total O(n 3 ) time by inverting Kirchhoff matrix (Smith & Smith, 2007) Belief propagation assumes incoming messages to TREE are independent. So outgoing messages can be computed with first-order parsing algorithms (fast, no grammar constant).

84 83 Some connections … Parser stacking (Nivre & McDonald 2008, Martins et al. 2008) Global constraints in arc consistency  A LL D IFFERENT constraint (Régin 1994) Matching constraint in max-product BP  For computer vision (Duchi et al., 2006)  Could be used for machine translation As far as we know, our parser is the first use of global constraints in sum-product BP.

85 84 Outline Edge-factored parsing  Dependency parses  Scoring the competing parses: Edge features  Finding the best parse Higher-order parsing  Throwing in more features: Graphical models  Finding the best parse: Belief propagation  Experiments Conclusions New! Old

86 85 Runtimes for each factor type (see paper) Factor typedegreeruntimecounttotal TreeO(n 2 )O(n 3 )1 Proj. TreeO(n 2 )O(n 3 )1 Individual links1O(1)O(n 2 ) Grandparent2O(1)O(n 3 ) Sibling pairs2O(1)O(n 3 ) Sibling bigramsO(n)O(n 2 )O(n)O(n 3 ) NoCrossO(n) O(n 2 )O(n 3 ) Tag1O(g)O(n) TagLink3O(g 2 )O(n 2 ) TagTrigramO(n)O(ng 3 )1O(n) TOTALO(n 3 ) +=+= Additive, not multiplicative! per iteration

87 86 Runtimes for each factor type (see paper) Factor typedegreeruntimecounttotal TreeO(n 2 )O(n 3 )1 Proj. TreeO(n 2 )O(n 3 )1 Individual links1O(1)O(n 2 ) Grandparent2O(1)O(n 3 ) Sibling pairs2O(1)O(n 3 ) Sibling bigramsO(n)O(n 2 )O(n)O(n 3 ) NoCrossO(n) O(n 2 )O(n 3 ) Tag1O(g)O(n) TagLink3O(g 2 )O(n 2 ) TagTrigramO(n)O(ng 3 )1O(n) TOTALO(n 3 ) +=+= Additive, not multiplicative! Each “global” factor coordinates an unbounded # of variables Standard belief propagation would take exponential time to iterate over all configurations of those variables See paper for efficient propagators

88 87 Experimental Details Decoding  Run several iterations of belief propagation  Get final beliefs at link variables  Feed them into first-order parser  This gives the Min Bayes Risk tree (minimizes expected error) Training  BP computes beliefs about each factor, too …  … which gives us gradients for max conditional likelihood. (as in forward-backward algorithm) Features used in experiments  First-order: Individual links just as in McDonald et al. 2005  Higher-order: Grandparent, Sibling bigrams, NoCross 87

89 88 Dependency Accuracy The extra, higher-order features help! (non-projective parsing) DanishDutchEnglish Tree+Link85.587.388.6 +NoCross86.188.389.1 +Grandparent86.188.689.4 +ChildSeq86.588.590.1

90 89 Dependency Accuracy The extra, higher-order features help! (non-projective parsing) DanishDutchEnglish Tree+Link85.587.388.6 +NoCross86.188.389.1 +Grandparent86.188.689.4 +ChildSeq86.588.590.1 Best projective parse with all factors 86.084.590.2 +hill-climbing86.187.690.2 exact, slow doesn’t fix enough edges

91 90 Time vs. Projective Search Error …DP 140 Compared with O(n 4 ) DPCompared with O(n 5 ) DP iterations

92 91 Runtime: BP vs. DP Vs. O(n 4 ) DPVs. O(n 5 ) DP

93 92 Outline Edge-factored parsing  Dependency parses  Scoring the competing parses: Edge features  Finding the best parse Higher-order parsing  Throwing in more features: Graphical models  Finding the best parse: Belief propagation  Experiments Conclusions New! Old

94 93 Freedom Regained Output probability defined as product of local and global factors  Throw in any factors we want! (log-linear model)  Each factor must be fast, but they run independently Let local factors negotiate via “belief propagation”  Each bit of syntactic structure is influenced by others  Some factors need combinatorial algorithms to compute messages fast e.g., existing parsing algorithms using dynamic programming  Each iteration takes total time O(n 3 ) or even O(n 2 ); see paper Compare reranking or stacking Converges to a pretty good (but approximate) global parse  Fast parsing for formerly intractable or slow models  Extra features of these models really do help accuracy This paper in a nutshell

95 94 Future Opportunities Efficiently modeling more hidden structure  POS tags, link roles, secondary links (DAG-shaped parses) Beyond dependencies  Constituency parsing, traces, lattice parsing Beyond parsing  Alignment, translation  Bipartite matching and network flow  Joint decoding of parsing and other tasks (IE, MT, reasoning...) Beyond text  Image tracking and retrieval  Social networks

96 95 thank you


Download ppt "1 1 David A. Smith (JHU  UMass Amherst) Jason Eisner (Johns Hopkins University) Dependency Parsing by Belief Propagation."

Similar presentations


Ads by Google