1 1 David A. Smith (JHU  UMass Amherst) Jason Eisner (Johns Hopkins University) Dependency Parsing by Belief Propagation.

Slides:



Advertisements
Similar presentations
Linear Time Methods for Propagating Beliefs Min Convolution, Distance Transforms and Box Sums Daniel Huttenlocher Computer Science Department December,
Advertisements

Exact Inference. Inference Basic task for inference: – Compute a posterior distribution for some query variables given some observed evidence – Sum out.
Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors David A. Smith Jason Eisner Johns Hopkins University.
Exact Inference in Bayes Nets
Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.
Dependency Parsing Some slides are based on:
28 June 2007EMNLP-CoNLL1 Probabilistic Models of Nonprojective Dependency Trees David A. Smith Center for Language and Speech Processing Computer Science.
Belief Propagation by Jakob Metzler. Outline Motivation Pearl’s BP Algorithm Turbo Codes Generalized Belief Propagation Free Energies.
CS774. Markov Random Field : Theory and Application Lecture 04 Kyomin Jung KAIST Sep
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Visual Recognition Tutorial
Conditional Random Fields
Belief Propagation, Junction Trees, and Factor Graphs
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Parsing SLP Chapter 13. 7/2/2015 Speech and Language Processing - Jurafsky and Martin 2 Outline  Parsing with CFGs  Bottom-up, top-down  CKY parsing.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.
Computer vision: models, learning and inference
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
1 Data-Driven Dependency Parsing. 2 Background: Natural Language Parsing Syntactic analysis String to (tree) structure He likes fish S NP VP NP VNPrn.
Graphical models for part of speech tagging
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Graphical Models over Multiple Strings Markus Dreyer and Jason Eisner Dept. of Computer Science, Johns Hopkins University EMNLP 2009 Presented by Ji Zongcheng.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Slides for “Data Mining” by I. H. Witten and E. Frank.
1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.
Object Recognition Part 2 Authors: Kobus Barnard, Pinar Duygulu, Nado de Freitas, and David Forsyth Slides by Rong Zhang CSE 595 – Words and Pictures Presentation.
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.
Approximation-aware Dependency Parsing by Belief Propagation September 19, 2015 TACL at EMNLP 1 Matt Gormley Mark Dredze Jason Eisner.
Penalized EP for Graphical Models Over Strings Ryan Cotterell and Jason Eisner.
John Lafferty Andrew McCallum Fernando Pereira
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
A global approach Finding correspondence between a pair of epipolar lines for all pixels simultaneously Local method: no guarantee we will have one to.
Today Graphical Models Representing conditional dependence graphically
Distributed cooperation and coordination using the Max-Sum algorithm
PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Slide 1 Directed Graphical Probabilistic Models: inference William W. Cohen Machine Learning Feb 2008.
1 Local Search for Optimal Permutations Jason Eisner and Roy Tromble with Very Large-Scale Neighborhoods in Machine Translation.
Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
1 David Smith (JHU  UMass Amherst) Jason Eisner (JHU) Dependency Parsing by Belief Propagation.
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
Roadmap Probabilistic CFGs –Handling ambiguity – more likely analyses –Adding probabilities Grammar Parsing: probabilistic CYK Learning probabilities:
Section 2: Belief Propagation Basics 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using.
Natural Language Processing Vasile Rus
Lecture 7: Constrained Conditional Models
Local factors in a graphical model
CSC 594 Topics in AI – Natural Language Processing
Hans Bodlaender, Marek Cygan and Stefan Kratsch
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Artificial Intelligence
CS 4/527: Artificial Intelligence
CAP 5636 – Advanced Artificial Intelligence
Markov Networks.
N-Gram Model Formulas Word sequences Chain rule of probability
CS 188: Artificial Intelligence
Soft Constraints: Exponential Models
Expectation-Maximization & Belief Propagation
Word embeddings (continued)
Markov Networks.
David Kauchak CS159 – Spring 2019
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

1 1 David A. Smith (JHU  UMass Amherst) Jason Eisner (Johns Hopkins University) Dependency Parsing by Belief Propagation

2 Outline Edge-factored parsing  Dependency parses  Scoring the competing parses: Edge features  Finding the best parse Higher-order parsing  Throwing in more features: Graphical models  Finding the best parse: Belief propagation  Experiments Conclusions New! Old

3 Outline Edge-factored parsing  Dependency parses  Scoring the competing parses: Edge features  Finding the best parse Higher-order parsing  Throwing in more features: Graphical models  Finding the best parse: Belief propagation  Experiments Conclusions New! Old

4 MOD Word Dependency Parsing He reckons the current account deficit will narrow to only 1.8 billion in September. Raw sentence Part-of-speech tagging He reckons the current account deficit will narrow to only 1.8 billion in September. PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP. POS-tagged sentence Word dependency parsing slide adapted from Yuji Matsumoto Word dependency parsed sentence He reckons the current account deficit will narrow to only 1.8 billion in September. SUBJ ROOT S-COMP SUBJ SPEC MOD COMP

5 What does parsing have to do with belief propagation? loopy belief propagation belief loopypropagation

6 Outline Edge-factored parsing  Dependency parses  Scoring the competing parses: Edge features  Finding the best parse Higher-order parsing  Throwing in more features: Graphical models  Finding the best parse: Belief propagation  Experiments Conclusions New! Old

7 7 Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972) In the beginning, we used generative models. p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * … each choice depends on a limited part of the history but which dependencies to allow? what if they’re all worthwhile? p(D | A,B,C)? … p(D | A,B) * p(C | A,B,D)?

8 8 Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972) In the beginning, we used generative models. Solution: Log-linear (max-entropy) modeling  Features may interact in arbitrary ways  Iterative scaling keeps adjusting the feature weights until the model agrees with the training data. p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * … which dependencies to allow? (given limited training data) (1/Z) * Φ (A) * Φ (B,A) * Φ (C,A) * Φ (C,B) * Φ (D,A,B) * Φ (D,B,C) * Φ (D,A,C) * … throw them all in!

9 Log-linear models great for n-way classification Also good for predicting sequences Also good for dependency parsing 9 How about structured outputs? but to allow fast dynamic programming, only use n-gram features but to allow fast dynamic programming or MST parsing, only use single-edge features …find preferred links… find preferred tags van

10 How about structured outputs? but to allow fast dynamic programming or MST parsing, only use single-edge features …find preferred links…

11 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking Is this a good edge? yes, lots of green...

12 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking Is this a good edge? jasný  den (“bright day”)

13 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking Is this a good edge? jasný  den (“bright day”) jasný  N (“bright NOUN”) VAAANJNVC

14 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking Is this a good edge? jasný  den (“bright day”) jasný  N (“bright NOUN”) VAAANJNVC A  NA  N

15 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking Is this a good edge? jasný  den (“bright day”) jasný  N (“bright NOUN”) VAAANJNVC A  N preceding conjunction A  NA  N

16 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking How about this competing edge? VAAANJNVC not as good, lots of red...

17 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking How about this competing edge? VAAANJNVC jasný  hodiny (“bright clocks”)... undertrained...

18 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking How about this competing edge? VAAANJNVC byljasnstuddubndenahodiodbítřin jasný  hodiny (“bright clocks”)... undertrained... jasn  hodi (“bright clock,” stems only)

19 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking How about this competing edge? VAAANJNVC jasn  hodi (“bright clock,” stems only) byljasnstuddubndenahodiodbítřin A plural  N singular jasný  hodiny (“bright clocks”)... undertrained...

20 jasný  hodiny (“bright clocks”)... undertrained... Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking How about this competing edge? VAAANJNVC jasn  hodi (“bright clock,” stems only) byljasnstuddubndenahodiodbítřin A plural  N singular A  N where N follows a conjunction

21 jasný Edge-Factored Parsers (McDonald et al. 2005) Bylstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking VAAANJNVC byljasnstuddubndenahodiodbítřin Which edge is better?  “bright day” or “bright clocks”?

22 jasný Edge-Factored Parsers (McDonald et al. 2005) Bylstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking VAAANJNVC byljasnstuddubndenahodiodbítřin Which edge is better? Score of an edge e =   features(e) Standard algos  valid parse with max total score our current weight vector

23 Edge-Factored Parsers (McDonald et al. 2005) Which edge is better? Score of an edge e =   features(e) Standard algos  valid parse with max total score our current weight vector can’t have both (one parent per word) can‘t have both (no crossing links) Can’t have all three (no cycles) Thus, an edge may lose (or win) because of a consensus of other edges.

24 Outline Edge-factored parsing  Dependency parses  Scoring the competing parses: Edge features  Finding the best parse Higher-order parsing  Throwing in more features: Graphical models  Finding the best parse: Belief propagation  Experiments Conclusions New! Old

25 Finding Highest-Scoring Parse The cat in the hat wore a stovepipe. ROOT Convert to context-free grammar (CFG) Then use dynamic programming each subtree is a linguistic constituent (here a noun phrase) The cat in the hat wore a stovepipe ROOT let’s vertically stretch this graph drawing

26 Finding Highest-Scoring Parse each subtree is a linguistic constituent (here a noun phrase) The cat in the hat wore a stovepipe ROOT so CKY’s “grammar constant” is no longer constant  Convert to context-free grammar (CFG) Then use dynamic programming  CKY algorithm for CFG parsing is O(n 3 )  Unfortunately, O(n 5 ) in this case to score “cat  wore” link, not enough to know this is NP must know it’s rooted at “cat” so expand nonterminal set by O(n): {NP the, NP cat, NP hat,...}

27 Finding Highest-Scoring Parse each subtree is a linguistic constituent (here a noun phrase) The cat in the hat wore a stovepipe ROOT Convert to context-free grammar (CFG) Then use dynamic programming  CKY algorithm for CFG parsing is O(n 3 )  Unfortunately, O(n 5 ) in this case  Solution: Use a different decomposition (Eisner 1996) Back to O(n 3 )

28 Spans vs. constituents Two kinds of substring. »Constituent of the tree: links to the rest only through its headword (root). »Span of the tree: links to the rest only through its endwords. The cat in the hat wore a stovepipe. ROOT

Decomposing a tree into spans The cat in the hat wore a stovepipe. ROOT The cat wore a stovepipe. ROOT cat in the hat wore + + in the hat wore cat in + hat wore in the hat + cat in the hat wore a stovepipe. ROOT

30 Finding Highest-Scoring Parse Convert to context-free grammar (CFG) Then use dynamic programming  CKY algorithm for CFG parsing is O(n 3 )  Unfortunately, O(n 5 ) in this case  Solution: Use a different decomposition (Eisner 1996) Back to O(n 3 ) Can play usual tricks for dynamic programming parsing  Further refining the constituents or spans Allow prob. model to keep track of even more internal information  A*, best-first, coarse-to-fine  Training by EM etc. require “outside” probabilities of constituents, spans, or links

31 Hard Constraints on Valid Trees Score of an edge e =   features(e) Standard algos  valid parse with max total score our current weight vector can’t have both (one parent per word) can‘t have both (no crossing links) Can’t have all three (no cycles) Thus, an edge may lose (or win) because of a consensus of other edges.

32 talk Non-Projective Parses can‘t have both (no crossing links) The “projectivity” restriction. Do we really want it? IgiveaonbootstrappingtomorrowROOT‘ll subtree rooted at “talk” is a discontiguous noun phrase

33 Non-Projective Parses istameamnoritgloriacanitiemROOT IgiveaonbootstrappingtalktomorrowROOT‘ll that NOM my ACC may-knowglory NOM going-gray ACC That glory may-know my going-gray (i.e., it shall last till I go gray) occasional non-projectivity in English frequent non-projectivity in Latin, etc.

34 Finding highest-scoring non-projective tree Consider the sentence “John saw Mary” (left). The Chu-Liu-Edmonds algorithm finds the maximum- weight spanning tree (right) – may be non-projective. Can be found in time O(n 2 ) root John saw Mary root John saw Mary slide thanks to Dragomir Radev Every node selects best parent If cycles, contract them and repeat

35 Summing over all non-projective trees Finding highest-scoring non-projective tree Consider the sentence “John saw Mary” (left). The Chu-Liu-Edmonds algorithm finds the maximum- weight spanning tree (right) – may be non-projective. Can be found in time O(n 2 ). How about total weight Z of all trees? How about outside probabilities or gradients? Can be found in time O(n 3 ) by matrix determinants and inverses (Smith & Smith, 2007). slide thanks to Dragomir Radev

36 Graph Theory to the Rescue! Tutte’s Matrix-Tree Theorem (1948) The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r. Exactly the Z we need! O(n 3 ) time!

37 Building the Kirchoff (Laplacian) Matrix Negate edge scores Sum columns (children) Strike root row/col. Take determinant N.B.: This allows multiple children of root, but see Koo et al

38 Why Should This Work? Chu-Liu-Edmonds analogy: Every node selects best parent If cycles, contract and recur Clear for 1x1 matrix; use induction Undirected case; special root cases for directed

39 Outline Edge-factored parsing  Dependency parses  Scoring the competing parses: Edge features  Finding the best parse Higher-order parsing  Throwing in more features: Graphical models  Finding the best parse: Belief propagation  Experiments Conclusions New! Old

40 Exactly Finding the Best Parse With arbitrary features, runtime blows up  Projective parsing: O(n 3 ) by dynamic programming  Non-projective: O(n 2 ) by minimum spanning tree but to allow fast dynamic programming or MST parsing, only use single-edge features …find preferred links… O(n 4 ) grandparents O(n 5 ) grandp. + sibling bigrams O(n 3 g 6 ) POS trigrams … O(2 n ) sibling pairs (non-adjacent) NP-hard any of the above features soft penalties for crossing links pretty much anything else!

41 Let’s reclaim our freedom (again!) Output probability is a product of local factors  Throw in any factors we want! (log-linear model) How could we find best parse?  Integer linear programming (Riedel et al., 2006) doesn’t give us probabilities when training or parsing  MCMC Slow to mix? High rejection rate because of hard T REE constraint?  Greedy hill-climbing (McDonald & Pereira 2006) This paper in a nutshell (1/Z) * Φ (A) * Φ (B,A) * Φ (C,A) * Φ (C,B) * Φ (D,A,B) * Φ (D,B,C) * Φ (D,A,C) * … none of these exploit tree structure of parses as the first-order methods do

42 Let’s reclaim our freedom (again!) Output probability is a product of local factors  Throw in any factors we want! (log-linear model) Let local factors negotiate via “belief propagation” Links (and tags) reinforce or suppress one another  Each iteration takes total time O(n 2 ) or O(n 3 ) Converges to a pretty good (but approx.) global parse certain global factors ok too each global factor can be handled fast via some traditional parsing algorithm (e.g., inside-outside) This paper in a nutshell (1/Z) * Φ (A) * Φ (B,A) * Φ (C,A) * Φ (C,B) * Φ (D,A,B) * Φ (D,B,C) * Φ (D,A,C) * …

43 Let’s reclaim our freedom (again!) Training with many featuresDecoding with many features Iterative scalingBelief propagation Each weight in turn is influenced by others Each variable in turn is influenced by others Iterate to achieve globally optimal weights Iterate to achieve locally consistent beliefs To train distrib. over trees, use dynamic programming to compute normalizer Z To decode distrib. over trees, use dynamic programming to compute messages This paper in a nutshell New!

44 Outline Edge-factored parsing  Dependency parses  Scoring the competing parses: Edge features  Finding the best parse Higher-order parsing  Throwing in more features: Graphical models  Finding the best parse: Belief propagation  Experiments Conclusions New! Old

45 First, a familiar example  Conditional Random Field (CRF) for POS tagging 45 Local factors in a graphical model … … find preferred tags vvv Possible tagging (i.e., assignment to remaining variables) Observed input sentence (shaded)

46 Local factors in a graphical model First, a familiar example  Conditional Random Field (CRF) for POS tagging … … find preferred tags van Possible tagging (i.e., assignment to remaining variables) Another possible tagging Observed input sentence (shaded)

47 Local factors in a graphical model First, a familiar example  Conditional Random Field (CRF) for POS tagging … … find preferred tags vna v 021 n 210 a 031 vna v 021 n 210 a 031 ”Binary” factor that measures compatibility of 2 adjacent tags Model reuses same parameters at this position

48 Local factors in a graphical model First, a familiar example  Conditional Random Field (CRF) for POS tagging … … find preferred tags v 0.2 n a 0 “Unary” factor evaluates this tag Its values depend on corresponding word can’t be adj v 0.2 n a 0

49 Local factors in a graphical model First, a familiar example  Conditional Random Field (CRF) for POS tagging … … find preferred tags v 0.2 n a 0 “Unary” factor evaluates this tag Its values depend on corresponding word (could be made to depend on entire observed sentence)

50 Local factors in a graphical model First, a familiar example  Conditional Random Field (CRF) for POS tagging … … find preferred tags v 0.2 n a 0 “Unary” factor evaluates this tag Different unary factor at each position v 0.3 n 0.02 a 0 v 0.3 n 0 a 0.1

51 Local factors in a graphical model First, a familiar example  Conditional Random Field (CRF) for POS tagging … … find preferred tags vna v 021 n 210 a 031 v 0.3 n 0.02 a 0 vna v 021 n 210 a 031 v 0.3 n 0 a 0.1 v 0.2 n a 0 van p( v a n ) is proportional to the product of all factors’ values on v a n

52 Local factors in a graphical model First, a familiar example  Conditional Random Field (CRF) for POS tagging … … find preferred tags vna v 021 n 210 a 031 v 0.3 n 0.02 a 0 vna v 021 n 210 a 031 v 0.3 n 0 a 0.1 v 0.2 n a 0 van = … 1*3*0.3*0.1*0.2 … p( v a n ) is proportional to the product of all factors’ values on v a n

53 First, a familiar example  CRF for POS tagging Now let’s do dependency parsing!  O(n 2 ) boolean variables for the possible links van Local factors in a graphical model find preferred links … …

54 First, a familiar example  CRF for POS tagging Now let’s do dependency parsing!  O(n 2 ) boolean variables for the possible links 54 Local factors in a graphical model find preferred links … … tfftff Possible parse— encoded as an assignment to these vars van

55 First, a familiar example  CRF for POS tagging Now let’s do dependency parsing!  O(n 2 ) boolean variables for the possible links 55 Local factors in a graphical model find preferred links … … f f t f t f Possible parse— encoded as an assignment to these vars Another possible parse van

56 First, a familiar example  CRF for POS tagging Now let’s do dependency parsing!  O(n 2 ) boolean variables for the possible links (cycle) 56 Local factors in a graphical model find preferred links … … f t t t f Possible parse— encoded as an assignment to these vars Another possible parse An illegal parse van f

57 First, a familiar example  CRF for POS tagging Now let’s do dependency parsing!  O(n 2 ) boolean variables for the possible links (cycle) 57 Local factors in a graphical model find preferred links … … t t t Possible parse— encoded as an assignment to these vars Another possible parse An illegal parse Another illegal parse van t (multiple parents) f f

58  So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation 58 Local factors for parsing find preferred links … … t 2 f 1 t 1 f 2 t 1 f 2 t 1 f 6 t 1 f 3 as before, goodness of this link can depend on entire observed input context t 1 f 8 some other links aren’t as good given this input sentence But what if the best assignment isn’t a tree??

59 Global factors for parsing  So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree  this is a “hard constraint”: factor is either 0 or 1 find preferred links … … ffffff 0 ffffft 0 fffftf 0 …… fftfft 1 …… tttttt 0

60  So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree  this is a “hard constraint”: factor is either 0 or 1 60 Global factors for parsing find preferred links … … ffffff 0 ffffft 0 fffftf 0 …… fftfft 1 …… tttttt 0 t f f t f f 64 entries (0/1) So far, this is equivalent to edge-factored parsing (McDonald et al. 2005). Note: McDonald et al. (2005) don’t loop through this table to consider exponentially many trees one at a time. They use combinatorial algorithms; so should we! optionally require the tree to be projective (no crossing links) we’re legal!

61  So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree  this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables  grandparent 61 Local factors for parsing find preferred links … … ft f11 t13 t t 3

62  So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree  this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables  grandparent  no-cross 62 Local factors for parsing find preferred links … … t by t ft f11 t10.2

63 Local factors for parsing find preferred links … … by  So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree  this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables  grandparent  no-cross  siblings  hidden POS tags  subcategorization ……

64 Outline Edge-factored parsing  Dependency parses  Scoring the competing parses: Edge features  Finding the best parse Higher-order parsing  Throwing in more features: Graphical models  Finding the best parse: Belief propagation  Experiments Conclusions New! Old

65 Good to have lots of features, but … Nice model Shame about the NP-hardness  Can we approximate? Machine learning to the rescue!  ML community has given a lot to NLP  In the 2000’s, NLP has been giving back to ML Mainly techniques for joint prediction of structures Much earlier, speech recognition had HMMs, EM, smoothing … 65

66 Great Ideas in ML: Message Passing 66 3 behind you 2 behind you 1 behind you 4 behind you 5 behind you 1 before you 2 before you there’s 1 of me 3 before you 4 before you 5 before you adapted from MacKay (2003) textbook Count the soldiers

67 Great Ideas in ML: Message Passing 67 3 behind you 2 before you there’s 1 of me Belief: Must be = 6 of us only see my incoming messages 231 Count the soldiers adapted from MacKay (2003) textbook

68 Belief: Must be = 6 of us 231 Great Ideas in ML: Message Passing 68 4 behind you 1 before you there’s 1 of me only see my incoming messages Belief: Must be = 6 of us 141 Count the soldiers adapted from MacKay (2003) textbook

69 Great Ideas in ML: Message Passing 69 7 here 3 here 11 here (= 7+3+1) 1 of me Each soldier receives reports from all branches of tree adapted from MacKay (2003) textbook

70 Great Ideas in ML: Message Passing 70 3 here 7 here (= 3+3+1) Each soldier receives reports from all branches of tree adapted from MacKay (2003) textbook

71 Great Ideas in ML: Message Passing 71 7 here 3 here 11 here (= 7+3+1) Each soldier receives reports from all branches of tree adapted from MacKay (2003) textbook

72 Great Ideas in ML: Message Passing 72 7 here 3 here Belief: Must be 14 of us Each soldier receives reports from all branches of tree adapted from MacKay (2003) textbook

73 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 73 7 here 3 here Belief: Must be 14 of us wouldn’t work correctly with a “loopy” (cyclic) graph adapted from MacKay (2003) textbook

74 … … find preferred tags Great ideas in ML: Forward-Backward v 0.3 n 0 a 0.1 v 1.8 n 0 a 4.2 αβα belief message v 2 n 1 a 7 In the CRF, message passing = forward-backward v 7 n 2 a 1 v 3 n 1 a 6 β vna v 021 n 210 a 031 v 3 n 6 a 1 vna v 021 n 210 a 031

75 Extend CRF to “skip chain” to capture non-local factor  More influences on belief 75 … … find preferred tags Great ideas in ML: Forward-Backward v 3 n 1 a 6 v 2 n 1 a 7 αβ v 3 n 1 a 6 v 5.4 n 0 a 25.2 v 0.3 n 0 a 0.1

76 Extend CRF to “skip chain” to capture non-local factor  More influences on belief  Graph becomes loopy  76 … … find preferred tags Great ideas in ML: Forward-Backward v 3 n 1 a 6 v 2 n 1 a 7 αβ v 3 n 1 a 6 v 5.4` n 0 a 25.2` v 0.3 n 0 a 0.1 Red messages not independent? Pretend they are!

77 Two great tastes that taste great together You got dynamic programming in my belief propagation! You got belief propagation in my dynamic programming! Upcoming attractions …

78 Loopy Belief Propagation for Parsing find preferred links … … Sentence tells word 3, “Please be a verb” Word 3 tells the 3  7 link, “Sorry, then you probably don’t exist” The 3  7 link tells the Tree factor, “You’ll have to find another parent for 7” The tree factor tells the 10  7 link, “You’re on!” The 10  7 link tells 10, “Could you please be a noun?” …

79  Higher-order factors (e.g., Grandparent) induce loops Let’s watch a loop around one triangle … Strong links are suppressing or promoting other links … 79 Loopy Belief Propagation for Parsing find preferred links … …

80  Higher-order factors (e.g., Grandparent) induce loops Let’s watch a loop around one triangle …  How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t, given the messages it receives from all the other links?” 80 Loopy Belief Propagation for Parsing find preferred links … … ? TREE factor ffffff 0 ffffft 0 fffftf 0 …… fftfft 1 …… tttttt 0

81  How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t, given the messages it receives from all the other links?” 81 Loopy Belief Propagation for Parsing find preferred links … … ? TREE factor ffffff 0 ffffft 0 fffftf 0 …… fftfft 1 …… tttttt 0 But this is the outside probability of green link!  TREE factor computes all outgoing messages at once (given all incoming messages) Projective case: total O(n 3 ) time by inside-outside Non-projective: total O(n 3 ) time by inverting Kirchhoff matrix (Smith & Smith, 2007)

82  How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t, given the messages it receives from all the other links?” 82 Loopy Belief Propagation for Parsing But this is the outside probability of green link!  TREE factor computes all outgoing messages at once (given all incoming messages) Projective case: total O(n 3 ) time by inside-outside Non-projective: total O(n 3 ) time by inverting Kirchhoff matrix (Smith & Smith, 2007) Belief propagation assumes incoming messages to TREE are independent. So outgoing messages can be computed with first-order parsing algorithms (fast, no grammar constant).

83 Some connections … Parser stacking (Nivre & McDonald 2008, Martins et al. 2008) Global constraints in arc consistency  A LL D IFFERENT constraint (Régin 1994) Matching constraint in max-product BP  For computer vision (Duchi et al., 2006)  Could be used for machine translation As far as we know, our parser is the first use of global constraints in sum-product BP.

84 Outline Edge-factored parsing  Dependency parses  Scoring the competing parses: Edge features  Finding the best parse Higher-order parsing  Throwing in more features: Graphical models  Finding the best parse: Belief propagation  Experiments Conclusions New! Old

85 Runtimes for each factor type (see paper) Factor typedegreeruntimecounttotal TreeO(n 2 )O(n 3 )1 Proj. TreeO(n 2 )O(n 3 )1 Individual links1O(1)O(n 2 ) Grandparent2O(1)O(n 3 ) Sibling pairs2O(1)O(n 3 ) Sibling bigramsO(n)O(n 2 )O(n)O(n 3 ) NoCrossO(n) O(n 2 )O(n 3 ) Tag1O(g)O(n) TagLink3O(g 2 )O(n 2 ) TagTrigramO(n)O(ng 3 )1O(n) TOTALO(n 3 ) +=+= Additive, not multiplicative! per iteration

86 Runtimes for each factor type (see paper) Factor typedegreeruntimecounttotal TreeO(n 2 )O(n 3 )1 Proj. TreeO(n 2 )O(n 3 )1 Individual links1O(1)O(n 2 ) Grandparent2O(1)O(n 3 ) Sibling pairs2O(1)O(n 3 ) Sibling bigramsO(n)O(n 2 )O(n)O(n 3 ) NoCrossO(n) O(n 2 )O(n 3 ) Tag1O(g)O(n) TagLink3O(g 2 )O(n 2 ) TagTrigramO(n)O(ng 3 )1O(n) TOTALO(n 3 ) +=+= Additive, not multiplicative! Each “global” factor coordinates an unbounded # of variables Standard belief propagation would take exponential time to iterate over all configurations of those variables See paper for efficient propagators

87 Experimental Details Decoding  Run several iterations of belief propagation  Get final beliefs at link variables  Feed them into first-order parser  This gives the Min Bayes Risk tree (minimizes expected error) Training  BP computes beliefs about each factor, too …  … which gives us gradients for max conditional likelihood. (as in forward-backward algorithm) Features used in experiments  First-order: Individual links just as in McDonald et al  Higher-order: Grandparent, Sibling bigrams, NoCross 87

88 Dependency Accuracy The extra, higher-order features help! (non-projective parsing) DanishDutchEnglish Tree+Link NoCross Grandparent ChildSeq

89 Dependency Accuracy The extra, higher-order features help! (non-projective parsing) DanishDutchEnglish Tree+Link NoCross Grandparent ChildSeq Best projective parse with all factors hill-climbing exact, slow doesn’t fix enough edges

90 Time vs. Projective Search Error …DP 140 Compared with O(n 4 ) DPCompared with O(n 5 ) DP iterations

91 Runtime: BP vs. DP Vs. O(n 4 ) DPVs. O(n 5 ) DP

92 Outline Edge-factored parsing  Dependency parses  Scoring the competing parses: Edge features  Finding the best parse Higher-order parsing  Throwing in more features: Graphical models  Finding the best parse: Belief propagation  Experiments Conclusions New! Old

93 Freedom Regained Output probability defined as product of local and global factors  Throw in any factors we want! (log-linear model)  Each factor must be fast, but they run independently Let local factors negotiate via “belief propagation”  Each bit of syntactic structure is influenced by others  Some factors need combinatorial algorithms to compute messages fast e.g., existing parsing algorithms using dynamic programming  Each iteration takes total time O(n 3 ) or even O(n 2 ); see paper Compare reranking or stacking Converges to a pretty good (but approximate) global parse  Fast parsing for formerly intractable or slow models  Extra features of these models really do help accuracy This paper in a nutshell

94 Future Opportunities Efficiently modeling more hidden structure  POS tags, link roles, secondary links (DAG-shaped parses) Beyond dependencies  Constituency parsing, traces, lattice parsing Beyond parsing  Alignment, translation  Bipartite matching and network flow  Joint decoding of parsing and other tasks (IE, MT, reasoning...) Beyond text  Image tracking and retrieval  Social networks

95 thank you