Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012

Roadmap Collins & Duffy, 2001 Tree Kernels for Parsing: Motivation Parsing as reranking Tree kernels for similarity Case study: Penn Treebank parsing 2

Motivation: Parsing Parsing task: Given a natural language sentence, extract its syntactic structure Specifically, generate a corresponding parse tree 3

Motivation: Parsing Parsing task: Given a natural language sentence, extract its syntactic structure Specifically, generate a corresponding parse tree Approaches: 4

Motivation: Parsing Parsing task: Given a natural language sentence, extract its syntactic structure Specifically, generate a corresponding parse tree Approaches: “Classical” approach: Hand-write CFG productions; use standard alg, e.g. CKY 5

Motivation: Parsing Parsing task: Given a natural language sentence, extract its syntactic structure Specifically, generate a corresponding parse tree Approaches: “Classical” approach: Hand-write CFG productions; use standard alg, e.g. CKY Probabilistic approach: Build large treebank of parsed sentences Learn production probabilities Use probabilistic versions of standard alg Pick highest probability parse 6

Parsing Issues Main issues: 7

Parsing Issues Main issues: Robustness: get reasonable parse for any input Ambiguity: select best parse given alternatives “Classic” approach: 8

Parsing Issues Main issues: Robustness: get reasonable parse for any input Ambiguity: select best parse given alternatives “Classic” approach: Hand-coded grammars often fragile No obvious mechanism to select among alternatives Probabilistic approach: 9

Parsing Issues Main issues: Robustness: get reasonable parse for any input Ambiguity: select best parse given alternatives “Classic” approach: Hand-coded grammars often fragile No obvious mechanism to select among alternatives Probabilistic approach: Fairly good robustness, small probabilities for any Select by probability, but decisions are local Hard to capture more global structure 10

Approach: Parsing by Reranking Intuition: Identify collection of candidate parses for each sentence e.g. output of PCFG parser For training, identify gold standard parse, sentence pair 11

Approach: Parsing by Reranking Intuition: Identify collection of candidate parses for each sentence e.g. output of PCFG parser For training, identify gold standard parse, sentence pair Create a parse tree vector representation Identify (more global) parse tree features 12

Approach: Parsing by Reranking Intuition: Identify collection of candidate parses for each sentence e.g. output of PCFG parser For training, identify gold standard parse, sentence pair Create a parse tree vector representation Identify (more global) parse tree features Train a reranker to rank gold standard highest 13

Approach: Parsing by Reranking Intuition: Identify collection of candidate parses for each sentence e.g. output of PCFG parser For training, identify gold standard parse, sentence pair Create a parse tree vector representation Identify (more global) parse tree features Train a reranker to rank gold standard highest Apply to rerank candidate parses for new sentence 14

Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree 15

Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree 16

Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree C(s i ) = {x ij }: Candidate parses for s i wlog, x i1 is the correct parse for s i 17

Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree C(s i ) = {x ij }: Candidate parses for s i wlog, x i1 is the correct parse for s i h(x ij ): feature vector representation of x ij 18

Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree C(s i ) = {x ij }: Candidate parses for s i wlog, x i1 is the correct parse for s i h(x ij ): feature vector representation of x ij Training: Learn 19

Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree C(s i ) = {x ij }: Candidate parses for s i wlog, x i1 is the correct parse for s i h(x ij ): feature vector representation of x ij Training: Learn Decoding: Compute 20

Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints 21

Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints What constraints? 22

Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints What constraints? Here, ranking constraints: Specifically, correct parse outranks all other candidates 23

Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints What constraints? Here, ranking constraints: Specifically, correct parse outranks all other candidates Formally, 24

Reformulating with α Training learns α ij, such that 27

Reformulating with α Training learns α ij, such that Note: just like SVM equation, w/different constraint 28

Reformulating with α Training learns α ij, such that Note: just like SVM equation, w/different constraint Parse scoring: 29

Reformulating with α Training learns α ij, such that Note: just like SVM equation, w/different constraint Parse scoring: After substitution, we have 30

Reformulating with α Training learns α ij, such that Note: just like SVM equation, w/different constraint Parse scoring: After substitution, we have After the kernel trick, we have 31

Reformulating with α Training learns α ij, such that Note: just like SVM equation, w/different constraint Parse scoring: After substitution, we have After the kernel trick, we have Note: With a suitable kernel K, don’t need h(x)s 32

Parsing as reranking: Perceptron algorithm Similar to SVM, learns separating hyperplane 33

Parsing as reranking: Perceptron algorithm Similar to SVM, learns separating hyperplane Modeled with weight vector w Using simple iterative procedure Based on correcting errors in current model 34

Parsing as reranking: Perceptron algorithm Similar to SVM, learns separating hyperplane Modeled with weight vector w Using simple iterative procedure Based on correcting errors in current model Initialize α ij =0 35

Parsing as reranking: Perceptron algorithm Similar to SVM, learns separating hyperplane Modeled with weight vector w Using simple iterative procedure Based on correcting errors in current model Initialize α ij =0 For i=1,…,n; for j=2,…,n If f(x i1 )>f(x ij ): continue else: α ij += 1 36

Defining the Kernel So, we have a model: Framework for training Framework for decoding 37

Defining the Kernel So, we have a model: Framework for training Framework for decoding But need to define a kernel K 38

Defining the Kernel So, we have a model: Framework for training Framework for decoding But need to define a kernel K We need: K: X x X  R 39

Defining the Kernel So, we have a model: Framework for training Framework for decoding But need to define a kernel K We need: K: X x X  R Recall that X is a tree, and K is a similarity function 40

What’s in a Kernel? What are good attributes of a kernel? 41

What’s in a Kernel? What are good attributes of a kernel? Capture similarity between instances Here, between parse trees 42

What’s in a Kernel? What are good attributes of a kernel? Capture similarity between instances Here, between parse trees Capture more global parse information than PCFG 43

What’s in a Kernel? What are good attributes of a kernel? Capture similarity between instances Here, between parse trees Capture more global parse information than PCFG Computable tractably, even over complex, large trees 44

Tree Kernel Proposal Idea: PCFG models learn MLE probabilities on rewrite rules NP  N vs NP  DT N vs NP  PN vs NP  DT JJ N Local to parent:children levels 45

Tree Kernel Proposal Idea: PCFG models learn MLE probabilities on rewrite rules NP  N vs NP  DT N vs NP  PN vs NP  DT JJ N Local to parent:children levels New measure incorporates all tree fragments in parse Captures higher order, longer distances dependencies Track counts of individual rules + much more 46

Tree Fragment Example Fragments of NP over ‘apple’ Not exhaustive 47

Tree Representation Tree fragments: Any subgraph with more than one node Restriction: Must include full (not partial) rule productions Parse tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) n: number of distinct tree fragments in training data h i (T): # of occurrences of i th tree fragment in current tree 48

Tree Representation Tree fragments: Any subgraph with more than one node Restriction: Must include full (not partial) rule productions Parse tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) n: number of distinct tree fragments in training data h i (T): # of occurrences of i th tree fragment in current tree 49

Tree Representation Pros: 50

Tree Representation Pros: Fairly intuitive model Natural inner product interpretation Captures long- and short-range dependencies Cons: 51

Tree Representation Pros: Fairly intuitive model Natural inner product interpretation Captures long- and short-range dependencies Cons: Size!!!: # subtrees exponential in size of tree Direct computation of inner product intractable 52

Key Challenge 53

Key Challenge Efficient computation: Find a kernel that can compute similarity efficiently In terms of common subtrees 54

Key Challenge Efficient computation: Find a kernel that can compute similarity efficiently In terms of common subtrees Pure enumeration clearly intractable 55

Key Challenge Efficient computation: Find a kernel that can compute similarity efficiently In terms of common subtrees Pure enumeration clearly intractable Compute recursively over subtrees Using a polynomial process 56

Counting Common Subtrees Example: C(n1,n2): number of common subtrees rooted at n1,n2 C(n1,n2): Due to F. Xia 57 aa red apple

Calculating C(n1,n2) Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, 58

Calculating C(n1,n2) Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, 59

Calculating C(n1,n2) Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, C(n1,n2) =1 Else: 60

Calculating C(n1,n2) Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, C(n1,n2) =1 Else: nc(n1): # children of n1: What about n2? 61

Calculating C(n1,n2) Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, C(n1,n2) =1 Else: nc(n1): # children of n1: What about n2? same production  same # children ch(n1,j)  j th child of n1 62

Components of Tree Representation Tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) Consider 2 trees: T 1, T 2 Number of nodes: N 1,N 2, respectively 63

Components of Tree Representation Tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) Consider 2 trees: T 1, T 2 Number of nodes: N 1,N 2, respectively Define: I i (n) = 1if i th subtree is rooted at n, 0 o.w. 64

Components of Tree Representation Tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) Consider 2 trees: T 1, T 2 Number of nodes: N 1,N 2, respectively Define: I i (n) = 1if i th subtree is rooted at n, 0 o.w. Then, 65

Components of Tree Representation Tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) Consider 2 trees: T 1, T 2 Number of nodes: N 1,N 2, respectively Define: I i (n) = 1if i th subtree is rooted at n, 0 o.w. Then, 66

Tree Kernel Computation 67

Tree Kernel Computation Running time: K(T 1,T 2 ) O(N 1 N 2 ): Based on recursive computation of C 68

Tree Kernel Computation Running time: K(T 1,T 2 ) O(N 1 N 2 ): Based on recursive computation of C Remaining Issues: 69

Tree Kernel Computation Running time: K(T 1,T 2 ) O(N 1 N 2 ): Based on recursive computation of C Remaining Issues: K(T1,T2) depends on size of T1 and T2 70

Tree Kernel Computation Running time: K(T 1,T 2 ) O(N 1 N 2 ): Based on recursive computation of C Remaining Issues: K(T1,T2) depends on size of T1 and T2 K(T1,T1) >> K(T1,T2), T1 != T2 10 6 vs 10 2 : Very ‘peaked’ 71

Improvements Managing tree size: 72

Improvements Managing tree size: Normalize!! (like cosine similarity) 73

Improvements Managing tree size: Normalize!! (like cosine similarity) Downweight large trees: 74

Improvements Managing tree size: Normalize!! (like cosine similarity) Downweight large trees: Restrict depth: just threshold Rescale with weight λ 75

Rescaling Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, C(n1,n2) =λ Else: 0<λ<= 1 76

Rescaling Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, C(n1,n2) =λ Else: 0<λ<= 1 77

Case Study 78

Parsing Experiment Data: Penn Treebank ATIS corpus segment Training: 800 sentences Top 20 parses Development: 200 sentences Test: 336 sentences Select best candidate from top 100 parses 79

Parsing Experiment Data: Penn Treebank ATIS corpus segment Training: 800 sentences Top 20 parses Development: 200 sentences Test: 336 sentences Select best candidate from top 100 parses Classifier: Voted perceptron Kernelized like SVM, more computationally tractable 80

Parsing Experiment Data: Penn Treebank ATIS corpus segment Training: 800 sentences Top 20 parses Development: 200 sentences Test: 336 sentences Select best candidate from top 100 parses Classifier: Voted perceptron Kernelized like SVM, more computationally tractable Evaluation: 10 runs, average parse score reported 81

Experimental Results Baseline system: 74% Substantial improvement: 6% absolute score 82

Summary Parsing as reranking problem Tree Kernel: Computes similarity between trees based on fragments Efficient recursive computation procedure Yields improved performance on parsing task 83

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Similar presentations

Presentation on theme: "Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Similar presentations

Presentation on theme: "Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012."— Presentation transcript:

Similar presentations

About project

Feedback