Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012
Roadmap Collins & Duffy, 2001 Tree Kernels for Parsing: Motivation Parsing as reranking Tree kernels for similarity Case study: Penn Treebank parsing 2
Motivation: Parsing Parsing task: Given a natural language sentence, extract its syntactic structure Specifically, generate a corresponding parse tree 3
Motivation: Parsing Parsing task: Given a natural language sentence, extract its syntactic structure Specifically, generate a corresponding parse tree Approaches: 4
Motivation: Parsing Parsing task: Given a natural language sentence, extract its syntactic structure Specifically, generate a corresponding parse tree Approaches: “Classical” approach: Hand-write CFG productions; use standard alg, e.g. CKY 5
Motivation: Parsing Parsing task: Given a natural language sentence, extract its syntactic structure Specifically, generate a corresponding parse tree Approaches: “Classical” approach: Hand-write CFG productions; use standard alg, e.g. CKY Probabilistic approach: Build large treebank of parsed sentences Learn production probabilities Use probabilistic versions of standard alg Pick highest probability parse 6
Parsing Issues Main issues: 7
Parsing Issues Main issues: Robustness: get reasonable parse for any input Ambiguity: select best parse given alternatives “Classic” approach: 8
Parsing Issues Main issues: Robustness: get reasonable parse for any input Ambiguity: select best parse given alternatives “Classic” approach: Hand-coded grammars often fragile No obvious mechanism to select among alternatives Probabilistic approach: 9
Parsing Issues Main issues: Robustness: get reasonable parse for any input Ambiguity: select best parse given alternatives “Classic” approach: Hand-coded grammars often fragile No obvious mechanism to select among alternatives Probabilistic approach: Fairly good robustness, small probabilities for any Select by probability, but decisions are local Hard to capture more global structure 10
Approach: Parsing by Reranking Intuition: Identify collection of candidate parses for each sentence e.g. output of PCFG parser For training, identify gold standard parse, sentence pair 11
Approach: Parsing by Reranking Intuition: Identify collection of candidate parses for each sentence e.g. output of PCFG parser For training, identify gold standard parse, sentence pair Create a parse tree vector representation Identify (more global) parse tree features 12
Approach: Parsing by Reranking Intuition: Identify collection of candidate parses for each sentence e.g. output of PCFG parser For training, identify gold standard parse, sentence pair Create a parse tree vector representation Identify (more global) parse tree features Train a reranker to rank gold standard highest 13
Approach: Parsing by Reranking Intuition: Identify collection of candidate parses for each sentence e.g. output of PCFG parser For training, identify gold standard parse, sentence pair Create a parse tree vector representation Identify (more global) parse tree features Train a reranker to rank gold standard highest Apply to rerank candidate parses for new sentence 14
Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree 15
Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree 16
Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree C(s i ) = {x ij }: Candidate parses for s i wlog, x i1 is the correct parse for s i 17
Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree C(s i ) = {x ij }: Candidate parses for s i wlog, x i1 is the correct parse for s i h(x ij ): feature vector representation of x ij 18
Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree C(s i ) = {x ij }: Candidate parses for s i wlog, x i1 is the correct parse for s i h(x ij ): feature vector representation of x ij Training: Learn 19
Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree C(s i ) = {x ij }: Candidate parses for s i wlog, x i1 is the correct parse for s i h(x ij ): feature vector representation of x ij Training: Learn Decoding: Compute 20
Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints 21
Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints What constraints? 22
Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints What constraints? Here, ranking constraints: Specifically, correct parse outranks all other candidates 23
Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints What constraints? Here, ranking constraints: Specifically, correct parse outranks all other candidates Formally, 24
Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints What constraints? Here, ranking constraints: Specifically, correct parse outranks all other candidates Formally, 25
Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints What constraints? Here, ranking constraints: Specifically, correct parse outranks all other candidates Formally, 26
Reformulating with α Training learns α ij, such that 27
Reformulating with α Training learns α ij, such that Note: just like SVM equation, w/different constraint 28
Reformulating with α Training learns α ij, such that Note: just like SVM equation, w/different constraint Parse scoring: 29
Reformulating with α Training learns α ij, such that Note: just like SVM equation, w/different constraint Parse scoring: After substitution, we have 30
Reformulating with α Training learns α ij, such that Note: just like SVM equation, w/different constraint Parse scoring: After substitution, we have After the kernel trick, we have 31
Reformulating with α Training learns α ij, such that Note: just like SVM equation, w/different constraint Parse scoring: After substitution, we have After the kernel trick, we have Note: With a suitable kernel K, don’t need h(x)s 32
Parsing as reranking: Perceptron algorithm Similar to SVM, learns separating hyperplane 33
Parsing as reranking: Perceptron algorithm Similar to SVM, learns separating hyperplane Modeled with weight vector w Using simple iterative procedure Based on correcting errors in current model 34
Parsing as reranking: Perceptron algorithm Similar to SVM, learns separating hyperplane Modeled with weight vector w Using simple iterative procedure Based on correcting errors in current model Initialize α ij =0 35
Parsing as reranking: Perceptron algorithm Similar to SVM, learns separating hyperplane Modeled with weight vector w Using simple iterative procedure Based on correcting errors in current model Initialize α ij =0 For i=1,…,n; for j=2,…,n If f(x i1 )>f(x ij ): continue else: α ij += 1 36
Defining the Kernel So, we have a model: Framework for training Framework for decoding 37
Defining the Kernel So, we have a model: Framework for training Framework for decoding But need to define a kernel K 38
Defining the Kernel So, we have a model: Framework for training Framework for decoding But need to define a kernel K We need: K: X x X R 39
Defining the Kernel So, we have a model: Framework for training Framework for decoding But need to define a kernel K We need: K: X x X R Recall that X is a tree, and K is a similarity function 40
What’s in a Kernel? What are good attributes of a kernel? 41
What’s in a Kernel? What are good attributes of a kernel? Capture similarity between instances Here, between parse trees 42
What’s in a Kernel? What are good attributes of a kernel? Capture similarity between instances Here, between parse trees Capture more global parse information than PCFG 43
What’s in a Kernel? What are good attributes of a kernel? Capture similarity between instances Here, between parse trees Capture more global parse information than PCFG Computable tractably, even over complex, large trees 44
Tree Kernel Proposal Idea: PCFG models learn MLE probabilities on rewrite rules NP N vs NP DT N vs NP PN vs NP DT JJ N Local to parent:children levels 45
Tree Kernel Proposal Idea: PCFG models learn MLE probabilities on rewrite rules NP N vs NP DT N vs NP PN vs NP DT JJ N Local to parent:children levels New measure incorporates all tree fragments in parse Captures higher order, longer distances dependencies Track counts of individual rules + much more 46
Tree Fragment Example Fragments of NP over ‘apple’ Not exhaustive 47
Tree Representation Tree fragments: Any subgraph with more than one node Restriction: Must include full (not partial) rule productions Parse tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) n: number of distinct tree fragments in training data h i (T): # of occurrences of i th tree fragment in current tree 48
Tree Representation Tree fragments: Any subgraph with more than one node Restriction: Must include full (not partial) rule productions Parse tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) n: number of distinct tree fragments in training data h i (T): # of occurrences of i th tree fragment in current tree 49
Tree Representation Pros: 50
Tree Representation Pros: Fairly intuitive model Natural inner product interpretation Captures long- and short-range dependencies Cons: 51
Tree Representation Pros: Fairly intuitive model Natural inner product interpretation Captures long- and short-range dependencies Cons: Size!!!: # subtrees exponential in size of tree Direct computation of inner product intractable 52
Key Challenge 53
Key Challenge Efficient computation: Find a kernel that can compute similarity efficiently In terms of common subtrees 54
Key Challenge Efficient computation: Find a kernel that can compute similarity efficiently In terms of common subtrees Pure enumeration clearly intractable 55
Key Challenge Efficient computation: Find a kernel that can compute similarity efficiently In terms of common subtrees Pure enumeration clearly intractable Compute recursively over subtrees Using a polynomial process 56
Counting Common Subtrees Example: C(n1,n2): number of common subtrees rooted at n1,n2 C(n1,n2): Due to F. Xia 57 aa red apple
Calculating C(n1,n2) Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, 58
Calculating C(n1,n2) Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, 59
Calculating C(n1,n2) Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, C(n1,n2) =1 Else: 60
Calculating C(n1,n2) Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, C(n1,n2) =1 Else: nc(n1): # children of n1: What about n2? 61
Calculating C(n1,n2) Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, C(n1,n2) =1 Else: nc(n1): # children of n1: What about n2? same production same # children ch(n1,j) j th child of n1 62
Components of Tree Representation Tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) Consider 2 trees: T 1, T 2 Number of nodes: N 1,N 2, respectively 63
Components of Tree Representation Tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) Consider 2 trees: T 1, T 2 Number of nodes: N 1,N 2, respectively Define: I i (n) = 1if i th subtree is rooted at n, 0 o.w. 64
Components of Tree Representation Tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) Consider 2 trees: T 1, T 2 Number of nodes: N 1,N 2, respectively Define: I i (n) = 1if i th subtree is rooted at n, 0 o.w. Then, 65
Components of Tree Representation Tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) Consider 2 trees: T 1, T 2 Number of nodes: N 1,N 2, respectively Define: I i (n) = 1if i th subtree is rooted at n, 0 o.w. Then, 66
Tree Kernel Computation 67
Tree Kernel Computation Running time: K(T 1,T 2 ) O(N 1 N 2 ): Based on recursive computation of C 68
Tree Kernel Computation Running time: K(T 1,T 2 ) O(N 1 N 2 ): Based on recursive computation of C Remaining Issues: 69
Tree Kernel Computation Running time: K(T 1,T 2 ) O(N 1 N 2 ): Based on recursive computation of C Remaining Issues: K(T1,T2) depends on size of T1 and T2 70
Tree Kernel Computation Running time: K(T 1,T 2 ) O(N 1 N 2 ): Based on recursive computation of C Remaining Issues: K(T1,T2) depends on size of T1 and T2 K(T1,T1) >> K(T1,T2), T1 != T vs 10 2 : Very ‘peaked’ 71
Improvements Managing tree size: 72
Improvements Managing tree size: Normalize!! (like cosine similarity) 73
Improvements Managing tree size: Normalize!! (like cosine similarity) Downweight large trees: 74
Improvements Managing tree size: Normalize!! (like cosine similarity) Downweight large trees: Restrict depth: just threshold Rescale with weight λ 75
Rescaling Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, C(n1,n2) =λ Else: 0<λ<= 1 76
Rescaling Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, C(n1,n2) =λ Else: 0<λ<= 1 77
Case Study 78
Parsing Experiment Data: Penn Treebank ATIS corpus segment Training: 800 sentences Top 20 parses Development: 200 sentences Test: 336 sentences Select best candidate from top 100 parses 79
Parsing Experiment Data: Penn Treebank ATIS corpus segment Training: 800 sentences Top 20 parses Development: 200 sentences Test: 336 sentences Select best candidate from top 100 parses Classifier: Voted perceptron Kernelized like SVM, more computationally tractable 80
Parsing Experiment Data: Penn Treebank ATIS corpus segment Training: 800 sentences Top 20 parses Development: 200 sentences Test: 336 sentences Select best candidate from top 100 parses Classifier: Voted perceptron Kernelized like SVM, more computationally tractable Evaluation: 10 runs, average parse score reported 81
Experimental Results Baseline system: 74% Substantial improvement: 6% absolute score 82
Summary Parsing as reranking problem Tree Kernel: Computes similarity between trees based on fragments Efficient recursive computation procedure Yields improved performance on parsing task 83