Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Similar presentations


Presentation on theme: "Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012."— Presentation transcript:

1 Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012

2 Roadmap Collins & Duffy, 2001 Tree Kernels for Parsing: Motivation Parsing as reranking Tree kernels for similarity Case study: Penn Treebank parsing 2

3 Motivation: Parsing Parsing task: Given a natural language sentence, extract its syntactic structure Specifically, generate a corresponding parse tree 3

4 Motivation: Parsing Parsing task: Given a natural language sentence, extract its syntactic structure Specifically, generate a corresponding parse tree Approaches: 4

5 Motivation: Parsing Parsing task: Given a natural language sentence, extract its syntactic structure Specifically, generate a corresponding parse tree Approaches: “Classical” approach: Hand-write CFG productions; use standard alg, e.g. CKY 5

6 Motivation: Parsing Parsing task: Given a natural language sentence, extract its syntactic structure Specifically, generate a corresponding parse tree Approaches: “Classical” approach: Hand-write CFG productions; use standard alg, e.g. CKY Probabilistic approach: Build large treebank of parsed sentences Learn production probabilities Use probabilistic versions of standard alg Pick highest probability parse 6

7 Parsing Issues Main issues: 7

8 Parsing Issues Main issues: Robustness: get reasonable parse for any input Ambiguity: select best parse given alternatives “Classic” approach: 8

9 Parsing Issues Main issues: Robustness: get reasonable parse for any input Ambiguity: select best parse given alternatives “Classic” approach: Hand-coded grammars often fragile No obvious mechanism to select among alternatives Probabilistic approach: 9

10 Parsing Issues Main issues: Robustness: get reasonable parse for any input Ambiguity: select best parse given alternatives “Classic” approach: Hand-coded grammars often fragile No obvious mechanism to select among alternatives Probabilistic approach: Fairly good robustness, small probabilities for any Select by probability, but decisions are local Hard to capture more global structure 10

11 Approach: Parsing by Reranking Intuition: Identify collection of candidate parses for each sentence e.g. output of PCFG parser For training, identify gold standard parse, sentence pair 11

12 Approach: Parsing by Reranking Intuition: Identify collection of candidate parses for each sentence e.g. output of PCFG parser For training, identify gold standard parse, sentence pair Create a parse tree vector representation Identify (more global) parse tree features 12

13 Approach: Parsing by Reranking Intuition: Identify collection of candidate parses for each sentence e.g. output of PCFG parser For training, identify gold standard parse, sentence pair Create a parse tree vector representation Identify (more global) parse tree features Train a reranker to rank gold standard highest 13

14 Approach: Parsing by Reranking Intuition: Identify collection of candidate parses for each sentence e.g. output of PCFG parser For training, identify gold standard parse, sentence pair Create a parse tree vector representation Identify (more global) parse tree features Train a reranker to rank gold standard highest Apply to rerank candidate parses for new sentence 14

15 Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree 15

16 Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree 16

17 Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree C(s i ) = {x ij }: Candidate parses for s i wlog, x i1 is the correct parse for s i 17

18 Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree C(s i ) = {x ij }: Candidate parses for s i wlog, x i1 is the correct parse for s i h(x ij ): feature vector representation of x ij 18

19 Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree C(s i ) = {x ij }: Candidate parses for s i wlog, x i1 is the correct parse for s i h(x ij ): feature vector representation of x ij Training: Learn 19

20 Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree C(s i ) = {x ij }: Candidate parses for s i wlog, x i1 is the correct parse for s i h(x ij ): feature vector representation of x ij Training: Learn Decoding: Compute 20

21 Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints 21

22 Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints What constraints? 22

23 Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints What constraints? Here, ranking constraints: Specifically, correct parse outranks all other candidates 23

24 Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints What constraints? Here, ranking constraints: Specifically, correct parse outranks all other candidates Formally, 24

25 Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints What constraints? Here, ranking constraints: Specifically, correct parse outranks all other candidates Formally, 25

26 Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints What constraints? Here, ranking constraints: Specifically, correct parse outranks all other candidates Formally, 26

27 Reformulating with α Training learns α ij, such that 27

28 Reformulating with α Training learns α ij, such that Note: just like SVM equation, w/different constraint 28

29 Reformulating with α Training learns α ij, such that Note: just like SVM equation, w/different constraint Parse scoring: 29

30 Reformulating with α Training learns α ij, such that Note: just like SVM equation, w/different constraint Parse scoring: After substitution, we have 30

31 Reformulating with α Training learns α ij, such that Note: just like SVM equation, w/different constraint Parse scoring: After substitution, we have After the kernel trick, we have 31

32 Reformulating with α Training learns α ij, such that Note: just like SVM equation, w/different constraint Parse scoring: After substitution, we have After the kernel trick, we have Note: With a suitable kernel K, don’t need h(x)s 32

33 Parsing as reranking: Perceptron algorithm Similar to SVM, learns separating hyperplane 33

34 Parsing as reranking: Perceptron algorithm Similar to SVM, learns separating hyperplane Modeled with weight vector w Using simple iterative procedure Based on correcting errors in current model 34

35 Parsing as reranking: Perceptron algorithm Similar to SVM, learns separating hyperplane Modeled with weight vector w Using simple iterative procedure Based on correcting errors in current model Initialize α ij =0 35

36 Parsing as reranking: Perceptron algorithm Similar to SVM, learns separating hyperplane Modeled with weight vector w Using simple iterative procedure Based on correcting errors in current model Initialize α ij =0 For i=1,…,n; for j=2,…,n If f(x i1 )>f(x ij ): continue else: α ij += 1 36

37 Defining the Kernel So, we have a model: Framework for training Framework for decoding 37

38 Defining the Kernel So, we have a model: Framework for training Framework for decoding But need to define a kernel K 38

39 Defining the Kernel So, we have a model: Framework for training Framework for decoding But need to define a kernel K We need: K: X x X  R 39

40 Defining the Kernel So, we have a model: Framework for training Framework for decoding But need to define a kernel K We need: K: X x X  R Recall that X is a tree, and K is a similarity function 40

41 What’s in a Kernel? What are good attributes of a kernel? 41

42 What’s in a Kernel? What are good attributes of a kernel? Capture similarity between instances Here, between parse trees 42

43 What’s in a Kernel? What are good attributes of a kernel? Capture similarity between instances Here, between parse trees Capture more global parse information than PCFG 43

44 What’s in a Kernel? What are good attributes of a kernel? Capture similarity between instances Here, between parse trees Capture more global parse information than PCFG Computable tractably, even over complex, large trees 44

45 Tree Kernel Proposal Idea: PCFG models learn MLE probabilities on rewrite rules NP  N vs NP  DT N vs NP  PN vs NP  DT JJ N Local to parent:children levels 45

46 Tree Kernel Proposal Idea: PCFG models learn MLE probabilities on rewrite rules NP  N vs NP  DT N vs NP  PN vs NP  DT JJ N Local to parent:children levels New measure incorporates all tree fragments in parse Captures higher order, longer distances dependencies Track counts of individual rules + much more 46

47 Tree Fragment Example Fragments of NP over ‘apple’ Not exhaustive 47

48 Tree Representation Tree fragments: Any subgraph with more than one node Restriction: Must include full (not partial) rule productions Parse tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) n: number of distinct tree fragments in training data h i (T): # of occurrences of i th tree fragment in current tree 48

49 Tree Representation Tree fragments: Any subgraph with more than one node Restriction: Must include full (not partial) rule productions Parse tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) n: number of distinct tree fragments in training data h i (T): # of occurrences of i th tree fragment in current tree 49

50 Tree Representation Pros: 50

51 Tree Representation Pros: Fairly intuitive model Natural inner product interpretation Captures long- and short-range dependencies Cons: 51

52 Tree Representation Pros: Fairly intuitive model Natural inner product interpretation Captures long- and short-range dependencies Cons: Size!!!: # subtrees exponential in size of tree Direct computation of inner product intractable 52

53 Key Challenge 53

54 Key Challenge Efficient computation: Find a kernel that can compute similarity efficiently In terms of common subtrees 54

55 Key Challenge Efficient computation: Find a kernel that can compute similarity efficiently In terms of common subtrees Pure enumeration clearly intractable 55

56 Key Challenge Efficient computation: Find a kernel that can compute similarity efficiently In terms of common subtrees Pure enumeration clearly intractable Compute recursively over subtrees Using a polynomial process 56

57 Counting Common Subtrees Example: C(n1,n2): number of common subtrees rooted at n1,n2 C(n1,n2): Due to F. Xia 57 aa red apple

58 Calculating C(n1,n2) Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, 58

59 Calculating C(n1,n2) Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, 59

60 Calculating C(n1,n2) Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, C(n1,n2) =1 Else: 60

61 Calculating C(n1,n2) Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, C(n1,n2) =1 Else: nc(n1): # children of n1: What about n2? 61

62 Calculating C(n1,n2) Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, C(n1,n2) =1 Else: nc(n1): # children of n1: What about n2? same production  same # children ch(n1,j)  j th child of n1 62

63 Components of Tree Representation Tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) Consider 2 trees: T 1, T 2 Number of nodes: N 1,N 2, respectively 63

64 Components of Tree Representation Tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) Consider 2 trees: T 1, T 2 Number of nodes: N 1,N 2, respectively Define: I i (n) = 1if i th subtree is rooted at n, 0 o.w. 64

65 Components of Tree Representation Tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) Consider 2 trees: T 1, T 2 Number of nodes: N 1,N 2, respectively Define: I i (n) = 1if i th subtree is rooted at n, 0 o.w. Then, 65

66 Components of Tree Representation Tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) Consider 2 trees: T 1, T 2 Number of nodes: N 1,N 2, respectively Define: I i (n) = 1if i th subtree is rooted at n, 0 o.w. Then, 66

67 Tree Kernel Computation 67

68 Tree Kernel Computation Running time: K(T 1,T 2 ) O(N 1 N 2 ): Based on recursive computation of C 68

69 Tree Kernel Computation Running time: K(T 1,T 2 ) O(N 1 N 2 ): Based on recursive computation of C Remaining Issues: 69

70 Tree Kernel Computation Running time: K(T 1,T 2 ) O(N 1 N 2 ): Based on recursive computation of C Remaining Issues: K(T1,T2) depends on size of T1 and T2 70

71 Tree Kernel Computation Running time: K(T 1,T 2 ) O(N 1 N 2 ): Based on recursive computation of C Remaining Issues: K(T1,T2) depends on size of T1 and T2 K(T1,T1) >> K(T1,T2), T1 != T2 10 6 vs 10 2 : Very ‘peaked’ 71

72 Improvements Managing tree size: 72

73 Improvements Managing tree size: Normalize!! (like cosine similarity) 73

74 Improvements Managing tree size: Normalize!! (like cosine similarity) Downweight large trees: 74

75 Improvements Managing tree size: Normalize!! (like cosine similarity) Downweight large trees: Restrict depth: just threshold Rescale with weight λ 75

76 Rescaling Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, C(n1,n2) =λ Else: 0<λ<= 1 76

77 Rescaling Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, C(n1,n2) =λ Else: 0<λ<= 1 77

78 Case Study 78

79 Parsing Experiment Data: Penn Treebank ATIS corpus segment Training: 800 sentences Top 20 parses Development: 200 sentences Test: 336 sentences Select best candidate from top 100 parses 79

80 Parsing Experiment Data: Penn Treebank ATIS corpus segment Training: 800 sentences Top 20 parses Development: 200 sentences Test: 336 sentences Select best candidate from top 100 parses Classifier: Voted perceptron Kernelized like SVM, more computationally tractable 80

81 Parsing Experiment Data: Penn Treebank ATIS corpus segment Training: 800 sentences Top 20 parses Development: 200 sentences Test: 336 sentences Select best candidate from top 100 parses Classifier: Voted perceptron Kernelized like SVM, more computationally tractable Evaluation: 10 runs, average parse score reported 81

82 Experimental Results Baseline system: 74% Substantial improvement: 6% absolute score 82

83 Summary Parsing as reranking problem Tree Kernel: Computes similarity between trees based on fragments Efficient recursive computation procedure Yields improved performance on parsing task 83


Download ppt "Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012."

Similar presentations


Ads by Google