Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Slides:



Advertisements
Similar presentations
Location Recognition Given: A query image A database of images with known locations Two types of approaches: Direct matching: directly match image features.
Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Albert Gatt Corpora and Statistical Methods Lecture 11.
Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Linear Classifiers (perceptrons)
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
Structured SVM Chen-Tse Tsai and Siddharth Gupta.
Support vector machine
Machine learning continued Image source:
10. Lexicalized and Probabilistic Parsing -Speech and Language Processing- 발표자 : 정영임 발표일 :
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
CKY Parsing Ling 571 Deep Processing Techniques for NLP January 12, 2011.
Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi.
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
Comparing Convolution Kernels and Recursive Neural Networks for Learning Preferences on Structured Data Sauro Menchetti, Fabrizio Costa, Paolo Frasconi.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Ensemble Learning: An Introduction
Features and Unification
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
Online Learning Algorithms
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
SVM by Sequential Minimal Optimization (SMO)
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Spring /22/071 Beyond PCFGs Chris Brew Ohio State University.
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
Querying Structured Text in an XML Database By Xuemei Luo.
SI485i : NLP Set 8 PCFGs and the CKY Algorithm. PCFGs We saw how CFGs can model English (sort of) Probabilistic CFGs put weights on the production rules.
Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.
1 Boosting-based parse re-ranking with subtree features Taku Kudo Jun Suzuki Hideki Isozaki NTT Communication Science Labs.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Semi-supervised Training of Statistical Parsers CMSC Natural Language Processing January 26, 2006.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
1 Intelligente Analyse- und Informationssysteme Frank Reichartz, Hannes Korte & Gerhard Paass Fraunhofer IAIS, Sankt Augustin, Germany Dependency Tree.
Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.
PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Supertagging CMSC Natural Language Processing January 31, 2006.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Support Vector Machines and Kernel Methods for Co-Reference Resolution 2007 Summer Workshop on Human Language Technology Center for Language and Speech.
Question Classification using Support Vector Machine Dell Zhang National University of Singapore Wee Sun Lee National University of Singapore SIGIR2003.
Wei Lu, Hwee Tou Ng, Wee Sun Lee National University of Singapore
NTU & MSRA Ming-Feng Tsai
NLP. Introduction to NLP Time flies like an arrow –Many parses –Some (clearly) more likely than others –Need for a probabilistic ranking method.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-15: Probabilistic parsing; PCFG (contd.)
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Roadmap Probabilistic CFGs –Handling ambiguity – more likely analyses –Adding probabilities Grammar Parsing: probabilistic CYK Learning probabilities:
Support Vector Machines
CSC 594 Topics in AI – Natural Language Processing
Parsing and More Parsing
David Kauchak CS159 – Spring 2019
The Voted Perceptron for Ranking and Structured Classification
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
David Kauchak CS159 – Spring 2019
Presentation transcript:

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012

Roadmap Collins & Duffy, 2001 Tree Kernels for Parsing: Motivation Parsing as reranking Tree kernels for similarity Case study: Penn Treebank parsing 2

Motivation: Parsing Parsing task: Given a natural language sentence, extract its syntactic structure Specifically, generate a corresponding parse tree 3

Motivation: Parsing Parsing task: Given a natural language sentence, extract its syntactic structure Specifically, generate a corresponding parse tree Approaches: 4

Motivation: Parsing Parsing task: Given a natural language sentence, extract its syntactic structure Specifically, generate a corresponding parse tree Approaches: “Classical” approach: Hand-write CFG productions; use standard alg, e.g. CKY 5

Motivation: Parsing Parsing task: Given a natural language sentence, extract its syntactic structure Specifically, generate a corresponding parse tree Approaches: “Classical” approach: Hand-write CFG productions; use standard alg, e.g. CKY Probabilistic approach: Build large treebank of parsed sentences Learn production probabilities Use probabilistic versions of standard alg Pick highest probability parse 6

Parsing Issues Main issues: 7

Parsing Issues Main issues: Robustness: get reasonable parse for any input Ambiguity: select best parse given alternatives “Classic” approach: 8

Parsing Issues Main issues: Robustness: get reasonable parse for any input Ambiguity: select best parse given alternatives “Classic” approach: Hand-coded grammars often fragile No obvious mechanism to select among alternatives Probabilistic approach: 9

Parsing Issues Main issues: Robustness: get reasonable parse for any input Ambiguity: select best parse given alternatives “Classic” approach: Hand-coded grammars often fragile No obvious mechanism to select among alternatives Probabilistic approach: Fairly good robustness, small probabilities for any Select by probability, but decisions are local Hard to capture more global structure 10

Approach: Parsing by Reranking Intuition: Identify collection of candidate parses for each sentence e.g. output of PCFG parser For training, identify gold standard parse, sentence pair 11

Approach: Parsing by Reranking Intuition: Identify collection of candidate parses for each sentence e.g. output of PCFG parser For training, identify gold standard parse, sentence pair Create a parse tree vector representation Identify (more global) parse tree features 12

Approach: Parsing by Reranking Intuition: Identify collection of candidate parses for each sentence e.g. output of PCFG parser For training, identify gold standard parse, sentence pair Create a parse tree vector representation Identify (more global) parse tree features Train a reranker to rank gold standard highest 13

Approach: Parsing by Reranking Intuition: Identify collection of candidate parses for each sentence e.g. output of PCFG parser For training, identify gold standard parse, sentence pair Create a parse tree vector representation Identify (more global) parse tree features Train a reranker to rank gold standard highest Apply to rerank candidate parses for new sentence 14

Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree 15

Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree 16

Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree C(s i ) = {x ij }: Candidate parses for s i wlog, x i1 is the correct parse for s i 17

Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree C(s i ) = {x ij }: Candidate parses for s i wlog, x i1 is the correct parse for s i h(x ij ): feature vector representation of x ij 18

Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree C(s i ) = {x ij }: Candidate parses for s i wlog, x i1 is the correct parse for s i h(x ij ): feature vector representation of x ij Training: Learn 19

Parsing as Reranking, Formally Training data pairs: {(s i,t i )} where s i is a sentence, t i is a parse tree C(s i ) = {x ij }: Candidate parses for s i wlog, x i1 is the correct parse for s i h(x ij ): feature vector representation of x ij Training: Learn Decoding: Compute 20

Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints 21

Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints What constraints? 22

Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints What constraints? Here, ranking constraints: Specifically, correct parse outranks all other candidates 23

Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints What constraints? Here, ranking constraints: Specifically, correct parse outranks all other candidates Formally, 24

Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints What constraints? Here, ranking constraints: Specifically, correct parse outranks all other candidates Formally, 25

Parsing as Reranking: Training Consider the hard-margin SVM model: Minimize ||w|| 2 subject to constraints What constraints? Here, ranking constraints: Specifically, correct parse outranks all other candidates Formally, 26

Reformulating with α Training learns α ij, such that 27

Reformulating with α Training learns α ij, such that Note: just like SVM equation, w/different constraint 28

Reformulating with α Training learns α ij, such that Note: just like SVM equation, w/different constraint Parse scoring: 29

Reformulating with α Training learns α ij, such that Note: just like SVM equation, w/different constraint Parse scoring: After substitution, we have 30

Reformulating with α Training learns α ij, such that Note: just like SVM equation, w/different constraint Parse scoring: After substitution, we have After the kernel trick, we have 31

Reformulating with α Training learns α ij, such that Note: just like SVM equation, w/different constraint Parse scoring: After substitution, we have After the kernel trick, we have Note: With a suitable kernel K, don’t need h(x)s 32

Parsing as reranking: Perceptron algorithm Similar to SVM, learns separating hyperplane 33

Parsing as reranking: Perceptron algorithm Similar to SVM, learns separating hyperplane Modeled with weight vector w Using simple iterative procedure Based on correcting errors in current model 34

Parsing as reranking: Perceptron algorithm Similar to SVM, learns separating hyperplane Modeled with weight vector w Using simple iterative procedure Based on correcting errors in current model Initialize α ij =0 35

Parsing as reranking: Perceptron algorithm Similar to SVM, learns separating hyperplane Modeled with weight vector w Using simple iterative procedure Based on correcting errors in current model Initialize α ij =0 For i=1,…,n; for j=2,…,n If f(x i1 )>f(x ij ): continue else: α ij += 1 36

Defining the Kernel So, we have a model: Framework for training Framework for decoding 37

Defining the Kernel So, we have a model: Framework for training Framework for decoding But need to define a kernel K 38

Defining the Kernel So, we have a model: Framework for training Framework for decoding But need to define a kernel K We need: K: X x X  R 39

Defining the Kernel So, we have a model: Framework for training Framework for decoding But need to define a kernel K We need: K: X x X  R Recall that X is a tree, and K is a similarity function 40

What’s in a Kernel? What are good attributes of a kernel? 41

What’s in a Kernel? What are good attributes of a kernel? Capture similarity between instances Here, between parse trees 42

What’s in a Kernel? What are good attributes of a kernel? Capture similarity between instances Here, between parse trees Capture more global parse information than PCFG 43

What’s in a Kernel? What are good attributes of a kernel? Capture similarity between instances Here, between parse trees Capture more global parse information than PCFG Computable tractably, even over complex, large trees 44

Tree Kernel Proposal Idea: PCFG models learn MLE probabilities on rewrite rules NP  N vs NP  DT N vs NP  PN vs NP  DT JJ N Local to parent:children levels 45

Tree Kernel Proposal Idea: PCFG models learn MLE probabilities on rewrite rules NP  N vs NP  DT N vs NP  PN vs NP  DT JJ N Local to parent:children levels New measure incorporates all tree fragments in parse Captures higher order, longer distances dependencies Track counts of individual rules + much more 46

Tree Fragment Example Fragments of NP over ‘apple’ Not exhaustive 47

Tree Representation Tree fragments: Any subgraph with more than one node Restriction: Must include full (not partial) rule productions Parse tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) n: number of distinct tree fragments in training data h i (T): # of occurrences of i th tree fragment in current tree 48

Tree Representation Tree fragments: Any subgraph with more than one node Restriction: Must include full (not partial) rule productions Parse tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) n: number of distinct tree fragments in training data h i (T): # of occurrences of i th tree fragment in current tree 49

Tree Representation Pros: 50

Tree Representation Pros: Fairly intuitive model Natural inner product interpretation Captures long- and short-range dependencies Cons: 51

Tree Representation Pros: Fairly intuitive model Natural inner product interpretation Captures long- and short-range dependencies Cons: Size!!!: # subtrees exponential in size of tree Direct computation of inner product intractable 52

Key Challenge 53

Key Challenge Efficient computation: Find a kernel that can compute similarity efficiently In terms of common subtrees 54

Key Challenge Efficient computation: Find a kernel that can compute similarity efficiently In terms of common subtrees Pure enumeration clearly intractable 55

Key Challenge Efficient computation: Find a kernel that can compute similarity efficiently In terms of common subtrees Pure enumeration clearly intractable Compute recursively over subtrees Using a polynomial process 56

Counting Common Subtrees Example: C(n1,n2): number of common subtrees rooted at n1,n2 C(n1,n2): Due to F. Xia 57 aa red apple

Calculating C(n1,n2) Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, 58

Calculating C(n1,n2) Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, 59

Calculating C(n1,n2) Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, C(n1,n2) =1 Else: 60

Calculating C(n1,n2) Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, C(n1,n2) =1 Else: nc(n1): # children of n1: What about n2? 61

Calculating C(n1,n2) Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, C(n1,n2) =1 Else: nc(n1): # children of n1: What about n2? same production  same # children ch(n1,j)  j th child of n1 62

Components of Tree Representation Tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) Consider 2 trees: T 1, T 2 Number of nodes: N 1,N 2, respectively 63

Components of Tree Representation Tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) Consider 2 trees: T 1, T 2 Number of nodes: N 1,N 2, respectively Define: I i (n) = 1if i th subtree is rooted at n, 0 o.w. 64

Components of Tree Representation Tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) Consider 2 trees: T 1, T 2 Number of nodes: N 1,N 2, respectively Define: I i (n) = 1if i th subtree is rooted at n, 0 o.w. Then, 65

Components of Tree Representation Tree representation: h(T) = (h 1 (T),h 2 (T),…h n (T)) Consider 2 trees: T 1, T 2 Number of nodes: N 1,N 2, respectively Define: I i (n) = 1if i th subtree is rooted at n, 0 o.w. Then, 66

Tree Kernel Computation 67

Tree Kernel Computation Running time: K(T 1,T 2 ) O(N 1 N 2 ): Based on recursive computation of C 68

Tree Kernel Computation Running time: K(T 1,T 2 ) O(N 1 N 2 ): Based on recursive computation of C Remaining Issues: 69

Tree Kernel Computation Running time: K(T 1,T 2 ) O(N 1 N 2 ): Based on recursive computation of C Remaining Issues: K(T1,T2) depends on size of T1 and T2 70

Tree Kernel Computation Running time: K(T 1,T 2 ) O(N 1 N 2 ): Based on recursive computation of C Remaining Issues: K(T1,T2) depends on size of T1 and T2 K(T1,T1) >> K(T1,T2), T1 != T vs 10 2 : Very ‘peaked’ 71

Improvements Managing tree size: 72

Improvements Managing tree size: Normalize!! (like cosine similarity) 73

Improvements Managing tree size: Normalize!! (like cosine similarity) Downweight large trees: 74

Improvements Managing tree size: Normalize!! (like cosine similarity) Downweight large trees: Restrict depth: just threshold Rescale with weight λ 75

Rescaling Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, C(n1,n2) =λ Else: 0<λ<= 1 76

Rescaling Given two subtrees rooted at n1 and n2 If productions at n1 and n2 are different, C(n1,n2) = 0 If productions at n1 and n2 are the same, And n1 and n2 are preterminals, C(n1,n2) =λ Else: 0<λ<= 1 77

Case Study 78

Parsing Experiment Data: Penn Treebank ATIS corpus segment Training: 800 sentences Top 20 parses Development: 200 sentences Test: 336 sentences Select best candidate from top 100 parses 79

Parsing Experiment Data: Penn Treebank ATIS corpus segment Training: 800 sentences Top 20 parses Development: 200 sentences Test: 336 sentences Select best candidate from top 100 parses Classifier: Voted perceptron Kernelized like SVM, more computationally tractable 80

Parsing Experiment Data: Penn Treebank ATIS corpus segment Training: 800 sentences Top 20 parses Development: 200 sentences Test: 336 sentences Select best candidate from top 100 parses Classifier: Voted perceptron Kernelized like SVM, more computationally tractable Evaluation: 10 runs, average parse score reported 81

Experimental Results Baseline system: 74% Substantial improvement: 6% absolute score 82

Summary Parsing as reranking problem Tree Kernel: Computes similarity between trees based on fragments Efficient recursive computation procedure Yields improved performance on parsing task 83