Raymond J. Mooney University of Texas at Austin

Raymond J. Mooney University of Texas at Austin
CS 388: Natural Language Processing: Neural Shift-Reduce Dependency Parsing Raymond J. Mooney University of Texas at Austin

Shift Reduce Parser Deterministically builds a parse incrementally, bottom up, and left to right, without backtracking. Maintains buffer of input words and a stack of constructed constituents. Perform sequence of operations/actions: Shift: Push the next word in the buffer onto the stack. Reduce: Replace a set of the top elements on the stack with a constituent composed of them.

Sample Parse of “Bob eats pasta”
Buffer: Bob eats pasta Stack

Action: Shift Buffer: eats pasta Stack Bob

Action: Reduce(Bob  NP) Buffer: eats pasta Stack (NP Bob)

Action: Shift Buffer: pasta Stack eats (NP Bob)

Action: Reduce(eats VB) Buffer: pasta Stack (VB eats) (NP Bob)

Action: Shift Buffer: Stack pasta (VB eats) (NP Bob)

Action: Reduce(pasta  NP) Buffer: Stack (NP pasta) (VB eats) (NP Bob)

Action: Reduce(VB NP  VP) Buffer: Stack (VP (VB eats)(NP pasta)) (NP Bob)

Action: Reduce(S  NP VP) Buffer: Stack (S (NP Bob) (VP (VB eats)(NP pasta)))

Shift Reduce Parsing Must use “look ahead” to use next words in the buffer to pick the correct action. Originally introduced to parse programming languages which are DCFLs. Use for NLP requires heuristics to pick an action at each step which (due to ambiguity) could be wrong, resulting in a “garden path.” Can perform backup when an impasse is reached in order to search for a parse.

Shift-Reduce Dependency Parser
Easily adapted to dependency parsing by using reduce operators that introduce dependency arcs. In addition to a stack and buffer, maintain a set of dependency arcs created.

Arc-Standard System (Nivre, 2004)
Buffer b = [b1, b2,… bn] Stack s = [s1, s2,… sm] Arcs A = {label(wi, wj), …} Configuration c = (s, b, A) Initial Config: ([ROOT], [w1, w2, … wn], {}) Final Config: ([ROOT], [], {label(wi, wj), …})

Arc Standard Actions

Sample Parse of “He has good control”
Stack Arcs Buffer: [He, has, good, control] ROOT

Action: Shift Stack Arcs Buffer: [has, good, control] He ROOT

Action: Shift Stack Arcs Buffer: [good, control] has He ROOT

Action: LeftArc(nsubj) Stack Arcs Buffer: [good, control] nsubj(has,He) has ROOT

Action: Shift Stack Arcs Buffer: [control] good nsubj(has,He) has ROOT

Action: Shift Stack Arcs Buffer: [] control nsubj(has,He) good has ROOT

Action: LeftArc(amod) Stack Arcs Buffer: [] control nsubj(has,He) amod(control,good) has ROOT

Action: RightArc(dobj) Stack Arcs Buffer: [] has nsubj(has,He) amod(control,good) ROOT dobj(has,control)

Action: RightArc(root) Stack Arcs Buffer: [] ROOT nsubj(has,He) amod(control,good) dobj(has,control) root(ROOT,has)

Stanford Neural Dependency Parser (Chen and Manning, 2014)
Train a neural net to choose the best shift-reduce parser action to take at each step. Uses features (words, POS tags, arc labels) extracted from the current stack, buffer, and arcs as context. History (thru citation trail): Neural shift-reduce parser (Mayberry & Miikkulainen, 1999) Decision-tree shift-reduce parser (Hermjakob & Mooney, 1997) Simple learned shift-reduce parser (Simmons & Yu, 1992)

Parse action classification
Neural Architecture Parse action classification

Context Features Used (rc = right-child, lc=left-child)
The top 3 words on the stack and buffer: s1; s2; s3; b1; b2; b3; The first and second leftmost / rightmost children of the top two words on the stack: lc1(si); rc1(si); lc2(si); rc2 (si), i = 1; 2. The leftmost-of-leftmost and rightmost-of-rightmost children of the top two words on the stack: lc1(lc1(si)); rc1(rc1(si)), i = 1; 2. Also include the POS tag and parent arc label (where available) for these same items.

Input Embeddings Instead of using one-hot input encodings, words and POS tags are “embedded” in a 50 dimensional set of input features. Embedding POS tags is unusual since there are relatively few; however, it allows similar tags (e.g. NN and NNS) to have similar embeddings and thereby behave similarly.

Cube Activation Function
Alternative non-linear output function instead of sigmoid (softmax) or tanh. Allows modeling the product terms of xixjxk for any three different input elements. Based on previous empirical results, capturing interactions of three elements seems important for shift-reduce dependency parsing.

Training Data Automatically construct dependency parses from treebank phrase-structure parse trees. Compute correct sequence of “oracle” shift-reduce parse actions (transitions, ti) at each step from gold-standard parse trees. Determine correct parse sequence by using a “shortest stack” oracle which always prefers LeftArc over Shift.

Training Algorithm Training objective is to minimize the cross-entropy loss, plus a L2-regularization term: Initialize word embeddings to precomputed values such as Word2Vec. Use AdaGrad with dropout to compute model parameters that approximately minimize this objective.

Evaluation Metrics for Dependency Parsing
Unlabeled Atachment Score (UAS): % of tokens for which a system has predicted the correct parent. Labeled Atachment Score (LAS): % of tokens for which a system has predicted the correct parent with the correct arc label.

Sample Results on Penn WSJ Treebank

Conclusions Shift-reduce parsing is an efficient and effective alternative to standard PCFG parsing. Particularly effective for dependency parsing. Models deterministic, left-to-right parsing that seems to characterize human parsing (therefore subject to garden paths). Neural methods to select parse operations give state-of-the-art results.

Raymond J. Mooney University of Texas at Austin

Similar presentations

Presentation on theme: "Raymond J. Mooney University of Texas at Austin"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Raymond J. Mooney University of Texas at Austin

Similar presentations

Presentation on theme: "Raymond J. Mooney University of Texas at Austin"— Presentation transcript:

Similar presentations

About project

Feedback