Dependency Parsing: Past, Present, and Future

Dependency Parsing: Past, Present, and Future
Good afternoon, everyone. I am Zhenghua Li. We come from Soochow University, China. Thank you very much for attending our tutorial. And we hope our talk is helpful. If you have any question, please do not hesitate to discuss with us. Wenliang Chen, Zhenghua Li, Min Zhang {wlchen, zhli13, Soochow University, China

Recent Events of Dependency Parsing
Shared tasks CoNLL 2006/2007 shared tasks on multilingual dependency parsing CoNLL 2008/2009 shared tasks on joint parsing of syntactic and semantic dependencies SANCL 2012 shared task organized by Google (parsing the web) SemEval 2014/2015: broad-coverage semantic dependency parsing (SDP) Dependency parsing has made tremendous progress during the past decade. Since 2006, researchers organized a few parsing related shared tasks. Training and evaluation data are provided for many languages, which makes it easier to study and compare different parsing methods for different languages. CoNLL 2006 and 2007 shared tasks focus on multilingual dependency parsing. CoNLL 2008 and 2009 on joint syntactic and semantic parsing, although the results show that the top scoring systems mostly adopt the pipelined framework. In 2012, Petrov and McDonald organized a shared task on web parsing, and try to find a solution for parsing non-canonical texts from the web. The 2014/2015 SemEval tasks focus on broad-coverage semantic dependency parsing.

Recent Events of Dependency Parsing
Tutorials COLING-ACL06: “Dependency Parsing” by Joakim Nivre and Sandra Kubler NAACL10: “Recent Advances in Dependency Parsing” by Qin Iris Wang and Yue Zhang IJCNLP13: Ours EACL14: “Recent Advances in Dependency Parsing” by Ryan McDonald and Joakim Nivre ACL14: “Syntactic Processing Using Global Discriminative Learning and Beam-Search Decoding” by Yue Zhang, Meishan Zhang, and Ting Liu Books “Dependency Parsing” by Sandra Kübler, Joakim Nivre, and Ryan McDonald, 2009 You can also refer to several previous tutorials, and the excellent book of Sandra, Joakim, and McDonald. For the fact that EACL ACL COLING this year all hold tutorials on dependency parsing, we can see that dependency parsing is still an important and hot topic in NLP field.

Outline Part A: dependency parsing and supervised approaches
Part B: semi-supervised dependency parsing Part C: Parsing the web and domain adaptation Part D: Multilingual dependency parsing Part E: Conclusion and open problems Our tutorial is composed of five parts. In part a, we introduce the fundamentals, basic concepts, and supervised approaches on dependency parsing In part b, we discuss a few techniques on semi-supervised dependency parsing In part c, we discuss the recent advances on parsing web data and domain adaptation problem In part D, we survey recent work on multilingual dependency parsing. Finally, We will conclude and discuss some open problems in Part e.

Part A: Dependency Parsing and Supervised approaches
Here is part a

Dependency tree A dependency tree is a tree structure composed of the input words and meets a few constraints: Single-head Connected Acyclic Dependency grammar represents syntax using dependency tree structures. Then what is a dependency tree? Given a sentence, for example, I ate … A dependency tree is a tree structure composed of the input words. In other words, the input words are nodes in the tree, while the syntactic dependencies are presented by directed edges. The labels on each arc shows the syntactic relation of the dependency. $ is a pseudo word that points to the sentence root. Formally, three constraints are required. Single-headed means that each word has one and only one head. Connected: From the pseudo-word $0, along the directed links, you can reach all the words in the sentence. Acyclic: ignoring the link directions, there is no cycle. Actually, connected and acyclic constraints are redundant. Single-headed and connected lead to acyclic. Similarly, single-headed and acyclic also lead to connected. Moreover, there are two types of dependency trees, namely projective and non-projective parses. A projective tree means that you can project the tree without crossing arcs. Here ``projection” means that the input words are in their sequential order and the arcs are at the same side. In contrast, a non-projective tree mans that it is impossible to project the tree without crossing arcs. The goal of dependency parsing is to build a dependency tree, reflecting the syntactic structure of the sentence. 增加一个新的tree structure示例，ate为根的那种。 I1 ate2 the3 fish4 with5 a6 folk7 root $0 subj pmod obj det

Projective dependency trees
Informally, “projective” means the tree does not contain any crossing arcs. There are two kinds of dependency trees, depending on whether there exist crossing arcs in the tree. If a tree does not contain crossing arcs, it is called a projective dependency tree. For most English sentences, the syntax can be captured by projective trees. Current research mainly focuses on projective dependency parsing, where the outputs of the parsers are projective trees. I1 ate2 the3 fish4 with5 a6 folk7 root $0 subj pmod obj det

Non-projective dependency trees
A non-projective dependency tree contains crossing arcs. A hearing is scheduled on the issue $ today However, for some English sentences, it is better to use non-projective trees to represent their syntax. Here is an example. A non-projective tree means that the tree contains crossing arcs. As we can see, the link from “hearing” to “on” crosses two other links. For some languages, such as English and Chinese, the proportion of non-projective phenomena is low. In contrast, some highly inflected languages such as Czech, Dutch, Turkish, frequently require non-projective structures to capture the correct syntax of the sentences. Later, we also briefly introduce a few works on non-projective dependency parsing. Example from “Dependency Parsing” by Kübler, Nivre, and McDonald, 2009

Dependency Tree The basic unit is a link (dependency, arc) from the head to the modifier. Modifier Dependent Child Head Governor Parent Label Relation Type eat2 fish4 The basic unit of a dependency tree is a single dependency, represented by a link from the head to the modifier. Here is an example. “eat” is the head, also called governor or parent; “fish” is the modifier, also called dependent or child; “obj”, short for object, is the label on the link, representing the syntactic relation of this dependency. obj

Dependency Tree A bilingual example I1 eat2 the3 fish4 with5 a6 folk7
root $0 subj pmod obj det 我1 用2 叉子3 吃4 鱼5 vv This slide shows an English-Chinese bilingual example. We show the dependency trees for both sides and mark the word alignments with dashed lines. We can see that although the word order differs for English and Chinese, the syntactic dependencies remain the same. Add examples for constituent parses The advantages of dependency grammar? NO!

Evaluation Metrics Unlabeled attachment score (UAS)
The percent of words that have the correct heads Labeled attachment score (LAS) The percent of words that have the correct heads and labels. Root Accuracy (RA) Complete Match rate (CM) Here, we introduce the standard evaluation metrics for dependency parsing. As we know, in a dependency tree, each word can have one and only one head. Therefore, we can measure the performance of a dependency parser by counting the percentage of the words that have the correct head. The most widely used metric is the unlabeled attachment score, abbreviated as UAS, which is the percent of words that have the correct heads. Labeled attachment score, abbreviated as LAS, is the percent of words that have the correct heads and syntactic labels. In other words, it is required that the parser not only assigns the correct head to a word, but also assigns the correct label to the dependency from the head to the word. The left two are sentence-level metrics. Root accuracy, usually abbreviated as RA, is the percent of sentences that have the correct root. Complete match rate, abbreviated as CM, is the percent of sentences that have completely correct parse trees. I1 ate2 the3 fish4 with5 a6 folk7 root $0 subj pmod obj det

Formalism of Dependency Parsing
He1 does2 it3 here4 $0 Formally, given an input sentence, X=x1x2…xn, The goal of dependency parsing is to find the best-scoring (or optimal) tree from a huge search space \phi(X). \phi(X) contains all plausible trees. A dependency tree is a set of arcs, each arc is denoted as (h,m), meaning an arc from the head xh to the modifer xm, and the syntactic relation is l the input sentence a link from the head xh to the modifier xm a candidate tree The set of all possible dependency trees over X

Supervised Approaches for Dependency Parsing
Graph-based Transition-based Hybrid (ensemble) Other methods During the past decade, a few important approaches are proposed and developed for dependency parsing. Among them, the graph-based and transition-based methods are the two mainstream approaches. Then, researchers propose the hybrid method to combine the advantages of the two methods. We will also briefly introduce two other methods which are recently applied to dependency parsing.

Graph-based Dependency Parsing
Find the highest scoring tree from a complete dependency graph. He1 does2 it3 here4 $0 He1 does2 it3 here4 $0 The graph-based parsing method model the problem as finding the highest scoring tree from a complete dependency graph. The left figure shows the complete dependency graph for the input sentence “he does it here”. In the graph, every word pair has a directed link. The goal of dependency parser is to search in the graph, and to produce a legal dependency tree which has the highest score.

Two problems The search problem The learning problem
Given the score of each link, how to find Y*? The learning problem Given an input sentence, how to determine the score of each link? w ∙ f How to learn w using a treebank (supervised learning)? For the graph-based method, there are two problems. The first one is the search problem. Given the score of each link, how to find the highest scoring tree from the complete graph? The second one is the learning problem. Given an input sentence, how to determine the score of each link? Under the feature based representation, a link is represented by a feature vector, and the score of the link is the dot product of weight vector and feature vector. Then, the learning problem is about how to learn the feature weight vector using an annotated treebank? Now let’s discuss the two problems. He1 does2 it3 here4 $0 He1 does2 it3 here4 $0

First-order as an example
The first-order graph-based method assumes that the dependencies in a tree are independent from each other (arc-factorization) Before discussing the search and learning problems, we would like to introduce the basic graph-based method, which is called the first-order model. The first-order graph-based model makes a strong independence assumption: the links in a tree are independent from each other. This method is also called the arc-factorization method. In other words, the score of a dependency is not affected by other dependencies. Therefore, the score of a tree is the summation of the scores of all the dependencies in the tree. He1 does2 it3 here4 $0

Search problem for first-order model
Eisner (2000) described a dynamic programming based decoding algorithm for bilexical grammar. McDonald+ (2005) applied this algorithm to the search problem of the first-order model. Time complexity O(n3)

Eisner algorithm data structure
Basic data structure Incomplete spans does2 here4 does2 it3 $0 does2 The basic data structures of Eisner algorithm is spans. A span represents a partial tree structure for a range of input words, where the leftmost or rightmost boundary word is the root. There are two kinds of spans. The first kind is called incomplete spans. Here is an incomplete span for the parse tree in the bottom right. “does” is the root of this partial tree. There is a link from “does” to “here”. This span indicates that all the words between “does” and “here” have got their fathers. “incomplete” means that the other boundary word, which is “here”, may have other children. In other words, the children of “here” are incomplete, and other possible children can be attached in future operations. Similarly, we can find other incomplete spans for the parse tree. However, this incomplete span is illegal, since there is no link from “he” to “it” in the parse tree. He1 does2 He1 does2 it3 here4 $0 He1 it3

Eisner algorithm data structure
Basic data structure Complete spans does2 here4 does2 it3 $0 here4 The other kind is called complete spans. Here is a complete span. “does” is the root of the span. This span indicates that all the words in span, except the root word “does”, have got their fathers and children. “complete” means the other boundary word, which is “here”, has got all its children. Similarly, we can get other complete spans for the parse tree in the bottom right. However, the last span is illegal, since “does” has other children outside the span. He1 does2 He1 does2 it3 here4 $0 $0 does2

Eisner algorithm operations
Basic Operations s t + s r r+1 t s t Then, we introduce the basic operations of Eisner algorithm. Eisner algorithm works in a bottom-up fashion. One operation combines two adjacent spans into a larger span. Since there are many ways to build a span, the highest scoring combination is finally chosen and stored in the corresponding span. There are four basic operations in Eisner algorithm. The left span contains the right-side descendents of s; the right span contains the left-side descendents of t. These two spans are combined into incomplete spans, either s being the father of t, or s being the child of t. In this operation, the left span contains the right-side descendents of s, the left-side descendents of r; the right span contains the right-side descendents of r. These two spans are combined into a complete span, which contains all the descendents of r. This operation is similar to the above one. The difference is that the direction is opposite. s r + s t r t r t s r + s t

Eisner algorithm Initialization: complete spans (width=1) $0 He1 does2
Now we show how Eisner algorithm works for the example sentence “he does it here” In the initialization step, we build two completes spans for each word. The score of each span is ZERO. $0 He1 does2 it3 here4

Eisner algorithm Build incomplete span (width = 2) s=4.1 s=0.1 s=0.2
Then, we build incomplete spans containing two words. The two spans are combined into an incomplete span. Assume that the score of the link from $ to he is 0.1. Then the span score is 0.1. Here is an incomplete span from he to does. The span score is equal to the score of the link from he to does, which is 0.2. Similarly, we can build the incomplete span from does to he, with a score of 4.1. In a similar way, we build all the incomplete spans. s=4.1 s=0.1 s=0.2 s=0.1 s=0.2 s=3.2 s=0.1 $0 He1 does2 it3 here4

Eisner algorithm Build complete spans (width = 2) s=4.1 s=0.1 s=0.2
Then, we build the complete spans containing two words. The two spans are combined into a complete span. Since this operation does not introduce any new link, the score of the resulting span is the sum of the two component spans, which is 0.1. Here is a complete span from he to does. Then we build the complete span from does to he, with a score of 4.1. In a similar way, we build all the complete spans. s=4.1 s=0.1 s=0.2 s=0.1 s=0.2 s=3.2 s=0.1 $0 He1 does2 it3 here4 s=0.1 s=0.2

Eisner algorithm ? Build incomplete span (width = 3)
score(doeshere) = 3.3 Then we consider building incomplete spans containing three words. Since it is very complicated to show how to create all the spans, we only take the incomplete span from “does” to “here” as an example. The question is how to build the incomplete span based on the smaller spans created earlier? Here, we assume that the score of the link from “does” to “here” is 3.3. According to the operations defined in Eisner algorithm. There are two ways to build the span. The first way is to combine the complete span for the single word “does” and the complete span from “here” to “it”. The second way is to combine the complete span from “does” to “it” and the complete span for the single word “here”. The question becomes： which way should be chosen? $0 He1 does2 it3 here4 s=0.2 s=3.2

Eisner algorithm Build incomplete spans (width = 3) + + s=0.2+3.3=3.5
Using the first way, the resulting span has a score of 3.5; Using the second way, the score is 6.5. Therefore, the algorithm chooses the highest scoring combination to create the span. does2 it3 here4 s=3.2 + s= =6.5 does2 it3 here4 score(doeshere)=3.3

Eisner algorithm Build complete spans (width = 3)
Build incomplete spans (width = 4) … The best parse is stored in the complete span from “$” to “here”: C(0,4) In similar manner, we build complete spans of width 3 Then incomplete spans of width 4 And so on Finally, the best parse is stored in the complete span from $ to here, denoted as C(0,4) Though backtracking ,we can retrieve the parse tree. Backtracking He1 does2 it3 here4 $0

Eisner algorithm The search path for the correct parse tree Complete!
Here, we show the path to create the gold-standard parse tree. First, we build incomplete spans containing two words. These two incomplete spans build two dependencies. Then the incomplete spans grow into complete spans. Then we build incomplete spans containing three words, which create another two dependencies. Then we build the complete span from “does” to “here”. Then we combine the two spans and get a complete span from “$” to “here”, which covers all the words of the sentence. He1 does2 it3 here4 $0 $0 He1 does2 it3 here4

The learning problem score(2,4) = w∙f(2,4)
Given an input sentence, how to determine the score of each link? score(2,4) = ? He1 does2 it3 here4 $0 Then we introduce the learning problem. Given an input sentence, how to determine the score of each link? For example, what is the score of the link from “does” to “here”? Discriminative models use feature based representation, which means that a link is represented as a feature vector. Then, the score of the link is the dot product of the feature weight vector w and the feature vector f. Feature based representation: a link is represented as a feature vector f (2,4) score(2,4) = w∙f(2,4)

Features for one dependency
Example from slides of Rush and Petrov (2012) Here is an example of the feature vector for a link. We can see that start-of-the-art parsers use different combinations of lexical words, part-of-speech tags, link distance, link directions to compose rich features.

How to learn w? (supervised)
Use a treebank Each sentence has a manually annotated dependency tree. Online training (Collins, 2002; Crammer and Singer, 2001; Crammer+, 2003) Initialize w = 0 Go though the treebank for a few (10) iterations. Use one instance to update the weight vector. Then the question becomes: how to learn w? Supervised methods use a treebank to learn w. A treebank contains a set of sentences. Each sentence is manually annotated with its syntactic dependency structure. Online training is most widely used in current dependency parsers. In the initialization step, the algorithm sets the weight vector into zero. Then, it goes through the treebank for a few iterations, for example, 10 iterations. At one time, the online learning algorithm uses one instance to update the weight vector.

Online learning w Treebank Gold-standard parse Y+
He1 does2 it3 here4 $0 This slides show how the online algorithm uses one instance to update the feature weight vector. Here is a treebank. Each time, the training algorithm select one example from the treebank. Suppose the sentence is “he does it here”. Then the training algorithm uses the current model to parse the sentence. Suppose this is the 1-best parse returned by the model. We denote it as Y- Here is the gold-standard parse, denoted as Y+ Then, the algorithm updates the feature weight vector using this formula. Roughly speaking, the formula raises the weights of correct features, and lower the weights of mistaken features. He1 does2 it3 here4 $0 He1 does2 it3 here4 $0 Gold-standard parse Y+ 1-best parse Y- with w(k) w(k+1) = w(k) + f(X,Y+) – f(X,Y-)

Quick summarization The search problem The learning problem
Dynamic programming based decoding algorithms (Eisner algorithm) to find the best parse tree. The learning problem Online training algorithms to learn the feature weight vector w under the supervision of a treebank. Let’s have a quick summarization about the first-order graph-based method. For the search problem, we can use the dynamic programming based Eisner algorithm to find the best parse tree. For the learning problem, we can use online training algorithms to learn the feature weight vector w under the supervision of a treebank.

Recent advances (graph-based method)
Explore features of more non-local subtrees Second-order (McDonald & Pereira 06; Carreras 07) Third-order (Koo+, 2010) Higher-order with beam search (Zhang & McDonald 12) During the past few years, researchers try to propose more complex graph-based models. These models weakens the independence assumptions made in the first-order model. In that way, more non-local features can be included. Meanwhile, such higher-order models also require more complex dynamic programming based decoding algorithms to find the optimal parse trees. --- In the figure blow, we show different kinds of subtrees used in current parsers. At the beginning , the parser only uses single dependencies, which is the first-order model of McDonald et al. Then, subtrees of adjacent sibling dependencies are included, which is the second-order model of McDonald and Pereira. Then, Carreras proposed a more complex second-order model, to include grand-parent-child subtrees. Koo et al propose the third-order model. Zhang and McDonald propose to use beam-search instead of dynamic programming for decoding. In that way, arbitrary non-local features can be included into the model.

Transition-based Dependency Parsing
Gradually build a tree by applying a sequence of transition actions. (Yamada and Matsumoto, 2003; Nivre, 2003) The score of the tree is equal to the summation of the scores of the actions. The other mainstream approach is the transition-based method. Transition-based parser treat the problem as finding the highest scoring action sequence that builds a legal dependency tree. The score a dependency tree is the summation of the scores of all actions. A_i is the action ,… the action adopted in step i the partial results built so far by the tree built by the action sequence

Transition-based Dependency Parsing
The goal of a transition-based dependency parser is to find the highest scoring action sequence that builds a legal tree.

Two problems for Transition-based DP
The search problem Assuming the model parameters (feature weights) are known, how to find the optimal action sequence that leads to a legal tree for an input sentence? Greedy search as an example The learning problem How to learn the feature weights? Global online training Before 2008: Locally training classifiers Similarly, transition-based dependency parsing also has two problems to solve.

Components of Transition-based DP
A stack to store the processed words and the partial trees A queue to store the unseen input words Transition actions Gradually build a dependency tree according to the contexts (history) in the stack and the queue. A transition-based system is composed of three basic components, a stack, a queue, and an action set. The stack records the processed words and the partial structures produced so far. The queue stores the unseen input words which will be processed later in a left-to-right order. The action set defines a few actions that may be applied to the stack and the queue. The actions gradually build dependency trees according to the configurations in the stack and the queue. Stack Input queue He1 does2 $0 it3 here4 Which action should be applied?

An arc-eager transition-based parser
Four actions Shift Left-arc Right-arc Reduce Here we give an example of an arc-eager parser. The stack is in the left side, while the input queue is in the right side. In the stack, we can see that “he” and “does” are the processed words, and a dependency from does to he is built. The unseen input words are in the queue. Based on the configuration in the stack and the queue, the parser decides which action should be taken. In this way, the transition-based parser build a complete parse after a chain of actions. Let us suppose the parser makes the right decision, which is Right-Arc. Then, after applying the action, a right arc from does to it is added, and the word “it” is pushed into the stack. Compared with graph-based, one advantage of transition-based parser is that it can explore arbitrarily rich features from the partial trees built so far in the stack. Correspondingly, the disadvantage is that it uses inexact decoding algorithms. Such decoding algorithms can only explore limited search space. Stack Input queue He1 does2 $0 it3 here4

Stack Input queue $0 He1 does2 it3 here4 Here we give an example of an arc-eager parser. At the beginning, the stack only contains a pseudo word, $, and the queue contains the input sentence. Based on the current configuration, the parser decide to apply shift. Then, the top word in the queue is shifted into the stack Shift $0 He1 does2 it3 here4

$0 He1 does2 it3 here4 Then, the parser decides to take a left-arc action, which add an dependency from does to he, and he is popped off the stack. Left-arc $0 does2 it3 here4 He1

$0 does2 it3 here4 Then, the parser takes a right-arc action, which add and dependency from $ to does, and push does into the stack He1 Right-arc $0 does2 it3 here4 He1

$0 does2 it3 here4 Then another right-arc, adding a dependency from does to it, and push it into the stack He1 Right-arc $0 does2 it3 here4 He1

$0 does2 it3 here4 Since the word it has found its father, and has no children, the parser takes a reduce action, which pop it off the stack He1 Reduce $0 does2 here4 He1 it3

$0 does2 here4 Then, take a right-arc action, which add an dependency from does to here He1 it3 Right-arc $0 does2 here4 He1 it3

$0 does2 here4 Then reduce here off the stack He1 it3 Reduce $0 does2 He1 it3 here4

$0 does2 Then reduce does off the stack Now, the stack only contains the pseudo word $, and the queue is empty. Therefore, we call the process is completed. He1 it3 here4 Reduce $0 does2 Complete! He1 it3 here4

Recent advances (transition-based）
Explore richer features using the partially built structures (Zhang and Nivre, 2011). Enlarge the search space during decoding Beam search (Duan+, 2007; Zhang and Nivre, 2011) Beam search with state merging via dynamic programming (Huang and Sagae, 2010) Dynamic programming based decoding (Kuhlmann+, 2011) Better online learning Dynamic oracles (Goldberg and Nivre, 2014) Recently, researchers make efforts in two directions to improve transition-based parsing. The first one is to explore richer features based on partially built trees. The second one is the enlarge the search space during decoding. At the beginning, greedy search is adopted and the decoder always keeps the top-scoring hypothesis. Therefore, the search space is quite limited. Then, researchers propose to use beam-search to explore more hypothesis. After that, Huang and Sagae combined beam search with state merging to further enlarge the search space. Kuhlmann et al. propose a dynamic programming based decoding algorithm for transition-based parser. (time complexity?) The third direction is to use dynamic oracles during online learning, which include a bunch of works by Yoav Goldberg et, al. Dynamic oracle action sequence?

Other methods Easy-first non-directional dependency parsing (Goldberg and Elhadad, 2010) Iteratively select easiest (highest-scoring) pair of neighbors to build a dependency. Constituent-based dependency parsing (Sun and Wan, 2013) Method 1: convert the outputs of a constituent parser into dependency structures. Method 2: convert dependency trees into context- free-grammar structures. Recently, Goldberg and Elhadad propose a novel easy-first framework for dependency parsing. The main idea is to iteratively select the highest-scoring pair of neighbors to form a dependency at each step until a complete tree is reached. Another avenue is to use constituent-based parsing models to do dependency parsing. There are two choices. The first one trains a constituent parser using phrase-structure data. For evaluation, it transform the outputs of the constituent parser into dependency trees using hand-crafted head-finding rules. The second method transforms the dependency structures into pseudo phrase structures and then train a constituent parser using the pseudo phrase-structure treebank. For evaluation, the results of the pseudo constituent parser can be directly converted into dependency trees. ---- Similar to transition-based parsing, the easy-first method can capture arbitrary features from previously built subtrees at both the left and right sides. Also, it is very efficient with time complexity of O(nlogn) references

Ensemble methods Different dependency parsers have different advantages. The graph-based MSTParser performs better on long- distance dependencies. The transition-based MaltParser performs better on short- distance dependencies. Ensemble methods try to leverage the complementary strengths of different parsing approaches. Re-parsing (Sagae and Lavie, 2006; Surdeanu and Manning, 2010) Stacking (McDonald and Nivre, 2011; Martins+, 2008) Ensemble methods try to combine the advantages of different parsing models. For example, results show that the graph-based MSTParser get higher accuracy on long dependencies, while transition-based Maltparser works better on short dependencies. We would like to introduce two strategies for model integration.

Ensemble via re-parsing
Sagae and Lavie (2006); Surdeanu and Manning (2010) Separately train M different parsers For a test sentence, the M parsers produce M parses. Combine the M parses to build a partial dependency graph. Reparse to find the best result from the dependency graph using the standard MST parsing algorithm. The re-parsing method works in the following way. First separately train M different parser (for example, one MSTParser, and two Maltparser using different parameter settings) For a test sentence, the M parsers produces M parses. Combine the M parses to build a partial dependency graph. Finally, reparse the graph to find the best result using standard Eisner algorithm.

Ensemble via re-parsing
Sagae and Lavie (2006); Surdeanu and Manning (2010) In the re-parsing method, first, the outputs of several different parsers are combined. The dependency graph in the left bottom includes only the dependencies from the individual parsers. The dependencies of the individual parsers are equally weighted and the arc weights are the occurrence count. Then, use the standard Eisner algorithm to find the best parse. ----- Surdeanu and Manning also tried other more complex weighting strategies but did not see any improvements. Sagae sa:gi: Surdeanu 苏尔代亚努 Example from the slides of Wang and Zhang (2010)

Ensemble via stacking Joakim and McDonald (2008); Martins+ (2008)
Combine the graph-based and transition-based parsers. Use one parser to guide or help the other one. Train the level-1 parser first (Parser1) Let the level-2 parser (Parser2) consult Parser1 during both training and test. Two directions MSTmalt (MaltParser for level-1; MSTParser as level-2) MaltMST (verse) The second method for parser ensemble is called stacking. Joakim and McDoanld; Martins+ tried to combine graph-based and transition-based parsers using stacking. The main idea is to use one parser to guide the other one during both training and evaluation. First, train the level-1 parser. We call it Parser1. Then, train the level-2 parser. We call it Parser2. Parser2 is made to consult Parser1 during both training and test. Then, there are two direction: MSTParser guided by Maltparser Or Maltparser guided by MSTParser

Non-projective dependency parsing
Pseudo-projective (Nivre and Nilsson, 2005) Graph-based methods (McDonald+, 2005, 2006; Pilter, 2014) Transition-based methods (Nivre, 2009) Then, we talk about non-projective dependency parsing. Compared with projective dependency parsing, non-projective parsing has attracted less attention. There are three major directions. A hearing is scheduled on the issue $ today Example from “Dependency Parsing” by Kübler, Nivre, and McDonald, 2009

Non-projectivity in natural languages Then, we talk about non-projective dependency parsing. Compared with projective dependency parsing, non-projective parsing has attracted less attention. There are three major directions. Table from invited talks by Prof. Joakim Niver: Beyond MaltParser - Advances in Transition-Based Dependency Parsing

Pseudo-projective (Nivre and Nilsson, 2005) Pre-processing Convert non-projective trees into projective ones, with dependency labels encoding the transform process. Projective dependency parsing for both training and test. Post-processing Recover the non-projective trees from the projective outputs with the help of the labels. w1 w2 w3 $0 a b c|b w1 w2 w3 $0 a b c Nivre and Nilsson propose the pseudo-projective method. It is composed of three steps. First, convert non-projective trees into projective ones, … Here is an example. The left side is a non-projective tree. It is converted into a projective tree using some rules. We can see that the head of w1 is originally w3, and is changed to be w2 after conversion. The dependency relation of w1 becomes “c|b”, which means that the dependency relation of the true head of w1 is b. In this way, the non-projective tree can be recovered.

Graph-based methods First-order (McDonald+, 2005) Chu-Liu-Edmond decoding algorithm, O(n2) Second-order (McDonald and Pereira, 2006) Greedy approximate decoding, O(n3) Third-order (Pitler, 2014) Dynamic programming decoding, O(n4) McDonald+ propose non-projective parsing models under the graph-based framework. For the first-order model, they use the dynamic programming based Chu-Liu-Edmond decoding algorithm. The time complexity is O(n2) For the second-order model, the decoding problem becomes an NP problem. Therefore, they propose a greedy approximate decoding procedure with time complexity of O(n3) Pitler recently propose a dynamic programming decoding algorithm for third-order non-proj dependency parsing, with time complexity of O(n4)

Transition-based methods Extended with a SWAP action (Nivre, 2009) swap Nivre (2009) extended existing transition-based systems and propose a transition system for non-projective parsing with an extra swap action. Below is an example. Suppose we reach the state in the top, where the left-side stack contains the processed words, and the input is in the right-side queue. We know that the should be a dependency from meeting to “for”. However, we cannot produce this dependency, because according to the definition of this transition system, we can only connect the top two adjacent words in the stack. Here the swap action can help. We can see that after two swap actions, “for” get close to “meeting” In this way, the non-projective dependencies can be well handled. swap Example from Wang and Zhang (2010)

Probabilistic dependency parsing
Log-linear models (CRF) Projective : inside-outside (Paskin, 2001) Non-projective: matrix-tree theorem (Smith and Smith, 2007; McDonald and Satta, 2007; Koo+, 2007) Generative models Spectral learning (Luque+, 2012; Dhillon+, 2012) Up to date, most dependency parsers adopt non-probabilistic linear models. However, there exist some work on probabilistic models. For projective log-linear model, we can use the inside-outside algorithm to compute the marginal probabilities. These marginal probabilities can serve to get the weight gradients. For the non-projective log-linear model, a few researchers developed an algorithm to compute the marginal probabilities based on the matrix-tree theorem. Recently, there are also a few work on generative dependency parsing using spectral learning.

Improving parsing efficiency
Coarse-to-fine parsing Fast feature generation Parallel techniques Recently, researchers try to improve parsing efficiency, because real-world systems require fast syntactic analysis of large data. First is the c Second is Third, we talk about parallel techniques for fast parsing

Coarse-to-fine parsing
Use a coarse and fast model to prune the search space for more complex models. Charniak and Johnson (2005) apply this method to fast n-best constituent parsing. Use a CRF-based first-order dependency parser to prune the search space before third-order parsing (Koo and Collins, 2010) Vine pruning (Rush and Petrov, 2012) Zero-order => first-order => second-order => third-order The main idea for coarse-to-fine … is to use … Rush and Petrov further propose a cascaded pruning framework. First, they use zero-order with linear complexity to prune the search space, Then use a first-order with cubic time complexity to further prune the search space, And so on

Fast feature generation
When only considering positive features, current graph-based dependency parsers usually incorporate tens of millions of different features. Feature generation based on standard hash tables needs ~90% of the total parsing time (Bohnet, 2010) Feature string generation Feature index mapping Feature generation contains two steps, Feature string generation and feature index mapping

Feature strings for one dependency
Example from slides of Rush and Petrov (2012) Even for one dependency, we get nearly one hundred different features.

Feature-index mapping
Map different feature strings into different indices (feature space). Feature weight Each feature in the feature space has a weight. [went, VERB, As, IN] → 1023 [VERB, As, IN, left] → 526

Fast feature generation
Bohnet (2010) proposes a hash kernel method for fast feature generation. Map each feature string into an index using a hash function with pure numerical calculation. Hash collision is OK. Qian + (2010) propose a 2D trie structure for fast feature generation. Complex feature templates are extension of simple ones. Given a feature string l, the hash function maps it into an index. Because the operation only needs pure numerical, it is very fast, … As you may see, this method may map different features into the same index, which is also known as hash collision. However, the methods ignore it, and allow different correspond to the same index. This means that some different features are treated as the same. Since the number of features is quite huge, experiments show that such approximate does not hurt the parsing accuracy. One advantage of this method is that it can incorporate negative features which do not occur in the correct parses.

Parallel techniques Parsing efficiency can be largely improved by exploring multiple CPU cores via multi-thread programming. Graph-based parser (Bohnet, 2010) Parallel feature generation Parallel decoding algorithms Transition-based parser (Hatori+, 2011) Parallel beam-search decoding Both parsers are publicly available. The graph-based parser of Bohnet implements parallel …

Supervised dependency parsing
Quick summarization Graph-based Parser ensemble: reparsing, stacking Transition-based Now let me make a short summarization. We introduces two mainstream supervised approaches, two other recently adopted approaches We also introduces two parser ensemble methods, After that, we introduce works on non-projective parsing, and probabilistic models. Finally, we summarize current techniques on improving parsing efficiency. Next, we would like to introduce another topic, joint morphological analysis and dependency parsing. This topic has attracted more attention in the past three years. Easy-first constituent-based Supervised dependency parsing Non-projective Probabilistic models Parsing efficiency

Joint morphological analysis and dependency parsing
Motivation Due to the intrinsic difficulty of NLP, a cascaded framework is commonly adopted. Morphology (lexicon) → Syntax → Semantics Two problems Error propagation Fail to explore the connections between different levels of processing tasks which may help those related tasks. Given a input sentence, we first apply morphological analysis, such as word segmentation, part-of-speech tagging. Then, we do syntactic analysis based on the results of previous step. Finally, semantic analysis is done. There are two problems with this pipeline framework. First, error in lower level tasks badly influence higher level tasks. Second, it fails to explore the connections and interactions between tasks. This kind of connection and interaction may help those related tasks.

Pipeline example: Chinese POS tagging and dependency parsing
SUB VMOD PMOD P ROOT Dependency Parsing Here we take Chinese POS tagging and dependency parsing as an example. For the pipeline method, given an input sentence, we first apply POS tagging, and produce a POS tag sequence for this sentence. The sentence may contain some errors, which are marked in grey color. Then, based on the POS tag sequence, we apply dependency parsing. Here is an example. The POS tag errors are marked in grey. We can see that the verb 效力 is wrongly tagged as a noun. This error will influence the parsing process. NN # NT P PU POS tagging 欧文1 现在2 效力3 于4 利物浦队5 。6 $0 Owen now palys for in Liverpool . Words

Joint Chinese POS tagging and dependency parsing
SUB VMOD PMOD P ROOT Joint Parsing and Tagging For the joint method, given the input sentence, we do not produce a tag sequence. Instead, we construct a POS tag lattice, which contains all POS tagging ambiguities. Then, we conduct joint parsing and tagging, and produce an optimal POS tag sequence and parse tree simutaneously. The idea of joint POS tagging and dependency parsing is to keep the POS tagging ambiguity to the parsing phase, and resolve such ambiguities During parsing. NN # NT P PU POS tag lattice VV JJ NR 欧文1 现在2 效力3 于4 利物浦队5 。6 $0 Owen now palys for in Liverpool . Words

Graph-based joint POS tagging and dependency parsing (Li+, 2011)
Formally, the pipeline method Step 1: POS tagging Step 2: dependency parsing Formally, for the pipeline method, the first step is to find a highest-scoring POS tag sequence, where \phi_1(x) is the search space, which contains all tag sequences for X Then, the second step tries to find a highest-scoring parse tree, which \phi_x(x) is the search space, which contains all legal parse trees for X SUB VMOD PMOD P ROOT NN # NT P PU 欧文1 现在2 效力3 于4 利物浦队5 。6 $0 Owen now palys for in Liverpool .

The joint method tries to solve the two task simultaneously. The joint method tries to solve the two tasks simultaneously We can see that the search space is the product of \phi_1 and \phi_2, which is quite huge. SUB VMOD PMOD P ROOT JJ NR NR JJ VV VV # NN NT NN P NN PU 欧文1 现在2 效力3 于4 利物浦队5 。6 $0 Owen now palys for in Liverpool .

Under the joint model, the score of a joint result is the summation of the two components. We can see that the POS tagging features .. Under the joint model, the POS tagging features and the syntactic features can interact with each other in order to find an optimal joint solution.

The search problem Given the feature weights wjoint, how to efficiently find the optimal joint result from a huge search space? Dynamic programming based decoding algorithms: direct extension of the decoding algorithms for dependency parsing Under the graph-based framework, the joint method need to solves two problems, the search problem and the learning problem. Given the feature weights w, how to efficiently find the optimal joint result from a huge search space? The answer is we can directly extend the dynamic programming based decoding algorithms for dependency parsing

Dynamic programming based decoding algorithms (Li+, 2011)
Product of two dynamic programming based decoding algorithms Augment partial parses (spans) with POS tags. Time complexity O(n3q4) (q=1.4) Both POS tagging and dependency parsing have dynamic programming based decoding algorithms. The proposed decoding algorithm is actually the product of the two dynamic programming algorithms. The idea is to augment the partial parses with POS tags.

The learning problem (Li+, 2011)
How to learn the feature weights wjoint ? Online training Averaged perceptron (AP) Margin infused relaxed algorithm (MIRA) Passive-aggressive algorithm (PA) For the learning problem, we can adopt online training, such as ap, mira, and pa, to learn the feature weights.

Separately passive-aggressive (SPA) learning (Li+, 2012)
Use separate update steps for the POS tagging features and syntactic features. Can better balance the discriminative power of both tagging and parsing features. Lead to better tagging and parsing accuracy. We furture propose a new training algorithm, named as seperately PA learning. The idea is to Results show that SPA can …, and lead to ..

Results of graph-based joint models
Here is the results on Chinese treebank 5 We can see that compared with parsing with gold-standard POS tags, the parsing accuracy of the pipeline method drops by about 7%, and the joint method can improve accuracy by about 1%. On the other hand, the joint method can also improve tagging accuracy by about 0.7% On Chinese Data: CTB5

Transition-based joint Chinese POS tagging and dependency parsing
A direct extension of transition-based dependency parsing by adding a tagging action (Hatori+, 2011; Bohnet and Nivre, 2012) An arc-standard version (Hatori+, 2011) Shift(t): shift a word in the queue into the stack and assign tag t to it. (SH) Reduce-left (RL) Reduce-right (RR) Researchers also propose joint POS tagging and dependency parsing based on transition-based framework. The idea is to add a tagging action into the transition-system.

Transition-based joint Chinese POS tagging and dependency parsing
An example from the slides of Hatori+ (2011) # Act. Stack S Queue Q - 我/?? 想/?? 把/?? … 1 SH(PN) 我/PN 想/?? 把/?? 这/?? … 2 SH(VV) 我/PN 想/VV 把/?? 这/?? 个/?? … 3 SH(BA) 我/PN 想/VV 把/BA 这/?? 个/?? 句子/?? … 4 SH(DT) 我/PN 想/VV 把/BA 这/DT 个/?? 句子/?? 翻译/?? … 5 SH(M) 我/PN 想/VV 把/BA 这/DT 个/M 句子/?? 翻译/?? 成/?? … 6 RL 7 SH(NN) 翻译/?? 成/?? 英语/?? $ 8 RR 9 10 成/?? 英语/?? $ Here is an example from Hatori+

Other work on joint morphological analysis and parsing
Easy-first joint Chinese POS tagging and dependency parsing (Ma+, 2012) Transition-based joint Chinese word segmentation, POS tagging, and dependency parsing (Hatori+, 2012; Li and Zhou, 2012) Joint Chinese word segmentation, POS tagging, and phrase-structure parsing Qian and Liu (2012): integrating separately-trained models at decoding phase. Zhang+ (2013): a character-level transition-based system Here is some other work on joint morphologoical analysis and parsing, which include Easy-first Transition-based, Joint CWS, POS, phrase, …

Other work on joint morphological analysis and parsing
Dual decomposition (DD) (Rush+, 2010) Integrating different NLP subtasks at the test phase a phrase-structure parser and a dependency parser a phrase-structure parser and a POS tagger Loopy belief propagation (LBP) (Lee+, 2011) Joint morphological disambiguation and dependency parsing for morphologically-rich languages including Latin, Czech, Ancient Greek, and Hungarian. Comparison of DD and LBP (Auli and Lopez, 2011) Joint CCG supertagging and parsing Researcher also propose to use Dual decomposition and loopy belief propagation for joint morphological analysis and parsing.

References Michael Auli and Adam Lopez A Comparison of Loopy Belief Propagation and Dual Decomposition for Integrated CCG Supertagging and Parsing. In Proceedings of ACL 2011, pp 470–480 Bernd Bohnet. Top Accuracy and Fast Dependency Parsing is not a Contradiction. Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010). 2010:89-97 Bernd Bohnet and Joakim Nivre A Transition-Based System for Joint Part-of- Speech Tagging and Labeled Non-Projective Dependency Parsing. In Proceedings of EMNLP 2012, pp1455–1465 Xavier Carreras Experiments with a higher-order projective dependency parser. In Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007, pp957–961 Eugene Charniak and Mark Johnson Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking. Proceedings of ACL 2005, pp173–180. Michael Collins Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In Proceedings of EMNLP 2002. Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer Online passive aggressive algorithms. In Procedings of NIPS 2003.

References Koby Crammer and Yoram Singer Ultraconservative Online Algorithms for Multiclass Problems. Journal of Machine Learning Research, 2001 (3) Paramveer S. Dhillon, Jordan Rodu, Michael Collins, Dean P. Foster, and Lyle H. Ungar Spectral Dependency Parsing with Latent Variables. In Proc. EMNLP 2012, pp Xiangyu Duan, Jun Zhao, and Bo Xu Probabilistic parsing action models for multilingual dependency parsing. In Proceedings of the CoNLL Shared Task of EMNLP-CoNLL 2007, pp940–946 Jason Eisner Three new probabilistic models for dependency parsing: An exploration. In Proceedings of COLING1996, pp340–345 Jason Eisner Bilexical Grammars and Their Cubic-Time Parsing Algorithms. Advances in Probabilistic and Other Parsing Technologies. pp29-62 Yoav Goldberg and Michael Elhadad An Efficient Algorithm for Eash-First Non- Directional Dependency Parsing. (NAACL 2010), pp Jun Hatori, Takuya Matsuzaki, Yusuke Miyao, Jun'ichi Tsujii Incremental Joint POS Tagging and Dependency Parsing in Chineseings of IJCNLP 2011, pp1216–1224 Jun Hatori, Takuya Matsuzaki, Yusuke Miyao, Jun'ichi Tsujii Incremental Joint Approach to Word Segmentation, POS Tagging, and Dependency Parsing in Chinese. In Proceedings of ACL 2012, pp1045–1053

References Liang Huang and Kenji Sagae Dynamic Programming for Linear-time Incremental Parsing. In Proceedings of ACL 2010, pp1077–1086 Terry Koo and Michael Collins Efficient third-order dependency parsers. In Proceedings of ACL 2010, pp1–11 Terry Koo, Amir Globerson, Xavier Carreras, and Michael Collins Structured prediction models via the matrix-tree theorem. In Proc. EMNLP. Marco Kuhlmann, Carlos Gomez-Rodriguez, Giorgio Satta Dynamic programming algorithms for transition-based dependency parsers. In Proceddings of ACL 2011, pp John Lee, Jason Naradowsky, and David A. Smith A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing. In Proceedings of ACL 2011, pp885–894 Zhenghua Li, Min Zhang, Wanxiang Che, Ting Liu A Separately Passive- Aggressive Training Algorithm for Joint POS Tagging and Dependency Parsing. In Proceedings of COLING 2012, pp1681–1698 Zhenghua Li, Min Zhang, Wanxiang Che, Ting Liu, Wenliang Chen, Haizhou Li. Joint Models for Chinese POS Tagging and Dependency Parsing. In Proceedings of EMNLP 2011, pp1180–1191

References Yoav Goldberg and Joakim Nivre Training Deterministic Parsers with Non- Deterministic Oracles. In TACL. Zhongguo Li and Guodong Zhou Unified Dependency Parsing of Chinese Morphological and Syntactic Structures. In Proceedings of EMNLP 2012, pp1445–1454 Franco M. Luque, Ariadna Quattoni, Borja Balle, and Xavier Carreras Spectral learning for non-deterministic dependency parsing. In Proc. EACL 2012, pp Ji Ma, Tong Xiao, Jingbo Zhu, and Feiliang Ren Easy-First Chinese POS Tagging and Dependency Parsing. In Proceedings of COLING 2012, pp 1731–1746 Andre F. T. Martins, Dipanjan Das, Noah A. Smith, and Eric P. Xing Stacking Dependency Parsers. In Proceedings of EMNLP 2008, pp157–166 Ryan McDonald, Koby Crammer, and Fernando Pereira Online large-margin training of dependency parsers. In Proceedings of ACL 2005, pp91–98 Ryan McDonald and Fernando Pereira Online learning of approximate dependency parsing algorithms. In Proceedings of EACL 2006, pp81–88 Ryan McDonald and Giorgio Satta On the complexity of non-projective data-driven dependency parsing. In Proc. of IWPT. Ryan McDonald, Joakim Nivre Analyzing and Integrating Dependency Parsers. Computational Linguistics, 37(1), pp197–230

References Joakim Nivre An efficient algorithm for projective dependency parsing. In Proceedings of IWPT 2003, pp Joakim Nivre and Jens Nilsson Pseudo-Projective Dependency Parsing. Proceedings of ACL, pp99-106 Joakim Nivre Non-Projective Dependency Parsing in Expected Linear Time. Proceedings of ACL-IJCNLP, pp Mark A. Paskin Cubic-time parsing and learning algorithms for grammatical bigram models. Technical report No. UCB/CSD University of California Berkeley. Emily Pitler A Crossing-Sensitive Third-Order Factorization for Dependency Parsing. In TACL Xian Qian and Yang Liu Joint Chinese Word Segmentation, POS Tagging and Parsing. In Proceedings of EMNLP, pp501–511 Xian Qian, Qi Zhang, Xuangjing Huang, Lide Wu D Trie for Fast Parsing. In Proc. of COLING, pp Alexander M. Rush and Slav Petrov Vine pruning for efficient multi-pass dependency parsing. In Proceedings of EMNLP, pp

References Alexander M. Rush, David Sontag, Michael Collins, and Tommi Jaakkola On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing. In Proceedings of EMNLP 2010, pp1–11 Kenji Sagae, Alon Lavie Parser Combination by Reparsing. In Proceedings of NAACL 2006, pp 129–132 David A. Smith and Noah A. Smith Probabilistic models of nonprojective dependency trees. In Proc. EMNLP, pp Weiwei Sun and Xiaojun Wan Data-driven, PCFG-based and Pseudo-PCFG-based Models for Dependency Parsing. TACL. Mihai Surdeanu and Christopher D. Manning Ensemble Models for Dependency Parsing: Cheap and Good? In Proceedings of NAACL Hiroyasu Yamada, Yuji Matsumoto Statistical dependency analysis with support vector machines. Proceedings of IWPT. 2003:195–206. Hao Zhang, Ryan McDonald Generalized Higher-Order Dependency Parsing with Cube Pruning. In Proceedings of EMNLP 2012, pp 320–331 Meishan Zhang, Yue Zhang, Wanxiang Che, Ting Liu Chinese Parsing Exploiting Characters. In Proceedings of ACL. Yue Zhang, Joakim Nivre Transition-based Dependency Parsing with Rich Non-local Features, In Proceedings of ACL: shortpapers, pp188–193

End of Part A 用一张图总结一下所讲内容
Supervised parsing (transition-based, graph-based, …) Joint (transition-based, graph-based) 可以贴一些实验结果：Nivre and McDonald (stacked learning) Joint (English, Chinese)

Dependency Parsing: Past, Present, and Future

Similar presentations

Presentation on theme: "Dependency Parsing: Past, Present, and Future"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dependency Parsing: Past, Present, and Future

Similar presentations

Presentation on theme: "Dependency Parsing: Past, Present, and Future"— Presentation transcript:

Similar presentations

About project

Feedback