Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens, OH EMNLP, October 2008
Introduction 1 Syntactic Parsing POS Tagging Semantic Role Labeling Named Entity Recognition Question Answering NLP systems often depend on the output of other NLP systems.
Traditional Pipeline Model: M 1 2 Syntactic Parsing POS Tagging x The best annotation from one stage is used in subsequent stages. Problem: Errors propagate between pipeline stages!
Probabilistic Pipeline Model: M 2 3 Syntactic Parsing POS Tagging x All possible annotations from one stage are used in subsequent stages. Problem: Z(x) has exponential cardinality! probabilistic features
Probabilistic Pipeline Model: M 2 4 When original i ‘s are count features, it can be shown that: Feature-wise formulation:
An instance of feature i, i.e. the actual evidence used from example (x,y,z). Probabilistic Pipeline Model 5 When original i ‘s are count features, it can be shown that: Feature-wise formulation:
The set of all instances of feature i in (x,y,z), across all annotations z Z(x). Probabilistic Pipeline Model 6 When original i ‘s are count features, it can be shown that: Feature-wise formulation:
Example: POS Dependency Parsing 7 The 1 sailors 2 mistakenly 3 thought 4 there 5 must 6 be 7 diamonds 8 in 9 the 10 soil 11 RBVBD Feature i RB VBD The set of feature instances F i is: 0.91
Example: POS Dependency Parsing 8 The 1 sailors 2 mistakenly 3 thought 4 there 5 must 6 be 7 diamonds 8 in 9 the 10 soil 11 RBVBD Feature i RB VBD The set of feature instances F i is: 0.01 RBVBD 0.91
Example: POS Dependency Parsing 9 The 1 sailors 2 mistakenly 3 thought 4 there 5 must 6 be 7 diamonds 8 in 9 the 10 soil 11 RBVBD Feature i RB VBD The set of feature instances F i is: 0.1 RBVBD 0.01 RBVBD 0.91
Example: POS Dependency Parsing 10 The 1 sailors 2 mistakenly 3 thought 4 there 5 must 6 be 7 diamonds 8 in 9 the 10 soil 11 RBVBD Feature i RB VBD The set of feature instances F i is: RBVBD 0.1 RBVBD 0.01 RBVBD 0.91
Example: POS Dependency Parsing 11 The 1 sailors 2 mistakenly 3 thought 4 there 5 must 6 be 7 diamonds 8 in 9 the 10 soil 11 RBVBD Feature i RB VBD The set of feature instances F i is: RBVBD RBVBD 0.1 RBVBD 0.01 RBVBD 0.91
Example: POS Dependency Parsing 12 The 1 sailors 2 mistakenly 3 thought 4 there 5 must 6 be 7 diamonds 8 in 9 the 10 soil 11 RBVBD Feature i RB VBD The set of feature instances F i is: RBVBD RBVBD RBVBD 0.1 RBVBD 0.01 RBVBD 0.91
Example: POS Dependency Parsing 13 The 1 sailors 2 mistakenly 3 thought 4 there 5 must 6 be 7 diamonds 8 in 9 the 10 soil 11 Feature i RB VBD The set of feature instances F i is: N (N-1) feature instances in F i. RBVBD RBVBD RBVBD RBVBD 0.1 RBVBD 0.01 RBVBD 0.91 ……
Example: POS Dependency Parsing 14 1)Feature i RB VBD uses a limited amount of evidence: the set of feature instances F i has cardinality N (N-1). 2) computing takes O(N|P| 2 ) time using a constrained version of the forward-backward algorithm: Therefore, computing i takes O(N 3 |P| 2 ) time.
Probabilistic Pipeline Model: M 2 15 Syntactic Parsing POS Tagging x All possible annotations from one stage are used in subsequent stages. polynomial time In general, the time complexity of computing i depends on the complexity of the evidence used by feature i.
Probabilistic Pipeline Model: M 3 16 Syntactic Parsing POS Tagging x The best annotation from one stage is used in subsequent stages, together with its probabilistic confidence:
Probabilistic Pipeline Model: M 3 17 Syntactic Parsing POS Tagging x The best annotation from one stage is used in subsequent stages, together with its probabilistic confidence: The set of instances of feature i using only the best annotation
Probabilistic Pipeline Model: M 3 Like the traditional pipeline model M 1, except that it uses the probabilistic confidence values associated with annotation features. More efficient than M 2, but less accurate. Example: POS Dependency Parsing –shows features generated by template t i t j and their probabilities. 18 The 1 sailors 2 mistakenly 3 thought 4 there 5 must 6 be 7 diamonds 8 in 9 the 10 soil 11 DT 1 NNS 2 RB 3 VBD 4 EX 5 MD 6 VB 7 NNS 8 IN 9 DT 10 NN 11 x:x: y:y:
Probabilistic Pipeline Models 19 Model M 2 Model M 3
Two Applications 1)Dependency Parsing 2)Named Entity Recognition 20 Syntactic Parsing POS Tagging x Syntactic Parsing POS Tagging x Named Entity Recognition
1) Dependency Parsing Use MSTParser [McDonald et al. 2005] : –The score of a dependency tree the sum of the edge scores: –Feature templates use words and POS tags at positions u and v and their neighbors u 1 and v 1. Use CRF [Lafferty et al. 2001] POS tagger: –Compute probabilistic features using a constrained forward-backward procedure. –Example: feature t i t j has probability p(t i, t j ) constrain the state transitions to pass through tags t i and t j. 21
1) Dependency Parsing Two approximations of model M 2 : –Model M 2 ’: Consider POS tags independent: –p(t i RB,t j VBD|x) p(t i RB|x) p(t j VBD|x) Ignore tags with low marginal probability: –p(t i ) 1/( |P|) –Model M 2 ”: Like M 2 ’, but use constrained forward-backward to compute marginal probabilities when the tag chunks are less than 4 tokens apart. 22
1) Dependency Parsing: Results Train MSTParser on sections 2-21 of Pen WSJ Treebank using gold POS tagging. Test MST Parser on section 23, using POS tags from CRF tagger. Absolute error reduction of “only” 0.19% : –But POS tagger has a very high accuracy of 96.25%. Expect more substantial improvement when upstream stages in the pipeline are less accurate. 23 M1M1 M 2 ’( 1)M 2 ’( 2)M 2 ’( 4)M 2 ”( 4)
2) Named Entity Recognition Model NER as a sequence tagging problem using CRFs: 24 The 1 sailors 2 mistakenly 3 thought 4 there 5 must 6 be 7 diamonds 8 in 9 the 10 soil 11 DT 1 NNS 2 RB 3 VBD 4 EX 5 MD 6 VB 7 NNS 8 IN 9 DT 10 NN 11 x:x: z2:z2: z1:z1: y:y: O I O O O O O O O O O Flat features: unigram, bigram and trigram that extend either left or right: sailors, the sailors, sailors RB, sailors RB thought… Tree features: unigram, bigram and trigram that extend in any direction in the undirected dependency tree: sailors thought, sailors thought RB, NNS thought RB, …
Named Entity Recognition: Model M 2 25 Syntactic Parsing POS Tagging x Named Entity Recognition Probabilistic features: Example feature NNS 2 thought 4 RB 3 :
Named Entity Recognition: Model M 3 ’ M 3 ’ is an approximation of M 3 in which confidence scores are computed as follows: –Consider POS tagging and dependency parsing independent. –Consider POS tags independent. –Consider dependency arcs independent. –Example feature NNS 2 thought 4 RB 3 : Need to compute marginals p(u v|x). 26
Probabilistic Dependency Features To compute probabilistic POS features, we used a constrained version of the forward-backward algorithm. To compute probabilistic dependency features, we use a constrained version of Eisner’s algorithm: –Compute normalized scores n(u v | x) using the softmax function: –Transform scores n(u v|x) into probabilities p(u v|x) using isotonic regression [ Zadrozny & Elkan, 2002 ]. 27
Named Entity Recognition: Results Implemented the CRF models in MALLET [ McCallum, 2002 ] Trained and tested on the standard split from the ACE corpus (674 training, 97 testing). POS tagger and MSTParser were trained on sections 2-21 of WSJ Treebank –Isotonic regression for MSTParser on section ModelTreeFlatTree+Flat M3’M3’ M1M Area under PR curve
Named Entity Recognition: Results 29 M 3 ’ (probabilistic) vs. M 1 (traditional) using tree features:
Conclusions & Related Work A general method for improving the communication between consecutive stages in pipeline models: –based on computing expectations for count features. an efective method for associating probabilities with output substructures. –adds polynomial time complexity to pipeline whenever the inference step at each stage is done in polynomial time. Can be seen as complementary to the sampling approach of [ Finkel et al ]: –approximate vs. exact in polynomial time. –used in testing vs. used in training and testing. 30
Future Work 1)Try full model M 2 / its approximation M 2 ’ on NER. 2)Extend model to pipeline graphs containing cycles. 31
Questions? 32