Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens,

Slides:

Advertisements

Similar presentations

Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.

Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.

Learning with lookahead: Can history-based models rival globally optimized models? Yoshimasa Tsuruoka Japan Advanced Institute of Science and Technology.

Expectation Maximization Dekang Lin Department of Computing Science University of Alberta.

Authors Sebastian Riedel and James Clarke Paper review by Anusha Buchireddygari Incremental Integer Linear Programming for Non-projective Dependency Parsing.

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

HPSG parser development at U-tokyo Takuya Matsuzaki University of Tokyo.

Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.

Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Dynamic Bayesian Networks (DBNs)

Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,

John Lafferty, Andrew McCallum, Fernando Pereira

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.

Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.

A Graphical Model For Simultaneous Partitioning And Labeling Philip Cowans & Martin Szummer AISTATS, Jan 2005 Cambridge.

Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.

1 Quasi-Synchronous Grammars  Based on key observations in MT: translated sentences often have some isomorphic syntactic structure, but not usually in.

Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.

Conditional Random Fields

Transformation-based error- driven learning (TBL) LING 572 Fei Xia 1/19/06.

Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.

Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?

Named Entity Recognition and the Stanford NER Software Jenny Rose Finkel Stanford University March 9, 2007.

SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.

STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.

1 Sequence Labeling Raymond J. Mooney University of Texas at Austin.

EMNLP’01 19/11/2001 ML: Classical methods from AI –Decision-Tree induction –Exemplar-based Learning –Rule Induction –T ransformation B ased E rror D riven.

Ronan Collobert Jason Weston Leon Bottou Michael Karlen Koray Kavukcouglu Pavel Kuksa.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Graphical models for part of speech tagging

Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.

1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.

인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.

SI485i : NLP Set 8 PCFGs and the CKY Algorithm. PCFGs We saw how CFGs can model English (sort of) Probabilistic CFGs put weights on the production rules.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

A Language Independent Method for Question Classification COLING 2004.

Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

AQUAINT Workshop – June 2003 Improved Semantic Role Parsing Kadri Hacioglu, Sameer Pradhan, Valerie Krugler, Steven Bethard, Ashley Thornton, Wayne Ward,

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.

Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.

A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

A Generalization of Forward-backward Algorithm Ai Azuma Yuji Matsumoto Nara Institute of Science and Technology.

Hidden-Variable Models for Discriminative Reranking Jiawen, Liu Spoken Language Processing Lab, CSIE National Taiwan Normal University Reference: Hidden-Variable.

Prior Knowledge Driven Domain Adaptation Gourab Kundu, Ming-wei Chang, and Dan Roth Hyphenated compounds are tagged as NN. Example: H-ras Digit letter.

NLP. Introduction to NLP The probabilities don’t depend on the specific words –E.g., give someone something (2 arguments) vs. see something (1 argument)

Supertagging CMSC Natural Language Processing January 31, 2006.

Automatic Grammar Induction and Parsing Free Text - Eric Brill Thur. POSTECH Dept. of Computer Science 심 준 혁.

John Lafferty Andrew McCallum Fernando Pereira

Shallow Parsing for South Asian Languages -Himanshu Agrawal.

Chunk Parsing II Chunking as Tagging. Chunk Parsing “Shallow parsing has become an interesting alternative to full parsing. The main goal of a shallow.

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.

NLP. Introduction to NLP Time flies like an arrow –Many parses –Some (clearly) more likely than others –Need for a probabilistic ranking method.

Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.

Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.

Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.

Prototype-Driven Grammar Induction Aria Haghighi and Dan Klein Computer Science Division University of California Berkeley.

Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.

A CRF-BASED NAMED ENTITY RECOGNITION SYSTEM FOR TURKISH Information Extraction Project Reyyan Yeniterzi.

Efficient Inference on Sequence Segmentation Models

Named Entity Tagging with Conditional Random Fields

Bidirectional CRF for NER

Prototype-Driven Learning for Sequence Models

Presentation transcript:

Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens, OH EMNLP, October 2008

Introduction 1 Syntactic Parsing POS Tagging Semantic Role Labeling Named Entity Recognition Question Answering NLP systems often depend on the output of other NLP systems.

Traditional Pipeline Model: M 1 2 Syntactic Parsing POS Tagging x The best annotation from one stage is used in subsequent stages. Problem: Errors propagate between pipeline stages!

Probabilistic Pipeline Model: M 2 3 Syntactic Parsing POS Tagging x All possible annotations  from one stage are used in subsequent stages. Problem: Z(x) has exponential cardinality! probabilistic features

Probabilistic Pipeline Model: M 2 4 When original  i ‘s are count features, it can be shown that: Feature-wise formulation:

An instance of feature  i, i.e. the actual evidence used from example (x,y,z). Probabilistic Pipeline Model 5 When original  i ‘s are count features, it can be shown that: Feature-wise formulation:

The set of all instances of feature  i in (x,y,z), across all annotations z  Z(x). Probabilistic Pipeline Model 6 When original  i ‘s are count features, it can be shown that: Feature-wise formulation:

Example: POS  Dependency Parsing 7 The 1 sailors 2 mistakenly 3 thought 4 there 5 must 6 be 7 diamonds 8 in 9 the 10 soil 11 RBVBD Feature  i  RB  VBD The set of feature instances F i is: 0.91

Example: POS  Dependency Parsing 8 The 1 sailors 2 mistakenly 3 thought 4 there 5 must 6 be 7 diamonds 8 in 9 the 10 soil 11 RBVBD Feature  i  RB  VBD The set of feature instances F i is: 0.01 RBVBD 0.91

Example: POS  Dependency Parsing 9 The 1 sailors 2 mistakenly 3 thought 4 there 5 must 6 be 7 diamonds 8 in 9 the 10 soil 11 RBVBD Feature  i  RB  VBD The set of feature instances F i is: 0.1 RBVBD 0.01 RBVBD 0.91

Example: POS  Dependency Parsing 10 The 1 sailors 2 mistakenly 3 thought 4 there 5 must 6 be 7 diamonds 8 in 9 the 10 soil 11 RBVBD Feature  i  RB  VBD The set of feature instances F i is: RBVBD 0.1 RBVBD 0.01 RBVBD 0.91

Example: POS  Dependency Parsing 11 The 1 sailors 2 mistakenly 3 thought 4 there 5 must 6 be 7 diamonds 8 in 9 the 10 soil 11 RBVBD Feature  i  RB  VBD The set of feature instances F i is: RBVBD RBVBD 0.1 RBVBD 0.01 RBVBD 0.91

Example: POS  Dependency Parsing 12 The 1 sailors 2 mistakenly 3 thought 4 there 5 must 6 be 7 diamonds 8 in 9 the 10 soil 11 RBVBD Feature  i  RB  VBD The set of feature instances F i is: RBVBD RBVBD RBVBD 0.1 RBVBD 0.01 RBVBD 0.91

Example: POS  Dependency Parsing 13 The 1 sailors 2 mistakenly 3 thought 4 there 5 must 6 be 7 diamonds 8 in 9 the 10 soil 11 Feature  i  RB  VBD The set of feature instances F i is:  N  (N-1) feature instances in F i. RBVBD RBVBD RBVBD RBVBD 0.1 RBVBD 0.01 RBVBD 0.91 ……

Example: POS  Dependency Parsing 14 1)Feature  i  RB  VBD uses a limited amount of evidence:  the set of feature instances F i has cardinality N  (N-1). 2)  computing takes O(N|P| 2 ) time using a constrained version of the forward-backward algorithm: Therefore, computing  i takes O(N 3 |P| 2 ) time.

Probabilistic Pipeline Model: M 2 15 Syntactic Parsing POS Tagging x All possible annotations  from one stage are used in subsequent stages. polynomial time In general, the time complexity of computing  i depends on the complexity of the evidence used by feature  i.

Probabilistic Pipeline Model: M 3 16 Syntactic Parsing POS Tagging x The best annotation  from one stage is used in subsequent stages, together with its probabilistic confidence:

Probabilistic Pipeline Model: M 3 17 Syntactic Parsing POS Tagging x The best annotation  from one stage is used in subsequent stages, together with its probabilistic confidence: The set of instances of feature  i using only the best annotation

Probabilistic Pipeline Model: M 3 Like the traditional pipeline model M 1, except that it uses the probabilistic confidence values associated with annotation features. More efficient than M 2, but less accurate. Example: POS  Dependency Parsing –shows features generated by template t i  t j and their probabilities. 18 The 1 sailors 2 mistakenly 3 thought 4 there 5 must 6 be 7 diamonds 8 in 9 the 10 soil 11 DT 1 NNS 2 RB 3 VBD 4 EX 5 MD 6 VB 7 NNS 8 IN 9 DT 10 NN 11 x:x: y:y:

Probabilistic Pipeline Models 19 Model M 2 Model M 3

Two Applications 1)Dependency Parsing 2)Named Entity Recognition 20 Syntactic Parsing POS Tagging x Syntactic Parsing POS Tagging x Named Entity Recognition

1) Dependency Parsing Use MSTParser [McDonald et al. 2005] : –The score of a dependency tree  the sum of the edge scores: –Feature templates use words and POS tags at positions u and v and their neighbors u  1 and v  1. Use CRF [Lafferty et al. 2001] POS tagger: –Compute probabilistic features using a constrained forward-backward procedure. –Example: feature t i  t j has probability p(t i, t j ) constrain the state transitions to pass through tags t i and t j. 21

1) Dependency Parsing Two approximations of model M 2 : –Model M 2 ’: Consider POS tags independent: –p(t i  RB,t j  VBD|x)  p(t i  RB|x)  p(t j  VBD|x) Ignore tags with low marginal probability: –p(t i )  1/(  |P|) –Model M 2 ”: Like M 2 ’, but use constrained forward-backward to compute marginal probabilities when the tag chunks are less than 4 tokens apart. 22

1) Dependency Parsing: Results Train MSTParser on sections 2-21 of Pen WSJ Treebank using gold POS tagging. Test MST Parser on section 23, using POS tags from CRF tagger. Absolute error reduction of “only” 0.19% : –But POS tagger has a very high accuracy of 96.25%. Expect more substantial improvement when upstream stages in the pipeline are less accurate. 23 M1M1 M 2 ’(  1)M 2 ’(  2)M 2 ’(  4)M 2 ”(  4)

2) Named Entity Recognition Model NER as a sequence tagging problem using CRFs: 24 The 1 sailors 2 mistakenly 3 thought 4 there 5 must 6 be 7 diamonds 8 in 9 the 10 soil 11 DT 1 NNS 2 RB 3 VBD 4 EX 5 MD 6 VB 7 NNS 8 IN 9 DT 10 NN 11 x:x: z2:z2: z1:z1: y:y: O I O O O O O O O O O Flat features: unigram, bigram and trigram that extend either left or right:  sailors, the sailors, sailors RB, sailors RB thought… Tree features: unigram, bigram and trigram that extend in any direction in the undirected dependency tree:  sailors  thought, sailors  thought  RB, NNS  thought  RB, …

Named Entity Recognition: Model M 2 25 Syntactic Parsing POS Tagging x Named Entity Recognition Probabilistic features: Example feature NNS 2  thought 4  RB 3 :

Named Entity Recognition: Model M 3 ’ M 3 ’ is an approximation of M 3 in which confidence scores are computed as follows: –Consider POS tagging and dependency parsing independent. –Consider POS tags independent. –Consider dependency arcs independent. –Example feature NNS 2  thought 4  RB 3 : Need to compute marginals p(u  v|x). 26

Probabilistic Dependency Features To compute probabilistic POS features, we used a constrained version of the forward-backward algorithm. To compute probabilistic dependency features, we use a constrained version of Eisner’s algorithm: –Compute normalized scores n(u  v | x) using the softmax function: –Transform scores n(u  v|x) into probabilities p(u  v|x) using isotonic regression [ Zadrozny & Elkan, 2002 ]. 27

Named Entity Recognition: Results Implemented the CRF models in MALLET [ McCallum, 2002 ] Trained and tested on the standard split from the ACE corpus (674 training, 97 testing). POS tagger and MSTParser were trained on sections 2-21 of WSJ Treebank –Isotonic regression for MSTParser on section ModelTreeFlatTree+Flat M3’M3’ M1M Area under PR curve

Named Entity Recognition: Results 29 M 3 ’ (probabilistic) vs. M 1 (traditional) using tree features:

Conclusions & Related Work A general method for improving the communication between consecutive stages in pipeline models: –based on computing expectations for count features. an efective method for associating probabilities with output substructures. –adds polynomial time complexity to pipeline whenever the inference step at each stage is done in polynomial time. Can be seen as complementary to the sampling approach of [ Finkel et al ]: –approximate vs. exact in polynomial time. –used in testing vs. used in training and testing. 30

Future Work 1)Try full model M 2 / its approximation M 2 ’ on NER. 2)Extend model to pipeline graphs containing cycles. 31

Questions? 32