Drexel – 4/22/13 1/39 Treebank Analysis Using Derivation Trees Seth Kulick

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Albert Gatt Corpora and Statistical Methods Lecture 11.
Progress update Lin Ziheng. System overview 2 Components – Connective classifier Features from Pitler and Nenkova (2009): – Connective: because – Self.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
Recognizing Implicit Discourse Relations in the Penn Discourse Treebank Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng Department of Computer Science National.
In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
For Monday Read Chapter 23, sections 3-4 Homework –Chapter 23, exercises 1, 6, 14, 19 –Do them in order. Do NOT read ahead.
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
Albert Gatt LIN3022 Natural Language Processing Lecture 8.
6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
Extracting LTAGs from Treebanks Fei Xia 04/26/07.
1/13 Parsing III Probabilistic Parsing and Conclusions.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks and Parsing Jan Hajič Institute of Formal and Applied Linguistics School of.
Workshop on Treebanks, Rochester NY, April 26, 2007 The Penn Treebank: Lessons Learned and Current Methodology Ann Bies Linguistic Data Consortium, University.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
Automated Identification of Preposition Errors Joel Tetreault Educational Testing Service ECOLT October 29, 2010.
LING/C SC/PSYC 438/538 Lecture 27 Sandiway Fong. Administrivia 2 nd Reminder – 538 Presentations – Send me your choices if you haven’t already.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
Top-Down Design and Modular Development
GALE Banks 11/9/06 1 Parsing Arabic: Key Aspects of Treebank Annotation Seth Kulick Ryan Gabbard Mitch Marcus.
Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.
Spring /22/071 Beyond PCFGs Chris Brew Ohio State University.
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.
Tree-adjoining grammar (TAG) is a grammar formalism defined by Aravind Joshi and introduced in Tree-adjoining grammars are somewhat similar to context-free.
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
Ling 570 Day 17: Named Entity Recognition Chunking.
10/12/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 10 Giuseppe Carenini.
SI485i : NLP Set 8 PCFGs and the CKY Algorithm. PCFGs We saw how CFGs can model English (sort of) Probabilistic CFGs put weights on the production rules.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.
12/06/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Statistical Parsing Dr. Jan Hajič CS Dept., Johns Hopkins Univ.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
CSA2050 Introduction to Computational Linguistics Parsing I.
1 Context Free Grammars October Syntactic Grammaticality Doesn’t depend on Having heard the sentence before The sentence being true –I saw a unicorn.
CPSC 503 Computational Linguistics
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Supertagging CMSC Natural Language Processing January 31, 2006.
Natural Language Processing Lecture 15—10/15/2015 Jim Martin.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Natural Language Processing Lecture 14—10/13/2015 Jim Martin.
Towards Semi-Automated Annotation for Prepositional Phrase Attachment Sara Rosenthal William J. Lipovsky Kathleen McKeown Kapil Thadani Jacob Andreas Columbia.
PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.
CIS Treebanks, Trees, Querying, QC, etc. Seth Kulick Linguistic Data Consortium University of Pennsylvania
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
LING 581: Advanced Computational Linguistics Lecture Notes March 2nd.
Natural Language Processing Vasile Rus
Language Identification and Part-of-Speech Tagging
Advanced Computer Systems
Treebanks, Trees, Querying, QC, etc.
Basic Parsing with Context Free Grammars Chapter 13
Construct State Modification in the Arabic Treebank
LING/C SC 581: Advanced Computational Linguistics
Probabilistic and Lexicalized Parsing
Constraining Chart Parsing with Partial Tree Bracketing
Artificial Intelligence 2004 Speech & Natural Language Processing
Presentation transcript:

Drexel – 4/22/13 1/39 Treebank Analysis Using Derivation Trees Seth Kulick

Drexel – 4/22/13 2/39 Overview  Background Treebank – a collection of sentences annotated (by people). The role of treebanks in Natural Language Processing (NLP).  The Problem A utomatically discovering internal inconsistency in treebanks.  Our approach System output  Adaptation for parser evaluation/inter-annotator agreement  Future work

Drexel – 4/22/13 3/39 Treebank Example stood VP PP at NP NML The NP -SBJ wholesaleprice index S 90.1  Each sentence has a tree with syntactic information  The annotation is done (ideally) in conformity with a set of annotation guidelines Treebank annotation is hard and error-prone

Drexel – 4/22/13 4/39 The role of treebanks in NLP  Penn Treebank (1994) – 1M words from the Wall Street Journal – “statistical revolution in NLP”  Training and evaluation material for parsers  Parsers – machine learning problem input: new sentence, output: tree for that sentence used for downstream processing  Many approaches to parsing Break down full trees into smaller parts that can be used to assign structure to a new sentence Rely on consistency of annotation.  What does it mean for a treebank to be inconsistent?

Drexel – 4/22/13 5/39 Annotation inconsistency: Example 1  Two instances of “The wholesale price index” in a treebank.  Suppose a parser encountered “The wholesale price index” in new data. How would it know what to do? NML The NP wholesaleprice index The NP wholesalepriceindex

Drexel – 4/22/13 6/39 Annotation inconsistency: Example 2  Two instances of “foreign trade exports” in a treebank NP NML foreign exports trade NP foreignexportstrade

Drexel – 4/22/13 7/39 Annotation inconsistency: Example 3  Two instances of “than average” in a treebank PP than NP average PP than ADJP average

Drexel – 4/22/13 8/39 The Problem – How to detect inconsistencies  A million words of such trees – what to compare, how to find inconsistencies? VP NML The NP -SBJ wholesaleprice index S S The NP -SBJ wholesaleprice index VP

Drexel – 4/22/13 9/39 The Problem – How to detect inconsistencies  After approx. 20 years of making treebanks, this is still an open question.  Usual approach - Hand-crafted searches for structural problems (e.g. two SBJs for a verb, unexpected unary branches, etc.)  But we want a way to automatically discover inconsistencies. Dickinson & Meurers (TLT 2003…) Kulick et al. (ACL 2011, LREC 2012, NAACL 2013)  Also for treebanks meant for linguistic research, not as training material for parsers.

Drexel – 4/22/13 10/39 The Importance of the Problem  Lots of treebanks Penn Treebank 1M words (DARPA) Ontonotes 1.3M words - (DARPA) Arabic Treebank 600M – (DARPA) Penn Parsed Corpora of Historical English 1.8M (NSF)  Treebank construction is expensive for annotation Need faster and better treebank construction NSF Linguistics/Robust Intelligence (Kroch & Kulick)  Treebank analysis overlaps with key concerns How to increase annotation speed How to determine features for parsers.

Drexel – 4/22/13 11/39 Our approach  Similarities to parsing work – break down full trees to compare parts of them Adapt some ideas from parsing research Tree Adoining Grammar (Joshi et al…)  Basic idea: Decompose each tree into smaller chunks of structure “Derivation tree” relates these chunks together Compare strings based on their derivation tree fragments  Two advantages Group inconsistencies by structural properties Abstract away from interfering surface properties

Drexel – 4/22/13 12/39 Treebank Decomposition stood VP PP at NP NML The NP -SBJ wholesaleprice index S 90.1

Drexel – 4/22/13 13/39 Treebank Decomposition stood VP PP at NP NML The NP -SBJ wholesaleprice index S 90.1 Decomposition done by heuristics controlling the descent through the tree. The separate chunks (elementary trees) become the nodes of the “derivation tree”.

Drexel – 4/22/13 14/39 Derivation Tree Each node is an elementary tree.

Drexel – 4/22/13 15/39 Derivation Tree “ derivation tree fragment for “the wholesale price index”

Drexel – 4/22/13 16/39 Treebank Example was VP NP S The NP -SBJ wholesaleprice index

Drexel – 4/22/13 17/39 Treebank Example was VP NP S The NP -SBJ wholesaleprice index

Drexel – 4/22/13 18/39 Derivation Tree another derivation tree fragment for “the wholesale price index”

Drexel – 4/22/13 19/39 The Basic Idea  “nucleus” – a sequence of words being examined for consistency of annotation  Collect all the instances of that nucleus Each instance has an associated derivation tree fragment Have a set of all derivation tree fragments for the nucleus  If more than one derivation tree fragment, flag the nucleus as inconsistent  Simplifying the problem

Drexel – 4/22/13 20/39 Derivation Tree Fragments  Two different derivation tree fragments for “the wholespace price index” – flag as inconsistent

Drexel – 4/22/13 21/39 Organization by Annotation Structures  Great advantage of this approach For any set of instances for a nucleus, we have the derivation tree fragments for each instance Group nuclei by the set of derivation tree fragments used for all the instances that the nucleus has.

Drexel – 4/22/13 22/39 Another Annotation Inconsistency  Two instances of “the economic growth rate”  Inconsistent, just like “the wholesale price index” NML the NP economicgrowth rate the NP economicgrowthrate

Drexel – 4/22/13 23/39 Derivation Tree Fragments  Aside from the words, exactly the same as the derivation tree fragments for “the wholesale price index”

Drexel – 4/22/13 24/39 Organization by Annotation Structures  Why this is good: e.g.. “a short interest position”, “the economic growth rate”, “the real estate industry” all annotated inconsistently in the same way We care more about different types of consistencies/inconsistencies than individual cases of words. Very helpful for identifying the sorts of errors annotators might be making, or problematic areas for a parser.

Drexel – 4/22/13 25/39 System Overview  Compute derivation tree for each sentence  Identify strings to compare - the “nuclei” For now, use all strings that are constituents somewhere in the corpus. (this can’t be right in the long run)  Get the derivation tree fragments for each nucleus Include nuclei with more than one derivation tree fragment. Sort by set of derivation tree fragments used for a nucleus.  Software – getting in shape for release Derivation tree and elementary tree extraction – Java Everything else – MySQL and Python. (trees in MySQL) Output - static HTML for now.  Used for current treebanking

Drexel – 4/22/13 26/39 Why not just compare subtrees?  Two instances of “The wholesale price index” in a treebank.  Why bother decomposing it?  Adjunction/coordination  Partial constituents/Treebank simplifications NML The NP wholesaleprice index The NP wholesalepriceindex

Drexel – 4/22/13 27/39 Adjunction  “of the class” modifying “the teacher”  Suppose we want to look at all instances of “the teacher of the class” NP theteacher PP of NP theclass Adjunction structure for modification

Drexel – 4/22/13 28/39 Adjunction  “that I took” modifying “the class”  No subtree with just “the teacher of the class” NP theteacher PP of NP clas s NP SBAR that I took the

Drexel – 4/22/13 29/39 Adjunction in Derivation Tree Fragments Clunch 11/8/1029  nucleus “the teacher of the class”  a1-a5 and b1-b5 are compared  No interference from b6 (and below).

Drexel – 4/22/13 30/39 Arabic Treebank example nucleus is “summit Sharm el-Sheikh” summit NP SharmNP el-Sheikh summit NP Sharmel-Sheikh Sharm summit NP el-Sheikh NP Egypt Annotation is differentAnnotation is same

Drexel – 4/22/13 31/39 Arabic Treebank example nucleus is “summit Sharm el-Sheikh”

Drexel – 4/22/13 32/39 Further Abstraction from Treebank  Two instances of nucleus “one place” Not a constituent in one of them. not a constituen t here

Drexel – 4/22/13 33/39 Further Abstraction from Treebank  Change #b2 to not have the QP  More abstract representation also used for parsing

Drexel – 4/22/13 34/39 (Partial) Evaluation  Arabic Treebank 598K words (Maamouri et al, ) First 10 annotation types include 266 nuclei, all correctly identified recall of inconsistencies – hard to measure # Nuclei# Reported# non-duplicate#Annotation Types 54,4969,9844,2721,911  Ontonotes English newswire 525K words  evaluation so far – successful, not perfect # Nuclei# Reported# non-duplicate#Annotation Types 30,4973,6093,0121,186

Drexel – 4/22/13 35/39 System Output  Similar words grouped together

Drexel – 4/22/13 36/39 Parser Analysis/Inter-Annotator Agreement  Method of finding nuclei is in general a problem Reliance on identical strings of words – “later”, “l8r” Must occur at least once as a constituent  But in one situation it is completely natural Comparing two sets of annotations over the same sentences Kulick et al, NAACL 2013  Parser evaluation – Output of parser compared to treebank gold  Inter-annotator Agreement Evaluation Two annotators working on the same sentences

Drexel – 4/22/13 37/39 Example of Inter-annotator (dis)agreement  What to compare? Original starting point of work  Two annotators on same sentence

Drexel – 4/22/13 38/39 IAA evaluation  Evaluation on 4,270 word pre-release subset of Google English Web Treebank (Bies et al, 2012) Inconsistency Type# Found# Accurate Function Tags only53 POS Tags Only1813 Structural  Evaluation on 82,701 word pre-release supplement Modern British English ( Kroch & Santorini, in prep)  1,532 inconsistency types with 2,194 nuclei  First has 88 nuclei, second has 37  First 20 (375 nuclei) all true instances of inconsistent annotation

Drexel – 4/22/13 39/39 Future work  Clustering based on words and derivation tree fragments. Problems of multiple spellings - webtext, historical corpora  Order output based on some notion of what’s more likely to be an error.  Dependency work.  Parsing work based on derivation trees.