Conversion of Penn Treebank Data to Text. Penn TreeBank Project “A Bank of Linguistic Trees” (as of 11/1992) University of Pennsylvania, LINC Laboratory.

Slides:



Advertisements
Similar presentations
Word Bi-grams and PoS Tags
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Parsing: computing the grammatical structure of English sentences COMP3310.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.
Modeling the Evolution of Product Entities Priya Radhakrishnan 1, Manish Gupta 1,2, Vasudeva Varma 1 1 Search and Information Extraction Lab, IIIT-Hyderabad,
GLARF-ULA: ULA08 Workshop March 19, 2007 GLARF-ULA: Working Towards Usability Unified Linguistic Annotation Workshop Adam Meyers New York University March.
Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,
Semantic Role Labeling Abdul-Lateef Yussiff
Probabilistic Parsing Chapter 14, Part 2 This slide set was adapted from J. Martin, R. Mihalcea, Rebecca Hwa, and Ray Mooney.
April 26th, 2007 Workshop on Treebanking, HLT/NAACL, Rochester 1 Layering of Annotations in the Penn Discourse TreeBank (PDTB) Rashmi Prasad Institute.
LING 581: Advanced Computational Linguistics Lecture Notes January 19th.
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
Using Treebanks tgrep2 Lecture 2: 07/12/2011. Using Corpora For discovery For evaluation of theories For identifying tendencies – distribution of a class.
Introduction to treebanks Session 1: 7/08/
6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
1/13 Parsing III Probabilistic Parsing and Conclusions.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks and Parsing Jan Hajič Institute of Formal and Applied Linguistics School of.
Parsing the NEGRA corpus Greg Donaker June 14, 2006.
1 Part-of-Speech (POS) Tagging Revisited Mark Sharp CS-536 Machine Learning Term Project Fall 2003.
CS224N Interactive Session Competitive Grammar Writing Chris Manning Sida, Rush, Ankur, Frank, Kai Sheng.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 29– CYK; Inside Probability; Parse Tree construction) Pushpak Bhattacharyya CSE.
LING/C SC/PSYC 438/538 Lecture 27 Sandiway Fong. Administrivia 2 nd Reminder – 538 Presentations – Send me your choices if you haven’t already.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
GALE Banks 11/9/06 1 Parsing Arabic: Key Aspects of Treebank Annotation Seth Kulick Ryan Gabbard Mitch Marcus.
LING 581: Advanced Computational Linguistics Lecture Notes February 12th.
Creating a corpus of commands in a domotic environment (semi- automatically)?
April 17, 2007MT Marathon: Tree-based Translation1 Tree-based Translation with Tectogrammatical Representation Jan Hajič Institute of Formal and Applied.
SI485i : NLP Set 8 PCFGs and the CKY Algorithm. PCFGs We saw how CFGs can model English (sort of) Probabilistic CFGs put weights on the production rules.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
Lecture 10 NLTK POS Tagging Part 3 Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:
AQUAINT Workshop – June 2003 Improved Semantic Role Parsing Kadri Hacioglu, Sameer Pradhan, Valerie Krugler, Steven Bethard, Ashley Thornton, Wayne Ward,
University of Edinburgh27/10/20151 Lexical Dependency Parsing Chris Brew OhioState University.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
LING 581: Advanced Computational Linguistics Lecture Notes February 19th.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-16: Probabilistic parsing; computing probability of.
CSA2050 Introduction to Computational Linguistics Parsing I.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
Introduction to Syntactic Parsing Roxana Girju November 18, 2004 Some slides were provided by Michael Collins (MIT) and Dan Moldovan (UT Dallas)
NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size.
Automatic Grammar Induction and Parsing Free Text - Eric Brill Thur. POSTECH Dept. of Computer Science 심 준 혁.
Part-of-speech tagging
Exploiting Reducibility in Unsupervised Dependency Parsing David Mareček and Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University.
Handling Unlike Coordinated Phrases in TAG by Mixing Syntactic Category and Grammatical Function Carlos A. Prolo Faculdade de Informática – PUCRS CELSUL,
LING/C SC/PSYC 438/538 Lecture 18 Sandiway Fong. Adminstrivia Homework 7 out today – due Saturday by midnight.
LING/C SC/PSYC 438/538 Lecture 9 Sandiway Fong. Adminstrivia Homework 4 graded Homework 5 out today – Due Saturday night by midnight – (Gives me Sunday.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-15: Probabilistic parsing; PCFG (contd.)
NLP. Parsing ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT.
NLP. Introduction to NLP #include int main() { int n, reverse = 0; printf("Enter a number to reverse\n"); scanf("%d",&n); while (n != 0) { reverse =
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.
Prototype-Driven Grammar Induction Aria Haghighi and Dan Klein Computer Science Division University of California Berkeley.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25– Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March,
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
LING 581: Advanced Computational Linguistics Lecture Notes February 24th.
LING 581: Advanced Computational Linguistics Lecture Notes March 2nd.
Coping with Problems in Grammars Automatically Extracted from Treebanks Carlos A. Prolo Computer and Info. Science Dept. University of Pennsylvania.
Statistical Natural Language Parsing Parsing: The rise of data and statistics.
LING/C SC 581: Advanced Computational Linguistics
LING/C SC/PSYC 438/538 Lecture 3 Sandiway Fong.
Constraining Chart Parsing with Partial Tree Bracketing
David Kauchak CS159 – Spring 2019
LING/C SC 581: Advanced Computational Linguistics
LING/C SC/PSYC 438/538 Lecture 3 Sandiway Fong.
Presentation transcript:

Conversion of Penn Treebank Data to Text

Penn TreeBank Project “A Bank of Linguistic Trees” (as of 11/1992) University of Pennsylvania, LINC Laboratory 4.5 million words of American English Annotation of naturally-occurring text for linguistic structure

Tokenization –Treatment of punctuation, words, etc. as separate tokens Children’s  Children ’s Part-of-speech (POS) tagging –Text first assigned POS tags automatically –Human annotators correct first-pass POS tags Bracketing –(Fidditch, a deterministic parser (Hindle 1983, 1989) ) –Two-stage parsing process made explicit with brackets Tree Linguistic Components

Penn TreeBank: Brown Corpus (as of 11/1992) POS Tags (Tokens) 1,172,041 Skeletal Parsing (Tokens) 1,172,041

You know you’re in trouble when … Robert MacIntyre Programmer/Data Manager Penn Treebank Project ftp://ftp.cis.upenn.edu/pub/treebank/doc/faq.cd2 “0. You will always have a certain amount of error. Sometimes there is just no way to find the head of a phrase, because it is tagged or parsed completely incorrectly. (no big surprise, that)”

( END_OF_TEXT_UNIT ) ( (`` ``) (S (NP (PRP I) ) (VP (VBP leave) (NP (DT this) (NN church) ) (PP (IN with) (NP (DT a) (NN feeling) (SBAR (IN that) (S (NP (DT a) (JJ great) (NN weight) ) (AUX (VBZ has) ) (VP (VBN been) (VP (VBN lifted) (PP (IN off) (NP (PRP$ my) (NN heart) )))))))))) (,,) (S (NP (PRP I) ) (AUX (VBP have) ) (VP (VP (VBN left) (NP (PRP$ my) (NN grudge) ) (PP (IN at) (NP (DT the) (NN altar) ))) (CC and) (VP (VBN forgiven) (NP (PRP$ my) (NN neighbor) ))))) ('' '') (..) ) ( END_OF_TEXT_UNIT ) cb08_42 ``I leave this church with a feeling that a great weight has been lifted off my heart, I have left my grudge at the altar and forgiven my neighbor''. Tree Conversion: Clean Case

( (S (NP (PRP He) ) (VP (VBD reported) (SBAR (IN that) (S (NP (NP (DT the) (NN city) ) (POS 's) (NNS contributions) (PP (IN for) (NP (NN animal) (NN care) ))) (VP (VBD included) (NP (NP ($ $) (CD 67,000) (PP (TO to) (NP (NP (DT the) (NNS Women) ) (POS 's) (NN S.P.C.A.) ))) (: ;) (: ;) (NP (NP ($ $) (CD 15,000) ) (S (NP (-NONE- T) ) (AUX (TO to) ) (VP (VB pay) (NP (NP (CD six) (NNS policemen) ) (VP (VBN assigned) (PP (IN as) (NP (NN dog) (NNS catchers) ))))))) (CC and) (NP (NP ($ $) (CD 15,000) ) (S (NP (-NONE- T) ) (AUX (TO to) ) (VP (VB investigate) (NP (NN dog) (NNS bites) )))))))))) (..) ) ( END_OF_TEXT_UNIT ) ca09_46 He reported that the city's contributions for animal care included $67,000 to the Women's S.P.C.A.;; $15,000 to pay six policemen assigned as dog catchers and $15,000 to investigate dog bites. (NP (DT the) (NNS Women) ) (POS 's) (NN S.P.C.A.) ))) (: ;) (: ;) (NP (NP ($ $) (CD 15,000) ) (S (NP (-NONE- T) ) (AUX (TO to) ) (VP (VB pay) (NP (NP (CD six) (NNS policemen) ) Tree Conversion : Problematic Case

Summary of Problems Encountered Typing Errors – Punctuation duplication in data Special notation for delimiter characters – RRB, LRB, RSB, LSB, RCB, LCB Special Null Elements – ( -NONE- ) * 0 T NIL ** Conventions for final output need to consider these lessons

Future Recommendations Put POS tree data into proper database –Increases confidence in correctness of data –Minimizes error Spend more effort upfront *once* to clean data SQL queries more reusable than (write-only) perl scripts Due to random graduate student ability If DB option not available –Avoid duplication of data in final output –Avoid text delimiters that exist as data tokens (“ ‘, \s ) –Do thoughtful labeling conventions