LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th.

Slides:



Advertisements
Similar presentations
LING 581: Advanced Computational Linguistics Lecture Notes January 30th.
Advertisements

LING 388: Language and Computers
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 15 th.
LING 388: Language and Computers Sandiway Fong Lecture 2.
LING 581: Advanced Computational Linguistics Lecture Notes February 9th.
Probabilistic Parsing Chapter 14, Part 2 This slide set was adapted from J. Martin, R. Mihalcea, Rebecca Hwa, and Ray Mooney.
In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.
LING 581: Advanced Computational Linguistics Lecture Notes January 19th.
LING 581: Advanced Computational Linguistics Lecture Notes February 2nd.
LING 581: Advanced Computational Linguistics Lecture Notes March 9th.
LING 581: Advanced Computational Linguistics Lecture Notes January 26th.
Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.
Introduction to treebanks Session 1: 7/08/
LING 581: Advanced Computational Linguistics Lecture Notes February 16th.
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
LING 581: Advanced Computational Linguistics Lecture Notes January 26th.
The Wonderful World of Tregex
Recovering empty categories. Penn Treebank The Penn Treebank Project annotates naturally occurring text for linguistic structure. It produces skeletal.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
LING 581: Advanced Computational Linguistics Lecture Notes January 19th.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
LING/C SC/PSYC 438/538 Lecture 19 Sandiway Fong 1.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 22 nd.
LING/C SC/PSYC 438/538 Lecture 27 Sandiway Fong. Administrivia 2 nd Reminder – 538 Presentations – Send me your choices if you haven’t already.
BİL711 Natural Language Processing1 Statistical Parse Disambiguation Problem: –How do we disambiguate among a set of parses of a given sentence? –We want.
LING 581: Advanced Computational Linguistics Lecture Notes February 12th.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
10/12/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 10 Giuseppe Carenini.
CS : Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Constituent Parsing and Algorithms (with.
LING 581: Advanced Computational Linguistics Lecture Notes February 19th.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
LING 388: Language and Computers Sandiway Fong Lecture 12.
CSA2050 Introduction to Computational Linguistics Parsing I.
CPSC 503 Computational Linguistics
LING 388: Language and Computers Sandiway Fong Lecture 21.
Supertagging CMSC Natural Language Processing January 31, 2006.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
11 Project, Part 3. Outline Basics of supervised learning using Naïve Bayes (using a simpler example) Features for the project 2.
LING/C SC/PSYC 438/538 Lecture 20 Sandiway Fong 1.
LING/C SC/PSYC 438/538 Lecture 18 Sandiway Fong. Adminstrivia Homework 7 out today – due Saturday by midnight.
NLP. Introduction to NLP #include int main() { int n, reverse = 0; printf("Enter a number to reverse\n"); scanf("%d",&n); while (n != 0) { reverse =
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.
CIS Treebanks, Trees, Querying, QC, etc. Seth Kulick Linguistic Data Consortium University of Pennsylvania
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
LING 581: Advanced Computational Linguistics Lecture Notes February 24th.
LING 581: Advanced Computational Linguistics Lecture Notes March 2nd.
Chapter 12: Probabilistic Parsing and Treebanks Heshaam Faili University of Tehran.
Natural Language Processing Vasile Rus
CSC 594 Topics in AI – Natural Language Processing
LING/C SC/PSYC 438/538 Lecture 21 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 20 Sandiway Fong.
Probabilistic and Lexicalized Parsing
LING/C SC 581: Advanced Computational Linguistics
LING 581: Advanced Computational Linguistics
LING/C SC 581: Advanced Computational Linguistics
LING 581: Advanced Computational Linguistics
Probabilistic and Lexicalized Parsing
LING 581: Advanced Computational Linguistics
CSCI 5832 Natural Language Processing
LING/C SC 581: Advanced Computational Linguistics
LING/C SC/PSYC 438/538 Lecture 23 Sandiway Fong.
LING/C SC 581: Advanced Computational Linguistics
LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong.
LING/C SC 581: Advanced Computational Linguistics
LING/C SC 581: Advanced Computational Linguistics
LING/C SC 581: Advanced Computational Linguistics
LING/C SC 581: Advanced Computational Linguistics
LING/C SC/PSYC 438/538 Lecture 3 Sandiway Fong.
LING/C SC 581: Advanced Computational Linguistics
Presentation transcript:

LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Today's Topics A note on java Tregex homework discussion Treebanks and Statistical parsers Homework – One more exercise on tregex – Install the Bikel-Collins Parser

Java and tregex on OSX If you're running on Mavericks (10.9), Java 7 is running by default. But the latest tregex (3.5.1) requires Java 8: – Unsupported major.minor version 52.0 Solution: – Install Java 8 (JRE) from Oracle directly – It installs in /Library/Internet\ Plug-Ins/ – Modify the path to java in run-tregex-gui.command as follows: #!/bin/sh /Library/Internet\ Plug- Ins/JavaAppletPlugin.plugin/Contents/Home/bin/java -mx300m -cp `dirname $0`/stanford-tregex.jar edu.stanford.nlp.trees.tregex.gui.TregexGUI

Homework Discussion useful command line tool – diff

Homework Discussion page 268 of PRSGUID1.PDF Functional tag -CLF indicates a true cleft

Homework Discussion Wish: – every construction was marked with its own tag So – -CLF looks easy …

Homework Discussion Types: that is sometimes WHNP

Homework Discussion There are no SQ-CLF and SINV-CLF … Gapping?

Homework Discussion Search conservatively… – 62: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < – 41: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < < /WHNP/)) wsj_0267.mrg-30 It was also in law school that Mr. O'Kicki and his first wife had the first of seven daughters *T*-1.

Homework Discussion Search conservatively… – 62: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < – 57: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < < /WH(ADV|NP)/)) wsj_0591.mrg-21 It is partly for this reason that the exchange last week began *-3 trading in its own stock `` basket '' product that *T*-2 allows big investors to buy or sell all 500 stocks in the Standard & Poor 's index in a single trade. 62: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < < /IN|(WH(ADV|NP))/))

Homework Discussion wsj_1154.mrg-4 It isn't every day that we hear a Violetta who *T*- 1 can sing the first act's high-flying music with all the little notes perfectly pitched *-2 and neatly stitched *-2 together. No promised temporal trace …

Homework Discussion 41: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < < /WH(ADV|NP)-[0-9]+/)) 57: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < < /WH(ADV|NP).*-.*[0-9]+/)) wsj_0267.mrg-30 It was also in law school that Mr. O'Kicki and his first wife had the first of seven daughters *T*-1.

Homework Discussion 43: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < < /WH(ADV|NP).*-.*([0- 9]+)/#2%i << (/NP-SBJ/ < (/-NONE-/ < /\*T.*([0-9]+)/)))) 39: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < < /WH(ADV|NP).*-.*([0- 9]+)/#2%i << (/NP-SBJ/ < (/-NONE-/ < /\*T.*([0-9]+)/#1%i)))) wsj_1655.mrg-4 Still, it was in Argentine editions that his countrymen first read his story of Pascal Duarte, a field worker who *T*-1 stabbed his mother to death and has no regrets as he awaits his end in a prison cell *T*-4

Homework Discussion Wh-clefts

Homework Discussion Wh-clefts: – 28: /SBAR-NOM/ << /.*-PRD/ wsj_0415.mrg-5 Who that winner will be *T*-1 is highly uncertain.

Relevance of Treebanks Statistical parsers typically construct syntactic phrase structure – they’re trained on Treebank corpora like the Penn Treebank Note: some use dependency graphs, not trees

Parsers trained on the Treebank Don’t recover fully-annotated trees – not trained using nodes with indices or empty (-NONE-) nodes – not trained using functional tags, e.g. –SBJ Therefore they don’t fully parse Example: no SBAR node in … a movie to see Stanford parser

Parsers trained on the Treebank SBAR can be forced by the presence of an overt relative pronoun, but note there is no subject gap:

Parsers trained on the Treebank Probabilities are estimated from frequency information of each node given surrounding context (e.g. parent node, or the word that heads the node) Still these systems have enormous problems with prepositional phrase (PP) attachment Example: (borrowed from Igor Malioutov) – A boy with a telescope kissed Mary on the lips – Mary was kissed by a boy with a telescope on the lips PP with a telescope should adjoin to the noun phrase (NP) a boy PP on the lips should adjoin to the verb phrase (VP) headed by kiss

Active/passive sentences Examples using the Stanford Parser: Both active and passive sentences are parsed incorrectly Both active and passive sentences are parsed incorrectly

Active/passive sentences Examples: X on the lips modifies Mary X on the lips modifies telescope

Homework Exercise Use tregex to find out how many passive sentences there are in the Treebank WSJ section? Report your search formula and frequency count The passive construction (according to the Bracketing Guidelines) – Note: by-phrase containing logical subject (LGS) is optional

Treebank Rules Just how many rules are there in the WSJ treebank? What’s the most common POS tag? What’s the most common syntax rule?

Treebank Total # of tags: 1,253,013

Treebank CategoryFrequencyPercentage N % V %

Treebank Total # of rules: 978,873 # of different rules: 31,338

Treebank

Total # of rules: 978,873 # of different rules: 17,554

Treebank

Today’s Topic Let’s segue from Treebank search to stochastic parsers trained on the WSJ Penn Treebank Examples: Berkeley Parser – arser/parser.html Stanford Parser – are all trained on the Treebank. We’ll play with Bikel’s implementation of Collins’s Parser …

Using the Treebank What is the grammar of the Treebank? – We can extract the phrase structure rules used, and – count the frequency of rules, and construct a stochastic parser

Using the Treebank Breakthrough in parsing accuracy with lexicalized trees – think of expanding the nonterminal names to include head information and the words that are at the leaves of the subtrees.

Bikel Collins Parser Java re-implementation of Collins’ parser Paper – Daniel M. Bikel Intricacies of Collins’ Parsing Model. (PS) (PDF) in Computational Linguistics, 30(4), pp PS) (PDF) in Computational Linguistics, 30(4), pp Software – at-parser (page no longer exists) at-parser

Bikel Collins Download and install Dan Bikel’s parser – dbp.zip (on course homepage) File: install.sh – Java code – but at this point I think Windows won’t work because of the shell script (.sh) – maybe after files are extracted?

Bikel Collins Download and install the POS tagger MXPOST parser doesn’t actually need a separate tagger…

Bikel Collins Training the parser with the WSJ PTB See guide – userguide/guide.pdf directory: TREEBANK_3/parsed/mrg/wsj chapters 02-21: create one single.mrg file events:wsj obj.gz directory: TREEBANK_3/parsed/mrg/wsj chapters 02-21: create one single.mrg file events:wsj obj.gz

Bikel Collins Settings:

Bikel Collins Parsing – Command – Input file format (sentences)

Bikel Collins Verify the trainer and parser work on your machine

Bikel Collins File: bin/parse is a shell script that sets up program parameters and calls java

Bikel Collins

File: bin/train is another shell script

Bikel Collins Relevant WSJ PTB files