LING/C SC/PSYC 438/538 Lecture 27 Sandiway Fong. Administrivia 2 nd Reminder – 538 Presentations – Send me your choices if you haven’t already.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Parsing: computing the grammatical structure of English sentences COMP3310.
Advertisements

LING 388: Language and Computers
Expectation Maximization Dekang Lin Department of Computing Science University of Alberta.
LING 388: Language and Computers Sandiway Fong Lecture 2.
Statistical NLP: Lecture 3
Semantic Role Labeling Abdul-Lateef Yussiff
Probabilistic Parsing Chapter 14, Part 2 This slide set was adapted from J. Martin, R. Mihalcea, Rebecca Hwa, and Ray Mooney.
LING 388 Language and Computers Lecture 22 11/25/03 Sandiway FONG.
LING 581: Advanced Computational Linguistics Lecture Notes January 19th.
Introduction to treebanks Session 1: 7/08/
Albert Gatt LIN3022 Natural Language Processing Lecture 8.
LING 581: Advanced Computational Linguistics Lecture Notes January 26th.
1/13 Parsing III Probabilistic Parsing and Conclusions.
1/17 Probabilistic Parsing … and some other approaches.
LING/C SC/PSYC 438/538 Lecture 19 Sandiway Fong 1.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
LING 581: Advanced Computational Linguistics Lecture Notes February 12th.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
LING 388: Language and Computers Sandiway Fong Lecture 3.
Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short.
10/12/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 10 Giuseppe Carenini.
GRAMMARS David Kauchak CS159 – Fall 2014 some slides adapted from Ray Mooney.
1 Semi-Supervised Approaches for Learning to Parse Natural Languages Rebecca Hwa
CS : Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Constituent Parsing and Algorithms (with.
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
LING 388: Language and Computers Sandiway Fong Lecture 18.
LING 388: Language and Computers Sandiway Fong Lecture 19.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
LING/C SC/PSYC 438/538 Lecture 26 Sandiway Fong. Administrivia 538 Presentations – Send me your choices if you haven’t already Thanksgiving Holiday –
Conversion of Penn Treebank Data to Text. Penn TreeBank Project “A Bank of Linguistic Trees” (as of 11/1992) University of Pennsylvania, LINC Laboratory.
CPE 480 Natural Language Processing Lecture 4: Syntax Adapted from Owen Rambow’s slides for CSc Fall 2006.
CSA2050 Introduction to Computational Linguistics Parsing I.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
LING 388: Language and Computers Sandiway Fong Lecture 21.
NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size.
Part-of-speech tagging
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Spring 2006-Lecture 2.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th.
LING/C SC/PSYC 438/538 Lecture 20 Sandiway Fong 1.
Handling Unlike Coordinated Phrases in TAG by Mixing Syntactic Category and Grammatical Function Carlos A. Prolo Faculdade de Informática – PUCRS CELSUL,
LING/C SC/PSYC 438/538 Lecture 18 Sandiway Fong. Adminstrivia Homework 7 out today – due Saturday by midnight.
1 Some English Constructions Transformational Framework October 2, 2012 Lecture 7.
LING 388: Language and Computers Sandiway Fong Lecture 20.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-15: Probabilistic parsing; PCFG (contd.)
NLP. Parsing ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT.
NLP. Introduction to NLP #include int main() { int n, reverse = 0; printf("Enter a number to reverse\n"); scanf("%d",&n); while (n != 0) { reverse =
LING/C SC/PSYC 438/538 Lecture 19 Sandiway Fong 1.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25– Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March,
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
LING 581: Advanced Computational Linguistics Lecture Notes February 24th.
LING 581: Advanced Computational Linguistics Lecture Notes March 2nd.
Coping with Problems in Grammars Automatically Extracted from Treebanks Carlos A. Prolo Computer and Info. Science Dept. University of Pennsylvania.
Statistical Natural Language Parsing Parsing: The rise of data and statistics.
LING/C SC/PSYC 438/538 Lecture 21 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 20 Sandiway Fong.
CS 388: Natural Language Processing: Statistical Parsing
LING/C SC 581: Advanced Computational Linguistics
LING 581: Advanced Computational Linguistics
LING/C SC/PSYC 438/538 Lecture 3 Sandiway Fong.
LING/C SC 581: Advanced Computational Linguistics
LING/C SC 581: Advanced Computational Linguistics
Constraining Chart Parsing with Partial Tree Bracketing
LING/C SC/PSYC 438/538 Lecture 23 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong.
LING/C SC 581: Advanced Computational Linguistics
David Kauchak CS159 – Spring 2019
LING/C SC/PSYC 438/538 Lecture 3 Sandiway Fong.
LING/C SC 581: Advanced Computational Linguistics
Presentation transcript:

LING/C SC/PSYC 438/538 Lecture 27 Sandiway Fong

Administrivia 2 nd Reminder – 538 Presentations – Send me your choices if you haven’t already

english26.pl [Background: chapter 12 of JM contains many grammar rules] Subject of passive in by-phrase – the sandwich was eaten by John Questions – Did John eat the sandwich? – Is the sandwich eaten by John? – Was John eating the sandwich? – Who ate the sandwich? – What did John eat?(do-support) – Which sandwich did John eat? – Why did John eat the sandwich?

english26.pl Displacement rule: was John eating the sandwich ?  was John was eating the sandwich  Ax [ NP ] [ VP Ax ]

english26.pl Displacement rule: was John eating the sandwich ?  was John was eating the sandwich  Ax [ NP ] [ VP Ax ]

english26.pl Yes-no question (without aux inversion) Blocked by {Ending=root} constraint

english26.pl Passives and progressives: nested constructions … [Passive [Progressive] ] [Progressive [Passive] ]

english26.pl Nesting order forced by rule chaining – progressive  passive  VP_nptrace – passive  VP_nptrace – passive  progressive …

Homework 5 English grammar programming use english26.pl Add rules to handle the following sentences Part 1: Raising verbs – John seems to be happy – It seems that John is happy – It seems that John be happy – John seems is happy Part 2: PP attachment ambiguity – I saw the boy with a telescope – is ambiguous between two readings – Your grammar should produce both parses Part 3: recursion and relative clauses – I recognized the man – I recognized the man who recognized you – I recognized the man who recognized the woman who you recognized Explain your parse trees Submit your grammar and runs.

Why can’t computers use English? Context – a linguist’s view: a list of examples that are hard for computers to do – a computational linguist’s view (mine): these actually aren’t very hard at all... armed with some DCG technology, we can easily write a grammar to that make the distinctions outlined in the pamphlet – You could easily… write a grammar for these examples Online parsers: Berkeley parser/Stanford parser trained on the Penn treebank

If computers are so smart, why can't they use simple English? Consider, for instance, the four letters read ; they can be pronounced as either reed or red. How does the machine know in each case which is the correct pronunciation? Suppose it comes across the following sentences: (l) The girls will read the paper. (reed) (2) The girls have read the paper. (red) We might program the machine to pronounce read as reed if it comes right after will, and red if it comes right after have. But then sentences (3) through (5) would cause trouble. (3) Will the girls read the paper? (reed) (4) Have any men of good will read the paper? (red) (5) Have the executors of the will read the paper? (red) How can we program the machine to make this come out right?

If computers are so smart, why can't they use simple English? (6) Have the girls who will be on vacation next week read the paper yet? (red) (7) Please have the girls read the paper. (reed) (8) Have the girls read the paper?(red) Sentence (6) contains both have and will before read, and both of them are auxiliary verbs. But will modifies be, and have modifies read. In order to match up the verbs with their auxiliaries, the machine needs to know that the girls who will be on vacation next week is a separate phrase inside the sentence. In sentence (7), have is not an auxiliary verb at all, but a main verb that means something like 'cause' or 'bring about'. To get the pronunciation right, the machine would have to be able to recognize the difference between a command like (7) and the very similar question in (8), which requires the pronunciation red.

Example (5) Have the executors of the will read the paper? (red)

Treebanks Treebank – A corpus of sentences – Each sentence has been parsed – POS tags assigned – Also labels for phrases A treebank – is also a grammar – we can extract the rules, also frequency counts A consistently labeled treebank – might be called a “grammatical theory” Most popular treebank – Penn Treebank – Available on cd from UA Library (search the catalog) – particularly the Wall Street Journal (WSJ) section – 50,000 sentences – used for training stochastic context-free grammars (PCFGs) – Results: around the 90% mark on bracketed precision/recall – contains also traces and indices (typically not used for PCFGs)

Penn Treebank What is in it? – (v3) Four parsed sections One million words of 1989 Wall Street Journal (WSJ) material ATIS-3 sample Switchboard Brown Corpus Example: wsj_001.mrg ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (..) )) In the NLP literature, “Penn Treebank” usually refers to the WSJ section only Pierre Vinken, 61 years old, will join the board as nonexecutive director Nov. 29.

Penn Treebank What is in it? – Part-of-speech (POS) labels on words, numbers and punctuation using the 48-tag Penn tagset (a simplification of the 1982 Francis & Kucera Brown corpus tagset), e.g. NN, VB, IN, JJ – Constituents identified and labeled with syntactic categories, e.g. S, NP, VP, PP – Additional sublabels to facilitate predicate-argument extraction, e.g. -SBJ, -CLR, -TMP

Penn Treebank WSJ section of the Penn Treebank – has become the standard training corpus and testbed for statistical NLP Other Penn treebanks – Arabic, Chinese and Korean Other formalisms – (Combinatory Categorial Grammar) CCG treebank – Dependency grammar – lists about 50 treebanks in 29 languages

Penn Treebank The formalism chosen (sorta) matters – Penn Treebank includes empty categories (ECs), including traces – CCG has slash categories – Dependency grammar- based treebanks don’t – also don’t have node labels Example: wsj_100.mrg ( (S (NP-SBJ (NNP Nekoosa) ) (VP (VBZ has) (VP (VBN given) (NP (DT the) (NN offer) ) (NP (NP (DT a) (JJ public) (JJ cold) (NN shoulder) ) (,,) (NP (NP (DT a) (NN reaction) ) (SBAR (WHNP-2 (-NONE- 0) ) (S (NP-SBJ (NNP Mr.) (NNP Hahn) ) (VP (VBZ has) (RB n't) (VP (VBN faced) (NP (-NONE- *T*-2) )

Penn Treebank Example: wsj_100.mrg ( (S (NP-SBJ (NNP Nekoosa) ) (VP (VBZ has) (VP (VBN given) (NP (DT the) (NN offer) ) (NP (NP (DT a) (JJ public) (JJ cold) (NN shoulder) ) (,,) (NP (NP (DT a) (NN reaction) ) (SBAR (WHNP-2 (-NONE- 0) ) (S (NP-SBJ (NNP Mr.) (NNP Hahn) ) (VP (VBZ has) (RB n't) (VP (VBN faced) (NP (-NONE- *T*-2) (PP-LOC (IN in) (NP (NP (PRP$ his) (CD 18) (JJR earlier) (NNS acquisitions) ) (,,) (SBAR (WHNP-3 (DT all) (WHPP (IN of) (WHNP (WDT which) ))) (S (NP-SBJ-1 (-NONE- *T*-3) ) (VP (VBD were) (VP (VBN negotiated) (NP (-NONE- *-1) ) (PP-LOC (IN behind) (NP (DT the) (NNS scenes) )))))))))))))))) (..) ))

Penn Treebank The formalism chosen (sorta) matters – Penn Treebank includes empty categories, including traces It is standard in the statistical NLP literature to first discard all the empty category information – both for training and evaluation – some exceptions: Collins Model 3 post processing to re-insert ecs

Penn Treebank How is it used? – One million words of 1989 Wall Street Journal (WSJ) material – nearly 50,000 sentences (49,208) divided into 25 sections (0– 24) – sections 2–21 contain 39,832 sentences – section 23 (2,416 sentences) is held out for evaluation Standard practice training sections 2–21 evaluation %

Treebank Software Tgrep2 by Doug Rohde – – Download and install for Linux (pre-compiled and works without compilation on your Linux if you’re lucky) – For MacOSX just re-compile – (also will need the DRUtils library) – described in the textbook – works on the command line Java Package – Tregex from Stanford – broadly compatible with Tgrep2 – regex.shtml regex.shtml – Jar file (should run on all platforms) – has a graphical user interface – file run-tregex-gui.bat (batch file for Windows) See file: set max memory to 500m (or larger) to use with entire treebank Also TIGERsearch – stuttgart.de/projekte/TIGER/TIGER Search/ stuttgart.de/projekte/TIGER/TIGER Search/ – Windows explicitly supported