CS 730: Text Mining for Social Media & Collaboratively Generated Content Lecture 3: Parsing and Chunking.

CS 730: Text Mining for Social Media & Collaboratively Generated Content
Lecture 3: Parsing and Chunking

“Big picture” of the course
Language models (word, n-gram, …) Classification and sequence models WSD, Part-of-speech tagging Syntactic parsing and tagging Next week: semantics Next->Next week: Info Extraction + Text Mining Intro Fall break Part II: Social media (research papers start) 9/20/2010 CS730: Text Mining for Social Media, F2010

CS730: Text Mining for Social Media, F2010
Today’s Lecture Plan Phrase Chunking Syntactic Parsing Machine Learning-based Syntactic Parse 9/20/2010 CS730: Text Mining for Social Media, F2010

Phrase Chunking 9/20/2010 CS730: Text Mining for Social Media, F2010

Why Chunking/Parsing? 9/20/2010 CS730: Text Mining for Social Media, F2010

Phrase Structure (continued)
9/20/2010 CS730: Text Mining for Social Media, F2010

Types of Phrases Phrases: classify by part of speech of main word or by syntactic role subject and predicate; noun phrase and verb phrase In "The young cats drink milk.“ "The young cats" is a noun phrase and the subject; "drink milk" is a verb phrase and the predicate the main word is the head of the phrase: "cats" in "the young cats" Verb complements and modifiers types of complements ... noun phrases, adjective phrases, prepositional phrases, particles noun phrase: I served a brownie. adjective phrase: I remained very rich. prepositional phrase: I looked at Fred. particles: He looked up the number. clauses; clausal complements I dreamt that I won a million brownies. tenses: simple past, present, future; progressive, perfect simple present: John bakes cookies. present progressive: John is baking cookies. present perfect: John has baked cookies. active vs. passive active: Bernie ate the banana. passive: The banana was eaten by Bernie. 9/20/2010 CS730: Text Mining for Social Media, F2010

Noun Phrase Structure Left modifiers: determiner, quantifier, adjective, noun: the five shiny tin cans Right modifiers: prepositional phrases and apposition prepositional phrase: the man in the moon apposition: Scott, the Arctic explorer Relative clauses the man who ate the popcorn the popcorn which the man ate the man who is eating the popcorn the tourist who was eaten by a lion Reduced relative clauses the man eating the popcorn the man eaten by a lion 9/20/2010 CS730: Text Mining for Social Media, F2010

Attachment Ambiguities

Preliminaries: Constraint Grammars/CFGs

CFG (applying rewrite rules)

Preliminaries: CFG (continued)

Parsing 9/20/2010 CS730: Text Mining for Social Media, F2010

Human parsing 9/20/2010 CS730: Text Mining for Social Media, F2010

Chunking (from Abney 1994) “I begin with an intuition: when I read a sentence, I read it a chunk at a time.” Breaks up something like this: [I begin] [with an intuition]: [when I read] [a sentence], [I read it] [a chunk] [at a time] Chunks correspond to prosodic patterns. Strongest stresses in the sentence fall one to a chunk Pauses are most likely to fall between chunks Typical chunk consists of a single content word surrounded by a constellation of function words, matching a fixed template. A simple context-free grammar is often adequate to describe the structure of chunks. 9/20/2010 CS730: Text Mining for Social Media, F2010

Chunking (continued) Text chunking subsumes a range of tasks. The simplest is finding noun groups or base NPs: non-recursive noun phrases up to the head (for English). More ambitious systems may add additional chunk types, such as verb groups Seek a complete partitioning of the sentence into chunks of different types: [NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only $1.8 billion ] [PP in ] [NP September ] . Steve Abney, Parsing by Chunks The chunks are non-recursive structures which can be handled by finite-state methods (CFGs) Why do text chunking? Full parsing is expensive, and is not very robust. Partial parsing much faster, more robust, sufficient for many applications (IE, QA). Can also serve as a possible first step for full parsing 9/20/2010 CS730: Text Mining for Social Media, F2010

Chunking: Rule-based Quite high performance on NP chunking can be obtained with a small number of regular expressions With a larger rule set, using Constraint Grammar rules, Voutilainen reports recall of 98%+ with precison of 95-98% for noun chunks. Atro Voutilainen, NPtool, a Detector of English Noun Phrases, WVLC 93. 9/20/2010 CS730: Text Mining for Social Media, F2010

Why chunking can be difficult?
Two major sources of error (and these are also error sources for simple finite-state patterns for baseNP): participles and conjunction. Whether a particple is part of a noun phrase will depend on the particular choice of words He enjoys writing letters. He sells writing paper. and sometimes is genuinely ambiguous ... He enjoys baking potatoes. He has broken bottles in the basement. The rules for conjoined NPs are complicated by the bracketing rules of the Penn Tree Bank. Conjoined prenominal nouns are generally treated as part of a single baseNP: "brick and mortar university" (with "brick and mortar" modifying "university"). Conjoined heads with shared modifiers are also to be treated as a single baseNP: "ripe apples and bananas"; If the modifier is not shared, there are two baseNPs: "ripe apples and cinnamon". Modifier sharing, however, is hard for people to judge and is not consistently annotated 9/20/2010 CS730: Text Mining for Social Media, F2010

Transformation-Based Learning for Chunking
Ramshaw & Marcus, Text Chunking using Transformation-Based Learning, WVLC 1995 Adapted the TBL method from Brill for POS tagger. One-level NP chunking restated as a word tagging task. Used 3 tags: I (inside a baseNP) O (outside a baseNP) B (the start of a baseNP which immediately follows another baseNP) Initial tags assigned based on the most likely tag for a given part-of-speech. The contexts for TBL rules: words, part-of-speech assignments, and prior IOB tags. 9/20/2010 CS730: Text Mining for Social Media, F2010

TBL-based Chunking (2) Ramshaw & Marcus, Text Chunking using Transformation-Based Learning, WVLC 1995 Results can be scored based on the correct assignment of tags, or on recall and precision of complete baseNPs. The latter is normally used as the metric, since it corresponds to the actual objective -- different tag sets can be used as an intermediate representation. Obtained about 92% recall and precision with their system for baseNPs, using 200K words of training. Without lexical information: 90.5% recall and precision. 9/20/2010 CS730: Text Mining for Social Media, F2010

Chunking: Classification-based
Classification task: NP or not NP? Using classifiers for Chunking The best performance on the base NP and chunking tasks was obtained using a Support Vector Machine method. They obtained an accuracy of 94.22% with the small data set of Ramshaw and Marcus, and 95.77% by training on almost the entire Penn Treebank. Taku Kudo; Yuji Matsumoto. Chunking with Support Vector Machines Proc. NAACL 01. 9/20/2010 CS730: Text Mining for Social Media, F2010

Hand-tuning vs. Machine Learning
BaseNP chunking is a task for which people (with some linguistics training) can write quite good rules quickly. This raises the practical question of whether we should be using machine learning at all. If there is already a large relevant resource, it makes sense to learn from it. However, if we have to develop a chunker for a new language, is it cheaper to annotate some data or to write the rules directly? Ngai and Yarowsky addressed this question. They also considered selecting the data to be annotated. Traditional training is based on sequential text annotation ... we just annotate a series of documents in sequence. Can we do better? Ngai, G. and D. Yarowsky, Rule Writing or Annotation: Cost-efficient Resource Usage for Base Noun Phrase Chunking. ACL 2000 9/20/2010 CS730: Text Mining for Social Media, F2010

Active Learning Ngai &Yarowsky ACL 2000 Instead of annotating training examples sequentially, choose good examples Usually, choose examples “on the boundary” – i.e., for which classifier has low confidence Very often allows training to converge much faster than sequential/batch learning. Drawback: requires user in the loop. 9/20/2010 CS730: Text Mining for Social Media, F2010

Active Learning (continued)
Ngai &Yarowsky ACL 2000 9/20/2010 CS730: Text Mining for Social Media, F2010

Rule Writing vs. Active Learning
Ngai &Yarowsky ACL 2000 9/20/2010 CS730: Text Mining for Social Media, F2010

Rule Writing vs. Annotation Learning
Ngai &Yarowsky ACL 2000 Annotation: Can continue infinitely Can combine efforts of multiple annotators More consistent results Accuracy can be improved by better learning algs. Rule writing Must keep in mind rule interactions Difficult to combine rules from different experts Requires more skills Accuracy limited by set of rules (will not improve ever) 9/20/2010 CS730: Text Mining for Social Media, F2010

The parsing problem correct test trees P A R S E c s o r e accuracy test sentences Recent parsers quite accurate (Eisner, Collins, Charniak, etc.) Grammar That’s why parsing is hard; here’s what a parser does. We feed it some sentences. For each sentence, it consults its grammar, and produces a tree whose leaves are the sentence’s words, and whose structure is intended to be helpful. If you want to see how good the parser is, feed it some test sentences, and using your favorite scoring method, compare the output trees to an answer key of trees produced by a human. Suffice it to say that the recent crop of parsers is quite accurate. Most sentences are parsed correctly, or with one error. Eisner is me, and while today’s talk isn’t really about parsing ... 9/20/2010 CS730: Text Mining for Social Media, F2010

Applications of parsing (1/2)
Machine translation (Alshawi 1996, Wu 1997, ...) English Chinese tree operations Speech synthesis from parses (Prevost 1996) The government plans to raise income tax. The government plans to raise income tax the imagination. Speech recognition using parsing (Chelba et al 1998) Put the file in the folder. Put the file and the folder. So why is accurate parsing a good thing? Many hard natural language problems become somewhat easier once you can parse the input. (MT) If you want to translate sentences, it helps to parse them first, because the manipulations that turn English into Chinese are more easily defined or learned over trees than over sentences. (SS) If you want your machine to read your to you over the phone, let’s say in Rex Harrison’s voice with perfect elocution, then it had better be careful with the following apparently similar strings: (read). These similar strings have very different intonation because they have very different parses. So again, computing intonation from the parse is much easier. (SR) It’s recently been demonstrated that parsing helps in speech recognition. Commercial speech recognizers will mix up these sentences, because they sound the same and each is locally well-formed within a three-word window. But the first transcription is much more likely for syntactic reasons. A recognizer that parses as it goes will prefer the first transcription. 9/20/2010 CS730: Text Mining for Social Media, F2010

Applications of parsing (2/2)
Grammar checking (Microsoft) Indexing for information retrieval (Woods 1997) ... washing a car with a hose vehicle maintenance Information extraction (Hobbs 1996) NY Times archive Database query (GC) Microsoft is doing a lot of work on parsing, in part to improve the grammar checker in MS Word. (IR) Bill Woods at Sun has an Information Retrieval system that uses light parsing to get really great results in user tests. If your document says “washing a car with a hose,” Bill will pick out this phrase and construct up a bunch of ways to index it, including “vehicle maintenance.” (IE) Finally, there’s so much information out there on the web. If we parse it, we should be able to read all kinds of useful information off the parse trees, both in general and for specific queries. Informative queries are easier to write over parse trees than over strings. 9/20/2010 CS730: Text Mining for Social Media, F2010

Parsing for the Turing Test
Most linguistic properties are defined over trees. One needs to parse to see subtle distinctions. E.g.: Sara dislikes criticism of her (her  Sara) Sara dislikes criticism of her by anyone. (her  Sara) Sara dislikes anyone’s criticism of her (her = Sara or her  Sara) Finally, if you want to really understand language, deeply, you gotta parse. Almost everything linguists care about is in the trees, not the words. People make subtle distinction in languages, and eventually machines must too. What do I mean by subtle distinctions? Well, look at this sentence ... (stress HER when reading it, to make it easier for audience) “her” might refer to Sara’s mom, or Sara’s favorite student, but it can’t refer to Sara - we’d have to say “herself.” 9/20/2010 CS730: Text Mining for Social Media, F2010

What makes a good grammar
Conjunctions must match I ate a hamburger and on the stove. I ate a cold hot dog and well burned. I ate the hot dog slowly and a hamburger 9/20/2010 CS730: Text Mining for Social Media, F2010

Vanilla CFG not sufficient for NL
Number agreement a men DET selection a apple Tense, mood, etc. agreement For now, let’s what it would take to parse English with vanilla CFG 9/20/2010 CS730: Text Mining for Social Media, F2010

Parsing re-defined 9/20/2010 CS730: Text Mining for Social Media, F2010

Revised CFG 9/20/2010 CS730: Text Mining for Social Media, F2010

In: cats scratch people with claws

Soundness and Completeness in Parsing

Top-Down Parsing Top-down parsing is goal directed A top-down parser starts with a list of constituents to be built. The top-down parser rewrites the goals in the goal list by matching one against the LHS of the grammar rules, and expanding it with the RHS, attempting to match the sentence to be derived. If a goal can be rewritten in several ways, then there is a choice of which rule to apply (search problem) Can use depth-first or breadth-first search, and goal ordering. 9/20/2010 CS730: Text Mining for Social Media, F2010

Simple Top-down parsing algorithm
Start with initial state ((S) 1) and no backup states. Select current state: Take the first state off possibilities list and call it C. If the possibilities list is empty then the algorithm fails (that is, no successful parse is possible). If C consists of an empty symbol list and the word position is at the end of the sentence then the algorithm succeeds. Otherwise, generate the next possible states. If the first symbol on the symbol list of C is a lexical symbol, and the next word in the sentence can be in that class, then create a new state by removing the first symbol from the symbol list and updating the word position, and add it to the possibilities list. Otherwise, if the first symbol on the symbol list of C is a non-terminal, generate a new state for each rule in the grammar that can rewrite that nonterminal symbol and add them all to the possibilities list. 9/20/2010 CS730: Text Mining for Social Media, F2010

Top-down as search For a depth-first strategy, the possibilities list is a stack. In other words, step 1 always takes the first element off the list, and step 3 always puts the new states on the front of the list, yielding a last-in first-out (LIFO) strategy. In contrast, in a breadth-first strategy the possibilities list is manipulated as a queue. Step 3 adds the new positions onto the end of the list, rather than the beginning, yielding a first-in first-out (FIR)) strategy. 9/20/2010 CS730: Text Mining for Social Media, F2010

Top-down example Grammer: same CFG as before Lexicon: cried: V dogs: N, V the: ART Input: The/1 dogs/2 cried/3 A typical parse state: ((N VP) 2) Parser needs to find N followed by VP, starting at position 2 9/20/2010 CS730: Text Mining for Social Media, F2010

Parsing “The dogs cried”
Step Current State Backup States Comment 1. ((S) 1) initial position 2. ((NP VP) 1) rewriting S by rule I 3. ((ART N VP) 1) rewriting NP by rules 2 & 3 ((ART ADJN VP) I) 4. ((N VP) 2) matching ART with the ((ART ADJ N VP) 1) 5. ((VP) 3) matching N with dogs 6. ((V) 3) rewriting VP by rules 5—8 ((V NP) 3) 7. the parse succeeds as V is matched to cried, leaving an empty grammatical symbol list with an empty sentence 9/20/2010 CS730: Text Mining for Social Media, F2010

Problems with Top-down
Left recursive rules A top-down parser will do badly if there are many different rules for the same LHS. Consider if there are 600 rules for S, 599 of which start with NP, but one of which starts with V, and the sentence starts with V. Useless work: expands things that are possible top-down but not there Top-down parsers do well if there is useful grammar-driven control: search is directed by the grammar Top-down is hopeless for rewriting parts of speech (preterminals) with words (terminals). In practice that is always done bottom-up as lexical lookup. Repeated work: anywhere there is common substructure 9/20/2010 CS730: Text Mining for Social Media, F2010

Bottom-up Parsing Bottom-up parsing is data directed The initial goal list of a bottom-up parser is the string to be parsed. If a sequence in the goal list matches the RHS of a rule, then this sequence may be replaced by the LHS of the rule. Parsing is finished when the goal list contains just the start category. If the RHS of several rules match the goal list, then there is a choice of which rule to apply (search problem) Can use depth-first or breadth-first search, and goal ordering. The standard presentation is as shift-reduce parsing. 9/20/2010 CS730: Text Mining for Social Media, F2010

Problems with Bottom-up Parsing
Unable to deal with empty categories: termination problem, unless rewriting empties as constituents is somehow restricted (but then it’s generally incomplete) Useless work: locally possible, but globally impossible. Inefficient when there is great lexical ambiguity (grammar-driven control might help here) Conversely, it is data-directed: it attempts to parse the words that are there. Repeated work: anywhere there is common substructure Both TD (LL) and BU (LR) parsers can (and frequently do) do work exponential in the sentence length on NLP problems. 9/20/2010 CS730: Text Mining for Social Media, F2010

Dynamic Programming for Parsing
Systematically fill in tables of solutions to sub-problems. Store subtrees for each of the various constituents in the input as they are discovered Cocke-Kasami-Younger (CKY) algorithm, Early’s algorithm, and chart parsing. 9/20/2010 CS730: Text Mining for Social Media, F2010

CKY algorithm (BU), recognizer version
Input: string of n words Output: yes/no (since it’s only a recognizer) Data structure: n x n table rows labeled 0 to n-1 columns labeled 1 to n cell [i,j] lists constituents found between i and j 9/20/2010 CS730: Text Mining for Social Media, F2010

Miniature Grammar 9/20/2010 CS730: Text Mining for Social Media, F2010

CKY Example 9/20/2010 CS730: Text Mining for Social Media, F2010

CKY Algorithm 9/20/2010 CS730: Text Mining for Social Media, F2010

CKY: Fill last column after “Houston”

CKY Algorithm: Additional Information
More formal algorithm analysis/description: Online demo: 9/20/2010 CS730: Text Mining for Social Media, F2010

Feature-Augmented CFGs
Motivation: Agreement Most verbs in English can appear in two forms in the present tense: form used for third-person, singular subjects (the flight does). Called 3sg. Has final -s used for all other kinds of subjects (all the flights do, I do). Let’s call non-3sg. Usually does not have final -s. If subject does not agree with verb are ungrammatical: *[What flight] leave in the morning? *Does [NP you] have a flight from Boston to ForthWorth? *Do [NP this flight] stop in Dallas 9/20/2010 CS730: Text Mining for Social Media, F2010

Agreement (continued)
Rule for yes-no-questions S → Aux NP VP We could replace with two rules: S → 3sgAux 3sgNP VP S → Non3sgAux Non3sgNP VP Also have to add rules for the lexicon: 3sgAux → does | has | can | . . . Non3sgAux → do | have | can | . . . Also need to add rules for 3sgNP and Non3sgNP: Make two copies of each rule for NP The problem with this method of dealing with number agreement is that it doubles the size of the grammar 9/20/2010 CS730: Text Mining for Social Media, F2010

Other Agreement Issues
Head nouns and determiners have to agree: this flight *this flights those flights *those flight Problems in languages like German or French, which have gender agreement Solutions: Proliferate rules (lots and lots of CFG rules) Or… 9/20/2010 CS730: Text Mining for Social Media, F2010

Feature-Augmented CFGs
Number agreement features: [NUMBER SG] Adding an additional feature-value pair to capture person [ NUMBER SG PERSON 3 # ] Encode grammatical category of the constituent [ CAT NP NUMBER SG PERSON ] Represent 3sgNP category of noun phrases. Corresponding plural version of this structure: NUMBER PL 9/20/2010 CS730: Text Mining for Social Media, F2010

Features (continued) 9/20/2010 CS730: Text Mining for Social Media, F2010

Features in Grammar CFG rules with constraint features Example: Number agreement S → NP VP [ NP NUMBER ] = [ VP NUMBER ] Feature augmentations changes CFGs No longer blind concatenation of non-terminals Can be used as filters in Early’s algorithm 9/20/2010 CS730: Text Mining for Social Media, F2010

Feature Augmentation Key Ideas
The elements of context-free grammar rules have feature-based constraints associated with them. Shift from atomic grammatical categories to more complex categories with properties. The constraints associated with individual rules can refer to, and manipulate, the feature structures associated with the parts of the rule to which they are attached. 9/20/2010 CS730: Text Mining for Social Media, F2010

Dependency Grammars I gave him my address All links between lexical (word) nodes About 35 syntactic and semantic relations 9/20/2010 CS730: Text Mining for Social Media, F2010

Dependency Grammars (continued)
Advantages: Free word order Implementations: Link parser (Sleator). Freely available 9/20/2010 CS730: Text Mining for Social Media, F2010

Learning to Parse 9/20/2010 CS730: Text Mining for Social Media, F2010

Parsing in the early 1990s The parsers produced detailed, linguistically rich representations Parsers had uneven and usually rather poor coverage E.g., 30% of sentences received no analysis Even quite simple sentences had many possible analyses Parsers either had no method to choose between them or a very ad hoc treatment of parse preferences Parsers could not be learned from data Parser performance usually wasn’t or couldn’t be assessed quantitatively and the performance of different parsers was often incommensurable 9/20/2010 CS730: Text Mining for Social Media, F2010

Ambiguity John saw Mary Typhoid Mary Phillips screwdriver Mary note how rare rules interact I see a bird is this 4 nouns – parsed like “city park scavenger bird”? rare parts of speech, plus systematic ambiguity in noun sequences Time flies like an arrow Fruit flies like a banana Time reactions like this one Time reactions like a chemist or is it just an NP? The official seat, center of authority, jurisdiction, or office of a bishop 9/20/2010 CS730: Text Mining for Social Media, F2010

Our bane: Ambiguity John saw Mary Typhoid Mary Phillips screwdriver Mary note how rare rules interact I see a bird is this 4 nouns – parsed like “city park scavenger bird”? rare parts of speech, plus systematic ambiguity in noun sequences Time | flies like an arrow NP VP Fruit flies | like a banana NP VP Time | reactions like this one V[stem] NP Time reactions | like a chemist S PP or is it just an NP? 9/20/2010 CS730: Text Mining for Social Media, F2010

May 2007 example… 9/20/2010 CS730: Text Mining for Social Media, F2010

How to solve this combinatorial explosion of ambiguity?
First try parsing without any weird rules, throwing them in only if needed. Better: every rule has a weight. A tree’s weight is total weight of all its rules. Pick the overall lightest parse of sentence. Can we pick the weights automatically Yes: Statistical Parsing 9/20/2010 CS730: Text Mining for Social Media, F2010

Statistical parsing Over the last 12 years statistical parsing has succeeded wonderfully! NLP researchers have produced a range of (often free, open source) statistical parsers, which can parse any sentence and often get most of it correct These parsers are now a commodity component The parsers are still improving. 9/20/2010 CS730: Text Mining for Social Media, F2010

Learning to Parse: A Taste
Penn Treebank project (about 1M words) 9/20/2010 CS730: Text Mining for Social Media, F2010

Using a Treebank as Grammar
4,500 different rules for expanding VP are separate rules for PP sequences of any length, and every possible arrangement of verb arguments: 9/20/2010 CS730: Text Mining for Social Media, F2010

Naïve Treebank Grammar
17,500 distinct rule types. 9/20/2010 CS730: Text Mining for Social Media, F2010

Spoken Language Syntax
Utterances (vs. sentences) Much higher rate of pronouns Repair phenomenon (~40% of sentences) use of the words uh and um, word repetitions, restarts, and word fragments (“uh” – most common word) 9/20/2010 CS730: Text Mining for Social Media, F2010

Treebanks for Speech: LDC Switchboard

Statistical parsing applications
High precision question answering systems (Pasca and Harabagiu SIGIR 2001) Improving biological named entity extraction (Finkel et al. JNLPBA 2004): Syntactically based sentence compression (Lin and Wilbur Inf. Retr. 2007) Extracting people’s opinions about products (Bloom et al. NAACL 2007) Improved interaction in computer games (Gorniak and Roy, AAAI 2005) Helping linguists find data (Resnik et al. BLS 2005) 9/20/2010 CS730: Text Mining for Social Media, F2010

Probabilistic CKY 9/20/2010 CS730: Text Mining for Social Media, F2010

time flies like an arrow 5 NP 3 Vst 3 1 NP 4 VP 4 2 P 2 V 5 3 Det 1 4 N 8 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 9/20/2010 CS730: Text Mining for Social Media, F2010

time flies like an arrow 5 NP 3 Vst 3 NP 10 1 NP 4 VP 4 2 P 2 V 5 3 Det 1 4 N 8 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 9/20/2010 CS730: Text Mining for Social Media, F2010

time flies like an arrow 5 NP 3 Vst 3 NP 10 S 8 1 NP 4 VP 4 2 P 2 V 5 3 Det 1 4 N 8 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 9/20/2010 CS730: Text Mining for Social Media, F2010

time flies like an arrow 5 NP 3 Vst 3 NP 10 S 8 S 13 1 NP 4 VP 4 2 P 2 V 5 3 Det 1 4 N 8 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 9/20/2010 CS730: Text Mining for Social Media, F2010

time flies like an arrow 5 NP 3 Vst 3 NP 10 S 8 S 13 1 NP 4 VP 4 2 P 2 V 5 PP 12 3 Det 1 4 N 8 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 9/20/2010 CS730: Text Mining for Social Media, F2010

time flies like an arrow 5 NP 3 Vst 3 NP 10 S 8 S 13 1 NP 4 VP 4 2 P 2 V 5 PP 12 VP 16 3 Det 1 4 N 8 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 9/20/2010 CS730: Text Mining for Social Media, F2010

time flies like an arrow 5 NP 3 Vst 3 NP 10 S 8 S 13 1 NP 4 VP 4 NP 18 2 P 2 V 5 PP 12 VP 16 3 Det 1 4 N 8 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 9/20/2010 CS730: Text Mining for Social Media, F2010

time flies like an arrow 5 NP 3 Vst 3 NP 10 S 8 S 13 1 NP 4 VP 4 NP 18 S 21 2 P 2 V 5 PP 12 VP 16 3 Det 1 4 N 8 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 9/20/2010 CS730: Text Mining for Social Media, F2010

time flies like an arrow 5 NP 3 Vst 3 NP 10 S 8 S 13 1 NP 4 VP 4 NP 18 S 21 VP 18 2 P 2 V 5 PP 12 VP 16 3 Det 1 4 N 8 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 9/20/2010 CS730: Text Mining for Social Media, F2010

time flies like an arrow 5 NP 3 Vst 3 NP 10 S 8 S 13 NP 24 1 NP 4 VP 4 NP 18 S 21 VP 18 2 P 2 V 5 PP 12 VP 16 3 Det 1 4 N 8 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 9/20/2010 CS730: Text Mining for Social Media, F2010

time flies like an arrow 5 NP 3 Vst 3 NP 10 S 8 S 13 NP 24 S 22 1 NP 4 VP 4 NP 18 S 21 VP 18 2 P 2 V 5 PP 12 VP 16 3 Det 1 4 N 8 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 9/20/2010 CS730: Text Mining for Social Media, F2010

time flies like an arrow 5 NP 3 Vst 3 NP 10 S 8 S 13 NP 24 S 22 S 27 1 NP 4 VP 4 NP 18 S 21 VP 18 2 P 2 V 5 PP 12 VP 16 3 Det 1 4 N 8 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 9/20/2010 CS730: Text Mining for Social Media, F2010

Follow backpointers … S time flies like an arrow 5 NP 3 Vst 3 NP 10 S 8 S 13 NP 24 S 22 S 27 1 NP 4 VP 4 NP 18 S 21 VP 18 2 P 2 V 5 PP 12 VP 16 3 Det 1 4 N 8 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 9/20/2010 CS730: Text Mining for Social Media, F2010

NP VP time flies like an arrow 5 NP 3 Vst 3 NP 10 S 8 S 13 NP 24 S 22 S 27 1 NP 4 VP 4 NP 18 S 21 VP 18 2 P 2 V 5 PP 12 VP 16 3 Det 1 4 N 8 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 9/20/2010 CS730: Text Mining for Social Media, F2010

NP VP time flies like an arrow 5 NP 3 Vst 3 NP 10 S 8 S 13 NP 24 S 22 S 27 1 NP 4 VP 4 NP 18 S 21 VP 18 2 P 2 V 5 PP 12 VP 16 3 Det 1 4 N 8 VP PP 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 9/20/2010 CS730: Text Mining for Social Media, F2010

NP VP time flies like an arrow 5 NP 3 Vst 3 NP 10 S 8 S 13 NP 24 S 22 S 27 1 NP 4 VP 4 NP 18 S 21 VP 18 2 P 2 V 5 PP 12 VP 16 3 Det 1 4 N 8 VP PP P NP 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 9/20/2010 CS730: Text Mining for Social Media, F2010

NP VP time flies like an arrow 5 NP 3 Vst 3 NP 10 S 8 S 13 NP 24 S 22 S 27 1 NP 4 VP 4 NP 18 S 21 VP 18 2 P 2 V 5 PP 12 VP 16 3 Det 1 4 N 8 VP PP P NP Det N 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 9/20/2010 CS730: Text Mining for Social Media, F2010

Which entries do we need? time flies like an arrow 5 NP 3 Vst 3 NP 10 S 8 S 13 NP 24 S 22 S 27 1 NP 4 VP 4 NP 18 S 21 VP 18 2 P 2 V 5 PP 12 VP 16 3 Det 1 4 N 8 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 9/20/2010 CS730: Text Mining for Social Media, F2010

Not worth keeping … time flies like an arrow 5 NP 3 Vst 3 NP 10 S 8 S 13 NP 24 S 22 S 27 1 NP 4 VP 4 NP 18 S 21 VP 18 2 P 2 V 5 PP 12 VP 16 3 Det 1 4 N 8 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 9/20/2010 CS730: Text Mining for Social Media, F2010

… since it just breeds worse options time flies like an arrow 5 NP 3 Vst 3 NP 10 S 8 S 13 NP 24 S 22 S 27 1 NP 4 VP 4 NP 18 S 21 VP 18 2 P 2 V 5 PP 12 VP 16 3 Det 1 4 N 8 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 9/20/2010 CS730: Text Mining for Social Media, F2010

Keep only best-in-class! time flies like an arrow 5 NP 3 Vst 3 NP 10 S 8 S 13 NP 24 S 22 S 27 1 NP 4 VP 4 NP 18 S 21 VP 18 2 P 2 V 5 PP 12 VP 16 3 Det 1 4 N 8 inferior stock 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 9/20/2010 CS730: Text Mining for Social Media, F2010

Keep only best-in-class! (and backpointers so you can recover parse) time flies like an arrow 5 NP 3 Vst 3 NP 10 S 8 NP 24 S 22 1 NP 4 VP 4 NP 18 S 21 VP 18 2 P 2 V 5 PP 12 VP 16 3 Det 1 4 N 8 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 9/20/2010 CS730: Text Mining for Social Media, F2010

Probabilistic Trees Instead of lightest weight tree, take highest probability tree Given any tree, a generator should have some probability of producing it! Just like using n-grams to choose among strings … What is the probability of this tree? S NP time VP VP flies PP P like NP Det an N arrow 9/20/2010 CS730: Text Mining for Social Media, F2010

Probabilistic or stochastic context-free grammars (PCFGs)
G = (T, N, S, R, P) T is set of terminals N is set of nonterminals For NLP, we usually distinguish a set P  N of preterminals, which always rewrite as terminals S is the start symbol (one of the nonterminals) R is rules/productions of the form X  , where X is a nonterminal and  is a sequence of terminals and nonterminals (possibly an empty sequence) P(R) gives the probability of each rule. A grammar G generates a language model L. 9/20/2010 CS730: Text Mining for Social Media, F2010

The probability of trees and strings
P(t) -- The probability of tree is the product of the probabilities of the rules used to generate it. P(w1n) -- The probability of the string is the sum of the probabilities of the trees which have that string as their yield P(w1n) = Σj P(w1n, t) where t is a parse of w1n = Σj P(t) 9/20/2010 CS730: Text Mining for Social Media, F2010

A Simple PCFG (in CNF) S  NP VP VP  V NP VP  VP PP PP  P NP P  with V  saw NP  NP PP NP  astronomers 0.1 NP  ears NP  saw NP  stars NP  telescope 9/20/2010 CS730: Text Mining for Social Media, F2010

Tree and String Probabilities
w15 = astronomers saw stars with ears P(t1) = 1.0 * 0.1 * 0.7 * 1.0 * 0.4 * 0.18 * 1.0 * 1.0 * 0.18 = P(t2) = 1.0 * 0.1 * 0.3 * 0.7 * 1.0 * 0.18 = P(w15) = P(t1) P(t2) = = 9/20/2010 CS730: Text Mining for Social Media, F2010

Chomsky Normal Form All rules are of the form X  Y Z or X  w. A transformation to this form doesn’t change the generative capacity of CFGs. With some extra book-keeping in symbol names, you can even reconstruct the same trees with a detransform Unaries/empties are removed recursively N-ary rules introduce new nonterminals: VP  V NP PP becomes VP  and @VP-V  NP PP In practice it’s a pain Reconstructing n-aries is easy Reconstructing unaries can be trickier But it makes parsing easier/more efficient 9/20/2010 CS730: Text Mining for Social Media, F2010

Treebank binarization
N-ary Trees in Treebank TreeAnnotations.annotateTree Binary Trees When you do this assignment, you should correctly implement CKY parsing first. Lexicon and Grammar Parsing 9/20/2010 CS730: Text Mining for Social Media, F2010

An example: before binarization…
ROOT S VP NP V NP PP N P NP N N cats scratch people with claws 9/20/2010 CS730: Text Mining for Social Media, F2010

After binarization.. ROOT S @S->_NP VP NP @VP->_V @VP->_V_NP V NP PP N P @PP->_P N NP N cats scratch people with claws 9/20/2010 CS730: Text Mining for Social Media, F2010

Probabilistic Trees Instead of lightest weight tree, take highest probability tree Given any tree, your assignment 1 generator would have some probability of producing it! Just like using n-grams to choose among strings … What is the probability of this tree? S NP time VP VP flies PP P like NP Det an N arrow 9/20/2010 CS730: Text Mining for Social Media, F2010

Chain rule: One node at a time
S S S S NP time VP | S) = p( | S) * p( | ) p( NP VP NP time VP NP VP VP flies PP P like NP S S * p( | ) Det an N arrow NP time VP NP time VP VP PP S S * p( | ) * … NP time VP NP time VP VP flies PP VP PP 9/20/2010 CS730: Text Mining for Social Media, F2010

Chain rule + backoff S S S S NP time VP | S) = p( | S) * p( | ) p( NP VP NP time VP NP VP VP flies PP P like NP S S * p( | ) Det an N arrow NP time VP NP time VP VP PP S S * p( | ) * … NP time VP NP time VP VP flies PP VP PP 9/20/2010 CS730: Text Mining for Social Media, F2010

Already have a CKY alg for weights …
NP time VP | S) = w(S  NP VP) w(NP  flies | NP) w( VP flies PP P like NP + w(VP  VP NP) Det an N arrow + w(VP  flies) + … Just let w(X  Y Z) = -log p(X  Y Z | X) Then lightest tree has highest prob 9/20/2010 CS730: Text Mining for Social Media, F2010

Pruning for Speed Heuristically throw away constituents that probably won’t make it into best complete parse. Use probabilities to decide which ones. So probs are useful for speed as well as accuracy! Both safe and unsafe methods exist Throw x away if p(x) < (and lower this threshold if we don’t get a parse) Throw x away if p(x) < 100 * p(y) for some y that spans the same set of words Throw x away if p(x)*q(x) is small, where q(x) is an estimate of probability of all rules needed to combine x with the other words in the sentence 9/20/2010 CS730: Text Mining for Social Media, F2010

Agenda (“Best-First”) Parsing
Explore best options first Should get some good parses early on – grab one & go! Prioritize constits (and dotted constits) Whenever we build something, give it a priority How likely do we think it is to make it into the highest-prob parse? usually related to log prob. of that constit might also hack in the constit’s context, length, etc. if priorities are defined carefully, obtain an A* algorithm Put each constit on a priority queue (heap) Repeatedly pop and process best constituent. CKY style: combine w/ previously popped neighbors. Earley style: scan/predict/attach as usual. What else? 9/20/2010 CS730: Text Mining for Social Media, F2010

Preprocessing First “tag” the input with parts of speech: Guess the correct preterminal for each word, using HMMs Now only allow one part of speech per word This eliminates a lot of crazy constituents! But if you tagged wrong you could be hosed Raise the stakes: What if tag says not just “verb” but “transitive verb”? Or “verb with a direct object and 2 PPs attached”? (“supertagging”) Safer to allow a few possible tags per word, not just one … 9/20/2010 CS730: Text Mining for Social Media, F2010

How good are PCFGs? Robust (usually admit everything, but with low probability) Partial solution for grammar ambiguity: a PCFG gives some idea of the plausibility of a sentence But not so good because the independence assumptions are too strong Give a probabilistic language model But in a simple case it performs worse than a trigram model The problem seems to be it lacks the lexicalization of a trigram model 9/20/2010 CS730: Text Mining for Social Media, F2010

Lexicalization Lexical heads are important for certain classes of ambiguities (e.g., PP attachment): Lexicalizing grammar creates a much larger grammar. Sophisticated smoothing needed Smarter parsing algorithms needed More data needed How necessary is lexicalization? Bilexical vs. monolexical selection Closed vs. open class lexicalization 9/20/2010 CS730: Text Mining for Social Media, F2010

Lexicalized Parsing peel the apple on the towel ambiguous put the apple on the towel on attached to put (is the other reading even possible?) put the apple on the towel in the box VP[head=put]  V[head=put] NP PP VP[head=put]  V[head=put] NP PP[head=on] study the apple on the towel study dislikes on (how can the PCFG express this?) VP[head=study]  VP[head=study] PP[head=on] study it on the towel it dislikes on even more – PP can’t attach to pronoun 9/20/2010 CS730: Text Mining for Social Media, F2010

Lexicalized Parsing the plan that Natasha would swallow ambiguous between content of plan and relative clause the plan that Natasha would snooze snooze dislikes a direct object (plan) the plan that Natasha would make make likes a direct object (plan) the pill that Natasha would swallow pill can’t express a content-clause the way plan does pill is a probable direct object for swallow How to express these distinctions in a CFG or PCFG? 9/20/2010 CS730: Text Mining for Social Media, F2010

Putting words into PCFGs
A PCFG uses the actual words only to determine the probability of parts-of-speech (the preterminals) In many cases we need to know about words to choose a parse The head word of a phrase gives a good representation of the phrase’s structure and meaning Attachment ambiguities The astronomer saw the moon with the telescope Coordination: the dogs in the house and the cats Subcategorization frames: put versus like 9/20/2010 CS730: Text Mining for Social Media, F2010

(Head) Lexicalization
put takes both an NP and a VP Sue put [ the book ]NP [ on the table ]PP * Sue put [ the book ]NP * Sue put [ on the table ]PP like usually takes an NP and not a PP Sue likes [ the book ]NP * Sue likes [ on the table ]PP We can’t tell this if we just have a VP with a verb, but we can if we know what verb it is 9/20/2010 CS730: Text Mining for Social Media, F2010

(Head) Lexicalization
Collins 1997, Charniak 1997 Puts the properties of words into a PCFG Swalked NPSue VPwalked Sue Vwalked PPinto walked Pinto NPstore into DTthe NPstore the store 9/20/2010 CS730: Text Mining for Social Media, F2010

Lexicalization sharpens probabilities: rule expansion
E.g., probability of different verbal complement frames (often called “subcategorizations”) Local Tree come take think want VP  V 9.5% 2.6% 4.6% 5.7% VP  V NP 1.1% 32.1% 0.2% 13.9% VP  V PP 34.5% 3.1% 7.1% 0.3% VP  V SBAR 6.6% 73.0% VP  V S 2.2% 1.3% 4.8% 70.8% VP  V NP S 0.1% 0.0% VP  V PRT NP 5.8% VP  V PRT PP 6.1% 1.5% 9/20/2010 CS730: Text Mining for Social Media, F2010

Lexicalization sharpens probabilities: Predicting heads
“Bilexical probabilities” p(prices | n-plural) = .013 p(prices | n-plural, NP) = .013 p(prices | n-plural, NP, S) = .025 p(prices | n-plural, NP, S, v-past) = .052 p(prices | n-plural, NP, S, v-past, fell) = .146 9/20/2010 CS730: Text Mining for Social Media, F2010

Naïve Lexicalized Parsing
Can, in principle, use CKY on lexicalized PCFGs O(Rn3) time and O(Sn2) memory But R = rV2 and S = sV Result is completely impractical (why?) Memory: 10K rules * 50K words * (40 words)2 * 8 bytes ≈ 6TB Can modify CKY to exploit lexical sparsity Lexicalized symbols are a base grammar symbol and a pointer into the input sentence, not any arbitrary word Result: O(rn5) time, O(sn3) Memory: 10K rules * (40 words)3 * 8 bytes ≈ 5GB 9/20/2010 CS730: Text Mining for Social Media, F2010

Charniak (1997) linear interpolation/shrinkage

Charniak (1997) shrinkage example

Lexicalized Parsing was seen as the breakthrough of the late 90s
Eugene Charniak, 2000 JHU workshop: “To do better, it is necessary to condition probabilities on the actual words of the sentence. This makes the probabilities much tighter: p(VP  V NP NP) = p(VP  V NP NP | said) = p(VP  V NP NP | gave) = ” Michael Collins, 2003 COLT tutorial: “Lexicalized Probabilistic Context-Free Grammars … perform vastly better than PCFGs (88% vs. 73% accuracy)” 9/20/2010 CS730: Text Mining for Social Media, F2010

Lexicalized parsing results (Labeled Constituent Precision/Recall F1)
Demo: 9/20/2010 CS730: Text Mining for Social Media, F2010

Sparseness & the Penn Treebank
The Penn Treebank – 1 million words of parsed English WSJ – has been a key resource (because of the widespread reliance on supervised learning) But 1 million words is like nothing: 965,000 constituents, but only 66 WHADJP, of which only 6 aren’t how much or how many, but there is an infinite space of these How clever/original/incompetent (at risk assessment and evaluation) … Most of the probabilities that you would like to compute, you can’t compute 9/20/2010 CS730: Text Mining for Social Media, F2010

Sparseness & the Penn Treebank (2)
Many parse preferences depend on bilexical statistics: likelihoods of relationships between pairs of words (compound nouns, PP attachments, …) Extremely sparse, even on topics central to the WSJ: stocks plummeted 2 occurrences stocks stabilized 1 occurrence stocks skyrocketed 0 occurrences #stocks discussed 0 occurrences So far there has been very modest success in augmenting the Penn Treebank with extra unannotated materials or using semantic classes – once there is more than a little annotated training data. Cf. Charniak 1997, Charniak 2000; but see McClosky et al. 2006 9/20/2010 CS730: Text Mining for Social Media, F2010

Motivating discriminative parsing
In discriminative models, it is easy to incorporate different kinds of features Often just about anything that seems linguistically interesting In generative models, it’s often difficult, and the model suffers because of false independence assumptions This ability to add informative features is the real power of discriminative models for NLP. 9/20/2010 CS730: Text Mining for Social Media, F2010

Discriminative Parsers
Discriminative Dependency Parsing Not as computationally hard (tiny grammar constant) Explored considerably recently. E.g. McDonald et al. 2005 Make parser action decisions discriminatively E.g. with a shift-reduce parser Dynamic program Phrase Structure Parsing Resource intensive! Most work on sentences of length <=15 The need to be able to dynamic program limits the feature types you can use Post-Processing: Parse reranking Just work with output of k-best generative parser Distribution-free methods Probabilistic model methods 9/20/2010 CS730: Text Mining for Social Media, F2010

Charniak and Johnson (ACL 2005): Coarse-to-fine n-best parsing and MaxEnt discriminative reranking Builds a maxent discriminative reranker over parses produced by (a slightly bugfixed and improved version of) Charniak (2000). Gets 50 best parses from Charniak (2000) parser Doing this exploits the “coarse-to-fine” idea to heuristically find good candidates Maxent model for reranking uses heads, etc. as generative model, but also nice linguistic features: Conjunct parallelism Right branching preference Heaviness (length) of constituents factored in Gets 91% LP/LR F1 (on all sentences! – up to 80 wd) 9/20/2010 CS730: Text Mining for Social Media, F2010

Readings for Next Week FSNLP Chapters 11 and 12 9/20/2010 CS730: Text Mining for Social Media, F2010

CS 730: Text Mining for Social Media & Collaboratively Generated Content Lecture 3: Parsing and Chunking.

Similar presentations

Presentation on theme: "CS 730: Text Mining for Social Media & Collaboratively Generated Content Lecture 3: Parsing and Chunking."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 730: Text Mining for Social Media & Collaboratively Generated Content Lecture 3: Parsing and Chunking.

Similar presentations

Presentation on theme: "CS 730: Text Mining for Social Media & Collaboratively Generated Content Lecture 3: Parsing and Chunking."— Presentation transcript:

Similar presentations

About project

Feedback