Phrase Finding with “Request-and-Answer” William W. Cohen.

Slides:



Advertisements
Similar presentations
ThemeInformation Extraction for World Wide Web PaperUnsupervised Learning of Soft Patterns for Generating Definitions from Online News Author Cui, H.,
Advertisements

Overview of this week Debugging tips for ML algorithms
Rocchio’s Algorithm 1. Motivation Naïve Bayes is unusual as a learner: – Only one pass through data – Order doesn’t matter 2.
Sorting Really Big Files Sorting Part 3. Using K Temporary Files Given  N records in file F  M records will fit into internal memory  Use K temp files,
CS252: Systems Programming Ninghui Li Program Interview Questions.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Naïve Bayes for Large Vocabularies William W. Cohen.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Basics of MapReduce Shannon Quinn. Today Naïve Bayes with huge feature sets – i.e. ones that don’t fit in memory Pros and cons of possible approaches.
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.
Overview Full Bayesian Learning MAP learning
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Lecture 5: Learning models using EM
Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009
KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Using Large-Vocabulary Classifiers William W. Cohen.
Querying Structured Text in an XML Database By Xuemei Luo.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Using Large-Vocabulary Classifiers William W. Cohen.
Phrase Finding; Stream-and-Sort vs “Request-and-Answer” William W. Cohen.
CSE 143 Lecture 11 Maps Grammars slides created by Alyssa Harding
Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.
Data Preparation and Description Lecture 25 th. RECAP.
MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –
Some Details About Stream- and-Sort Operations William W. Cohen.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
Phrase Finding William W. Cohen. ACL Workshop 2003.
Announcements My office hours: Tues 4pm Wed: guest lecture, Matt Hurst, Bing Local Search.
Naïve Bayes with Large Feature sets: (1)Stream and Sort (2)Request and Answer Pattern (3) Rocchio’s Algorithm COSC 526 Class 4 Arvind Ramanathan Computational.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
More on Data Streams and Streaming Data Shannon Quinn (with thanks to William Cohen of CMU, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford)
Naïve Bayes with Large Vocabularies (aka, stream- and-sort) Shannon Quinn.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
The Unreasonable Effectiveness of Data
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,
Phrase Finding; Stream-and-Sort vs “Request-and-Answer” William W. Cohen.
Statistical Properties of Text
Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Learning Analogies and Semantic Relations Nov William Cohen.
Merge Sort Presentation By: Justin Corpron. In the Beginning… John von Neumann ( ) Stored program Developed merge sort for EDVAC in 1945.
Phrase Finding; Stream-and-Sort to “Request-and-Answer” William W. Cohen.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Mistake Bounds William W. Cohen. One simple way to look for interactions Naïve Bayes – two class version dense vector of g(x,y) scores for each word in.
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
Spark vs Hadoop.
Evaluation of Relational Operations: Other Operations
Statistical NLP: Lecture 9
Comparative RNA Structural Analysis
Algorithm Discovery and Design
Evaluation of Relational Operations: Other Techniques
Evaluation of Relational Operations: Other Techniques
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Phrase Finding with “Request-and-Answer” William W. Cohen

PHRASE-FINDING: THE STATISTICS

“Phraseness” 1 – based on BLRT – Define p i =k i /n i, p=(k 1 +k 2 )/(n 1 +n 2 ), L(p,k,n) = p k (1-p) n-k comment k1k1 C(W 1 =x ^ W 2 =y)how often bigram x y occurs in corpus C n1n1 C(W 1 =x)how often word x occurs in corpus C k2k2 C(W 1 ≠x^W 2 =y)how often y occurs in C after a non -x n2n2 C(W 1 ≠x)how often a non- x occurs in C Phrase x y : W 1 = x ^ W 2 = y Does y occur at the same frequency after x as in other positions?

“Informativeness” 1 – based on BLRT – Define p i =k i /n i, p=(k 1 +k 2 )/(n 1 +n 2 ), L(p,k,n) = p k (1-p) n-k Phrase x y : W 1 = x ^ W 2 = y and two corpora, C and B comment k1k1 C(W 1 =x ^ W 2 =y)how often bigram x y occurs in corpus C n1n1 C(W 1 =* ^ W 2 =*)how many bigrams in corpus C k2k2 B(W 1 =x^W 2 =y)how often x y occurs in background corpus n2n2 B(W 1 =* ^ W 2 =*)how many bigrams in background corpus Does x y occur at the same frequency in both corpora?

The breakdown: what makes a good phrase – To compare distributions, use KL-divergence “Pointwise KL divergence” Combined: difference between foreground bigram model and background unigram model Bigram model: P(x y)=P(x)P(y|x) Unigram model: P(x y)=P(x)P(y)

Pointwise KL, combined

Why phrase-finding? Phrases are where the standard supervised “bag of words” representation starts to break. There’s not supervised data, so it’s hard to see what’s “right” and why It’s a nice example of using unsupervised signals to solve a task that could be formulated as supervised learning It’s a nice level of complexity, if you want to do it in a scalable way.

PHRASE-FINDING: THE ALGORITHM

Implementation Request-and-answer pattern – Main data structure: tables of key-value pairs key is a phrase x y value is a mapping from a attribute names (like phraseness, freq-in-B, etc) to numeric values. – Keys and values are just strings – We’ll operate mostly by sending messages to this data structure and getting results back, or else streaming thru the whole table – For really big data: we’d also need tables where key is a word and val is set of attributes of the word (freq-in-B, freq-in-C, …)

Generating and scoring phrases: 1 Stream through foreground corpus and count events “W 1 = x ^ W 2 = y ” the same way we do in training naive Bayes: stream-and sort and accumulate deltas (a “sum-reduce”) – Don’t generate boring phrases (e.g., crossing a sentence, contain a stopword, …) Then stream through the output and convert to phrase, attributes- of-phrase records with one attribute: freq-in-C=n Stream through foreground corpus and count events “W 1 = x ” in a (memory-based) hashtable…. This is enough* to compute phrasiness: – ψ p ( x y ) = f ( freq-in-C(x), freq-in-C(y), freq-in-C(x y)) … so you can do that with a scan through the phrase table that adds an extra attribute (holding word frequencies in memory). * actually you also need total # words and total #phrases….

Generating and scoring phrases: 2 Stream through background corpus and count events “W 1 = x ^ W 2 = y ” and convert to phrase, attributes-of- phrase records with one attribute: freq-in- B =n Sort the two phrase-tables: freq-in-B and freq-in-C and run the output through another tool that – appends together all the attributes associated with the same key… so we now have elements like

Generating and scoring phrases: 3 Scan the through the phrase table one more time and add the informativeness attribute and the overall quality attribute Summary, assuming word vocabulary n W is small: Scan foreground corpus C for phrases: O(n C ) producing m C phrase records – of course m C << n C Compute phrasiness: O(m C ) Scan background corpus B for phrases: O(n B ) producing m B Sort together and combine records: O(m log m), m=m B + m C Compute informativeness and combined quality: O(m) Summary, assuming word vocabulary n W is small: Scan foreground corpus C for phrases: O(n C ) producing m C phrase records – of course m C << n C Compute phrasiness: O(m C ) Scan background corpus B for phrases: O(n B ) producing m B Sort together and combine records: O(m log m), m=m B + m C Compute informativeness and combined quality: O(m) Assumes word counts fit in memory

LARGE-VOCABULARY PHRASE FINDING

Ramping it up – keeping word counts out of memory Goal: records for xy with attributes freq-in-B, freq-in-C, freq-of-x-in-C, freq-of-y-in-C, … Assume I have built built phrase tables and word tables….how do I incorporate the word attributes into the phrase records? For each phrase xy, request necessary word frequencies: – Print “ x ~ request=freq-in-C,from =xy” – Print “ y ~ request=freq-in-C,from =xy” Sort all the word requests in with the word tables Scan through the result and generate the answers: for each word w, a 1 =n 1,a 2 =n 2,…. – Print “ xy ~request=freq-in-C,from= w ” Sort the answers in with the xy records Scan through and augment the xy records appropriately

Generating and scoring phrases: 3 Summary 1.Scan foreground corpus C for phrases, words: O(n C ) producing m C phrase records, v C word records 2.Scan phrase records producing word-freq requests: O(m C ) producing 2m C requests 3.Sort requests with word records: O((2m C + v C )log(2m C + v C )) = O(m C log m C ) since v C < m C 4.Scan through and answer requests: O(m C ) 5.Sort answers with phrase records: O(m C log m C ) 6.Repeat 1-5 for background corpus: O(n B + m B logm B ) 7.Combine the two phrase tables: O(m log m), m = m B + m C 8.Compute all the statistics: O(m ) Summary 1.Scan foreground corpus C for phrases, words: O(n C ) producing m C phrase records, v C word records 2.Scan phrase records producing word-freq requests: O(m C ) producing 2m C requests 3.Sort requests with word records: O((2m C + v C )log(2m C + v C )) = O(m C log m C ) since v C < m C 4.Scan through and answer requests: O(m C ) 5.Sort answers with phrase records: O(m C log m C ) 6.Repeat 1-5 for background corpus: O(n B + m B logm B ) 7.Combine the two phrase tables: O(m log m), m = m B + m C 8.Compute all the statistics: O(m )

SOME MORE INTERESTING WORK WITH PHRASES

More cool work with phrases Turney: Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews.ACL ‘02. Task: review classification (65-85% accurate, depending on domain) – Identify candidate phrases (e.g., adj-noun bigrams, using POS tags) – Figure out the semantic orientation of each phrase using “pointwise mutual information” and aggregate SO(phrase) = PMI(phrase,'excellent') − PMI(phrase,'poor')

“Answering Subcognitive Turing Test Questions: A Reply to French” - Turney

More from Turney LOW HIGHER HIGHEST

More cool work with phrases Locating Complex Named Entities in Web Text. Doug Downey, Matthew Broadhead, and Oren Etzioni, IJCAI Task: identify complex named entities like “Proctor and Gamble”, “War of 1812”, “Dumb and Dumber”, “Secretary of State William Cohen”, … Formulation: decide whether to or not to merge nearby sequences of capitalized words axb, using variant of For k=1, ck is PM (w/o the log). For k=2, ck is “Symmetric Conditional Probability”

Downey et al results

Summary Even more on stream-and-sort and naïve Bayes – Request-answer pattern Another problem: “meaningful” phrase finding – Statistics for identifying phrases (or more generally correlations and differences) – Also using foreground and background corpora Implementing “phrase finding” efficiently – Using request-answer Some other phrase-related problems – Semantic orientation – Complex named entity recognition