Towards a Query Optimizer for Text-Centric Tasks Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay Jain, Luis Gravano Presenter: Avinandan Sengupta.

Slides:

Advertisements

Similar presentations

1 Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University.

Advertisements

Text Categorization.

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Evaluating Search Engine

Information Retrieval in Practice

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.

Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis & Luis Gravano.

. Learning Bayesian networks Slides by Nir Friedman.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Copyright © Cengage Learning. All rights reserved. 6 Point Estimation.

Computer Science 1 Web as a graph Anna Karpovsky.

Overview of Search Engines

Query Log Analysis Naama Kraus Slides are based on the papers: Andrei Broder, A taxonomy of web search Ricardo Baeza-Yates, Graphs from Search Engine Queries.

1 Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Text Classification, Active/Interactive learning.

Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.

Master Thesis Defense Jan Fiedler 04/17/98

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.

윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.

Querying Text Databases for Efficient Information Extraction Eugene Agichtein Luis Gravano Columbia University.

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.

Confidence intervals and hypothesis testing Petter Mostad

Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

For: CS590 Intelligent Systems Related Subject Areas: Artificial Intelligence, Graphs, Epistemology, Knowledge Management and Information Filtering Application.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.

21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,

1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,

Sampling and estimation Petter Mostad

OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.

Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Effective Anomaly Detection with Scarce Training Data Presenter: 葉倚任 Author: W. Robertson, F. Maggi, C. Kruegel and G. Vigna NDSS

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Information Retrieval in Practice

Statistical Analysis Urmia University

SA3202 Statistical Methods for Social Sciences

CS 430: Information Discovery

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Panos Ipeirotis Luis Gravano

Panagiotis G. Ipeirotis Luis Gravano

INF 141: Information Retrieval

Learning Bayesian networks

Presentation transcript:

Towards a Query Optimizer for Text-Centric Tasks Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay Jain, Luis Gravano Presenter: Avinandan Sengupta

Session Outline 2 Text Centric Tasks Methods Employed A More Disciplined Approach Proposed Algorithm Experimental Setup ResultsConclusion

Scenario I 3 Construction of a table of disease outbreaks from a newspaper archive sample tuples Task 1 Information Extraction

Scenario II 4 WordFrequency Samsung2900 Nokia2500 Blackberry2000 Apple Tabulating the number of times an organization’s name appears on a particular web site Task 2 Content Summary Construction

Scenario III 5 Task 3 Discovering pages on Botany on the Internet Focused Resource Discovery

Text-centric tasks 6 Types Information Extraction Content Summary Construction Focused Resource Discovery

Performing Text-Centric Tasks 7

Recall – In Text Centric Tasks 8 Set of tokens that the document processor P extracts from the corpus Strategy Corpus Documents Processed

General flow 9 Retrieve documents from corpus Relevant? Process document Recall ≥ Target Recall Done Y Y Y Y Document Classifier Document Processor Document RetrievalToken ExtractionCheck Start optional

What are the available method for retrieval? 10 Scan Filtered Scan AQG ISE CrawlQuery Iterative Set Expansion Automatic Query Generation Execution Strategies

Execution Time – Generic Model 11 Strategy Corpus

Execution Time – Simplified 12

Scan (SC) 13 Time(SC,D) = |D retr |. (t R + t P ) =

Filtered Scan (FS) 14 Time(FS,D) = |D retr |. (t R + t F + C σ. t P ) : selectivity of C fraction of database documents that C judges useful one time, offline

Iterative Set Expansion (ISE) 15 = Time(ISE,D) = |Q sent |. t Q + |D retr |. (t R + t P )

Automatic Query Generation (AQG) 16 Time(AQG,D) = |Q sent |. t Q + |D retr |. (t R + t P ) =

Which strategy to use? 17 Crawling Querying Text centric tasks Select a strategy based on heuristics/intuition

A More Disciplined Approach 18

Can we do better? 19 Define execution models Estimate cost s Select appropriate technique based on cost Revisit technique selection Scan Filtered Scan AQG ISE

Formalizing the problem 20 Given a target recall value τ, the goal is to identify an execution strategy S among S 1,..., S n such that: Recall(S, D) ≥ τ Time(S, D) ≤ Time(S j, D) if Recall(S j, D) ≥ τ

Degrees 21 g(d) degree of a document g(t) # of distinct documents in D from which P can extract t g(q) # of documents from D retrieved by query q D useful D useless # of distinct tokens extracted from d using P degree of a token degree of a query

Cost of Scan Time(SC,D) = |D retr |. (t R + t P ) SC retrieves documents in no particular order and does not retrieve the same document twice. SC is doing multiple token sampling from a finite population in parallel over D Probability of observing a token t k times in a sample of size S follow hypergeometric distribution

Cost of Scan Probability that token t does not appear in the sample # of documents in which the token does not appear # of ways to select S documents from |D| docs # of ways to select S documents from |D| - g(t) docs Probability that token t appears in at least one document Expected number of tokens retrieved after processing S documents

Cost of Scan We do not know the exact g(t) for each token But, we know the form of the token degree distribution [power law distribution] Thus by using estimates for the probabilities Pr{g(t) = i} |Tokens| * { Pr{g(t) = 1}*[1 – (|D| - 1)!(|D| - S)!/(|D| - 1 – S)!|D|!] + Pr{g(t) = 2}*[1 – (|D| - 2)!(|D| - S)!/(|D| - 2 – S)!|D|!] Pr{g(t) = ∞}*[1 – (|D| - ∞)!(|D| - S)!/(|D| - ∞ – S)!|D|!]} Estimated # of documents retrieved to achieve a target recall

Cost of Filtered Scan 25 Classifier selectivity Classifier recall C r : the fraction of useful documents in D that are also classified as useful by the classifier. A uniform recall is assumed across tokens C r * g(t) : # times each token appears (on average)

Cost of Filtered Scan 26 Estimated # of documents retrieved to achieve a target recall When C σ is high, almost all documents in D are processed by P, and the behavior tends towards that of Scan

Cost of ISE - Random Graph Model 27 A random graph is a collection of points, or vertices, with lines, or edges, connecting pairs of them at random The presence or absence of an edge between two vertices is independent of the presence or absence of any other edge, so that each edge may be considered to be present with independent probability p.

Cost of ISE – Querying Graph 28 Querying Graph: A bipartite graph with (V,E) V = {Tokens, t} U {Documents, d} E 1 = {edge: d->t, such that tokens t can be extracted from d} E 2 = {edge: t->d, such that a query with t retrieves document d} E= E 1 U E 2

Cost of ISE – With Generating Functions 29 Degree distribution of a randomly chosen document Degree distribution of a randomly chosen token pd k is the probability that a randomly chosen document d contains k tokens pt k is the probability that a randomly chosen token t retrieves k documents

Cost of ISE – With Generating Functions 30 degree distribution for a document chosen by following a random edge degree distribution for a token chosen by following a random edge

Cost of ISE – Properties of Generating Functions 31

Cost of ISE - Evaluation 32 Consider: ISE has sent a set Q of tokens as queries By the Power property, the distribution of the total number of retrieved documents (which are pointed to by these tokens) The degree distribution of these tokens is: Gt 1 (x) These tokens were discovered by following random edges on the graph Gd 2 (x) = [Gt 1 (x)] |Q| Time(ISE,D) = |Q sent |. t Q + |D retr |. (t R + t P ) Implies - |D retr | is a random variable whose distribution is given by Gd 2 (x) Hence, the degree distribution of these documents is described by Gd 1 (x) Documents are retrieved by following random edges on the graph

Cost of ISE - Evaluation 33 By Composition property, the distribution of the total number of tokens retrieved |Tokens retr | by the D retr documents: Using Moments property, the expected values for|D retr | and |Tokens retr |, after ISE sends Q queries the number of queries |Q sent | sent by Iterative Set Expansion to reach the target recall τ

Cost of AQG 34

Algorithms 35

Global Optimization 36

Local Optimization 37

Probablity, Distributions, Parameter Estimations 38

Scan - Parameter Estimation 39 This relies on the characteristics of the token and document degree distributions. After retrieving and processing a few documents, we can estimate the distribution parameters based on the frequency of the initially extracted tokens and documents. Specifically, we can use a maximum likelihood fit to estimate the parameters of the document degree distribution. For example, the document degrees for Task 1 tend to follow a power-law distribution, with a probability mass function: Goal: Estimate the most likely value of β, for a given sample of document degrees g(d1),..., g(ds) ζ (β) is the Riemann zeta function (serves as a normalizing factor) Use MLE to identify the value of β that maximizes the likelihood function:

Scan - Parameter Estimation 40 Find the maxima: we can estimate the value of β using numeric approximation

Scan – Token Distribution Estimation 41 To maximize the above, we take log, (eliminate factorials by Stirling’s approximation, and equate the derivative to zero to find the maxima

Filtered Scan – Parameter Estimation 42

ISE – Parameter Estimation 43

AQG – Parameter Estimation 44

Experimental Setting and Results 45

Tuple extraction from New York Times archives Categorized word frequency computation for Usenet newgroups Document retrieval on Botany from the Internet 46 Details of the Experiments

Task 1a, 1b – Information Extraction 47 Document Processor: Snowball 1a: Extracting a Disease-Outbreaks relation, tuple (DiseaseName, Country) 1b: extracting a Headquarters relation, tuple (Organization,Location) Token: a single tuple of the target relation Document: a news article from The New York Times archive Corpus: Newspaper articles from The NewYork Times, published in 1995 (NYT95) and 1996 (NYT96) NYT95 documents for training NYT96 documents for evaluation of the alternative execution strategies NYT96 Features 182,531 documents, 16,921 tokens (Task 1a) 605 tokens (Task 1b) Document Classifier: RIPPER g(d):power-law distribution g(t): power-law distribution

Task 1a, 1b – Information Extraction 48 FS: Rule Based Classifier (RIPPER) RIPPER trained with a set of 500 useful documents and 1500 not useful documents from the NYT95 data set AQG: 2000 documents from the NYT95 data set as a training set to create the queries required by Automatic Query Generation ISE: construct queries using the AND operator of the attributes of each tuple (tuple typhus, Belize -> [typhus AND Belize]) ISE/AQG: maximum # of returned documents - 100

Task 2 - Content Summary Construction 49 Separate documents into topics based on high-level name of the newsgroup (comp, sci) Train a rule-based classifier using RIPPER; creates rules to assign documents into categories Final queries contain the antecedents of the rules, across all categories Document Processor: Simple Tokenizer Token: words and its frequency Document: A Usenet message Corpus: 20 Newgroups data set from the UCI KDD Archive. Contains 20,000 messages g(d):lognormal distribution g(t): power-law distribution Extracting words and their frequency from newsgroup FS: not applicable (all documents useful) ISE: queries are constructed using words that appear in previously retrieved documents ISE/AQG: maximum # of returned documents AQG Modus operandi

Task 3 – Focused Resource Discovery 50 Separate documents into topics based on high-level name of the newsgroup (comp, sci) Train a rule-based classifier using RIPPER; creates rules to assign documents into categories Final queries contain the antecedents of the rules, across all categories Document Processor: Multinomial Naïve Bayes Classifier Token: URL of page on Botany Document: Web page Corpus: 800,000 pages with 12,000 relevant to Botany g(d):lognormal distribution g(t): power-law distribution Retrieving document on Botany from the Internet ISE/AQG: maximum # of returned documents AQG Modus operandi

Task 3 – Database Building Retrieve 8000 pages listed in Open Directory under: Top -> Science -> Biology -> Botany Select 1000 documents as training documents Create a multinomial Naive Bayes classifier that decides whether a Web page is about Botany foreach of the downloaded Botany pages – extract backlinks with Google – classify retrieved pages – foreach page classified as “Botany” repeat backlinks extraction until none of the backlinks was classified under Botany. 51

Task 3 – Database Attributes Around 12,000 pages on Botany, pointed to by approximately 32,000 useless documents Augment useless documents: – picked 10 more random topics from the third level of the Open Directory hierarchy – downloaded all the Web pages listed under these topics, for a total of approximately 100,000 pages. Final data set – Total: around 800,000 pages – 12,000 are relevant to Botany. 52

Task 3 – Modus Operandi SC with a classifier deciding whether each of the retrieved pages belongs to the category of choice. For FS – a focused crawler starts from a few Botany Web pages, – visits a Web page only when at least one of the documents that points to it is useful For AQG – train a RIPPER classifier using the training set – create a set of rules that assign documents into the Botany category. 53

Evaluation – Model Accuracy 54 Task 1a Task 1b Task 2 Task 3

Evaluation – Global vs. Actual 55 Task 1a Task 1b Task 2 Task 3

Evaluation – Global vs. Local 56 Task 1a Task 1b Task 2 Task 3

Introduced a rigorous cost model for several query- and crawl- based execution strategies that underlie the implementation of many text-centric tasks Develop principled cost estimation approaches for the model introduced Analyzed the models to predict the execution time and output completeness of important query- and crawl-based algorithms and accordingly select a strategy – until now these were only empirically evaluated, with limited theoretical justification Demonstrated that the suggested modeling can be successfully used to create optimizers for text-centric tasks Showed that the optimizers help build efficient execution plans to achieve a target recall, resulting in executions that can be orders of magnitude faster than alternate choices 57 Conclusion

References Generator functions Sampling from a finite population: miners.com/2008/05/agent-problem-sampling-from- finite.htmlhttp://blog.data- miners.com/2008/05/agent-problem-sampling-from- finite.html Random graphs with arbitrary degree distribution and their applications Probability Distributions – – – – MLE: 58

59 Thanks!

Backup Slides 60

Probability, Distributions and Estimations 61

Distributions, Models, and Likelihood 62

MLE 63 Maximum-likelihood estimation (MLE) is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum- likelihood estimation provides estimates for the model's parameters.

Zipf’s Law 64 Given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

Power Law 65 An example power-law graph, being used to demonstrate ranking of popularity. To the right is the long tail, and to the left are the few that dominate (also known as the rule).

Hypergeometric Distribution 66 Discrete probability distribution that describes the number of successes in a sequence of n draws from a finite population without replacement

Binomial Distribution 67 Describes the number of successes in a sequence of n draws with replacement.