To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay.

To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay Jain – Columbia University Luis Gravano – Columbia University

2 Text-Centric Task I: Information Extraction Information extraction applications extract structured relations from unstructured text May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire, is finding itself hard pressed to cope with the crisis… DateDisease NameLocation Jan. 1995MalariaEthiopia July 1995Mad Cow DiseaseU.K. Feb. 1995PneumoniaU.S. May 1995EbolaZaire Information Extraction System (e.g., NYU’s Proteus) Disease Outbreaks in The New York Times Information Extraction tutorial yesterday by AnHai Doan, Raghu Ramakrishnan, Shivakumar Vaithyanathan

3 Text-Centric Task II: Metasearching Metasearchers create content summaries of databases (words + frequencies) to direct queries appropriately Friday June 16, NEW YORK (Forbes) - Starbucks Corp. may be next on the target list of CSPI, a consumer-health group that this week sued the operator of the KFC restaurant chain WordFrequency Starbucks102 consumer215 soccer1295 …… Content Summary Extractor WordFrequency Starbucks103 consumer216 soccer1295 …… Content Summary of Forbes.com

4 Text-Centric Task III: Focused Resource Discovery Identify web pages about a given topic (multiple techniques proposed: simple classifiers, focused crawlers, focused querying,…) URL http://biology.about.com/ http://www.amjbot.org/ http://www.sysbot.org/ http://www.botany.ubc.ca/ Web Page Classifier Web Pages about Botany

5 An Abstract View of Text-Centric Tasks Output Tokens … Extraction System Text Database 3.Extract output tokens 2.Process documents 1.Retrieve documents from database TaskToken Information ExtractionRelation Tuple Database SelectionWord (+Frequency) Focused CrawlingWeb Page about a Topic For the rest of the talk

6 Executing a Text-Centric Task Output Tokens … Extraction System Text Database 3.Extract output tokens 2.Process documents 1.Retrieve documents from database Similar to relational world Two major execution paradigms Scan-based: Retrieve and process documents sequentially Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results Unlike the relational world Indexes are only “approximate”: index is on keywords, not on tokens of interest Choice of execution plan affects output completeness (not only speed) → underlying data distribution dictates what is best

7 Execution Plan Characteristics Output Tokens … Extraction System Text Database 3.Extract output tokens 2.Process documents 1.Retrieve documents from database Execution Plans have two main characteristics: Execution Time Recall (fraction of tokens retrieved) Question: How do we choose the fastest execution plan for reaching a target recall ? “What is the fastest plan for discovering 10% of the disease outbreaks mentioned in The New York Times archive?”

8 Outline Description and analysis of crawl- and query-based plans  Scan  Filtered Scan  Iterative Set Expansion  Automatic Query Generation Optimization strategy Experimental results and conclusions Crawl-based Query-based (Index-based)

9 Scan Output Tokens … Extraction System Text Database 3.Extract output tokens 2.Process documents 1.Retrieve docs from database Scan Scan retrieves and processes documents sequentially (until reaching target recall) Execution time = |Retrieved Docs| · (R + P) Time for retrieving a document Question: How many documents does Scan retrieve to reach target recall? Time for processing a document Filtered Scan Filtered Scan uses a classifier to identify and process only promising documents (details in paper)

10 Estimating Recall of Scan Modeling Scan for Token t: What is the probability of seeing t (with frequency g(t)) after retrieving S documents? A “sampling without replacement” process After retrieving S documents, frequency of token t follows hypergeometric distribution Recall for token t is the probability that frequency of t in S docs > 0 S documents Probability of seeing token t after retrieving S documents g(t) = frequency of token t

11 Estimating Recall of Scan Modeling Scan: Multiple “sampling without replacement” processes, one for each token Overall recall is average recall across tokens → We can compute number of documents required to reach target recall Execution time = |Retrieved Docs| · (R + P)

12 Scan and Filtered Scan Output Tokens … Extraction System Text Database 4.Extract output tokens 3.Process documents 1.Retrieve docs from database Scan Scan retrieves and processes all documents (until reaching target recall) Filtered Scan Filtered Scan uses a classifier to identify and process only promising documents (e.g., the Sports section of NYT is unlikely to describe disease outbreaks) Execution time = |Retrieved Docs| * ( R + F + P) Time for retrieving a document Time for filtering a document Question: How many documents does (Filtered) Scan retrieve to reach target recall? Classifier 2.Filter documents Time for processing a document Classifier selectivity (σ≤1) σ filtered

13 Estimating Recall of Filtered Scan Modeling Filtered Scan: Analysis similar to Scan Main difference: the classifier rejects documents and  Decreases effective database size from | D| to σ·|D| (σ: classifier selectivity)  Decreases effective token frequency from g(t) to r·g(t) (r: classifier recall) Documents rejected by classifier decrease effective database size Tokens in rejected documents have lower effective token frequency

14 Outline Description and analysis of crawl- and query-based plans  Scan  Filtered Scan  Iterative Set Expansion  Automatic Query Generation Optimization strategy Experimental results and conclusions Crawl-based Query-based

15 Iterative Set Expansion Output Tokens … Extraction System Text Database 3.Extract tokens from docs 2.Process retrieved documents 1.Query database with seed tokens Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q Time for retrieving a document Time for answering a query Question: How many queries and how many documents does Iterative Set Expansion need to reach target recall? Time for processing a document Query Generation 4.Augment seed tokens with new tokens Question: How many queries and how many documents does Iterative Set Expansion need to reach target recall? (e.g., [Ebola AND Zaire]) (e.g., )

16 Querying Graph The querying graph is a bipartite graph, containing tokens and documents Each token (transformed to a keyword query) retrieves documents Documents contain tokens TokensDocuments t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5

17 Using Querying Graph for Analysis We need to compute the: Number of documents retrieved after sending Q tokens as queries (estimates time ) Number of tokens that appear in the retrieved documents (estimates recall ) To estimate these we need to compute the: Degree distribution of the tokens discovered by retrieving documents Degree distribution of the documents retrieved by the tokens (Not the same as the degree distribution of a randomly chosen token or document – it is easier to discover documents and tokens with high degrees) TokensDocuments t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5 Elegant analysis framework based on generating functions – details in the paper

18 Recall Limit: Reachability Graph t 1 retrieves document d 1 that contains t 2 t1t1 t2t2 t3t3 t4t4 t5t5 TokensDocuments t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5 Upper recall limit: determined by the size of the biggest connected component Reachability Graph

19 Automatic Query Generation Iterative Set Expansion Iterative Set Expansion has recall limitation due to iterative nature of query generation Automatic Query Generation Automatic Query Generation avoids this problem by creating queries offline (using machine learning), which are designed to return documents with tokens

20 Automatic Query Generation Output Tokens … Extraction System Text Database 4.Extract tokens from docs 3.Process retrieved documents 2.Query database Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q Time for retrieving a document Time for answering a query Time for processing a document Offline Query Generation 1.Generate queries that tend to retrieve documents with tokens

21 Estimating Recall of Automatic Query Generation Query q retrieves g(q) docs Query has precision p(q)  p(q)·g(q) useful docs  [1-p(q)]·g(q) useless docs  We compute total number of useful (and useless) documents retrieved  Analysis similar to Filtered Scan:  Effective database size is |D useful |  Sample size S is number of useful documents retrieved Text Database Useful Doc Useless Doc q p(q)·g(q) (1-p(q))·g(q)

22 Outline Description and analysis of crawl- and query-based plans Optimization strategy Experimental results and conclusions

23 Summary of Cost Analysis Our analysis so far:  Takes as input a target recall  Gives as output the time for each plan to reach target recall (time = infinity, if plan cannot reach target recall) Time and recall depend on task-specific properties of database:  Token degree distribution  Document degree distribution Next, we show how to estimate degree distributions on-the-fly

24 Estimating Cost Model Parameters Token and document degree distributions belong to known distribution families Can characterize distributions with only a few parameters! TaskDocument DistributionToken Distribution Information ExtractionPower-law Content Summary ConstructionLognormalPower-law (Zipf) Focused Resource DiscoveryUniform

25 Parameter Estimation Naïve solution for parameter estimation:  Start with separate, “parameter-estimation” phase  Perform random sampling on database  Stop when cross-validation indicates high confidence We can do better than this! No need for separate sampling phase Sampling is equivalent to executing the task: →Piggyback parameter estimation into execution

26 On-the-fly Parameter Estimation Pick most promising execution plan for target recall assuming “default” parameter values Start executing task Update parameter estimates during execution Switch plan if updated statistics indicate so Important  Only Scan acts as “random sampling”  All other execution plan need parameter adjustment (see paper) Correct (but unknown) distribution Initial, default estimationUpdated estimation

27 Outline Description and analysis of crawl- and query-based plans Optimization strategy Experimental results and conclusions

28 Correctness of Theoretical Analysis Solid lines: Actual time Dotted lines: Predicted time with correct parameters Task: Disease Outbreaks Snowball IE system 182,531 documents from NYT 16,921 tokens

29 Experimental Results (Information Extraction) Solid lines: Actual time Green line: Time with optimizer (results similar in other experiments – see paper)

30 Conclusions Common execution plans for multiple text-centric tasks Analytic models for predicting execution time and recall of various crawl- and query-based plans Techniques for on-the-fly parameter estimation Optimization framework picks on-the-fly the fastest plan for target recall

31 Future Work Incorporate precision and recall of extraction system in framework Create non-parametric optimization (i.e., no assumption about distribution families) Examine other text-centric tasks and analyze new execution plans Create adaptive, “next-K” optimizer

32 Thank you! TaskFiltered ScanIterative Set Expansion Automatic Query Generation Information Extraction Grishman et al., J.of Biomed. Inf. 2002 Agichtein and Gravano, ICDE 2003 Content Summary Construction -Callan et al., SIGMOD 1999 Ipeirotis and Gravano, VLDB 2002 Focused Resource Discovery Chakrabarti et al., WWW 1999 -Cohen and Singer, AAAI WIBIS 1996

33 Overflow Slides

34 Experimental Results (IE, Headquarters) Task: Company Headquarters Snowball IE system 182,531 documents from NYT 16,921 tokens

35 Experimental Results (Content Summaries) Content Summary Extraction 19,997 documents from 20newsgroups 120,024 tokens

36 Experimental Results (Content Summaries) Content Summary Extraction 19,997 documents from 20newsgroups 120,024 tokens ISE is a cheap plan for low target recall but becomes the most expensive for high target recall

37 Experimental Results (Content Summaries) Content Summary Extraction 19,997 documents from 20newsgroups 120,024 tokens Underestimated recall for AQG, switched to ISE

38 Experimental Results (Information Extraction) OPTIMIZED is faster than “best plan”: overestimated F.S. recall, but after F.S. run to completion, OPTIMIZED just switched to Scan

39 Focused Resource Discovery 800,000 web pages 12,000 tokens

To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay.

Similar presentations

Presentation on theme: "To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay.

Similar presentations

Presentation on theme: "To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay."— Presentation transcript:

Similar presentations

About project

Feedback