Presentation is loading. Please wait.

Presentation is loading. Please wait.

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.

Similar presentations


Presentation on theme: "Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia."— Presentation transcript:

1 Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia University

2 6/10/2015Columbia University2 Distributed Search? Why? “Surface” Web vs. “Hidden” Web “Surface” Web – Link structure – Crawlable – Documents indexed by search engines “Hidden” Web – No link structure – Documents “hidden” in databases – Documents not indexed by search engines – Need to query each collection individually

3 6/10/2015Columbia University3 Hidden Web: Examples DatabaseQueryMatchesGoogle PubMeddiabetes178,975119 U.S. Patentswireless network16,7410 Library of Congressvisa regulations>10,0000 ………… PubMed search: [diabetes] 178,975 matches PubMed is at http://www.ncbi.nlm.nih.gov/PubMed Google search: [diabetes site:www.ncbi.nlm.nih.gov] 119 matches

4 6/10/2015Columbia University4 Distributed Search: Challenges Metasearcher Library of Congress Hidden Web PubMed ESPN Content summaries of databases (vocabulary, word frequencies) kidneys220,000 stones40,000... kidneys5 stones40... kidneys20 stones 950... Select good databases for query Evaluate query at these databases Merge results from databases

5 6/10/2015Columbia University5 Database Selection Problems 1. How to extract content summaries? 2. How to use the extracted content summaries? Web Database Web Database 1 Metasearcher cancer basketball 4 cancer 4,532 cpu 23 basketball 4 cancer 4,532 cpu 23 Web Database 2 basketball 4 cancer 60,298 cpu 0 Web Database 3 basketball 6,340 cancer 2 cpu 0

6 6/10/2015Columbia University6 Extracting Content Summaries from Web Databases No direct access to remote documents other than by querying Resort to query-based document sampling: Send queries to database Retrieve document sample Use sample to create approximate content summary

7 6/10/2015Columbia University7 “Random” Query-Based Sampling Pick a word and send it as a query to database Retrieve top-k documents returned (e.g., k=4) Repeat until “enough” (e.g., 300) documents are retrieved Use word frequencies in sample to create content summary WordFrequency in Sample cancer150 (out of 300) aids114 (out of 300) heart98 (out of 300) … basketball2 (out of 300) Callan et al., SIGMOD’99, TOIS 2001

8 6/10/2015Columbia University8 Random Sampling: Problems No actual word frequencies computed for content summaries, only a “ranking” of words Many words missing from content summaries (many rare words) Many queries return very few or no matches # documents word rank Zipf’s law Many words appear in only one or two documents

9 6/10/2015Columbia University9 Our Technique: Focused Probing 1. Train document classifiers Find representative words for each category 2. Use classifier rules to derive a topically-focused sample from database 3. Estimate actual document frequencies for all discovered words

10 6/10/2015Columbia University10 Focused Probing: Training Start with a predefined topic hierarchy and preclassified documents Train document classifiers for each node Extract rules from classifiers: ibm AND computers → Computers lung AND cancer → Health … angina → Heart hepatitis AND liver → Hepatitis … } Root } Health SIGMOD 2001

11 6/10/2015Columbia University11 Focused Probing: Sampling Transform each rule into a query For each query: Send to database Record number of matches Retrieve top-k matching documents At the end of round: Analyze matches for each category Choose category to focus on Sampling proceeds in rounds: In each round, the rules associated with each node are turned into queries for the database  Representative document sample  Actual frequencies for some “important” words Output:

12 6/10/2015Columbia University12 Sample Frequencies and Actual Frequencies “liver” appears in 200 out of 300 documents in sample “kidney” appears in 100 out of 300 documents in sample “hepatitis” appears in 30 out of 300 documents in sample Document frequencies in actual database? Can exploit number of matches from one-word queries Query “liver” returned 140,000 matches Query “hepatitis” returned 20,000 matches “kidney” was not a query probe…

13 6/10/2015Columbia University13 Adjusting Document Frequencies We know ranking r of words according to document frequency in sample We know absolute document frequency f of some words from one- word queries Mandelbrot’s formula connects empirically word frequency f and ranking r We use curve-fitting to estimate the absolute frequency of all words in sample r f

14 6/10/2015Columbia University14 Actual PubMed Content Summary Extracted automatically ~ 27,500 words in extracted content summary Fewer than 200 queries sent At most 4 documents retrieved per query PubMed content summary Number of Documents: 3,868,552 category: Health, Diseases … cancer1,398,178 aids106,512 heart281,506 hepatitis23,481 … basketball907 cpu487 The extracted content summary accurately represents size, contents, and classification of the database

15 6/10/2015Columbia University15 Focused Probing: Contributions Focuses database sampling on dense topic areas Estimates absolute document frequencies of words Classifies databases along the way Classification useful for database selection

16 6/10/2015Columbia University16 Database Selection Problems 1. How to extract content summaries? 2. How to use the extracted content summaries? Metasearcher cancer Web Database 1 basketball 4 cancer 4,532 cpu 23 Web Database 2 basketball 4 cancer 60,298 cpu 0 Web Database 3 basketball 6,340 cancer 2 cpu 0 Web Database basketball 4 cancer 4,532 cpu 23

17 6/10/2015Columbia University17 Database Selection and Extracted Content Summaries Database selection algorithms assume complete content summaries Content summaries extracted by (small-scale) sampling are inherently incomplete (Zipf's law) Queries with undiscovered words are problematic Database Classification Helps: Similar topics ↔ Similar content summaries Extracted content summaries complement each other

18 6/10/2015Columbia University18 Content Summaries for Categories: Example Cancerlit contains “metastasis”, not found during sampling CancerBacup contains “diabetes”, not found during sampling Cancer category content summary contains both

19 6/10/2015Columbia University19 Hierarchical DB Selection: Outline Create aggregated content summaries for categories Hierarchically direct queries using categories Category content summaries are more complete than database content summaries Various traversal techniques possible

20 6/10/2015Columbia University20 Hierarchical DB Selection: Example To select D databases: Use a “flat” DB selection algorithm to score categories Proceed to category with highest score Repeat until category is a leaf, or category has fewer than D databases

21 6/10/2015Columbia University21 Retrieves same number of documents using fewer queries Topic detection helps Actual aids basketball cancer heart … pneumonia Sample aids basketball cancer heart … pneumonia Actual cancer pneumonia aids heart … basketball Sample aids basketball cancer heart … pneumonia Ignores “off-topic” documents Better sample: Each retrieved document “represents” many unretrieved, so “on-topic” sampling helps Focused Probing compared to Random Sampling: Better vocabulary coverage Better word ranking More efficient for same sample size More effective for same sample size Experiments: Content Summary Extraction More results in the paper! 4 types of classifiers (SVM, Ripper, C4.5, Bayes), frequency estimation, different data sets…

22 6/10/2015Columbia University22 LoC Experiments: Database Selection LoCc Data set and workload: 50 real Web databases 50 TREC Web Track queries Metric: Precision @ 15 For each query pick 3 databases Retrieve 5 documents from each database Return 15 documents to user Mark “relevant” and “irrelevant” documents LoC Database Selection Query Good database selection algorithms choose databases with relevant documents

23 6/10/2015Columbia University23 Experiments: Precision of Database Selection Algorithms HierarchicalFlat Focused Probing0.270.17 Random Sampling-0.18 Hierarchical database selection improves precision drastically Category content summaries more complete Topic-based database clustering helps Best result for centralized search ~ 0.35 Not an option for Hidden Web! More results in the paper! (different flat selection algorithms, more content summary extraction algorithms…)

24 6/10/2015Columbia University24 Contributions Technique for extracting content summaries from completely autonomous Hidden-Web databases Technique for estimating frequencies: Possible to distinguish large from small databases Hierarchical database selection exploits classification improving drastically precision of distributed search Content summary extraction implemented and available for download at: http://sdarts.cs.columbia.edu http://sdarts.cs.columbia.edu

25 6/10/2015Columbia University25 Future Work Different techniques for merging content summaries for category content summary creation Effect of frequency estimation on database selection Different hierarchy “traversing” algorithms for hierarchical database selection


Download ppt "Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia."

Similar presentations


Ads by Google