Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.

Slides:



Advertisements
Similar presentations
Downloading Textual Hidden-Web Content Through Keyword Queries
Advertisements

1 Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University.
Probe, Count, and Classify: Categorizing Hidden Web Databases
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
Automatic Classification of Text Databases Through Query Probing Panagiotis G. Ipeirotis Luis Gravano Columbia University Mehran Sahami E.piphany Inc.
1 Oct 30, 2006 LogicSQL-based Enterprise Archive and Search System How to organize the information and make it accessible and useful ? Li-Yan Yuan.
Search Engines and Information Retrieval
Building an Intelligent Web: Theory and Practice Pawan Lingras Saint Mary’s University Rajendra Akerkar American University of Armenia and SIBER, India.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis & Luis Gravano.
Data-rich Section Extraction from HTML pages Introducing the DSE-Algorithm Original Paper from: Jiying Wang and Fred H. Lochovsky Department of Computer.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Re-ranking Documents Segments To Improve Access To Relevant Content in Information Retrieval Gary Madden Applied Computational Linguistics Dublin City.
A Mobile World Wide Web Search Engine Wen-Chen Hu Department of Computer Science University of North Dakota Grand Forks, ND
Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection.
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
Information Retrieval
1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Chapter 5: Information Retrieval and Web Search
1 Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University.
Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.
Search Engines and Information Retrieval Chapter 1.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Captain Nemo: a Metasearch Engine with Personalized Hierarchical Search Space ( Stefanos Souldatos, Theodore Dalamagas,
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park,
Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
MINING MULTI-LABEL DATA BY GRIGORIOS TSOUMAKAS, IOANNIS KATAKIS, AND IOANNIS VLAHAVAS Published on July, 7, 2010 Team Members: Kristopher Tadlock, Jimmy.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Hidden-Web Databases: Classification and Search Luis Gravano Columbia University Joint work with Panos Ipeirotis (Columbia)
Performance Measurement. 2 Testing Environment.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Large-Scale Content-Based Audio Retrieval from Text Queries
Information Retrieval in Practice
Panagiotis G. Ipeirotis Tom Barry Luis Gravano
Classifying and Searching "Hidden-Web" Text Databases
Classifying and Searching "Hidden-Web" Text Databases
Classifying and Searching "Hidden-Web" Text Databases
Classifying and Searching "Hidden-Web" Text Databases
Classifying and Searching "Hidden-Web" Text Databases
Data Mining Chapter 6 Search Engines
SDLIP + STARTS = SDARTS A Protocol and Toolkit for Metasearching
Panos Ipeirotis Luis Gravano
Panagiotis G. Ipeirotis Luis Gravano
Presentation transcript:

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia University

6/10/2015Columbia University2 Distributed Search? Why? “Surface” Web vs. “Hidden” Web “Surface” Web – Link structure – Crawlable – Documents indexed by search engines “Hidden” Web – No link structure – Documents “hidden” in databases – Documents not indexed by search engines – Need to query each collection individually

6/10/2015Columbia University3 Hidden Web: Examples DatabaseQueryMatchesGoogle PubMeddiabetes178, U.S. Patentswireless network16,7410 Library of Congressvisa regulations>10,0000 ………… PubMed search: [diabetes] 178,975 matches PubMed is at Google search: [diabetes site: 119 matches

6/10/2015Columbia University4 Distributed Search: Challenges Metasearcher Library of Congress Hidden Web PubMed ESPN Content summaries of databases (vocabulary, word frequencies) kidneys220,000 stones40, kidneys5 stones40... kidneys20 stones Select good databases for query Evaluate query at these databases Merge results from databases

6/10/2015Columbia University5 Database Selection Problems 1. How to extract content summaries? 2. How to use the extracted content summaries? Web Database Web Database 1 Metasearcher cancer basketball 4 cancer 4,532 cpu 23 basketball 4 cancer 4,532 cpu 23 Web Database 2 basketball 4 cancer 60,298 cpu 0 Web Database 3 basketball 6,340 cancer 2 cpu 0

6/10/2015Columbia University6 Extracting Content Summaries from Web Databases No direct access to remote documents other than by querying Resort to query-based document sampling: Send queries to database Retrieve document sample Use sample to create approximate content summary

6/10/2015Columbia University7 “Random” Query-Based Sampling Pick a word and send it as a query to database Retrieve top-k documents returned (e.g., k=4) Repeat until “enough” (e.g., 300) documents are retrieved Use word frequencies in sample to create content summary WordFrequency in Sample cancer150 (out of 300) aids114 (out of 300) heart98 (out of 300) … basketball2 (out of 300) Callan et al., SIGMOD’99, TOIS 2001

6/10/2015Columbia University8 Random Sampling: Problems No actual word frequencies computed for content summaries, only a “ranking” of words Many words missing from content summaries (many rare words) Many queries return very few or no matches # documents word rank Zipf’s law Many words appear in only one or two documents

6/10/2015Columbia University9 Our Technique: Focused Probing 1. Train document classifiers Find representative words for each category 2. Use classifier rules to derive a topically-focused sample from database 3. Estimate actual document frequencies for all discovered words

6/10/2015Columbia University10 Focused Probing: Training Start with a predefined topic hierarchy and preclassified documents Train document classifiers for each node Extract rules from classifiers: ibm AND computers → Computers lung AND cancer → Health … angina → Heart hepatitis AND liver → Hepatitis … } Root } Health SIGMOD 2001

6/10/2015Columbia University11 Focused Probing: Sampling Transform each rule into a query For each query: Send to database Record number of matches Retrieve top-k matching documents At the end of round: Analyze matches for each category Choose category to focus on Sampling proceeds in rounds: In each round, the rules associated with each node are turned into queries for the database  Representative document sample  Actual frequencies for some “important” words Output:

6/10/2015Columbia University12 Sample Frequencies and Actual Frequencies “liver” appears in 200 out of 300 documents in sample “kidney” appears in 100 out of 300 documents in sample “hepatitis” appears in 30 out of 300 documents in sample Document frequencies in actual database? Can exploit number of matches from one-word queries Query “liver” returned 140,000 matches Query “hepatitis” returned 20,000 matches “kidney” was not a query probe…

6/10/2015Columbia University13 Adjusting Document Frequencies We know ranking r of words according to document frequency in sample We know absolute document frequency f of some words from one- word queries Mandelbrot’s formula connects empirically word frequency f and ranking r We use curve-fitting to estimate the absolute frequency of all words in sample r f

6/10/2015Columbia University14 Actual PubMed Content Summary Extracted automatically ~ 27,500 words in extracted content summary Fewer than 200 queries sent At most 4 documents retrieved per query PubMed content summary Number of Documents: 3,868,552 category: Health, Diseases … cancer1,398,178 aids106,512 heart281,506 hepatitis23,481 … basketball907 cpu487 The extracted content summary accurately represents size, contents, and classification of the database

6/10/2015Columbia University15 Focused Probing: Contributions Focuses database sampling on dense topic areas Estimates absolute document frequencies of words Classifies databases along the way Classification useful for database selection

6/10/2015Columbia University16 Database Selection Problems 1. How to extract content summaries? 2. How to use the extracted content summaries? Metasearcher cancer Web Database 1 basketball 4 cancer 4,532 cpu 23 Web Database 2 basketball 4 cancer 60,298 cpu 0 Web Database 3 basketball 6,340 cancer 2 cpu 0 Web Database basketball 4 cancer 4,532 cpu 23

6/10/2015Columbia University17 Database Selection and Extracted Content Summaries Database selection algorithms assume complete content summaries Content summaries extracted by (small-scale) sampling are inherently incomplete (Zipf's law) Queries with undiscovered words are problematic Database Classification Helps: Similar topics ↔ Similar content summaries Extracted content summaries complement each other

6/10/2015Columbia University18 Content Summaries for Categories: Example Cancerlit contains “metastasis”, not found during sampling CancerBacup contains “diabetes”, not found during sampling Cancer category content summary contains both

6/10/2015Columbia University19 Hierarchical DB Selection: Outline Create aggregated content summaries for categories Hierarchically direct queries using categories Category content summaries are more complete than database content summaries Various traversal techniques possible

6/10/2015Columbia University20 Hierarchical DB Selection: Example To select D databases: Use a “flat” DB selection algorithm to score categories Proceed to category with highest score Repeat until category is a leaf, or category has fewer than D databases

6/10/2015Columbia University21 Retrieves same number of documents using fewer queries Topic detection helps Actual aids basketball cancer heart … pneumonia Sample aids basketball cancer heart … pneumonia Actual cancer pneumonia aids heart … basketball Sample aids basketball cancer heart … pneumonia Ignores “off-topic” documents Better sample: Each retrieved document “represents” many unretrieved, so “on-topic” sampling helps Focused Probing compared to Random Sampling: Better vocabulary coverage Better word ranking More efficient for same sample size More effective for same sample size Experiments: Content Summary Extraction More results in the paper! 4 types of classifiers (SVM, Ripper, C4.5, Bayes), frequency estimation, different data sets…

6/10/2015Columbia University22 LoC Experiments: Database Selection LoCc Data set and workload: 50 real Web databases 50 TREC Web Track queries Metric: 15 For each query pick 3 databases Retrieve 5 documents from each database Return 15 documents to user Mark “relevant” and “irrelevant” documents LoC Database Selection Query Good database selection algorithms choose databases with relevant documents

6/10/2015Columbia University23 Experiments: Precision of Database Selection Algorithms HierarchicalFlat Focused Probing Random Sampling-0.18 Hierarchical database selection improves precision drastically Category content summaries more complete Topic-based database clustering helps Best result for centralized search ~ 0.35 Not an option for Hidden Web! More results in the paper! (different flat selection algorithms, more content summary extraction algorithms…)

6/10/2015Columbia University24 Contributions Technique for extracting content summaries from completely autonomous Hidden-Web databases Technique for estimating frequencies: Possible to distinguish large from small databases Hierarchical database selection exploits classification improving drastically precision of distributed search Content summary extraction implemented and available for download at:

6/10/2015Columbia University25 Future Work Different techniques for merging content summaries for category content summary creation Effect of frequency estimation on database selection Different hierarchy “traversing” algorithms for hierarchical database selection