Divide and Conquer: Challenges in Scaling Federated Search Presented by Abe Lederman, President and CTO Deep Web Technologies, LLC SearchEngine Meeting 24 April 2006 Boston, MA
SEARCH ALL OF THESE SOURCES ONE AT A TIME
OR SEARCH THEM ALL AT ONCE
Finding the Gold Hidden in the World Wide Web “Google-type” search engines “pan” the surface web for gold “Deep Web” search engines go mining for gold
Finding the Gold Hidden in the World Wide Web “Google-type” search engines “pan” the surface web for gold “Deep Web” search engines go mining for gold
Challenges Overview Managing a large number of sources Searching a large number of sources in parallel Organizing and ranking the results returned
Challenges of Managing Thousands of Data Sources Locate Reliable Sources Categorize Sources by Content Configure Sources for Searching Maintain Sources 4
Challenges in Searching Thousands of Sources Automatically Select Sources to Search Retrieve Results from Cache 5 Perform Many Searches in Parallel Bring Back Best Results
Source Selection Optimizer Search Conductor Source Selection Optimizer Source Descriptions Previous Results
Caching of Search Results Reduces the load (cost) of accessing sources CHALLENGES Requires a large database Need to determine how often to update the cache Works best with lots of users doing similar searches
We Address Scalability Through a Grid-Based Solution Uses open standards (Web Services, WSDL, SOAP, XML) Runs on distributed nodes Is platform independent (Java based) Very flexible, providing a framework for integration of various filtering and analysis tools
Distributing the Workload as Grid Services
Select sources to search Can I get more results from “good” sources? Enough good results? YES Deliver results to user YES NO Perform Search Get Next Results Search Conductor
Searching a large number of sources can lead to a flood of results
Challenges in Organizing and Ranking Results 5 Multi-tier Relevance Ranking User-driven Ranking Clustering of Results
Multi-tier Relevance Ranking QuickRank – Ranks results based on occurrence of search terms in title, author, and snippet MetaRank – Ranks results utilizing custom algorithms applied to meta- data DeepRank – Downloads and indexes full-text documents HEAVY LIFTING REQUIRED!
User-driven Ranking Credibility of source Date range Document length Document type Geographic proximity Popularity of document Reading level Relevance Desired: Blending (weighing) of above criteria
Clustering
A Grand Challenge for Federated Search Source: Walter Warnick, Ph.D., DOE OSTI. Global Discovery: Increasing the Pace of Knowledge Diffusion to Increase the Pace of Science. Presented at the Annual Meeting of the American Association for the Advancement of Science, February 16-20, 2006.
Mathematician’s Scientific Discovery Biology Researcher’s Scientific Discovery Physics Scientific Discovery Math Databases: Research Papers Correspondence Conferences Biology Databases: Research Papers Correspondence Conferences Physics Databases: Research Papers Correspondence Conferences Global Discovery Search Portal Math Community Biology Community Physics Community Knowledge Diffusion in Action
Grid of Grids Each circle = a portal with sources End result is thousands of sources in 2 hops Scaling to the Next Level
Abe Lederman 122 Longview Drive Los Alamos, NM Thank You!