Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.

Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University Advisor Jamie Callan (Carnegie Mellon University)

2 © Luo Si July,2004 Outline Outline:  Introduction: Introduction to federated search  Research Problems: the state-of-the-art and our contribution  Demo: Demo of a prototype system for real world application!

3 © Luo Si July,2004 Outline Outline:  Introduction: Introduction to federated search  Research Problems: the state-of-the-art and our contribution  Demo: Demo of a prototype system for real world application!

4 © Luo Si July,2004 Introduction Visible Web vs. Hidden Web Visible Web: Information can be copied (crawled) and accessed by conventional search engines like Google or AltaVista - No arbitrary crawl of the data (e.g., ACM library) - Updated too frequently to be crawled (e.g., buy.com) Can NOT Index (promptly) Hidden Web: Information hidden from conventional engines. - Larger than Visible Web (2-50 times) Valuable Searched by Federated Search - Created by professionals - Web: Uncooperative information sources Federated Search is a feature used to beat Google by search engines like www.find.com

5 © Luo Si July,2004 Introduction Components of Federated Search System... (1)Resource Representation.. Engine 1Engine 2Engine 3Engine 4Engine N ( 2) Resource Selection …… (3) Results Merging

6 © Luo Si July,2004 Introduction Modeling Federated Search Application in real world - But, not enough relevance judgments, not enough control… Require Thorough Simulation TREC Testbeds with about 100 information sources - Normal or moderately skewed size testbeds: Trec123 or Trec4_Kmeans - Skewed: Representative (large source with the same relevant doc density), Relevant (large source with higher relevant doc density), Nonrelevant (large source with lower relevant doc density) Multiple type of search engines to reflect uncooperative environment Modeling Federated Search in Research Environments

7 © Luo Si July,2004 Outline Outline:  Introduction  Research Problems: the state-of-the-art and our contribution  Demo - Resource Representation - Resource Selection - Results Merging - A Unified Framework

8 © Luo Si July,2004 Research Problems (Resource Representation) Previous Research on Resource Representation Resource descriptions of words and the occurrences - Query-Based Sampling (Callan, 1999): send query and get sampled doc Information source size estimation - Capture-Recapture Model (Liu and Yu, 1999): But require large number of interactions with information sources Centralized sample database: Collect docs from Query-Based Sampling (QBS) - For query-expansion (Ogilvie & Callan, 2001), not very successful - Successful utilization for other problems, throughout our new research

9 © Luo Si July,2004 Research Problems (Resource Representation) Estimate df of a term in sampled docs, Get total df from the source by resample query, Scale the number of sampled docs to estimate source size Sample-Resample Model (Si and Callan, 2003) New Information Source Size Estimation Algorithm Absolute error ratio Estimated Size Actual Size Trec123Trec123-10Col Cap-Recapture0.7290.943 Sample-Resample0.2320.299 Experiments Measure:

10 © Luo Si July,2004 Outline Outline:  Introduction  Research Problems: the state-of-the-art and our contribution  Demo - Resource Representation - Resource Selection - Results Merging - A Unified Framework

11 © Luo Si July,2004 Research Problems (Resource Selection) Previous Research on Resource Selection Goal of Resource Selection of Information Source Recommendation High-Recall: Select the (few) information sources that have the most relevant documents “Big document” resource selection approach: Treat information sources as big documents, rank them by similarity of user query - Examples: CVV, CORI and KL-divergence They lose doc boundaries and do not optimize the goal of High-Recall Estimate the percentage of relevant docs among sources and rank sources New RElevant Doc Distribution Estimation (ReDDE) resource selection “Relevant Document Distribution Estimation Method for Resource Selection” (Luo Si & Jamie Callan, SIGIR ’03)

12 © Luo Si July,2004 Research Problems (Resource Selection) Relevant Doc Distribution Estimation (ReDDE) Algorithm Estimated Source Size Number of Sampled Docs “Everything at the top is (equally) relevant” Source Scale Factor Rank on Centralized Complete DB Simple Rank on Centralized Complete DB with ranking on Centralized Complete DB Number of Relevant Docs

13 © Luo Si July,2004 Research Problems (Resource Selection) Experiments Evaluated Ranking Desired Ranking Measure:

14 © Luo Si July,2004 Outline Outline:  Introduction  Research Problems: the state-of-the-art and our contribution  Future Research - Resource Representation - Resource Selection - Results Merging - A Unified Framework

15 © Luo Si July,2004 Research Problems (Results Merging) Goal of Results Merging Make different result lists comparable and merge them into a single list Difficulties: Information sources may use different retrieval algorithms Information sources have different corpus statistics Previous Research on Results Merging Some methods download all docs and calculate comparable scores large communication and computation costs Some methods use heuristic combination: CORI method Semi-Supervised Learning (SSL) Merging (Si & Callan, 2002, 2003) Basic idea is to approximate centralized doc score by linear regression Estimate linear models from overlap documents in both centralized sampled DB and individual ranked lists

16 © Luo Si July,2004 Research Problems (Results Merging) In resource representation: Build representations by QBS, collapse sampled docs into centralized sample DB In resource selection: Rank sources, calculate centralized scores for docs in centralized sample DB In results merging: Find overlap docs, build linear models, estimate centralized scores for all docs SSL Results Merging (cont) Engine 2 …….. …… Engine 1Engine N Resource Representation Centralized Sample DB Resource Selection. Overlap Docs... Final Results CSDB Ranking

17 © Luo Si July,2004 Research Problems (Results Merging) 10 Sources Selected Experiments Trec123Trec4-kmeans “Using Sampled Data and Regression to Merger Search Engine Results ” (Luo Si & Jamie Callan, SIGIR ’02) “A Semi-Supervised Learning Method to Merge Search Engine Results ” (Luo Si & Jamie Callan, TOIS ’03)

18 © Luo Si July,2004 Outline Outline:  Introduction  Research Problems: the state-of-the-art and preliminary research  Demo - Resource Representation - Resource Selection - Results Merging - A Unified Framework

19 © Luo Si July,2004 Research Problems (Unified Utility Framework) Goal of the Unified Utility Maximization Framework Integrate and adjust individual components of federated search to get global desired results for different applications High-Recall vs. High-Precision Simply combine individual effective components together High-Recall: Select sources that contain as many relevant docs as possible for information source recommendation High-Precision: Select sources that return many relevant docs at top part of final ranked list for federated document retrieval They are correlated but NOT identical, previous research does NOT distinguish them

20 © Luo Si July,2004 Research Problems (Unified Utility Framework) Formalize federated search as mathematic optimization problem with respect to different goals of different applications Unified Utility Maximization Framework (UUM) Example: for document retrieval with High-Precision Goal: Number of rel docs in top part of rank list Number of sources to select Retrieve fixed number of docs

21 © Luo Si July,2004 Research Problems (Unified Utility Framework) Resource selection for federated document retrieval Unified Utility Maximization Framework (UUM) Solution: No simple solution, by dynamic programming A variant to select variable number of docs from selected sources Total number of documents to select Retrieve variable number of docs “Unified Utility Maximization Framework for Resource Selection ” (Luo Si & Jamie Callan, CIKM ’04)

22 © Luo Si July,2004 Research Problems (Unified Utility Framework) ExperimentsResource selection for federated document retrieval Trec123Representative 3 Sources Selected 10 Sources Selected SSL Merge

23 © Luo Si July,2004 Outline  Demo FedStats Project: Cooperative work with Jamie Callan, Thi Nhu Truong and Lawrence Yau

25 © Luo Si July,2004 Future Research (Conclude) Conclude Federated search has been hot research in last decade - Most of previous research is tied with “Big document” Approach - More theoretically solid foundation - More empirically effective - Better model real world applications The new research advances the state-of-the-art Bridge from Cool Research to Practical Tool

Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.

Similar presentations

Presentation on theme: "Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.

Similar presentations

Presentation on theme: "Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University."— Presentation transcript:

Similar presentations

About project

Feedback