Presentation is loading. Please wait.

Presentation is loading. Please wait.

Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.

Similar presentations


Presentation on theme: "Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University."— Presentation transcript:

1 Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University Thesis Committee: Jamie Callan (Carnegie Mellon University, Chair) Jaime Carbonell (Carnegie Mellon University) Yiming Yang (Carnegie Mellon University) Luis Gravano (Columbia University)

2 2 © Luo Si July,2004 Outline Outline:  Introduction: Introduction to federated search  Research Problems: the state-of-the-art and preliminary research  Future Research: Dissertation research and expected contribution

3 3 © Luo Si July,2004 Outline Outline:  Introduction: Introduction to federated search  Research Problems: the state-of-the-art and preliminary research  Future Research: Dissertation research and expected contribution

4 4 © Luo Si July,2004 Introduction Visible Web vs. Hidden Web Visible Web: Information can be copied (crawled) and accessed by conventional search engines like Google or AltaVista - No arbitrary crawl of the data (e.g., ACM library) - Updated too frequently to be crawled (e.g., buy.com) Can NOT Index (promptly) Hidden Web: Information can NOT be copied and accessed by conventional engines. Hidden Web contained in (Hidden) information sources that provide text search engines to access the hidden information

5 5 © Luo Si July,2004 Introduction Federated Search Environments: Small companies: Probably cooperative information sources Big companies (organizations): Probably uncooperative information sources Web: Uncooperative information sources - Larger than Visible Web (2-50 times, Sherman 2001) Valuable Searched by Federated Search Hidden Web is: - Created by professionals

6 6 © Luo Si July,2004 Introduction Components of Federated Search System... (1)Resource Representation.. Engine 1Engine 2Engine 3Engine 4Engine N ( 2) Resource Selection …… (3) Results Merging

7 7 © Luo Si July,2004 Introduction Solutions of Federated Search Browsing model: Organize sources into a hierarchy; Navigate manually From: Invisible-web.net

8 8 © Luo Si July,2004 Introduction Solutions of Federated Search Information source recommendation: Recommend information sources for users’ text queries Federated document retrieval: Search selected sources and merge individual ranked lists - Useful when users want to browse the selected sources - Contain resource representation and resource selection components - Most complete solution - Contain all of resource representation, resource selection and results merging

9 9 © Luo Si July,2004 Introduction Modeling Federated Search Application in real world - FedStats project: Web site to connect dozens of government agencies with uncooperative search engines Previously use centralized solution (ad-hoc retrieval), but suffer a lot from missing new information and broken links Require federated search solution: A prototype of federated search solution for FedStats is on-going in Carnegie Mellon University - Good candidate for evaluation of federated search algorithms - But, not enough relevance judgments, not enough control… Require Thorough Simulation

10 10 © Luo Si July,2004 Introduction Modeling Federated Search TREC data - Large text corpus, thorough queries and relevance judgments - Often be divided into O(100) information sources - Professional well-organized contents Simulation with TREC news/government data - Most commonly used, many baselines (Lu et al., 1996)(Callan, 2000)…. - Simulate environments of large companies or domain specific hidden Web - Normal or moderately skewed size testbeds: Trec123 or Trec4_Kmeans - Skewed: Representative (large source with the same relevant doc density), Relevant (large source with higher relevant doc density), Nonrelevant (large source with lower relevant doc density)

11 11 © Luo Si July,2004 Introduction Modeling Federated Search - INQUERY: Bayesian inference network with Okapi term formula, doc score range [0.4, 1] Simulation multiple types of search engines - Language Model: Generation probabilities of query given docs doc score range [-60, -30] (log of the probabilities) - Vector Space Model: SMART “lnc.ltc” weighting doc score range [0.0, 1.0] Federated search metric - Information source size estimation: Error rate in source size estimation - Information source recommendation: High-Recall, select information sources with most relevant docs - Federated doc retrieval: High-Precision at top ranked docs

12 12 © Luo Si July,2004 Outline Outline:  Introduction  Research Problems: the state-of-the-art and preliminary research  Future Research - Resource Representation - Resource Selection - Results Merging - A Unified Framework

13 13 © Luo Si July,2004 Outline Outline:  Introduction  Research Problems: the state-of-the-art and preliminary research  Future Research - Resource Representation - Resource Selection - Results Merging - A Unified Framework

14 14 © Luo Si July,2004 Research Problems (Resource Representation) Previous Research on Resource Representation Resource descriptions of words and the occurrences - STARTS protocol (Gravano et al., 1997): Cooperative protocol - Query-Based Sampling (Callan et al., 1999): Centralized sample database: Collect docs from Query-Based Sampling (QBS) - For query-expansion (Ogilvie & Callan, 2001), not very successful - Successful utilization for other problems, throughout this proposal  Send random queries and analyze returned docs  Good for uncooperative environments

15 15 © Luo Si July,2004 Research Problems (Resource Representation) Previous Research on Resource Representation Information source size estimation Important for resource selection and provide users useful information - Capture-Recapture Model (Liu and Yu, 1999) But require large number of interactions with information sources Use two sets of independent queries, analyze overlap of returned doc ids Strategy: Estimate df of a term in sampled docs Get total df from by resample query from source Scale the number of sampled docs to estimate source size - Sample-Resample Model (Si and Callan, 2003) Assume: Search engine indicates num of docs matching a one-term query New Information Source Size Estimation Algorithm

16 16 © Luo Si July,2004 Research Problems (Resource Representation) Experiment Methodology Methods are allowed the same number of transactions with a source Two scenarios to compare Capture-Recapture & Sample-Resample methods 1 80 85 385 1 300 Queries Downloaded documents Capture- Recapture (Scenario 2) Sample- Resample - Combined with other components: methods can utilize data from Query- Based Sample (QBS) - Component-level study: can not utilize data from Query-Based Sample Data may be acquired by QBS (80 sample queries acquire 300 docs) Capture- Recapture (Scenario 1)

17 17 © Luo Si July,2004 Research Problems (Resource Representation) Experiments Measure: Absolute error ratio Estimated Source Size Actual Source Size Trec123 (Avg AER, lower is better) Trec123-10Col (Avg AER, lower is better) Cap-Recapture0.7290.943 Sample-Resample0.2320.299 To conduct component-level study - Capture-Recapture: about 385 queries (transactions) - Sample-Resample: 80 queries and 300 docs for sampled docs (sample) + 5 queries ( resample) = 385 transactions Collapse every 10 th source of Trec123

18 18 © Luo Si July,2004 Outline Outline:  Introduction  Research Problems: the state-of-the-art and preliminary research  Future Research - Resource Representation - Resource Selection - Results Merging - A unified framework

19 19 © Luo Si July,2004 Research Problems (Resource Selection) Previous Research on Resource Selection Resource selection algorithms that need training data - Decision-Theoretic Framework (DTF) (Nottelmann & Fuhr, 1999, 2003) - Lightweight probes (Hawking & Thistlewaite, 1999) DTF causes large human judgment costs Acquire training data in an online manner, large communication costs Goal of Resource Selection of Information Source Recommendation High-Recall: Select the (few) information sources that have the most relevant documents

20 20 © Luo Si July,2004 Research Problems (Resource Selection) Previous Research on Resource Representation - Cue Validity Variance (CVV) (Yuwono & Lee, 1997) - CORI (Bayesian Inference Network) (Callan,1995) “Big document” resource selection approach: Treat information sources as big documents, rank them by similarity of user query - KL-divergence (Xu & Croft, 1999), Calculate KL divergence between distribution of information sources and user query CORI and KL are the state-of-the-art (French et al., 1999)(Craswell et al,, 2000) But “Big document” approach loses doc boundaries and does not optimize the goal of High-Recall

21 21 © Luo Si July,2004 Research Problems (Resource Selection) Previous Research on Resource Selection - bGlOSS (Gravano et al., 1994) and vGlOSS (Gravano et al, 1999) Turn away from “big document” approach by considering goodness of each doc in the sources But use strong assumptions to calculate doc goodness Thought Our strategy, estimate the percentage of relevant docs among sources and rank the sources accordingly Resource selection algorithms of information source recommendation need to optimize High-Recall of including most relevant docs RElevant Doc Distribution Estimation (ReDDE) resource selection Methods turn away from “Big document” resource selection approach

22 22 © Luo Si July,2004 Research Problems (Resource Selection) Relevant Doc Distribution Estimation (ReDDE) Algorithm Estimated Source Size Number of Sampled Docs “Everything at the top is (equally) relevant” Source Scale Factor Rank on Centralized Complete DB Problem: To estimate doc ranking on Centralized Complete DB

23 23 © Luo Si July,2004 Previous Work: Resource Selection & Results Merging ReDDE Algorithm (Cont) Engine 2.. Engine 1Engine N Resource Representation Centralized Sample DB Resource Selection. CSDB Ranking In resource selection: Construct ranking on CCDB with ranking on CSDB CCDB Ranking... Threshold In resource representation: Build representations by QBS, collapse sampled docs into centralized sample DB

24 24 © Luo Si July,2004 Research Problems (Resource Selection) Experiments On testbeds with uniform or moderately skewed source sizes Evaluated Ranking Desired Ranking

25 25 © Luo Si July,2004 Research Problems (Resource Selection) Experiments On testbeds with skewed source sizes

26 26 © Luo Si July,2004 Outline Outline:  Introduction  Research Problems: the state-of-the-art and preliminary research  Future Research - Resource Representation - Resource Selection - Results Merging - A unified framework

27 27 © Luo Si July,2004 Research Problems (Results Merging) Goal of Results Merging Make different result lists comparable and merge them into a single list Difficulties: - Information sources may use different retrieval algorithms - Information sources have different corpus statistics Previous Research on Results Merging Most accurate methods directly calculate comparable scores - Use same retrieval algorithm and same corpus statistics (Viles & French, 1997)(Xu and Callan, 1998), need source cooperation - Download retrieved docs and recalculate scores (Kirsch, 1997), large communication and computation costs

28 28 © Luo Si July,2004 Research Problems (Results Merging) Previous Research on Results Merging Methods approximate comparable scores - Round Robin (Voorhees et al., 1997), only use source rank information and doc rank information, fast but less effective - CORI merging formula (Callan et al., 1995), linear combination of doc scores and source scores  Work in uncooperative environment, effective but need improvement  Use linear transformation, a hint for other method

29 29 © Luo Si July,2004 Research Problems (Results Merging) Thought Previous algorithms either try to calculate or to mimic the effect of the centralized scores Can we estimate the centralized scores effectively and efficiently? Semi-Supervised Learning (SSL) Merging (Si & Callan, 2002, 2003) - Some docs exist in both centralized sample DB and retrieved docs - Linear transformation maps source specific doc scores to source independent scores on centralized sample DB From Centralized sampled DB and individual ranked lists when long ranked lists are available Download minimum number of docs with only short ranked lists

30 30 © Luo Si July,2004 Research Problems (Results Merging) In resource representation: Build representations by QBS, collapse sampled docs into centralized sample DB In resource selection: Rank sources, calculate centralized scores for docs in centralized sample DB In results merging: Find overlap docs, build linear models, estimate centralized scores for all docs SSL Results Merging (cont) Engine 2 …….. …… Engine 1Engine N Resource Representation Centralized Sample DB Resource Selection. Overlap Docs... Final Results CSDB Ranking

31 31 © Luo Si July,2004 Research Problems (Results Merging) Experiments Trec123Trec4-kmeans 3 Sources Selected 10 Sources Selected SSL downloads minimum docs for training 50 docs retrieved from each source

32 32 © Luo Si July,2004 Outline Outline:  Introduction  Research Problems: the state-of-the-art and preliminary research  Future Research - Resource Representation - Resource Selection - Results Merging - A Unified Framework

33 33 © Luo Si July,2004 Research Problems (Unified Utility Framework) Goal of the Unified Utility Maximization Framework Integrate and adjust individual components of federated search to get global desired results for different applications High-Recall vs. High-Precision Simply combine individual effective components together High-Recall: Select sources that contain as many relevant docs as possible for information source recommendation High-Precision: Select sources that return many relevant docs at top part of ranked lists for federated document retrieval They are correlated but NOT identical, previous research does NOT distinguish them

34 34 © Luo Si July,2004 Research Problems (Unified Utility Framework) UUM Framework Engine 2.. Engine 1Engine N Resource Representation Centralized Sample DB. CSDB Ranking In resource representation: Build representations and CSDB Build logistic model on CSDB In resource selection: Use piecewise interpolation to get all centralized doc scores Calculate probs of relevance for all docs in all available sources Centralized doc scores Probs of Relevance Prob of Rel Centralized scores Centralized Doc Score Doc Rank Resource Selection Estimate probabilities of relevance of docs the prob of relevance for j th doc from i th source

35 35 © Luo Si July,2004 Research Problems (Unified Utility Framework) Basic Framework Unified Utility Maximization Framework (UUM) Let indicate number of docs to retrieve from each source Estimated probs of relevance for all docs utility gained by making selection when is correct prob of given all available resource resource descriptionscentralized retrieval scores Desired solution is: MAP Approximate

36 36 © Luo Si July,2004 Research Problems (Unified Utility Framework) Resource selection for information source recommendation Unified Utility Maximization Framework (UUM) High-Recall Goal: Select sources that contain as many relevant docs as possible Number of rel docs in selected sources Number of sources to select Solution: Rank sources by number of relevant docs they contain Called Unified Utility Maximization Framework for High-Recall UUM/HR

37 37 © Luo Si July,2004 Research Problems (Unified Utility Framework) Resource selection for federated document retrieval Unified Utility Maximization Framework (UUM) High-Precision Goal: Select sources that return many relevant docs as the top part Number of rel docs in top part of source Number of sources to select Called Unified Utility Maximization Framework for High-Precision with Fixed Length UUM/HP-FL Retrieve fixed number of docs Solution: Rank sources by number of relevant docs in top part

38 38 © Luo Si July,2004 Research Problems (Unified Utility Framework) Resource selection for federated document retrieval Unified Utility Maximization Framework (UUM) Solution: No simple solution, by dynamic programming A variant to select variable number of docs from selected sources Number of documents to select Retrieve variable number of docs Called Unified Utility Maximization Framework for High-Precision with Variable Length UUM/HP-VL

39 39 © Luo Si July,2004 Research Problems (Unified Utility Framework) Experiments Resource selection for information source recommendation

40 40 © Luo Si July,2004 Research Problems (Unified Utility Framework) Experiments Resource selection for information source recommendation

41 41 © Luo Si July,2004 Research Problems (Unified Utility Framework) ExperimentsResource selection for federated document retrieval Trec123Representative 3 Sources Selected 10 Sources Selected SSL Merge

42 42 © Luo Si July,2004 Outline Outline:  Introduction: Introduction to federated search  Research Problems: the state-of-the-art and preliminary research  Future Research: Dissertation research and expected contribution

43 43 © Luo Si July,2004 Future Research (Dissertation Research) Purpose More experiments to study effectiveness of federated search algorithms Extend proposed federated research algorithms to better simulate operational environments Information Source Estimation Resource Selection Results Merging Unified Utility Maximization Framework

44 44 © Luo Si July,2004 Future Research (Dissertation Research) Information Source Size Estimation More experiments to study Sample-Resample algorithm - Effects of more resample queries, will that improve estimation accuracy? - Resample query characteristic, which is better, low df or high df? - Sample Resample estimation of larger sources (e.g., 300,000 docs) Sample-Resample without available doc frequency information - Estimate doc frequency from overlap of sampled docs and retrieved results from source -Basic Sample-Resample needs doc frequency information from sources, which may not be available in operational environments

45 45 © Luo Si July,2004 Future Research (Dissertation Research) Resource Selection for Information Source Recommendation - High-Precision variant for source recommendation, UUM/HP algorithms can be the candidate of solutions - High-Recall measures the total amount of relevant docs contained in information sources, but users may only care the top ranked docs in every source. - Source retrieval effectiveness may need to be considered, discussed later together with new research of unified utility maximization framework

46 46 © Luo Si July,2004 Future Research (Dissertation Research) Results Merging Semi-Supervised Learning (SSL) with only rank information - Extend SSL by generating pseudo doc scores from doc rank information - Basic SSL algorithm transforms source specific doc scores into source independent scores. But maybe only doc ranking is available (e.g., most search engines in FedStats do NOT return doc scores) Study the difference between SSL algorithm and a desired merging algorithm - Compare the results merging effectiveness of SSL algorithm and an algorithm that merges with actual centralized doc scores

47 47 © Luo Si July,2004 Future Research (Dissertation Research) Unified Utility Maximization Framework Weighted High-Precision criterion - Current High-Precision criterion assigns equal weights on top-ranked docs, but Precision: At 5 docs: 0.3640 At 10 docs: 0.3360 At 15 docs: 0.3253 At 20 docs: 0.3140 At 30 docs: 0.2780 At 100 docs: 0.1666 At 200 docs: 0.0833 At 500 docs: 0.0333 Partial results of trec_eval - Different top-ranked docs have different contribution (e.g., 1 st and 500 th ) Also users put different amount of attention on the docs New weighted High-Precision Goal in UUM Framework

48 48 © Luo Si July,2004 Future Research (Dissertation Research) Unified Utility Maximization Framework Incorporate impact of source retrieval effectiveness into the framework - Current solutions do not consider source retrieval effectiveness, but bad search engines may not return any relevant doc even there are a lot. Important in operation environments (e.g., PubMed system uses less effective unranked retrieval) Incorporate source retrieval effectiveness into UUM Framework Idea: Measure source retrieval effectiveness by agreement with centralized retrieval results One possibility, Noise_Model measure effectiveness

49 49 © Luo Si July,2004 Future Research (Expected Contribution) Expected Contribution Propose more theoretically solid and effective solutions to the full range of federated search - Sample-Resample source size estimation vs. Capture-Recapture - RElevant Doc Distribution Estimation (ReDDE) resource selection vs. “big document” approach - Semi-Supervised Learning (SSL) results merging vs. CORI formula

50 50 © Luo Si July,2004 Future Research (Expected Contribution) Expected Contribution Propose Unified Utility Maximization Framework to integrate separate solutions -This is the first probabilistic framework to integrate different components together - It allows a better opportunity to utilize available information (e.g., information in centralized sample database) - It enables us to configure individual components globally for desired overall results than simply combining them together

51 51 © Luo Si July,2004 Future Research (Expected Contribution) Expected Contribution Federated search has been hot research in last decade - Most of previous research is tied with “Big document” Approach - More theoretically solid foundation - More empirically effective - Better model real world applications The new research advances the state-of-the-art Bridge from Cool Research to Practical Tool

52 52 © Luo Si July,2004 Future Research (Schedule) July. 2004 – Aug. 2004 Analyze and develop federated search testbed with TREC Web data. Sep. 2004 – Dec. 2004 - Experiments to study the behavior of Sample-Resample algorithm - New Sample-Resample source algorithm without available document frequency information Jan. 2005 – Apr. 2005 -SSL algorithm without returned document scores - Influence of component accuracy on overall results of federated search task

53 53 © Luo Si July,2004 Future Research (Schedule) May. 2005 – Aug. 2005 - Utility maximization framework with weighted high-precision goal - Utility maximization framework with consideration of source retrieval effectiveness Sep. 2005 – Dec. 2005 Analyze the results, summarize and write up the thesis


Download ppt "Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University."

Similar presentations


Ads by Google