Comparing Offline and Online Statistics Estimation for Text Retrieval from Overlapped Collections MS Thesis Defense Bhaumik Chokshi Committee Members:

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

VLDB ‘07 Query Processing over Incomplete Autonomous Databases Garrett Wolf (Arizona State University) Hemal Khatri (MSN Live Search) Bhaumik Chokshi (Arizona.

Chapter 5: Introduction to Information Retrieval

Introduction to Information Retrieval

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Dr. Subbarao Kambhampati

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)

Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

Evaluating Search Engine

Search Engines and Information Retrieval

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.

Supporting Queries with Imprecise Constraints Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati Dept. of Computer.

Adaptive Query Processing for Data Aggregation: Mining, Using and Maintaining Source Statistics M.S Thesis Defense by Jianchun Fan Committee Members: Dr.

Information Retrieval in Practice

Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.

1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.

Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.

Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati.

WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.

Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.

The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Search Engines and Information Retrieval Chapter 1.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,

Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?

IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.

A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA

MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.

1 Retrieval and Feedback Models for Blog Feed Search SIGIR 2008 Advisor ： Dr. Koh Jia-Ling Speaker ： Chou-Bin Fan Date ：

Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.

Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Chapter 6: Information Retrieval and Web Search

Query Processing over Incomplete Autonomous Databases Presented By Garrett Wolf, Hemal Khatri, Bhaumik Chokshi, Jianchun Fan, Yi Chen, Subbarao Kambhampati.

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.

Performance Measurement. 2 Testing Environment.

Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.

AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.

Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.

Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Document Clustering and Collection Selection Diego Puppin Web Mining,

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.

Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University

Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,

Text Based Information Retrieval

Evaluation of IR Systems

Subbarao Kambhampati (Arizona State University)

Subbarao Kambhampati (Arizona State University)

Panos Ipeirotis Luis Gravano

Panagiotis G. Ipeirotis Luis Gravano

Relevance and Reinforcement in Interactive Browsing

Probabilistic Ranking of Database Query Results

Anthony Okorodudu CSE Answering Imprecise Queries over Autonomous Web Databases By Ullas Nambiar and Subbarao Kambhampati Anthony Okorodudu.

Presentation transcript:

Comparing Offline and Online Statistics Estimation for Text Retrieval from Overlapped Collections MS Thesis Defense Bhaumik Chokshi Committee Members: Prof. Subbarao Kambhampati (Chair) Prof. Yi Chen Prof. Hasan Davulcu

My MS Work Collection Selection : ROSCO Query Processing over Incomplete Autonomous Databases: QPIAD Handling Query Imprecision and Data Incompleteness: QUIC

Multi Source Information Retrieval In multi source information retrieval problem, searching every information source is not efficient. The retrieval system must choose one collection or subset of collections to call to answer a given query.

Overlapping Collections Many real world collections have significant overlap.  For example, multiple bibliography collections (e.g., ACMDL, IEEE, DBLP etc.) may store some of the same papers and multiple news archives (e.g., New York Times, Washington Post etc.) may store very similar news stories. CSB IEEE ACM How likely it is that a given collection has documents relevant to the query. Whether a collection will provide novel results given the collections already selected. DBLP Science

Related Work Most collection selection approaches do not consider overlap  Existing systems like CORI, ReDDE try to create a representative for each collection based on term and document frequency information.  ReDDE uses collection samples to estimate relevance of each collection. Same samples can be used to estimate overlap among collections. 16.6% of the documents in runs submitted to the TREC 2004 terabyte track were redundant. [Bernstein and Zobel, 2005] Using coverage and overlap statistics in context of relational data sources. [Nie and Kambhampati, 2004]  Overlap among tuples can be identified in a much straightforward way compared to text documents.

Challenges Involved Need for query specific overlap  Two collections may have low overlap as a whole but can have high overlap for a particular set of queries. Overlap assessment offline vs. online  Offline approach can store statistics for general keywords and map incoming query to these keywords to obtain relevance and overlap statistics.  Online approach can use the samples to estimate relevance and overlap statistics. Efficiently determine true overlap between collections  True overlap between collections can be estimated using result to result comparison for different collections. COSC O

Context of this work COSCO takes overlap into account while determining collection order. But it does it offline. Samples built for the collections can be used to estimate overlap statistics which can be a better estimate as it is for a particular query. COSCO estimates overlap using bag similarity over result-set document. True overlap between collections can be obtained using result to result comparison. COSCO does not do experiments on TREC data.

Contributions ROSCO, an online approach which estimates overlap statistics from the samples of the collections. Comparison of offline (COSCO) and online (ROSCO) approaches for statistics estimation for text retrieval from overlapping collections.

Outline COSCO and ROSCO Architecture ROSCO Approach Empirical Evaluation Other Contributions Conclusion

COSCO Architecture

ROSCO Architecture

Outline COSCO and ROSCO Architecture ROSCO Approach Empirical Evaluations Other Contributions Conclusion

ROSCO (Offline Component) Collection representation through query based sampling C2C1 Training Queries Samples Training Queries S2 S1 Union of Samples

ROSCO (Offline Component) Collection Size Estimation C2C1 Random Queries Random Queries Samples S2S1 Number of documents returned from collection C i Number of documents returned from sample S i Size Estimates

ROSCO (Offline Component) Grainy Hash Vector Sample Hash GHV w bits n bits

ROSCO (Online Component) Assessing Relevance Union of Samples Query S2 S1 Samples Query Determine top –k relevant documents for each collections Size Estimates Top-k documents for each collection

ROSCO (Online Component) Assessing Overlap and Combining with Relevance Estimate no. of relevant new documents for each collection Size Estimates GHVs of the top-k documents of each collection GHVs of documents of the collections selected till now Collection with maximum no. of new relevant documents

Comparison of ROSCO and COSCO COSCO:  Offline method for estimating coverage and overlap statistics.  Gets estimate for a query by using statistics for corresponding frequent item sets. Statistics for “data mining integration” can be obtained by using statistics from “data mining” and “data integration”.  This way of computing statistics can lead to a much different estimate from actual statistics. ROSCO:  Online method for estimating coverage and overlap statistics.  Gets estimate by sending query to sample which can give better estimate for a particular query at hand.  Success of this approach depends on the quality of sample. Sometimes it can be hard to obtain a good sample of the collection.

Outline ROSCO and COSCO Architecture ROSCO Approach Empirical Evaluation Other Contributions Conclusion

Empirical Evaluation Whether ROSCO can perform better in an environment of overlapping text collections compared to the approaches which do not consider overlap. Compare ROSCO and COSCO in presence of overlap among collections.

Testbed Creation Test Data  TREC Genomics data.  50 queries with their relevance judgment. Testbed Creation  100 disjoint clusters from 200,000 documents to create topic specific collections.  uniform-50cols: 50 collections. Each of the 200,000 documents is randomly assigned to 10 different collections. Total of 2 million documents.  skewed-100cols: 100 collections. Each of the 100 clusters is randomly assigned to 10 different collections. Total of 2 million documents. As each cluster is assigned to multiple collections, topic specific overlap among collections is more prominent in this testbed compared to uniform- 50cols.

Collection Size and Relevance Statistics Testbed 1 Testbed 2 uniform-50colsskewed-50cols

Collection Overlap Statistics uniform-50cols skewed-100cols

Tested Methods COSCO, ReDDE and ROSCO. Greedy Ideal for establishing performance bound Setting up COSCO  40 training queries to each of the collection Setting up ROSCO and ReDDE  Training Queries: 25 queries for each collection.  Sample size: 10% of the actual collections.  10 size estimates  Duplicate detection: GHV containing 32 vectors of 2 bits each (total of 64 bits).  Mismatches allowed: 0 mismatch allowed for exact duplicates Evaluation  Recall after each collection called. (Central evaluation and TREC evaluation)  Processing time.

Greedy Ideal This method attempts to greedily maximize the percentage recall assuming oracular information. It is used for establishing performance bound and as a baseline ranking method in evaluation.

Experimental Results (Central Evaluation) 10 queries different from training queries for evaluation. 5-fold cross validation Evaluation metric: For both the testbeds ROSCO performs better than ReDDE and COSCO by 7-8% in terms of recall metric R. Ranking by a particular method Ranking by the baseline method

Experimental Results (TREC Evaluation) For both testbeds ROSCO is performing better than ReDDE and ROSCO in terms of recall metric R. As skewed-100col testbed is created by topic specific clusters, ROSCO shows more improvement compared to uniform-50col testbed over other approaches.

Experimental Results (Processing Cost) Processing time for ReDDE and ROSCO is more compared to COSCO. But no. of collections called by ReDDE and ROSCO are less for same amount of recall.

Summary of Experimental Results Evaluated ROSCO, ReDDE and COSCO on two different testbeds with overlapping collections. ROSCO shows improvement over ReDDE and COSCO by  7-8% for central evaluations on both testbeds.  TREC evaluation: 3-5% on uniform-50cols and 8-10% on clustered-100cols. Processing time for ReDDE and ROSCO is more compared to COSCO. But no. of collections called by ReDDE and ROSCO are less for same amount of recall.

Outline ROSCO and COSCO Architecture ROSCO Approach Empirical Evaluation Other Contributions Conclusion

Other Contributions (QPIAD Project) IdMakeModelYearBody 1AudiA42001Convt 2BMWZ42002Convt 3PorscheBoxster2005Convt 4BMWZ42003NULL 5HondaCivic2004NULL 6ToyotaCamry2002Sedan 7AudiA42006NULL F Measure based query rewriting for incomplete autonomous web databases Given a query Q:(Body Style=Convt) retrieve all relevant tuples IdMakeModelYearBodyConfidence 4BMWZ42003NULL0.7 7AudiA42006NULL0.3 Ranked Relevant Uncertain Answers Select Top K Rewritten Queries Q 1 ’: Model=A4 Q 2 ’: Model=Z4 Q 3 ’: Model=Boxster Re-order queries based on Estimated Precision IdMakeModelYearBody 1AudiA42001Convt 2BMWZ42002Convt 3PorscheBoxster2005Convt AFD: Model~> Body style

Other Contributions (QPIAD Project) Sources may impose resource limitations on the # of queries we can issue Therefore, we should select only the top-K queries while ensuring the proper balance between precision and recall  SOLUTION: Use F-Measure based selection with configurable alpha parameter α=1P = R α R α>1 P < R JOINS P – Estimated Precision R – Estimated Recall (based on P & Est. Sel.) F Measure based query rewriting for incomplete autonomous web databases Co-author on VLDB 2007 research paper

Other Contributions (QUIC Project) Given a query Q: model = Civic, an Accord with sedan body style may be more relevant than Civic with coupe body style. Handling unconstrained attributes in presence of query imprecision and data incompleteness Tuples matching user query can be ranked based on unconstrained attributes. [Surajit Chaudhuri, Gautam Das, Vagelis Hristidis and Gerhard Weikum, 2004] In absence of query log, relevance for unconstrained attributes can be approximated from database. 10 queries, 13 users Approach considering unconstrained attributes performs better than the one ignoring unconstrained attributes. Co-author on CIDR 2007 demo paper

Outline ROSCO and COSCO Architecture ROSCO Approach Empirical Evaluation Other Contributions Conclusion

An online method ROSCO for overlap estimation. Comparison of offline and online approaches for text retrieval in an environment composed of overlapping collections. Results of empirical evaluation show that online method for overlap estimation performs better than offline method for overlap estimation as well as method which does not consider overlap among collections. Co-author on two other works appearing in CIDR – 2007 and VLDB