Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --

Similar presentations


Presentation on theme: "A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --"— Presentation transcript:

1 A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley -- ray@sherlock.berkeley.edu TREC DiskSourceSize MBSize doc 1WSJ (86-89)27098,732 1AP (89)25984,678 1ZIFF24575,180 1FR (89)26225,960 2WSJ (90-92)24774.520 2AP (88)24179,919 2ZIFF17856,920 2FR (88)21119,860 3AP (90)24278,321 3SJMN (91)29090,257 3PAT2456,711 Totals2,690691,058 TREC DiskSourceNum DBTotal DB 1WSJ (86-89)29Disk 1 1AP (89)1267 1ZIFF14 1FR (89)12 2WSJ (90-92)22Disk 2 2AP (88)1154 2ZIFF11 (1 dup) 2FR (88)10 3AP (90)12Disk 3 3SJMN (91)12116 3PAT92 Totals237 - 1 We attempt to estimate the probability of relevance for a given collection with respect to a query using the Logistic Regression method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weight calculation at retrieval time. We calculate the probability of relevance using Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval time the probability of relevance for a particular query Q and a collection C is estimated by: Probabilistic Retrieval Using Logistic Regression For the 6 X attribute measures shown below: Average Absolute Query Frequency Query Length Average Absolute Collection Frequency Collection Length Average Inverse Collection Frequency Inverse Collection Frequency (N = Number of collections M = Number of Terms in common between query and document The probabilities are actually calculated as the log odds, and converted The c i coefficients were estimated separately for three query types (during retrieval the length of the query was used to differentiate these. Test Database Characteristics The Problem We used collections formed by dividing the documents on TIPSTER disks 1, 2, and 3 into 236 sets based on source and month (using the same contents as in evaluations by Powell & French and Callan). The query set used was TREC queries 51-150. Collection relevance information was based on whether any documents in the collection were relevant according to the relevance judgements for TREC queries 51-150. The relevance information was used both for estimating the logistic regression coefficients (using a sample of the data) and for the evaluation (with full data). Hundreds or Thousands of servers with databases ranging widely in content, topic, and format –Broadcast search is expensive in terms of bandwidth and in processing too many irrelevant results –How to select the “best” ones to search? What to search first Which to search next –Topical /domain constraints on the search selections –Variable contents of database (metadata only, full text…) Resource Description –How to collect metadata about digital libraries and their collections or databases Resource Selection –How to select relevant digital library collections or databases from a large number of databases Distributed Search –How to perform parallel or sequential searching over the selected digital library databases Data Fusion –How to merge query results from different digital libraries with their different search engines, differing record structures, etc. Distributed IR Tasks This research was sponsored at U.C. Berkeley and the University of Liverpool by the National Science Foundation and the Joint Information Systems Committee (UK) under the International Digital Libraries Program award #IIS-99755164 James French and Allison Powell kindly provided the CORI and Ideal(0) results used in the evaluation. Acknowledgements MetaSearch Server Map Explain And Scan Queries Internet Map Results Map Query Map Results Search Engine DB2DB 1 Map Query Map Results Search Engine DB 4DB 3 Distributed Index Search Engine Db 6 Db 5 Our Approach Using Z39.50 Replicated servers Meta-Topical Servers General Servers Database Servers –Tested using the collection representatives as harvested from over the network and the TIPSTER relevance judgements –Testing by comparing our approach to known algorithms for ranking collections –Results (preliminary) were measured against reported results for the Ideal and CORI algorithms and against the optimal “Relevance Based Ranking” (MAX) –Recall analog (How many of the Rel docs occurred in the top n databases – averaged) Very Long Queries ^R^R Number of Collections ^R^R Title Queries Long Queries Number of Collections ^R^R Distributed Retrieval Testing and Results CORI Ranking Assume each database has some merit for a given query, q Given a Baseline ranking B and an estimated (test) ranking E for the evaluated system Let db bi and db ei denote the database in the i-th ranked position of rankings B and E Let B i = merit(q, db bi ) and E i = merit(q, db ei ) We can define some measures: Recall AnalogsAnd a Precision Analog Effectiveness Measures Comparative Evaluation MetaSearch New approach to building metasearch based on Z39.50 Instead of using broadcast search we are using two Z39.50 Services -- Identification of database metadata using Z39.50 Explain -- Extraction of distributed indexes using Z39.50 SCAN -- Creation of “Collection Documents” using index contents Evaluation Questions: -- How efficiently can we build distributed indexes? -- How effectively can we choose databases using the index? -- How effective is merging search results from multiple sources? -- Do Hierarchies of servers (general/meta-topical/individual) work? For all servers (could be a topical subset)… –Get Explain information to find which indexes are supported and the collection statistics. –For each index Use SCAN to extract terms and frequency information Add term + freq + source index + database metadata to the metasearch XML “Collection Document” –Index collection documents including for retrieval by the above algorithm Planned Exensions Post-Process indexes (especially Geo Names, etc) for special types of data –e.g. create “geographical coverage” indexes Application and Further Research Data Harvesting and Collection Document Creation The figures to the right summarize our results from the preliminary evaluation. The X axis is the number of collections in the ranking and the Y axis,, is a Recall analog that measures the proportion of the total possible relevant documents that have been accumulated in the top N databases, averaged across all of the queries. The Max line is the optimal results based where the collections are ranked in order of the number of relevant documents they contain. Ideal(0) is an implementation of the GlOSS ``Ideal''algorithm and CORI is an implementation of Callan's Inference net approach. The Prob line is the logistic regression method (described to the left). For title queries the described method performs slightly better than the CORI algorithm for up to about 100 collections, where CORI exceeds it. For Long queries our method is virtually identical to CORI, and CORI performs better for Very Long queries. Both CORI and the logistic regression method outperform the Ideal(0) implementation. Results and Discussion ^R^R The method described here is being applied to two distributed systems of servers in the UK. The first (the Distributed Archives Hub will be made up of individual servers containing archival descriptions in the EAD (Encoded Archival Description) DTD. MerseyLibraries.org is a consortium of University and Public libraries in the Merseyside area. In both cases the method described here is being used to build a central index to provide efficient distributed search over the various servers. The basic model is shown below – individual database servers will be harvested to create (potentially) a hierarchy of servers used to intelligently route queries to the databases most like to contain relevant materials. We are also continuing to refine the both the harvesting method and the collection ranking algorithm. We believe that additional collection and collection document statistics may provide a better ranking of results and thus more effective routing of queries.


Download ppt "A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --"

Similar presentations


Ads by Google