Efficient Result-set Merging Across Thousands of Hosts Simulating an Internet-scale GIR application with the GOV2 Test Collection Christopher Fallen Arctic.

Efficient Result-set Merging Across Thousands of Hosts Simulating an Internet-scale GIR application with the GOV2 Test Collection Christopher Fallen Arctic Region Supercomputing Center University of Alaska Fairbanks

Overview Internet-scale GIR simulation for the TREC 2006 Terabyte Track Using the distribution of relevance judgments from the TREC 2004-6 Terabyte Tracks to limit result-set length Collection ranking

TREC 2006 Terabyte Track GOV2 test collection is the publicly available text from http://*.gov in early 2004http://*.gov –426 GB of text –25 million documents Retrieve ranked result sets for 50 topics –Top results from each group are placed in the relevance pool –140,000 relevance judgments (qrels) for 149 topics (three years of TREC Terabyte Tracks) {qrels} := {(topic, document, relevance)}

A GIR simulation using GOV2 Suppose each host provides an Index/Search (I/S) Service –The query processor (QP) must merge ranked results from every responding host Partition GOV2 into hostnames, create one index per host –17,000 hostnames –{Mean, median} size = {1,500, 34} documents –Standard deviation = 18,000 documents

Ranked distribution of GOV2 host sizes

Consequences of search-by- host for result-set merging Hundreds of ranked lists (responding hosts) per topic –Many short lists –Several long lists –Large bandwidth requirement for each query Merging algorithms must be robust with respect to long and short result-sets and the large number of result-sets

“Logistic Regression” Merge Find logistic curve parameters that yield the best fit to the relevance score- at-rank data Parameter estimates need several results per result-set

Distribution of Relevance Judgments For most topics, there are many more non- relevant qrels than relevant qrels Across all topics, the number of non-relevant, relevant, and very relevant qrels is strongly correlated with host size The hosts that contain relevant qrels also contain non-relevant qrels –But the relevant documents are probably near the top of each host’s result-set!

(# relevant)/(# non-relevant) qrels

Skimming the topn documents from each host Is there a simple functional relationship between the number of likely relevant documents in a host (that retrieves any documents at all) to the size of the host? A proportional model of relevance for each relevance score r is simple… (# r qrels from Host) = C r (topic) * |Host|

Skimming the top n documents from each host … and the constant C r (topic) of proportionality can be measured from TREC Terabyte Track data, then averaged over topics to get Does the model adequately describe the data? –A posteriori: Does the model describe TREC data? –A priori: Can the model be used to truncate result- sets based on host size without affecting IR performance?

Proportional relevance model Two-way ANOVA applied to host-by-topic table of the values (# r qrels from Host)/|Host| = C r For the relevant and very-relevant qrels only, the total variance is largely between topics but not within the topics –C rel is not sensitive across hosts for a fixed topic –The standard deviation of C rel across topics is larger than the mean value so the stdv. can be used as a conservative estimate and the mean can be used as an aggressive estimate

Proportional relevance model Select the top-performing topics from one run of the TREC 2006 TB Track Truncate the result-set from each host using the aggressive =.0005 Merge truncated result sets and compare the IR performance of the merged list with the list merged from the entire result sets –No statistically significant difference in P@20 performance of either result-set was observed after discarding >30% of the results from one set.

Relevance-ranking hosts Assume that web documents are grouped non-arbitrarily by hostname according to content Then many orderings or rankings are possible on (host, document) pairs –Dictionary order –Round-robin Truncating the ranked lists of hosts may lead to increased search efficiency with negligible IR performance penalty

Retrieve from only 1:5 hosts?

Future work Collection ranking –Maximum document relevance score –Minimum query projection residual into reduced collection term-doc subspace

Efficient Result-set Merging Across Thousands of Hosts Simulating an Internet-scale GIR application with the GOV2 Test Collection Christopher Fallen Arctic.

Similar presentations

Presentation on theme: "Efficient Result-set Merging Across Thousands of Hosts Simulating an Internet-scale GIR application with the GOV2 Test Collection Christopher Fallen Arctic."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Result-set Merging Across Thousands of Hosts Simulating an Internet-scale GIR application with the GOV2 Test Collection Christopher Fallen Arctic.

Similar presentations

Presentation on theme: "Efficient Result-set Merging Across Thousands of Hosts Simulating an Internet-scale GIR application with the GOV2 Test Collection Christopher Fallen Arctic."— Presentation transcript:

Similar presentations

About project

Feedback