Presentation is loading. Please wait.

Presentation is loading. Please wait.

Retrieval Evaluation Modern Information Retrieval, Chapter 3

Similar presentations


Presentation on theme: "Retrieval Evaluation Modern Information Retrieval, Chapter 3"— Presentation transcript:

1 Retrieval Evaluation Modern Information Retrieval, Chapter 3
Ricardo Baeza-Yates, Berthier Ribeiro-Neto

2 Outline Introduction Retrieval Performance Evaluation
Recall and precision Alternative measures Reference Collections TREC Collection CACM & ISI Collection CF Collection

3 Introduction Type of evaluation Retrieval performance evaluation
Functional analysis phase, and Error analysis phase Proper working with no errors Performance evaluation Response time/space required Retrieval performance evaluation The evaluation of how precise is the answer set

4 Retrieval Performance Evaluation
batch query IR Relevant Docs In Answer Set |Ra| Recall=|Ra|/|R| Precision=|Ra|/|A| collection Answer Set |A| Relevant Docs |R| Sorted by relevance

5 Precision and Recall Precision Recall
Ratio of the number of relevant documents retrieved to the total number of documents retrieved The number of hits that are relevant Recall Ratio of number of relevant documents retrieved to the total number of relevant documents The number of relevant documents that are hits

6 Precision and Recall Retrieved Documents Relevant Document Space
Low Precision Low Recall High Precision Low Recall Low Precision High Recall High Precision High Recall

7 Precision and Recall Retrieved Documents |A| Relevant |R| Information
Space |RA| Recall = |RA| |R| The user isn’t usually given the answer set RA at once The documents in A are sorted to a degree of relevance (ranking) which the user examines. Recall and precision vary as the user proceeds with their examination of the answer set A Will discuss further in the evaluation lecture. Precision = |RA| |A|

8 Precision and Recall Trade Off
100% Increase number of documents retrieved Likely to retrieve more of the relevant documents and thus increase the recall But typically retrieve more inappropriate documents and thus decrease precision

9 Precision versus recall curve
Rq={d3,d5,d9,d25,d39,d44,d56, d71,d89,d123} P=100% at R=10% P= 66% at R=20% P= 50% at R=30% Ranking for query q: 1.d123* 2.d84 3.d56* 4.d6 5.d8 6.d9* 7.d511 8.d129 9.d187 10.d25* 11.d38 12.d48 13.d250 14.d11 15.d3* Usually based on 11 standard recall levels: 0%, 10%, ..., 100%

10 Precision versus recall curve
For a single query Fig3.2

11 Average Over Multiple Queries
P(r)=average precision at the recall level r Nq = Number of queries used Pi(r) =The precision at recall level r for the i-th query

12 Interpolated precision
Rq={d3,d56,d129} P=33% at R=33% P= 25% at R=66% P= 20% at R=100% P(rj)=max rj≦ r≦ rj+1P(r) 1.d123 2.d84 3.d56* 4.d6 5.d8 6.d9 7.d511 8.d129* 9.d187 10.d25 11.d38 12.d48 13.d250 14.d11 15.d3*

13 Interpolated precision
Let rj, j{0, 1, 2, …, 10}, be a reference to the j-th standard recall level P(rj)=max rj≦ r≦ rj+1P(r) R=30%, P3(r)~P4(r)=33% R=40%, P4(r)~P5(r) R=50%, P5(r)~P6(r) R=60%, P6(r)~P7(r)=25%

14 Average recall vs. precision figure: used to compare the retrieval performance of several algorithms

15 Single Value Summaries
Average precision versus recall: Compare retrieval algorithms over a set of example queries Sometimes we need to compare individual query’s performance Average precision Need a single value summary The single value should be interpreted as a summary of the corresponding precision versus recall curve

16 Single Value Summaries
Average Precision at Seen Relevant Documents Averaging the precision figures obtained after each new relevant document is observed. Example: Figure 3.2, ( )/5=0.57

17 Single Value Summaries
R-Precision The precision at the R-th position in the ranking R: the total number of relevant documents of the current query (total number in Rq) Fig3.2:R=10, value=0.4 Fig3.3,R=3, value=0.33

18 Single Value Summaries
Precision Histograms Use R-precision measures to compare the retrieval history of two algorithms through visual inspection RPA/B(i)=RPA(i)-RPB(i)

19 Summary Table Statistics
the number of queries , total number of documents retrieved by all queries, total number of relevant documents were effectively retrieved when all queries are considered total number of relevant documents retrieved by all queries…

20 Precision and Recall Maximum recall Recall and precision
Measures which quantify the informativeness of the retrieval process might now be more appropriate Recall and precision are easy to define when a linear ordering of the retrieved documents is enforced

21 Alternative Measures The Harmonic Mean The E Measure
b=1, E(j)= 1 - F(j) b>1, more interested in precision b<1, more interested in recall

22 User-Oriented Measure
Query relevant docs Coverage=|Rk|/|U|, the fraction of the documents known to the user to be relevant which has actually been retrieved Novelty=|Ru|/(|Ru|+|Rk|), the fraction of the relevant documents retrieved which was unknown to the user

23 Reference Collections
Reference collections have been used throughout the years for the evaluation of IR systems Reference test collections TIPSTER/TREC CACM, ISI Cystic Fibrosis: small collections, relevant documents

24 TREC (Text REtrieval Conference)
Research in IR lacks a solid formal framework as a basic foundation Lacks robust and consistent testbeds and benchmarks

25 TREC (Text REtrieval Conference)
Initiated under the National Institute of Standards and Technology(NIST) Goals: Providing a large test collection Uniform scoring procedures Forum 1st TREC conference in 1992 7th TREC conference in 1998 Document collection: test collections, example information requests (topics), relevant docs The benchmarks tasks

26 The Document Collection
The TREC collection has been growing steadily over the years. TREC-3: 2 gigabytes TREC-6: 5.8 gigabytes The collection can be obtained at a small fee It is distributed in 6 CDs of compressed text Documents from sub-collections are tagged with SGML to allow easy parsing.

27 TREC1-6 Documents

28 The Documents Collection
TREC doc numbered WSJ <doc> <docno>WSJ </docno> <hl>AT&T Unveils Services to Upgrade Phone Networks Under Global Plan</hl> <author>Janet Guyon (WSJ Staff)</author> <dateline>New York</dateline> <text> American Telephone & Telegrapj Co. introduced the first of a newgeneration of phone service with broad… </text> </doc>

29 The Example Information Requests (Topics)
Example information requests can be used for testing a new ranking algorithm A request is a description of an information need in natural language Example: Topic numbered 168 <top> <num> Number:168 <title>Topic:Financing AMTRAK <desc>Description: ….. <nar>Narrative:A ….. </top>

30 The Relevant Documents for each Example Information Request
There is more than 350 topics The set of relevant docs for each example information request (topic) is obtained from a pool of possible relevant documents. Pooling method: This pool is created by taking the top k documents (k=100) in the ranking generated by various IR systems. The documents in the pool are then shown to human assessors who ultimately decide on the relevance of each document

31 The (Benchmark) Tasks at the TREC Conferences
Ad hoc task: Receive new requests and execute them on a pre-specified document collection Routing task Fixed requests, changing document collections first doc: training and tuning retrieval algorithm second doc: testing the tuned retrieval algorithm Similar to filtering, but still have ranking.

32 Other tasks: Ref. p. 90 Chinese Filtering Interactive
NLP(natural language procedure) Cross languages High precision Spoken document retrieval Query Task(TREC-7)

33 TREC

34 Evaluation Measures at the TREC Conferences
At the TREC conference, 4 basic types of evaluation measures are used: (ref. p.91) Summary table statistics Recall-precision averages Document level averages Average precision histogram

35 The CACM Collection Small collections about computer science literature Text of doc Structured subfields word stems from the title and abstract sections Categories direct references between articles: a list of pairs of documents [da,db] Bibliographic coupling connections:a list of triples[d1,d2,ncited] Number of co-citations for each pair of articles[d1,d2,nciting] A unique environment for testing retrieval algorithms which are based on information derived from cross-citing patterns

36 The ISI Collection ISI is a test collection of 1460 docs selected from a previous collection at the Institute of Scientific Information (ISI) The main purpose of the ISI collection is to support investigation of similarities based on terms and cross-citation patterns

37 CFC Collection 1,239 documents indexed with the term cystic fibrosis in the National Library of Medicine’s MEDLINE Each doc record is composed of: MEDLINE accession number author title source major subjects minor subjects abstract references citations

38 CFC Collection 100 information requests with extensive relevance judgements: 4 separate relevance scores for each request Scores proviced by human experts and by a medical bibliographer Each score: 0 (not relevant) 1 (marginally relevant) 2 (strongly relevant)

39 CFC Collection Small and nice collection for experimentation
Number of information requests is large relative to the collection size Good relevance judgements For online access:


Download ppt "Retrieval Evaluation Modern Information Retrieval, Chapter 3"

Similar presentations


Ads by Google