Presentation is loading. Please wait.

Presentation is loading. Please wait.

26-01-2012 | 1 › Gertjan van Noord2014 Zoekmachines Lecture 5: Evaluation.

Similar presentations


Presentation on theme: "26-01-2012 | 1 › Gertjan van Noord2014 Zoekmachines Lecture 5: Evaluation."— Presentation transcript:

1 26-01-2012 | 1 › Gertjan van Noord2014 Zoekmachines Lecture 5: Evaluation

2 2 of 25 Retrieval performance General performance of a system speed security usability ….. Retrieval performance and evaluation is the system presenting documents related to the query? is the user satisfied? information need?

3 User Queries Same words, different intent -”Check my cash” -“Cash my check” Different words, same intent -”Can I open an interest-bearing account?” -”open savings account” -”start account for saving” Gap between users’ language and official terminology – “daylight lamp” – “random reader”, “edentifier”, “log in machine”, “card reader”

4 26-01-12

5 What is relevancy? What if you have to search through a 2 page doc to find what you need? What if doc is 35 pages? Or 235? What if you need to click though once to get to answer? Or 2 times? 3 times? Is relevancy a characterisics of a result, or of a result set? What is the effect of an irrelevant result in an otherwise good result set? 26-01-12 Determining relevancy is complex!

6 6 Translation of info need Each information need has to be translated into the "language" of the IR system reality document info need query relevance

7 General retrieval evaluation: batch mode (automatic) testing Test set consisting of: set of documents set of queries file with relevant document numbers for each query (human evaluation!) Experimental test sets: (a.o.) ADI, CACM, Cranfield, TREC testsets

8 Example CACM data files query.text. I 1.W What articles exist which deal with TSS (Time Sharing System), an operating system for IBM computers? cacm.all.I 1410.T Interarrival Statistics for Time Sharing Systems.W The optimization of time-shared system performance requires the description of the stochastic processes governing the user inputs and the program activity. This paper provides a statistical description of the user input process in the SDC-ARPA general-purpose […] qrels.text 01 1410 0 0 01 1572 0 0 01 1605 0 0 01 2020 0 0 01 2358 0 0

9 Basic performance measures: precision, recall, F-score 26-01-12 c a b d

10 Continguency table of results +Rel-Rel +Ansaba+b -Anscdc+d a+cb+dN

11 Exercise Testset 10.000 documents in database for query Q available: 50 relevant docs Resultset query Q 100 documents relevant: 20 docs 20/50=.4 What is the recall? What is the precision? What is the generality? 20/100=.2 50/10.000=.005

12 Harmonic mean F Difficult to compare systems if we have two numbers. If one system has P=0.4 and R=0.6 and another has P=0.35 and R=0.65 – which one is better? Combine both numbers: harmonic mean You go to school by bike: 10 kilometers. In the morning, you bike 30 km/h. In the afternoon, you bike 20 km/h. What is your average speed?

13 Harmonic mean F Difficult to compare systems if we have two numbers. If one system has P=0.4 and R=0.6 and another has P=0.35 and R=0.65 – which one is better? Combine both numbers: harmonic mean You go to school by bike: 10 kilometers. In the morning, you bike 30 km/h. In the afternoon, you bike 20 km/h. What is your average speed? 20 + 30 minutes for 20 km. 24 km/h! Harmonic mean: (2 * v1 * v2)/(v1 + v2)

14 Harmonic mean F If precision is higher, F-score is higher too If recall is higher, F-score is higher too F-score maximal when precision AND recall are high

15 Harmonic mean F What if P=0.1 and R=0.9 What if P=0.4 and R=0.6

16 Retrieval performance Retrieval of all relevant docs first number of retrieved docs => recall random perverse parabolic 100% N ideal

17 Until now: ordering of results is ignored

18 Recall and precision on rank x set of 80 docs, 4 relevant for query Q ordered answer set for Q: NRNNNNNRN..NRN…NRN……N Rel. docrankrecallprec d12 d28 d330 d440 recall always rising? precision always falling?

19 Recall and precision changes 80 docs, 4 relevant for query Q Answer set: NRNNNNNRN..NRN…NRN……N docrankrecallprec d12.25.5 d28.50.25 d330.75.1 d4401.1 recall always rising? rising or equal precision always falling? falling, equal, but also rising: if d3 had rank 9

20 Recall-precision graph 100% recall 100% prec precision on different recall levels - for 1 query - average over queries - comparing systems

21 R/P graph:comparing 2 systems

22 Interpolation Interpolation: if a higher recall level has a higher precision, use it for the lower recall level as well, no spikes

23 Single value summaries for ordered result lists (1) 11 pt average precision averaging the (interpolated) precision values on the 11 recall levels (0%-100%) 3 pt average precision same at recall levels 20%, 50%, 80%

24 Single value summaries for ordered result lists (2) p@n: precision at a document cut-off value (n = 5, 10, 20, 50,…) usual measure for web retrieval (why?) r@n: recall at a document cut-off value R-precision: precision on rank R where R=total number of relevant docs for the query (why??)

25 Single value summaries for ordered result lists (3) Average precision average of precision measured on the ranks of relevant docs seen for a query (non-interpolated) MAP mean of the average precisions on set of queries

26 Example average precision docrang r p d12.25.50 d28.50.25 d39.75.33 d440 1.00.10 ------- 1.18 Average precision = 1.18 / 4 = 0.295

27 What do these numbers really mean? How do we know the user was happy with the result? Click? What determines whether you click or not? Snippet Term highlighting What determines whether you click back and try another result? 26-01-12

28 Annotator agreement To be sure that judgements are reliable, more than one judge should give his rate A common measure for the agreement between judgements is the kappa statistic Of course there will always be some agreement. This expected chance agreement is included in kappa.

29 Kappa measure P(A)the proportion of agreement cases P(E) the expected proportion of agreement For more than 2 judges, kappa is calculated between each pair and the outcomes are averaged

30 Kappa example: page 152, table 8.2

31 Kappa interpretation for binary relevance decisions, the agreement is generally not higher than fair =1 complete agreement >0.8good agreement >0.67fair agreement <0.67dubious 0just chance <0worse than chance


Download ppt "26-01-2012 | 1 › Gertjan van Noord2014 Zoekmachines Lecture 5: Evaluation."

Similar presentations


Ads by Google