Presentation is loading. Please wait.

Presentation is loading. Please wait.

C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

Similar presentations


Presentation on theme: "C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed."— Presentation transcript:

1 C.Watterscs64031 Evaluation Measures

2 C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed information? Too much info?

3 C.Watterscs64033 Studies of Retrieval Effectiveness The Cranfield Experiments, Cyril W. Cleverdon, Cranfield College of Aeronautics, 1957 –1968 (hundreds of docs) SMART System, Gerald Salton, Cornell University, 1964-1988 (thousands of docs,) TREC, Donna Harman, National Institute of Standards and Technology (NIST), 1992 - (millions of docs, 100k to 7.5M per set, training Q’s and test Q’s, 150 each)

4 C.Watterscs64034 What can we measure? ??? Algorithm (Efficiency) –Speed of algorithm –Update potential of indexing scheme –Size of storage required –Potential for distribution & parallelism User Experience (Effectiveness) –How many of all relevant docs were found –How many were missed –How many errors in selection –How many need to be scanned before get good ones

5 C.Watterscs64035 Measures based on relevance RR NN NR RN not retrieved not relevant retrieved not relevant retrieved relevant not retrieved relevant retrieved not retrieved not relevant relevant Doc set

6 C.Watterscs64036 Relevance System always correct!! Who judges relevance? Inter-rater reliability Early evaluations –Done by panel of experts –1-2000 abstracts of docs TREC experiments –Done automatically –thousands of docs –Pooling + people

7 C.Watterscs64037 Defining the universe of relevant docs Manual inspection Manual exhaustive search Pooling (TREC) –Relevant set is the union of multiple techniques Sampling –Take a random sample –Inspect –Estimate from the sample for the whole set

8 C.Watterscs64038 Defining the relevant docs in a retrieved set (hit list) Panel of judges Individual users Automatic detection techniques –Vocabulary overlap with known relevant docs –metadata

9 C.Watterscs64039 Estimates of Recall Pooled system used by TREC depends on the quality of the set of nominated documents. Are there relevant documents not in the pool?

10 C.Watterscs640310 Measures based on relevance RR NN NR RN not retrieved not relevant retrieved not relevant retrieved relevant not retrieved relevant retrieved not retrieved not relevant relevant Doc set

11 C.Watterscs640311 Measures based on relevance RR RR + NR RR RR + RN RN RN + NN recall = precision = fallout =

12 C.Watterscs640312 Relevance Evaluation Measures Recall and Precision Single valued measures –Macro and micro averages –R-precision –E-measure –Swet’s measure

13 C.Watterscs640313 Recall and Precision Recall –Proportion of relevant docs retrieved Precision –Proportion of retrieved docs that are relevant

14 C.Watterscs640314 Formula (what do we have to work with?) R q = number of docs in whole data set relevant to query, q R r = number of docs in hit set (retrieved docs) that are relevant R h = number of docs in hit set Recall = Precision =

15 C.Watterscs640315 Recall = Precision = collection RhRh RqRq RrRr

16 C.Watterscs640316 Recall = Precision = R q ={d 3, d 5, d 9, d 25, d 39, d 44, d 56, d 71, d 89, d 123 } R h ={d 6, d 8, d 9, d 84, d 123 } R r ={ } Recall = Precision= ???what does that tell us??

17 C.Watterscs640317 Recall Precision Graphs precision Recall 100%

18 C.Watterscs640318 Typical recall-precision graph 1.0 0.75 0.5 0.25 1.0 0.75 0.5 0.25 precision recall narrow, specific query Broad, general query

19 C.Watterscs640319 Recall-precision after retrieval of n documents nrelevantrecallprecision 1yes0.21.0 2yes0.41.0 3no0.40.67 4yes0.60.75 5no0.60.60 6yes0.80.67 7no0.80.57 8no0.80.50 9no0.80.44 10no0.80.40 11no0.80.36 12no0.80.33 13yes1.00.38 14no1.00.36 SMART system using Cranfield data, 200 documents in aeronautics of which 5 are relevant

20 C.Watterscs640320 Recall-precision graph 1.0 0.75 0.5 0.25 1.0 0.75 0.5 0.25 recall precision 1 2 3 4 5 6 12 13 200

21 C.Watterscs640321 Recall-precision graph 1.0 0.75 0.5 0.25 1.0 0.75 0.5 0.25 recall precision The red system appears better than the black, but is the difference statistically significant?

22 C.Watterscs640322 Consider Rank Recall = Precision = R q ={d 3, d 5, d 9, d 25, d 39, d 44, d 56, d 71, d 89, d 123 } R h ={d 123, d 84, d 56, d 6, d 8, d 9, d 511, d 129, d 187, d 25, d 38, d 48, d 250, d 113, d 3 } R r ={ } Recall = Precision= What happens as we go through the hits?

23 C.Watterscs640323 Standard Recall Levels Plot Precision for Recall =0%, 10%,….100% P R100

24 C.Watterscs640324 Consider two queries R q ={d 3, d 5, d 9, d 25, d 39, d 44, d 56, d 71, d 89, d 123 } R h ={d 123, d 84, d 56, d 8, d 9, d 511, d 129, d 25, d 38, d 3 } R q ={d 3, d 5, d 56, d 89, d 90, d 94, d 129, d 206, d 500 d 502 } R h ={d 12,d 84, d 56, d 6, d 8, d 3, d 511, d 129,d 44,d 89 }

25 C.Watterscs640325 Comparison of Query results

26 C.Watterscs640326 P-R for Multiple Queries For each recall level average precision Avg Prec at recall r N q is number of queries P i (r) is prec at recall level r for ith query

27 C.Watterscs640327 Comparison of two systems

28 C.Watterscs640328 Macro and Micro Averaging Micro – average over each point Macro – average of averages per query example

29 C.Watterscs640329 Statistical tests Suppose that a search is carried out on systems i and j System i is superior to system j if recall(i) >= recall(j) precisions(i) >= precision(j)

30 C.Watterscs640330 Statistical tests The t-test is the standard statistical test for comparing two table of numbers, but depends on statistical assumptions of independence and normal distributions that do not apply to this data. The sign test makes no assumptions of normality and uses only the sign (not the magnitude) of the the differences in the sample values, but assumes independent samples. The Wilcoxon signed rank uses the ranks of the differences, not their magnitudes, and makes no assumption of normality but but assumes independent samples.

31 C.Watterscs640331 II Single Value Measures E-Measure & F1 measure Swet’s Measure Expected Search Length etc

32 C.Watterscs640332 E Measure & F1 Measure Weighted average of recall and precision Can increase importance of recall or precision F=1-E (bigger is better) Beta often =1 Increase beta ???

33 C.Watterscs640333 Normalized Recall recall is normalized against all relevant documents. Suppose there are N documents in the collection and out of which n are relevant. These n documents are ranked as i 1, i 2,…, i n. normalized recall is calculated: Normalize recall = (i j - j) / (n * (N – n))

34 J.Allan34 Normalized recall measure 51015200195 ideal ranks actual ranks worst ranks recall ranks of retrieved documents

35 J.Allan35 Normalized recall area between actual and worst area between best and worst Normalized recall = R norm = 1 - r i - i n(N - n) i = 1 n  n 

36 C.Watterscs640336 Example: N=200 n=5 at 1,3,5,10,14

37 C.Watterscs640337 Expected Search Length Number of non-relevant docs before relevant doc(s) are found Assume weak or no ordering L q = j + i. s/(r+1) s is required # of relevant docs j is # non-relevant docs before get required # ??

38 C.Watterscs640338 Problems with testing Determining relevant docs Setting up test questions Comparing results Understanding relevance of the results


Download ppt "C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed."

Similar presentations


Ads by Google