C.Watterscs64031 Evaluation Measures
C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed information? Too much info?
C.Watterscs64033 Studies of Retrieval Effectiveness The Cranfield Experiments, Cyril W. Cleverdon, Cranfield College of Aeronautics, 1957 –1968 (hundreds of docs) SMART System, Gerald Salton, Cornell University, (thousands of docs,) TREC, Donna Harman, National Institute of Standards and Technology (NIST), (millions of docs, 100k to 7.5M per set, training Q’s and test Q’s, 150 each)
C.Watterscs64034 What can we measure? ??? Algorithm (Efficiency) –Speed of algorithm –Update potential of indexing scheme –Size of storage required –Potential for distribution & parallelism User Experience (Effectiveness) –How many of all relevant docs were found –How many were missed –How many errors in selection –How many need to be scanned before get good ones
C.Watterscs64035 Measures based on relevance RR NN NR RN not retrieved not relevant retrieved not relevant retrieved relevant not retrieved relevant retrieved not retrieved not relevant relevant Doc set
C.Watterscs64036 Relevance System always correct!! Who judges relevance? Inter-rater reliability Early evaluations –Done by panel of experts – abstracts of docs TREC experiments –Done automatically –thousands of docs –Pooling + people
C.Watterscs64037 Defining the universe of relevant docs Manual inspection Manual exhaustive search Pooling (TREC) –Relevant set is the union of multiple techniques Sampling –Take a random sample –Inspect –Estimate from the sample for the whole set
C.Watterscs64038 Defining the relevant docs in a retrieved set (hit list) Panel of judges Individual users Automatic detection techniques –Vocabulary overlap with known relevant docs –metadata
C.Watterscs64039 Estimates of Recall Pooled system used by TREC depends on the quality of the set of nominated documents. Are there relevant documents not in the pool?
C.Watterscs Measures based on relevance RR NN NR RN not retrieved not relevant retrieved not relevant retrieved relevant not retrieved relevant retrieved not retrieved not relevant relevant Doc set
C.Watterscs Measures based on relevance RR RR + NR RR RR + RN RN RN + NN recall = precision = fallout =
C.Watterscs Relevance Evaluation Measures Recall and Precision Single valued measures –Macro and micro averages –R-precision –E-measure –Swet’s measure
C.Watterscs Recall and Precision Recall –Proportion of relevant docs retrieved Precision –Proportion of retrieved docs that are relevant
C.Watterscs Formula (what do we have to work with?) R q = number of docs in whole data set relevant to query, q R r = number of docs in hit set (retrieved docs) that are relevant R h = number of docs in hit set Recall = Precision =
C.Watterscs Recall = Precision = collection RhRh RqRq RrRr
C.Watterscs Recall = Precision = R q ={d 3, d 5, d 9, d 25, d 39, d 44, d 56, d 71, d 89, d 123 } R h ={d 6, d 8, d 9, d 84, d 123 } R r ={ } Recall = Precision= ???what does that tell us??
C.Watterscs Recall Precision Graphs precision Recall 100%
C.Watterscs Typical recall-precision graph precision recall narrow, specific query Broad, general query
C.Watterscs Recall-precision after retrieval of n documents nrelevantrecallprecision 1yes yes no yes no yes no no no no no no yes no SMART system using Cranfield data, 200 documents in aeronautics of which 5 are relevant
C.Watterscs Recall-precision graph recall precision
C.Watterscs Recall-precision graph recall precision The red system appears better than the black, but is the difference statistically significant?
C.Watterscs Consider Rank Recall = Precision = R q ={d 3, d 5, d 9, d 25, d 39, d 44, d 56, d 71, d 89, d 123 } R h ={d 123, d 84, d 56, d 6, d 8, d 9, d 511, d 129, d 187, d 25, d 38, d 48, d 250, d 113, d 3 } R r ={ } Recall = Precision= What happens as we go through the hits?
C.Watterscs Standard Recall Levels Plot Precision for Recall =0%, 10%,….100% P R100
C.Watterscs Consider two queries R q ={d 3, d 5, d 9, d 25, d 39, d 44, d 56, d 71, d 89, d 123 } R h ={d 123, d 84, d 56, d 8, d 9, d 511, d 129, d 25, d 38, d 3 } R q ={d 3, d 5, d 56, d 89, d 90, d 94, d 129, d 206, d 500 d 502 } R h ={d 12,d 84, d 56, d 6, d 8, d 3, d 511, d 129,d 44,d 89 }
C.Watterscs Comparison of Query results
C.Watterscs P-R for Multiple Queries For each recall level average precision Avg Prec at recall r N q is number of queries P i (r) is prec at recall level r for ith query
C.Watterscs Comparison of two systems
C.Watterscs Macro and Micro Averaging Micro – average over each point Macro – average of averages per query example
C.Watterscs Statistical tests Suppose that a search is carried out on systems i and j System i is superior to system j if recall(i) >= recall(j) precisions(i) >= precision(j)
C.Watterscs Statistical tests The t-test is the standard statistical test for comparing two table of numbers, but depends on statistical assumptions of independence and normal distributions that do not apply to this data. The sign test makes no assumptions of normality and uses only the sign (not the magnitude) of the the differences in the sample values, but assumes independent samples. The Wilcoxon signed rank uses the ranks of the differences, not their magnitudes, and makes no assumption of normality but but assumes independent samples.
C.Watterscs II Single Value Measures E-Measure & F1 measure Swet’s Measure Expected Search Length etc
C.Watterscs E Measure & F1 Measure Weighted average of recall and precision Can increase importance of recall or precision F=1-E (bigger is better) Beta often =1 Increase beta ???
C.Watterscs Normalized Recall recall is normalized against all relevant documents. Suppose there are N documents in the collection and out of which n are relevant. These n documents are ranked as i 1, i 2,…, i n. normalized recall is calculated: Normalize recall = (i j - j) / (n * (N – n))
J.Allan34 Normalized recall measure ideal ranks actual ranks worst ranks recall ranks of retrieved documents
J.Allan35 Normalized recall area between actual and worst area between best and worst Normalized recall = R norm = 1 - r i - i n(N - n) i = 1 n n
C.Watterscs Example: N=200 n=5 at 1,3,5,10,14
C.Watterscs Expected Search Length Number of non-relevant docs before relevant doc(s) are found Assume weak or no ordering L q = j + i. s/(r+1) s is required # of relevant docs j is # non-relevant docs before get required # ??
C.Watterscs Problems with testing Determining relevant docs Setting up test questions Comparing results Understanding relevance of the results