Evaluation of IR Systems

Evaluation of IR Systems
Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford) Prasad L10Evaluation

This lecture Results summaries:
Making our good results presentable and usable to a user How do we know if our results are any good? Evaluating a search engine Benchmarks Precision and recall What do we show about an item in the result set to highlight the hit (cf. cheap information extraction)? How do we determine and compute what we plan to show? How do we help with convenient navigation? How do we know that the result set we determined is really good? How do we compare various design choices for our search engine? Qualitative and quantitive issues. Prasad L10Evaluation

Result Summaries Having ranked the documents matching a query, we wish to present a results list. Most commonly, a list of document titles plus a short summary, aka “10 blue links”.

Summaries The title is typically automatically extracted from document metadata. What about the summaries? This description is crucial. User can identify good/relevant hits based on description. Two basic kinds: A static summary of a document is always the same, regardless of the query that hit the doc. A dynamic summary is a query-dependent attempt to explain why the document was retrieved for the query. Prasad L10Evaluation

Static summaries In typical systems, the static summary is a subset of the document. Simplest heuristic: the first 50 (or so – this can be varied) words of the document Summary cached at indexing time More sophisticated: extract from each document a set of “key” sentences Simple NLP heuristics to score each sentence Summary is made up of top-scoring sentences. Most sophisticated: NLP used to synthesize a summary Seldom used in IR (cf. text summarization work) Makes sense for News stories (topic sentence, lead) Abstract or irredundant body sentences matching title

Dynamic summaries Present one or more “windows” within the document that contain several of the query terms “KWIC” snippets: Keyword in Context presentation Generated in conjunction with scoring If query found as a phrase, all or some occurrences of the phrase in the doc If not, doc windows that contain multiple query terms The summary itself gives the entire content of the window – all terms, not only the query terms. Pro: Answer extraction : Titles, author names, etc drags other details due to proximity … Con: Boundary problem (AS co-author or Einstein) Sentence, Paragraph issue : Co-occurrence granularity vs proximity Requires access to cached document text

Generating dynamic summaries
If we have only a positional index, we cannot (easily) reconstruct context window surrounding hits. If we cache the documents at index time, then we can find windows in it, cueing from hits found in the positional index. E.g., positional index says “the query is a phrase in position 4378” so we go to this position in the cached document and stream out the content. Most often, cache only a fixed-size prefix of the doc. Note: Cached copy can be outdated Prasad L10Evaluation

Dynamic summaries Producing good dynamic summaries is a tricky optimization problem. The real estate for the summary is normally small and fixed. Want snippets to be long enough to be useful. Want linguistically well-formed snippets. Want snippets maximally informative about doc. But users really like snippets, even if they complicate IR system design. LexisNexis Cluster Label work : Emphasis – Semantic Match through abstraction (set, stem) but redeem Readability by reinstating original fragment Prasad L10Evaluation

Alternative results presentations?
An active area of HCI research An alternative: / copies the idea of Apple’s Cover Flow for search results For the moment, the search engine market has really coalesced around one dominant style (consistency has strong benefits – Jakob Nielsen). Prasad L10Evaluation

Evaluating search engines
What are we evaluating? Satisfying information need, not resource utilization (human productivity (“correctness”) vs machine performance (“efficiency”)) Why is it difficult to get a handle on this? – Different perspectives Semantics not available (DR vs IR) -- Role of benchmarks How do we quantify relevance? per query, using numbers, over different queries, … Binary judgements standard => multi-valued relevance, user-preference relation over a subset … Prasad L10Evaluation

Measures for a search engine
How fast does it index? e.g., number of bytes per hour How fast does it search? e.g., latency as a function of queries per second What is the cost per query? in dollars All of the preceding criteria are measurable: we can quantify speed / size / money. However, the key measure for a search engine is user happiness. Prasad L10Evaluation

Data Retrieval vs Information Retrieval
DR Performance Evaluation (after establishing correctness) Response time Index space … IR Performance Evaluation How relevant is the answer set? How happy are the users? (Required to establish “functional correctness”, e.g., through benchmarks) Prasad L10Evaluation

Measures for a search engine
What is user happiness? Factors include: Speed of response Size of index Uncluttered UI Most important: relevance (actually, maybe even more important: it’s free) None of these is sufficient: blindingly fast, but useless answers won’t make a user happy. How can we quantify user happiness? Prasad L10Evaluation

Measuring user happiness: Who is the user?
Web search engine: searcher. Success: Searcher finds what was looked for. Measure: rate of return to this search engine Web search engine: advertiser. Success: Searcher clicks on ad. Measure: clickthrough rate Ecommerce: buyer. Success: Buyer buys something. Measures: time to purchase, fraction of “conversions” of searchers to buyers Ecommerce: seller. Success: Seller sells something. Measure: profit per item sold Enterprise: CEO. Success: Employees are more productive (because of effective search). Measure: profit of the company eye of the beholder Prasad L10Evaluation

Happiness: elusive to measure
Most common proxy: relevance of search results Standard Methodology in IR: Relevance measurement requires 3 elements: A benchmark document collection A benchmark suite of queries An assessment of relevance for each query-document pair Some work on binary relevance, others use multi-valued relevance (or partial orders) Lack of formal semantics of documents and the query :: Fundamental issue of IR vs DR comes back to haunt us Human arbitrator abstracts semantics via relevance judgements Medline dataset : 1033 abstracts : med.all, med.qry, med.rel Prasad L10Evaluation

Evaluating an IR system
Note: the information need is translated into a query Relevance is assessed relative to the information need, not the query E.g., Information need: I'm looking for information on whether drinking red wine is more effective at reducing heart attack risks than white wine. Query: wine red white heart attack effective You evaluate whether the doc addresses the information need, not whether it has these words. Problem exacerbated in the presence of ambiguous short queries Jaguar (animal, football team, car) Saturn (planet, car) Cardinals (Church, Baseball team, Bird) Instead of displaying result sets by interleaving documents satisfying various senses, (as it done in Google and Yahoo search engines), Clusty (Vivisimo) search engine effectively clusters documents on the basis of word sense, thereby enabling quicker convergence to relevant documents Prasad L10Evaluation

Evaluating an IR system
Information need i : “I am looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.” Query q: [red wine white wine heart attack] Consider document d′: At heart of his speech was an attack on the wine industry lobby for downplaying the role of red and white wine in drunk driving. d′ is an excellent match for query q . . . d′ is not relevant to the information need i . Prasad L10Evaluation

Difficulties with gauging Relevancy
Relevancy, from a human standpoint, is: Subjective: Depends upon a specific user’s judgment. Situational: Relates to user’s current needs. Cognitive: Depends on human perception and behavior. Dynamic: Changes over time. Number of words page limit User need: determine equivalence Retrieved: How to limit number of words in an MS Word page IICAI/NLDB -> opt to change to IJCAI/VLDB Googling: Nano – Amazon (polysemy/synonymy) =============================================== User happiness can only be measured by relevance to an information need, not by relevance to queries. Our terminology is sloppy in these slides and in IIR: we talk about query-document relevance judgments even though we mean information-need-document relevance judgments. Prasad L10Evaluation

Standard relevance benchmarks
TREC - National Institute of Standards and Technology (NIST) has run a large IR test bed for many years Reuters and other benchmark doc collections used “Retrieval tasks” specified sometimes as queries Human experts mark, for each query and for each doc, Relevant or Nonrelevant or at least for subset of docs that some system returned for that query Binary unranked retrieval --- small result set (Precision and recall values) Binary unranked retrieval --- large result set ( Precision-recall variation over the result set) Binary ranked retrieval --- comparison with expected ranking and penalizing errors early on Graded Retrieval Research : Multi-valued grading, user preference relation (partial ordering) Prasad

Unranked retrieval evaluation: Precision and Recall
Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved) Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant) Precision P = tp/(tp + fp) Recall R = tp/(tp + fn) Relevant Nonrelevant Retrieved tp (true positive) fp (false positive) Not Retrieved fn (false negative) tn (true negative) In the Web context, in theory, precision can be computed based on retrieved documents. However, recall cannot be computed unless we look at the whole dataset! Explain – {true, false} x {positive , negative} reality computed Prasad L10Evaluation

Precision and Recall Prasad L10Evaluation
Answer = result set (returned by the search engine) Relevant docs (sort of unknown in the Web context) Prasad L10Evaluation

Precision and Recall in Practice
The ability to retrieve top-ranked documents that are mostly relevant. The fraction of the retrieved documents that are relevant. Recall The ability of the search to find all of the relevant items in the corpus. The fraction of the relevant documents that are retrieved. R = Relevant documents in the database A = Answer set / Retrieved documents RA = R intersection A Retrieved = in the Answer Set Recall = Fraction of relevant documents in the answer set (retrieved) Precision = Fraction of the answer set (retrieved docs) that is relevant For generic web search, high-precision is good (because high recall causes information overload). For legal or intelligence search, high recall is good because cost of missing a precedent is high/catastrophic. Prasad L10Evaluation

Accuracy Why do we use complex measures like precision, recall, etc?
Why not something simple like accuracy? Accuracy is the fraction of decisions (relevant/nonrelevant) that are correct. In terms of the contingency table above, accuracy = (TP + TN)/(TP + FP + FN + TN). Why is accuracy not a useful measure for web information retrieval? If all the given docs are useless for a given query (very natural in web context), high accuracy does not mean it has high “recall” on real dataset. Make sense if the size of the set of relevant documents and that of irrelevant documents is comparable. Otherwise, for lopsided distribution, accuracy does not provide the correct assessment, especially when the size of relevant documents is small and that is what we are interested in retrieving. 23

Why not just use accuracy?
How to build a % accurate search engine on a low budget…. People doing information retrieval want to find something and have a certain tolerance for junk. Snoogle.com Search for: 0 matching results found. Prasad L10Evaluation

Precision/Recall You can get high recall (but low precision) by retrieving all docs for all queries! Recall is a non-decreasing function of the number of docs retrieved In a good system, precision decreases as either the number of docs retrieved or recall increases This is not a theorem, but a result with strong empirical confirmation Precision-recall tradeoff =============== Recall is a non-decreasing function of the number of docs retrieved. The converse is also true (usually): It’s easy to get high precision for very low recall. Prasad L10Evaluation

Trade-offs 1 Precision 1 Recall Returns relevant documents but
misses many useful ones too The ideal 1 Precision Returns most relevant documents but includes lot of junk 1 Abstract characterization => Provides trade-off in search engine design Discussed Later => Given the result set for a query, typical variation in PR Recall Prasad L10Evaluation

Difficulties in using precision/recall
Should average over large document collection/query ensembles Need human relevance assessments People aren’t reliable assessors Assessments have to be binary Nuanced assessments? Heavily skewed by collection/authorship Results may not translate from one domain to another If the entire set of relevant docs for a query is known but we want to see how it is distributed over the ranked answer set, we need to use measures that consider fractional recalls. If the entire set of relevant docs for a query is NOT known, measure precisions at fixed number of documents. XML Query Language evaluation : Unified Web Model Query Language Evaluation Prasad L10Evaluation

A combined measure: F Combined measure that assesses precision/recall tradeoff is F measure (harmonic mean): Harmonic mean is a conservative average See CJ van Rijsbergen, Information Retrieval Given that there is clear trade-off between precision and recall, we want a system that does well on both axis combined. F-measure provides such a metric. Prasad L10Evaluation

Aka E Measure (parameterized F Measure)
Variants of F measure that allow weighting emphasis on precision over recall: Value of  controls trade-off:  = 1: Equally weight precision and recall (E=F).  > 1: Weight recall more.  < 1: Weight precision more. Two other commonly used F_P measures are the F_2 measure, which weights recall twice as much as precision, and the F_0.5 measure, which weights precision twice as much as recall. (van Rijsbergen's Effectiveness measure different?) Prasad L10Evaluation

F1 and other averages Prasad L10Evaluation
This graph throws light on ways to combine precision and recall in various ways, and see how F-measure compares to these choices. X-axis is precision (with recall fized at 70%) Y-axis is various combinations such as max(P,R), AM(P,R), F-measure/HM(P,R), etc All lines meet at (70%, 70%) Prasad L10Evaluation

F: Example P = 20/(20 + 40) = 1/3 R = 20/(20 + 60) = 1/4 relevant
not relevant retrieved 20 40 60 not retrieved 1,000,000 1,000,060 80 1,000,040 1,000,120 P = 20/( ) = 1/3 R = 20/( ) = 1/4 31

Compute precision, recall and F1 for this result set:
Exercise Compute precision, recall and F1 for this result set: relevant not relevant retrieved 18 2 not retrieved 82 1,000,000,000 32

Breakeven Point Breakeven point is the point where precision equals recall. Alternative single measure of IR effectiveness. How do you compute it?

Evaluating ranked results
Precision/recall/F are measures for unranked sets. We can easily turn set measures into measures of ranked lists. Just compute the set measure for each “prefix”: the top 1, top 2, top 3, top 4 etc. results Doing this for precision and recall gives you a precision-recall curve. Prasad L10Evaluation

Computing Recall/Precision Points: An Example
Let total # of relevant docs = 6 Check each new recall point: R=1/6=0.167; P=1/1=1 R=2/6=0.333; P=2/2=1 R=3/6=0.5; P=3/4=0.75 Variation of recall-precision over ranked retrieval: P and R inversely proportional Using binary relevance judgements to evaluate effectiveness of ranked retrieval (though not necessarily how good the ranking scheme is) (In this example, 14 docs retrieved where the dataset has only 6 relevant docs) Higher precision early on is good for web search engines … R=4/6=0.667; P=4/6=0.667 Missing one relevant document. Never reach 100% recall R=5/6=0.833; p=5/13=0.38 L10Evaluation

A precision-recall curve
This figure is not for the previous slide table though it has the same shape. For a specific query, as we move down the result set, Precision of an initial segment goes up when we encounter a relevant document, (line with positive slope) and goes down when we encounter an irrelevant document. Correspondingly, recall of an initial segment and remains unchanged when we encounter an irrelevant document. (vertical drops)

Averaging over queries
Sec. 8.4 Averaging over queries A precision-recall graph for one query isn’t a very sensible thing to look at. You need to average performance over a whole bunch of queries. But there’s a technical issue: Precision-recall calculations place some points on the graph How do you determine a value (interpolate) between the points?

Interpolated precision
Sec. 8.4 Interpolated precision Idea: If locally precision increases with increasing recall, then you should get to count that… So you take the max of precisions to right of value

Interpolating a Recall/Precision Curve
Interpolate a precision value for each standard recall level: rj {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0} r0 = 0.0, r1 = 0.1, …, r10=1.0 The interpolated precision at the j-th standard recall level is the maximum known precision at any recall level above the j-th level:

Interpolated precision-recall curve
For a specific query, as we move down the result set, Precision of an initial segment goes up when we encounter a relevant document, (line with positive slope) and goes down when we encounter an irrelevant document. Correspondingly, recall of an initial segment and remains unchanged when we encounter an irrelevant document. (vertical drops)

Evaluation Metrics (cont’d)
Graphs are good, but people want summary measures! Precision at fixed retrieval level Precision-at-k: Precision of top k results Perhaps appropriate for web search: all people want are good matches on the first one or two results pages But: averages badly and has an arbitrary parameter of k 11-point interpolated average precision The standard measure in the early TREC competitions: you take the precision at 11 levels of recall varying from 0 to 1 by tenths of the documents, using interpolation (the value for 0 is always interpolated!), and average them Evaluates performance at all recall levels Summarize a PR-graph for a query using a single number (1) Emphasize initial hits because people browse only initial pages (2) Emphasize performance over the whole result set (11-point precision is given in terms of percentage of documents.) Prasad L10Evaluation

Typical (good) 11 point precisions
SabIR/Cornell 8A1 11pt precision from TREC 8 (1999) Given a query and the set of relevant documents, 11-point precision is calculated at Recalls of 0% (interpolated), 10%, 20%, …, 100%. Prasad

11-point interpolated average precision
Recall Interpolated Precision 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.00 0.67 0.63 0.55 0.45 0.41 0.36 0.29 0.13 0.10 0.08 11-point average: ≈ 0.425 How can precision at 0.0 be > 0? First hit is relevant – precision = 1, recall = 1 / # of relevant docs 43

11 point precisions Prasad L10Evaluation
Given a query and the set of relevant documents, if the recall never reaches beyond 60%, Then the precision at all the later points beyond 60% is deemed 0%. (Rationale: have to access the entire dataset to get all the relevant documents -> 0.) If the set of relevant documents is small (say 6), then the precision values at these fixed points is determined by interpolation using the maximum value associated with the left and right neighboring points for which recall is available. (If precision is available only for recall Points 33%, 60%, and 75% and is 50%, 35%, and 25%, then the 11-pt precisions are: 50%, 50%, 50%, 50%, 50%, 50%, 35%, 35%, 0%, 0%, 0%.) For multiple queries, average it out. (See MIR for a good description.) Prasad L10Evaluation

Receiver Operating Characteristics (ROC) Curve
True positive rate = tp/(tp+fn) = recall = sensitivity False positive rate = fp/(tn+fp). Related to precision. fpr=0 <-> p=1 Why is the blue line “worthless”? ROC curves Used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel Differences from gains chart: y axis shows percentage of true positives x axis shows percentage of false positives

Variance of measures like precision/recall
For a test collection, it is usual that a system does badly on some information needs (e.g., P = 0.2 at R = 0.1) and really well on others (e.g., P = 0.95 at R = 0.1). Indeed, it is usually the case that the variance of the same system across queries is much greater than the variance of different systems on the same query. That is, there are easy information needs and hard ones. 46

Mean average precision (MAP)
MAP for a query Average of the precision value for each (of the k top) relevant document retrieved This approach weights early appearance of a relevant document over later appearance MAP for query collection is the mean (arithmetic average) of AP for each query Macro-averaging: each query counts equally One number to characterize the performance of a “ranking” search engine Sensitive to ranking Prasad L10Evaluation

Average Precision Comparing two search engines on the same query

Mean Average Precision (MAP)
summarize rankings from multiple queries by taking mean (averaging) of average precision most commonly used measure in research papers assumes user is interested in finding many relevant documents for each query requires many binary relevance judgments in text collection

MAP Comparing a search engine’s performance on two different queries

Summarize a Ranking: MAP
Given that n docs are retrieved Compute the precision (at rank) where each (new) relevant document is retrieved => p(1),…,p(k), if we have k rel. docs E.g., if the first rel. doc is at the 2nd rank, then p(1)=1/2. If a relevant document never gets retrieved, we assume the precision corresponding to that rel. doc to be zero Compute the average over all the relevant documents Average precision = (p(1)+…p(k))/k

(cont’d) This gives us (non-interpolated) average precision, which captures both precision and recall and is sensitive to the rank of each relevant document Mean Average Precisions (MAP) MAP = arithmetic mean average precision over a set of topics gMAP = geometric mean average precision over a set of topics (more affected by difficult topics)

Discounted Cumulative Gain
Popular measure for evaluating web search and related tasks. Two assumptions: Highly relevant documents are more useful than marginally relevant document. The lower the ranked position of a relevant document, the less useful it is for the user, since it is less likely to be examined. Benchmark provides ranked assessment

Uses graded relevance as a measure of usefulness, or gain, from examining a document Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks Typical discount is 1/log (rank) With base 2, the discount at rank 4 is 1/2, and at rank 8 it is 1/3

Summarize a Ranking: DCG
What if relevance judgments are in a scale of [1,r]? r>2 Cumulative Gain (CG) at rank n Let the ratings of the n documents be r1, r2, …rn (in ranked order) CG = r1+r2+…rn Discounted Cumulative Gain (DCG) at rank n DCG = r1 + r2/log22 + r3/log23 + … rn/log2n We may use any base for the logarithm, e.g., base=b For rank positions above b, do not discount

DCG is the total gain accumulated at a particular rank p: Alternative formulation: used by some web search companies emphasis on retrieving highly relevant documents A search engine that rates a highly relevant document at a lower rank is penalized heavily.

DCG Example 10 ranked documents judged on 0-3 relevance scale:
3, 2, 3, 0, 0, 1, 2, 2, 3, 0 discounted gain: 3, 2/1, 3/1.59, 0, 0, 1/2.59, 2/2.81, 2/3, 3/3.17, 0 = 3, 2, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0 DCG: 3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61 Log2 3 = log10 3 / log10 2 = log2 6 = log2 2 + log2 3 = 2.59 DCG: 3, 3 + 2, , … Normalized DCG: DCG values are often normalized by comparing the DCG at each rank with the DCG value for the perfect ranking makes averaging easier for queries with different numbers of relevant documents

Summarize a Ranking: NDCG
Normalized Cumulative Gain (NDCG) at rank n Normalize DCG at rank n by the DCG value at rank n of the ideal ranking The ideal ranking would first return the documents with the highest relevance level, then the next highest relevance level, etc Compute the precision (at rank) where each (new) relevant document is retrieved => p(1),…,p(k), if we have k rel. docs NDCG is now quite popular in evaluating Web search Perfect ranking: 3, 3, 3, 2, 2, 2, 1, 0, 0, 0 ideal DCG values: 3, 6, 7.89, 8.89, 9.75, 10.52, 10.88, 10.88, 10.88, 10 Actual DCG: 3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61 NDCG values (divide actual by ideal): 1, 0.83, 0.87, 0.76, 0.71, 0.69, 0.73, 0.8, 0.88, 0.88 NDCG £ 1 at any rank position

NDCG - Example 4 documents: d1, d2, d3, d4 i Ground Truth
Ranking Function1 Ranking Function2 Document Order ri 1 d4 2 d3 d2 3 4 d1 NDCGGT=1.00 NDCGRF1=1.00 NDCGRF2=0.9203 Permuting among equally relevant documents fine

NDCG (at 4) - Example Graded ranking/ordering:
DCG = 4 + 2/log(2) + 0/log(3) + 1/log(4) = 6.5 IDCG = 4 + 2/log(2) + 1/log(3) + 0/log(4) = 6.63 NDCG = DCG/IDCG = 6.5/6.63 = .98 1 2 4

R- Precision Precision at the R-th position in the ranking of results for a query that has R relevant documents. R = # of relevant docs = 6 R-Precision = 4/6 = 0.67 L10Evaluation

Test Collections Prasad

Creating Test Collections for IR Evaluation
Prasad L10Evaluation

What we need for a benchmark
A collection of documents Documents must be representative of the documents we expect to see in reality. A collection of information needs . . .which we will often incorrectly refer to as queries Information needs must be representative of the information needs we expect to see in reality. Human relevance assessments We need to hire/pay “judges” or assessors to do this. Expensive, time-consuming Judges must be representative of the users we expect to see in reality. 64

Standard relevance benchmark: Cranfield
Pioneering: first testbed allowing precise quantitative measures of information retrieval effectiveness Late 1950s, UK 1398 abstracts of aerodynamics journal articles, a set of 225 queries, exhaustive relevance judgments of all query- document-pairs Too small, too untypical for serious IR evaluation today 65

Standard relevance benchmark: TREC
TREC = Text Retrieval Conference (TREC) Organized by the U.S. National Institute of Standards and Technology (NIST) TREC is actually a set of several different relevance benchmarks. Best known: TREC Ad Hoc, used for first 8 TREC evaluations between 1992 and 1999 1.89 million documents, mainly newswire articles, 450 information needs No exhaustive relevance judgments – too expensive Rather, NIST assessors’ relevance judgments are available only for the documents that were among the top k returned for some system which was entered in the TREC evaluation for which the information need was developed. 66

Standard relevance benchmarks: Others
GOV2 Another TREC/NIST collection 25 million web pages Used to be largest collection that is easily available But still 3 orders of magnitude smaller than what Google/Yahoo/MSN index NTCIR East Asian language and cross-language information retrieval Cross Language Evaluation Forum (CLEF) This evaluation series has concentrated on European languages and cross-language information retrieval. Many others 67

Validity of relevance assessments
Relevance assessments are only usable if they are consistent. If they are not consistent, then there is no “truth” and experiments are not repeatable. How can we measure this consistency or agreement among judges? → Kappa measure 68

Kappa measure for inter-judge (dis)agreement
Agreement measure among judges Designed for categorical judgments Corrects for chance agreement P(A) – proportion of time judges agree P(E) – what agreement would be by chance Kappa = 0 for chance agreement, 1 for total agreement. Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ] Prasad L10Evaluation

Kappa Measure: Example
P(A)? P(E)? Kappa Measure: Example Number of docs Judge 1 Judge 2 300 Relevant 70 Nonrelevant 20 10

Kappa Example P(A) = 370/400 = 0.925 P(nonrelevant) = ( )/800 = P(relevant) = ( )/800 = P(E) = ^ ^2 = 0.665 Kappa = (0.925 – 0.665)/( ) = 0.776 Kappa > 0.8 : good agreement 0.67< Kappa <0.8 : “tentative conclusions” (Carletta ’96) Depends on purpose of study For >2 judges: average pairwise kappas Both judges score non-relevant randomly: 80/400 * 90/400 = 0.045 Both judges score relevant randomly: 320/400 * 310/400 = 0.62 Both judges agree = = 0.665 Both judges disagree: 320/400 * 90/ /400*80/400 = = 0.335 P(E) = / ( ) = 0.665

Kappa Example : Alternative view
Both judges score non-relevant randomly: 80/400 * 90/400 = 0.045 Both judges score relevant randomly: 320/400 * 310/400 = 0.62 Both judges agree = = 0.665 Both judges disagree: 320/400 * 90/ /400 * 80/400 = = 0.335 P(E) = / ( ) = 0.665

Evaluation at large search engines
Search engines have test collections of queries and hand-ranked results Recall is difficult to measure on the web Search engines often use precision at top k, e.g., k = 10 . . . or measures that reward you more for getting rank 1 right than for getting rank 10 right. NDCG (Normalized Cumulative Discounted Gain) Search engines also use non-relevance-based measures. Clickthrough on first result Not very reliable if you look at a single clickthrough … but pretty reliable in the aggregate. Studies of user behavior in the lab A/B testing Not very reliable if you look at a single clickthrough (you may realize after clicking that the summary was misleading and the document is nonrelevant) . . . L10Evaluation

A/B testing Purpose: Test a single innovation
Prerequisite: You have a large search engine up and running. Have most users use old system Divert a small proportion of traffic (e.g., 1%) to the new system that includes the innovation Evaluate with an “automatic” measure like clickthrough on first result Now we can directly see if the innovation does improve user happiness. Probably the evaluation methodology that large search engines trust most

SKIP DETAILS Prasad L10Evaluation

Other Evaluation Measures
Adapted from Slides Attributed to Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) Prasad L10Evaluation

Fallout Rate Problems with both precision and recall:
Number of irrelevant documents in the collection is not taken into account. Recall is undefined when there is no relevant document in the collection. Precision is undefined when no document is retrieved. Prasad L10Evaluation

Subjective Relevance Measure
Novelty Ratio: The proportion of items retrieved and judged relevant by the user and of which they were previously unaware. Ability to find new information on a topic. Coverage Ratio: The proportion of relevant items retrieved out of the total relevant documents known to a user prior to the search. Relevant when the user wants to locate documents which they have seen before (e.g., the budget report for Year 2000). Prasad L10Evaluation

Other Factors to Consider
User effort: Work required from the user in formulating queries, conducting the search, and screening the output. Response time: Time interval between receipt of a user query and the presentation of system responses. Form of presentation: Influence of search output format on the user’s ability to utilize the retrieved materials. Collection coverage: Extent to which any/all relevant items are included in the document corpus. Prasad L10Evaluation

Early Test Collections
Previous experiments were based on the SMART collection which is fairly small. (ftp://ftp.cs.cornell.edu/pub/smart) Collection Number Of Number Of Raw Size Name Documents Queries (Mbytes) CACM 3, CISI 1, CRAN 1, MED 1, TIME Different researchers used different test collections and evaluation techniques. Prasad L10Evaluation

Critique of pure relevance
Relevance vs Marginal Relevance A document can be redundant even if it is highly relevant Duplicates The same information from different sources Marginal relevance is a better measure of utility for the user. Using facts/entities as evaluation units more directly measures true relevance. But harder to create evaluation set Prasad L10Evaluation

Evaluation of IR Systems

Similar presentations

Presentation on theme: "Evaluation of IR Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evaluation of IR Systems

Similar presentations

Presentation on theme: "Evaluation of IR Systems"— Presentation transcript:

Similar presentations

About project

Feedback