7CCSMWAL Algorithmic Issues in the WWW

7CCSMWAL Algorithmic Issues in the WWW

Evaluation in Information Retrieval
See Chapter 8 of Intro to IR

Ideas Want to compare performance of a binary classifier systems, which e.g divide documents into ‘useful’ and ‘not-useful’. An old problem: Various treatments were tested for ability to improve a medical condition. The treatment is evaluated as ‘improved’, ‘not-improved’ for each patient. (usually not clear) for instance, to determine whether a person has been cured of hypertension based on a blood pressure measure after administering a drug

True and false positives True positive rate = True positives/Relevant-Docs False positive rate = False positives/ Non-relevant The simplest case. We know the true answer. (which documents are Relevant). We look how the classifier worked

ROC space See Plot TPR (True Positive Rate) against FPR (False Pos. Rate) Top left hand corner (perfect). Bottom right hand corner (completely wrong). Dotted red line (guess randomly)

Test collection method
To measure the information retrieval effectiveness of something (algorithm or IR system or search engine) Need a test collection consisting of A document collection A test suite of information needs, expressible as queries A set of relevance judgments, usually given as a binary assessment of either relevant or non-relevant for each query-document pair

Information need The information need is translated into a query
Relevance is assessed relative to the information need NOT the underlying query E.g., Information need: I’m looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine. E.g., Query: wine red white heart attack effective We evaluate whether the document addresses the information need, not whether the document has the query words D=“red wine at heart attack prices” [not addressed]

Evaluating an IR system
Precision: fraction of retrieved docs that are relevant Recall: fraction of relevant docs that are retrieved False negatives: relevant docs judged as non-relevant by IR system Consider the first row sum and first column sum Relevant Non-relevant Retrieved tp (true positive) fp (false positive) Not Retrieved fn (false negatives) tn (true negatives) Precision P = tp / (tp + fp) Recall R = tp / (tp + fn)

Precision / Recall Precision and recall trade off against one another
You can get high recall (but low precision) by retrieving all docs for any queries But then precision is nearly zero Recall is a non-decreasing function of the number of docs retrieved Good system: precision decreases as number of docs retrieved increases This is not a theorem, but a result with strong empirical confirmation

Ranked or Unranked Unranked: Classify results as ‘useful’ or ‘not-useful’. E.g. Boolean retrieval. These documents are relevant, these documents are not relevant. Ranked: The results are presented in order from ‘most useful’ to ‘least useful’. E.g. Search engine output based on a relevance measure.

Evaluation of unranked retrieval sets
Meaning? Unordered set of documents returned F measure: Combined measure that assesses precision/recall tradeoff where  2 = (1 – ) / and   [0,1], so   [0,] Usually use balanced F1 (or F=1) measure ( = 1 ) F1 = 2 P R / (P + R) Precision (P = tp / (tp + fp)) Recall (R = tp / (tp + fn))

Evaluating ranked results
An IR system (e.g., search engine) returns a ranked list of documents d1, d2, d3, d4, ..., dn Compute the precision Pi and recall Ri values for every prefix (i=1…n) of the list, i.e., {d1, d2, ..., di} Precision-Recall curve: Plot y= Precision (P = tp / (tp + fp)) against x= Recall (R = tp / (tp + fn)) If system 1 has fewer fp for a given R compared to system 2 its precision curve is higher at that value of R

Algorithmic aspects Precision-Recall curve: plot Pi vs Ri, for i=1 to n and join the points by lines Ri increases as i increases because of more retrieved docs Precision varies If di is a relevant doc, Pi > Pi-1 If di is not a relevant doc, Pi < Pi-1 This means the plot is not monotone decreasing so we use the upper envelope

A precision-recall curve What was the relevance of the first 5 documents?
Interpolated curve original curve

Interpolated precision
Remove the jiggles in the precision-recall curve Interpolated precision ( pinterp(r) ) is defined as the highest precision found for any recall level r’  r pinterp(r) = maxr’  r p(r’) where p(r’) is the precision at the recall value r’ The justification is that almost anyone would be prepared to look at a few more documents if it increased the percentage of viewed set that was relevant i.e., the precision of the larger set is higher

Eleven-point interpolated precision
The interpolated precision is measured at standard recall values, i.e., the 11 recall levels of 0.0, 0.1, 0.2, ..., 1.0 Interpolation algorithm Let P[ j ] be the precision at the recall value at j/10 Set pinterp(1.0) = 0 if not all relevant docs are retrieved. For j = 1 to 10 do P[ j ] = pinterp(j/10)

Example For a query, the IR system retrieves the set of docs ranked in the order d123, d84, d56, d6, d8, d9, d511, d129, d187, d25, d38, d48, d250, d113, d3 Suppose there are 3 relevant docs d3, d56, d129 Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Retrieved d123 d84 d56 d6 d8 d9 d511 d129 d187 d25 d38 d48 d250 d113 d3 Relevant & retrieved 

Example To plot interpolated precision-recall graph upper envelope, we only need to consider the precision as a new relevant doc is retrieved When d56 is retrieved: P = 1/3  0.33, R = 1/3  0.33 When d129 is retrieved: P = 2/8 =0.25, R = 2/3  0.67 When d3 is retrieved: P = 3/15 = 0.2, R = 3/3 = 1 Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Retrieved d123 d84 d56 d6 d8 d9 d511 d129 d187 d25 d38 d48 d250 d113 d3 Relevant & retrieved 

Example precision (%) Recall (%) 100 80 60 40 20 20 40 60 80 100
(0.33, 0.33) 60 (0.67, 0.25) (1, 0.2) 40 20 Recall (%) 20 40 60 80 100

What does it all mean? Precision P = tp / (tp + fp)
Recall R = tp / (tp + fn) We want to retrieve all the relevant documents first, followed by all the irrelevant ones If we have T true docs followed by F false docs, at the k-th step (k ≤ T) then tp= k, fp=0, tp+fn=T P= k/k=1, R= k/T Horizontal line thru 1 on Precision axis (y-axis)

Compare Stem and No-Stem Systems
Example Compare Stem and No-Stem Systems The curve closest to the upper right-hand corner of the graph indicates the best performance. [IBM data]

7CCSMWAL Algorithmic Issues in the WWW

Similar presentations

Presentation on theme: "7CCSMWAL Algorithmic Issues in the WWW"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

7CCSMWAL Algorithmic Issues in the WWW

Similar presentations

Presentation on theme: "7CCSMWAL Algorithmic Issues in the WWW"— Presentation transcript:

Similar presentations

About project

Feedback