7CCSMWAL Algorithmic Issues in the WWW

Slides:



Advertisements
Similar presentations
Information Extraction Lecture 4 – Named Entity Recognition II CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Precision and Recall.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
Evaluating Search Engine
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Learning Techniques for Information Retrieval Perceptron algorithm Least mean.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
CS276A Information Retrieval Lecture 8. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Evaluation CSC4170 Web Intelligence and Social Computing Tutorial 5 Tutor: Tom Chao Zhou
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Evaluating Classifiers
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Evaluation – next steps
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 7 9/13/2011.
Information Retrieval Lecture 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 10 Evaluation.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 8: Evaluation.
Sampath Jayarathna Cal Poly Pomona
Evaluating Classifiers
Text Mining CSC 600: Data Mining Class 20.
Evaluation of IR Systems
Lecture 10 Evaluation.
Evaluation.
Data Mining Classification: Alternative Techniques
אחזור מידע, מנועי חיפוש וספריות
Modern Information Retrieval
IR Theory: Evaluation Methods
Lecture 6 Evaluation.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Model Evaluation and Selection
Text Mining CSC 576: Data Mining.
Computational Intelligence: Methods and Applications
Dr. Sampath Jayarathna Cal Poly Pomona
Retrieval Evaluation - Measures
Roc curves By Vittoria Cozza, matr
Retrieval Performance Evaluation - Measures
Machine Learning: Methodology Chapter
Dr. Sampath Jayarathna Cal Poly Pomona
Precision and Recall Reminder:
Precision and Recall.
The Normal Distribution
Evaluation Metrics CS229 Anand Avati.
Presentation transcript:

7CCSMWAL Algorithmic Issues in the WWW

Evaluation in Information Retrieval See Chapter 8 of Intro to IR

Ideas Want to compare performance of a binary classifier systems, which e.g divide documents into ‘useful’ and ‘not-useful’. An old problem: Various treatments were tested for ability to improve a medical condition. The treatment is evaluated as ‘improved’, ‘not-improved’ for each patient. (usually not clear) for instance, to determine whether a person has been cured of hypertension based on a blood pressure measure after administering a drug

True and false positives True positive rate = True positives/Relevant-Docs False positive rate = False positives/ Non-relevant The simplest case. We know the true answer. (which documents are Relevant). We look how the classifier worked

ROC space See http://en.wikipedia.org/wiki/Receiver_operating_characteristic Plot TPR (True Positive Rate) against FPR (False Pos. Rate) Top left hand corner (perfect). Bottom right hand corner (completely wrong). Dotted red line (guess randomly)

Test collection method To measure the information retrieval effectiveness of something (algorithm or IR system or search engine) Need a test collection consisting of A document collection A test suite of information needs, expressible as queries A set of relevance judgments, usually given as a binary assessment of either relevant or non-relevant for each query-document pair

Information need The information need is translated into a query Relevance is assessed relative to the information need NOT the underlying query E.g., Information need: I’m looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine. E.g., Query: wine red white heart attack effective We evaluate whether the document addresses the information need, not whether the document has the query words D=“red wine at heart attack prices” [not addressed]

Evaluating an IR system Precision: fraction of retrieved docs that are relevant Recall: fraction of relevant docs that are retrieved False negatives: relevant docs judged as non-relevant by IR system Consider the first row sum and first column sum Relevant Non-relevant Retrieved tp (true positive) fp (false positive) Not Retrieved fn (false negatives) tn (true negatives) Precision P = tp / (tp + fp) Recall R = tp / (tp + fn)

Precision / Recall Precision and recall trade off against one another You can get high recall (but low precision) by retrieving all docs for any queries But then precision is nearly zero Recall is a non-decreasing function of the number of docs retrieved Good system: precision decreases as number of docs retrieved increases This is not a theorem, but a result with strong empirical confirmation

Ranked or Unranked Unranked: Classify results as ‘useful’ or ‘not-useful’. E.g. Boolean retrieval. These documents are relevant, these documents are not relevant. Ranked: The results are presented in order from ‘most useful’ to ‘least useful’. E.g. Search engine output based on a relevance measure.

Evaluation of unranked retrieval sets Meaning? Unordered set of documents returned F measure: Combined measure that assesses precision/recall tradeoff where  2 = (1 – ) / and   [0,1], so   [0,] Usually use balanced F1 (or F=1) measure ( = 1 ) F1 = 2 P R / (P + R) Precision (P = tp / (tp + fp)) Recall (R = tp / (tp + fn))

Evaluating ranked results An IR system (e.g., search engine) returns a ranked list of documents d1, d2, d3, d4, ..., dn Compute the precision Pi and recall Ri values for every prefix (i=1…n) of the list, i.e., {d1, d2, ..., di} Precision-Recall curve: Plot y= Precision (P = tp / (tp + fp)) against x= Recall (R = tp / (tp + fn)) If system 1 has fewer fp for a given R compared to system 2 its precision curve is higher at that value of R

Algorithmic aspects Precision-Recall curve: plot Pi vs Ri, for i=1 to n and join the points by lines Ri increases as i increases because of more retrieved docs Precision varies If di is a relevant doc, Pi > Pi-1 If di is not a relevant doc, Pi < Pi-1 This means the plot is not monotone decreasing so we use the upper envelope

A precision-recall curve What was the relevance of the first 5 documents? Interpolated curve original curve

Interpolated precision Remove the jiggles in the precision-recall curve Interpolated precision ( pinterp(r) ) is defined as the highest precision found for any recall level r’  r pinterp(r) = maxr’  r p(r’) where p(r’) is the precision at the recall value r’ The justification is that almost anyone would be prepared to look at a few more documents if it increased the percentage of viewed set that was relevant i.e., the precision of the larger set is higher

Eleven-point interpolated precision The interpolated precision is measured at standard recall values, i.e., the 11 recall levels of 0.0, 0.1, 0.2, ..., 1.0 Interpolation algorithm Let P[ j ] be the precision at the recall value at j/10 Set pinterp(1.0) = 0 if not all relevant docs are retrieved. For j = 1 to 10 do P[ j ] = pinterp(j/10)

Example For a query, the IR system retrieves the set of docs ranked in the order d123, d84, d56, d6, d8, d9, d511, d129, d187, d25, d38, d48, d250, d113, d3 Suppose there are 3 relevant docs d3, d56, d129 Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Retrieved d123 d84 d56 d6 d8 d9 d511 d129 d187 d25 d38 d48 d250 d113 d3 Relevant & retrieved 

Example To plot interpolated precision-recall graph upper envelope, we only need to consider the precision as a new relevant doc is retrieved When d56 is retrieved: P = 1/3  0.33, R = 1/3  0.33 When d129 is retrieved: P = 2/8 =0.25, R = 2/3  0.67 When d3 is retrieved: P = 3/15 = 0.2, R = 3/3 = 1 Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Retrieved d123 d84 d56 d6 d8 d9 d511 d129 d187 d25 d38 d48 d250 d113 d3 Relevant & retrieved 

Example precision (%) Recall (%) 100 80 60 40 20 20 40 60 80 100 (0.33, 0.33) 60 (0.67, 0.25) (1, 0.2) 40 20 Recall (%) 20 40 60 80 100

What does it all mean? Precision P = tp / (tp + fp) Recall R = tp / (tp + fn) We want to retrieve all the relevant documents first, followed by all the irrelevant ones If we have T true docs followed by F false docs, at the k-th step (k ≤ T) then tp= k, fp=0, tp+fn=T P= k/k=1, R= k/T Horizontal line thru 1 on Precision axis (y-axis)

Compare Stem and No-Stem Systems Example Compare Stem and No-Stem Systems The curve closest to the upper right-hand corner of the graph indicates the best performance. [IBM data]