Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.

Retrieval Evaluation

Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document sets, or large content types, such performance evaluations are valid. In information retrieval, we also care about retrieval performance evaluation, that is how well the retrieved documents match the goal.

Retrieval Performance Evaluation We discussed overall system evaluation previously –Traditional vs. berry-picking models of retrieval activity –Metrics include time to complete task, user satisfaction, user errors, time to learn system But how can we compare how well different algorithms do at retrieving documents?

Precision and Recall Consider if we have a document collection, a query and its results, and a task and its relevant documents. Document Collection Relevant Documents |R| Retrieved Documents |A| Relevant Documents in Answer Set |Ra|

Precision Precision – the percentage of retrieved documents that are relevant. = |Ra| / |A| Document Collection Relevant Documents |R| Retrieved Documents |A| Relevant Documents in Answer Set |Ra|

Recall Recall – the percentage of relevant documents that are retrieved. = |Ra| / |R| Document Collection Relevant Documents |R| Retrieved Documents |A| Relevant Documents in Answer Set |Ra|

Precision/Recall Trade-Off We can guarantee 100% recall by returning all documents in the collection … –Obviously, this is a bad idea! We can get a high precision rate by only returning documents that we are sure of. –Maybe a bad idea So, retrieval algorithms are characterized by their recall and precision curve.

Plotting Precision/Recall Curve 11-Level Precision/Recall Graph –Plot precision at 0%, 10%, 20%, …, 100% recall. –Normally averages over a set of standard queries are used. P avg (r) = Σ ( P i (r) / N q ) Example (using one query): Relevant Documents (R q ) = {d 1, d 2, d 3, d 4, d 5, d 6, d 7, d 8, d 9, d 10 } Ordered Ranking by Retrieval Algorithm (A q ) = {d 10, d 27, d 7, d 44, d 35, d 3, d 73, d 82, d 19, d 4, d 29, d 33, d 48, d 54, d 1 }

Plotting Precision/Recall Curve Example (second query): Relevant Documents (R q ) = {d 1, d 7, d 82 } Ordered Ranking by Retrieval Algorithm (A q ) = {d 10, d 27, d 7, d 44, d 35, d 3, d 73, d 82, d 19, d 4, d 29, d 33, d 48, d 54, d 1 } Need to interpolate. Now Plot the Average of the Different Queries

Single Value Summaries Would like to be able to compare performance on specific queries to determine anomalous behavior. Could record the precision at a specific recall level. Simple idea is to use the precision for the first relevant document. Why would this be a bad idea? Why might it work?

Average Precision at Seen Relevant Documents Does just what it says … Example: (A q ) = {d 10, d 27, d 7, d 44, d 35, d 3, d 73, d 82, d 19, d 4, d 29, d 33, d 48, d 54, d 1 } Compute the average. Average : (1 + 0.66 + 0.5 + 0.4 + 0.33) / 5

R-Precision Idea: Compute the precision at the R th position in the ranking. R – the number of relevant documents. So if there are 10 (or R) relevant documents then report the precision among the top 10 (or R) ranked documents.

Precision Histograms Create a histogram of the difference between the R values for two retrieval algorithms for a set of queries. 1.0 0.5 -0.5 1234567812345678

Problems with Precision and Recall Proper estimation of recall requires detailed knowledge of all documents in collection. –When is this available? –Is this possible on the web? Precision and recall are both calculations related to the number of relevant documents returned –A single value might be preferred Precision and recall are valuable for batch processing –What about for interactive systems? Precision and Recall computed assuming linear ordering of returned documents. –What about when a unordered sets are returned?

F Measure Most commonly used single-value metric Combines precision and recall: F (j) = 2 / ( 1 / r (j) + 1 / P (j) ) Where: r (j) = recall at j th document in ranking P (j) = precision at j th document in ranking Value ranges from 0 (when no relevant documents are retrieved) to 1 (when all relevant documents are retrieved).

E Measure May have preference of higher recall or precision. E Measure allows comparison that includes that bias. Another combination of precision and recall: E (j) = 1 – ( 1 + b 2 ) / (b 2 / r (j) + 1 / P (j) ) Where: r (j) = recall at j th document in ranking P (j) = precision at j th document in ranking Complement of F (j) when b = 1. When b > 1, precision is weighed more than recall. When b < 1, recall is weighed more than precision.

User-Oriented Measures Users have expectations when searching for information. How can we compare retrieval algorithms that include those expectations? Document Collection Relevant Documents |R| Retrieved Documents |A| Relevant documents known by user |U| Relevant documents known by user that are retrieved |Rk| Relevant documents not known by user that are retrieved |Ru|

User-Oriented Measures Coverage = |Rk| / |U| Novelty = |Ru| / ( |Ru| + |Rk| ) Relative recall and recall effort use number of documents user expected to find. Document Collection Relevant Documents |R| Retrieved Documents |A| Relevant documents known by user |U| Relevant documents known by user that are retreived |Rk| Relevant documents not known by user that are retreived |Ru|

Evaluating Interactive Systems Empirical data involving human users is time consuming to gather and difficult to draw universal conclusions from. Evaluation metrics for user interfaces –Time required to learn the system –Time to achieve goals on benchmark tasks –Error rates –Retention of the use of the interface over time –User satisfaction

Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.

Similar presentations

Presentation on theme: "Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.

Similar presentations

Presentation on theme: "Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document."— Presentation transcript:

Similar presentations

About project

Feedback