Evaluation LBSC 708A/CMSC 838L Session 6, October 16, 2001 Philip Resnik and Douglas W. Oard.

Evaluation LBSC 708A/CMSC 838L Session 6, October 16, 2001 Philip Resnik and Douglas W. Oard

Agenda  Philip Resnik –Questions –Evaluation fundamentals –User-centered strategies Doug Oard –System-centered strategies –Lessons From TREC

Basis for Selection Visceral (9) –“I’ll know it when I see it” (expected answer) Conscious (6) –Might be true for known item retrieval Formalized (0) –Used in the TREC interactive evaluations Compromised (3) –I can’t imagine a case were this would be right

Muddiest Points Clustering (4) –How are clusters formed? (what are heuristics?) –What meaning do clusters have? How starfields and contours are used (3) Projection to fewer dimensions (2) –How are the dimensions selected? [FastMap] Summarization (2) –Distinction between salience and selectivity Why did we devote an entire class to HCI?

Single-Link Clustering Pick any document as the first cluster “seed” Add the most similar document to each cluster –Cosine measure, Okapi BM 25, etc. –Start a new cluster if none are within a threshold –Adding the same document joins two clusters Optionally check large clusters for splits –Recluster largest clusters with a tighter threshold

FastMap Projection Select any point (document) at random Find the furthest point (least similar document) Use coordinates of two points to define a line –A line defines an “orthogonal hyperplane” Project all points to the orthogonal hyperplane –Project to the line, then subtract from original point Repeat for desired number or dimensions

Visualizing FastMap

Evaluation Criteria Effectiveness –User-machine system, ranked list quality, … Efficiency –Retrieval time, indexing time, index size Usability –Learnability, novice use, expert use

IR Effectiveness Evaluation User-centered strategy –Given several users, and at least 2 retrieval systems –Have each user try the same task on both systems –Measure which system works the “best” System-centered strategy –Given documents, queries, and relevance judgments –Try several variations on the retrieval system –Measure which ranks more good docs near the top

Measures of Effectiveness Good measures of effectiveness should –Capture some aspect of what the user wants –Have predictive value for other situations Different queries, different document collection –Be easily replicated by other researchers –Be expressed as a single number Allows two systems to be easily compared No IR measures of effectiveness are that good!

Comparing Alternative Approaches Achieve a meaningful improvement –An application-specific judgment call Achieve reliable improvement in unseen cases –Can be verified using statistical tests

Evolution of Evaluation in Language Technology Evaluation by inspection of examples Evaluation by demonstration Evaluation by improvised demonstration Evaluation on data using a figure of merit Evaluation on test data Evaluation on common test data Evaluation on common, unseen test data

Resnik’s Ten Commandments of Language Technology Evaluation Thou shalt define pertinent evaluation metrics Thou shalt define replicable evaluation metrics Thou shalt report all relevant system parameters Thou shalt establish upper bounds on performance Thou shalt establish lower bounds on performance Thou shalt test differences for statistical significance Thou shalt not mingle training data with test data

Statistical Significance Tests How sure can you be that an observed difference doesn’t simply result from the particular queries you chose? System A 0.20 0.21 0.22 0.19 0.17 0.20 0.21 System B 0.40 0.41 0.42 0.39 0.37 0.40 0.41 Experiment 1 Query 12345671234567 Average0.200.40 System A 0.02 0.39 0.16 0.58 0.04 0.09 0.12 System B 0.76 0.07 0.37 0.21 0.02 0.91 0.46 Experiment 2 Query 12345671234567 Average0.200.40

Estimating from Samples Population: mean  Sample: mean X The mean is a point estimator - unbiased: X approaches  as N approaches infinity - consistent: X gets more accurate as N gets bigger

Sampling, continued Population: mean  Sample: mean X 1 Sample: mean X 2 Sample: mean X 3 Problem: you can’t do all possible experiments. Statistics is the science of inference from the experiments you do.

Sampling and Experimentation Population: mean  Sample: mean X 1 Sample: mean X 2 Sample: mean X 3  Most experiments will produce a result somewhere near the true value. Some won’t. So how can you infer anything from experimental results?

Hypothesis Testing Paradigm Choose some experimental variable X Establish a null hypothesis H 0 Identify distribution of X assuming H 0 is true Do experiment to get an empirical value x Ask: “What’s the probability I would have gotten this value of x if H 0 were true?” If this probability p is low, reject H 0

Hypothesis Testing Example In a southern state, 50% of eligible voters are African American. In a particular court case, 4 of 80 potential jurors were African American. Was the pool of potential jurors chosen fairly?

“Experiment”: choosing a jury X: the number of African American jurors H0 (null hypothesis): jury pool distribution reflects the population Expected distribution of X: binomial (each voter equally likely) Intuition: when N=80,  (expected value of X) is 40 Probability of getting x  4 when N=80: 0.0000000000000000014 Null hypothesis is vanishingly unlikely! Hypothesis Testing Example, cont’d 0 8040 X 95% of outcomes

Back to IR Evaluation! How sure can you be that an observed difference doesn’t simply result from the particular queries you chose? System A 0.20 0.21 0.22 0.19 0.17 0.20 0.21 System B 0.40 0.41 0.42 0.39 0.37 0.40 0.41 Experiment 1 Query 12345671234567 Average0.200.40 System A 0.02 0.39 0.16 0.58 0.04 0.09 0.12 System B 0.76 0.07 0.37 0.21 0.02 0.91 0.46 Experiment 2 Query 12345671234567 Average0.200.40

Sign Test Compare some figure of merit for each topic –Note which system produces the bigger value Null hypothesis: either condition is equally likely to produce the bigger value for any query Value X to measure: how often B gives you a bigger value than A. Compute the probability of the x you got –Any statistics package contains the formula for this Treat p < 0.05 as“statistically significant” –If p < 0.05, this result probably didn’t happen by chance –But the results still need to pass the “meaningful” test!

Student’s t test avg1avg2 Null hypothesis: the reds and the blues were sampled from the same underlying distribution 0.20 0.21 0.22 0.19 0.17 0.20 0.21 0.40 0.41 0.42 0.39 0.37 0.40 0.41 Measurable value: t = look it up if you’re interested! Note: this test is only valid if you assume the two distributions are normal and have the same variance. Also note: doesn’t assume the values are paired.

Paired t test More powerful than the sign test –If the assumptions are satisfied Compute the figure of merit difference –On a topic by topic basis for enough topics to approximate a normal distribution Assume that the topics are independent Compute the probability of the outcome you got –Microsoft Excel includes a TTEST function Treat p < 0.05 as “statistically significant”

User Studies The only way to account for interface issues –By studying the interface component –By studying the complete system Formative tests –Provide a basis for system development Summative tests –Designed to assess performance

Questionnaires Collect demographic data –For example, computer experience –Basis for interpreting results Subjective self-assessment –Which did they think was more effective? –Often at variance with objective results! Preference –Which interface did they prefer? Why?

Qualitative User Studies Observe user behavior –Instrumented software, eye trackers, etc. –Face and keyboard cameras –Think-aloud protocols –Interviews and focus groups Organize the data –For example, group it into overlapping categories Look for patterns and themes Develop a “grounded theory”

Quantitative User Studies Select independent variable(s) –e.g., what info to display in selection interface Select dependent variable(s) –e.g., time to find a known relevant document Run subjects in different orders –Average out learning and fatigue effects Compute statistical significance –p<0.05 is sufficient to reject the null hypothesis that the independent variable has no effect

Agenda Philip Resnik –Questions –Evaluation fundamentals –User-centered strategies  Doug Oard –System-centered strategies –Lessons From TREC

Which is the Best Rank Order? R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R abcd ef gh

IR Test Collections Documents –Representative quantity –Representative sources and topics Topics (formalized) –Used to form queries (compromised) Relevance judgments –For each document, with respect to each topic –This is the expensive part!

Some Assumptions Unchanging, known queries –The same queries are used by each system Binary relevance –Every document is either relevant or it is not Unchanging, known relevance –The relevance of each doc to each query is known But only used for evaluation, not retrieval! Focus on effectiveness, not efficiency

Set-Based MOE Precision –How much of what was found is relevant? Often of interest, particularly for interactive searching Recall –How much of what is relevant was found? Particularly important for law and patent searches Fallout –How much of what was irrelevant was rejected? Useful when different size collections are compared

The Contingency Table Relevant Retrieved Irrelevant RetrievedIrrelevant Rejected Relevant Rejected Relevant Not relevant RetrievedNot Retrieved Doc Action

Single-Figure Set-Based Measures Balanced F-measure –Harmonic mean of recall and precision –Weakness: What if no relevant documents exist? Utility functions –Reward relevant retrieved, Penalize non-relevant For example, 3R + - 2N + –Weakness Hard to normalize, so hard to average

MOE for Ranked Retrieval Start with the first relevant doc on the list –Compute recall and precision at that point Move down the list, 1 relevant doc at a time –Computing recall and precision at each point Plot precision for every value of recall –Interpolate with a nonincreasing step function Repeat for several queries Average the plots at every point

The Precision-Recall Curve R R R R Relevant Retrieved Irrelevant RetrievedIrrelevant Rejected Relevant Rejected Relevant=4 Not relevant=6 RetrievedNot Retrieved Doc=10 Action

Single-Number Measures Mean precision at a fixed number of documents –Precision at 10 docs is the “AltaVista measure” Mean precision at a fixed recall level –Adjusts for the total number of relevant docs Mean Average Precision (MAP) –Average of precision at recall=0.0, 0.1, …, 1.0 –Area under the precision/recall curve Mean breakeven point –Value at which precision = recall

Precision at recall=0.1 Precision at 10 docs Average Precision Breakeven Point

Single-Figure MOE Weaknesses Mean precision at 10 documents –Averaged over very different numbers of relevant docs Mean precision at constant recall –A specific recall fraction is rarely the user’s goal Mean breakeven point –Nobody ever searches at the breakeven point Mean Average Precision –Users typically operate near an extreme of the curve So the average is not very informative

Why Mean Average Precision? It is easy to trade between recall and precision –Adding related query terms improves recall But naive query expansion techniques kill precision –Limiting matches by part-of-speech helps precision But it almost always hurts recall Comparisons should give some weight to both –Average precision is a principled way to do this Rewards improvements in either factor

Visualizing Average Precision 0.353

Visualizing Mean Average Precision 0.0 0.2 0.4 0.6 0.8 1.0 Average Precision Topic

How Much is Enough? The maximum average precision is 1.0 –But inter-rater reliability is 0.8 or less –So 0.8 is a practical upper bound at every point Precision  0.8 is sometimes seen at low recall Improvement should be meaningful –Rule of thumb: 5 is noticeable, 10% is important Is the improvement statistically significant? –Typically requires at least 25 topics (50 is better)

Obtaining Relevance Judgments Exhaustive assessment can be too expensive –TREC has 50 queries for >1 million docs each year Random sampling won’t work either –If relevant docs are rare, none may be found! IR systems can help focus the sample –Each system finds some relevant documents –Different systems find different relevant documents –Together, enough systems will find most of them

Pooled Assessment Methodology Each system submits top 1000 documents Top 100 documents for each are judged –All are placed in a single pool –Duplicates are eliminated –Placed in an arbitrary order to avoid bias Evaluated by the person that wrote the query Assume unevaluated documents not relevant –Overlap evaluation shows diminishing returns Compute average precision over all 1000 docs

TREC Overview Documents are typically distributed in April Topics are distributed in early June –Queries are formed from topics using standard rules Top 1000 selections are due in mid-August Relevance judgments available in October Results presented in late November each year

Lessons From TREC Incomplete judgments are useful –If sample is unbiased with respect to systems tested Different relevance judgments change absolute score –But rarely change comparative advantages when averaged Evaluation technology is predictive –Results transfer to operational settings) Adapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

Effects of Incomplete Judgments Additional relevant documents are: –roughly uniform across systems –highly skewed across topics Systems that don’t contribute to pool get comparable results

Inter-Judge Agreement

Effect of Different Judgments

Net Effect of Different Judges Mean Kendall  between system rankings produced from different qrel sets:.938 Similar results held for –Different query sets –Different evaluation measures –Different assessor types –Single opinion vs. group opinion judgments

Query-Averaged Recall-Precision

Single-Topic Recall Precision

What Query Averaging Hides

Single Number Summary Scores Precision (X): # rel in top X / X Recall(Y): # rel in top Y/ R Average precision: Avg r (Prec(rank of r)) R-Precision: Prec(R) Recall at.5 precision –use Prec(10) if precision <.5 in top 10 Rank of first relevant (expected search length)

Document Level Measures Example: precision at 10 documents Easily interpreted Less sensitive to ranking –Only differences that cross cut-off affect results May better reflect actual applications But less useful for tuning a system Don’t average well –Working in different parts of recall-precision curve –Some queries don’t have that many documents

Example: Number Relevant at 30 Documents

Average Precision Less easily interpreted Sensitive –Changing one rank will change average precision –Documents closest to the top have the greatest effect Stable –Small change in rank makes small change in score

Runs Ranked by Different Measures Ranked by measure averaged over 50 topics

Correlations Between Rankings Kendall’s  computed between pairs of rankings

Two Minute Papers If I demonstrated a new retrieval technique that achieved a statistically significant improvement in average precision on the TREC collection, what would be the most serious limitation to consider when interpreting that result? What was the muddiest point in today’s lecture?

Evaluation LBSC 708A/CMSC 838L Session 6, October 16, 2001 Philip Resnik and Douglas W. Oard.

Similar presentations

Presentation on theme: "Evaluation LBSC 708A/CMSC 838L Session 6, October 16, 2001 Philip Resnik and Douglas W. Oard."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evaluation LBSC 708A/CMSC 838L Session 6, October 16, 2001 Philip Resnik and Douglas W. Oard.

Similar presentations

Presentation on theme: "Evaluation LBSC 708A/CMSC 838L Session 6, October 16, 2001 Philip Resnik and Douglas W. Oard."— Presentation transcript:

Similar presentations

About project

Feedback