Advanced Information Retrieval

Advanced Information Retrieval
Meeting #3

Performance Evaluation
How do you evaluate the performance of an information retrieval system? Or compare two different systems?

Relevance for IR A measurement of the outcome of a search
The judgment on what should or should not be retrieved There is no simple answers to what is relevant and what is not relevant difficult to define subjective depending on knowledge, needs, time, situation, etc. The central concept of information retrieval

Relevance to What? Information Needs The final test of relevance is
Problems? requests ? queries ? The final test of relevance is if users find the information useful if users can use the information to solve the problems they have if users fill information gap they perceived.

Relevance Judgment The user's judgment The intermediary's judgment
How well the retrieved documents satisfy the user's information needs How useful the retrieved documents Related but not useful --- still not relevant The intermediary's judgment How likely does the user judge the information as useful? How important does the user treat the information? The system's judgment How well the retrieved document match the query

What is the goal of an IR system?
A “good” IR system is able to extract meaningful (relevant) information while withholding non-relevant information Why is this difficult? What are we testing?

What are the components of relevance? Some criteria…
Depth/scope Accuracy/validity Content novelty Document novelty Tangibility Ability to understand Recency External validation Effectiveness Access Clarity Source quality etc.

Determining relevance?
Subjective in nature; may be determined by The user who posed the retrieval problem Realistic but based on many personal factors Relates to problem/information need An external judge Relates to statement/query Assumption of independence Should it be binary or n-ary?

Precision vs. Recall Which one is more important? Depends on the task!
Generic web search engine Precision! Index of court cases Need all legal precedents! Recall!

Relationship of R and P Theoretically, Practically, When will p = 0?
R and P are not depend on each other. Practically, High Recall is achieved at the expense of precision. High Precision is achieved at the expense of recall. When will p = 0? Only when none of the retrieved documents is relevant. When will p=1? Only when every retrieved documents are relevant.

Ideal Retrieval Systems
Ideal IR system would have P=1, R= 1, for all the queries Is it possible? Why? If information needs could be defined very precisely, and If relevance judgments could be done unambiguously, and If query matching could be designed perfectly, Then We would have an ideal system. It is not an ideal information retrieval system.

Precision vs. Recall Inversely related
As recall increases, precision decreases Precision Recall

Evaluation of IR Systems
Using Recall & Precision Conduct query searches Try many different queries Compare results of Precision & Recall Recall & Precision need to be considered together. Results varies depending on test data and queries. Recall & Precision is only one aspect of system performance High recall/high precision is desirable, but not necessary the most important thing that the user considers. 4

Precision vs. Recall Precision and Recall depend on size of selected set Will depend on user interface, the user, and the user’s task Boolean system: assume all documents are presented and viewed by user Ranked system: depends on number of documents viewed by user from ranked list

Implementing Precision & Recall
Common method: Measure the precision at several levels of recall Measure precision at increasing sizes of “selected” set For each query, calculate precision at 11 levels of recall (0, 10, … 100%) Average across all queries Plot precision vs. recall curve

Data quality consideration
Coverage of database Completeness and accuracy of data Indexing methods and indexing quality indexing types currency of indexing ( Is it updated often?) indexing sizes 17

Precision vs. Recall Average precision across all user queries
Rong Jin, Alex G. Hauptmann, Cheng Xiang Zhai. Title language model for information retrieval Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval.

Precision & Recall for a single query
14 documents: 1,2,4,6 and 13 are relevant ith document retrieved

Precision & Recall for a Single Query
14 documents: 1,2,4,6 and 13 are relevant

Example: How to show that a search engine is better than other search engines?
On each level of recall, the precision for your system is higher than the precisions of other systems. Three-point average, r=.25, .5, .75 11-point average: r= 0.0, .1, .2, .3, .4 …, .9, 1.0 Comparing the first 10, 20, 30, … items returned by search engines, your system always has more relevant documents than other systems.

Use fixed interval levels of Recall to compare Precision
System 1 System 2 System 3 R=.25 0.6 0.7 0.9 0.5 0.4 0.7 R=.50 0.2 0.3 0.4 R=.75

Use fixed intervals of the number of retrieved documents to compare Precision
Number of relevant documents System A Average Precision Query 1 Query 2 Query 3 N=10 4 5 6 0.5 N=20 4 5 16 0.42 N=30 5 17 5 0.3 N=40 8 6 24 0.32 N=50 10 25 0.27 6 Number of documents retrieved

Use Precision and Recall to Evaluate IR Systems
1 2 3 4 5 System A 0.9 / 0.1 0.7 / 0.4 0.45/0.5 0.3/0.6 0.1/ 0.8 System B 0.8/ 0.2 0.5/ 0.3 0.4/0.5 0..3/0.7 0.2/0.8 System C 0.9/ 0.4 0.7/ 0.6 0.5/ 0.7 0.3/0.8 0.2/ 0.9 4

P-R diagram P 1.0 System A System B 0.5 System C 0.1 R 0.1 1.0 0.5

Problems with Recall/Precision
Poor match with user needs Limited usefulness of recall Does not handle interactivity well Computation of recall? Relevance is not utility Averages ignore individual differences in queries

Interface Consideration
User friendly interface How long does it take for a user to learn advanced features? How well can the user explore or interact with the query output? How easy is it to customize output displays?

User-Centered IR Evaluation
More user-oriented measures Satisfaction, informativeness Other types of measures Time, cost-benefit, error rate, task analysis Evaluation of user characteristics Evaluation of interface Evaluation of process or interaction

User satisfaction The final test is the user!
User satisfaction is more important then precision and recall Measuring user satisfaction Survey Use statistics User experiments

Retrieval Effectiveness
Designing an information retrieval system has many decisions: Manual or automatic indexing? Natural language or controlled vocabulary? What stoplists? What stemming methods? What query syntax? etc. How do we know which of these methods are most effective? Is everything a matter of judgment?

Studies of Retrieval Effectiveness
• The Cranfield Experiments, Cyril W. Cleverdon, Cranfield College of Aeronautics, • SMART System, Gerald Salton, Cornell University, • TREC, Donna Harman, National Institute of Standards and Technology (NIST),

Cranfield Experiments (Example)
Comparative efficiency of indexing systems: (Universal Decimal Classification, alphabetical subject index, a special facet classification, Uniterm system of co-ordinate indexing) Four indexes prepared manually for each document in three batches of 6,000 documents -- total 18,000 documents, each indexed four times. The documents were reports and paper in aeronautics. Indexes for testing were prepared on index cards and other cards. Very careful control of indexing procedures.

Cranfield Experiments (continued)
Searching: • 1,200 test questions, each satisfied by at least one document • Reviewed by expert panel • Searches carried out by 3 expert librarians • Two rounds of searching to develop testing methodology • Subsidiary experiments at English Electric Whetstone Laboratory and Western Reserve University

The Cranfield Data The Cranfield data was made widely available and used by other researchers • Salton used the Cranfield data with the SMART system (a) to study the relationship between recall and precision, and (b) to compare automatic indexing with human indexing • Sparc Jones and van Rijsbergen used the Cranfield data for experiments in relevance weighting, clustering, definition of test corpora, etc.

The Cranfield Experiments 1950s/1960s
Time Lag. The interval between the demand being made and the answer being given. Presentation. The physical form of the output. User effort. The effort, intellectual or physical, demanded of the user. Recall. The ability of the system to present all relevant documents. Precision. The ability of the system to withhold non-relevant documents.

Cranfield Experiments -- Analysis
Cleverdon introduced recall and precision, based on concept of relevance. recall (%) practical systems precision (%)

Why not the others? According to Cleverdon: Time lag Presentation
a function of hardware Presentation successful if the user can read and understand the list of references returned User effort can be measured with a straightforward examination of a small number of cases.

In Reality Need to consider the user task carefully
Cleverdon was focusing on batch interfaces Interactive browsing interfaces very significant (Turpin & Hersh) Interactive systems User effort & presentation very important

In Spite of That Precision & Recall Usability extensively evaluated
not so much

Why Not Usability Usability requires a user-study
Every new feature needs a new study (expensive) High variance – many confounding factors Offline analysis of accuracy Once a dataset is found Easy to control factors Repeatable Automatic Free If the system isn’t accurate, it isn’t going to be usable

Measures • From IR (User) precision, aspectual recall
• From experimental psychology Quantitative: time, number of errors, … Qualitative: user opinions Example evaluation measures: System viewpoint User viewpoint Effectiveness recall/precision quality of solution Efficiency retrieval time task completion time Satisfaction Preference confidence

Another view… “The omission of the user from the traditional IR model, whether it is made explicit or not, stems directly from the user’s absence from the Cranfield experiment”. (Harter and Hert, 1997)

The TREC era Text REtrieval Evaluation Conference
Sponsored and hosted by NIST Begun in 1992 Participants from academia, industry, and government Provides standard test collections and queries; relevance judges and data analysis at NIST

TREC Goals… to encourage research in information retrieval based on large test collections; to increase communication among industry, academia, and government by creating an open forum for the exchange of research ideas; to speed the transfer of technology from research labs into commercial products by demonstrating substantial improvements in retrieval methodologies on real-world problems; to increase the availability of appropriate evaluation techniques for use by industry and academia, including development of new evaluation techniques more applicable to current systems

TREC databases About 5 gigabytes of text; sources include WSJ, AP, Computer Selects, Federal Register, SJ Mercury, FT, Congressional Record, FBIS, LA Times Simple SGML tagging; no correction of errors in text

TREC Experiments 1. NIST provides text corpus on CD-ROM
Participant builds index using own technology 2. NIST provides 50 natural language topic statements Participant converts to queries (automatically or manually) 3. Participant run search, returns up to 1,000 hits to NIST. NIST analyzes for recall and precision (all TREC participants use rank based methods of searching)

TREC Evaluation Summary table statistics Recall-precision averages
No. of topics, documents, relevant documents retrieved, relevant documents available Recall-precision averages Average precision at 11 levels of recall Document level averages Average precision at specified document cutoff values Average precision histogram Single measure for each topic

Evaluation of Web Search Engines
Dynamic nature of database Differences between databases (content, indexing) Operational system Recall “practically unknowable” Generally relatively little overlap

What IR still lacks… Wider range of task; suites for evaluation, especially interfaces Appropriate measures for interactive evaluation Standard tests for standard users Flexibility to match users need and outcomes Mechanisms to study process and strategy as well as outcomes

Advanced Information Retrieval

Similar presentations

Presentation on theme: "Advanced Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced Information Retrieval

Similar presentations

Presentation on theme: "Advanced Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback