Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid

Similar presentations


Presentation on theme: "Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid"— Presentation transcript:

1 Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid jane@dcs.qmul.ac.uk

2 Jane Reid, AMSc IRIC, QMUL, 16/10/01 2 Lecture plan Background System-centred evaluation User-centred evaluation

3 Jane Reid, AMSc IRIC, QMUL, 16/10/01 3 The changing face of evaluation Originally... –Batch IR systems –Small, textual collections –Queries formulated by searchers Today... –Interactive IR systems –Large collections of different or mixed media –Queries formulated by end-users

4 Jane Reid, AMSc IRIC, QMUL, 16/10/01 4 Elements of evaluation When we evaluate, we need to establish: –Methodology –Criterion –Measure –Tool –Method of data analysis

5 Jane Reid, AMSc IRIC, QMUL, 16/10/01 5 System-centred evaluation (Comparative) evaluation of technical performance of IR system(s) Methodology = non-interactive experiment Criterion = relevance Measure = effectiveness Tool = test collection Method of data analysis = recall / precision

6 Jane Reid, AMSc IRIC, QMUL, 16/10/01 6 Relevance Relevant = “having significant and demonstrable bearing on the matter at hand” Underlying assumptions: –Objectivity –Topicality –Binary nature –Independence

7 Jane Reid, AMSc IRIC, QMUL, 16/10/01 7 Effectiveness Effectiveness = the ability of the IR system to retrieve relevant documents and suppress non-relevant documents

8 Jane Reid, AMSc IRIC, QMUL, 16/10/01 8 Test collection Components: –Document collection –Queries / requests –Relevance judgements

9 Jane Reid, AMSc IRIC, QMUL, 16/10/01 9 Test collection creation Manual method: –Every document judged against every query by one of several judges Pooling method: –Queries run against several IR systems first –Results pooled, and top proportion chosen for judging –Only top documents are judged

10 Jane Reid, AMSc IRIC, QMUL, 16/10/01 10 Recall / precision [1] Document collection Retrieved RelevantRetrieved and relevant

11 Jane Reid, AMSc IRIC, QMUL, 16/10/01 11 Recall / precision [2] Recall = proportion of relevant documents that is retrieved, i.e. number of relevant documents retrieved / total number of relevant documents Precision = proportion of retrieved documents that is relevant, i.e. number of relevant documents retrieved / number of documents retrieved

12 Jane Reid, AMSc IRIC, QMUL, 16/10/01 12 How to use a test collection For each system / system version –For each query in the test collection Run query against system to obtain ranking Use ranking and relevance judgements to calculate recall/precision (r/p) pairs at each recall point Interpolate to standard recall points if necessary –Average r/p values across all queries in table / graph form Produce r/p graph for all systems

13 Jane Reid, AMSc IRIC, QMUL, 16/10/01 13 Interpolation Observed valueInterpolated value

14 Jane Reid, AMSc IRIC, QMUL, 16/10/01 14 Averaging [1] Precision Recall Query 1Query 2Average 0.10.80.60.7 0.20.80.50.65 0.30.60.40.5 0.40.60.30.45 0.50.40.250.325 0.60.40.20.3 0.70.30.150.225 0.80.30.10.2 0.90.20.050.115 1.00.20.050.115

15 Jane Reid, AMSc IRIC, QMUL, 16/10/01 15 Averaging [2]

16 Jane Reid, AMSc IRIC, QMUL, 16/10/01 16 Comparison of systems

17 Jane Reid, AMSc IRIC, QMUL, 16/10/01 17 Examples of test collections [1] TREC (Text REtrieval Conference) –Started in 1990, run by National Institute of Standards and Technology (NIST) –Components Huge document collection (several GB), taken from Wall Street Journal, Financial Times, etc New documents, topics (i.e. requests, including description and narrative fields) and relevance judgements (performed by retired civil servants) each year

18 Jane Reid, AMSc IRIC, QMUL, 16/10/01 18 Examples of test collections [2] –Participants Industrial, commercial and academic Must submit results of retrieval tasks to TREC conference each November –“Tracks” Ad-hoc + routing (filtering) Also: interactive, cross-lingual, Web, spoken document, short queries, …

19 Jane Reid, AMSc IRIC, QMUL, 16/10/01 19 Examples of test collections [3] CIS –1239 documents about cystic fibrosis from NLM’s MEDLINE collection –Fields: author, title, source, major and minor subjects, abstracts, references and citations –100 queries, developed by relevance judges

20 Jane Reid, AMSc IRIC, QMUL, 16/10/01 20 Examples of test collections [4] –Unusual features: 4 judges per document per query (3 experts, 1 medical bibliographer) 3 levels of relevance (0-2) Combined relevances on scale of 0-8

21 Jane Reid, AMSc IRIC, QMUL, 16/10/01 21 Examples of test collections [5] CACM –3024 articles on computer science from CACM, 1958 - 1979 –Fields: author, date, word stems for titles and abstracts, categories, direct referencing, bibliography coupling, number of co-citations for each pair of articles –52 queries, each with 2 Boolean formulations

22 Jane Reid, AMSc IRIC, QMUL, 16/10/01 22 Examples of test collections [6] –Unusual features: Citation links to other documents, so often used for hypertext-type experiments

23 Jane Reid, AMSc IRIC, QMUL, 16/10/01 23 User-centred evaluation Evaluation of interface / interaction Methodology = interactive experiment, ethnographic study,... Many different criteria, measures, tools and methods of data analysis –No standard user-centred methodology –Elements often borrowed from other areas, e.g. HCI, experimental psychology

24 Jane Reid, AMSc IRIC, QMUL, 16/10/01 24 User-centred issues: layers model

25 Jane Reid, AMSc IRIC, QMUL, 16/10/01 25 Test collection Advantages –Cheap and easy for evaluator –Cross-system comparison possible Limitations –Static requests / queries –Objective, topical relevance judgements made by domain experts –Does not evaluate interaction

26 Jane Reid, AMSc IRIC, QMUL, 16/10/01 26 Different document types Multi-media documents –Images Topical relevance Non-topical relevance –Speech Recognition Retrieval Structured collections

27 Jane Reid, AMSc IRIC, QMUL, 16/10/01 27 Interaction [1] Data characteristics –Size of documents –Size of collection System characteristics –Retrieval effectiveness –Functionality –Interface features

28 Jane Reid, AMSc IRIC, QMUL, 16/10/01 28 Interaction [2] User –Domain expertise –System expertise –Task –Subjects vs real users Contextual –Social and environmental factors

29 Jane Reid, AMSc IRIC, QMUL, 16/10/01 29 Strategy System characteristics –Type of access (query-based, browsing, mixed) –Functional visibility Search characteristics –Topic focus –Tactics and search strategy User characteristics –Mental/cognitive models

30 Jane Reid, AMSc IRIC, QMUL, 16/10/01 30 Tasks Real Simulated –Past real –Fictitious

31 Jane Reid, AMSc IRIC, QMUL, 16/10/01 31 Learning System –Dynamic weighting of terms/documents –Case-based retrieval –User modelling User –Evolving information needs –Learning about domain/collection/system –Sociological view

32 Jane Reid, AMSc IRIC, QMUL, 16/10/01 32 Measures [1] From IR –Evaluation of results Aspectual recall/precision Pertinence Utility

33 Jane Reid, AMSc IRIC, QMUL, 16/10/01 33 Measures [2] From information science/HCI –Evaluation of results Task performance –Evaluation of process Quantitative: time, number of errors Qualitative: usability –Evaluation of overall quality of experience User satisfaction

34 Jane Reid, AMSc IRIC, QMUL, 16/10/01 34 Tools [1] From information science/HCI –Before the session Cognitive walkthroughs Interviews/questionnaires –During the session Observation Think aloud protocols

35 Jane Reid, AMSc IRIC, QMUL, 16/10/01 35 Tools [2] –After the session Interviews/questionnaires Focus groups

36 Jane Reid, AMSc IRIC, QMUL, 16/10/01 36 Large-scale experiments Interactive TREC OKAPI

37 Jane Reid, AMSc IRIC, QMUL, 16/10/01 37 User-centred evaluation [1] What is to be evaluated? –e.g. IR system using new underlying model Why do we want to evaluate? –e.g. functionality, usability How will we evaluate? –e.g. effectiveness, efficiency, satisfaction

38 Jane Reid, AMSc IRIC, QMUL, 16/10/01 38 User-centred evaluation [2] Example evaluation measures: FunctionalityUsability Effectiveness recall/precisionquality of solution Efficiency retrieval timetask completion time Satisfaction preferenceconfidence

39 Jane Reid, AMSc IRIC, QMUL, 16/10/01 39 Experimental design process Formulate research hypothesis Formulate experimental hypotheses Design experiment(s) Conduct pilot test and experiment(s) Analyse data Evaluate experimental hypotheses

40 Jane Reid, AMSc IRIC, QMUL, 16/10/01 40 Simple experimental design [1] Controlled experiment in laboratory setting One group of participants Each participant performs one or more tasks –Pre-defined tasks vs “real” tasks

41 Jane Reid, AMSc IRIC, QMUL, 16/10/01 41 Simple experimental design [2] Example data gathered at task stages: –Stage 1: Formulate information need –Stage 2: Gather information Task completion time Information-seeking behaviour –Use of observation, recording, think-aloud protocols

42 Jane Reid, AMSc IRIC, QMUL, 16/10/01 42 Simple experimental design [3] Example data (continued): –Stage 3: Use information Confidence –Use of questionnaires, interviews using Likert scales / semantic differentials –Stage 4: Assess information Quality of solution –Independent assessment of task output

43 Jane Reid, AMSc IRIC, QMUL, 16/10/01 43 Simple experimental design [4] Analysis: –Mostly qualitative, with summary statistics –Common-sense interpretation of results –Use of pre-defined benchmarks

44 Jane Reid, AMSc IRIC, QMUL, 16/10/01 44 Complex experimental design [1] Other controlled experiments: –Within-subject, e.g. longitudinal study –Between-subject Comparative study looking at effect of: –System type, e.g. variations in algorithm used –Task type –User characteristics, e.g. domain knowledge, general computer literacy, system knowledge Comparison with control group

45 Jane Reid, AMSc IRIC, QMUL, 16/10/01 45 Complex experimental design [2] Other controlled experiments (continued): –Mixed within-subject / between-subject Examine effect of interaction of variables Analysis: –Quantitative: Summary statistics Significance testing –Qualitative

46 Jane Reid, AMSc IRIC, QMUL, 16/10/01 46 Complex experimental design [3] Operational / ethno-methodological experiments –Evaluation in a “semi-real” or “real” setting of the “acceptability” of the system Analysis –Mostly qualitative

47 Jane Reid, AMSc IRIC, QMUL, 16/10/01 47 Complex experimental design [4] Case studies –Detailed evaluation using a single or small number of participant(s) –Possible to examine cognitive and affective issues Analysis –Mostly qualitative

48 Jane Reid, AMSc IRIC, QMUL, 16/10/01 48 Summary System-centred evaluation –Uses test collection methodology, with recall and precision –Good for evaluating technical performance User-centred evaluation –No standard methodology –Good for evaluating interface / interaction Usually necessary to use a combination


Download ppt "Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid"

Similar presentations


Ads by Google