Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CS 430: Information Discovery Lecture 20 The User in the Loop.

Similar presentations


Presentation on theme: "1 CS 430: Information Discovery Lecture 20 The User in the Loop."— Presentation transcript:

1 1 CS 430: Information Discovery Lecture 20 The User in the Loop

2 2 Course Administration Final examination: Date: Tuesday, 15-MAY Start Time: 3:00 PM Finish Time: 5:30 PM Room: KL B11

3 3 Course Administration Assignment 3 Not acceptable: I recommend that the company uses XYZ commercial software package. A common question: What file structure is suitable for fielded searching?

4 4 Inverted File (Basic) Inverted file: a list of the words in a set of documents and the documents in which they appear. Word Document abacus 3 19 22 actor 2 19 29 aspen 5 atoll 11 34 Stop words are removed before building the index. From Lecture 3

5 5 Inverted File (Enhanced) WordPostings DocumentLocation abacus4 3 94 19 7 19 212 2256 actor3 2 66 19 213 29 45 aspen1 5 43 atoll3 11 3 1170 34 40 From Lecture 3

6 6 Inverted File (Enhanced) WordPostings DocumentLocationField abacus4 3 94normal 19 7title 19 212normal 2256subject actor3 2 66title 19 213normal 29 45normal aspen1 5 43list atoll3 11 3normal 1170normal 34 40footnote

7 7 The Human in the Loop Search index Return hits Browse repository Return objects

8 8 Evaluation of Usability Observing users (user protocols) Focus groups Measurements effectiveness in carrying out tasks speed Expert review Competitive analysis

9 9 See paper by Croft, Cook and Wilder in the CS 430 readings

10 10 THOMAS The documents: Full text of all legislation introduced in Congresses, since 1989. Text of the Congressional Record. Indexes Bills are indexed by title, bill number, and the text of the bill. The Congressional Record is indexed by title, document identifier, date, speaker, and page number. Search system InQuery -- developed by the University of Massachusetts, Available commercially from Sovereign Hill Software.

11 11 Weighting Single-word Query The more instances of that word in the document, the more relevant the document will be considered. Occurrence of the term in the title are considered most relevant (weight x 20).

12 12 Weighting Multiple-word Queries 1. Documents containing instances of the search terms as a phrase --i.e., adjacent to each other 2. Search terms occur near, but not next to, each other, and not necessarily in the same order as entered. 3. All search terms appear singly, not in proximity to each other. 4. Documents contain less than all of the words.

13 13 Language Problems InQuery considers of NO relevance documents containing NO instances of any form of the search words Search for "capital punishment" does not find legislation about "death penalty". If there are no highly relevant documents, InQuery returns poorly relevant documents Search for "elderly black Americans" into the system and received a bill on "black bears" as most relevant, followed by bills relating to "black colleges and universities". (There were no bills in any way related to "elderly black Americans".)

14 14 Advanced Features Ranked output: Combines evidence in the text of the document and the corpus as a whole. Passage-based retrieval: The probability of relevance is based both on the entire content of a document and the best matching passage in the document. Simple and complex queries: e.g., simple word-based queries, Boolean queries, phrase-based queries or a combination. Field-based retrieval: e.g., bill number and type. Flexible and efficient indexing: Incorporates a variety of document structures (e.g. HTML, MARC, etc.) Tools for query processing and query expansion

15 15 Queries WordsUnique Queries 1 5,767 2 9,646 3 6,905 4 2,240 5 656 6 87 7 19 8 1 Total 25,321 Table showing number of words in queries

16 16 D-Lib Working Group on Metrics DARPA-funded attempt to develop a TREC-like approach to digital libraries (1997). "This Working Group is aimed at developing a consensus on an appropriate set of metrics to evaluate and compare the effectiveness of digital libraries and component technologies in a distributed environment. Initial emphasis will be on (a) information discovery with a human in the loop, and (b) retrieval in a heterogeneous world. " Very little progress made. See: http://www.dlib.org/metrics/public/index.html

17 17 MIRA Evaluation Frameworks for Interactive Multimedia Information Retrieval Applications European study 1996-99 Chair Keith Van Rijsbergen, Glasgow University Expertise Multi Media Information Retrieval Information Retrieval Human Computer Interaction Case Based Reasoning Natural Language Processing

18 18 MIRA Starting Point Information Retrieval techniques are beginning to be used in complex goal and task oriented systems whose main objectives are not just the retrieval of information. New original research in IR is being blocked or hampered by the lack of a broader framework for evaluation.

19 19 MIRA Aims Bring the user back into the evaluation process. Understand the changing nature of IR tasks and their evaluation. 'Evaluate' traditional evaluation methodologies. Consider how evaluation can be prescriptive of IR design Move towards balanced approach (system versus user) Understand how interaction affects evaluation. Support the move from static to dynamic evaluation. Understand how new media affects evaluation. Make evaluation methods more practical for smaller groups. Spawn new projects to develop new evaluation frameworks

20 20 MIRA Approaches Developing methods and tools for evaluating interactive IR. Possibly the most important activity of all. User tasks: Studying real users, and their overall goals. Improve user interfaces is to widen the set of users Develop a design for a multimedia test collection. Get together collaborative projects. (TREC was organized as competition.) Pool tools and data.


Download ppt "1 CS 430: Information Discovery Lecture 20 The User in the Loop."

Similar presentations


Ads by Google