Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin

Similar presentations


Presentation on theme: "Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin"— Presentation transcript:

1 jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu http://scils.rutgers.edu/~belkin/belkin.html

2 jhu-hlt-2004 © n.j. belkin 2 The IR Situation A person (the user) recognizes that her/his knowledge is inadequate for resolving some problem / achieving some goal (a problematic situation) In order to resolve the problematic situation, the user has recourse to some knowledge resource external to her/himself

3 jhu-hlt-2004 © n.j. belkin 3 The IR Situation (2) The user engages with the knowledge resource through some intermediary The three components, user, knowledge resource, intermediary, and their interactions with one another, together constitute the information retrieval system

4 jhu-hlt-2004 © n.j. belkin 4 IR Systems The goal of an IR system is that the user’s problematic situation is appropriately resolved This goal is accomplished by facilitating effective interaction of the user with appropriate information objects (elements of the knowledge resource)

5 jhu-hlt-2004 © n.j. belkin 5 Relevance An indicator, or measure, of the appropriateness of an information object to a user’s problematic situation Topical relevance - The information object is about the same topic as the problematic situation Situational relevance - The information object is useful in resolving the problematic situation

6 jhu-hlt-2004 © n.j. belkin 6 What IR Systems Try to Do Predict, on the basis of some information about the user, and information about the knowledge resource, what information objects are likely to be the most appropriate for the user to interact with, at any particular time

7 jhu-hlt-2004 © n.j. belkin 7 How IR Systems Try to Do This Represent the user’s information problem (the query) Represent (surrogate) and organize (classify) the contents of the knowledge resource Compare query to surrogates (predict relevance) Present results to the user for interaction/judgment

8 jhu-hlt-2004 © n.j. belkin 8 How IR Differs from DBMS No “right” answer Probabilistic (predictive), not determinative Unstructured, or only partially structured information (e.g. text, images)

9 jhu-hlt-2004 © n.j. belkin 9 Why IR is Difficult People cannot specify what they don’t know (Anomalous State of Knowledge), so representation of information problem is inherently uncertain Information objects can be about many things, so representation of aboutness is inherently incomplete

10 jhu-hlt-2004 © n.j. belkin 10 Why IR is Difficult (2) Relevance is a relation between the person and the information object(s), and is dependent upon user’s interpretation, so prediction of relevance (or appropriateness) is inherently uncertain

11 jhu-hlt-2004 © n.j. belkin 11 Evaluation of IR Systems Traditional goal of IR is to retrieve all and only the relevant IOs in response to a query All is measured by recall: the proportion of relevant IOs in the collection which are retrieved Only is measured by precision: the proportion of retrieved IOs which are relevant

12 jhu-hlt-2004 © n.j. belkin 12 Other Functions of IR Systems IR is concerned not only with supporting “specified searching” People engage in many kinds of interactions with IR systems, e.g. “browsing”, “evaluating”, “comparing”, “extracting” People have many different IR-related tasks, e.g. question-answering, finding one or a few “good” IOs, constructing a “useful” portal

13 jhu-hlt-2004 © n.j. belkin 13 Other Evaluation Measures To evaluate IR support for different tasks, different measures are required Relevance may not be the only criterion according to which measures are constructed Support for different kinds of behaviors may require different kinds of measures

14 jhu-hlt-2004 © n.j. belkin 14 Evaluation of What? Effectiveness –recall, precision, accuracy of answer, “satisfaction” Usability –learnability error rates Performance –time, cognitive effort

15 jhu-hlt-2004 © n.j. belkin 15 Evaluation Problems Realistic IR is interactive; traditional IR methods and measures are based on non-interactive situations Evaluating interactive IR requires human subjects; the normal mode of evaluation is comparison between two systems (no gold standard or benchmarks); cannot compare a subject’s searching on the same task in two systems Major tradeoffs between number of subjects and number of tasks; realism and control

16 jhu-hlt-2004 © n.j. belkin 16 A Traditional View of IR (you’ll see this again)

17 jhu-hlt-2004 © n.j. belkin 17 IR as Support for Interaction with Information USER COMPARISON REPRESENTATION PRESENTATION VISUALIZATION goals, tasks, knowledge, problem, uses INTERACTION judgment, use, search, interpretation, modification INFORMATION type, medium, mode, level NAVIGATION USER COMPARISON REPRESENTATION PRESENTATION VISUALIZATION goals, tasks, knowledge, problem, uses INTERACTION judgment, use, search, interpretation, modification INFORMATION type, medium, mode, level NAVIGATION USER COMPARISON REPRESENTATION PRESENTATION VISUALIZATION goals, tasks, knowledge, problem, uses INTERACTION judgment, use, search, interpretation, modification INFORMATION type, medium, mode, level NAVIGATION Time Overall goals, environment, situation

18 jhu-hlt-2004 © n.j. belkin 18 The User as the Central Actor in the IR System The goal of IR is to help the user resolve the problematic situation This is done by supporting interaction with appropriate IOs The user in the system is the only actor that can judge appropriateness The user’s interactions determine the type of support provided

19 jhu-hlt-2004 © n.j. belkin 19 Interaction as the Central Process of IR Accepting the user as the central actor implies accepting the user’s interactions with information as the central process All other IR processes can be interpreted as being in support of the user’s current (or future) interactions with information This suggests specific IR system design choices and problems

20 jhu-hlt-2004 © n.j. belkin 20 How Interaction Has Been Accounted For Relevance feedback –Automatically moving the initial query toward the “ideal” query –Term reweighting and query expansion Support for query modification –Display of “good” and “bad” terms –Thesauri –Inter-document relations

21 jhu-hlt-2004 © n.j. belkin 21 Personalization in IR Taking account of user goals, situation, context for –tailoring the interaction –tailoring the retrieval results TREC HARD track is a first attempt at evaluating use of context

22 jhu-hlt-2004 © n.j. belkin 22 IR Models Exact match models –String matching –Boolean Best (partial match) models –Vector space –Probabilistic –Logic (Plausible inference) –Language modeling

23 jhu-hlt-2004 © n.j. belkin 23 Exact Match IR Goal of EM IR is to retrieve the set of information objects which match the user’s query specification Assumptions of EM IR –IOs are completely representable –Information problems are specifiable –Relevance is determinable and binary

24 jhu-hlt-2004 © n.j. belkin 24 Exact Match IR Retrieves IOs that contain specified string or Boolean combination of strings Supported by inverted file organization (or signatures) Enhanced by wild-cards, proximity searching

25 jhu-hlt-2004 © n.j. belkin 25 Exact Match IR Advantages –Efficient –Boolean queries capture some aspects of information problem structure Disadvantages –Not effective –Difficult to write effective queries –No inherent document ranking

26 jhu-hlt-2004 © n.j. belkin 26 Best Match IR All types based on the assumption that IR is an uncertain process Models differ by what they ascribe the uncertainty to, and by how they respond to that uncertainty

27 jhu-hlt-2004 © n.j. belkin 27 Vector Space IR Words represent concepts or topics These can be construed as dimensions of a “concept space” IOs are about the topics represented by their words IOs can be represented as vectors in the concept space Queries can be specified and represented as are IOs

28 jhu-hlt-2004 © n.j. belkin 28 Vector Space IR Goal of IR is to present the user with IOs most similar to query, in order of similarity Similarity is defined as closeness in the concept (vector) space Uncertainty in IR is in the degree of match between IO and query, arises from uncertainty in representation of each

29 jhu-hlt-2004 © n.j. belkin 29 Vector Space Model Advantages –Straightforward ranking –Simple query formulation (bag of words) –Intuitively appealing –Effective Disadvantages –Unstructured queries –Effective calculations and parameters must be empirically determined

30 jhu-hlt-2004 © n.j. belkin 30 Probabilistic Model Uncertainty in IR arises from uncertainty in the relevance relationship, in the representation of the information problem, and in the representation of IOs Result of these uncertainties can be represented as probabilities of relevance of an IO to an information problem, given the available evidence

31 jhu-hlt-2004 © n.j. belkin 31 Probabilistic IR Goal of IR is to present to the user the IOs in order of their probability of relevance to the information problem (the Probability Ranking Principle)

32 jhu-hlt-2004 © n.j. belkin 32 Probabilistic IR Advantages –Straightforward relevance ranking –Simple query formulation –Sound mathematical/theoretical model –Effective Disadvantages –Unrealistic assumptions (term independence) –Probabilities difficult to estimate

33 jhu-hlt-2004 © n.j. belkin 33 Plausible Inference IR Uncertainty in IR arises from uncertainty in relevance relationship, uncertainty in representation of information problem, uncertainty in representation of IOs This implies that IR can be no more than a process of plausible inference of relevance of an IO to an information problem

34 jhu-hlt-2004 © n.j. belkin 34 Plausible Inference IR In logical implicature version, IO and information problem should be represented in a logical formalism which allows plausible inference In multiple sources of evidence version, as much evidence as possible about relationship between IO and information problem should be used to estimate probability of relevance (induction)

35 jhu-hlt-2004 © n.j. belkin 35 Plausible Inference IR In logic version, goal of IR is to present to the user those IOs from which the query is most plausibly inferred, in order of plausibility In sources of evidence version, goal of IR is to present to the user those IOs which are believed most likely to be relevant, in the order of strength of belief

36 jhu-hlt-2004 © n.j. belkin 36 Plausible Inference IR Advantages –Relevance ranking –Strong formalisms –Structured queries possible –Effective (multiple sources of evidence) Disadvantages –Complex, difficult to implement –Weight for evidence empirically determined

37 jhu-hlt-2004 © n.j. belkin 37 Language Modeling for IR Assumes that IOs and expressions of information problems are of the same type Uncertainty in IR is due to uncertainty in representations of IOs and information problems Goal is to present to the user IOs in order of the probability of the IO being generated by the language model of the information problem (or vice versa), or by the similarity of the language model of the IO to that of the information problem

38 jhu-hlt-2004 © n.j. belkin 38 Language Modeling for IR Most common type is statistical unigram model, based on observed word frequencies, smoothed in various ways The Kullback-Leibler distance is a measure of the distance between two probability distributions KL({pi},{qi}) =  pi(log 2 (pi/qi)) i

39 jhu-hlt-2004 © n.j. belkin 39 Advantages of Language Modeling Attempts to do away with the concept of relevance Computationally tractable, intuitively appealing

40 jhu-hlt-2004 © n.j. belkin 40 Problems with Language Modeling Assumption of equivalence between IO and information problem representation is unrealistic Very simple models of language Choosing a method of smoothing is difficult, and in general, ad hoc

41 jhu-hlt-2004 © n.j. belkin 41 Problems in Best Match IR For most best match IR models to work well, queries should be long –bag of words approach depends upon many words in order to disambiguate meaning Reasons for retrieval and ranking are not easily understood

42 jhu-hlt-2004 © n.j. belkin 42 Overcoming Problems in Best Match IR Enhance short queries through query expansion based on pseudo-relevance feedback or other methods Default exact match searching for short queries Encourage longer queries/problem statements through interface design

43 jhu-hlt-2004 © n.j. belkin 43 Some Takeaway Messages IR supports a human activity IR is inherently interactive, and the IR system inevitably involves the user as the central actor Representation and comparison techniques for text-based IR seem to have plateaued Improved IR will come from improved support for all types of interactions with information, and especially with personalization Big research issue: how to represent and use situation and context


Download ppt "Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin"

Similar presentations


Ads by Google