Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 IR960803.pptSteven O. Kimbrough 960803 Basics of Information Retrieval.

Similar presentations


Presentation on theme: "1 IR960803.pptSteven O. Kimbrough 960803 Basics of Information Retrieval."— Presentation transcript:

1 1 IR960803.pptSteven O. Kimbrough 960803 Basics of Information Retrieval

2 2 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: Executive Summary, 1 The basic information retrieval (IR) problem –Given a task at hand and a collection of documents, find all information relevant to the task The IR problem is –Universally present –Often not recognized at all –Often incorrectly thought to be solved –Fundamentally very difficult Good solutions –Require ranked retrieval –Do not require user knoweldge of all the relevant search terms

3 3 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: Executive Summary, 2 Improved search engines/methods are becoming available –Web search engines, Verity, Lotus Notes, etc. Little is known about performance of standard methods –And what IS known is very discouraging Less is known about newer methods The Internet exacerbates the problem--and is delivering some newer methods Expect good IR to be on the agenda for organizations wanting to operate at peak efficiency

4 4 IR960803.pptSteven O. Kimbrough 960803 Example of IR at Work: Litigation Support A large constuction company is being sued over performance on a major project –Find all documents relevant to defending against the suit Assume –All 50,000+ documents are on line –Unlimited availability of a powerful full-text retrieval system »Search on given words »Search on Boolean combinations of words »Proximity searching –Lawyers are experienced, highly skilled and incented, and are supported by first- rate assistants What percentage of the relevant documents are actually found?

5 5 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: Basics of IR, 1 The corporate memory problem –Information pertinent to the task at hand has passed through the organization. »Was this information captured? »If so, can it be effectively retrieved and brought to bear on the present task? »If so, can it be understood and intepreted? The information retrieval (IR) problem –Narrower than corporate memory »Focused on documents –Broader than corporate memory »Individuals and organizations Our interest: the broadest sense(s) –“the IR problem” or “the corporate memory problem” or “the organizational memory problem”

6 6 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: Basics of IR, 2 The IR problem is very hard –And will remain so Why? Many reasons, including: –Documents are not (very) structured »Compare: database searches vs document base searches –Language is not (very) coöperative »DNA: microbiology or Digital Equipment Corporation’s Network Architecture? »free rider: game theory or urban transportation systems? »corporate memory or organizational memory? Physical access vs logical access –Physical: relatively easy –Logical: terribly difficult –Internet?

7 7 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: Basics of IR, 3 Kinds of information searches –Framework from David Blair, “Search Exhaustivity and Data Base Size as a Framework for Text Retrieval Systems” Distinctions –Large vs small (document) data bases –Exhaustive vs sample searches –Content vs context searches And... so...

8 8 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: Basics of IR, 4 Resulting framework of searches, ordered by difficulty –Large, exhaustive, content –Large, exhaustive, context –Large, sample, content –Large, sample, context –Small, exhaustive, content –Small, exhaustive, context –Small, sample, content –Small, sample, context

9 9 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: Basic IR Technology, 1 Recall: IR is a hard problem –And it is a real problem in real organizations for real people Examples from Blair –“The Management of Information: Basic Distinctions” Example from the Coast Guard –We know we have the document, but have no idea where it is »Retrieve from the Congressional Record

10 10 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: Basic IR Technology, 2 Your basic IR technology –full text or keyword retrieval, with –Boolean combinations and –Location indicators Full text--has everything –Or does it? Keyword indexing –Requires work Boolean combination of words –Usual boolean operators: AND, OR, NOT –This is a logically complete set

11 11 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: Basic IR Technology, 3 Examples of boolean search –computer AND network –(corporate OR organizational) AND memory Boolean formulæ: –Natural for many queries –Fundamentally difficult for many people »Disjunctive normal form »Conjunctive normal form –Fundamentally deterministic »And that’s a problem

12 12 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: Basic IR Technology, 4 Recall vs precision –Everything: U –What you want: A + B –What you get: B + C –Recall: B/(A + B) –Precision: B/((B + C) AB U C

13 13 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: Basic IR Technology, 5 Recall our 8-way framework –Large, exhaustive, content –Large, exhaustive, context –Large, sample, content –Large, sample, context –Small, exhaustive, content –Small, exhaustive, context –Small, sample, content –Small, sample, context When and where and how does the recall vs precision distinction matter?

14 14 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: Basic IR Technology, 6 So, how well does full text retrieval work? Hard to tell –Recall the recall vs precision diagram –How do we find A? –Few good studies –The Blair & Maron STAIRS study (1985), “An Evaluation of Retrieval Effectiveness for a Full-Text Document Retrieval System” is about the best STAIRS study –BART –Litigation support –Results: bad news

15 15 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: IR Theory Why should IR be such a difficult problem? “We all know why we’re here.” Zipf word distributions Scale is the problem Concept: futility point(s) Demise of the library model Collection partitioning IR as communication Importance of context

16 16 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: IR Theory, Example “Operated on this morning. Diagnosis not yet complete but results seem satisfactory and already exceed expecations.”

17 17 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: Requirements for IR Systems Standard features –Full text or keyword searches –Boolean searches –Partial matches –Positional searches Ranked retrieval –Note: distinguished from database queries Sensitivity to semantic latency Context sensitivity Ability to exploit partial structuring –SGML, time lines, causal models, etc.

18 18 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: Non-Basic IR Technologies Ranking algorithms Latent semantic indexing Genetic searching Faceted Indexing...for now, ranking algorithms only

19 19 IR960803.pptSteven O. Kimbrough 960803 Retrieval Algorithms/Approaches Three categories of approach: –Boolean –Vector space –Probabilistic Boolean –Find matches on words and combinations of words –The standard approach –Known not to work well Vector space –Using the indexing, place documents and queries in a hyperspace and measure distance –Example: DCB algorithm (follows) Probabilistic –Like the vector space, but impose a probability distribution on the elements in the space

20 20 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: The DCB Ranking Algorithm, 1 Needed for good IR: –Ranked retrieval –Sensitivity to semantic latency –(and other things) DCB: A vector space approach –Ranking based on location of documents in “word space” How does it work?

21 21 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: The DCB Ranking Algorithm, 2 Consider K, an array of 1s and 0s –Rows: keywords –Columns: documents –Entries »1 if the document has the keyword »0 else Example, to illustrate:

22 22 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: The DCB Ranking Algorithm, 3 Obtain L, by multiplying K by its transpose:

23 23 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: The DCB Ranking Algorithm, 4 Obtain M, by multiplying L by K:

24 24 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: The DCB Ranking Algorithm, 5 Intuition: Look at how the top-left element of M is obtained (by multiplying the top row of L by the left-most column of K):

25 25 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: The DCB Ranking Algorithm, 6 The DCB algorithm for document retrieval ranks the documents in (ex ante) a plausible manner. Does it actually produce good rankings? –From our experience to date, yes –Experimental studies are much needed »Initial study on Laughlin photos is very encouraging (cf., Hoque et al. 1995) And there is more....

26 26 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: The DCB Ranking Algorithm, 7: Resource Location Thinking more generally, the DCB algorithm for document retrieval ranks the documents by a sort of similarity of association K is a matrix of primary links (keywords to documents) DCB measures overall association for these primary links K need not be just keywords to documents Think of K as generally indicating primary links (1s) between individual objects, e.g. –people –meetings –issues

27 27 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: The DCB Ranking Algorithm, 8 Think of K as generally indicating primary links (1s and 0s) between individual objects, e.g. –people –meetings –issues –museum artifacts –keywords Then, K will (typically) be square But the DCB algorithm works the same way Interpretation of M is essentially the same L has a useful interpretation as well Now an example

28 28 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: The DCB Ranking Algorithm, 9 A resource location (mini)example Interpretation of K –K(1)a person –K(2)a person –K(3)an issue –K(4)an issue –K(5)a meeting –K(6)a meeting The problem (Coast Guard): –Given a particular letter of inquiry, which CG employees have useful knowledge for the question at hand? Idea: –The question identifies an issue. Find the employees most closely associated with that issue.

29 29 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: The DCB Ranking Algorithm, 10 A new K:

30 30 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: The DCB Ranking Algorithm, 11 A new L:

31 31 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: The DCB Ranking Algorithm, 12 And a new M: Person Issue

32 32 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: The DCB Ranking Algorithm, 13 DCB –Ranked retrieval by similarly of association –Works for documents –Works for arbitrary associations between arbitrary objects Future work –Much is required to explore DCB systematically –Experiments –Tweeks –Comparison with alternatives –&c.

33 33 IR960803.pptSteven O. Kimbrough 960803 Experimental IR Test: PIRS and Laughlin PIRS: Picture Indexing and Retrieval System –Developed as part of the Coast Guard KSS project Ranking based on –DCB applied to –Text associated with pictures Clarence Laughlin archive –The Historic New Orleans Collection –Photographer and writer Test –390 photos plus Collection records on them –Parse and feed to DCB algorithm Very promising results –Given 3 ranked photos, 83% of subjects agreed on 2 or 3 of the implicit 3 pairwise rankings (50% expected)

34 34 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: Representative Commercial Products grep et al. Verity--Topic Excaliber Lotus Notes AppleSearch Web search engines –Yahoo –Lycos –Web Crawler –etc.

35 35 IR960803.pptSteven O. Kimbrough 960803 The Information Retrieval Problem: Commercial Products Ranking often is provided –But usually in a very limited way –AppleSearch: *****, ****,...,* Retrieval and ranking algorithms typically not disclosed –Be wary Quality of retrieval not known –And when known not disclosed Some systems require extensive maintenance and “hand holding” Batch update of indexing is the rule

36 36 IR960803.pptSteven O. Kimbrough 960803 Useful References David Blair, “Search Exhaustivity and Data Base Size as a Framework for Text Retrieval Systems (or, All You Wanted to Know about Document Retrieval but Were Afraid to Ask” David Blair (1984). “The Management of Information: Basic Distinctions” David Blair and M. E. Maron (1985). “An Evaluation of Retrieval Effectiveness for a Full-Text Document Retrieval System” David Blair and Steven O. Kimbrough (1994). “Exemplary Documents” Michael D. Gordon and Robert K. Lindsay (1994). “Toward Discovery Support Systems: A Replication, Re-examination, and Extension of Swanson’s Work on Literature Based Discovery of a Connection Between Raynaud’s and Fish Oil.” Abeer Y. Hoque et al. (1995). “Report on an Experiment on Picture REtrieval Using the DCB Algorithm, The PIRS Software and Pictures from the Clarence Laughlin Archives at The Historic New Orleans Collection.” Steven O. Kimbrough, Stephen E. Kirk and Jim R. Oliver (1995) “On Relevance and Two Aspects of the Organizational Memory Problem”

37 37 IR960803.pptSteven O. Kimbrough 960803 This page unintentioally left blank


Download ppt "1 IR960803.pptSteven O. Kimbrough 960803 Basics of Information Retrieval."

Similar presentations


Ads by Google