Presentation is loading. Please wait.

Presentation is loading. Please wait.

2003.10.16 - SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003

Similar presentations


Presentation on theme: "2003.10.16 - SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003"— Presentation transcript:

1 2003.10.16 - SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003 http://www.sims.berkeley.edu/academics/courses/is202/f03/ SIMS 202: Information Organization and Retrieval Lecture 16: Intro to Information Retrieval

2 2003.10.16 - SLIDE 2IS 202 – FALL 2003 Lecture Overview Review –MPEG-7 Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments Discussion Prep for Presentations –MMM Status, Web interface, Flamenco Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

3 2003.10.16 - SLIDE 3IS 202 – FALL 2003 Lecture Overview Review –MPEG-7 Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments Discussion Prep for Presentations –MMM Status, Web interface, Flamenco Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

4 2003.10.16 - SLIDE 4IS 202 – FALL 2003 Review: Information Overload “The world's total yearly production of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth.” (Varian & Lyman) “The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all.” (W.H. Auden)

5 2003.10.16 - SLIDE 5IS 202 – FALL 2003 Course Outline Organization –Overview –Categorization –Knowledge Representation –Metadata Introduction –Controlled Vocabularies Introduction –Thesaurus Design and Construction –Multimedia Information Organization and Retrieval –Metadata for Media –Database Design –XML Retrieval –Introduction to Search Process –Boolean Queries and Text Processing –Statistical Properties of Text and Vector Representation –Probabilistic Ranking and Relevance Feedback –Evaluation –Web Search Issues and Architecture –Interfaces for Information Retrieval

6 2003.10.16 - SLIDE 6IS 202 – FALL 2003 Key Issues In This Course How to describe information resources or information-bearing objects in ways so that they may be effectively used by those who need to use them –Organizing How to find the appropriate information resources or information-bearing objects for someone’s (or your own) needs –Retrieving

7 2003.10.16 - SLIDE 7IS 202 – FALL 2003 Key Issues Creation UtilizationSearching Active Inactive Semi-Active Retention/ Mining Disposition Discard Using Creating Authoring Modifying Organizing Indexing Storing Retrieval Distribution Networking Accessing Filtering

8 2003.10.16 - SLIDE 8IS 202 – FALL 2003 Modern IR Textbook Topics

9 2003.10.16 - SLIDE 9IS 202 – FALL 2003 More Detailed View

10 2003.10.16 - SLIDE 10IS 202 – FALL 2003 What We’ll Cover A Lot A Little

11 2003.10.16 - SLIDE 11IS 202 – FALL 2003 IR Topics for 202 The Search Process Information Retrieval Models –Boolean, Vector, and Probabilistic Content Analysis/Zipf Distributions Evaluation of IR Systems –Precision/Recall –Relevance –User Studies Web-Specific Issues User Interface Issues Special Kinds of Search

12 2003.10.16 - SLIDE 12IS 202 – FALL 2003 Lecture Overview Review –MPEG-7 Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments Discussion Prep for Presentations –MMM Status, Web interface, Flamenco Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

13 2003.10.16 - SLIDE 13IS 202 – FALL 2003 The Standard Retrieval Interaction Model

14 2003.10.16 - SLIDE 14IS 202 – FALL 2003 Standard Model of IR Assumptions: –The goal is maximizing precision and recall simultaneously –The information need remains static –The value is in the resulting document set

15 2003.10.16 - SLIDE 15IS 202 – FALL 2003 Problems with Standard Model Users learn during the search process: –Scanning titles of retrieved documents –Reading retrieved documents –Viewing lists of related topics/thesaurus terms –Navigating hyperlinks Some users don’t like long (apparently) disorganized lists of documents

16 2003.10.16 - SLIDE 16IS 202 – FALL 2003 IR is an Iterative Process Repositories Workspace Goals

17 2003.10.16 - SLIDE 17IS 202 – FALL 2003 IR is a Dialog The exchange doesn’t end with first answer Users can recognize elements of a useful answer, even when incomplete Questions and understanding changes as the process continues

18 2003.10.16 - SLIDE 18IS 202 – FALL 2003 Bates’ “Berry-Picking” Model Standard IR model –Assumes the information need remains the same throughout the search process Berry-picking model –Interesting information is scattered like berries among bushes –The query is continually shifting

19 2003.10.16 - SLIDE 19IS 202 – FALL 2003 Berry-Picking Model Q0 Q1 Q2 Q3 Q4 Q5 A sketch of a searcher… “moving through many actions towards a general goal of satisfactory completion of research related to an information need.” (after Bates 89)

20 2003.10.16 - SLIDE 20IS 202 – FALL 2003 Berry-Picking Model (cont.) The query is continually shifting New information may yield new ideas and new directions The information need –Is not satisfied by a single, final retrieved set –Is satisfied by a series of selections and bits of information found along the way

21 2003.10.16 - SLIDE 21IS 202 – FALL 2003 Information Seeking Behavior Two parts of a process: –Search and retrieval –Analysis and synthesis of search results This is a fuzzy area –We will look at (briefly) at some different working theories

22 2003.10.16 - SLIDE 22IS 202 – FALL 2003 Search Tactics and Strategies Search Tactics –Bates 1979 Search Strategies –Bates 1989 –O’Day and Jeffries 1993

23 2003.10.16 - SLIDE 23IS 202 – FALL 2003 Tactics vs. Strategies Tactic: short term goals and maneuvers –Operators, actions Strategy: overall planning –Link a sequence of operators together to achieve some end

24 2003.10.16 - SLIDE 24IS 202 – FALL 2003 Information Search Tactics Monitoring tactics –Keep search on track Source-level tactics –Navigate to and within sources Term and Search Formulation tactics –Designing search formulation –Selection and revision of specific terms within search formulation

25 2003.10.16 - SLIDE 25IS 202 – FALL 2003 Monitoring Tactics (Strategy-Level) Check –Compare original goal with current state Weigh –Make a cost/benefit analysis of current or anticipated actions Pattern –Recognize common strategies Correct Errors Record –Keep track of (incomplete) paths

26 2003.10.16 - SLIDE 26IS 202 – FALL 2003 Source-Level Tactics “Bibble”: – Look for a pre-defined result set E.g., a good link page on web Survey: –Look ahead, review available options E.g., don’t simply use the first term or first source that comes to mind Cut: –Eliminate large proportion of search domain E.g., search on rarest term first

27 2003.10.16 - SLIDE 27IS 202 – FALL 2003 Search Formulation Tactics Specify –Use as specific terms as possible Exhaust –Use all possible elements in a query Reduce –Subtract elements from a query Parallel –Use synonyms and parallel terms Pinpoint –Reducing parallel terms and refocusing query Block –To reject or block some terms, even at the cost of losing some relevant documents

28 2003.10.16 - SLIDE 28IS 202 – FALL 2003 Term Tactics Move around the thesaurus –Superordinate, subordinate, coordinate –Neighbor (semantic or alphabetic) –Trace – pull out terms from information already seen as part of search (titles, etc.) –Morphological and other spelling variants –Antonyms (contrary)

29 2003.10.16 - SLIDE 29IS 202 – FALL 2003 Additional Considerations (Bates 79) More detail is needed about short-term cost/benefit decision rule strategies When to stop? –How to judge when enough information has been gathered? –How to decide when to give up an unsuccessful search? –When to stop searching in one source and move to another?

30 2003.10.16 - SLIDE 30IS 202 – FALL 2003 Implications Search interfaces should make it easy to store intermediate results Interfaces should make it easy to follow trails with unanticipated results (and find your way back) This all makes evaluation of the search, the interface and the search process more difficult

31 2003.10.16 - SLIDE 31IS 202 – FALL 2003 Later in the course: –More on Search Process and Strategies –User interfaces to improve IR process –Incorporation of Content Analysis into better systems More Later…

32 2003.10.16 - SLIDE 32IS 202 – FALL 2003 Restricted Form of the IR Problem The system has available only pre- existing, “canned” text passages Its response is limited to selecting from these passages and presenting them to the user It must select, say, 10 or 20 passages out of millions or billions!

33 2003.10.16 - SLIDE 33IS 202 – FALL 2003 Information Retrieval Revised Task Statement: Build a system that retrieves documents that users are likely to find relevant to their queries This set of assumptions underlies the field of Information Retrieval

34 2003.10.16 - SLIDE 34IS 202 – FALL 2003 Relevance (Introduction) In what ways can a document be relevant to a query? –Answer precise question precisely Who is buried in grant’s tomb? Grant. –Partially answer question Where is Danville? Near Walnut Creek. Where is Dublin? –Suggest a source for more information. What is lymphodema? Look in this Medical Dictionary. –Give background information –Remind the user of other knowledge –Others...

35 2003.10.16 - SLIDE 35IS 202 – FALL 2003 Relevance “Intuitively, we understand quite well what relevance means. It is a primitive ‘y’ know’ concept, as is information for which we hardly need a definition. … if and when any productive contact [in communication] is desired, consciously or not, we involve and use this intuitive notion or relevance.” »Saracevic, 1975 p. 324

36 2003.10.16 - SLIDE 36IS 202 – FALL 2003 Define your own relevance Relevance is the (A) gage of relevance of an (B) aspect of relevance existing between an (C) object judged and a (D) frame of reference as judged by an (E) assessor Where… From Saracevic, 1975 and Schamber 1990

37 2003.10.16 - SLIDE 37IS 202 – FALL 2003 A. Gages Measure Degree Extent Judgement Estimate Appraisal Relation

38 2003.10.16 - SLIDE 38IS 202 – FALL 2003 B. Aspect Utility Matching Informativeness Satisfaction Appropriateness Usefulness Correspondence

39 2003.10.16 - SLIDE 39IS 202 – FALL 2003 C. Object judged Document Document representation Reference Textual form Information provided Fact Article

40 2003.10.16 - SLIDE 40IS 202 – FALL 2003 D. Frame of reference Question Question representation Research stage Information need Information used Point of view request

41 2003.10.16 - SLIDE 41IS 202 – FALL 2003 E. Assessor Requester Intermediary Expert User Person Judge Information specialist

42 2003.10.16 - SLIDE 42IS 202 – FALL 2003 Lecture Overview Review –MPEG-7 Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments (view from 100,000 Ft.) Discussion Prep for Presentations –MMM Status, Web interface, Flamenco Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

43 2003.10.16 - SLIDE 43IS 202 – FALL 2003 Visions of IR Systems Rev. John Wilkins, 1600’s : The Philosophic Language and tables Wilhelm Ostwald and Paul Otlet, 1910’s: The “monographic principle” and Universal Classification Emanuel Goldberg, 1920’s - 1940’s H.G. Wells, “World Brain: The idea of a permanent World Encyclopedia.” (Introduction to the Encyclopédie Française, 1937) Vannevar Bush, “As we may think.” Atlantic Monthly, 1945.

44 2003.10.16 - SLIDE 44IS 202 – FALL 2003 Card-Based IR Systems Uniterm (Casey, Perry, Berry, Kent: 1958) –Developed and used from mid 1940’s) EXCURSION 43821 90 241 52 63 34 25 66 17 58 49 130 281 92 83 44 75 86 57 88 119 640 122 93 104 115 146 97 158 139 870 342 157 178 199 207 248 269 298 LUNAR 12457 110 181 12 73 44 15 46 7 28 39 430 241 42 113 74 85 76 17 78 79 820 761 602 233 134 95 136 37 118 109 901 982 194 165 127 198 179 377 288 407

45 2003.10.16 - SLIDE 45IS 202 – FALL 2003 Card Systems Batten Optical Coincidence Cards (“Peek- a-Boo Cards”), 1948 Lunar Excursion

46 2003.10.16 - SLIDE 46IS 202 – FALL 2003 Card Systems Zatocode (edge-notched cards) Mooers, 1951 Document 1 Title: lksd ksdj sjd sjsjfkl Author: Smith, J. Abstract: lksf uejm jshy ksd jh uyw hhy jha jsyhe Document 200 Title: Xksd Lunar sjd sjsjfkl Author: Jones, R. Abstract: Lunar uejm jshy ksd jh uyw hhy jha jsyhe Document 34 Title: lksd ksdj sjd Lunar Author: Smith, J. Abstract: lksf uejm jshy ksd jh uyw hhy jha jsyhe

47 2003.10.16 - SLIDE 47IS 202 – FALL 2003 Computer-Based Systems Bagley’s 1951 MS thesis from MIT suggested that searching 50 million item records, each containing 30 index terms would take approximately 41,700 hours –Due to the need to move and shift the text in core memory while carrying out the comparisons 1957 – Desk Set with Katharine Hepburn and Spencer Tracy – EMERAC

48 2003.10.16 - SLIDE 48IS 202 – FALL 2003 Historical Milestones in IR Research 1958 Statistic Language Properties (Luhn) 1960 Probabilistic Indexing (Maron & Kuhns) 1961 Term association and clustering (Doyle) 1965 Vector Space Model (Salton) 1968 Query expansion (Roccio, Salton) 1972 Statistical Weighting (Sparck-Jones) 1975 2-Poisson Model (Harter, Bookstein, Swanson) 1976 Relevance Weighting (Robertson, Sparck- Jones) 1980 Fuzzy sets (Bookstein) 1981 Probability without training (Croft)

49 2003.10.16 - SLIDE 49IS 202 – FALL 2003 Historical Milestones in IR Research (cont.) 1983 Linear Regression (Fox) 1983 Probabilistic Dependence (Salton, Yu) 1985 Generalized Vector Space Model (Wong, Rhagavan) 1987 Fuzzy logic and RUBRIC/TOPIC (Tong, et al.) 1990 Latent Semantic Indexing (Dumais, Deerwester) 1991 Polynomial & Logistic Regression (Cooper, Gey, Fuhr) 1992 TREC (Harman) 1992 Inference networks (Turtle, Croft) 1994 Neural networks (Kwok)

50 2003.10.16 - SLIDE 50IS 202 – FALL 2003 Boolean IR Systems Synthex at SDC, 1960 Project MAC at MIT, 1963 (interactive) BOLD at SDC, 1964 (Harold Borko) 1964 New York World’s Fair – Becker and Hayes produced system to answer questions (based on airline reservation equipment) SDC began production for a commercial service in 1967 – ORBIT NASA-RECON (1966) becomes DIALOG 1972 Data Central/Mead introduced LEXIS – Full text Online catalogs – late 1970’s and 1980’s

51 2003.10.16 - SLIDE 51IS 202 – FALL 2003 The Internet and the WWW Gopher, Archie, Veronica, WAIS Tim Berners-Lee, 1991 creates WWW at CERN – originally hypertext only Web-crawler Lycos Alta Vista Inktomi Google (and many others)

52 2003.10.16 - SLIDE 52IS 202 – FALL 2003 Information Retrieval – Historical View Boolean model, statistics of language (1950’s) Vector space model, probablistic indexing, relevance feedback (1960’s) Probabilistic querying (1970’s) Fuzzy set/logic, evidential reasoning (1980’s) Regression, neural nets, inference networks, latent semantic indexing, TREC (1990’s) DIALOG, Lexus-Nexus, STAIRS (Boolean based) Information industry (O($B)) Verity TOPIC (fuzzy logic) Internet search engines (O($100B?)) (vector space, probabilistic) ResearchIndustry

53 2003.10.16 - SLIDE 53IS 202 – FALL 2003 Lecture Overview Review –MPEG-7 Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments Discussion Prep for Presentations –MMM Status, Web interface, Flamenco Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

54 2003.10.16 - SLIDE 54IS 202 – FALL 2003 Discussion: Joe Hall on MIR Why does there have to be such a schism between computer-centered and human- centered IR? Would it not be more wise to approach IR from both directions simultaneously? How do *you* find information on a regular basis? Is Google your first-order attack? What do you do when Google wouldn't return anything useful... for example, if Kate was looking for information on music from "The The" or “Peaches"? What are some useful, domain- specific tools out there that you use (like IMDB, or The All Music Guide)?

55 2003.10.16 - SLIDE 55IS 202 – FALL 2003 Discussion: Joe Hall on MIR What would a Venn diagram of Information Retrieval and Information Organization look like? With systems like Google that rely on a very simplistic ranking system, complex Information Organization seems not necessary for certain types of information. There seems to be an OI/IR trade-off here... that is, the more organized your information, the less sophisticated a retreival system needs to be.

56 2003.10.16 - SLIDE 56IS 202 – FALL 2003 Paul Laskowski on Berlin How many people can participate in a group memory? I would happily share my 202-related emails with my phone project group (Go MonkeyBots!!!), but I might want to be more selective when writing to the entire class – there might be strange people here I haven't met yet. Can a group memory benefit from some notion of social distance and privacy?

57 2003.10.16 - SLIDE 57IS 202 – FALL 2003 Paul Laskowski on Berlin TeamInfo demonstrates that separating discussions into categories is difficult, and expensive to maintain. Part of the problem is that categories are always evolving. Is there a way to exploit references, keywords, or shared language among emails to automatically infer a structure in subject space?

58 2003.10.16 - SLIDE 58IS 202 – FALL 2003 David Schlossberg on Munro While the article points out that we lack knowledge in social navigation, it implies we also lack technology to make this social navigation possible. Are improvements in social navigation limited by current technology? If so, what innovations are needed to make those improvements? What are the limits of Technology to solve these problems?

59 2003.10.16 - SLIDE 59IS 202 – FALL 2003 David Schlossberg on Munro What information domains lend themselves best to social navigation? Which domains are not well suited for social navigation? Another way of thinking about this is where would you like to see changes in interaction or information retrieval with your computer? For instance, the article mentions that chatting could be much more natural with avatars or virtual spaces.

60 2003.10.16 - SLIDE 60IS 202 – FALL 2003 David Schlossberg on Munro One example of existing social navigation is how Google does its ranking based on how people previously chose from the search results. What other examples of social navigation of information space already exist either on the Internet or in the physical world?

61 2003.10.16 - SLIDE 61IS 202 – FALL 2003 Lecture Overview Review –MPEG-7 Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments Discussion Prep for Presentations –MMM Status, Web interface, Flamenco Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

62 2003.10.16 - SLIDE 62IS 202 – FALL 2003 Next Time Project Presentations (no readings)


Download ppt "2003.10.16 - SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003"

Similar presentations


Ads by Google