Presentation is loading. Please wait.

Presentation is loading. Please wait.

2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

Similar presentations


Presentation on theme: "2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004"— Presentation transcript:

1 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004 http://www.sims.berkeley.edu/academics/courses/is202/f04/ SIMS 202: Information Organization and Retrieval Lecture 3: Intro to Information Retrieval

2 2004.09.07 - SLIDE 2IS 202 – FALL 2004 Lecture Overview Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments Discussion Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

3 2004.09.07 - SLIDE 3IS 202 – FALL 2004 Lecture Overview Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments Discussion Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

4 2004.09.07 - SLIDE 4IS 202 – FALL 2004 Review: Information Overload “The world's total yearly production of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth.” (Varian & Lyman) “The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all.” (W.H. Auden)

5 2004.09.07 - SLIDE 5IS 202 – FALL 2004 Key Issues In This Course How to describe information resources or information-bearing objects in ways so that they may be effectively used by those who need to use them –Organizing How to find the appropriate information resources or information-bearing objects for someone’s (or your own) needs –Retrieving

6 2004.09.07 - SLIDE 6IS 202 – FALL 2004 Key Issues Creation UtilizationSearching Active Inactive Semi-Active Retention/ Mining Disposition Discard Using Creating Authoring Modifying Organizing Indexing Storing Retrieval Distribution Networking Accessing Filtering

7 2004.09.07 - SLIDE 7IS 202 – FALL 2004 IR Topics for 202 The Search Process Information Retrieval Models –Boolean, Vector, and Probabilistic Web-Specific Issues Content Analysis/Zipf Distributions Evaluation of IR Systems –Precision/Recall –Relevance –User Studies User Interface Issues Special Kinds of Search

8 2004.09.07 - SLIDE 8IS 202 – FALL 2004 Lecture Overview Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments Discussion Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

9 2004.09.07 - SLIDE 9IS 202 – FALL 2004 Web Search Questions What do people search for? How do people use search engines? –How often do people find what they are looking for? –How difficult is it for people to find what they are looking for? How can search engines be improved?

10 2004.09.07 - SLIDE 10IS 202 – FALL 2004 What Do People Search for on the Web? Study by Spink et al., Oct 98 –www.shef.ac.uk/~is/publications/infres/paper53.html –Survey on Excite, 13 questions –Data for 316 surveys (If you are interested in this, Amanda Spink has a new book entitled “Web Search: Public Searching On the Web”)

11 2004.09.07 - SLIDE 11IS 202 – FALL 2004 What Do People Search for on the Web? Topics Genealogy/Public Figure:12% Computer related:12% Business:12% Entertainment: 8% Medical: 8% Politics & Government 7% News 7% Hobbies 6% General info/surfing 6% Science 6% Travel 5% Arts/education/shopping/images 14% Something is missing…

12 2004.09.07 - SLIDE 12IS 202 – FALL 2004 What Do People Search for on the Web? 4660 sex 3129 yahoo 2191 internal site admin check from kho 1520 chat 1498 porn 1315 horoscopes 1284 pokemon 1283 SiteScope test 1223 hotmail 1163 games 1151 mp3 1140 weather 1127 www.yahoo.com 1110 maps 1036 yahoo.com 983 ebay 980 recipes 50,000 queries from excite 1997 Most frequent terms:

13 2004.09.07 - SLIDE 13IS 202 – FALL 2004 Why Do These Differ? Self-reporting survey The nature of language –Only a few ways to say certain things –Many different ways to express most concepts UFO, flying saucer, space ship, satellite How many ways are there to talk about history?

14 2004.09.07 - SLIDE 14IS 202 – FALL 2004 65002930 the 62789720 a 60857930 to 57248022 of 54078359 and 52928506 in 50686940 s 49986064 for 45999001 on 42205245 this 41203451 is 39779377 by 35439894 with 35284151 or 34446866 at 33528897 all 31583607 are 30998255 from 30755410 e 30080013 you 29669506 be 29417504 that 28542378 not 28162417 an 28110383 as 28076530 home 27650474 it 27572533 i 24548796 have 24420453 if 24376758 new 24171603 t 23951805 your 23875218 page 22292805 about 22265579 com 22107392 information Source: http://elib.cs.berkeley.edu/docfreq/index.html What is on the Web? List of 31,928,892 terms from analysis of 49,602,191 web pages

15 2004.09.07 - SLIDE 15IS 202 – FALL 2004 Intranet Queries (Aug 2000) 3351 bearfacts 3349 telebears 1909 extension 1874 schedule+of+classes 1780 bearlink 1737 bear+facts 1468 decal 1443 infobears 1227 calendar 989 career+center 974 campus+map 920 academic+calendar 840 map 773 bookstore 741 class+pass 738 housing 721 tele-bears 716 directory 667 schedule 627 recipes 602 transcripts 582 tuition 577 seti 563 registrar 550 info+bears 543 class+schedule 470 financial+aid

16 2004.09.07 - SLIDE 16IS 202 – FALL 2004 Intranet Queries Summary of sample data from 3 weeks of UCB queries –13.2% Telebears/BearFacts/InfoBears/BearLink (12297) –6.7% Schedule of classes or final exams (6222) –5.4% Summer Session (5041) –3.2% Extension (2932) –3.1% Academic Calendar (2846) –2.4% Directories (2202) –1.7% Career Center (1588) –1.7% Housing (1583) –1.5% Map (1393) Average query length over last 4 months: 1.8 words This suggests what is difficult to find from the home page

17 2004.09.07 - SLIDE 17IS 202 – FALL 2004 Queries as Zeitgeist From: http:://www.google.com/press/zeitgeist.html

18 2004.09.07 - SLIDE 18IS 202 – FALL 2004 How DO people search? Different approaches for different tasks Models of the search process attempt to summarize how people interact with information resources when seeking information –Standard IR model –Alternative models

19 2004.09.07 - SLIDE 19IS 202 – FALL 2004 The Standard Retrieval Interaction Model

20 2004.09.07 - SLIDE 20IS 202 – FALL 2004 Standard Model of IR Assumptions: –The goal is maximizing precision and recall simultaneously –The information need remains static –The value is in the resulting document set

21 2004.09.07 - SLIDE 21IS 202 – FALL 2004 Problems with Standard Model Users learn during the search process: –Scanning titles of retrieved documents –Reading retrieved documents –Viewing lists of related topics/thesaurus terms –Navigating hyperlinks Some users don’t like long and (apparently) disorganized lists of documents

22 2004.09.07 - SLIDE 22IS 202 – FALL 2004 IR is an Iterative Process Repositories/ Resources Workspace Goals/ Needs

23 2004.09.07 - SLIDE 23IS 202 – FALL 2004 IR is a Dialog The exchange doesn’t end with first answer Users can recognize elements of a useful answer, even when incomplete Questions and understanding changes as the process continues

24 2004.09.07 - SLIDE 24IS 202 – FALL 2004 Bates’ “Berry-Picking” Model Standard IR model –Assumes the information need remains the same throughout the search process Berry-picking model –Interesting information is scattered like berries among bushes –The query is continually shifting

25 2004.09.07 - SLIDE 25IS 202 – FALL 2004 Berry-Picking Model Q0 Q1 Q2 Q3 Q4 Q5 A sketch of a searcher… “moving through many actions towards a general goal of satisfactory completion of research related to an information need.” (after Bates 89)

26 2004.09.07 - SLIDE 26IS 202 – FALL 2004 Berry-Picking Model (cont.) The query is continually shifting New information may yield new ideas and new directions The information need –Is not satisfied by a single, final retrieved set –Is satisfied by a series of selections and bits of information found along the way

27 2004.09.07 - SLIDE 27IS 202 – FALL 2004 Information Seeking Behavior Two parts of a process: –Search and retrieval –Analysis and synthesis of search results This is a fuzzy area –We will look at (briefly) at some different working theories

28 2004.09.07 - SLIDE 28IS 202 – FALL 2004 Search Tactics and Strategies Search Tactics –Bates 1979 Search Strategies –Bates 1989 –O’Day and Jeffries 1993

29 2004.09.07 - SLIDE 29IS 202 – FALL 2004 Tactics vs. Strategies Tactic: short term goals and maneuvers –Operators, actions Strategy: overall planning –Link a sequence of operators together to achieve some end

30 2004.09.07 - SLIDE 30IS 202 – FALL 2004 Information Search Tactics Monitoring tactics –Keep search on track Source-level tactics –Navigate to and within sources Term and Search Formulation tactics –Designing search formulation –Selection and revision of specific terms within search formulation

31 2004.09.07 - SLIDE 31IS 202 – FALL 2004 Monitoring Tactics (Strategy-Level) Check –Compare original goal with current state Weigh –Make a cost/benefit analysis of current or anticipated actions Pattern –Recognize common strategies Correct Errors Record –Keep track of (incomplete) paths

32 2004.09.07 - SLIDE 32IS 202 – FALL 2004 Source-Level Tactics “Bibble”: – Look for a pre-defined result set E.g., a good link page on web Survey: –Look ahead, review available options E.g., don’t simply use the first term or first source that comes to mind Cut: –Eliminate large proportion of search domain E.g., search on rarest term first

33 2004.09.07 - SLIDE 33IS 202 – FALL 2004 Search Formulation Tactics Specify –Use as specific terms as possible Exhaust –Use all possible elements in a query Reduce –Subtract elements from a query Parallel –Use synonyms and parallel terms Pinpoint –Reducing parallel terms and refocusing query Block –To reject or block some terms, even at the cost of losing some relevant documents

34 2004.09.07 - SLIDE 34IS 202 – FALL 2004 Term Tactics Move around a thesaurus –Superordinate, subordinate, coordinate –Neighbor (semantic or alphabetic) –Trace – pull out terms from information already seen as part of search (titles, etc.) –Morphological and other spelling variants –Antonyms (contrary)

35 2004.09.07 - SLIDE 35IS 202 – FALL 2004 Additional Considerations (Bates 79) More detail is needed about short-term cost/benefit decision rule strategies When to stop? –How to judge when enough information has been gathered? –How to decide when to give up an unsuccessful search? –When to stop searching in one source and move to another?

36 2004.09.07 - SLIDE 36IS 202 – FALL 2004 Implications Search interfaces should make it easy to store intermediate results Interfaces should make it easy to follow trails with unanticipated results (and find your way back) This all makes evaluation of the search, the interface and the search process more difficult

37 2004.09.07 - SLIDE 37IS 202 – FALL 2004 Later in the course: –More on Search Process and Strategies –User interfaces to improve IR process –Incorporation of Content Analysis into better systems More Later…

38 2004.09.07 - SLIDE 38IS 202 – FALL 2004 Restricted Form of the IR Problem The system has available only pre- existing, “canned” text passages Its response is limited to selecting from these passages and presenting them to the user It must select, say, 10 or 20 passages out of millions or billions!

39 2004.09.07 - SLIDE 39IS 202 – FALL 2004 Information Retrieval Revised Task Statement: Build a system that retrieves documents that users are likely to find relevant to their queries This set of assumptions underlies the field of Information Retrieval

40 2004.09.07 - SLIDE 40IS 202 – FALL 2004 Relevance (Introduction) In what ways can a document be relevant to a query? –Answer precise question precisely Who is buried in grant’s tomb? Grant? or no one? –Partially answer question Where is Danville? Near Walnut Creek. Where is Dublin? –Suggest a source for more information. What is lymphodema? Look in this Medical Dictionary. –Give background information –Remind the user of other knowledge –Others...

41 2004.09.07 - SLIDE 41IS 202 – FALL 2004 Relevance “Intuitively, we understand quite well what relevance means. It is a primitive ‘y’ know’ concept, as is information for which we hardly need a definition. … if and when any productive contact [in communication] is desired, consciously or not, we involve and use this intuitive notion or relevance.” »Saracevic, 1975 p. 324

42 2004.09.07 - SLIDE 42IS 202 – FALL 2004 Define your own relevance Relevance is the (A) gage of relevance of an (B) aspect of relevance existing between an (C) object judged and a (D) frame of reference as judged by an (E) assessor Where… From Saracevic, 1975 and Schamber 1990

43 2004.09.07 - SLIDE 43IS 202 – FALL 2004 A. Gages Measure Degree Extent Judgement Estimate Appraisal Relation

44 2004.09.07 - SLIDE 44IS 202 – FALL 2004 B. Aspect Utility Matching Informativeness Satisfaction Appropriateness Usefulness Correspondence

45 2004.09.07 - SLIDE 45IS 202 – FALL 2004 C. Object judged Document Document representation Reference Textual form Information provided Fact Article

46 2004.09.07 - SLIDE 46IS 202 – FALL 2004 D. Frame of reference Question Question representation Research stage Information need Information used Point of view request

47 2004.09.07 - SLIDE 47IS 202 – FALL 2004 E. Assessor Requester Intermediary Expert User Person Judge Information specialist

48 2004.09.07 - SLIDE 48IS 202 – FALL 2004 Lecture Overview Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments Discussion Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

49 2004.09.07 - SLIDE 49IS 202 – FALL 2004 Visions of IR Systems Rev. John Wilkins, 1600’s : The Philosophic Language and tables Wilhelm Ostwald and Paul Otlet, 1910’s: The “monographic principle” and Universal Classification Emanuel Goldberg, 1920’s - 1940’s H.G. Wells, “World Brain: The idea of a permanent World Encyclopedia.” (Introduction to the Encyclopédie Française, 1937) Vannevar Bush, “As we may think.” Atlantic Monthly, 1945.

50 2004.09.07 - SLIDE 50IS 202 – FALL 2004 Card-Based IR Systems Uniterm (Casey, Perry, Berry, Kent: 1958) –Developed and used from mid 1940’s) EXCURSION 43821 90 241 52 63 34 25 66 17 58 49 130 281 92 83 44 75 86 57 88 119 640 122 93 104 115 146 97 158 139 870 342 157 178 199 207 248 269 298 LUNAR 12457 110 181 12 73 44 15 46 7 28 39 430 241 42 113 74 85 76 17 78 79 820 761 602 233 134 95 136 37 118 109 901 982 194 165 127 198 179 377 288 407

51 2004.09.07 - SLIDE 51IS 202 – FALL 2004 Card Systems Batten Optical Coincidence Cards (“Peek- a-Boo Cards”), 1948 Lunar Excursion

52 2004.09.07 - SLIDE 52IS 202 – FALL 2004 Card Systems Zatocode (edge-notched cards) Mooers, 1951 Document 1 Title: lksd ksdj sjd sjsjfkl Author: Smith, J. Abstract: lksf uejm jshy ksd jh uyw hhy jha jsyhe Document 200 Title: Xksd Lunar sjd sjsjfkl Author: Jones, R. Abstract: Lunar uejm jshy ksd jh uyw hhy jha jsyhe Document 34 Title: lksd ksdj sjd Lunar Author: Smith, J. Abstract: lksf uejm jshy ksd jh uyw hhy jha jsyhe

53 2004.09.07 - SLIDE 53IS 202 – FALL 2004 Computer-Based Systems Bagley’s 1951 MS thesis from MIT suggested that searching 50 million item records, each containing 30 index terms would take approximately 41,700 hours –Due to the need to move and shift the text in core memory while carrying out the comparisons 1957 – Desk Set with Katharine Hepburn and Spencer Tracy – EMERAC

54 2004.09.07 - SLIDE 54IS 202 – FALL 2004 Historical Milestones in IR Research 1958 Statistic Language Properties (Luhn) 1960 Probabilistic Indexing (Maron & Kuhns) 1961 Term association and clustering (Doyle) 1965 Vector Space Model (Salton) 1968 Query expansion (Roccio, Salton) 1972 Statistical Weighting (Sparck-Jones) 1975 2-Poisson Model (Harter, Bookstein, Swanson) 1976 Relevance Weighting (Robertson, Sparck- Jones) 1980 Fuzzy sets (Bookstein) 1981 Probability without training (Croft)

55 2004.09.07 - SLIDE 55IS 202 – FALL 2004 Historical Milestones in IR Research (cont.) 1983 Linear Regression (Fox) 1983 Probabilistic Dependence (Salton, Yu) 1985 Generalized Vector Space Model (Wong, Rhagavan) 1987 Fuzzy logic and RUBRIC/TOPIC (Tong, et al.) 1990 Latent Semantic Indexing (Dumais, Deerwester) 1991 Polynomial & Logistic Regression (Cooper, Gey, Fuhr) 1992 TREC (Harman) 1992 Inference networks (Turtle, Croft) 1994 Neural networks (Kwok)

56 2004.09.07 - SLIDE 56IS 202 – FALL 2004 Boolean IR Systems Synthex at SDC, 1960 Project MAC at MIT, 1963 (interactive) BOLD at SDC, 1964 (Harold Borko) 1964 New York World’s Fair – Becker and Hayes produced system to answer questions (based on airline reservation equipment) SDC began production for a commercial service in 1967 – ORBIT NASA-RECON (1966) becomes DIALOG 1972 Data Central/Mead introduced LEXIS – Full text Online catalogs – late 1970’s and 1980’s

57 2004.09.07 - SLIDE 57IS 202 – FALL 2004 The Internet and the WWW Gopher, Archie, Veronica, WAIS Tim Berners-Lee, 1991 creates WWW at CERN – originally hypertext only Web-crawler Lycos Alta Vista Inktomi Google (and many others)

58 2004.09.07 - SLIDE 58IS 202 – FALL 2004 Information Retrieval – Historical View Boolean model, statistics of language (1950’s) Vector space model, probablistic indexing, relevance feedback (1960’s) Probabilistic querying (1970’s) Fuzzy set/logic, evidential reasoning (1980’s) Regression, neural nets, inference networks, latent semantic indexing, TREC (1990’s) DIALOG, Lexus-Nexus, STAIRS (Boolean based) Information industry (O($B)) Verity TOPIC (fuzzy logic) Internet search engines (O($100B??)) (vector space, probabilistic) ResearchIndustry

59 2004.09.07 - SLIDE 59IS 202 – FALL 2004 Lecture Overview Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments Discussion Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

60 2004.09.07 - SLIDE 60IS 202 – FALL 2004 Mano Marks on MIR The authors make a distinction between data retrieval and information retrieval. What is that distinction? When would data retrieval be more appropriate than information retrieval? When would information retrieval be more appropriate? In this context, what is data? What is information?

61 2004.09.07 - SLIDE 61IS 202 – FALL 2004 Melissa Chan on Bates Bates published this berry picking article in 1989 stating that real-life queries tend to shift and evolve as a user retrieves information. How does Bates search strategies of footnote chasing, citation searching, journal run, area scanning, subject searches, and author searches parallel a research search on the Internet/online libraries today? Which methods do you more frequently use? Online Libraries 15 Years Later...Would you need to redesign Berkeley's online library to fit the search methods listed by Bates? Does the current design limit or expand your ability to "berry pick" among the library collections? See http://melvyl.cdlib.org/

62 2004.09.07 - SLIDE 62IS 202 – FALL 2004 Irina Lib on Berlin The authors of TeamInfo put a lot of effort in organizing information into categories to minimize searching. With Google advocating the "search, not sort" approach to e-mail, do you think this approach for a group memory system? Do you think it works well for individual systems? TeamInfo was tested on a relatively small, homogenous group of people. Do you think a system such as TeamInfo would work well for larger, more heterogeneous groups? What problems, if any, would arise?

63 2004.09.07 - SLIDE 63IS 202 – FALL 2004 Jen King on Munro What are the possible flaws with using social navigation (“navigation towards a cluster of people or navigation because other people have looked at something”) as a theoretical framework for design? One suggestion: if we base a design upon how an aggregate of people appear to use something, we will inevitably exclude some portion of the audience who doesn’t conform to the norm (Amazon.com recommendations are a possible example of this phenomenon). Non-verbal cues are an important element of human communication. Could social navigation help provide the contextual cues that non-verbal communication provides with helping individuals comprehend information?

64 2004.09.07 - SLIDE 64IS 202 – FALL 2004 Jen King on Munro The central point of social navigation made in the reading is a shift from thinking about computers as external objects humans act upon to a ubiquitous computing environment where humans are engaged with computers in many contexts, both individually and as part of a social group. The authors note that an alternate design possibility includes a “move away from ‘dead’ information spaces we see on the Internet today and in every way possible open up the spaces for seeing other users — both directly and indirectly,” (p.6) or in other words, creating a “virtual reality” where the presence of other people (and not merely unidirectional web pages) define the environment. Have you encountered any computerized social environments that you thought worked well? If not, how did they fail? Do you agree that interacting directly with other users online is the future of information spaces?

65 2004.09.07 - SLIDE 65IS 202 – FALL 2004 Next Time Boolean Queries and Text Processing Readings (note – slight rearrangement of the web site and readings) –(Background) MIR Ch. 2 and Ch. 4 –How to Use Controlled Vocabularies More Effectively in Online Searching (Bates) –Improving Full-Text Precision on Short Queries using Simple Constraints (Hearst)


Download ppt "2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004"

Similar presentations


Ads by Google