Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 15: Intro to Information Retrieval

Similar presentations


Presentation on theme: "Lecture 15: Intro to Information Retrieval"— Presentation transcript:

1 Lecture 15: Intro to Information Retrieval
SIMS 202: Information Organization and Retrieval Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002 IS 202 – FALL 2002

2 Lecture Overview Review Database Design Normalization
Web-enabled Databases Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey IS 202 – FALL 2002

3 Models (1) Conceptual Model Logical Internal Model requirements
External requirements Application 1 Application 2 Application 3 Application 4 Internal Model IS 202 – FALL 2002

4 Database System Life Cycle
Growth, Change, & Maintenance 6 Operations 5 Integration 4 Design 1 Conversion 3 Physical Creation 2 IS 202 – FALL 2002

5 Normal Forms First Normal Form (1NF) Second Normal Form (2NF)
Third Normal Form (3NF) Boyce-Codd Normal Form (BCNF) Fourth Normal Form (4NF) Fifth Normal Form (5NF) IS 202 – FALL 2002

6 Unnormalized Relations
Normalization Unnormalized Relations First normal form Second normal form Boyce- Codd and Higher Third normal form Functional dependencyof nonkey attributes on the primary key - Atomic values only Full Functional dependencyof nonkey attributes on the primary key No transitive dependency between nonkey attributes All determinants are candidate keys - Single multivalued dependency IS 202 – FALL 2002

7 Dynamic Web Applications 2
Server database CGI DBMS Web Internet Files Clients IS 202 – FALL 2002

8 Server Interfaces Database Web Server Web Application Server Web DB
HTML JavaScript DHTML CGI API’s ColdFusion PhP Perl Java ASP SQL ODBC Native DB interfaces JDBC Native DB Interfaces Adapted from John P. Ashenfelter, Choosing a Database for Your Web Site IS 202 – FALL 2002

9 Photo Browser The current photo browser uses a combination of
Javascript for expandable hierarchies Database in MS Access ColdFusion to search the database when one of the facets is selected The database design for the photo database currently looks like… IS 202 – FALL 2002

10 Photo Browser ER IS 202 – FALL 2002

11 Photo Database Lets look at the photo database in the Access interface
Multi-Facet queries Queries for multiple descriptors in the same facet (harder) IS 202 – FALL 2002

12 Lecture Overview Review Database Design Normalization
Web-enabled Databases Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey IS 202 – FALL 2002

13 Review: Information Overload
“The world's total yearly production of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth.” (Varian & Lyman) “The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all.” (W.H. Auden) IS 202 – FALL 2002

14 Course Outline Organization Retrieval Overview The search process
Categorization Metadata and markup Metadata for multimedia Photo Project Controlled vocabularies, classification, thesauri Information design Thesaurus design Database design Retrieval The search process Content analysis Tokenization, Zipf’s law, lexical associations IR implementation Term weighting and document ranking Vector space model User interfaces Overviews, query specification, providing context IS 202 – FALL 2002

15 Key Issues In This Course
How to describe information resources or information-bearing objects in ways so that they may be effectively used by those who need to use them Organizing How to find the appropriate information resources or information-bearing objects for someone’s (or your own) needs Retrieving IS 202 – FALL 2002

16 Key Issues Creation Utilization Searching Active Retention/ Mining
Inactive Semi-Active Retention/ Mining Disposition Discard Using Creating Authoring Modifying Organizing Indexing Storing Retrieval Distribution Networking Accessing Filtering IS 202 – FALL 2002

17 IR Textbook Topics IS 202 – FALL 2002

18 More Detailed View IS 202 – FALL 2002

19 A Lot A Little What We’ll Cover IS 202 – FALL 2002

20 IR Topics for 202 The Search Process Information Retrieval Models
Content Analysis/Zipf Distributions Evaluation of IR Systems Precision/Recall Relevance User Studies System and Implementation Issues Web-Specific Issues User Interface Issues Special Kinds of Search IS 202 – FALL 2002

21 Lecture Overview Review Database Design Normalization
Web-enabled Databases Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey IS 202 – FALL 2002

22 The Standard Retrieval Interaction Model
IS 202 – FALL 2002

23 Standard Model of IR Assumptions:
Maximizing precision and recall simultaneously The information need remains static The value is in the resulting document set IS 202 – FALL 2002

24 Problems with Standard Model
Users learn during the search process: Scanning titles of retrieved documents Reading retrieved documents Viewing lists of related topics/thesaurus terms Navigating hyperlinks Some users don’t like long disorganized lists of documents IS 202 – FALL 2002

25 IR is an Iterative Process
Repositories Workspace Goals IS 202 – FALL 2002

26 IR is a Dialog The exchange doesn’t end with first answer
User can recognize elements of a useful answer Questions and understanding changes as the process continues IS 202 – FALL 2002

27 Bates’ “Berry-Picking” Model
Standard IR model Assumes the information need remains the same throughout the search process Berry-picking model Interesting information is scattered like berries among bushes The query is continually shifting IS 202 – FALL 2002

28 Berry-Picking Model Q2 Q4 Q3 Q1 Q5 Q0
A sketch of a searcher… “moving through many actions towards a general goal of satisfactory completion of research related to an information need.” (after Bates 89) Q2 Q4 Q3 Q1 Q5 Q0 IS 202 – FALL 2002

29 Berry-Picking Model (cont.)
The query is continually shifting New information may yield new ideas and new directions The information need Is not satisfied by a single, final retrieved set Is satisfied by a series of selections and bits of information found along the way IS 202 – FALL 2002

30 Information Seeking Behavior
Two parts of a process: Search and retrieval Analysis and synthesis of search results This is a fuzzy area We will look at several different working theories IS 202 – FALL 2002

31 Search Tactics and Strategies
Bates 1979 Search Strategies Bates 1989 O’Day and Jeffries 1993 IS 202 – FALL 2002

32 Tactics vs. Strategies Tactic: short term goals and maneuvers
Operators, actions Strategy: overall planning Link a sequence of operators together to achieve some end IS 202 – FALL 2002

33 Information Search Tactics
Monitoring tactics Keep search on track Source-level tactics Navigate to and within sources Term and Search Formulation tactics Designing search formulation Selection and revision of specific terms within search formulation IS 202 – FALL 2002

34 Monitoring Tactics (Strategy-Level)
Check Compare original goal with current state Weigh Make a cost/benefit analysis of current or anticipated actions Pattern Recognize common strategies Correct Errors Record Keep track of (incomplete) paths IS 202 – FALL 2002

35 Source-Level Tactics “Bibble”: Survey: Cut:
Look for a pre-defined result set E.g., a good link page on web Survey: Look ahead, review available options E.g., don’t simply use the first term or first source that comes to mind Cut: Eliminate large proportion of search domain E.g., search on rarest term first IS 202 – FALL 2002

36 Source-Level Tactics (cont.)
Stretch Use source in unintended way E.g., use patents to find addresses Scaffold Take an indirect route to goal E.g., when looking for references to obscure poet, look up contemporaries Cleave Binary search in an ordered file IS 202 – FALL 2002

37 Search Formulation Tactics
Specify Use as specific terms as possible Exhaust Use all possible elements in a query Reduce Subtract elements from a query Parallel Use synonyms and parallel terms Pinpoint Reducing parallel terms and refocusing query Block To reject or block some terms, even at the cost of losing some relevant documents IS 202 – FALL 2002

38 Term Tactics Move around the thesaurus
Superordinate, subordinate, coordinate Neighbor (semantic or alphabetic) Trace – pull out terms from information already seen as part of search (titles, etc.) Morphological and other spelling variants Antonyms (contrary) IS 202 – FALL 2002

39 Additional Considerations (Bates 79)
Add a Sort tactic! More detail is needed about short-term cost/benefit decision rule strategies When to stop? How to judge when enough information has been gathered? How to decide when to give up an unsuccessful search? When to stop searching in one source and move to another? IS 202 – FALL 2002

40 Implications Interfaces should make it easy to store intermediate results Interfaces should make it easy to follow trails with unanticipated results Makes evaluation more difficult IS 202 – FALL 2002

41 More Later… Later in the course: More on Search Process and Strategies
User interfaces to improve IR process Incorporation of Content Analysis into better systems IS 202 – FALL 2002

42 Restricted Form of the IR Problem
The system has available only pre-existing, “canned” text passages Its response is limited to selecting from these passages and presenting them to the user It must select, say, 10 or 20 passages out of millions or billions! IS 202 – FALL 2002

43 Information Retrieval
Revised Task Statement: Build a system that retrieves documents that users are likely to find relevant to their queries This set of assumptions underlies the field of Information Retrieval IS 202 – FALL 2002

44 Lecture Overview Review Database Design Normalization
Web-enabled Databases Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey IS 202 – FALL 2002

45 Visions of IR Systems Paul Otlet, 1930’s
Emanuel Goldberg, 1920’s ’s H.G. Wells, “World Brain: The idea of a permanent World Encyclopedia.” (Introduction to the Encyclopedie Francaise), 1937. Vannevar Bush, “As we may think.” Atlantic Monthly, 1945. IS 202 – FALL 2002

46 Card-Based IR Systems Uniterm (Casey, Perry, Berry, Kent: 1958)
Developed and used from mid 1940’s) EXCURSION 298 LUNAR 407 IS 202 – FALL 2002

47 Card Systems Batten Optical Coincidence Cards (“Peek-a-Boo Cards”), 1948 Lunar Excursion IS 202 – FALL 2002

48 Card Systems Zatocode (edge-notched cards) Mooers, 1951 Document 34
Title: lksd ksdj sjd Lunar Author: Smith, J. Abstract: lksf uejm jshy ksd jh uyw hhy jha jsyhe Document 1 Title: lksd ksdj sjd sjsjfkl Author: Smith, J. Abstract: lksf uejm jshy ksd jh uyw hhy jha jsyhe Document 200 Title: Xksd Lunar sjd sjsjfkl Author: Jones, R. Abstract: Lunar uejm jshy ksd jh uyw hhy jha jsyhe IS 202 – FALL 2002

49 Computer-Based Systems
Bagley’s 1951 MS thesis from MIT suggested that searching 50 million item records, each containing 30 index terms would take approximately 41,700 hours Due to the need to move and shift the text in core memory while carrying out the comparisons 1957 – Desk Set with Katharine Hepburn and Spencer Tracy – EMERAC IS 202 – FALL 2002

50 Historical Milestones in IR Research
1958 Statistic Language Properties (Luhn) 1960 Probabilistic Indexing (Maron & Kuhns) 1961 Term association and clustering (Doyle) 1965 Vector Space Model (Salton) 1968 Query expansion (Roccio, Salton) 1972 Statistical Weighting (Sparck-Jones) Poisson Model (Harter, Bookstein, Swanson) 1976 Relevance Weighting (Robertson, Sparck-Jones) 1980 Fuzzy sets (Bookstein) 1981 Probability without training (Croft) IS 202 – FALL 2002

51 Historical Milestones in IR Research (cont.)
Linear Regression (Fox) Probabilistic Dependence (Salton, Yu) Generalized Vector Space Model (Wong, Rhagavan) Fuzzy logic and RUBRIC/TOPIC (Tong, et al.) Latent Semantic Indexing (Dumais, Deerwester) Polynomial & Logistic Regression (Cooper, Gey, Fuhr) TREC (Harman) Inference networks (Turtle, Croft) Neural networks (Kwok) IS 202 – FALL 2002

52 Development of Bibliographic Databases
Chemical Abstracts Service first produced “Chemical Titles” by computer in 1961 Index Medicus from the National Library of Medicine soon followed with the creation of the MEDLARS database in 1961 By 1970, most secondary publications (indexes, abstract journals, etc.) were produced by machine IS 202 – FALL 2002

53 Boolean IR Systems Synthex at SDC, 1960
Project MAC at MIT, 1963 (interactive) BOLD at SDC, 1964 (Harold Borko) 1964 New York World’s Fair – Becker and Hayes produced system to answer questions (based on airline reservation equipment) SDC began production for a commercial service in 1967 – ORBIT NASA-RECON (1966) becomes DIALOG 1972 Data Central/Mead introduced LEXIS – Full text Online catalogs – late 1970’s and 1980’s IS 202 – FALL 2002

54 Experimental IR systems
Probabilistic indexing – Maron and Kuhns, 1960 SMART – Gerard Salton at Cornell – Vector space model, 1970’s SIRE at Syracuse I3R – Croft TREC – 1992 IS 202 – FALL 2002

55 The Internet and the WWW
Gopher, Archie, Veronica, WAIS Tim Berners-Lee, 1991 creates WWW at CERN – originally hypertext only Web-crawler Lycos Alta Vista Inktomi Google IS 202 – FALL 2002

56 Information Retrieval – Historical View
Research Industry Boolean model, statistics of language (1950’s) Vector space model, probablistic indexing, relevance feedback (1960’s) Probabilistic querying (1970’s) Fuzzy set/logic, evidential reasoning (1980’s) Regression, neural nets, inference networks, latent semantic indexing, TREC (1990’s) DIALOG, Lexus-Nexus, STAIRS (Boolean based) Information industry (O($B)) Verity TOPIC (fuzzy logic) Internet search engines (O($100B?)) (vector space, probabilistic) IS 202 – FALL 2002

57 Research Sources in Information Retrieval
ACM Transactions on Information Systems Am. Society for Information Science Journal Document Analysis and IR Proceedings (Las Vegas) Information Processing and Management (Pergammon) Journal of Documentation SIGIR Conference Proceedings TREC Conference Proceedings IS 202 – FALL 2002

58 Research Systems Software
INQUERY (Croft/U. Mass) OKAPI (Robertson) PRISE (Harman/NIST) SMART (Buckley/Cornell) CHESHIRE (Larson/Berkeley) IS 202 – FALL 2002

59 Structure of an IR System
Search Line Storage Line Interest profiles & Queries Documents & data Information Storage and Retrieval System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Formulating query in terms of descriptors Indexing (Descriptive and Subject) Storage of profiles Storage of Documents Store1: Profiles/ Search requests Store2: Document representations Comparison/ Matching Adapted from Soergel, p. 19 Potentially Relevant Documents IS 202 – FALL 2002

60 Structure of an IR System
Search Line Storage Line Interest profiles & Queries Documents & data Information Storage and Retrieval System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Formulating query in terms of descriptors Indexing (Descriptive and Subject) Storage of profiles Storage of Documents Store1: Profiles/ Search requests Store2: Document representations Comparison/ Matching Adapted from Soergel, p. 19 Potentially Relevant Documents IS 202 – FALL 2002

61 Structure of an IR System
Search Line Storage Line Interest profiles & Queries Documents & data Information Storage and Retrieval System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Formulating query in terms of descriptors Indexing (Descriptive and Subject) Storage of profiles Storage of Documents Store1: Profiles/ Search requests Store2: Document representations Comparison/ Matching Adapted from Soergel, p. 19 Potentially Relevant Documents IS 202 – FALL 2002

62 Structure of an IR System
Search Line Storage Line Interest profiles & Queries Documents & data Information Storage and Retrieval System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Formulating query in terms of descriptors Indexing (Descriptive and Subject) Storage of profiles Storage of Documents Store1: Profiles/ Search requests Store2: Document representations Comparison/ Matching Adapted from Soergel, p. 19 Potentially Relevant Documents IS 202 – FALL 2002

63 Relevance (Introduction)
In what ways can a document be relevant to a query? Answer precise question precisely Who is buried in grant’s tomb? Grant. Partially answer question Where is Danville? Near Walnut Creek. Suggest a source for more information. What is lymphodema? Look in this Medical Dictionary. Give background information Remind the user of other knowledge Others... IS 202 – FALL 2002

64 Next Time Boolean Search Logic Preparing information for search
Lexical analysis Using Lexis-Nexis (Assignment 8) IS 202 – FALL 2002


Download ppt "Lecture 15: Intro to Information Retrieval"

Similar presentations


Ads by Google