2003.10.16 - SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003

Slides:



Advertisements
Similar presentations
Recuperação de Informação B Cap. 10: User Interfaces and Visualization 10.1,10.2,10.3 November 17, 1999.
Advertisements

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Modern Information Retrieval Chapter 1: Introduction
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Information Retrieval: Human-Computer Interfaces and Information Access Process.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Search Engines and Information Retrieval
SLIDE 1CARL Presentation The Future of Search Ray R. Larson University of California, Berkeley School of Information.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
ISP 433/533 Week 2 IR Models.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 202 – FALL 2004 Lecture 29: Final Review Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00.
© Tefko Saracevic, Rutgers University1 Interaction in information retrieval There is MUCH more to searching than knowing computers, networks & commands,
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Information Retrieval February 24, 2004
SLIDE 1IS 202 – FALL 2004 Lecture 13: Midterm Review Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am -
8/28/97Information Organization and Retrieval Metadata and Data Structures University of California, Berkeley School of Information Management and Systems.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
9/4/2001Information Organization and Retrieval Introduction to Information Retrieval University of California, Berkeley School of Information Management.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
INFO 624 Week 3 Retrieval System Evaluation
Lecture 1: Introduction and History
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
© Tefko Saracevic1 Search strategy & tactics Governed by effectiveness&feedback.
Lecture 15: Intro to Information Retrieval
Information Retrieval: Human-Computer Interfaces and Information Access Process.
SLIDE 1IS 202 – FALL 2003 Lecture 26: Final Review Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004
September 7, 2000Information Organization and Retrieval Introduction to Information Retrieval Ray Larson & Marti Hearst University of California, Berkeley.
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
Modern Information Retrieval Lecture 1: Introduction.
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Search Engines and Information Retrieval Chapter 1.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Information Seeking Behavior Prof. Marti Hearst SIMS 202, Lecture 25.
Modern Information Retrieval Computer engineering department Fall 2005.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2005.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Jane Reid, AMSc IRIC, QMUL, 30/10/01 1 Information seeking Information-seeking models Search strategies Search tactics.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval and Web Search Course overview Instructor: Rada Mihalcea.
Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Search and Retrieval: Finding Out About Prof. Marti Hearst SIMS 202, Lecture 18.
Search and Retrieval: Query Languages Prof. Marti Hearst SIMS 202, Lecture 19.
Introduction to Information Retrieval. What is IR? Sit down before fact as a little child, be prepared to give up every conceived notion, follow humbly.
SIMS 202, Marti Hearst Final Review Prof. Marti Hearst SIMS 202.
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Information Organization: Overview
Lecture 1: Introduction and the Boolean Model Information Retrieval
What is Information Retrieval (IR)?
Search Techniques and Advanced tools for Researchers
Document Clustering Matt Hughes.
CSE 635 Multimedia Information Retrieval
Magnet & /facet Zheng Liang
Introduction to Information Retrieval
Information Organization: Overview
Presentation transcript:

SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall SIMS 202: Information Organization and Retrieval Lecture 16: Intro to Information Retrieval

SLIDE 2IS 202 – FALL 2003 Lecture Overview Review –MPEG-7 Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments Discussion Prep for Presentations –MMM Status, Web interface, Flamenco Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

SLIDE 3IS 202 – FALL 2003 Lecture Overview Review –MPEG-7 Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments Discussion Prep for Presentations –MMM Status, Web interface, Flamenco Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

SLIDE 4IS 202 – FALL 2003 Review: Information Overload “The world's total yearly production of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth.” (Varian & Lyman) “The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all.” (W.H. Auden)

SLIDE 5IS 202 – FALL 2003 Course Outline Organization –Overview –Categorization –Knowledge Representation –Metadata Introduction –Controlled Vocabularies Introduction –Thesaurus Design and Construction –Multimedia Information Organization and Retrieval –Metadata for Media –Database Design –XML Retrieval –Introduction to Search Process –Boolean Queries and Text Processing –Statistical Properties of Text and Vector Representation –Probabilistic Ranking and Relevance Feedback –Evaluation –Web Search Issues and Architecture –Interfaces for Information Retrieval

SLIDE 6IS 202 – FALL 2003 Key Issues In This Course How to describe information resources or information-bearing objects in ways so that they may be effectively used by those who need to use them –Organizing How to find the appropriate information resources or information-bearing objects for someone’s (or your own) needs –Retrieving

SLIDE 7IS 202 – FALL 2003 Key Issues Creation UtilizationSearching Active Inactive Semi-Active Retention/ Mining Disposition Discard Using Creating Authoring Modifying Organizing Indexing Storing Retrieval Distribution Networking Accessing Filtering

SLIDE 8IS 202 – FALL 2003 Modern IR Textbook Topics

SLIDE 9IS 202 – FALL 2003 More Detailed View

SLIDE 10IS 202 – FALL 2003 What We’ll Cover A Lot A Little

SLIDE 11IS 202 – FALL 2003 IR Topics for 202 The Search Process Information Retrieval Models –Boolean, Vector, and Probabilistic Content Analysis/Zipf Distributions Evaluation of IR Systems –Precision/Recall –Relevance –User Studies Web-Specific Issues User Interface Issues Special Kinds of Search

SLIDE 12IS 202 – FALL 2003 Lecture Overview Review –MPEG-7 Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments Discussion Prep for Presentations –MMM Status, Web interface, Flamenco Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

SLIDE 13IS 202 – FALL 2003 The Standard Retrieval Interaction Model

SLIDE 14IS 202 – FALL 2003 Standard Model of IR Assumptions: –The goal is maximizing precision and recall simultaneously –The information need remains static –The value is in the resulting document set

SLIDE 15IS 202 – FALL 2003 Problems with Standard Model Users learn during the search process: –Scanning titles of retrieved documents –Reading retrieved documents –Viewing lists of related topics/thesaurus terms –Navigating hyperlinks Some users don’t like long (apparently) disorganized lists of documents

SLIDE 16IS 202 – FALL 2003 IR is an Iterative Process Repositories Workspace Goals

SLIDE 17IS 202 – FALL 2003 IR is a Dialog The exchange doesn’t end with first answer Users can recognize elements of a useful answer, even when incomplete Questions and understanding changes as the process continues

SLIDE 18IS 202 – FALL 2003 Bates’ “Berry-Picking” Model Standard IR model –Assumes the information need remains the same throughout the search process Berry-picking model –Interesting information is scattered like berries among bushes –The query is continually shifting

SLIDE 19IS 202 – FALL 2003 Berry-Picking Model Q0 Q1 Q2 Q3 Q4 Q5 A sketch of a searcher… “moving through many actions towards a general goal of satisfactory completion of research related to an information need.” (after Bates 89)

SLIDE 20IS 202 – FALL 2003 Berry-Picking Model (cont.) The query is continually shifting New information may yield new ideas and new directions The information need –Is not satisfied by a single, final retrieved set –Is satisfied by a series of selections and bits of information found along the way

SLIDE 21IS 202 – FALL 2003 Information Seeking Behavior Two parts of a process: –Search and retrieval –Analysis and synthesis of search results This is a fuzzy area –We will look at (briefly) at some different working theories

SLIDE 22IS 202 – FALL 2003 Search Tactics and Strategies Search Tactics –Bates 1979 Search Strategies –Bates 1989 –O’Day and Jeffries 1993

SLIDE 23IS 202 – FALL 2003 Tactics vs. Strategies Tactic: short term goals and maneuvers –Operators, actions Strategy: overall planning –Link a sequence of operators together to achieve some end

SLIDE 24IS 202 – FALL 2003 Information Search Tactics Monitoring tactics –Keep search on track Source-level tactics –Navigate to and within sources Term and Search Formulation tactics –Designing search formulation –Selection and revision of specific terms within search formulation

SLIDE 25IS 202 – FALL 2003 Monitoring Tactics (Strategy-Level) Check –Compare original goal with current state Weigh –Make a cost/benefit analysis of current or anticipated actions Pattern –Recognize common strategies Correct Errors Record –Keep track of (incomplete) paths

SLIDE 26IS 202 – FALL 2003 Source-Level Tactics “Bibble”: – Look for a pre-defined result set E.g., a good link page on web Survey: –Look ahead, review available options E.g., don’t simply use the first term or first source that comes to mind Cut: –Eliminate large proportion of search domain E.g., search on rarest term first

SLIDE 27IS 202 – FALL 2003 Search Formulation Tactics Specify –Use as specific terms as possible Exhaust –Use all possible elements in a query Reduce –Subtract elements from a query Parallel –Use synonyms and parallel terms Pinpoint –Reducing parallel terms and refocusing query Block –To reject or block some terms, even at the cost of losing some relevant documents

SLIDE 28IS 202 – FALL 2003 Term Tactics Move around the thesaurus –Superordinate, subordinate, coordinate –Neighbor (semantic or alphabetic) –Trace – pull out terms from information already seen as part of search (titles, etc.) –Morphological and other spelling variants –Antonyms (contrary)

SLIDE 29IS 202 – FALL 2003 Additional Considerations (Bates 79) More detail is needed about short-term cost/benefit decision rule strategies When to stop? –How to judge when enough information has been gathered? –How to decide when to give up an unsuccessful search? –When to stop searching in one source and move to another?

SLIDE 30IS 202 – FALL 2003 Implications Search interfaces should make it easy to store intermediate results Interfaces should make it easy to follow trails with unanticipated results (and find your way back) This all makes evaluation of the search, the interface and the search process more difficult

SLIDE 31IS 202 – FALL 2003 Later in the course: –More on Search Process and Strategies –User interfaces to improve IR process –Incorporation of Content Analysis into better systems More Later…

SLIDE 32IS 202 – FALL 2003 Restricted Form of the IR Problem The system has available only pre- existing, “canned” text passages Its response is limited to selecting from these passages and presenting them to the user It must select, say, 10 or 20 passages out of millions or billions!

SLIDE 33IS 202 – FALL 2003 Information Retrieval Revised Task Statement: Build a system that retrieves documents that users are likely to find relevant to their queries This set of assumptions underlies the field of Information Retrieval

SLIDE 34IS 202 – FALL 2003 Relevance (Introduction) In what ways can a document be relevant to a query? –Answer precise question precisely Who is buried in grant’s tomb? Grant. –Partially answer question Where is Danville? Near Walnut Creek. Where is Dublin? –Suggest a source for more information. What is lymphodema? Look in this Medical Dictionary. –Give background information –Remind the user of other knowledge –Others...

SLIDE 35IS 202 – FALL 2003 Relevance “Intuitively, we understand quite well what relevance means. It is a primitive ‘y’ know’ concept, as is information for which we hardly need a definition. … if and when any productive contact [in communication] is desired, consciously or not, we involve and use this intuitive notion or relevance.” »Saracevic, 1975 p. 324

SLIDE 36IS 202 – FALL 2003 Define your own relevance Relevance is the (A) gage of relevance of an (B) aspect of relevance existing between an (C) object judged and a (D) frame of reference as judged by an (E) assessor Where… From Saracevic, 1975 and Schamber 1990

SLIDE 37IS 202 – FALL 2003 A. Gages Measure Degree Extent Judgement Estimate Appraisal Relation

SLIDE 38IS 202 – FALL 2003 B. Aspect Utility Matching Informativeness Satisfaction Appropriateness Usefulness Correspondence

SLIDE 39IS 202 – FALL 2003 C. Object judged Document Document representation Reference Textual form Information provided Fact Article

SLIDE 40IS 202 – FALL 2003 D. Frame of reference Question Question representation Research stage Information need Information used Point of view request

SLIDE 41IS 202 – FALL 2003 E. Assessor Requester Intermediary Expert User Person Judge Information specialist

SLIDE 42IS 202 – FALL 2003 Lecture Overview Review –MPEG-7 Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments (view from 100,000 Ft.) Discussion Prep for Presentations –MMM Status, Web interface, Flamenco Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

SLIDE 43IS 202 – FALL 2003 Visions of IR Systems Rev. John Wilkins, 1600’s : The Philosophic Language and tables Wilhelm Ostwald and Paul Otlet, 1910’s: The “monographic principle” and Universal Classification Emanuel Goldberg, 1920’s ’s H.G. Wells, “World Brain: The idea of a permanent World Encyclopedia.” (Introduction to the Encyclopédie Française, 1937) Vannevar Bush, “As we may think.” Atlantic Monthly, 1945.

SLIDE 44IS 202 – FALL 2003 Card-Based IR Systems Uniterm (Casey, Perry, Berry, Kent: 1958) –Developed and used from mid 1940’s) EXCURSION LUNAR

SLIDE 45IS 202 – FALL 2003 Card Systems Batten Optical Coincidence Cards (“Peek- a-Boo Cards”), 1948 Lunar Excursion

SLIDE 46IS 202 – FALL 2003 Card Systems Zatocode (edge-notched cards) Mooers, 1951 Document 1 Title: lksd ksdj sjd sjsjfkl Author: Smith, J. Abstract: lksf uejm jshy ksd jh uyw hhy jha jsyhe Document 200 Title: Xksd Lunar sjd sjsjfkl Author: Jones, R. Abstract: Lunar uejm jshy ksd jh uyw hhy jha jsyhe Document 34 Title: lksd ksdj sjd Lunar Author: Smith, J. Abstract: lksf uejm jshy ksd jh uyw hhy jha jsyhe

SLIDE 47IS 202 – FALL 2003 Computer-Based Systems Bagley’s 1951 MS thesis from MIT suggested that searching 50 million item records, each containing 30 index terms would take approximately 41,700 hours –Due to the need to move and shift the text in core memory while carrying out the comparisons 1957 – Desk Set with Katharine Hepburn and Spencer Tracy – EMERAC

SLIDE 48IS 202 – FALL 2003 Historical Milestones in IR Research 1958 Statistic Language Properties (Luhn) 1960 Probabilistic Indexing (Maron & Kuhns) 1961 Term association and clustering (Doyle) 1965 Vector Space Model (Salton) 1968 Query expansion (Roccio, Salton) 1972 Statistical Weighting (Sparck-Jones) Poisson Model (Harter, Bookstein, Swanson) 1976 Relevance Weighting (Robertson, Sparck- Jones) 1980 Fuzzy sets (Bookstein) 1981 Probability without training (Croft)

SLIDE 49IS 202 – FALL 2003 Historical Milestones in IR Research (cont.) 1983 Linear Regression (Fox) 1983 Probabilistic Dependence (Salton, Yu) 1985 Generalized Vector Space Model (Wong, Rhagavan) 1987 Fuzzy logic and RUBRIC/TOPIC (Tong, et al.) 1990 Latent Semantic Indexing (Dumais, Deerwester) 1991 Polynomial & Logistic Regression (Cooper, Gey, Fuhr) 1992 TREC (Harman) 1992 Inference networks (Turtle, Croft) 1994 Neural networks (Kwok)

SLIDE 50IS 202 – FALL 2003 Boolean IR Systems Synthex at SDC, 1960 Project MAC at MIT, 1963 (interactive) BOLD at SDC, 1964 (Harold Borko) 1964 New York World’s Fair – Becker and Hayes produced system to answer questions (based on airline reservation equipment) SDC began production for a commercial service in 1967 – ORBIT NASA-RECON (1966) becomes DIALOG 1972 Data Central/Mead introduced LEXIS – Full text Online catalogs – late 1970’s and 1980’s

SLIDE 51IS 202 – FALL 2003 The Internet and the WWW Gopher, Archie, Veronica, WAIS Tim Berners-Lee, 1991 creates WWW at CERN – originally hypertext only Web-crawler Lycos Alta Vista Inktomi Google (and many others)

SLIDE 52IS 202 – FALL 2003 Information Retrieval – Historical View Boolean model, statistics of language (1950’s) Vector space model, probablistic indexing, relevance feedback (1960’s) Probabilistic querying (1970’s) Fuzzy set/logic, evidential reasoning (1980’s) Regression, neural nets, inference networks, latent semantic indexing, TREC (1990’s) DIALOG, Lexus-Nexus, STAIRS (Boolean based) Information industry (O($B)) Verity TOPIC (fuzzy logic) Internet search engines (O($100B?)) (vector space, probabilistic) ResearchIndustry

SLIDE 53IS 202 – FALL 2003 Lecture Overview Review –MPEG-7 Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments Discussion Prep for Presentations –MMM Status, Web interface, Flamenco Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

SLIDE 54IS 202 – FALL 2003 Discussion: Joe Hall on MIR Why does there have to be such a schism between computer-centered and human- centered IR? Would it not be more wise to approach IR from both directions simultaneously? How do *you* find information on a regular basis? Is Google your first-order attack? What do you do when Google wouldn't return anything useful... for example, if Kate was looking for information on music from "The The" or “Peaches"? What are some useful, domain- specific tools out there that you use (like IMDB, or The All Music Guide)?

SLIDE 55IS 202 – FALL 2003 Discussion: Joe Hall on MIR What would a Venn diagram of Information Retrieval and Information Organization look like? With systems like Google that rely on a very simplistic ranking system, complex Information Organization seems not necessary for certain types of information. There seems to be an OI/IR trade-off here... that is, the more organized your information, the less sophisticated a retreival system needs to be.

SLIDE 56IS 202 – FALL 2003 Paul Laskowski on Berlin How many people can participate in a group memory? I would happily share my 202-related s with my phone project group (Go MonkeyBots!!!), but I might want to be more selective when writing to the entire class – there might be strange people here I haven't met yet. Can a group memory benefit from some notion of social distance and privacy?

SLIDE 57IS 202 – FALL 2003 Paul Laskowski on Berlin TeamInfo demonstrates that separating discussions into categories is difficult, and expensive to maintain. Part of the problem is that categories are always evolving. Is there a way to exploit references, keywords, or shared language among s to automatically infer a structure in subject space?

SLIDE 58IS 202 – FALL 2003 David Schlossberg on Munro While the article points out that we lack knowledge in social navigation, it implies we also lack technology to make this social navigation possible. Are improvements in social navigation limited by current technology? If so, what innovations are needed to make those improvements? What are the limits of Technology to solve these problems?

SLIDE 59IS 202 – FALL 2003 David Schlossberg on Munro What information domains lend themselves best to social navigation? Which domains are not well suited for social navigation? Another way of thinking about this is where would you like to see changes in interaction or information retrieval with your computer? For instance, the article mentions that chatting could be much more natural with avatars or virtual spaces.

SLIDE 60IS 202 – FALL 2003 David Schlossberg on Munro One example of existing social navigation is how Google does its ranking based on how people previously chose from the search results. What other examples of social navigation of information space already exist either on the Internet or in the physical world?

SLIDE 61IS 202 – FALL 2003 Lecture Overview Review –MPEG-7 Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments Discussion Prep for Presentations –MMM Status, Web interface, Flamenco Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

SLIDE 62IS 202 – FALL 2003 Next Time Project Presentations (no readings)