Lecture 1: Introduction and History

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
Modern Information Retrieval Chapter 1: Introduction
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Search Engines and Information Retrieval
SLIDE 1CARL Presentation The Future of Search Ray R. Larson University of California, Berkeley School of Information.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Modern Information Retrieval Chapter 1: Introduction
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Intelligent Information Retrieval CS 336 Lisa Ballesteros Spring 2006.
1 CS 430: Information Discovery Lecture 10 Cranfield and TREC.
ISP 433/533 Week 8 IR in libraries. Goal Universal Access to Information Vannevar Bush 1945 article Memex A memex is a device in which an individual stores.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 202 – FALL 2004 Lecture 13: Midterm Review Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am -
8/28/97Information Organization and Retrieval Metadata and Data Structures University of California, Berkeley School of Information Management and Systems.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Information Retrieval in Practice
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Lecture 15: Intro to Information Retrieval
SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
11/21/2000Information Organization and Retrieval Thesaurus Design and Development University of California, Berkeley School of Information Management and.
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004
September 7, 2000Information Organization and Retrieval Introduction to Information Retrieval Ray Larson & Marti Hearst University of California, Berkeley.
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
Modern Information Retrieval Lecture 1: Introduction.
Search Engines and Information Retrieval Chapter 1.
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
Information Retrieval and Web Search Lecture 1. Course overview Instructor: Rada Mihalcea Class web page:
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2006.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2005.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Conceptual structures in modern information retrieval Claudio Carpineto Fondazione Ugo Bordoni
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval and Web Search Course overview Instructor: Rada Mihalcea.
Information Retrieval
Search and Retrieval: Finding Out About Prof. Marti Hearst SIMS 202, Lecture 18.
User Errors in Formulating Queries and IR Techniques to Overcome Them Birger Larsen Information Interaction and Information Architecture Royal School of.
Introduction to Information Retrieval. What is IR? Sit down before fact as a little child, be prepared to give up every conceived notion, follow humbly.
SIMS 202, Marti Hearst Final Review Prof. Marti Hearst SIMS 202.
Information Retrieval in Practice
Sampath Jayarathna Cal Poly Pomona
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Lecture 1: Introduction and the Boolean Model Information Retrieval
Information Retrieval (in Practice)
What is Information Retrieval (IR)?
Information Retrieval and Web Search
Information Retrieval and Web Search
CSCE 561 Information Retrieval System Models
WIRED Week 2 Syllabus Update Readings Overview.
Murat Açar - Zeynep Çipiloğlu Yıldız
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Introduction to Information Retrieval
Lecture 8 Information Retrieval Introduction
Introduction to Search Engines
Presentation transcript:

Lecture 1: Introduction and History Principles of Information Retrieval Prof. Ray Larson University of California, Berkeley School of Information http://courses.sims.berkeley.edu/i240/s11/ IS 240 – Spring 2011

Lecture Overview Introduction to the Course (re)Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments Discussion Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey IS 240 – Spring 2011

Lecture Overview Introduction to the Course (re)Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments Discussion Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey IS 240 – Spring 2011

Introduction to Course Course Contents Assignments Readings and Discussion Hands-On use of IR systems Participation in Mini-TREC IR Evaluation Term paper Grading Readings Web Site: http://courses.sims.berkeley.edu/i240/s11/ IS 240 – Spring 2011

Purposes of the Course To impart a basic theoretical understanding of IR models Boolean Vector Space Probabilistic (including Language Models) To examine major application areas of IR including: Web Search Text categorization and clustering Cross language retrieval Text summarization Digital Libraries To understand how IR performance is measured: Recall/Precision Statistical significance Gain hands-on experience with IR systems IS 240 – Spring 2011

Lecture Overview Introduction to the Course (re)Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments Discussion Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey IS 240 – Spring 2011

Introduction Goal of IR is to retrieve all and only the “relevant” documents in a collection for a particular user with a particular need for information Relevance is a central concept in IR theory How does an IR system work when the “collection” is all documents available on the Web? Web search engines have been stress-testing the traditional IR models (and inventing new ways of ranking) IS 240 – Spring 2011

Information Retrieval The goal is to search large document collections (millions of documents) to retrieve small subsets relevant to the user’s information need Examples are: Internet search engines (Google, Yahoo! web search, etc.) Digital library catalogues (MELVYL, GLADYS) Some application areas within IR Cross language retrieval Speech/broadcast retrieval Text categorization Text summarization Structured Document Element retrieval (XML) Subject to objective testing and evaluation hundreds of queries millions of documents (the TREC set and conference) IS 240 – Spring 2011

Origins Communication theory revisited Problems with transmission of meaning Conduit metaphor vs. Toolmakers Paradigm Source Decoding Encoding Destination Message Channel Noise Storage Source Decoding (Retrieval/Reading) Encoding (writing/indexing) Destination Message IS 240 – Spring 2011

Human Communication Theory? Destination Noise Source Decoding Encoding Message Channel IS 240 – Spring 2011

Toolmakers’ Paradigm IS 240 – Spring 2011

Information Theory Better called “Technical Communication Theory” Communication may be over time and space Destination Noise Source Decoding Encoding Message Channel Storage Source Decoding (Retrieval/ Reading) Encoding (Writing/ Indexing) Destination Message IS 240 – Spring 2011

Structure of an IR System Interest profiles & Queries Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage Line Potentially Relevant Comparison/ Matching Store1: Profiles/ Search requests Store2: Document representations (Descriptive and Subject) Formulating query in terms of descriptors Storage of profiles Information Storage and Retrieval System Search Line Adapted from Soergel, p. 19 IS 240 – Spring 2011

Components of an IR System Documents Index Records and Document Surrogates Indexing Process Authoritative Indexing Rules severe information loss Query Specification Process User’s Information Need Retrieval Rules Retrieval Process Query List of Documents Relevant to User’s Information Need Fredric C. Gey 9 IS 240 – Spring 2011

Conceptual View of Routing Retrieval Detection Engine Document Stream IS 240 – Spring 2011

Conceptual View of Ad-Hoc Retrieval Q1 Q2 Q3 Qn Q. Q4 Collection Q. Q5 Q. Q6 Q. Q7 Q9 Q8 ‘Fixed’ collection size, can be instrumented IS 240 – Spring 2011

Review: Information Overload “The world's total yearly production of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth.” (Varian & Lyman) “The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all.” (W.H. Auden) “So much has already been written about everything that you can’t find anything about it.” (James Thurber, 1961) IS 240 – Spring 2011

IR Topics from (old) 202 The Search Process Information Retrieval Models Boolean, Vector, and Probabilistic Content Analysis/Zipf Distributions Evaluation of IR Systems Precision/Recall Relevance User Studies Web-Specific Issues XML Retrieval Issues User Interface Issues Special Kinds of Search IS 240 – Spring 2011

Lecture Overview Introduction to the Course (re)Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments Discussion Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey IS 240 – Spring 2011

The Standard Retrieval Interaction Model IS 240 – Spring 2011

Standard Model of IR Assumptions: The goal is maximizing precision and recall simultaneously The information need remains static The value is in the resulting document set IS 240 – Spring 2011

Problems with Standard Model Users learn during the search process: Scanning titles of retrieved documents Reading retrieved documents Viewing lists of related topics/thesaurus terms Navigating hyperlinks Some users don’t like long (apparently) disorganized lists of documents IS 240 – Spring 2011

IR is an Iterative Process Repositories Workspace Goals IS 240 – Spring 2011

IR is a Dialog The exchange doesn’t end with first answer Users can recognize elements of a useful answer, even when incomplete Questions and understanding changes as the process continues IS 240 – Spring 2011

Bates’ “Berry-Picking” Model Standard IR model Assumes the information need remains the same throughout the search process Berry-picking model Interesting information is scattered like berries among bushes The query is continually shifting IS 240 – Spring 2011

Berry-Picking Model Q2 Q4 Q3 Q1 Q5 Q0 A sketch of a searcher… “moving through many actions towards a general goal of satisfactory completion of research related to an information need.” (after Bates 89) Q2 Q4 Q3 Q1 Q5 Q0 IS 240 – Spring 2011

Berry-Picking Model (cont.) The query is continually shifting New information may yield new ideas and new directions The information need Is not satisfied by a single, final retrieved set Is satisfied by a series of selections and bits of information found along the way IS 240 – Spring 2011

Restricted Form of the IR Problem The system has available only pre-existing, “canned” text passages Its response is limited to selecting from these passages and presenting them to the user It must select, say, 10 or 20 passages out of millions or billions! IS 240 – Spring 2011

Information Retrieval Revised Task Statement: Build a system that retrieves documents that users are likely to find relevant to their queries This set of assumptions underlies the field of Information Retrieval IS 240 – Spring 2011

Lecture Overview Introduction to the Course (re)Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments Discussion Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey IS 240 – Spring 2011

IR History Overview Information Retrieval History Origins and Early “IR” Modern Roots in the scientific “Information Explosion” following WWII Non-Computer IR (mid 1950’s) Interest in computer-based IR from mid 1950’s Modern IR – Large-scale evaluations, Web-based search and Search Engines -- 1990’s IS 240 – Spring 2011

Origins Very early history of content representation Sumerian tokens and “envelopes” Alexandria - pinakes Indices IS 240 – Spring 2011

Origins Biblical Indexes and Concordances 1247 – Hugo de St. Caro – employed 500 Monks to create keyword concordance to the Bible Journal Indexes (Royal Society, 1600’s) “Information Explosion” following WWII Cranfield Studies of indexing languages and information retrieval IS 240 – Spring 2011

Visions of IR Systems Rev. John Wilkins, 1600’s : The Philosophical Language and tables Wilhelm Ostwald and Paul Otlet, 1910’s: The “monographic principle” and Universal Classification Emanuel Goldberg, 1920’s - 1940’s H.G. Wells, “World Brain: The idea of a permanent World Encyclopedia.” (Introduction to the Encyclopédie Française, 1937) Vannevar Bush, “As we may think.” Atlantic Monthly, 1945. Term “Information Retrieval” coined by Calvin Mooers. 1952 IS 240 – Spring 2011

Card-Based IR Systems Uniterm (Casey, Perry, Berry, Kent: 1958) Developed and used from mid 1940’s) EXCURSION 43821 90 241 52 63 34 25 66 17 58 49 130 281 92 83 44 75 86 57 88 119 640 122 93 104 115 146 97 158 139 870 342 157 178 199 207 248 269 298 LUNAR 12457 110 181 12 73 44 15 46 7 28 39 430 241 42 113 74 85 76 17 78 79 820 761 602 233 134 95 136 37 118 109 901 982 194 165 127 198 179 377 288 407 IS 240 – Spring 2011

Card Systems Batten Optical Coincidence Cards (“Peek-a-Boo Cards”), 1948 Lunar Excursion IS 240 – Spring 2011

Card Systems Zatocode (edge-notched cards) Mooers, 1951 Document 34 Title: lksd ksdj sjd Lunar Author: Smith, J. Abstract: lksf uejm jshy ksd jh uyw hhy jha jsyhe Document 1 Title: lksd ksdj sjd sjsjfkl Author: Smith, J. Abstract: lksf uejm jshy ksd jh uyw hhy jha jsyhe Document 200 Title: Xksd Lunar sjd sjsjfkl Author: Jones, R. Abstract: Lunar uejm jshy ksd jh uyw hhy jha jsyhe IS 240 – Spring 2011

Computer-Based Systems Bagley’s 1951 MS thesis from MIT suggested that searching 50 million item records, each containing 30 index terms would take approximately 41,700 hours Due to the need to move and shift the text in core memory while carrying out the comparisons 1957 – Desk Set with Katharine Hepburn and Spencer Tracy – EMERAC IS 240 – Spring 2011

Development of Bibliographic Databases Chemical Abstracts Service first produced “Chemical Titles” by computer in 1961. Index Medicus from the National Library of Medicine soon followed with the creation of the MEDLARS database in 1961. By 1970 Most secondary publications (indexes, abstract journals, etc) were produced by machine IS 240 – Spring 2011

Boolean IR Systems Synthex at SDC, 1960 Project MAC at MIT, 1963 (interactive) BOLD at SDC, 1964 (Harold Borko) 1964 New York World’s Fair – Becker and Hayes produced system to answer questions (based on airline reservation equipment) SDC began production for a commercial service in 1967 – ORBIT NASA-RECON (1966) becomes DIALOG 1972 Data Central/Mead introduced LEXIS – Full text of legal information Online catalogs – late 1970’s and 1980’s IS 240 – Spring 2011

Experimental IR systems Probabilistic indexing – Maron and Kuhns, 1960 SMART – Gerard Salton at Cornell – Vector space model, 1970’s SIRE at Syracuse I3R – Croft Cheshire I (1990) TREC – 1992 Inquery Cheshire II (1994) MG (1995?) Lemur (2000?) IS 240 – Spring 2011

The Internet and the WWW Gopher, Archie, Veronica, WAIS Tim Berners-Lee, 1991 creates WWW at CERN – originally hypertext only Web-crawler Lycos Alta Vista Inktomi Google (and many others) IS 240 – Spring 2011

Historical Milestones in IR Research 1958 Statistical Language Properties (Luhn) 1960 Probabilistic Indexing (Maron & Kuhns) 1961 Term association and clustering (Doyle) 1965 Vector Space Model (Salton) 1968 Query expansion (Roccio, Salton) 1972 Statistical Weighting (Sparck-Jones) 1975 2-Poisson Model (Harter, Bookstein, Swanson) 1976 Relevance Weighting (Robertson, Sparck-Jones) 1980 Fuzzy sets (Bookstein) 1981 Probability without training (Croft) IS 240 – Spring 2011

Historical Milestones in IR Research (cont.) 1983 Linear Regression (Fox) 1983 Probabilistic Dependence (Salton, Yu) 1985 Generalized Vector Space Model (Wong, Rhagavan) 1987 Fuzzy logic and RUBRIC/TOPIC (Tong, et al.) 1990 Latent Semantic Indexing (Dumais, Deerwester) 1991 Polynomial & Logistic Regression (Cooper, Gey, Fuhr) 1992 TREC (Harman) 1992 Inference networks (Turtle, Croft) 1994 Neural networks (Kwok) 1998 Language Models (Ponte, Croft) IS 240 – Spring 2011

Information Retrieval – Historical View Research Industry Boolean model, statistics of language (1950’s) Vector space model, probablistic indexing, relevance feedback (1960’s) Probabilistic querying (1970’s) Fuzzy set/logic, evidential reasoning (1980’s) Regression, neural nets, inference networks, latent semantic indexing, TREC (1990’s) DIALOG, Lexus-Nexus, STAIRS (Boolean based) Information industry (O($B)) Verity TOPIC (fuzzy logic) Internet search engines (O($100B?)) (vector space, probabilistic) IS 240 – Spring 2011

Research Sources in Information Retrieval ACM Transactions on Information Systems Am. Society for Information Science Journal Document Analysis and IR Proceedings (Las Vegas) Information Processing and Management (Pergammon) Journal of Documentation SIGIR Conference Proceedings TREC Conference Proceedings Much of this literature is now available online IS 240 – Spring 2011

Research Systems Software INQUERY (Croft) OKAPI (Robertson) PRISE (Harman) http://potomac.ncsl.nist.gov/prise SMART (Buckley) MG (Witten, Moffat) CHESHIRE (Larson) http://cheshire.berkeley.edu LEMUR toolkit Lucene Others IS 240 – Spring 2011

Lecture Overview Introduction to the Course (re)Introduction to Information Retrieval The Information Seeking Process Information Retrieval History and Developments Discussion Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey IS 240 – Spring 2011

Next Time Basic Concepts in IR Readings Joyce & Needham “The Thesaurus Approach to Information Retrieval” (in Readings book) Luhn “The Automatic Derivation of Information Retrieval Encodements from Machine-Readable Texts” (in Readings) Doyle “Indexing and Abstracting by Association, Pt I” (in Readings) IS 240 – Spring 2011