9/4/2001Information Organization and Retrieval Introduction to Information Retrieval University of California, Berkeley School of Information Management.

Slides:



Advertisements
Similar presentations
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Advertisements

INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
Modern Information Retrieval Chapter 1: Introduction
Query Models Use Types What do search engines do.
Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Information Retrieval: Human-Computer Interfaces and Information Access Process.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda  Ranked retrieval Similarity-based ranking Probability-based ranking.
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
Search Engines and Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
ISP 433/533 Week 2 IR Models.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 202 – FALL 2004 Lecture 13: Midterm Review Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am -
8/28/97Information Organization and Retrieval Metadata and Data Structures University of California, Berkeley School of Information Management and Systems.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
10/23/2001Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management.
INFO 624 Week 3 Retrieval System Evaluation
9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Current Topics in Information Access: IR Background
© Tefko Saracevic1 Search strategy & tactics Governed by effectiveness&feedback.
DOK 324: Principles of Information Retrieval Hacettepe University Department of Information Management.
Information Retrieval: Human-Computer Interfaces and Information Access Process.
10/24/2000Information Organization and Retrieval Information Structures and Metadata University of California, Berkeley School of Information Management.
September 7, 2000Information Organization and Retrieval Introduction to Information Retrieval Ray Larson & Marti Hearst University of California, Berkeley.
IMT530- Organization of Information Resources1 Feedback Like exercises –But want more instructions and feedback on them –Wondering about grading on these.
SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Search Engines and Information Retrieval Chapter 1.
Lecture Four: Steps 3 and 4 INST 250/4.  Does one look for facts, or opinions, or both when conducting a literature search?  What is the difference.
Information Seeking Behavior Prof. Marti Hearst SIMS 202, Lecture 25.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Personal Information Management Vitor R. Carvalho : Personalized Information Retrieval Carnegie Mellon University February 8 th 2005.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001.
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2005.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Jane Reid, AMSc IRIC, QMUL, 30/10/01 1 Information seeking Information-seeking models Search strategies Search tactics.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Search and Retrieval: Finding Out About Prof. Marti Hearst SIMS 202, Lecture 18.
Search and Retrieval: Query Languages Prof. Marti Hearst SIMS 202, Lecture 19.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Introduction to Information Retrieval. What is IR? Sit down before fact as a little child, be prepared to give up every conceived notion, follow humbly.
SIMS 202, Marti Hearst Final Review Prof. Marti Hearst SIMS 202.
Why indexing? For efficient searching of a document
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Query Models Use Types What do search engines do.
What is Information Retrieval (IR)?
Why the interest in Queries?
Query Models Use Types What do search engines do.
Document Clustering Matt Hughes.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Introduction to Information Retrieval
Presentation transcript:

9/4/2001Information Organization and Retrieval Introduction to Information Retrieval University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval Lecture authors: Marti Hearst & Ray Larson

9/4/2001Information Organization and Retrieval Review: Information Overload “The world's total yearly production of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth.” (Varian & Lyman) “The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all.” (W.H. Auden)

9/4/2001Information Organization and Retrieval To organize is to (1) furnish with organs, make organic, make into living tissue, become organic; (2) form into an organic whole; give orderly structure to; frame and put into working order; make arrangements for. Knowledge is knowing, familiarity gained by experience; person’s range of information; a theoretical or practical understanding of; the sum of what is known. To retrieve is to (1) recover by investigation or effort of memory, restore to knowledge or recall to mind; regain possession of; (2) rescue from a bad state, revive, repair, set right. Information is (1) informing, telling; thing told, knowledge, items of knowledge, news. The Oxford English Dictionary, cf. Rowley

9/4/2001Information Organization and Retrieval Information Life Cycle Creation UtilizationSearching Active Inactive Semi-Active Retention/ Mining Disposition Discard Using Creating Authoring Modifying Organizing Indexing Storing Retrieval Distribution Networking Accessing Filtering Note: This version of the Life cycle is based on the report of a conference on the Social Aspects of Digital Libraries held at UCLA. - C. Borgman, PI

9/4/2001Information Organization and Retrieval Authoring/Modifying Converting Data+Information+Knowledge to New Information. Creating information from observation, thought. Editing and Publication. Gatekeeping

9/4/2001Information Organization and Retrieval Organizing/Indexing Collecting and Integrating information. Affects Data, Information and Metadata. “Metadata” Describes data and information. –More on this later. Organizing Information. –Types of organization? Indexing

9/4/2001Information Organization and Retrieval Storing/Retrieving Information Storage –How and Where is Information stored? Retrieving Information. –How is information recovered from storage –How to find needed information –Linked with Accessing/Filtering stage

9/4/2001Information Organization and Retrieval Distribution/Networking Transmission of information –How is information transmitted? Networks vs Broadcast.

9/4/2001Information Organization and Retrieval Accessing/Filtering Using the organization created in the O/I stage to: –Select desired (or relevant) information –Locate that information –Retrieve the information from its storage location (often via a network)

9/4/2001Information Organization and Retrieval Using/Creating Using Information. Transformation of Information to Knowledge. Knowledge to New Data and New Information.

9/4/2001Information Organization and Retrieval Key issues in this course How to find the appropriate information resources or information-bearing objects for someone’s (or your own) needs. –Retrieving How to describe information resources or information-bearing objects in ways so that they may be effectively used by those who need to use them. –Organizing

9/4/2001Information Organization and Retrieval Key Issues Creation UtilizationSearching Active Inactive Semi-Active Retention/ Mining Disposition Discard Using Creating Authoring Modifying Organizing Indexing Storing Retrieval Distribution Networking Accessing Filtering

9/4/2001Information Organization and Retrieval This Week Introduction to IR –Modern IR textbook topics The Information Seeking Process

9/4/2001Information Organization and Retrieval Textbook Topics

9/4/2001Information Organization and Retrieval More Detailed View

9/4/2001Information Organization and Retrieval What We’ll Cover A Lot A Little

9/4/2001Information Organization and Retrieval Search and Retrieval Outline of Part I of SIMS 202 The Search Process Information Retrieval Models Content Analysis/Zipf Distributions Evaluation of IR Systems –Precision/Recall –Relevance –User Studies System and Implementation Issues Web-Specific Issues User Interface Issues Special Kinds of Search

9/4/2001Information Organization and Retrieval What is an Information Need?

9/4/2001Information Organization and Retrieval The Standard Retrieval Interaction Model

9/4/2001Information Organization and Retrieval Standard Model Assumptions: –Maximizing precision and recall simultaneously –The information need remains static –The value is in the resulting document set

9/4/2001Information Organization and Retrieval Problem with Standard Model: Users learn during the search process: –Scanning titles of retrieved documents –Reading retrieved documents –Viewing lists of related topics/thesaurus terms –Navigating hyperlinks Some users don’t like long disorganized lists of documents

9/4/2001Information Organization and Retrieval IR is an Iterative Process Repositories Workspace Goals

9/4/2001Information Organization and Retrieval IR is a Dialog –The exchange doesn’t end with first answer –User can recognize elements of a useful answer –Questions and understanding changes as the process continues.

9/4/2001Information Organization and Retrieval “Berry-Picking” as an Information Seeking Strategy (Bates 90) Standard IR model –assumes the information need remains the same throughout the search process Berry-picking model –interesting information is scattered like berries among bushes –the query is continually shifting

9/4/2001Information Organization and Retrieval A sketch of a searcher… “moving through many actions towards a general goal of satisfactory completion of research related to an information need.” (after Bates 89) Q0 Q1 Q2 Q3 Q4 Q5

9/4/2001Information Organization and Retrieval Berry-picking model (cont.) The query is continually shifting New information may yield new ideas and new directions The information need –is not satisfied by a single, final retrieved set – is satisfied by a series of selections and bits of information found along the way.

9/4/2001Information Organization and Retrieval Berry-picking model (cont.) The query is continually shifting New information may yield new ideas and new directions The information need –is not satisfied by a single, final retrieved set – is satisfied by a series of selections and bits of information found along the way.

9/4/2001Information Organization and Retrieval Information Seeking Behavior Two parts of a process: search and retrieval analysis and synthesis of search results This is a fuzzy area; we will look at several different working theories.

9/4/2001Information Organization and Retrieval Search Tactics and Strategies Search Tactics –Bates 79 Search Strategies –Bates 89 –O’Day and Jeffries 93

9/4/2001Information Organization and Retrieval Tactics vs. Strategies Tactic: short term goals and maneuvers –operators, actions Strategy: overall planning –link a sequence of operators together to achieve some end

9/4/2001Information Organization and Retrieval Information Search Tactics (after Bates 79) Monitoring tactics –keep search on track Source-level tactics –navigate to and within sources Term and Search Formulation tactics –designing search formulation –selection and revision of specific terms within search formulation

9/4/2001Information Organization and Retrieval Term Tactics Move around the thesaurus –superordinate, subordinate, coordinate –neighbor (semantic or alphabetic) –trace -- pull out terms from information already seen as part of search (titles, etc) –morphological and other spelling variants –antonyms (contrary)

9/4/2001Information Organization and Retrieval Source-level Tactics “Bibble”: – look for a pre-defined result set – e.g., a good link page on web Survey: –look ahead, review available options –e.g., don’t simply use the first term or first source that comes to mind Cut: –eliminate large proportion of search domain –e.g., search on rarest term first

9/4/2001Information Organization and Retrieval Source-level Tactics (cont.) Stretch –use source in unintended way –e.g., use patents to find addresses Scaffold –take an indirect route to goal –e.g., when looking for references to obscure poet, look up contemporaries Cleave –binary search in an ordered file

9/4/2001Information Organization and Retrieval Monitoring Tactics (strategy-level) Check –compare original goal with current state Weigh –make a cost/benefit analysis of current or anticipated actions Pattern –recognize common strategies Correct Errors Record –keep track of (incomplete) paths

9/4/2001Information Organization and Retrieval Additional Considerations (Bates 79) Add a Sort tactic! More detail is needed about short-term cost/benefit decision rule strategies When to stop? –How to judge when enough information has been gathered? –How to decide when to give up an unsuccesful search? –When to stop searching in one source and move to another?

9/4/2001Information Organization and Retrieval Implications Interfaces should make it easy to store intermediate results Interfaces should make it easy to follow trails with unanticipated results Makes evaluation more difficult.

9/4/2001Information Organization and Retrieval Later in the course: –More on Search Process and Strategies –User interfaces to improve IR process –Incorporation of Content Analysis into better systems

9/4/2001Information Organization and Retrieval Restricted Form of the IR Problem The system has available only pre-existing, “canned” text passages. Its response is limited to selecting from these passages and presenting them to the user. It must select, say, 10 or 20 passages out of millions or billions!

9/4/2001Information Organization and Retrieval Information Retrieval Revised Task Statement: Build a system that retrieves documents that users are likely to find relevant to their queries. This set of assumptions underlies the field of Information Retrieval.

9/4/2001Information Organization and Retrieval Some IR History –Roots in the scientific “Information Explosion” following WWII –Interest in computer-based IR from mid 1950’s H.P. Luhn at IBM (1958) Probabilistic models at Rand (Maron & Kuhns) (1960) Boolean system development at Lockheed (‘60s) Vector Space Model (Salton at Cornell 1965) Statistical Weighting methods and theoretical advances (‘70s) Refinements and Advances in application (‘80s) User Interfaces, Large-scale testing and application (‘90s)

9/4/2001Information Organization and Retrieval Structure of an IR System Search Line Interest profiles & Queries Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage Line Potentially Relevant Documents Comparison/ Matching Store1: Profiles/ Search requests Store2: Document representations Indexing (Descriptive and Subject) Formulating query in terms of descriptors Storage of profiles Storage of Documents Information Storage and Retrieval System Adapted from Soergel, p. 19

9/4/2001Information Organization and Retrieval Structure of an IR System Search Line Interest profiles & Queries Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage Line Potentially Relevant Documents Comparison/ Matching Store1: Profiles/ Search requests Store2: Document representations Indexing (Descriptive and Subject) Formulating query in terms of descriptors Storage of profiles Storage of Documents Information Storage and Retrieval System Adapted from Soergel, p. 19

9/4/2001Information Organization and Retrieval Structure of an IR System Search Line Interest profiles & Queries Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage Line Potentially Relevant Documents Comparison/ Matching Store1: Profiles/ Search requests Store2: Document representations Indexing (Descriptive and Subject) Formulating query in terms of descriptors Storage of profiles Storage of Documents Information Storage and Retrieval System Adapted from Soergel, p. 19

9/4/2001Information Organization and Retrieval Structure of an IR System Search Line Interest profiles & Queries Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage Line Potentially Relevant Documents Comparison/ Matching Store1: Profiles/ Search requests Store2: Document representations Indexing (Descriptive and Subject) Formulating query in terms of descriptors Storage of profiles Storage of Documents Information Storage and Retrieval System Adapted from Soergel, p. 19

9/4/2001Information Organization and Retrieval Relevance (introduction) In what ways can a document be relevant to a query? –Answer precise question precisely. –Who is buried in grant’s tomb? Grant. –Partially answer question. –Where is Danville? Near Walnut Creek. –Suggest a source for more information. –What is lymphodema? Look in this Medical Dictionary. –Give background information. –Remind the user of other knowledge. –Others...

9/4/2001Information Organization and Retrieval Query Languages A way to express the question (information need) Types: –Boolean –Natural Language –Stylized Natural Language –Form-Based (GUI)

9/4/2001Information Organization and Retrieval Simple query language: Boolean –Terms + Connectors (or operators) –terms words normalized (stemmed) words phrases thesaurus terms –connectors AND OR NOT

9/4/2001Information Organization and Retrieval Boolean Queries Cat Cat OR Dog Cat AND Dog (Cat AND Dog) (Cat AND Dog) OR Collar (Cat AND Dog) OR (Collar AND Leash) (Cat OR Dog) AND (Collar OR Leash)

9/4/2001Information Organization and Retrieval Boolean Queries (Cat OR Dog) AND (Collar OR Leash) –Each of the following combinations works: Catxxxx Dogxxxxx Collarxxxx Leashxxxx

9/4/2001Information Organization and Retrieval Boolean Queries (Cat OR Dog) AND (Collar OR Leash) –None of the following combinations work: Catxx Dogxx Collarxx Leashxx

9/4/2001Information Organization and Retrieval Boolean Logic A B

9/4/2001Information Organization and Retrieval Boolean Queries –Usually expressed as INFIX operators in IR ((a AND b) OR (c AND b)) –NOT is UNARY PREFIX operator ((a AND b) OR (c AND (NOT b))) –AND and OR can be n-ary operators (a AND b AND c AND d) –Some rules - (De Morgan revisited) NOT(a) AND NOT(b) = NOT(a OR b) NOT(a) OR NOT(b)= NOT(a AND b) NOT(NOT(a)) = a

9/4/2001Information Organization and Retrieval Boolean Logic 3t33t3 1t11t1 2t22t2 1D11D1 2D22D2 3D33D3 4D44D4 5D55D5 6D66D6 8D88D8 7D77D7 9D99D9 10 D D 11 m1m1 m2m2 m3m3 m5m5 m4m4 m7m7 m8m8 m6m6 m 2 = t 1 t 2 t 3 m 1 = t 1 t 2 t 3 m 4 = t 1 t 2 t 3 m 3 = t 1 t 2 t 3 m 6 = t 1 t 2 t 3 m 5 = t 1 t 2 t 3 m 8 = t 1 t 2 t 3 m 7 = t 1 t 2 t 3

9/4/2001Information Organization and Retrieval Boolean Searching “Measurement of the width of cracks in prestressed concrete beams” Formal Query: cracks AND beams AND Width_measurement AND Prestressed_concrete Cracks Beams Width measurement Prestressed concrete Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P)

9/4/2001Information Organization and Retrieval Psuedo-Boolean Queries A new notation, from web search –+cat dog +collar leash Does not mean the same thing! Need a way to group combinations. Phrases: –“stray cat” AND “frayed collar” –+“stray cat” + “frayed collar”

Information need Index Pre-process Parse Collections Rank Query text input

9/4/2001Information Organization and Retrieval Result Sets Run a query, get a result set Two choices –Reformulate query, run on entire collection –Reformulate query, run on result set Example: Dialog query (Redford AND Newman) -> S documents (S1 AND Sundance) ->S2 898 documents

Information need Index Pre-process Parse Collections Rank Query text input Reformulated Query Re-Rank

9/4/2001Information Organization and Retrieval Ordering of Retrieved Documents Pure Boolean has no ordering In practice: –order chronologically –order by total number of “hits” on query terms What if one term has more hits than others? Is it better to one of each term or many of one term? Fancier methods have been investigated –p-norm is most famous usually impractical to implement usually hard for user to understand

9/4/2001Information Organization and Retrieval Boolean Advantages –simple queries are easy to understand –relatively easy to implement Disadvantages –difficult to specify what is wanted –too much returned, or too little –ordering not well determined Dominant language in commercial systems until the WWW

9/4/2001Information Organization and Retrieval Faceted Boolean Query Strategy: break query into facets (polysemous with earlier meaning of facets) –conjunction of disjunctions a1 OR a2 OR a3 b1 OR b2 c1 OR c2 OR c3 OR c4 –each facet expresses a topic “rain forest” OR jungle OR amazon medicine OR remedy OR cure Smith OR Zhou AND

9/4/2001Information Organization and Retrieval Faceted Boolean Query Query still fails if one facet missing Alternative: Coordination level ranking –Order results in terms of how many facets (disjuncts) are satisfied –Also called Quorum ranking, Overlap ranking, and Best Match Problem: Facets still undifferentiated Alternative: assign weights to facets

9/4/2001Information Organization and Retrieval Proximity Searches Proximity: terms occur within K positions of one another –pen w/5 paper A “Near” function can be more vague –near(pen, paper) Sometimes order can be specified Also, Phrases and Collocations –“United Nations” “Bill Clinton” Phrase Variants –“retrieval of information” “information retrieval”

9/4/2001Information Organization and Retrieval Filters Filters: Reduce set of candidate docs Often specified simultaneous with query Usually restrictions on metadata –restrict by: date range internet domain (.edu.com.berkeley.edu) author size limit number of documents returned

9/4/2001Information Organization and Retrieval Next Statistical Properties of Text Preparing information for search: Lexical analysis Introduction to the Vector Space model of IR.