Presentation is loading. Please wait.

Presentation is loading. Please wait.

2002.10.24 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

Similar presentations


Presentation on theme: "2002.10.24 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002"— Presentation transcript:

1 2002.10.24 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002 http://www.sims.berkeley.edu/academics/courses/is202/f02/ SIMS 202: Information Organization and Retrieval Lecture 16: Boolean Information Retrieval

2 2002.10.24 - SLIDE 2IS 202 – FALL 2002 Lecture Overview Review –Introduction to Information Retrieval –The Information Seeking Process –History of IR Research IR System Structure (revisited) Central Concepts in IR Boolean Logic Boolean IR Systems Credit for some of the slides in this lecture goes to Marti Hearst

3 2002.10.24 - SLIDE 3IS 202 – FALL 2002 Lecture Overview Review –Introduction to Information Retrieval –The Information Seeking Process –History of IR Research IR System Structure (revisited) Central Concepts in IR Boolean Logic Boolean IR Systems Credit for some of the slides in this lecture goes to Marti Hearst

4 2002.10.24 - SLIDE 4IS 202 – FALL 2002 IR Topics for 202 The Search Process Information Retrieval Models Content Analysis/Zipf Distributions Evaluation of IR Systems –Precision/Recall –Relevance –User Studies System and Implementation Issues Web-Specific Issues User Interface Issues Special Kinds of Search

5 2002.10.24 - SLIDE 5IS 202 – FALL 2002 IR is an Iterative Process Repositories Workspace Goals

6 2002.10.24 - SLIDE 6IS 202 – FALL 2002 Berry-Picking Model Q0 Q1 Q2 Q3 Q4 Q5 A sketch of a searcher… “moving through many actions towards a general goal of satisfactory completion of research related to an information need.” (after Bates 89)

7 2002.10.24 - SLIDE 7IS 202 – FALL 2002 Restricted Form of the IR Problem The system has available only pre- existing, “canned” text passages. Its response is limited to selecting from these passages and presenting them to the user. It must select, say, 10 or 20 passages out of millions or billions!

8 2002.10.24 - SLIDE 8IS 202 – FALL 2002 Information Retrieval Revised Task Statement: Build a system that retrieves documents that users are likely to find relevant to their queries. This set of assumptions underlies the field of Information Retrieval.

9 2002.10.24 - SLIDE 9IS 202 – FALL 2002 Card-Based IR Systems Uniterm (Casey, Perry, Berry, Kent: 1958) –Developed and used from mid 1940’s) EXCURSION 43821 90 241 52 63 34 25 66 17 58 49 130 281 92 83 44 75 86 57 88 119 640 122 93 104 115 146 97 158 139 870 342 157 178 199 207 248 269 298 LUNAR 12457 110 181 12 73 44 15 46 7 28 39 430 241 42 113 74 85 76 17 78 79 820 761 602 233 134 95 136 37 118 109 901 982 194 165 127 198 179 377 288 407

10 2002.10.24 - SLIDE 10IS 202 – FALL 2002 Card Systems Batten Optical Coincidence Cards (“Peek- a-Boo Cards”), 1948 Lunar Excursion

11 2002.10.24 - SLIDE 11IS 202 – FALL 2002 Card Systems Zatocode (edge-notched cards) Mooers, 1951 Document 1 Title: lksd ksdj sjd sjsjfkl Author: Smith, J. Abstract: lksf uejm jshy ksd jh uyw hhy jha jsyhe Document 200 Title: Xksd Lunar sjd sjsjfkl Author: Jones, R. Abstract: Lunar uejm jshy ksd jh uyw hhy jha jsyhe Document 34 Title: lksd ksdj sjd Lunar Author: Smith, J. Abstract: lksf uejm jshy ksd jh uyw hhy jha jsyhe

12 2002.10.24 - SLIDE 12IS 202 – FALL 2002 Computer-Based Systems Bagley’s 1951 MS thesis from MIT suggested that searching 50 million item records, each containing 30 index terms would take approximately 41,700 hours –Due to the need to move and shift the text in core memory while carrying out the comparisons 1957 – Desk Set with Katharine Hepburn and Spencer Tracy – EMERAC

13 2002.10.24 - SLIDE 13IS 202 – FALL 2002 Historical Milestones in IR Research 1958 Statistic Language Properties (Luhn) 1960 Probabilistic Indexing (Maron & Kuhns) 1961 Term association and clustering (Doyle) 1965 Vector Space Model (Salton) 1968 Query expansion (Roccio, Salton) 1972 Statistical Weighting (Sparck-Jones) 1975 2-Poisson Model (Harter, Bookstein, Swanson) 1976 Relevance Weighting (Robertson, Sparck- Jones) 1980 Fuzzy sets (Bookstein) 1981 Probability without training (Croft)

14 2002.10.24 - SLIDE 14IS 202 – FALL 2002 Historical Milestones in IR Research (cont.) 1983 Linear Regression (Fox) 1983 Probabilistic Dependence (Salton, Yu) 1985 Generalized Vector Space Model (Wong, Rhagavan) 1987 Fuzzy logic and RUBRIC/TOPIC (Tong, et al.) 1990 Latent Semantic Indexing (Dumais, Deerwester) 1991 Polynomial & Logistic Regression (Cooper, Gey, Fuhr) 1992 TREC (Harman) 1992 Inference networks (Turtle, Croft) 1994 Neural networks (Kwok)

15 2002.10.24 - SLIDE 15IS 202 – FALL 2002 Lecture Overview Review –Introduction to Information Retrieval –The Information Seeking Process –History of IR Research IR System Structure (revisited) Central Concepts in IR Boolean Logic Boolean IR Systems Credit for some of the slides in this lecture goes to Marti Hearst

16 2002.10.24 - SLIDE 16IS 202 – FALL 2002 Structure of an IR System Search Line Interest profiles & Queries Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage Line Potentially Relevant Documents Comparison/ Matching Store1: Profiles/ Search requests Store2: Document representations Indexing (Descriptive and Subject) Formulating query in terms of descriptors Storage of profiles Storage of Documents Information Storage and Retrieval System Adapted from Soergel, p. 19

17 2002.10.24 - SLIDE 17IS 202 – FALL 2002 Structure of an IR System Search Line Interest profiles & Queries Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage Line Potentially Relevant Documents Comparison/ Matching Store1: Profiles/ Search requests Store2: Document representations Indexing (Descriptive and Subject) Formulating query in terms of descriptors Storage of profiles Storage of Documents Information Storage and Retrieval System Adapted from Soergel, p. 19

18 2002.10.24 - SLIDE 18IS 202 – FALL 2002 Structure of an IR System Search Line Interest profiles & Queries Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage Line Potentially Relevant Documents Comparison/ Matching Store1: Profiles/ Search requests Store2: Document representations Indexing (Descriptive and Subject) Formulating query in terms of descriptors Storage of profiles Storage of Documents Information Storage and Retrieval System Adapted from Soergel, p. 19

19 2002.10.24 - SLIDE 19IS 202 – FALL 2002 Structure of an IR System Search Line Interest profiles & Queries Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage Line Potentially Relevant Documents Comparison/ Matching Store1: Profiles/ Search requests Store2: Document representations Indexing (Descriptive and Subject) Formulating query in terms of descriptors Storage of profiles Storage of Documents Information Storage and Retrieval System Adapted from Soergel, p. 19

20 2002.10.24 - SLIDE 20IS 202 – FALL 2002 Lecture Overview Review –Introduction to Information Retrieval –The Information Seeking Process –History of IR Research IR System Structure (revisited) Central Concepts in IR Boolean Logic Boolean IR Systems Credit for some of the slides in this lecture goes to Marti Hearst

21 2002.10.24 - SLIDE 21IS 202 – FALL 2002 Central Concepts in IR Documents Queries Collections Evaluation Relevance

22 2002.10.24 - SLIDE 22IS 202 – FALL 2002 Documents What do we mean by a document? –Full document? –Document surrogates? –Pages? Buckland (JASIS, Sept. 1997) “What is a Document” Are IR systems better called Document Retrieval systems? A document is a representation of some aggregation of information, treated as a unit

23 2002.10.24 - SLIDE 23IS 202 – FALL 2002 Collection A collection is some physical or logical aggregation of documents –A database –A Library –A index? –Others?

24 2002.10.24 - SLIDE 24IS 202 – FALL 2002 Queries A query is some expression of a user’s information needs Can take many forms –Natural language description of need –Formal query in a query language Queries may not be accurate expressions of the information need –Differences between conversation with a person and formal query expression

25 2002.10.24 - SLIDE 25IS 202 – FALL 2002 Evaluation Why Evaluate? What to Evaluate? How to Evaluate?

26 2002.10.24 - SLIDE 26IS 202 – FALL 2002 Why Evaluate? Determine if the system is desirable Make comparative assessments Others?

27 2002.10.24 - SLIDE 27IS 202 – FALL 2002 What to Evaluate? How much of the information need was satisfied How much was learned about a topic Incidental learning –How much was learned about the collection –How much was learned about other topics How inviting the system is

28 2002.10.24 - SLIDE 28IS 202 – FALL 2002 What to Evaluate? What can be measured that reflects users’ ability to use system? (Cleverdon 66) –Coverage of Information –Form of Presentation –Effort required/Ease of Use –Time and Space Efficiency –Recall proportion of relevant material actually retrieved –Precision proportion of retrieved material actually relevant effectiveness

29 2002.10.24 - SLIDE 29IS 202 – FALL 2002 Relevance (introduction) In what ways can a document be relevant to a query? –Answer precise question precisely –Who is buried in grant’s tomb? Grant –Partially answer question –Where is Danville? Near Walnut Creek –Suggest a source for more information –What is lymphodema? Look in this Medical Dictionary… –Give background information –Remind the user of other knowledge –Others...

30 2002.10.24 - SLIDE 30IS 202 – FALL 2002 Relevance “Intuitively, we understand quite well what relevance means. It is a primitive ‘y’ know’ concept, as is information for which we hardly need a definition. … if and when any productive contact [in communication] is desired, consciously or not, we involve and use this intuitive notion or relevance.” »Saracevic, 1975 p. 324

31 2002.10.24 - SLIDE 31IS 202 – FALL 2002 Relevance How relevant is the document –for this user, for this information need. Subjective, but Measurable to some extent –How often do people agree a document is relevant to a query? How well does it answer the question? –Complete answer? Partial? –Background Information? –Hints for further exploration?

32 2002.10.24 - SLIDE 32IS 202 – FALL 2002 Relevance Research and Thought Review to 1975 by Saracevic Reconsideration of user-centered relevance by Schamber, Eisenberg and Nilan, 1990 Special Issue of JASIS on relevance (April 1994, 45(3))

33 2002.10.24 - SLIDE 33IS 202 – FALL 2002 Saracevic Relevance is considered as a measure of effectiveness of the contact between a source and a destination in a communications process –Systems view –Destinations view –Subject Literature view –Subject Knowledge view –Pertinence –Pragmatic view

34 2002.10.24 - SLIDE 34IS 202 – FALL 2002 Define your own relevance Relevance is the (A) gage of relevance of an (B) aspect of relevance existing between an (C) object judged and a (D) frame of reference as judged by an (E) assessor Where… From Saracevic, 1975 and Schamber 1990

35 2002.10.24 - SLIDE 35IS 202 – FALL 2002 A. Gages Measure Degree Extent Judgement Estimate Appraisal Relation

36 2002.10.24 - SLIDE 36IS 202 – FALL 2002 B. Aspect Utility Matching Informativeness Satisfaction Appropriateness Usefulness Correspondence

37 2002.10.24 - SLIDE 37IS 202 – FALL 2002 C. Object judged Document Document representation Reference Textual form Information provided Fact Article

38 2002.10.24 - SLIDE 38IS 202 – FALL 2002 D. Frame of reference Question Question representation Research stage Information need Information used Point of view request

39 2002.10.24 - SLIDE 39IS 202 – FALL 2002 E. Assessor Requester Intermediary Expert User Person Judge Information specialist

40 2002.10.24 - SLIDE 40IS 202 – FALL 2002 Schamber, Eisenberg and Nilan “Relevance is the measure of retrieval performance in all information systems, including full-text, multimedia, question- answering, database management and knowledge-based systems.” Systems-oriented relevance: Topicality User-Oriented relevance Relevance as a multi-dimensional concept

41 2002.10.24 - SLIDE 41IS 202 – FALL 2002 Schamber, et al. Conclusions “Relevance is a multidimensional concept whose meaning is largely dependent on users’ perceptions of information and their own information need situations Relevance is a dynamic concept that depends on users’ judgements of the quality of the relationship between information and information need at a certain point in time. Relevance is a complex but systematic and measureable concept if approached conceptually and operationally from the user’s perspective.”

42 2002.10.24 - SLIDE 42IS 202 – FALL 2002 Froelich Centrality and inadequacy of Topicality as the basis for relevance Suggestions for a synthesis of views

43 2002.10.24 - SLIDE 43IS 202 – FALL 2002 Janes’ View Topicality Pertinence Relevance Utility Satisfaction

44 2002.10.24 - SLIDE 44IS 202 – FALL 2002 Lecture Overview Review –Introduction to Information Retrieval –The Information Seeking Process –History of IR Research IR System Structure (revisited) Central Concepts in IR Boolean Logic Boolean IR Systems Credit for some of the slides in this lecture goes to Marti Hearst

45 2002.10.24 - SLIDE 45IS 202 – FALL 2002 Query Languages A way to express the question (information need) Types: –Boolean –Natural Language –Stylized Natural Language –Form-Based (GUI)

46 2002.10.24 - SLIDE 46IS 202 – FALL 2002 Simple query language: Boolean –Terms + Connectors (or operators) –terms words normalized (stemmed) words phrases thesaurus terms –connectors AND OR NOT

47 2002.10.24 - SLIDE 47IS 202 – FALL 2002 Boolean Queries Cat Cat OR Dog Cat AND Dog (Cat AND Dog) (Cat AND Dog) OR Collar (Cat AND Dog) OR (Collar AND Leash) (Cat OR Dog) AND (Collar OR Leash)

48 2002.10.24 - SLIDE 48IS 202 – FALL 2002 Boolean Queries (Cat OR Dog) AND (Collar OR Leash) –Each of the following combinations works: Catxxxx Dogxxxxx Collarxxxx Leashxxxx

49 2002.10.24 - SLIDE 49IS 202 – FALL 2002 Boolean Queries (Cat OR Dog) AND (Collar OR Leash) –None of the following combinations work: Catxx Dogxx Collarxx Leashxx

50 2002.10.24 - SLIDE 50IS 202 – FALL 2002 Boolean Logic A B

51 2002.10.24 - SLIDE 51IS 202 – FALL 2002 Boolean Queries –Usually expressed as INFIX operators in IR ((a AND b) OR (c AND b)) –NOT is UNARY PREFIX operator ((a AND b) OR (c AND (NOT b))) –AND and OR can be n-ary operators (a AND b AND c AND d) –Some rules - (De Morgan revisited) NOT(a) AND NOT(b) = NOT(a OR b) NOT(a) OR NOT(b)= NOT(a AND b) NOT(NOT(a)) = a

52 2002.10.24 - SLIDE 52IS 202 – FALL 2002 Boolean Logic 3t33t3 1t11t1 2t22t2 1D11D1 2D22D2 3D33D3 4D44D4 5D55D5 6D66D6 8D88D8 7D77D7 9D99D9 10 D 10 11 D 11 m1m1 m2m2 m3m3 m5m5 m4m4 m7m7 m8m8 m6m6 m 2 = t 1 t 2 t 3 m 1 = t 1 t 2 t 3 m 4 = t 1 t 2 t 3 m 3 = t 1 t 2 t 3 m 6 = t 1 t 2 t 3 m 5 = t 1 t 2 t 3 m 8 = t 1 t 2 t 3 m 7 = t 1 t 2 t 3

53 2002.10.24 - SLIDE 53IS 202 – FALL 2002 Boolean Searching “Measurement of the width of cracks in prestressed concrete beams” Formal Query: cracks AND beams AND Width_measurement AND Prestressed_concrete Cracks Beams Width measurement Prestressed concrete Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P)

54 2002.10.24 - SLIDE 54IS 202 – FALL 2002 Psuedo-Boolean Queries A new notation, from web search –+cat dog +collar leash Does not mean the same thing! Need a way to group combinations. Phrases: –“stray cat” AND “frayed collar” –+“stray cat” + “frayed collar”

55 2002.10.24 - SLIDE 55IS 202 – FALL 2002 Another View of IR Information need Index Pre-process Parse Collections Rank Query text input

56 2002.10.24 - SLIDE 56IS 202 – FALL 2002 Result Sets Run a query, get a result set Two choices –Reformulate query, run on entire collection –Reformulate query, run on result set Example: Dialog query (Redford AND Newman) -> S1 1450 documents (S1 AND Sundance) ->S2 898 documents

57 Information need Index Pre-process Parse Collections Rank Query text input Reformulated Query Re-Rank

58 2002.10.24 - SLIDE 58IS 202 – FALL 2002 Feedback Queries Information need Index Pre-process Parse Collections Rank Query text input Reformulated Query Re-Rank

59 2002.10.24 - SLIDE 59IS 202 – FALL 2002 Ordering of Retrieved Documents Pure Boolean has no ordering In practice: –order chronologically –order by total number of “hits” on query terms What if one term has more hits than others? Is it better to one of each term or many of one term? Fancier methods have been investigated –p-norm is most famous usually impractical to implement usually hard for user to understand

60 2002.10.24 - SLIDE 60IS 202 – FALL 2002 Boolean Advantages –simple queries are easy to understand –relatively easy to implement Disadvantages –difficult to specify what is wanted –too much returned, or too little –ordering not well determined Dominant language in commercial systems until the WWW

61 2002.10.24 - SLIDE 61IS 202 – FALL 2002 Faceted Boolean Query Strategy: break query into facets (polysemous with earlier meaning of facets) –conjunction of disjunctions a1 OR a2 OR a3 b1 OR b2 c1 OR c2 OR c3 OR c4 –each facet expresses a topic “rain forest” OR jungle OR amazon medicine OR remedy OR cure Smith OR Zhou AND

62 2002.10.24 - SLIDE 62IS 202 – FALL 2002 Faceted Boolean Query Query still fails if one facet missing Alternative: Coordination level ranking –Order results in terms of how many facets (disjuncts) are satisfied –Also called Quorum ranking, Overlap ranking, and Best Match Problem: Facets still undifferentiated Alternative: assign weights to facets

63 2002.10.24 - SLIDE 63IS 202 – FALL 2002 Proximity Searches Proximity: terms occur within K positions of one another –pen w/5 paper A “Near” function can be more vague –near(pen, paper) Sometimes order can be specified Also, Phrases and Collocations –“United Nations” “Bill Clinton” Phrase Variants –“retrieval of information” “information retrieval”

64 2002.10.24 - SLIDE 64IS 202 – FALL 2002 Filters Filters: Reduce set of candidate docs Often specified simultaneous with query Usually restrictions on metadata –restrict by: date range internet domain (.edu.com.berkeley.edu) author size limit number of documents returned

65 2002.10.24 - SLIDE 65IS 202 – FALL 2002 Lecture Overview Review –Introduction to Information Retrieval –The Information Seeking Process –History of IR Research IR System Structure (revisited) Central Concepts in IR Boolean Logic Boolean IR Systems Credit for some of the slides in this lecture goes to Marti Hearst

66 2002.10.24 - SLIDE 66IS 202 – FALL 2002 Boolean Systems Most of the commercial database search systems that pre-date the WWW are based on Boolean search –Dialog, Lexis-Nexis, etc. Most Online Library Catalogs are Boolean systems –E.g. MELVYL Database systems use Boolean logic for searching Many of the search engines sold for intranet search of web sites are Boolean

67 2002.10.24 - SLIDE 67IS 202 – FALL 2002 Why Boolean? Easy to implement Efficient searching across very large databases Easy to explain results –“Has to have all of the words…”

68 2002.10.24 - SLIDE 68IS 202 – FALL 2002 Assignment 8 Using Lexis-Nexis

69 2002.10.24 - SLIDE 69IS 202 – FALL 2002 Next Time Statistical Properties of Text Building access to IR systems -- indexes


Download ppt "2002.10.24 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002"

Similar presentations


Ads by Google