Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Boolean Retrieval Model LBSC 708A/CMSC 838L Session 2 - September 11, 2001 Philip Resnik.

Similar presentations


Presentation on theme: "The Boolean Retrieval Model LBSC 708A/CMSC 838L Session 2 - September 11, 2001 Philip Resnik."— Presentation transcript:

1 The Boolean Retrieval Model LBSC 708A/CMSC 838L Session 2 - September 11, 2001 Philip Resnik

2 Agenda Questions General model for detection The “bag of words” representation Boolean “free text” retrieval Proximity operators Controlled vocabulary retrieval Automating controlled vocabulary Retrieval versus filtering

3 But First... Rate the textbook reading: –Was it easy to understand? –How long did it take you to read?

4 Retrieval System Model Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Query Reformulation and Relevance Feedback Source Reselection NominateChoose Predict

5 Search Goal Choose the same documents a human would –Without human intervention (less work) –Faster than a human could (less time) –As accurately as possible (less accuracy) Humans start with an information need –Machines start with a query Humans match documents to information needs –Machines match document & query representations

6 Search Component Model Comparison Function Representation Function Query Formulation Human Judgment Representation Function Retrieval Status Value Utility Query Information NeedDocument Query RepresentationDocument Representation Query Processing Document Processing

7 Detection Component Model “Retrieval status value” is an estimate of utility –Utility  what the user would pay for the document A co-design problem –Document representation function –Query representation function –Comparison function Boolean “free text” retrieval is one way of allocating functionality to each function

8 “Bag of Words” Representation Bag = multiset: keeps track of members and counts The quick brown fox jumped over the lazy dog’s back  {back, brown, dog, fox, jumped, lazy, over, quick, ‘s, the, the} A “term” is any lexical item that you chose –A fixed-length sequence of characters (an “n-gram”) –A word (delimited by “white space” or punctuation) –“Root form” of each word (destroyed  destroy) –“Stem” of each word (destroyed  destr) –A phrase (e.g., phrases listed in a dictionary) Counts can be recorded in any consistent order

9 Bag of Words Example The quick brown fox jumped over the lazy dog’s back. Document 1 Document 2 Now is the time for all good men to come to the aid of their party. the quick brown fox over lazy dog back now is time for all good men to come jump aid of their party 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 1 0 1 0 1 1 Indexed Term Document 1Document 2 Stopword List ‘s

10 Boolean “Free Text” Retrieval Limit the bag of words to “absent” and “present” –“Boolean” values, represented as 0 and 1 Represent terms as a “bag of documents” –Same representation, but rows rather than columns Combine the rows using “Boolean operators” –AND, OR, NOT Any document with a 1 remaining is “detected”

11 Boolean Operators 01 11 01 0 1 A OR B A AND BA NOT B A B 00 01 01 0 1 A B 00 10 01 0 1 A B 10 01 B NOT B (= A AND NOT B)

12 Boolean Free Text Example quick brown fox over lazy dog back now time all good men come jump aid their party 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 0 1 Term Doc 1 Doc 2 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 Doc 3Doc 4 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 Doc 5Doc 6 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 0 1 1 1 1 0 0 0 Doc 7Doc 8 dog AND fox –Doc 3, Doc 5 dog NOT fox –Empty fox NOT dog –Doc 7 dog OR fox –Doc 3, Doc 5, Doc 7 good AND party –Doc 6, Doc 8 good AND party NOT over –Doc 6

13 Why Boolean Retrieval Works Boolean operators approximate natural language –Find documents about a good party that is not over AND can discover relationships between concepts –good party OR can discover alternate terminology –excellent party NOT can discover alternate meanings –Democratic party

14 The Perfect Query Paradox Every information need has a perfect doc set –If not, there would be no sense doing retrieval Almost every document set has a perfect query –AND every word to get a query for document 1 –Repeat for each document in the set –OR every document query to get the set query But users find Boolean query formulation hard –They get too much, too little, useless stuff, …

15 Why Boolean Retrieval Fails Natural language is way more complex –She saw the man on the hill with a telescope AND “discovers” nonexistent relationships –Terms in different paragraphs, chapters, … Guessing terminology for OR is hard –good, nice, excellent, outstanding, awesome, … Guessing terms to exclude is even harder! –Democratic party, party to a lawsuit, …

16 Proximity Operators More precise versions of AND –“NEAR n” allows at most n-1 intervening terms –“WITH” requires terms to be adjacent and in order Easy to implement, but less efficient –Store a list of positions for each word in each doc Stopwords become very important! –Perform normal Boolean computations Treat WITH and NEAR like AND with an extra constraint

17 Proximity Operator Example time AND come –Doc 2 time (NEAR 2) come –Empty quick (NEAR 2) fox –Doc 1 quick WITH fox –Empty quick brown fox over lazy dog back now time all good men come jump aid their party 01 (9) Term 1 (13) 1 (6) 1 (7) 1 (8) 1 (16) 1 (1) 1 (2) 1 (15) 1 (4) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 (5) 1 (9) 1 (3) 1 (4) 1 (8) 1 (6) 1 (10) Doc 1Doc 2

18 Concept Retrieval Goal: retrieve using “concepts,” not just words –Some words have many meanings (e.g., bank) This is a bigger problem for large diverse collections –Some meanings are associated with many words Especially when shades of meaning are unimportant This is the holy grail of information retrieval –Everyone agrees that it is a good idea –But every known approach has some limitations

19 Controlled Vocabulary Retrieval A straightforward concept retrieval approach –Works equally well for non-text materials –Index terms are a form of meta-data Assign a unique “descriptor” to each concept –Can be done by hand for collections of limited scope –In theory, descriptors are unambiguous Assign some descriptors to each document –Practical for valuable collections of limited size Use Boolean retrieval based on descriptors

20 Controlled Vocabulary Example Canine AND Fox –Doc 1 Canine AND Political action –Empty Canine OR Political action –Doc 1, Doc 2 The quick brown fox jumped over the lazy dog’s back. Document 1 Document 2 Now is the time for all good men to come to the aid of their party. Volunteerism Political action Fox Canine0 0 1 1 1 1 0 0 Descriptor Doc 1Doc 2 [Canine] [Fox] [Political action] [Volunteerism]

21 Thesaurus Design Thesauri contain descriptors and relationships –Broader term (  IS-A), narrower term, used for, … Indexers select descriptors for each document –Thesaurus must match the document collection Searchers select descriptors for each query –Thesaurus must match information needs Indexers must anticipate searchers’ info needs –Or searchers must discern indexers’ perspective –Or thesaurus itself must be accessible/browsable

22 Challenges Thesaurus design is expensive –Shifting concepts generate continuing expense Manual indexing is even more expensive –And consistent indexing is very expensive User needs are often difficult to anticipate –Challenge for thesaurus designers and indexers End users find thesauri hard to use –Co-design problem with query formulation

23 Applications When implied concepts must be captured –Political action, volunteerism, … When terminology selection is impractical –Searching foreign language materials When no words are present –Photos w/o captions, videos w/o transcripts, … When user needs are easily anticipated –Weather reports, yellow pages*, … *But cf. Bill Woods’ classic example of the paraphrase problem: “car washing” vs. “automobile cleaning”

24 Yahoo

25 Machine Assisted Indexing Goal: Automatically suggest descriptors –Better consistency with lower cost Chosen by a rule-based expert system –Design thesaurus by hand in the usual way –Design an expert system to process text String matching, proximity operators, … –Write rules for each thesaurus/collection/language –Try it out and fine tune the rules by hand

26 Machine Assisted Indexing Example //TEXT: science IF (all caps) USE research policy USE community program ENDIF IF (near “Technology” AND with “Development”) USE community development USE development aid ENDIF near: within 250 words with: in the same sentence Access Innovations system:

27 Text Categorization Goal: fully automatic descriptor assignment Machine learning approach –Assign descriptors manually for a “training set” –Design a learning algorithm find and use patterns Bayesian classifier, neural network, genetic algorithm, … –Present new documents System assigns descriptors like those in training set

28 Supervised Learning f 1 f 2 f 3 f 4 … f N v 1 v 2 v 3 v 4 … v N CvCv w 1 w 2 w 3 w 4 … w N CwCw LearnerClassifier New example x 1 x 2 x 3 x 4 … x N CxCx Labelled training examples CwCw

29 Retrieval vs. Filtering Retrospective retrieval: relatively static collection; constant flow of queries Information filtering: relatively static profile (query); constant stream of new documents Examples: –Yahoo categorization of new Web pages (could also be viewed as an ongoing indexing task) –Personalized newspaper

30 Case Study: Individual Inc. First of the personalized newspapers (original delivery mechanism: 8am fax) Core technology: SMART + extended Boolean Key insights: –Targeted, industry-specific marketing –Large staff of non-technical domain specialists –“Building block” Boolean profiles –Nightly update of profiles based on data stream e.g. (OJ or “orange juice”) and not Simpson –Inexpensive detection and selection, more costly examination/delivery.

31 Things to Do This Week Homework 1 –Due next week Do the readings Note reading list changes

32 One Minute Paper Brief answers, no names, online –In your opinion, what is the most important positive and most important negative characteristic of Boolean retrieval? Please provide exactly one of each. –What was the muddiest point in today’s lecture? –What was the most interesting point in today’s lecture? I’ll summarize the answers next class


Download ppt "The Boolean Retrieval Model LBSC 708A/CMSC 838L Session 2 - September 11, 2001 Philip Resnik."

Similar presentations


Ads by Google