Presentation is loading. Please wait.

Presentation is loading. Please wait.

INFM 700: Session 8 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Wednesday, April 11, 2012 This.

Similar presentations


Presentation on theme: "INFM 700: Session 8 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Wednesday, April 11, 2012 This."— Presentation transcript:

1 INFM 700: Session 8 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Wednesday, April 11, 2012 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for detailshttp://creativecommons.org/licenses/by-nc-sa/3.0/us/

2 iSchool Goals for Search Sessions Understand the basic issues in information retrieval (searching primarily unstructured text) Know the techniques generally used by modern search engines Learn how to recognize, explain, and predict search engine behavior and results based on an understanding of the basic algorithms Learn how search engines can be used most effectively in information architecture

3 iSchool Today’s Topics Introduction to Information Retrieval Keywords, inverted indices, and Boolean retrieval The vector space model, ranked retrieval Major issues Some additional tricks Examples: web search and site search IR Intro Boolean Vector Space Issues & Tricks

4 iSchool Levels of Structure Different types of data Structured data Semi-structured data Unstructured data How do you provide access to unstructured data? Manually develop an organization system (add structure) Provide search capabilities IR Intro Boolean Vector Space Issues & Tricks

5 iSchool What is search? Search is query-based access How is this different from browsing? Things one can search on: Content Metadata Organization systems Labels … IR Intro Boolean Vector Space Issues & Tricks

6 iSchool Some Key Concepts Different search paradigms Boolean, “keyword” “Natural language” or “free text” (full text) search Current search engines are primarily full text and statistical The fundamental challenge: words & concepts The basic method: weighting and context Other tricks (there are many!) Structuring Popularity and importance (of pages, documents) Metadata and thesauri User feedback IR Intro Boolean Vector Space Issues & Tricks

7 iSchool Some Context “The fact of the matter is that there really hasn’t been much progress in the basic science of how to search since the seventies” – Tim Bray (now at Google, “On Search” “Search is a problem that is about five percent solved” – Udi Manber, VP of Engineering, Google Note John Battelle, “The Search”, John Battelle’s Search Blog, Danny Sullivan’s “Search Engine Watch” IR Intro Boolean Vector Space Issues & Tricks

8 iSchool The Central Problem in IR Searcher Authors Concepts Query Documents Do these represent the same concepts? IR Intro Boolean Vector Space Issues & Tricks

9 iSchool Architecture of IR Systems Documents Query Hits Representation Function Representation Function Query RepresentationDocument Representation Comparison Function Index offlineonline IR Intro Boolean Vector Space Issues & Tricks

10 iSchool How do we represent text? Remember: computers don’t “understand” documents or queries Simple, yet effective approach: “bag of words” Treat all the words in a document as index terms Assign a “weight” to each term based on “importance” Disregard order, structure, meaning, etc. of the words Assumptions Term occurrence is independent (of other terms) Document relevance is independent (of other documents) “Words” can be defined IR Intro Boolean Vector Space Issues & Tricks

11 iSchool What’s a word? 天主教教宗若望保祿二世因感冒再度住進醫院。 這是他今年第二度因同樣的病因住院。 وقال مارك ريجيف - الناطق باسم الخارجية الإسرائيلية - إن شارون قبل الدعوة وسيقوم للمرة الأولى بزيارة تونس، التي كانت لفترة طويلة المقر الرسمي لمنظمة التحرير الفلسطينية بعد خروجها من لبنان عام 1982. Выступая в Мещанском суде Москвы экс-глава ЮКОСа заявил не совершал ничего противозаконного, в чем обвиняет его генпрокуратура России. भारत सरकार ने आर्थिक सर्वेक्षण में वित्तीय वर्ष 2005-06 में सात फ़ीसदी विकास दर हासिल करने का आकलन किया है और कर सुधार पर ज़ोर दिया है 日米連合で台頭中国に対処 … アーミテージ前副長官提言 조재영 기자 = 서울시는 25 일 이명박 시장이 ` 행정중심복합도시 '' 건설안 에 대해 ` 군대라도 동원해 막고싶은 심정 '' 이라고 말했다는 일부 언론의 보도를 부인했다. IR Intro Boolean Vector Space Issues & Tricks

12 iSchool Sample Document McDonald's slims down spuds Fast-food chain to reduce certain types of fat in its french fries with new cooking oil. NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier. But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste. Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment. … 14 × McDonald’s 12 × fat 11 × fries 8 × new 6 × company, french, nutrition 5 × food, oil, percent, reduce, taste, Tuesday … “Bag of Words” IR Intro Boolean Vector Space Issues & Tricks

13 iSchool Why does “bag of words” work (at all)? Words alone tell us a lot about content! Words are our main tool for describing concepts Words in context are especially powerful Getting beyond words is hard Structure usually (but not always) can be guessed from content “355 back correction Dow pulls signaling” “blind Venetian” vs. “Venetian blind” IR Intro Boolean Vector Space Issues & Tricks

14 iSchool Boolean Retrieval Users express queries as a Boolean (logical) expression “terms” (usually words or phrases) joined by AND, OR, NOT Can be arbitrarily nested Difference between “term” and “keyword”? Retrieval is based on the notion of sets Any given query divides the collection into two sets: retrieved, not-retrieved (complement) Pure Boolean systems do not define an ordering of the results (no ranking) IR Intro Boolean Vector Space Issues & Tricks

15 iSchool AND/OR/NOT AB All documents C IR Intro Boolean Vector Space Issues & Tricks

16 iSchool Logic Tables A OR B A AND B A NOT B NOT B 01 11 01 0 1 A B (= A AND NOT B) 00 01 01 0 1 A B 00 10 01 0 1 A B 10 01 B IR Intro Boolean Vector Space Issues & Tricks

17 iSchool Representing Documents The quick brown fox jumped over the lazy dog’s back. Document 1 Document 2 Now is the time for all good men to come to the aid of their party. the is for to of quick brown fox over lazy dog back now time all good men come jump aid their party 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 1 0 1 0 1 1 Term Document 1Document 2 Stopword List IR Intro Boolean Vector Space Issues & Tricks

18 iSchool Boolean View of a Collection Each column represents the view of a particular document: What terms are contained in this document? Each row represents the view of a particular term: What documents contain this term? To execute a query, pick out rows corresponding to query terms and then apply logic table of corresponding Boolean operator IR Intro Boolean Vector Space Issues & Tricks

19 iSchool Sample Queries fox dog0 0 0 0 1 1 0 0 1 1 0 0 0 1 0 0 Term Doc 1 Doc 2Doc 3Doc 4 Doc 5Doc 6Doc 7Doc 8 dog  fox 00101000 dog  fox 00101010 dog  fox 00000000 fox  dog 00000010 dog AND fox  Doc 3, Doc 5 dog OR fox  Doc 3, Doc 5, Doc 7 dog NOT fox  empty fox NOT dog  Doc 7 good party 0 0 1 0 0 0 1 0 0 0 1 1 0 0 1 1 g  p 00000101 g  p  o 00000100 good AND party  Doc 6, Doc 8 over10101011 good AND party NOT over  Doc 6 Term Doc 1 Doc 2Doc 3Doc 4 Doc 5Doc 6Doc 7Doc 8 IR Intro Boolean Vector Space Issues & Tricks

20 iSchool Inverted Index quick brown fox over lazy dog back now time all good men come jump aid their party 48 246 137 1357 2468 35 357 2468 3 1357 13578 248 268 157 246 13 68 Term Postings IR Intro Boolean Vector Space Issues & Tricks

21 iSchool Boolean Retrieval To execute a Boolean query: Build query syntax tree For each clause, look up postings Traverse postings and apply Boolean operator Efficiency analysis Postings traversal is linear (assuming sorted postings) Start with shortest posting first ( fox or dog ) and quick foxdog ORquick AND fox dog35 357 fox dog35 357 OR = union 357 IR Intro Boolean Vector Space Issues & Tricks

22 iSchool Why Boolean Retrieval Works Boolean operators approximate concepts How so? AND can identify relationships between concepts (e.g., interest rate, web design) OR can identify alternate terminology (e.g., interest percentage, HTML layout, etc.) NOT can filter alternate meanings (e.g., conflict AND interest AND NOT rate, NOT spider) IR Intro Boolean Vector Space Issues & Tricks

23 iSchool Why Boolean Retrieval Fails It’s really hard to come up with the “right” queries Casual searchers have difficulty with the logic Some concepts are just hard to express, e.g. “corporate mergers & acquisitions” – IBM acquired Lotus Relevance is not absolute, some documents are more relevant, or more helpful, than others IR Intro Boolean Vector Space Issues & Tricks

24 iSchool Ranked Retrieval in the Vector Space Model Order documents by how likely they are to be relevant to the information need Estimate relevance(q, d i ) Sort documents by relevance Display sorted results, usually one screen at a time How do we estimate relevance? Assume that document d is relevant to query q if they share terms in common Replace relevance(q, d i ) with sim(q, d i ) (similarity) Compute similarity of vector representations IR Intro Boolean Vector Space Issues & Tricks

25 iSchool Vector Representation “Bags of words” can be represented as vectors Why? Computational efficiency, ease of manipulation Geometric metaphor: “arrows” A vector is a set of values recorded in any consistent order “The quick brown fox jumped over the lazy dog’s back”  [ 1 1 1 1 1 1 1 1 2 ] 1 st position corresponds to “back” 2 nd position corresponds to “brown” 3 rd position corresponds to “dog” 4 th position corresponds to “fox” 5 th position corresponds to “jump” 6 th position corresponds to “lazy” 7 th position corresponds to “over” 8 th position corresponds to “quick” 9 th position corresponds to “the” IR Intro Boolean Vector Space Issues & Tricks

26 iSchool Vector Space Model Assumption: Documents that are “close together” in vector space “talk about” the same things t1t1 d2d2 d1d1 d3d3 d4d4 d5d5 t3t3 t2t2 θ φ Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”) IR Intro Boolean Vector Space Issues & Tricks

27 iSchool Similarity Metric How about |d 1 – d 2 |? Instead of Euclidean distance, use “angle” between the vectors It all boils down to the inner product (dot product) of vectors IR Intro Boolean Vector Space Issues & Tricks

28 iSchool Components of Similarity The “inner product” (aka dot product) is the key to the similarity function The denominator handles document length normalization Example: IR Intro Boolean Vector Space Issues & Tricks

29 iSchool Term Weighting Term weights consist of two components Local: how important is the term in this doc? Global: how important is the term in the collection? Here’s the intuition: Terms that appear often in a document should get high weights Terms that appear in many documents should get low weights How do we capture this mathematically? Term frequency (local) Inverse document frequency (global) IR Intro Boolean Vector Space Issues & Tricks

30 iSchool TF.IDF Term Weighting weight assigned to term i in document j number of occurrence of term i in document j number of documents in entire collection number of documents with term i IR Intro Boolean Vector Space Issues & Tricks

31 iSchool TF.IDF Example 4 5 6 3 1 3 1 6 5 3 4 3 7 1 2 123 2 3 2 4 4 0.301 0.125 0.602 0.301 0.000 0.602 tf idf complicated contaminated fallout information interesting nuclear retrieval siberia 1,4 1,5 1,6 1,3 2,1 2,6 3,5 3,3 3,4 1,2 0.301 0.125 0.602 0.301 0.000 0.602 complicated contaminated fallout information interesting nuclear retrieval siberia 4,2 4,3 2,3 3,34,2 3,7 3,1 4,4 IR Intro Boolean Vector Space Issues & Tricks

32 iSchool Document Scoring Algorithm Initialize accumulators to hold document scores For each query term t in the user’s query Fetch t’s postings For each document, score doc += w t,d  w t,q Apply length normalization to the scores at end Return top N documents IR Intro Boolean Vector Space Issues & Tricks

33 iSchool Summary thus far… Represent documents (and queries) as “bags of words” (terms) Derive term weights based on frequency Use weighted term vectors for each document, query Compute a vector-based similarity score Display sorted, ranked results IR Intro Boolean Vector Space Issues & Tricks

34 iSchool Issues and Tricks What’s a word/term? We can ignore words (“stop words”), combine (phrases), split up (“stem”) words Other special treatment (e.g. names, categories) Query formulation/suggestion Type of information need Popularity Based on link analysis/page rank Based on click through, other Structuring and tagging (e.g., “best bets”) IR Intro Boolean Vector Space Issues & Tricks

35 iSchool Issues and Tricks (cont’d) Thesaurus/query expansion Based on meaning, conceptual relationships Based on decomposition/type User feedback/”More like this” Clustering/grouping of results IR Intro Boolean Vector Space Issues & Tricks

36 iSchool Morphological Variation Handling morphology: related concepts have different forms Inflectional morphology: same part of speech Derivational morphology: different parts of speech Different morphological processes: Prefixing Suffixing Infixing Reduplication dogs = dog + PLURAL broke = break + PAST destruction = destroy + ion researcher = research + er IR Intro Boolean Vector Space Issues & Tricks

37 iSchool Stemming Dealing with morphological variation: index stems instead of words Stem: a word equivalence class that preserves the central concept How much to stem? organization  organize  organ? resubmission  resubmit/submission  submit? reconstructionism? IR Intro Boolean Vector Space Issues & Tricks

38 iSchool Does Stemming Work? Generally, yes! (in English) Helps more for longer queries, fewer results Lots of work done in this area But used very sparingly in web search – why? Donna Harman (1991) How Effective is Suffixing? Journal of the American Society for Information Science, 42(1):7-15. Robert Krovetz. (1993) Viewing Morphology as an Inference Process. Proceedings of SIGIR 1993. David A. Hull. (1996) Stemming Algorithms: A Case Study for Detailed Evaluation. Journal of the American Society for Information Science, 47(1):70-84. And others… IR Intro Boolean Vector Space Issues & Tricks

39 iSchool Beyond Words… Stemming/tokenization = specific instance of a general problem: what is it? Other units of indexing Concepts (e.g., from WordNet) Named entities Relations … IR Intro Boolean Vector Space Issues & Tricks

40 iSchool Recap Introduction to Information Retrieval Boolean retrieval Ranked retrieval – term weighting, the vector space model Advanced methods, things to think about Next time: Deploying search engines IR Intro Boolean Vector Space Issues & Tricks


Download ppt "INFM 700: Session 8 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Wednesday, April 11, 2012 This."

Similar presentations


Ads by Google