Information Retrieval CSE 8337 (Part A) Spring 2009 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and.

Information Retrieval CSE 8337 (Part A) Spring 2009 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/ http://www.sims.berkeley.edu/~hearst/irbook/ Data Mining Introductory and Advanced Topics by Margaret H. Dunham http://www.engr.smu.edu/~mhd/book  Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze http://informationretrieval.org

CSE 8337 Spring 2009 2 CSE 8337 Outline Introduction Simple Text Processing Boolean Queries Web Searching/Crawling Indexes Vector Space Model Matching Evaluation

CSE 8337 Spring 2009 3 Information Retrieval Information Retrieval (IR): retrieving desired information from textual data. Library Science Digital Libraries Web Search Engines Traditionally keyword based Sample query: Find all documents about “data mining”.

CSE 8337 Spring 2009 4 Motivation IR: representation, storage, organization of, and access to information items Focus is on the user information need User information need (example): Find all docs containing information on college tennis teams which: (1) are maintained by a USA university and (2) participate in the NCAA tournament. Emphasis is on the retrieval of information (not data)

CSE 8337 Spring 2009 5 DB vs IR Records (tuples) vs. documents Well defined results vs. fuzzy results DB grew out of files and traditional business systesm IR grew out of library science and need to categorize/group/access books/articles

CSE 8337 Spring 2009 6 Unstructured data Typically refers to free text Allows Keyword queries including operators More sophisticated “concept” queries e.g., find all web pages dealing with drug abuse Classic model for searching text documents

CSE 8337 Spring 2009 7 Semi-structured data In fact almost no data is “unstructured” E.g., this slide has distinctly identified zones such as the Title and Bullets Facilitates “semi-structured” search such as Title contains data AND Bullets contain search … to say nothing of linguistic structure

CSE 8337 Spring 2009 8 DB vs IR (cont’d)  Data retrieval  which docs contain a set of keywords?  Well defined semantics  a single erroneous object implies failure!  Information retrieval  information about a subject or topic  semantics is frequently loose  small errors are tolerated  IR system:  interpret contents of information items  generate a ranking which reflects relevance  notion of relevance is most important

CSE 8337 Spring 2009 9 Motivation  IR in the last 20 years:  classification and categorization  systems and languages  user interfaces and visualization  Still, area was seen as of narrow interest  Advent of the Web changed this perception once and for all  universal repository of knowledge  free (low cost) universal access  no central editorial board  many problems though: IR seen as key to finding the solutions!

CSE 8337 Spring 2009 10 Unstructured (text) vs. structured (database) data in 1996

CSE 8337 Spring 2009 11 Unstructured (text) vs. structured (database) data in 2006

CSE 8337 Spring 2009 12 Basic Concepts  The User Task  Retrieval  information or data  purposeful  Browsing  glancing around  Feedback Retrieval Browsing Database Response Feedback

CSE 8337 Spring 2009 13 Basic Concepts Logical view of the documents structure Accents spacing stopwords Noun groups stemming Manual indexing Docs structureFull textIndex terms

CSE 8337 Spring 2009 14 User Interface Text Operations Query Operations Indexing Searching Ranking Index Text query user need user feedback ranked docs retrieved docs logical view inverted file DB Manager Module Text Database / WWW Text The Retrieval Process

CSE 8337 Spring 2009 15 Basic assumptions of Information Retrieval Collection: Fixed set of documents Goal: Retrieve documents with information that is relevant to user’s information need and helps him complete a task

CSE 8337 Spring 2009 16 Fuzzy Sets and Logic Fuzzy Set: Set membership function is a real valued function with output in the range [0,1]. f(x): Probability x is in F. 1-f(x): Probability x is not in F. EX: T = {x | x is a person and x is tall} Let f(x) be the probability that x is tall Here f is the membership function

CSE 8337 Spring 2009 17 Fuzzy Sets

CSE 8337 Spring 2009 18 IR is Fuzzy SimpleFuzzy Not Relevant Relevant

CSE 8337 Spring 2009 19 Information Retrieval Metrics Similarity: measure of how close a query is to a document. Documents which are “close enough” are retrieved. Metrics: Precision = |Relevant and Retrieved| |Retrieved| Recall = |Relevant and Retrieved| |Relevant|

CSE 8337 Spring 2009 20 IR Query Result Measures IR

CSE 8337 Spring 2009 22 Text Processing TOC Simple Text Storage String Matching String-to-String Correction (Approximate matching)

CSE 8337 Spring 2009 23 Text storage EBCDIC/ASCII Array of character Linked list of character Trees- B Tree, Trie Stuart E. Madnick, “String Processing Techniques,” Communications of the ACM, Vol 10, No 7, July 1967, pp 420-424.

CSE 8337 Spring 2009 24 Pattern Matching(Recognition) Pattern Matching: finds occurrences of a predefined pattern in the data. Applications include speech recognition, information retrieval, time series analysis.

CSE 8337 Spring 2009 25 Similarity Measures Determine similarity between two objects. Similarity characteristics: Alternatively, distance measures measure how unlike or dissimilar objects are.

CSE 8337 Spring 2009 26 String Matching Problem Input: Pattern – length m Text string – length n Find one (next, all) occurrences of string in pattern Ex: String: 00110011011110010100100111 Pattern: 011010

CSE 8337 Spring 2009 27 String Matching Algorithms Brute Force Knuth-Morris Pratt Boyer Moore

CSE 8337 Spring 2009 28 Brute Force String Matching Brute Force Handbook of Algorithms and Data Structures http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/711a.srch.c.html Space O(m+n) Time O(mn) 00110011011110010100100111 011010

CSE 8337 Spring 2009 29 FSR

CSE 8337 Spring 2009 30 Creating FSR Create FSM: Construct the “correct” spine. Add a default “failure bus” to state 0. Add a default “initial bus” to state 1. For each state, decide its attachments to failure bus, initial bus, or other failure links.

CSE 8337 Spring 2009 31 Knuth-Morris-Pratt Apply FSM to string by processing characters one at a time. Accepting state is reached when pattern is found. Space O(m+n) Time O(m+n) Handbook of Algorithms and Data Structures http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/712.srch.c.html

CSE 8337 Spring 2009 32 Boyer-Moore Scan pattern from right to left Skip many positions on illegal character string. O(mn) Expected time better than KMP Expected behavior better Handbook of Algorithms and Data Structures http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/713.preproc.c.html

CSE 8337 Spring 2009 33 String-to-String Correction Measure of similarity between strings Can be used to determine how to convert from one string to another Cost to convert one to the other Transformations Match: Current characters in both strings are the same Delete: Delete current character in input string Insert: Insert current character in target string into string

CSE 8337 Spring 2009 34 Distance Between Strings

CSE 8337 Spring 2009 35 Approximate String Matching Find patterns “close to” the string Fuzzy matching Applications: Spelling checkers IR Define similarity (distance) between string and pattern

CSE 8337 Spring 2009 37 Keyword Based Queries Basic Queries Single word Multiple words Context Queries Phrase Proximity

CSE 8337 Spring 2009 38 Boolean Queries Keywords combined with Boolean operators: OR: (e 1 OR e 2 ) AND: (e 1 AND e 2 ) BUT: (e 1 BUT e 2 ) Satisfy e 1 but not e 2 Negation only allowed using BUT to allow efficient use of inverted index by filtering another efficiently retrievable set. Naïve users have trouble with Boolean logic.

CSE 8337 Spring 2009 39 Boolean Retrieval with Inverted Indices Primitive keyword: Retrieve containing documents using the inverted index. OR: Recursively retrieve e 1 and e 2 and take union of results. AND: Recursively retrieve e 1 and e 2 and take intersection of results. BUT: Recursively retrieve e 1 and e 2 and take set difference of results.

CSE 8337 Spring 2009 40 Term-document incidence 1 if play contains word, 0 otherwise Brutus AND Caesar but NOT Calpurnia

CSE 8337 Spring 2009 41 Incidence vectors So we have a 0/1 vector for each term. To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  bitwise AND. 110100 AND 110111 AND 101111 = 100100. http://www.rhymezone.com/shakespeare/

CSE 8337 Spring 2009 42 Inverted index For each term T, we must store a list of all documents that contain T. Do we use an array or a list for this? Brutus Calpurnia Caesar 12358132134 248163264128 1316 What happens if the word Caesar is added to document 14?

CSE 8337 Spring 2009 43 Inverted index Linked lists generally preferred to arrays Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers Brutus Calpurnia Caesar 248163264128 2358132134 1316 1 Dictionary Postings lists Sorted by docID (more later on why). Posting

CSE 8337 Spring 2009 44 Inverted index construction Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted index. friend roman countryman 24 2 13 16 1 More on these later. Documents to be indexed. Friends, Romans, countrymen.

CSE 8337 Spring 2009 45 Sequence of (Modified token, Document ID) pairs. I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 1 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Doc 2 Indexer steps

CSE 8337 Spring 2009 46 Sort by terms. Core indexing step.

CSE 8337 Spring 2009 47 Multiple term entries in a single document are merged. Frequency information is added. Why frequency? Will discuss later.

CSE 8337 Spring 2009 48 The result is split into a Dictionary file and a Postings file.

CSE 8337 Spring 2009 49 Where do we pay in storage? Pointers Terms Will quantify the storage, later.

CSE 8337 Spring 2009 50 The index we just built How do we process a query? Later - what kinds of queries can we process? Today’s focus

CSE 8337 Spring 2009 51 Query processing: AND Consider processing the query: Brutus AND Caesar Locate Brutus in the Dictionary; Retrieve its postings. Locate Caesar in the Dictionary; Retrieve its postings. “Merge” the two postings: 128 34 248163264123581321 Brutus Caesar

CSE 8337 Spring 2009 52 34 12824816 3264 12 3 581321 The merge Walk through the two postings simultaneously, in time linear in the total number of postings entries 128 34 248163264123581321 Brutus Caesar 2 8 If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID.

CSE 8337 Spring 2009 53 Example: WestLaw http://www.westlaw.com/ Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) Tens of terabytes of data; 700,000 users Majority of users still use boolean queries Example query: What is the statute of limitations in cases involving the federal tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM /3 = within 3 words, /S = in same sentence

CSE 8337 Spring 2009 54 Boolean queries: More general merges Exercise: Adapt the merge for the queries: Brutus AND NOT Caesar Brutus OR NOT Caesar Can we still run through the merge in time O(x+y)? What can we achieve?

CSE 8337 Spring 2009 55 Merging What about an arbitrary Boolean formula? (Brutus OR Caesar) AND NOT (Antony OR Cleopatra) Can we always merge in “linear” time? Linear in what? Can we do better?

CSE 8337 Spring 2009 56 Query optimization What is the best order for query processing? Consider a query that is an AND of t terms. For each of the t terms, get its postings, then AND them together. Brutus Calpurnia Caesar 12358162134 248163264128 1316 Query: Brutus AND Calpurnia AND Caesar

CSE 8337 Spring 2009 57 Query optimization example Process in order of increasing freq: start with smallest set, then keep cutting further. Brutus Calpurnia Caesar 12358132134 248163264128 1316 This is why we kept freq in dictionary Execute the query as (Caesar AND Brutus) AND Calpurnia.

CSE 8337 Spring 2009 58 More general optimization e.g., (madding OR crowd) AND (ignoble OR strife) Get freq’s for all terms. Estimate the size of each OR by the sum of its freq’s (conservative). Process in increasing order of OR sizes.

CSE 8337 Spring 2009 59 Exercise Recommend a query processing order for (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)

CSE 8337 Spring 2009 60 Phrasal Queries Retrieve documents with a specific phrase (ordered list of contiguous words) “information theory” May allow intervening stop words and/or stemming. “buy camera” matches: “buy a camera” “buying the cameras” etc.

CSE 8337 Spring 2009 61 Phrasal Retrieval with Inverted Indices Must have an inverted index that also stores positions of each keyword in a document. Retrieve documents and positions for each individual word, intersect documents, and then finally check for ordered contiguity of keyword positions. Best to start contiguity check with the least common word in the phrase.

CSE 8337 Spring 2009 62 Phrasal Search Algorithm 1. Find set of documents D in which all keywords (k 1 …k m ) in phrase occur (using AND query processing). 2. Intitialize empty set, R, of retrieved documents. 3. For each document, d, in D do 4. Get array, P i, of positions of occurrences for each k i in d 5. Find shortest array P s of the P i ’s 6. For each position p of keyword k s in P s do 7. For each keyword k i except k s do 8. Use binary search to find a position (p – s + i ) in the array P i 1. If correct position for every keyword found, add d to R 2. Return R

CSE 8337 Spring 2009 63 Proximity Queries List of words with specific maximal distance constraints between terms. Example: “dogs” and “race” within 4 words match “…dogs will begin the race…” May also perform stemming and/or not count stop words.

CSE 8337 Spring 2009 64 Proximity Retrieval with Inverted Index Use approach similar to phrasal search to find documents in which all keywords are found in a context that satisfies the proximity constraints. During binary search for positions of remaining keywords, find closest position of k i to p and check that it is within maximum allowed distance.

CSE 8337 Spring 2009 65 Pattern Matching Allow queries that match strings rather than word tokens. Requires more sophisticated data structures and algorithms than inverted indices to retrieve efficiently.

CSE 8337 Spring 2009 66 Simple Patterns Prefixes: Pattern that matches start of word. “anti” matches “antiquity”, “antibody”, etc. Suffixes: Pattern that matches end of word: “ix” matches “fix”, “matrix”, etc. Substrings: Pattern that matches arbitrary subsequence of characters. “rapt” matches “enrapture”, “velociraptor” etc. Ranges: Pair of strings that matches any word lexicographically (alphabetically) between them. “tin” to “tix” matches “tip”, “tire”, “title”, etc.

CSE 8337 Spring 2009 67 Allowing Errors What if query or document contains typos or misspellings? Judge similarity of words (or arbitrary strings) using: Edit distance (cost of insert/delete/match) Longest Common Subsequence (LCS) Allow proximity search with bound on string similarity.

CSE 8337 Spring 2009 68 Longest Common Subsequence (LCS) Length of the longest subsequence of characters shared by two strings. A subsequence of a string is obtained by deleting zero or more characters. Examples: “misspell” to “mispell” is 7 “misspelled” to “misinterpretted” is 7 “mis…p…e…ed”

CSE 8337 Spring 2009 69 Regular Expressions Language for composing complex patterns from simpler ones. An individual character is a regex. Union: If e 1 and e 2 are regexes, then (e 1 | e 2 ) is a regex that matches whatever either e 1 or e 2 matches. Concatenation: If e 1 and e 2 are regexes, then e 1 e 2 is a regex that matches a string that consists of a substring that matches e 1 immediately followed by a substring that matches e 2 Repetition: If e 1 is a regex, then e 1 * is a regex that matches a sequence of zero or more strings that match e 1

CSE 8337 Spring 2009 70 Regular Expression Examples (u|e)nabl(e|ing) matches unable unabling enable enabling (un|en)*able matches able unable unenable enununenable

CSE 8337 Spring 2009 71 Enhanced Regex’s (Perl) Special terms for common sets of characters, such as alphabetic or numeric or general “wildcard”. Special repetition operator (+) for 1 or more occurrences. Special optional operator (?) for 0 or 1 occurrences. Special repetition operator for specific range of number of occurrences: {min,max}. A{1,5} One to five A’s. A{5,} Five or more A’s A{5} Exactly five A’s

CSE 8337 Spring 2009 72 Perl Regex Examples U.S. phone number with optional area code: /\b($\d{3}$\s?)?\d{3}-\d{4}\b/ Email address: /\b\S+@\S+(\.com|\.edu|\.gov|\.org|\.net)\b/ Note: Packages available to support Perl regex’s in Java

CSE 8337 Spring 2009 73 Structural Queries Assumes documents have structure that can be exploited in search. Structure could be: Fixed set of fields, e.g. title, author, abstract, etc. Hierarchical (recursive) tree structure: chapter titlesectiontitlesection titlesubsection chapter book

CSE 8337 Spring 2009 74 Queries with Structure Allow queries for text appearing in specific fields: “nuclear fusion” appearing in a chapter title SFQL: Relational database query language SQL enhanced with “full text” search. Select abstract from journal.papers where author contains “Teller” and title contains “nuclear fusion” and date < 1/1/1950

CSE 8337 Spring 2009 75 Ranking search results Boolean queries give inclusion or exclusion of docs. Often we want to rank/group results Need to measure proximity from query to each doc. Need to decide whether docs presented to user are singletons, or a group of docs covering various aspects of the query.

CSE 8337 Spring 2009 76 The web and its challenges Unusual and diverse documents Unusual and diverse users, queries, information needs Beyond terms, exploit ideas from social networks link analysis, clickstreams... How do search engines work? And how can we make them better?

CSE 8337 Spring 2009 77 More sophisticated information retrieval Cross-language information retrieval Question answering Summarization Text mining …

CSE 8337 Spring 2009 78 Perl Regex’s Character classes: \w (word char) Any alpha-numeric (not: \W) \d (digit char) Any digit (not: \D) \s (space char) Any whitespace (not: \S). (wildcard) Anything Anchor points: \b (boundary) Word boundary ^ Beginning of string $ End of string

Information Retrieval CSE 8337 (Part A) Spring 2009 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and.

Similar presentations

Presentation on theme: "Information Retrieval CSE 8337 (Part A) Spring 2009 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval CSE 8337 (Part A) Spring 2009 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and.

Similar presentations

Presentation on theme: "Information Retrieval CSE 8337 (Part A) Spring 2009 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and."— Presentation transcript:

Similar presentations

About project

Feedback