Information Retrieval CSE 8337 (Part A) Spring 2009 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and.

Slides:



Advertisements
Similar presentations
Modern Information Retrieval Chapter 1: Introduction
Advertisements

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 1: Boolean Retrieval 1.
CS276A Text Retrieval and Mining Lecture 1. Query Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? One could grep all.
Adapted from Information Retrieval and Web Search
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
Boolean Retrieval Lecture 2: Boolean Retrieval Web Search and Mining.
Modern Information Retrieval Chapter 1: Introduction
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Information Retrieval
CS276 Information Retrieval and Web Search Lecture 1: Boolean retrieval.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Chapter 1: Introduction to IR.
PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
LIS618 lecture 2 the Boolean model Thomas Krichel
Modern Information Retrieval Computer engineering department Fall 2005.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 1: Boolean retrieval.
Information Retrieval Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Modern Information Retrieval Lecture 3: Boolean Retrieval.
Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Introduction to Information Retrieval Introduction to Information Retrieval COMP4201 Information Retrieval and Search Engines Lecture 1 Boolean retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 1: Introduction and Boolean retrieval.
Introduction to Information Retrival Slides are adapted from stanford CS276.
IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.
ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall.
Text Retrieval and Text Databases Based on Christopher and Raghavan’s slides.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
CES 514 – Data Mining Lec 2, Feb 10 Spring 2010 Sonoma State University.
Information Retrieval and Web Search
1 CS276 Information Retrieval and Web Search Lecture 1: Introduction.
Introduction to Information Retrieval CSE 538 MRS BOOK – CHAPTER I Boolean Model 1.
Information Retrieval Lecture 1. Query Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? Could grep all of Shakespeare’s.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
Query Languages Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Information Retrieval CSE 8337 Spring 2005 Simple Text Processing Material for these slides obtained from: Data Mining Introductory and Advanced Topics.
1 Information Retrieval Tanveer J Siddiqui J K Institute of Applied Physics & Technology University of Allahabad.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
1. L01: Corpuses, Terms and Search Basic terminology The need for unstructured text search Boolean Retrieval Model Algorithms for compressing data Algorithms.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Recuperação de Informação Cap. 01: Introdução 21 de Fevereiro de 1999 Berthier Ribeiro-Neto.
Information Retrieval
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 1: Boolean retrieval.
Introduction to Information Retrieval Boolean Retrieval.
CS276 Information Retrieval and Web Search Lecture 1: Boolean retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 1: Boolean retrieval.
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Module 2: Boolean retrieval. Introduction to Information Retrieval Information Retrieval  Information Retrieval (IR) is finding material (usually documents)
CS315 Introduction to Information Retrieval Boolean Search 1.
Information Retrieval : Intro
CS122B: Projects in Databases and Web Applications Winter 2017
COIS 442 Foundations on IR Information Retrieval and Web Search
Slides from Book: Christopher D
정보 검색 특론 Information Retrieval and Web Search
Boolean Retrieval.
Query Languages.
Information Retrieval
Information Retrieval and Web Search Lecture 1: Boolean retrieval
CSE 635 Multimedia Information Retrieval
Boolean Retrieval.
Introduction to Information Retrieval
CS276 Information Retrieval and Web Search
Recuperação de Informação
Presentation transcript:

Information Retrieval CSE 8337 (Part A) Spring 2009 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto Data Mining Introductory and Advanced Topics by Margaret H. Dunham  Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze

CSE 8337 Spring CSE 8337 Outline Introduction Simple Text Processing Boolean Queries Web Searching/Crawling Indexes Vector Space Model Matching Evaluation

CSE 8337 Spring Information Retrieval Information Retrieval (IR): retrieving desired information from textual data. Library Science Digital Libraries Web Search Engines Traditionally keyword based Sample query: Find all documents about “data mining”.

CSE 8337 Spring Motivation IR: representation, storage, organization of, and access to information items Focus is on the user information need User information need (example): Find all docs containing information on college tennis teams which: (1) are maintained by a USA university and (2) participate in the NCAA tournament. Emphasis is on the retrieval of information (not data)

CSE 8337 Spring DB vs IR Records (tuples) vs. documents Well defined results vs. fuzzy results DB grew out of files and traditional business systesm IR grew out of library science and need to categorize/group/access books/articles

CSE 8337 Spring Unstructured data Typically refers to free text Allows Keyword queries including operators More sophisticated “concept” queries e.g., find all web pages dealing with drug abuse Classic model for searching text documents

CSE 8337 Spring Semi-structured data In fact almost no data is “unstructured” E.g., this slide has distinctly identified zones such as the Title and Bullets Facilitates “semi-structured” search such as Title contains data AND Bullets contain search … to say nothing of linguistic structure

CSE 8337 Spring DB vs IR (cont’d)  Data retrieval  which docs contain a set of keywords?  Well defined semantics  a single erroneous object implies failure!  Information retrieval  information about a subject or topic  semantics is frequently loose  small errors are tolerated  IR system:  interpret contents of information items  generate a ranking which reflects relevance  notion of relevance is most important

CSE 8337 Spring Motivation  IR in the last 20 years:  classification and categorization  systems and languages  user interfaces and visualization  Still, area was seen as of narrow interest  Advent of the Web changed this perception once and for all  universal repository of knowledge  free (low cost) universal access  no central editorial board  many problems though: IR seen as key to finding the solutions!

CSE 8337 Spring Unstructured (text) vs. structured (database) data in 1996

CSE 8337 Spring Unstructured (text) vs. structured (database) data in 2006

CSE 8337 Spring Basic Concepts  The User Task  Retrieval  information or data  purposeful  Browsing  glancing around  Feedback Retrieval Browsing Database Response Feedback

CSE 8337 Spring Basic Concepts Logical view of the documents structure Accents spacing stopwords Noun groups stemming Manual indexing Docs structureFull textIndex terms

CSE 8337 Spring User Interface Text Operations Query Operations Indexing Searching Ranking Index Text query user need user feedback ranked docs retrieved docs logical view inverted file DB Manager Module Text Database / WWW Text The Retrieval Process

CSE 8337 Spring Basic assumptions of Information Retrieval Collection: Fixed set of documents Goal: Retrieve documents with information that is relevant to user’s information need and helps him complete a task

CSE 8337 Spring Fuzzy Sets and Logic Fuzzy Set: Set membership function is a real valued function with output in the range [0,1]. f(x): Probability x is in F. 1-f(x): Probability x is not in F. EX: T = {x | x is a person and x is tall} Let f(x) be the probability that x is tall Here f is the membership function

CSE 8337 Spring Fuzzy Sets

CSE 8337 Spring IR is Fuzzy SimpleFuzzy Not Relevant Relevant

CSE 8337 Spring Information Retrieval Metrics Similarity: measure of how close a query is to a document. Documents which are “close enough” are retrieved. Metrics: Precision = |Relevant and Retrieved| |Retrieved| Recall = |Relevant and Retrieved| |Relevant|

CSE 8337 Spring IR Query Result Measures IR

CSE 8337 Spring CSE 8337 Outline Introduction Simple Text Processing Boolean Queries Web Searching/Crawling Indexes Vector Space Model Matching Evaluation

CSE 8337 Spring Text Processing TOC Simple Text Storage String Matching String-to-String Correction (Approximate matching)

CSE 8337 Spring Text storage EBCDIC/ASCII Array of character Linked list of character Trees- B Tree, Trie Stuart E. Madnick, “String Processing Techniques,” Communications of the ACM, Vol 10, No 7, July 1967, pp

CSE 8337 Spring Pattern Matching(Recognition) Pattern Matching: finds occurrences of a predefined pattern in the data. Applications include speech recognition, information retrieval, time series analysis.

CSE 8337 Spring Similarity Measures Determine similarity between two objects. Similarity characteristics: Alternatively, distance measures measure how unlike or dissimilar objects are.

CSE 8337 Spring String Matching Problem Input: Pattern – length m Text string – length n Find one (next, all) occurrences of string in pattern Ex: String: Pattern:

CSE 8337 Spring String Matching Algorithms Brute Force Knuth-Morris Pratt Boyer Moore

CSE 8337 Spring Brute Force String Matching Brute Force Handbook of Algorithms and Data Structures Space O(m+n) Time O(mn)

CSE 8337 Spring FSR

CSE 8337 Spring Creating FSR Create FSM: Construct the “correct” spine. Add a default “failure bus” to state 0. Add a default “initial bus” to state 1. For each state, decide its attachments to failure bus, initial bus, or other failure links.

CSE 8337 Spring Knuth-Morris-Pratt Apply FSM to string by processing characters one at a time. Accepting state is reached when pattern is found. Space O(m+n) Time O(m+n) Handbook of Algorithms and Data Structures

CSE 8337 Spring Boyer-Moore Scan pattern from right to left Skip many positions on illegal character string. O(mn) Expected time better than KMP Expected behavior better Handbook of Algorithms and Data Structures

CSE 8337 Spring String-to-String Correction Measure of similarity between strings Can be used to determine how to convert from one string to another Cost to convert one to the other Transformations Match: Current characters in both strings are the same Delete: Delete current character in input string Insert: Insert current character in target string into string

CSE 8337 Spring Distance Between Strings

CSE 8337 Spring Approximate String Matching Find patterns “close to” the string Fuzzy matching Applications: Spelling checkers IR Define similarity (distance) between string and pattern

CSE 8337 Spring CSE 8337 Outline Introduction Simple Text Processing Boolean Queries Web Searching/Crawling Indexes Vector Space Model Matching Evaluation

CSE 8337 Spring Keyword Based Queries Basic Queries Single word Multiple words Context Queries Phrase Proximity

CSE 8337 Spring Boolean Queries Keywords combined with Boolean operators: OR: (e 1 OR e 2 ) AND: (e 1 AND e 2 ) BUT: (e 1 BUT e 2 ) Satisfy e 1 but not e 2 Negation only allowed using BUT to allow efficient use of inverted index by filtering another efficiently retrievable set. Naïve users have trouble with Boolean logic.

CSE 8337 Spring Boolean Retrieval with Inverted Indices Primitive keyword: Retrieve containing documents using the inverted index. OR: Recursively retrieve e 1 and e 2 and take union of results. AND: Recursively retrieve e 1 and e 2 and take intersection of results. BUT: Recursively retrieve e 1 and e 2 and take set difference of results.

CSE 8337 Spring Term-document incidence 1 if play contains word, 0 otherwise Brutus AND Caesar but NOT Calpurnia

CSE 8337 Spring Incidence vectors So we have a 0/1 vector for each term. To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  bitwise AND AND AND =

CSE 8337 Spring Inverted index For each term T, we must store a list of all documents that contain T. Do we use an array or a list for this? Brutus Calpurnia Caesar What happens if the word Caesar is added to document 14?

CSE 8337 Spring Inverted index Linked lists generally preferred to arrays Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers Brutus Calpurnia Caesar Dictionary Postings lists Sorted by docID (more later on why). Posting

CSE 8337 Spring Inverted index construction Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted index. friend roman countryman More on these later. Documents to be indexed. Friends, Romans, countrymen.

CSE 8337 Spring Sequence of (Modified token, Document ID) pairs. I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 1 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Doc 2 Indexer steps

CSE 8337 Spring Sort by terms. Core indexing step.

CSE 8337 Spring Multiple term entries in a single document are merged. Frequency information is added. Why frequency? Will discuss later.

CSE 8337 Spring The result is split into a Dictionary file and a Postings file.

CSE 8337 Spring Where do we pay in storage? Pointers Terms Will quantify the storage, later.

CSE 8337 Spring The index we just built How do we process a query? Later - what kinds of queries can we process? Today’s focus

CSE 8337 Spring Query processing: AND Consider processing the query: Brutus AND Caesar Locate Brutus in the Dictionary; Retrieve its postings. Locate Caesar in the Dictionary; Retrieve its postings. “Merge” the two postings: Brutus Caesar

CSE 8337 Spring The merge Walk through the two postings simultaneously, in time linear in the total number of postings entries Brutus Caesar 2 8 If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID.

CSE 8337 Spring Example: WestLaw Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) Tens of terabytes of data; 700,000 users Majority of users still use boolean queries Example query: What is the statute of limitations in cases involving the federal tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM /3 = within 3 words, /S = in same sentence

CSE 8337 Spring Boolean queries: More general merges Exercise: Adapt the merge for the queries: Brutus AND NOT Caesar Brutus OR NOT Caesar Can we still run through the merge in time O(x+y)? What can we achieve?

CSE 8337 Spring Merging What about an arbitrary Boolean formula? (Brutus OR Caesar) AND NOT (Antony OR Cleopatra) Can we always merge in “linear” time? Linear in what? Can we do better?

CSE 8337 Spring Query optimization What is the best order for query processing? Consider a query that is an AND of t terms. For each of the t terms, get its postings, then AND them together. Brutus Calpurnia Caesar Query: Brutus AND Calpurnia AND Caesar

CSE 8337 Spring Query optimization example Process in order of increasing freq: start with smallest set, then keep cutting further. Brutus Calpurnia Caesar This is why we kept freq in dictionary Execute the query as (Caesar AND Brutus) AND Calpurnia.

CSE 8337 Spring More general optimization e.g., (madding OR crowd) AND (ignoble OR strife) Get freq’s for all terms. Estimate the size of each OR by the sum of its freq’s (conservative). Process in increasing order of OR sizes.

CSE 8337 Spring Exercise Recommend a query processing order for (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)

CSE 8337 Spring Phrasal Queries Retrieve documents with a specific phrase (ordered list of contiguous words) “information theory” May allow intervening stop words and/or stemming. “buy camera” matches: “buy a camera” “buying the cameras” etc.

CSE 8337 Spring Phrasal Retrieval with Inverted Indices Must have an inverted index that also stores positions of each keyword in a document. Retrieve documents and positions for each individual word, intersect documents, and then finally check for ordered contiguity of keyword positions. Best to start contiguity check with the least common word in the phrase.

CSE 8337 Spring Phrasal Search Algorithm 1. Find set of documents D in which all keywords (k 1 …k m ) in phrase occur (using AND query processing). 2. Intitialize empty set, R, of retrieved documents. 3. For each document, d, in D do 4. Get array, P i, of positions of occurrences for each k i in d 5. Find shortest array P s of the P i ’s 6. For each position p of keyword k s in P s do 7. For each keyword k i except k s do 8. Use binary search to find a position (p – s + i ) in the array P i 1. If correct position for every keyword found, add d to R 2. Return R

CSE 8337 Spring Proximity Queries List of words with specific maximal distance constraints between terms. Example: “dogs” and “race” within 4 words match “…dogs will begin the race…” May also perform stemming and/or not count stop words.

CSE 8337 Spring Proximity Retrieval with Inverted Index Use approach similar to phrasal search to find documents in which all keywords are found in a context that satisfies the proximity constraints. During binary search for positions of remaining keywords, find closest position of k i to p and check that it is within maximum allowed distance.

CSE 8337 Spring Pattern Matching Allow queries that match strings rather than word tokens. Requires more sophisticated data structures and algorithms than inverted indices to retrieve efficiently.

CSE 8337 Spring Simple Patterns Prefixes: Pattern that matches start of word. “anti” matches “antiquity”, “antibody”, etc. Suffixes: Pattern that matches end of word: “ix” matches “fix”, “matrix”, etc. Substrings: Pattern that matches arbitrary subsequence of characters. “rapt” matches “enrapture”, “velociraptor” etc. Ranges: Pair of strings that matches any word lexicographically (alphabetically) between them. “tin” to “tix” matches “tip”, “tire”, “title”, etc.

CSE 8337 Spring Allowing Errors What if query or document contains typos or misspellings? Judge similarity of words (or arbitrary strings) using: Edit distance (cost of insert/delete/match) Longest Common Subsequence (LCS) Allow proximity search with bound on string similarity.

CSE 8337 Spring Longest Common Subsequence (LCS) Length of the longest subsequence of characters shared by two strings. A subsequence of a string is obtained by deleting zero or more characters. Examples: “misspell” to “mispell” is 7 “misspelled” to “misinterpretted” is 7 “mis…p…e…ed”

CSE 8337 Spring Regular Expressions Language for composing complex patterns from simpler ones. An individual character is a regex. Union: If e 1 and e 2 are regexes, then (e 1 | e 2 ) is a regex that matches whatever either e 1 or e 2 matches. Concatenation: If e 1 and e 2 are regexes, then e 1 e 2 is a regex that matches a string that consists of a substring that matches e 1 immediately followed by a substring that matches e 2 Repetition: If e 1 is a regex, then e 1 * is a regex that matches a sequence of zero or more strings that match e 1

CSE 8337 Spring Regular Expression Examples (u|e)nabl(e|ing) matches unable unabling enable enabling (un|en)*able matches able unable unenable enununenable

CSE 8337 Spring Enhanced Regex’s (Perl) Special terms for common sets of characters, such as alphabetic or numeric or general “wildcard”. Special repetition operator (+) for 1 or more occurrences. Special optional operator (?) for 0 or 1 occurrences. Special repetition operator for specific range of number of occurrences: {min,max}. A{1,5} One to five A’s. A{5,} Five or more A’s A{5} Exactly five A’s

CSE 8337 Spring Perl Regex Examples U.S. phone number with optional area code: /\b(\(\d{3}\)\s?)?\d{3}-\d{4}\b/ address: Note: Packages available to support Perl regex’s in Java

CSE 8337 Spring Structural Queries Assumes documents have structure that can be exploited in search. Structure could be: Fixed set of fields, e.g. title, author, abstract, etc. Hierarchical (recursive) tree structure: chapter titlesectiontitlesection titlesubsection chapter book

CSE 8337 Spring Queries with Structure Allow queries for text appearing in specific fields: “nuclear fusion” appearing in a chapter title SFQL: Relational database query language SQL enhanced with “full text” search. Select abstract from journal.papers where author contains “Teller” and title contains “nuclear fusion” and date < 1/1/1950

CSE 8337 Spring Ranking search results Boolean queries give inclusion or exclusion of docs. Often we want to rank/group results Need to measure proximity from query to each doc. Need to decide whether docs presented to user are singletons, or a group of docs covering various aspects of the query.

CSE 8337 Spring The web and its challenges Unusual and diverse documents Unusual and diverse users, queries, information needs Beyond terms, exploit ideas from social networks link analysis, clickstreams... How do search engines work? And how can we make them better?

CSE 8337 Spring More sophisticated information retrieval Cross-language information retrieval Question answering Summarization Text mining …

CSE 8337 Spring Perl Regex’s Character classes: \w (word char) Any alpha-numeric (not: \W) \d (digit char) Any digit (not: \D) \s (space char) Any whitespace (not: \S). (wildcard) Anything Anchor points: \b (boundary) Word boundary ^ Beginning of string $ End of string