Lecture 1: Introduction and the Boolean Model Information Retrieval

Lecture 1: Introduction and the Boolean Model Information Retrieval
By Dr. Huda Abdulaali Introduction to Information Retrieval. Manning et al.,

nature . . . that satisfies an information need from within large
INTRODUCTION Information retrieval (IR) is finding material of an unstructured nature that satisfies an information need from within large collections

Information retrieval (IR) is finding material (usually documents)
INTRODUCTION Information retrieval (IR) is finding material (usually documents) of an unstructured nature that satisfies an information need from within large collections (usually stored on computers). Document Collection: text units we have built an IR system. Usually documents But could be book chapters paragraphs scenes of a movie turns in a conversation...

Structured vs Unstructured Data
INTRODUCTION Structured vs Unstructured Data structured data refers to information with a high degree of organization, such that inclusion in a relational database is seamless and readily searchable by simple, straightforward search engine algorithms or other search operations; unstructured data is essentially the opposite.

Information Needs and Relevance
INTRODUCTION Information Needs and Relevance An information need is the topic about which the user desires to know more about. A query is what the user conveys to the computer in an attempt to communicate the information need. A document is relevant if the user perceives that it contains information of value with respect to their personal information need.

INTRODUCTION The field of IR also covers supporting users in browsing or filtering document collections or further processing a set of retrieved documents. Given a set of documents, clustering is the task of coming up with a good grouping of the documents based on their contents.

INTRODUCTION In web search, the system has to provide search over billions of documents stored on millions of computers. Distinctive issues are needing to gather documents for indexing, being able to build systems that work efficiently In Enterprise and institutional search e.g company’s documentation, patents, research articles often domain-specific, centralized storage; dedicated machines for search In personal information retrieval, operating systems have integrated information retrieval (such as Apple’s Mac OS X Spotlight or Windows Vista’s Instant Search). programs usually not only provide search but also text classification: they at least provide a spam (junk mail) filter, and commonly also provide either manual or automatic means for classifying mail so that it can be placed directly into particular folders

A short history of IR

Boolean Retrieval The Boolean model of information retrieval (BIR) is a classical information retrieval (IR) model and, at the same time, the first and most adopted one. It is used by many IR systems to this day In the Boolean retrieval model we can pose any query in the form of a Boolean expression of term i.e., one in which terms are combined with the operators and, or, and not. Shakespeare example

An index term is either present(1) or absent(0) in the document
Basic Assumption of Boolean Model An index term is either present(1) or absent(0) in the document All index terms provide equal evidence with respect to information needs. Queries are Boolean combinations of index terms. X AND Y: represents doc that contains both X and Y X OR Y: represents doc that contains either X or Y NOT X: represents the doc that do not contain X

An example information retrieval problem
Brutus AND Caesar AND NOT Calpurnia fat book that many people own is Shakespeare’s Collected Works. Suppose you wanted to determine which plays of Shakespeare contain the words Brutus and Caesar and not Calpurnia. One way to do that is to start at the beginning and to read through all the text, The simplest form of document retrieval is for a computer to do this sort of linear scan through documents. This process is commonly referred to as GREPPING through text

GREP is a command-line utility for searching plain-text data sets for lines that match a regular expression. Its name comes from the command g/re/p (globally search a regular expression and print), which has the same effect: doing a global search with the regular expression and printing all matching lines. GREP was originally developed for the Unix operating system, but later available for all Unix-like systems.

1. To process large document collections quickly. The amount of online data has grown at least as quickly as the speed of computers, and we would now like to be able to search collections that total in the order of billions to trillions of words. 2. To allow more flexible matching operations. For example, it is impractical to perform the query Romans near countrymen with GREP, where near might be defined as “within 5 words” or “within the same sentence.” 3. To allow ranked retrieval. In many cases, you want the best answer to an information need among many documents that contain certain words.

The way to avoid linearly scanning the texts for each query is to index the documents in advance The result is a binary term-document incidence matrix the information retrieval literature normally speaks of terms (NOT WORDS) The result is a vector for each term Retrieval model can be categorize as Boolean retrieval model Vector space model Probabilistic model Model based on belief net

result of binary term-document
An example information retrieval problem result of binary term-document incidence matrix Main idea: record for each document whether it contains each word out of all the different words Shakespeare used. Matrix element (t, d) is 1 if the play in column d contains the word in row t, and is 0 otherwise.

We compute the results for our query as the bitwise AND between vectors for Brutus, Caesar and complement (Calpurnia): To answer the query Brutus and Caesar and not Calpurnia, we take the vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwise and: and and = The answers for this query are thus Antony and Cleopatra and Hamlet

A first take at building an inverted index
The inverted index consists of a dictionary of terms (also: lexicon, vocabulary) and a postings list for each term, i.e., a list that records which documents the term occurs in. The inverted index data structure is a central component of a typical search engine indexing algorithm. A goal of a search engine implementation is to optimize the speed of the query: find the documents where word X occurs. To gain the speed benefits of indexing at retrieval time, we have to build the index in advance.

Within a document collection, we assume that each document has a unique docID serial number, known as the document identifier (docID). Then, collect the documents to be indexed. The core indexing step is sorting this list so that the terms are alphabetical.

Multiple occurrences of the same term from the same document are then merged. Instances of the same term are then grouped, and the result is split into a dictionary and postings The dictionary records some statistics, such as the number of document documents which contain each term The postings are secondarily sorted by docID. This provides the basis for efficient query processing.

Lecture 1: Introduction and the Boolean Model Information Retrieval

Similar presentations

Presentation on theme: "Lecture 1: Introduction and the Boolean Model Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 1: Introduction and the Boolean Model Information Retrieval

Similar presentations

Presentation on theme: "Lecture 1: Introduction and the Boolean Model Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback