Lecture 1: Introduction and the Boolean Model Information Retrieval

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Multimedia Database Systems
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 1: Boolean Retrieval 1.
Adapted from Information Retrieval and Web Search
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
Boolean Retrieval Lecture 2: Boolean Retrieval Web Search and Mining.
IR Models: Structural Models
Information Retrieval
CS276 Information Retrieval and Web Search Lecture 1: Boolean retrieval.
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Chapter 1: Introduction to IR.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Chapter 5: Information Retrieval and Web Search
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval cs458 Introduction David Kauchak adapted from:
LIS618 lecture 2 the Boolean model Thomas Krichel
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 1: Boolean retrieval.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
Modern Information Retrieval Lecture 3: Boolean Retrieval.
Chapter 6: Information Retrieval and Web Search
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 1: Introduction and Boolean retrieval.
IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.
ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
CES 514 – Data Mining Lec 2, Feb 10 Spring 2010 Sonoma State University.
Introduction to Information Retrieval CSE 538 MRS BOOK – CHAPTER I Boolean Model 1.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
1. L01: Corpuses, Terms and Search Basic terminology The need for unstructured text search Boolean Retrieval Model Algorithms for compressing data Algorithms.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 1: Boolean retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval cs160 Introduction David Kauchak adapted from:
Introduction to Information Retrieval Boolean Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 1: Boolean retrieval.
Information Retrieval and Web Search Boolean retrieval Instructor: Rada Mihalcea (Note: some of the slides in this set have been adapted from a course.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Module 2: Boolean retrieval. Introduction to Information Retrieval Information Retrieval  Information Retrieval (IR) is finding material (usually documents)
CS315 Introduction to Information Retrieval Boolean Search 1.
Web-based Information Architecture 01: Boolean Retrieval Hongfei Yan School of EECS, Peking University 2/27/2013.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Take-away Administrativa
Information Retrieval : Intro
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Large Scale Search: Inverted Index, etc.
CS122B: Projects in Databases and Web Applications Winter 2017
COIS 442 Foundations on IR Information Retrieval and Web Search
Slides from Book: Christopher D
Text Based Information Retrieval
정보 검색 특론 Information Retrieval and Web Search
Database Vocabulary Terms.
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Boolean Retrieval.
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
CSCE 561 Information Retrieval System Models
Basic Information Retrieval
Boolean Retrieval.
Information Retrieval and Web Search Lecture 1: Boolean retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Boolean Retrieval.
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
CS276 Information Retrieval and Web Search
Information Retrieval and Web Design
Information Retrieval and Web Design
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

Lecture 1: Introduction and the Boolean Model Information Retrieval By Dr. Huda Abdulaali 2017-2018 Introduction to Information Retrieval. Manning et al.,

nature . . . that satisfies an information need from within large INTRODUCTION Information retrieval (IR) is finding material . . . of an unstructured nature . . . that satisfies an information need from within large collections . . . .

Information retrieval (IR) is finding material (usually documents) INTRODUCTION Information retrieval (IR) is finding material (usually documents) of an unstructured nature . . . that satisfies an information need from within large collections (usually stored on computers). Document Collection: text units we have built an IR system. Usually documents But could be book chapters paragraphs scenes of a movie turns in a conversation...

Structured vs Unstructured Data INTRODUCTION Structured vs Unstructured Data  structured data refers to information with a high degree of organization, such that inclusion in a relational database is seamless and readily searchable by simple, straightforward search engine algorithms or other search operations;  unstructured data is essentially the opposite.

Information Needs and Relevance INTRODUCTION Information Needs and Relevance An information need is the topic about which the user desires to know more about. A query is what the user conveys to the computer in an attempt to communicate the information need. A document is relevant if the user perceives that it contains information of value with respect to their personal information need.

INTRODUCTION The field of IR also covers supporting users in browsing or filtering document collections or further processing a set of retrieved documents. Given a set of documents, clustering is the task of coming up with a good grouping of the documents based on their contents.

INTRODUCTION In web search, the system has to provide search over billions of documents stored on millions of computers. Distinctive issues are needing to gather documents for indexing, being able to build systems that work efficiently In Enterprise and institutional search e.g company’s documentation, patents, research articles often domain-specific, centralized storage; dedicated machines for search . In personal information retrieval, operating systems have integrated information retrieval (such as Apple’s Mac OS X Spotlight or Windows Vista’s Instant Search). Email programs usually not only provide search but also text classification: they at least provide a spam (junk mail) filter, and commonly also provide either manual or automatic means for classifying mail so that it can be placed directly into particular folders

A short history of IR

Boolean Retrieval The Boolean model of information retrieval (BIR) is a classical information retrieval (IR) model and, at the same time, the first and most adopted one. It is used by many IR systems to this day In the Boolean retrieval model we can pose any query in the form of a Boolean expression of term i.e., one in which terms are combined with the operators and, or, and not. Shakespeare example

An index term is either present(1) or absent(0) in the document Basic Assumption of Boolean Model An index term is either present(1) or absent(0) in the document All index terms provide equal evidence with respect to information needs. Queries are Boolean combinations of index terms. X AND Y: represents doc that contains both X and Y X OR Y: represents doc that contains either X or Y NOT X: represents the doc that do not contain X

An example information retrieval problem Brutus AND Caesar AND NOT Calpurnia fat book that many people own is Shakespeare’s Collected Works. Suppose you wanted to determine which plays of Shakespeare contain the words Brutus and Caesar and not Calpurnia. One way to do that is to start at the beginning and to read through all the text, The simplest form of document retrieval is for a computer to do this sort of linear scan through documents. This process is commonly referred to as GREPPING through text

An example information retrieval problem GREP is a command-line utility for searching plain-text data sets for lines that match a regular expression. Its name comes from the command g/re/p (globally search a regular expression and print), which has the same effect: doing a global search with the regular expression and printing all matching lines. GREP was originally developed for the Unix operating system, but later available for all Unix-like systems.

An example information retrieval problem 1. To process large document collections quickly. The amount of online data has grown at least as quickly as the speed of computers, and we would now like to be able to search collections that total in the order of billions to trillions of words. 2. To allow more flexible matching operations. For example, it is impractical to perform the query Romans near countrymen with GREP, where near might be defined as “within 5 words” or “within the same sentence.” 3. To allow ranked retrieval. In many cases, you want the best answer to an information need among many documents that contain certain words.

An example information retrieval problem The way to avoid linearly scanning the texts for each query is to index the documents in advance The result is a binary term-document incidence matrix the information retrieval literature normally speaks of terms (NOT WORDS) The result is a vector for each term Retrieval model can be categorize as Boolean retrieval model Vector space model Probabilistic model Model based on belief net

result of binary term-document An example information retrieval problem result of binary term-document incidence matrix Main idea: record for each document whether it contains each word out of all the different words Shakespeare used. Matrix element (t, d) is 1 if the play in column d contains the word in row t, and is 0 otherwise.

An example information retrieval problem We compute the results for our query as the bitwise AND between vectors for Brutus, Caesar and complement (Calpurnia): To answer the query Brutus and Caesar and not Calpurnia, we take the vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwise and: 110100 and 110111 and 101111 = 100100 The answers for this query are thus Antony and Cleopatra and Hamlet

A first take at building an inverted index The inverted index consists of a dictionary of terms (also: lexicon, vocabulary) and a postings list for each term, i.e., a list that records which documents the term occurs in. The inverted index data structure is a central component of a typical search engine indexing algorithm. A goal of a search engine implementation is to optimize the speed of the query: find the documents where word X occurs. To gain the speed benefits of indexing at retrieval time, we have to build the index in advance.

A first take at building an inverted index

A first take at building an inverted index

A first take at building an inverted index Within a document collection, we assume that each document has a unique docID serial number, known as the document identifier (docID). Then, collect the documents to be indexed. The core indexing step is sorting this list so that the terms are alphabetical.

A first take at building an inverted index Multiple occurrences of the same term from the same document are then merged. Instances of the same term are then grouped, and the result is split into a dictionary and postings The dictionary records some statistics, such as the number of document documents which contain each term The postings are secondarily sorted by docID. This provides the basis for efficient query processing.