Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w 12... w 1j... w 1m 1/|d 1 | d 2 w 21 w 22... w 2j... w 2m 1/|d.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Boolean and Vector Space Retrieval Models
TF/IDF Ranking. Vector space model Documents are also treated as a “bag” of words or terms. –Each document is represented as a vector. Term Frequency.
1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.
IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS.
Indexing & Tolerant Dictionaries The pdf image slides are from Hinrich Schütze’s slides,Hinrich Schütze L'Homme qui marche Alberto Giacometti (sold for.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
Modern Information Retrieval Chapter 2 Modeling. Probabilistic model the appearance or absent of an index term in a document is interpreted either as.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
9/10: Indexing & Tolerant Dictionaries Make-up Class: 10:30  11:45AM The pdf image slides are from Hinrich Schütze’s slides,Hinrich Schütze.
CS/Info 430: Information Retrieval
Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.
Search Engine Technology Slides are revised version of the ones taken from Homework 1 returned Stats: Total: 38 Min:
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
8/28/97Information Organization and Retrieval IR Implementation Issues, Web Crawlers and Web Search Engines University of California, Berkeley School of.
9/14/2000Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Marti Hearst University of California,
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
LIS618 lecture 2 the Boolean model Thomas Krichel
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
IR Homework #1 By J. H. Wang Mar. 21, Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
Evidence from Content INST 734 Module 2 Doug Oard.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
The Nuts & Bolts of Hypertext retrieval Crawling; Indexing; Retrieval.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.
1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,
CSE 454 Indexing. Todo A bit repetitive – cut some slides Some inconsistencie – eg are positions in the index or not. Do we want nutch as case study instead.
Why indexing? For efficient searching of a document
Large Scale Search: Inverted Index, etc.
Information Retrieval and Web Search
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Implementation Issues & IR Systems
אחזור מידע, מנועי חיפוש וספריות
CSCE 561 Information Retrieval System Models
Index Construction: sorting
Information Retrieval and Web Search
Basic Information Retrieval
Representation of documents and queries
CMPS 561 Boolean Retrieval
Implementation Based on Inverted Files
6. Implementation of Vector-Space Retrieval
Chapter 5: Information Retrieval and Web Search
CSE 454 Crawlers & Indexing.
Boolean and Vector Space Retrieval Models
Efficient Retrieval Document-term matrix t1 t tj tm nf
Presentation transcript:

Indexing

Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d 2 | d i w i1 w i2... w ij... w im 1/|d i | d n w n1 w n2... w nj... w nm 1/|d n | w ij is the weight of the term t j in document d i Most of the w ij ’s are zero.

Naïve Retrieval Given the query q = (q 1, q 2, …, q j, …, q n ), nf = 1/|q|. How to evaluate q (i.e., compute the similarity between q and each document)? Method 1: Compare q with each document: Data structure for documents: d i : ((t 1, w i1 ), (t 2, w i2 ),..., (t j, w ij ),..., (t m, w im ), 1/|d i |) –Only terms with positive weights are considered. –Terms are in alphabetic order. Data structure for the query: q : ((t 1, q 1 ), (t 2, q 2 ),..., (t j, q j ),..., (t m, q m ), 1/|q|)

Naïve Retrieval Method 1: Compare q with all the documents (cont.) Algorithm: Initialize all sim(q, d i ) = 0; for each document di (i = 1, …, n) { for each term t j (j = 1, …, m) if t j appears in q and d i sim(q, d i ) += q j  w ij ; sim(q, d i ) = sim(q, d i )  (1/|q|)  (1/|d i |); } rank the documents in descending order and show the k better ones to the users.

Indexing Method 1 is not efficient: all zero entries in the documents x terms matrix are accessed. Method 2: Use of a file of inverted indexes; Several data structures: For each term t j, an inverted list with all the documents that contain t is created j. I(t j ) = { (d 1, w 1j ), (d 2, w 2j ), …, (d i, w ij ), …, (d n, w nj ) } d i is the identification of the i-th document; Only non zero entries are considered.

Indexing Method 2: Use of inverted index file (cont.) Several data structures: Normalization factors of the docs. are pre- computed and stored in a vector: nf[i] stores 1/|d i |. A hash table for all the terms in the collection is created: t j points to I(t j ) Inverted lists are typically stored in disks; Typically the number of different terms is very high.

Inverted file creation Dictionary Pointers

Retrieval with inverted lists Algorithm: Initialize all sim(q, d i ) = 0; for each term t j in q { find I(t) using the hash table; for each (d i, w ij ) in I(t) sim(q, d i ) += q j  w ij ; } for each document di sim(q, d i ) = sim(q, d i )  (1/|q|)  nf[i]; rank the documents in descending order and show the better k to the user.

Retrieval with inverted lists Observations about method 2: If a document d does not contain any term of the query q, so d is not involved in the evaluation of q; Only non-zero entries of the matrix documents x terms are employed for query evaluation; The computation of similarities of several documents are made simultaneously.

Example (Method 2): q = { (t1, 1), (t3, 1) }, 1/|q| = d1 = { (t1, 2), (t2, 1), (t3, 1) }, nf[1] = d2 = { (t2, 2), (t3, 1), (t4, 1) }, nf[2] = d3 = { (t1, 1), (t3, 1), (t4, 1) }, nf[3] = d4 = { (t1, 2), (t2, 1), (t3, 2), (t4, 2) }, nf[4] = d5 = { (t2, 2), (t4, 1), (t5, 2) }, nf[5] = I(t1) = { (d1, 2), (d3, 1), (d4, 2) } I(t2) = { (d1, 1), (d2, 2), (d4, 1), (d5, 2) } I(t3) = { (d1, 1), (d2, 1), (d3, 1), (d4, 2) } I(t4) = { (d2, 1), (d3, 1), (d4, 1), (d5, 1) } I(t5) = { (d5, 2) } Retrieval with inverted lists

After t1 preprocessing: sim(q, d1) = 2, sim(q, d2) = 0, sim(q, d3) = 1 sim(q, d4) = 2, sim(q, d5) = 0 After t3 preprocessing: sim(q, d1) = 3, sim(q, d2) = 1, sim(q, d3) = 2 sim(q, d4) = 4, sim(q, d5) = 0 After normalization: sim(q, d1) =.87, sim(q, d2) =.29, sim(q, d3) =.82, sim(q, d4) =.78, sim(q, d5) = 0 Retrieval with inverted lists

Efficiency x flexibility The storage of the weights is good for efficiency but bad for flexibility; –A re-computing is necessary if the formulas tf and idf change Flexibility can be improved storing tf and df information, but efficiency is worst; A compromise exists: –Storing the weights tf; –Use of the weights idf with the weights of the query terms tf instead of the weights of the document terms tf.

Inverted indexes Is the main structure for the indexes; Main idea: –To invert the documents in a big index file; Basic steps: –To create a “dictionary” with all the tokens in the collection; –For each token, to list all the documents in which the token occur; –To treat the structure to avoid redundancy.

Inverted indexes An inverted file is composed by vectors in a way that each row corresponds to a list of documents; it corresponds to the transpose of the document matrix.

Inverted index files creation Documents are analysed for token extraction; these ones are save with doc-ID. Now is the time for all good men to come to the aid of their country Doc 1 It was a dark and stormy night in the country manor. The time was past midnight Doc 2

After the analysis of all the documents, the index file is sorted alphabetically. Inverted index files creation

Multiple entries of terms for a single document can de merged. Information about term frequencies are compiled. Inverted index files creation

Then the file can be separated in : a “Dictionary” file; and a “Pointer” file. Inverted index files creation

Allow a faster access to individual terms; For each term, a list is obtained with: –The identity of the document: doc-ID; –The term frequency in the document; –The position of the term in the document. These lists can be used to answer Boolean queries: country -> d1, d2 manor -> d2 country AND manor -> d2 They can be also employed in ranking algorithms. Inverted index files

Use of inverted files Dictionary Pointers Query: “time” AND “dark” 2 docs with “time” in the dictionary -> IDs 1 and 2 in the pointer file 1 doc with “dark” in the dictionary -> ID 2 in the pointer file. So, only doc 2. Satisfy the query.