Dictionaries and Tolerant retrieval

Slides:



Advertisements
Similar presentations
Hashing and Indexing John Ortiz.
Advertisements

B+-trees. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B = n pages I/O complexity:
Lecture 4: Dictionaries and tolerant retrieval
CpSc 881: Information Retrieval. 2 Dictionaries The dictionary is the data structure for storing the term vocabulary. Term vocabulary: the data Dictionary:
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 3 8/30/2010.
CS276A Information Retrieval
PrasadL05TolerantIR1 Tolerant IR Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford)
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan Tolerant Retrieval.
An Introduction to IR Lecture 3 Dictionaries and Tolerant retrieval 1.
1 ITCS 6265 Lecture 3 Dictionaries and Tolerant retrieval.
Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
CES 514 Data Mining Feb 17, 2010 Lecture 3: The dictionary and tolerant retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval.
1 Lecture 8: Data structures for databases II Jose M. Peña
BTrees & Bitmap Indexes
B+-tree and Hashing.
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 19: B-Trees: Data Structures for Disk.
Information Retrieval IR 5. Plan Last lecture Index construction This lecture Parametric and field searches Zones in documents Wild card queries Scoring.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
1 Geometric index structures April 15, 2004 Based on GUW Chapter , [Arge01] Sections 1, 2.1 (persistent B- trees), 3-4 (static versions.
Chair of Software Engineering Einführung in die Programmierung Introduction to Programming Prof. Dr. Bertrand Meyer Exercise Session 10.
Index Compression Lecture 4. Recap: lecture 3 Stemming, tokenization etc. Faster postings merges Phrase queries.
1 INF 2914 Information Retrieval and Web Search Lecture 9: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
CS347 Lecture 2 April 9, 2001 ©Prabhakar Raghavan.
§6 B+ Trees 【 Definition 】 A B+ tree of order M is a tree with the following structural properties: (1) The root is either a leaf or has between 2 and.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
PrasadL05TolerantIR1 Tolerant IR Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.
Introduction to Information Retrieval Introduction to Information Retrieval COMP4210: Information Retrieval and Search Engines Lecture 3: Dictionaries.
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
Dictionaries and Tolerant retrieval
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval Related to Chapter 3:
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 3: tolerant retrieval.
File Processing : Index and Hash 2015, Spring Pusan National University Ki-Joune Li.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval Related to Chapter 3:
CSED101 INTRODUCTION TO COMPUTING TREE 2 Hwanjo Yu.
Index tuning-- B+tree. overview Overview of tree-structured index Indexed sequential access method (ISAM) B+tree.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Evidence from Content INST 734 Module 2 Doug Oard.
Information Retrieval
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 3: Dictionaries and tolerant retrieval.
1 i206: Lecture 16: Data Structures for Disk; Advanced Trees Marti Hearst Spring 2012.
Week 9 - Monday.  What did we talk about last time?  Practiced with red-black trees  AVL trees  Balanced add.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Skip Pointers, Dictionaries and tolerant retrieval.
Information Retrieval On the use of the Inverted Lists.
CSC 143T 1 CSC 143 Highlights of Tables and Hashing [Chapter 11 p (Tables)] [Chapter 12 p (Hashing)]
Introduction to Information Retrieval Introduction to Information Retrieval Lectures 4-6: Skip Pointers, Dictionaries and tolerant retrieval.
An Introduction to IR Lecture 3 Dictionaries and Tolerant retrieval 1.
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 2 1 Oct 2002.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Lectures 5: Dictionaries and tolerant retrieval
Why indexing? For efficient searching of a document
Information Retrieval Christopher Manning and Prabhakar Raghavan
COMP9319: Web Data Compression and Search
Query processing: optimizations
Dictionary data structures for the Inverted Index
Dictionary data structures for the Inverted Index
Modified from Stanford CS276 slides
Lecture 3: Dictionaries and tolerant retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Dictionary data structures for the Inverted Index
Lecture 3: Dictionaries and tolerant retrieval
Tolerant IR Adapted from Lectures by Prabhakar Raghavan (Google) and
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Dictionaries and Tolerant retrieval

This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex

Dictionary data structures for inverted indexes The dictionary data structure stores the term vocabulary, document frequency, pointers to each postings list … in what data structure?

A naïve dictionary An array of struct: char[20] int Postings * 20 bytes 4/8 bytes 4/8 bytes How do we store a dictionary in memory efficiently? How do we quickly look up elements at query time?

Dictionary data structures Two main choices: Hash table Tree Some IR systems use hashes, some trees

Hashes Each vocabulary term is hashed to an integer Pros: Cons: (We assume you’ve seen hashtables before) Pros: Lookup is faster than for a tree: O(1) Cons: No easy way to find minor variants: judgment/judgement No prefix search [tolerant retrieval] If vocabulary keeps going, need to occasionally do the expensive operation of rehashing everything

Tree: binary tree Root a-m n-z a-hu hy-m n-sh si-z huygens sickle zygot aardvark

Tree: B-tree a-hu n-z hy-m Definition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4].

Trees Simplest: binary tree More usual: B+-trees Trees require a standard ordering of characters and hence strings … but we standardly have one Pros: Solves the prefix problem (terms starting with hyp) Cons: Slower: O(log M) [and this requires balanced tree] Rebalancing binary trees is expensive But B-trees mitigate the rebalancing problem

Wild-card queries

Wild-card queries: * mon*: find all docs containing any word beginning “mon”. Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon ≤ w < moo *mon: find words ending in “mon”: harder Maintain an additional B-tree for terms backwards. Can retrieve all words in range: nom ≤ w < non. Exercise: from this, how can we enumerate all terms meeting the wild-card query pro*cent ?

Query processing At this point, we have an enumeration of all terms in the dictionary that match the wild-card query. We still have to look up the postings for each enumerated term. E.g., consider the query: se*ate AND fil*er This may result in the execution of many Boolean AND queries.

B-trees handle *’s at the end of a query term How can we handle *’s in the middle of query term? co*tion We could look up co* AND *tion in a B-tree and intersect the two term sets Expensive The solution: transform wild-card queries so that the *’s occur at the end This gives rise to the Permuterm Index.

Permuterm index For term hello, index under: Queries: hello$, ello$h, llo$he, lo$hel, o$hell where $ is a special symbol. Queries: X lookup on X$ X* lookup on X*$ *X lookup on X$* *X* lookup on X* X*Y lookup on Y$X* Query = hel*o X=hel, Y=o Lookup o$hel*

Permuterm query processing Rotate query wild-card to the right Now use B-tree lookup as before. Permuterm problem: ≈ quadruples lexicon size Empirical observation for English.

Bigram (k-gram) indexes Enumerate all k-grams (sequence of k chars) occurring in any term e.g., from text “April is the cruelest month” we get the 2-grams (bigrams) $ is a special word boundary symbol Maintain a second inverted index from bigrams to dictionary terms that match each bigram. $a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru, ue,el,le,es,st,t$, $m,mo,on,nt,h$

Bigram index example The k-gram index finds terms based on a query consisting of k-grams $m mace madden mo among amortize on among around

Processing n-gram wild-cards Query mon* can now be run as $m AND mo AND on Gets terms that match AND version of our wildcard query. But we’d enumerate moon. Must post-filter these terms against query. Surviving enumerated terms are then looked up in the term-document inverted index. Fast, space efficient (compared to permuterm).

Processing wild-card queries As before, we must execute a Boolean query for each enumerated, filtered term. Wild-cards can result in expensive query execution (very large disjunctions…) pyth* AND prog* If you encourage “laziness” people will respond! Does Google allow wildcard queries? Search Type your search terms, use ‘*’ if you need to. E.g., Alex* will match Alexander.

B+-tree Records must be ordered over an attribute Queries: exact match and range queries over the indexed attribute: “find the name of the student with ID=087-34-7892” or “find all students with gpa between 3.00 and 3.5”

B+-tree:properties Insert/delete at log F (N/B) cost; keep tree height-balanced. (F = fanout) Two types of nodes: index nodes and data nodes; each node is 1 page (disk based method)

Index node 57 81 95 to keys to keys to keys to keys < 57 57£ k<81 81£k<95 95£

Data node 57 81 95 with key 57 with key 81 To record with key 85 From non-leaf node to next leaf in sequence 57 81 95 with key 57 with key 81 To record with key 85

EX: B+ Tree of order 3. (a) Initial tree 60 20 , 40 80 5,10 20 40,50 Index level 20 , 40 80 5,10 20 40,50 60 80,100 Data level

Query Example Root 100 Range[32, 160] 120 150 180 30 3 5 11 120 130 30 35 100 101 110 180 200 150 156 179