Dictionary data structures for the Inverted Index

Slides:



Advertisements
Similar presentations
The Dictionary ADT Definition A dictionary is an ordered or unordered list of key-element pairs, where keys are used to locate elements in the list. Example:
Advertisements

Inverted Index Hongning Wang
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 3 8/30/2010.
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan Tolerant Retrieval.
CES 514 Data Mining Feb 17, 2010 Lecture 3: The dictionary and tolerant retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval.
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.
Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
Advanced Algorithms for Massive Datasets Basics of Hashing.
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Tries. (Compacted) Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval Related to Chapter 3:
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 3: tolerant retrieval.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Correcting user queries to retrieve “right” answers Two.
Dictionary search Exact string search Paper on Cuckoo Hashing.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Dictionaries and Tolerant retrieval
1 i206: Lecture 16: Data Structures for Disk; Advanced Trees Marti Hearst Spring 2012.
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan Dictionary indexing.
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Retrieve matching documents when query contains a spelling.
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Lectures 5: Dictionaries and tolerant retrieval
CE 221 Data Structures and Algorithms
Hashing (part 2) CSE 2011 Winter March 2018.
Query processing: optimizations
CSC 172 DATA STRUCTURES.
Higher Order Tries Key = Social Security Number.
Dictionary data structures for the Inverted Index
IP Routers – internal view
CSCI 210 Data Structures and Algorithms
Modified from Stanford CS276 slides
Inverted Index Hongning Wang
Hashing Alexandra Stefan.
Data Structures and Algorithms for Information Processing
Hashing Alexandra Stefan.
Sorting.
Chapter 28 Hashing.
Lecture 3: Dictionaries and tolerant retrieval
Space-for-time tradeoffs
Dictionary data structures for the Inverted Index
Introduction to Database Systems File Organization and Indexing
Auto-completion Search
Chapter 21 Hashing: Implementing Dictionaries and Sets
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Hashing Alexandra Stefan.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Space-for-time tradeoffs
Lecture 3: Dictionaries and tolerant retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Database Design and Programming
Data Structure and Algorithms
2018, Spring Pusan National University Ki-Joune Li
Space-for-time tradeoffs
EE 312 Software Design and Implementation I
Dictionary data structures for the Inverted Index
CS210- Lecture 16 July 11, 2005 Agenda Maps and Dictionaries Map ADT
Hashing.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
EE 312 Software Design and Implementation I
Dictionaries and Hash Tables
Lecture-Hashing.
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

Dictionary data structures for the Inverted Index Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Dictionary data structures for the Inverted Index

This lecture Dictionary data structures “Tolerant” retrieval Ch. 3 This lecture Dictionary data structures Exact search Prefix search “Tolerant” retrieval Edit-distance queries Wild-card queries Spelling correction Soundex

Sec. 3.1 Basics The dictionary data structure stores the term vocabulary, but… in what data structure?

A naïve dictionary An array of struct: char[20] int Postings * Sec. 3.1 A naïve dictionary An array of struct: char[20] int Postings * 20 bytes 4/8 bytes 4/8 bytes How do we store a dictionary in memory efficiently? How do we quickly look up elements at query time?

Dictionary data structures Sec. 3.1 Dictionary data structures Two main choices: Hash table Tree Trie Some IR systems use hashes, some trees/tries

Prof. Paolo Ferragina, Univ. Pisa Hashing with chaining

Key issue: a good hash function Prof. Paolo Ferragina, Univ. Pisa Key issue: a good hash function Basic assumption: Uniform hashing Avg #keys per slot = n * (1/m) = n/m = a (load factor) m = Q(n)

Prof. Paolo Ferragina, Univ. Pisa In practice A trivial hash function is: prime Integer (if string?) The current version is MurmurHash (vers 3) yields a 32-bit or 128-bit hash value.

Another “provably good” hash Prof. Paolo Ferragina, Univ. Pisa Another “provably good” hash L = bit len of string m = table size ≈log2 m Key k0 k1 k2 kr r ≈ L / log2 m a a0 a1 a2 ar Each ai is selected at random in [0,m) prime not necessarily: (...mod p) mod m

Hashes Each vocabulary term is hashed to an integer Pros: Sec. 3.1 Hashes Each vocabulary term is hashed to an integer Pros: Lookup is faster than for a tree: O(1) on avg Turn to worst-case by Cuckoo hashing Cons: Only exact search: e.g. judgment/judgement If vocabulary keeps growing, you need to occasionally do the expensive rehashing Cuckoo hash «limits» those rehashes

Prof. Paolo Ferragina, Univ. Pisa Cuckoo Hashing A 2 hash tables, and 2 random choices where an item can be stored

Prof. Paolo Ferragina, Univ. Pisa A running example A B C F E D

Prof. Paolo Ferragina, Univ. Pisa A running example A B C F E D

Prof. Paolo Ferragina, Univ. Pisa A running example A B C F G E D

Prof. Paolo Ferragina, Univ. Pisa A running example E G B C F A D We reverse the traversed edges  keeps track of other position Random (bipartite) graph: node=cell, edge=key

Prof. Paolo Ferragina, Univ. Pisa 1 cycle, not a problem A B D F G D G B A F If #cells > 2 * #keys, then low prob of cycling (but <50% space occupancy)

Prof. Paolo Ferragina, Univ. Pisa The double cycle (!!!) E B C A D F G D G C E B F A D A C E B F G D A F E C G B If #cells > 2 * #keys, then low prob of 1 cycle  Low prob of 2 joined-cycles

Extensions for speed/space Prof. Paolo Ferragina, Univ. Pisa Extensions for speed/space More than 2 hashes (choices) per key. Very different: hypergraphs instead of graphs. Higher memory utilization 3 choices : 90% 4 choices : about 97% 2 hashes + bins of B-size. Balanced allocation and tightly O(1)-size bins Insertion sees a tree of possible evict+ins paths but more insert time (and random access) more memory ...but more local

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Prefix search

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Prefix-string Search Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix searches for a pattern P over them. Ex. Pattern P is pa Dict = {abaco, box, paolo, patrizio, pippo, zoo}

Trie: speeding-up searches Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Trie: speeding-up searches s 1 y z 2 2 omo stile aibelyite zyg 7 5 5 czecin 1 etic ygy ial 6 2 4 3 Pro: O(p) search time = path scan Cons: edge + node labels + tree structure

Tries Do exist many variants and their implementations Pros: Sec. 3.1 Tries Do exist many variants and their implementations Pros: Solves the prefix problem Cons: Slower: O(p) time, many cache misses From 10 to 60 (or, even more) bytes per node

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" 2-level indexing 2 advantages: Search ≈ typically 1 I/O + in-mem comp. Space ≈ trie built over a subset of strings Disk Internal Memory CT on a sample A disadvantage: Trade-off ≈ speed vs space (because of bucket size) systile szaibelyite systile syzygetic syzigial syzygygy szaibelyite szczecin szomo

Front-coding: squeezing dict Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Front-coding: squeezing dict systile syzygetic syzygial syzygy 2 5 3345% 0 http://checkmate.com/All/Natural/Washcloth.html... 0 http://checkmate.com/All_Natural/ 33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html 25 Essence_Oils.html 25 Mineral_Bath_Crystals.html 38 Salt.html 33 Cream.html http://checkmate.com/All_Natural/ http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html http://checkmate.com/All/Natural/Washcloth.html ... Gzip may be much better...

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" 2-level indexing 2 advantages: Search ≈ typically 1 I/O + in-mem comp. Space ≈ Trie built over a subset of strings and front-coding over buckets Disk Internal Memory CT on a sample A disadvantage: Trade-off ≈ speed vs space (because of bucket size) systile szaielyite ….70systile 92zygeti c85ial 65y 110szaibelyite 82czecin92omo….

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Spelling correction

Spell correction Two principal uses Two main flavors: Sec. 3.3 Spell correction Two principal uses Correcting document(s) being indexed Correcting queries to retrieve “right” answers Two main flavors: Isolated word Check each word on its own for misspelling But what about: from  form Context-sensitive is more effective Look at surrounding words e.g., I flew form Heathrow to Narita.

Isolated word correction Sec. 3.3.2 Isolated word correction Fundamental premise – there is a lexicon from which the correct spellings come Two basic choices for this A standard lexicon such as Webster’s English Dictionary An “industry-specific” lexicon – hand-maintained The lexicon of the indexed corpus E.g., all words on the web All names, acronyms etc. (including the mis-spellings) Mining algorithms to derive the possible corrections

Isolated word correction Sec. 3.3.2 Isolated word correction Given a lexicon and a character sequence Q, return the words in the lexicon closest to Q What’s “closest”? We’ll study several measures Edit distance (Levenshtein distance) Weighted edit distance n-gram overlap

Brute-force check of ED Sec. 3.3.4 Brute-force check of ED Given query Q, enumerate all character sequences within a preset (weighted) edit distance (e.g., 2) Intersect this set with list of “correct” words Show terms you found to user as suggestions How the time complexity grows with #errors allowed and string Length ?

Sec. 3.3.3 Edit distance Given two strings S1 and S2, the minimum number of operations to convert one to the other Operations are typically character-level Insert, Delete, Replace (possibly, Transposition) E.g., the edit distance from dof to dog is 1 From cat to act is 2 (Just 1 with transpose) from cat to dog is 3. Generally found by dynamic programming.

DynProg for Edit Distance Let E(i,j) = edit distance to transform P1,i in T1,j Consider the sequence of ops: ins, del, subst, match Order the sequence of ops by position involved in T If P[i]=T[j] then last op is a match, and thus it is not counted Otherwise the last op is: subst(P[i],T[j]) or ins(T[j]) or del(P[i]) E(i,0)=i, E(0,j)=j E(i, j) = E(i–1, j–1) if P[i]=T[j] E(i, j) = min{E(i, j–1), E(i–1, j), E(i–1, j–1)}+1 if P[i] T[j]

Example +1 T 1 2 3 4 5 6 p t a P +1 +1

Weighted edit distance Sec. 3.3.3 Weighted edit distance As above, but the weight of an operation depends on the character(s) involved Meant to capture keyboard errors, e.g. m more likely to be mis-typed as n than as q Therefore, replacing m by n is a smaller cost than by q Requires weighted matrix as input Modify DP to handle weights