Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dictionary data structures for the Inverted Index

Similar presentations


Presentation on theme: "Dictionary data structures for the Inverted Index"— Presentation transcript:

1 Dictionary data structures for the Inverted Index
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Dictionary data structures for the Inverted Index

2 This lecture Dictionary data structures “Tolerant” retrieval
Ch. 3 This lecture Dictionary data structures Exact search Prefix search “Tolerant” retrieval Edit-distance queries Wild-card queries Spelling correction Soundex

3 Sec. 3.1 Basics The dictionary data structure stores the term vocabulary, but… in what data structure?

4 A naïve dictionary An array of struct: char[20] int Postings *
Sec. 3.1 A naïve dictionary An array of struct: char[20] int Postings * 20 bytes /8 bytes /8 bytes How do we store a dictionary in memory efficiently? How do we quickly look up elements at query time?

5 Dictionary data structures
Sec. 3.1 Dictionary data structures Two main choices: Hash table Tree Trie Some IR systems use hashes, some trees/tries

6 Prof. Paolo Ferragina, Univ. Pisa
Hashing with chaining

7 Key issue: a good hash function
Prof. Paolo Ferragina, Univ. Pisa Key issue: a good hash function Basic assumption: Uniform hashing Avg #keys per slot = n * (1/m) = n/m = a (load factor) m = Q(n)

8 Prof. Paolo Ferragina, Univ. Pisa
In practice A trivial hash function is: prime Integer (if string?) The current version is MurmurHash (vers 3) yields a 32-bit or 128-bit hash value.

9 Another “provably good” hash
Prof. Paolo Ferragina, Univ. Pisa Another “provably good” hash L = bit len of string m = table size ≈log2 m Key k0 k1 k2 kr r ≈ L / log2 m a a0 a1 a2 ar Each ai is selected at random in [0,m) prime not necessarily: (...mod p) mod m

10 Hashes Each vocabulary term is hashed to an integer Pros:
Sec. 3.1 Hashes Each vocabulary term is hashed to an integer Pros: Lookup is faster than for a tree: O(1) on avg Turn to worst-case by Cuckoo hashing Cons: Only exact search: e.g. judgment/judgement If vocabulary keeps growing, you need to occasionally do the expensive rehashing Cuckoo hash «limits» those rehashes

11 Prof. Paolo Ferragina, Univ. Pisa
Cuckoo Hashing A 2 hash tables, and 2 random choices where an item can be stored

12 Prof. Paolo Ferragina, Univ. Pisa
A running example A B C F E D

13 Prof. Paolo Ferragina, Univ. Pisa
A running example A B C F E D

14 Prof. Paolo Ferragina, Univ. Pisa
A running example A B C F G E D

15 Prof. Paolo Ferragina, Univ. Pisa
A running example E G B C F A D We reverse the traversed edges  keeps track of other position Random (bipartite) graph: node=cell, edge=key

16 Prof. Paolo Ferragina, Univ. Pisa
1 cycle, not a problem A B D F G D G B A F If #cells > 2 * #keys, then low prob of cycling (but <50% space occupancy)

17 Prof. Paolo Ferragina, Univ. Pisa
The double cycle (!!!) E B C A D F G D G C E B F A D A C E B F G D A F E C G B If #cells > 2 * #keys, then low prob of 1 cycle  Low prob of 2 joined-cycles

18 Extensions for speed/space
Prof. Paolo Ferragina, Univ. Pisa Extensions for speed/space More than 2 hashes (choices) per key. Very different: hypergraphs instead of graphs. Higher memory utilization 3 choices : 90% 4 choices : about 97% 2 hashes + bins of B-size. Balanced allocation and tightly O(1)-size bins Insertion sees a tree of possible evict+ins paths but more insert time (and random access) more memory ...but more local

19 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prefix search

20 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prefix-string Search Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix searches for a pattern P over them. Ex. Pattern P is pa Dict = {abaco, box, paolo, patrizio, pippo, zoo}

21 Trie: speeding-up searches
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Trie: speeding-up searches s 1 y z 2 2 omo stile aibelyite zyg 7 5 5 czecin 1 etic ygy ial 6 2 4 3 Pro: O(p) search time = path scan Cons: edge + node labels + tree structure

22 Tries Do exist many variants and their implementations Pros:
Sec. 3.1 Tries Do exist many variants and their implementations Pros: Solves the prefix problem Cons: Slower: O(p) time, many cache misses From 10 to 60 (or, even more) bytes per node

23 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
2-level indexing 2 advantages: Search ≈ typically 1 I/O + in-mem comp. Space ≈ trie built over a subset of strings Disk Internal Memory CT on a sample A disadvantage: Trade-off ≈ speed vs space (because of bucket size) systile szaibelyite systile syzygetic syzigial syzygygy szaibelyite szczecin szomo

24 Front-coding: squeezing dict
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Front-coding: squeezing dict systile syzygetic syzygial syzygy 2 5 3345% 0 0 33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html 25 Essence_Oils.html 25 Mineral_Bath_Crystals.html 38 Salt.html 33 Cream.html ... Gzip may be much better...

25 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
2-level indexing 2 advantages: Search ≈ typically 1 I/O + in-mem comp. Space ≈ Trie built over a subset of strings and front-coding over buckets Disk Internal Memory CT on a sample A disadvantage: Trade-off ≈ speed vs space (because of bucket size) systile szaielyite ….70systile 92zygeti c85ial 65y szaibelyite 82czecin92omo….

26 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Spelling correction

27 Spell correction Two principal uses Two main flavors:
Sec. 3.3 Spell correction Two principal uses Correcting document(s) being indexed Correcting queries to retrieve “right” answers Two main flavors: Isolated word Check each word on its own for misspelling But what about: from  form Context-sensitive is more effective Look at surrounding words e.g., I flew form Heathrow to Narita.

28 Isolated word correction
Sec Isolated word correction Fundamental premise – there is a lexicon from which the correct spellings come Two basic choices for this A standard lexicon such as Webster’s English Dictionary An “industry-specific” lexicon – hand-maintained The lexicon of the indexed corpus E.g., all words on the web All names, acronyms etc. (including the mis-spellings) Mining algorithms to derive the possible corrections

29 Isolated word correction
Sec Isolated word correction Given a lexicon and a character sequence Q, return the words in the lexicon closest to Q What’s “closest”? We’ll study several measures Edit distance (Levenshtein distance) Weighted edit distance n-gram overlap

30 Brute-force check of ED
Sec Brute-force check of ED Given query Q, enumerate all character sequences within a preset (weighted) edit distance (e.g., 2) Intersect this set with list of “correct” words Show terms you found to user as suggestions How the time complexity grows with #errors allowed and string Length ?

31 Sec Edit distance Given two strings S1 and S2, the minimum number of operations to convert one to the other Operations are typically character-level Insert, Delete, Replace (possibly, Transposition) E.g., the edit distance from dof to dog is 1 From cat to act is 2 (Just 1 with transpose) from cat to dog is 3. Generally found by dynamic programming.

32 DynProg for Edit Distance
Let E(i,j) = edit distance to transform P1,i in T1,j Consider the sequence of ops: ins, del, subst, match Order the sequence of ops by position involved in T If P[i]=T[j] then last op is a match, and thus it is not counted Otherwise the last op is: subst(P[i],T[j]) or ins(T[j]) or del(P[i]) E(i,0)=i, E(0,j)=j E(i, j) = E(i–1, j–1) if P[i]=T[j] E(i, j) = min{E(i, j–1), E(i–1, j), E(i–1, j–1)} if P[i] T[j]

33 Example +1 T 1 2 3 4 5 6 p t a P +1 +1

34 Weighted edit distance
Sec Weighted edit distance As above, but the weight of an operation depends on the character(s) involved Meant to capture keyboard errors, e.g. m more likely to be mis-typed as n than as q Therefore, replacing m by n is a smaller cost than by q Requires weighted matrix as input Modify DP to handle weights


Download ppt "Dictionary data structures for the Inverted Index"

Similar presentations


Ads by Google