Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan Dictionary indexing.

Similar presentations


Presentation on theme: "Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan Dictionary indexing."— Presentation transcript:

1 Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan Dictionary indexing

2 Introduction to Information Retrieval String Search Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support searches for a pattern P over them. Hashing ?

3 Introduction to Information Retrieval Hashing with chaining

4 Introduction to Information Retrieval Key issue: a good hash function Basic assumption: Uniform hashing Avg #keys per slot = n * (1/m) = n/m =  (load factor)

5 Introduction to Information Retrieval Search cost m =  (n)

6 Introduction to Information Retrieval In practice we use simple hash functions: prime

7 Introduction to Information Retrieval Do “provably good” hashes exist ?  Each a i is selected at random in [0,m) k0k0 k1k1 k2k2 krkr ≈log 2 m r ≈ log 2 U / log 2 m a0a0 a1a1 a2a2 arar K a prime U = universe of keys m = Table size not necessarily: (...mod p) mod m

8 Introduction to Information Retrieval Cuckoo Hashing ABC ED 2 hash tables, and 2 random choices where an item can be stored

9 Introduction to Information Retrieval ABC ED F A running example

10 Introduction to Information Retrieval ABFC ED A running example

11 Introduction to Information Retrieval ABFC ED G A running example

12 Introduction to Information Retrieval EGBFC AD A running example

13 Introduction to Information Retrieval Cuckoo Hashing Examples ABC ED F G Random (bipartite) graph: node=cell, edge=key

14 Introduction to Information Retrieval Natural Extensions  More than 2 hashes (choices) per key.  Very different: hypergraphs instead of graphs.  Higher memory utilization  3 choices : 90+% in experiments  4 choices : about 97%  2 hashes + bins of B-size.  Balanced allocation and tightly O(1)-size bins  Insertion sees a tree of possible evict+ins paths but more insert time (and random access) more memory...but more local

15 Introduction to Information Retrieval Minimal Ordered Perfect Hashing 15 m = 1.25 n n=12  m=15 The h 1 and h 2 are not perfect

16 Introduction to Information Retrieval h(t) = [ g( h 1 (t) ) + g ( h 2 (t) ) ] mod n 16 computed h is perfect, no strings need to be stored space is negligible for h 1 and h 2 and m log n for g

17 Introduction to Information Retrieval How to construct it 17 Term = edge, its vertices are given by h1 and h2 All g(v)=0; then assign g() by difference with known h() Acyclic  ok No-Acycl  regenerate hashes

18 Introduction to Information Retrieval Prefix Search Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix searches for a pattern P over them.

19 Introduction to Information Retrieval Array of strings (pointers…) systile syzygetic syzygial syzygy Search = O(P * log 2 K) time, O(log 2 K) I/Os Space = N + 4K bytes I/O = cache misses (esp. range search)

20 Introduction to Information Retrieval Reduce I/Os: Force some locality ….systilesyzygeticsyzygialsyzygy…. sorted order + linear storage 2 advantages: Save random I/Os in last binary-steps I/O-scan in reporting range-results How do we reduce space storage ?

21 Introduction to Information Retrieval Space + I/O reduction: Bucketing ….7systile9syzygetic8syzygial6syzygy 11szaibelyite8szczecin9szomo…. 2 further advantages: Search = O(log 2 b) I/Os, where b ≈ N/B Space = (N + K) + 4 * b bytes B B

22 Introduction to Information Retrieval Space reduction: Front-coding http://checkmate.com/All_Natural/ http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html http://checkmate.com/All/Natural/Washcloth.html... 0 http://checkmate.com/All_Natural/ 33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html 25 Essence_Oils.html 25 Mineral_Bath_Crystals.html 38 Salt.html 33 Cream.html 33  45% 0 http://checkmate.com/All/Natural/Washcloth.html... ….systile syzygetic syzygial syzygy…. 2 55 Gzip may be much better...

23 Introduction to Information Retrieval Solution #1: Bucketing + FC ….70systile 92zygeti c85ial 65y 110szaibelyite 82czecin92omo…. Search = O(log 2 b) I/Os, where b ≈ N/B Space ≈ ( FC(D) + K ) + 4 * b bytes Not really FC(D) B B depends on D’s size

24 Introduction to Information Retrieval Trie: speeding-up searches 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo Pro: O(p) search time Cons: edge + node labels and tree structure

25 Introduction to Information Retrieval ….70systile 92zygeti c85ial 65y 110szaibelyite 82czecin92omo…. systile szaielyite CT on a sample Solution #2: 2-level indexing Disk Internal Memory 2 disadvantages: Sampling rate ≈ lengths of sampled strings Trade-off ≈ speed vs space (because of bucket size) 2 advantages: Search ≈ typically 1 I/O Space ≈ Front-coding over buckets


Download ppt "Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan Dictionary indexing."

Similar presentations


Ads by Google