Download presentation

Presentation is loading. Please wait.

Published byTaliyah Arrowsmith Modified over 2 years ago

1
Adapting Hash Table Design for Real- Life Dataset Budapest University of Technology and Economics, Hungary Department of Automation and Applied Informatics Sándor Juhász, Ákos Dudás IADIS Multi Conference on Computer Science and Information Systems 2009

2
Contents 1.Data transformation 2.Hash tables: - types and variations of hash tables - refined definitions 3.Inputs and hash functions 4.Performance of open hash tables 5.Performance of bucket tables 6.Summary

3
Data transformation – reason for hash tables data transformation: converts data from a source data format into destination format hash tables are beneficial – allow data transformation in nearly 1 step; – fast, and still compact in memory; – many existing implementations; – easy to implement, easy to customize or not? will see…

4
Data transformation – reason for hash tables in the particular case: – web log processing, recorded activity of Internet users for hundreds of web portals – unique ID for the users on the website – the ID is long, 40 hexadecimal digits (20 bytes) – ID used very frequently transform IDs to 4 bytes to save memory and storage space, – important constraint: the transformation must retain uniqueness of the values

5
Hash tables in general – refined definitions open hashing: “… all items are stored within the hash table” (NIST) proposed definition: “…only one item can be assigned to any slot of the hash table” – more permissive item empty item ptr to item item

6
Hash tables in general – refined definitions chaining: “… linked lists handle collision in a hash table” (NIST) proposed definition: “… allowing more items to be assigned to any slot of the hash table” – more permissive – may also use array, not just linked list – rather call this bucket hashing ptr to bucket item ptr to bucket array of items

7
hash tables open hash linear probing item table pointer table linear double hashing item table pointer table quadratic quotient item table pointer table bucket hash linked lists item table pointer table arrays Types and variations of hash tables reasons for the difference: – length of search path – number of indirections – memory alignment

8
Types and variations of hash tables – item table with linear probing key | value empty key | value structure of the hash table alignment of the hash table in memory 1. item2. item3. item4. item… cache line cache line… 123 openbucket linear probing double hash/ quadratic quotient array linked list pointer tableitem table

9
Types and variations of hash tables – item table with linear double hashing/quadratic quotient key | value empty key | value structure of the hash table alignment of the hash table in memory 1. item2. item3. item4. item cache line cache line… item6. item… 1 openbucket linear probing double hash/ quadratic quotient array linked list pointer tableitem table

10
Types and variations of hash tables – pointer table with array length structure of the hash table alignment of the hash table in memory ptr length1. item2. item3. item cache line … 23 length1. item… 4 1. bucketn. bucket … 5 key | value lengthkey | value lengthkey | value lengthkey | value openbucket linear probing double hash/ quadratic quotient array linked list pointer tableitem table 1

11
Types and variations of hash tables – pointer table with list key | value| ptr structure of the hash table alignment of the hash table in memory ptr key | value| ptr … 2. item3. item cache line cache line… item 2 1. bucketm. bucketn. bucket1. bucketk. bucket1. bucket … openbucket linear probing double hash/ quadratic quotient array linked list pointer tableitem table 1

12
Types and variations of hash tables – item table with list openbucket linear probing double hash/ quadratic quotient array linked list pointer tableitem table structure of the hash table alignment of the hash table in memory key | value| ptr … 2. item3. item cache line cache line… item 1 1. bucketm. bucketn. bucket1. bucketk. bucket1. bucket

13
Inputs and hash functions First point of optimization: hash function General purpose hash functions – Custom hash function – FNV (Fowler/Noll/Vo), widespread use – Jenkins hash function The distributions of the output of the general purpose hash functions unknown on real-life input. uniform“bumpy”real-life

14
Performance of open hash tables – uniform and “bumpy” inputs “bumpy” input distribution: similar best: linear probing, Jenkins or the custom hash function

15
Performance of open hash tables – real-life input with inferior hash function: not able to operate best: linear probing, Jenkins hash function

16
Performance of open hash tables – real-life input The step count of the algorithms do not confirm the observed difference in performance, but the number of L2 cache misses do.

17
Performance of bucket hash tables – uniform and “bumpy” inputs “bumpy” input distribution: similar best: item-table-with-list, Jenkins or the custom hash function

18
Performance of bucket hash tables – real-life input best: pointer-table-with-array, because it is not that sensitive to the hash function

19
Performance of bucket hash tables – real-life input The step count of the algorithms do not confirm the observed difference in performance, but the number of L2 cache misses do.

20
Summary Significance of hashing in fast data transformation New definitions for hash table types Introduction of additional hash structures with various memory layouts Multiple inputs and hash functions Robustness criterion Open hash tables: linear probing is the fastest; all variants are unable to handle real-life input with inferior hash function Bucket hash tables: arrays are favorable because they are robust and not sensitive to the hash function Verified using real-life input

21
Questions ? Adapting Hash Table Design for Real-Life Dataset Sándor Juhász

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google