Presentation is loading. Please wait.

Presentation is loading. Please wait.

By Dominik Seifert B97902122. Overview Data Alignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little.

Similar presentations


Presentation on theme: "By Dominik Seifert B97902122. Overview Data Alignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little."— Presentation transcript:

1 by Dominik Seifert B97902122

2 Overview Data Alignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little Things

3 Data Alignment (1/3) The Alignment Trap x86 supports this But GPUs don’t! Word-sized pointers are always word-size-aligned!

4 Data Alignment (2/3) Copy all words (corpus & query each) into a new array, consisting of 4-byte chunks Improves memory access patterns Allows us to always consider 4 bytes at a time Needs more space but who cares! Keep old offsets and translate to new offsets with: AlignedWordOffset = OrigWordOffset / 4 + WordIndex What’s the size of the i’th string? strlen(i’th string) == offset(i+1) – offset(i) - 1

5 Data Alignment (3/3) AlignedWordOffset = OrigWordOffset / 4 + WordIndex NewSize = 4 x (TotalSize / 4 + WordCount) Example: TotalSize = 10 WordCount = 3 NewSize = 4 x (10 / 4 + 3) = 4 x 5 = 20 Original String (10 bytes): Aligned String (5 x 4= 20 bytes):

6 Hashtable Motivation and overview A hash is an index into an array that contains a value Hashtables are perfect for exact matching Simple Build time: O(1) Lookup time: O(1) Databases always use hashtables if they don’t need to support range queries Trees are too much work, slower and way harder to parallelize Idea: Build hashtable of all corpus words Search for every query word

7 Hashtable MurMurHash Function (1/2) Simple Only a few lines (available online) Fast Always considers 4 bytes at a time Conflict-resilient Very few strings have the same hash I improved it slightly for my case: 6 lines were removed which handle strings of sizes that are not divisible by 4 (since all my aligned string sizes are divisible by 4) Largest bucket size for corpus (found out through trial & error): 4 Hashtable of query strings has largest bucket size 6 Inverting the lookup was slower!

8 Hashtable MurMurHash function (2/2)

9 Hashtable Stupid Parallel Hashing (1/2) No space optimization constraint Available space: About 900 MB (without the required space for input & output) Outline: Create H layers, each of about 900/H MB in size (Should be a prime number!) A layer is an array that maps hash to index For each layer L: Place all previously conflicting words in L Amount of layers = Largest bucket size: 4 Conflicting parallel writes = race condition CUDA C Programming Guide, section 4.1: If a non-atomic instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, the number of serialized writes that occur to that location varies depending on the compute capability of the device and which thread performs the final write is undefined. One thread will always succeed!!!

10 Hashtable Stupid Parallel Hashing (2/2) Note: Rows = Layers Columns = Buckets Input Layer 1 Layer 2 Layer 3 = Occupied / Conflicted = Occupied= Empty (-1)

11 Lookup Problems Slowest kernel! Needs too many registers! Did not benefit from shm! (But should)

12 The Complete Algorithm 1. Align words into 4 byte chunks 2. Compute hashes of all Corpus words 3. For each hashtable layer L (Total of 4): Place all previously conflicting words in L Use templates to determine the layer number 4. Lookup the index for every word in every layer L until the next word matches or the current layer has no such hash Four kernels:

13 The Little Things (1/2) A previous presenter inspired this idea: Init: Allocate & memset (using max sizes) Cleanup: Free all arrays

14 The Little Things (2/2) Compare Words: I did not really use shared memory Did not improve performance even though it should have due to load balancing Every thread roughly reads average word size Vs. some threads reading only 1 byte and some reading 100 bytes Did not investigate further since speed was already very fast

15 References MurMurHash: https://sites.google.com/site/murmurhash/MurmurHash2.cp p?attredirects=0 https://sites.google.com/site/murmurhash/MurmurHash2.cp p?attredirects=0 Real-time Parallel Hashing on the GPU ACM Transactions on Graphics (Proceedings of ACM SIGGRAPH Asia 2009) by Dan A. Alcantara, Andrei Sharf, Fatemeh Abbasinejad, Shubhabrata Sengupta, Michael Mitzenmacher, John D. Owens, and Nina Amenta I took some ideas from it but did not implement it at all Needs atomicAdd


Download ppt "By Dominik Seifert B97902122. Overview Data Alignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little."

Similar presentations


Ads by Google