Fast and Robust Hashing for Database Operators

Slides:



Advertisements
Similar presentations
Hash Tables CSC220 Winter What is strength of b-tree? Can we make an array to be as fast search and insert as B-tree and LL?
Advertisements

Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
Lecture 34: Chapter 5 Today’s topic –Virtual Memories 1.
Hash Tables and Associative Containers CS-212 Dick Steflik.
Hash Tables1 Part E Hash Tables  
Hashing General idea: Get a large array
Algorithms and Data Structures Hash Tables and Associative Arrays.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
File Organization Techniques
1 Hash Tables  a hash table is an array of size Tsize  has index positions 0.. Tsize-1  two types of hash tables  open hash table  array element type.
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
HASHING PROJECT 1. SEARCHING DATA STRUCTURES Consider a set of data with N data items stored in some data structure We must be able to insert, delete.
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
Ihab Mohammed and Safaa Alwajidi. Introduction Hash tables are dictionary structure that store objects with keys and provide very fast access. Hash table.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.
What’s going on here? Can you think of a generic way to describe both of these?
Algorithms Design Fall 2016 Week 6 Hash Collusion Algorithms and Binary Search Trees.
Persistent Memory (PM)
Hash Tables 1/28/2018 Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and M.
Hashing (part 2) CSE 2011 Winter March 2018.
Spencer MacBeth Supervisor - Dr. Ramon Lawrence
Memory COMPUTER ARCHITECTURE
Record Storage, File Organization, and Indexes
Hash table CSC317 We have elements with key and satellite data
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Data Abstraction & Problem Solving with C++
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Hashing Alexandra Stefan.
Week 8 - Wednesday CS221.
Subject Name: File Structures
Hashing Alexandra Stefan.
Cache Memory Presentation I
Review Graph Directed Graph Undirected Graph Sub-Graph
Morgan Kaufmann Publishers
© 2013 Goodrich, Tamassia, Goldwasser
Dictionaries 9/14/ :35 AM Hash Tables   4
Hash functions Open addressing
Hash Table.
Join Processing in Database Systems with Large Main Memories (part 2)
Computer Science 2 Hashing
Hash Tables.
Resolving collisions: Open addressing
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Hash Tables and Associative Containers
Hash Tables Chapter 12 discusses several ways of storing information in an array, and later searching for the information. Hash tables are a common.
Hashing Alexandra Stefan.
Dictionaries 1/17/2019 7:55 AM Hash Tables   4
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
COMP60621 Fundamentals of Parallel and Distributed Systems
CS202 - Fundamental Structures of Computer Science II
Advanced Implementation of Tables
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Hash Tables Chapter 12 discusses several ways of storing information in an array, and later searching for the information. Hash tables are a common.
Hash Tables Chapter 12 discusses several ways of storing information in an array, and later searching for the information. Hash tables are a common.
CSC3050 – Computer Architecture
Indexing, Access and Database System Architecture
Hash Tables: Associative Containers with Constant Time Operations --- On Average Consider the problem of computing the frequency of words.
COMP60611 Fundamentals of Parallel and Distributed Systems
Data Structures and Algorithm Analysis Hashing
Chapter 2: Performance CS 447 Jason Bakos Fall 2001 CS 447.
Chapter 13 Hashing © 2011 Pearson Addison-Wesley. All rights reserved.
Dictionaries and Hash Tables
Lecture-Hashing.
Presentation transcript:

Fast and Robust Hashing for Database Operators Welcome everyone. Our research is about accelerating database operators using FPGAs. Since hashing is an essential part of many operators, in this work we focus on how we can do fast and robust hashing using an FPGA as part of a heterogeneous architecture. Kaan Kara, Gustavo Alonso Kaan Kara, Gustavo Alonso 31.08.2016

X Motivation Trade-off Hash tables and hashing very often used in: Choice of hash function impacts performance: Speed: Easily computable Higher hashing throughput hash functions Robustness: Balanced distribution of Less collisions, hash values O(1) look-up, insertion deletion Query Processing In Databases Key-Value Stores Load Distributing Middleware X Our motivation comes from a recent work from the database community, which showed the importance of choosing a right hash function depending on the application. Obviously, hash tables and hashing are often used in applications such as query processing, key-value stores or load distribution. The choice of the hash function affects performance because of two properties: The first one is how fast the hash function is calculated. Easily computable hash functions lead to a higher hashing throughput. The second one is the robustness. More robust hash functions produce a balanced distribution of hash values, causing less collisions and guaranteeing O(1) look-up, insertion and deletion performance. Unfortunately, there is a trade-off between the speed and the robustness of a hash function. In other words, more robust hash functions are complex and take a longer time to calculate. Trade-off Kaan Kara, Gustavo Alonso 31.08.2016

Trade-Off: Fast vs. Robust Hashing Linear Random Grid Reverse Grid 0x0000_0001 0x0000_0002 0x0000_0003 … 0x0001_1AF0 0x2E4F_5929 0x82FA_C&B1 0x186C_BA1F 0x1111_1111 0x1111_1112 0x1111_1113 0x111E_14E1 0x2111_1111 0x3111_1111 0x1E41_E111 Key distributions: Inserting 1.5 Million keys into an empty hash table until it is 70% full. 300k 700k We performed a set of micro-benchmarks to show this trade-off. For that, we use 4 different key distributions, each representing specific data types. Linear keys represent indexes, then we have randomly distributed keys. The third and fourth distributions we call grid and reverse grid distributions, which resemble strings or address patterns. First micro-benchmark I will show you is the raw hashing throughput for different hash functions. The bars indicate how fast it is to compute a hash function. For example, modulo being a very simple arithmetic operation achieves the highest throughput. The second micro-benchmark shows the number of average probes needed, when inserting a value into a hash table. The bars basically indicate here how many hash value collisions occur during this process. We see that modulo and multiply-shift, being non-robust hash functions, produce many colliding values depending on the key distribution. On the other hand, the other hash functions behave in a robust way, not getting affected by the key distribution. This shows the importance of the robustness property. This is the speed and robustness trade-off. In this work, we try to break it by implementing robust hash functions on an FPGA as part of a heterogeneous platform. Kaan Kara, Gustavo Alonso 31.08.2016

Target Platform: Intel Xeon+FPGA Accelerator Function Unit Written in an HDL Able to access entire main memory QPI provides cache- coherency Our target platform is the Intel Xeon+FPGA, which is a 2 socket machine. On the one socket there is a Xeon CPU with 96 GBs of main memory. On the other one there is a Stratix 5 FPGA. We can implement our accelerators in so called accelerator function units, implemented in an HDL. The accelerators are able to access the entire main memory in a cache-coherent way via the QPI. The QPI provides 6.5 GB/s bandwidth for combined read and writes. Acknowledgement: We thank Intel for their generous donation of Xeon+FPGA. Kaan Kara, Gustavo Alonso 31.08.2016

Hardware Hashing Simple Tabulation Murmur RTL: Performance: We implemented simple tabulation and murmur hashing on the FPGA. I will not go into implementation details, which you can read in the paper. I would like to focus more on the performance delivered by the FPGA. In this figure we have again the raw hashing throughput, this time with the FPGA results. The measured throughput for FPGA hashing on the target platform reaches multiply-shift on the CPU. But bear in mind that the current implementation is completely memory bound, that is 6.5 GB/s. Actually, the implementation currently clocked at 200 MHz is capable of delivering 1600 Million keys per second. If the bandwidth would be 25.6 GB/s for combined read and write channels, this would be the throughput that we would measure. CPU Hashing FPGA Hashing Kaan Kara, Gustavo Alonso 31.08.2016

1.5 Million keys with 70% fill rate of the hash table Hybrid Hash Table 1.5 Million keys with 70% fill rate of the hash table 410 s 21% In a second step, we integrated the FPGA hashing into a hybrid hash table. We do this, because we would like to show that our FPGA hashing can be integrated into complete applications, as we plan to use this during query processing in the future. When a value needs to be inserted, updated, read or deleted, its key is first hashed on the FPGA and then the hash value is used in software for performing the look-up. This kind of hybrid processing is made possible through the shared memory architecture on the Xeon+FPGA. No batching or extra data copying has to be performed, making acceleration without further overhead possible. In this figure we present the hash table build times, averaged over the 4 key distributions we presented. The combination of both robustness and high throughput of the FPGA hashing enables us to get the best result, an improvement of 21% compared to the best CPU hashing. CPU Hashing Kaan Kara, Gustavo Alonso 31.08.2016

Thank you for your attention! Visit our poster tomorrow for questions.

Contact information and credits ETH Zurich Systems Group Universitatsstrasse 6 8092 Zurich systems.ethz.ch © ETH Zurich, August 2016 Kaan Kara, Gustavo Alonso 31.08.2016