Presentation is loading. Please wait.

Presentation is loading. Please wait.

BlueDBM: An Appliance for Big Data Analytics

Similar presentations


Presentation on theme: "BlueDBM: An Appliance for Big Data Analytics"— Presentation transcript:

1 BlueDBM: An Appliance for Big Data Analytics
Sang-Woo Jun* Ming Liu* Sungjin Lee* Jamey Hicks+ John Ankcorn+ Myron King+ Shuotao Xu* Arvind* *MIT Computer Science and Artificial Intelligence Laboratory +Quanta Research Cambridge BlueDBM stands for Blue DataBase Machine, and it refers to the architecture as well as ISCA 2015, Portland, OR. June 15, 2015 This work is funded by Quanta, Samsung and Lincoln Laboratory. We also thank Xilinx for their hardware and expertise donations.

2 Big data analytics Analysis of previously unimaginable amount of data can provide deep insight Google has predicted flu outbreaks a week earlier than the Center for Disease Control (CDC) Analyzing personal genome can determine predisposition to diseases Social network chatter analysis can identify political revolutions before newspapers Scientific datasets can be mined to extract accurate models Start with examples, description in next slide (size, randomness < db sequential) Likely to be the biggest economic driver for the IT industry for the next decade

3 A currently popular solution: RAM Cloud
Cluster of machines with large DRAM capacity and fast interconnect + Fastest as long as data fits in DRAM - Power hungry and expensive - Performance drops when data doesn’t fit in DRAM Flash-based solutions may be a better alternative + Faster than Disk, cheaper than DRAM + Lower power consumption than both - Legacy storage access interface is burdening - Slower than DRAM What if enough DRAM isn’t affordable? If everything fits on DRAM, then you’re in luck, everything will be fast. But what if you can’t afford to maintain a DRAM pool large enough to accommodate your data?

4 Related work Use of flash Networks Accelerators
SSDs, FusionIO, Purestorage Zetascale SSD for database buffer pool and metadata [SIGMOD 2008], [IJCA 2013] Networks QuickSAN [ISCA 2013] Hadoop/Spark on Infiniband RDMA [SC 2012] Accelerators SmartSSD[SIGMOD 2013], Ibex[VLDB 2014] Catapult[ISCA 2014] GPUs

5 Latency profile of distributed flash-based analytics
Distributed processing involves many system components Flash device access Storage software (OS, FTL, …) Network interface (10gE, Infiniband, …) Actual processing Flash Access 75 μs 50~100 μs Storage Software 100 μs 100~1000 μs Network 20 μs 20~1000 μs Processing Latency is additive because data needs to be copied to and from DRAM Latency is additive

6 Latency profile of distributed flash-based analytics
Architectural modifications can remove unnecessary overhead Near-storage processing Cross-layer optimization of flash management software* Dedicated storage area network Accelerator Flash Access 75 μs 50~100 μs < 20μs 100~1000 μs Storage Software 100 μs Network 20 μs 20~1000 μs Processing Latency is additive because data needs to be copied to and from DRAM Difficult to explore using flash packaged as off-the-shelf SSDs

7 Custom flash card had to be built
Bus 0 Flash Flash To VC707 Artix 7 FPGA Flash Bus 1 Flash HPC FMC PORT Flash Bus 2 Flash Bus 3 Flash Flash Network Ports Flash Array (on both side)

8 BlueDBM: Platform with near-storage processing and inter-controller networks
20 24-core Xeon Servers 20 BlueDBM Storage devices 1TB flash storage x4 20Gbps controller network Xilinx VC707 2GB/s PCIe BlueDBM stands

9 BlueDBM Storage Device
BlueDBM: Platform with near-storage processing and inter-controller networks 1 of 2 Racks (10 Nodes) BlueDBM Storage Device 20 24-core Xeon Servers 20 BlueDBM Storage devices 1TB flash storage x4 20Gbps controller network Xilinx VC707 2GB/s PCIe

10 BlueDBM node architecture
Lightweight flash management with very low overhead Adds almost no latency ECC support Flash Device Custom network protocol with low latency/high bandwidth x4 20Gbps links at 0.5us latency Virtual channels with flow control Flash Controller Flash Controller Software has very low level access to flash storage High level information can be used for low level management FTL implemented inside file system Interface Network Interface Network In-Storage Processor No time to go into gritty details! All three components involve a lot of engineering and ideas and there is just no time to present everything in detail. We will try to publish multiple publications regarding each of these components, or we can talk about details separately after the talk. PCIe Host Server Host Server

11 Generated by Connectal*
BlueDBM software view User- space Hardware-assisted Applications Connectal Proxy File System Generated by Connectal* Kernel- space Block Device Driver Connectal (By Quanta) Flash Ctrl HW Accelerator Accelerator Manager Connectal Wrapper Network Interface FPGA NAND Flash BlueDBM provides a generic file system interface as well as an accelerator-specific interface (Aided by Connectal)

12 Power consumption is low
Component Power (Watts) VC707 30 Flash Board (x2) 10 Storage Device Total 40 Storage device power consumption is a very conservative estimate 20% Component Power (Watts) Storage Device 40 Xeon Server 200+ Node Total 240+ GPU-based accelerator will double the power

13 * Results obtained since the paper submission
Applications Content-based image search * Faster flash with accelerators as replacement for DRAM-based systems BlueCache – An accelerated memcached* Dedicated network and accelerated caching systems with larger capacity Graph analytics Benefits of lower latency access into distributed flash for computation on large graphs * Results obtained since the paper submission

14 Content-based image retrieval
Takes a query image and returns similar images in a dataset of tens of million pictures Image similarity is determined by measuring the distance between histograms of each image Histogram is generated using RGB, HSV, “edgeness”, etc Better algorithms are available! Faster flash with accelerators as replacement for DRAM-based systems

15 Image search accelerator Sang woo Jun, Chanwoo Chung
Sobel Filter FPGA Flash Flash Controller Histogram Generator Query Histogram Comparator Software

16 Image query performance without sampling
BlueDBM + FPGA CPU Bottleneck BlueDBM + CPU Off-the shelf M.2. SSD Faster flash with acceleration can perform at DRAM speed

17 Sampling to improve performance
Intelligent sampling methods (e.g., Locality Sensitive Hashing) improves performance by dramatically reducing the search space But introduces random access pattern Locality-sensitive hash table Data Data accesses corresponding to a single hash table entry results in a lot of random accesses Faster flash with accelerators as replacement for DRAM-based systems

18 Image query performance with sampling
A disk based system cannot take advantage of the reduced search space

19 memcached service A distributed in-memory key-value store
caches DB results indexed by query strings Accessed via socket communication Uses system DRAM for caching (~256GB) Extensively used by database-driven websites Facebook, Flicker, Twitter, Wikipedia, Youtube … Web request Memcached request Response Return data Application Servers Memcached Servers Brower/ Mobile Apps Networking contributes to 90% overhead

20 Bluecache: Accelerated memcached service Shuotao Xu
Inter-controller network Bluecache accelerator Flash Controller Network 1TB Flash web server PCIe Bluecache accelerator Flash Controller Network 1TB Flash web server PCIe Bluecache accelerator Flash Controller Network 1TB Flash web server PCIe Dedicated network and accelerated caching systems with larger capacity Memcached server implemented in hardware Hashing and flash management implemented in FPGA 1TB hardware managed flash cache per node Hardware server accessed via local PCIe Direct network between hardware

21 Effect of architecture modification (no flash, only DRAM)
11X Performance PCIe DMA and inter-controller network reduces access overhead FPGA acceleration of memcached is effective

22 High cache-hit rate outweighs slow flash-accesses (small DRAM vs
High cache-hit rate outweighs slow flash-accesses (small DRAM vs. large Flash) Key size = 64 Bytes, Value size = 8K Bytes 5ms penalty per cache miss * Assuming no cache misses for Bluecache Bluecache starts performing better at 5% miss A “sweet spot” for large flash caches exist

23 Graph traversal Very latency-bound problem, because often cannot predict the next node to visit Beneficial to reduce latency by moving computation closer to data Flash 1 Flash 2 Flash 3 Benefits of lower latency access into distributed flash for computation on large graphs In-Store Processor Host 1 Host 2 Host 3

24 Graph traversal performance
DRAM Flash * Used fast BlueDBM network even for separate network for fairness Cost of graph goes down Flash based system can achieve comparable performance with a much smaller cluster

25 Other potential applications
Genomics Deep machine learning Complex graph analytics Platform acceleration Spark, MATLAB, SciDB, … Suggestions and collaboration are welcome!

26 Conclusion Fast flash-based distributed storage systems with low-latency random access may be a good platform to support complex queries on Big Data Reducing access latency for distributed storage require architectural modifications, including in-storage processors and fast storage networks Flash-based analytics hold a lot of promise, and we plan to continue demonstrating more application acceleration Thank you

27

28 Near-Data Accelerator is Preferable
Traditional Approach Hardware & software latencies are additive NIC NIC Accelerator FPGA DRAM Flash Accelerator FPGA DRAM Flash CPU CPU Motherboard Motherboard BlueDBM CPU DRAM Flash NIC FPGA Motherboard CPU DRAM Flash NIC FPGA Motherboard

29 Virtex 7 VC707 PCIe DRAM Artix 7 Network Cable Flash Network Ports


Download ppt "BlueDBM: An Appliance for Big Data Analytics"

Similar presentations


Ads by Google