Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind

Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind
GraFBoost: Using accelerated flash storage for external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind MIT CSAIL Funded by:

Large Graphs are Found Everywhere in Nature
Human neural network Structure of the internet Social networks TB to 100s of TB in size! 1) Connectomics graph of the brain - Source: “Connectomics and Graph Theory” Jon Clayden, UCL 2) (Part of) the internet - Source: Criteo Engineering Blog 3) The Graph of a Social Network – Source: Griff’s Graphs

Storage for Graph Analytics
Extremely sparse DRAM TB of DRAM Irregular structure Terabytes in size $$$ $8000/TB, 200W Our goal: $ $500/TB, 10W

Random Access Challenge in Flash Storage
DRAM Bandwidth: 3.6 GB/s ~100 GB/s Access Granularity: 8192 Bytes 128 Bytes For many applications, performance is not bound entirely by memory bandwidth. In fact, most software struggle to saturate 3GB/s. Software not optimized for random access Wastes performance by not using most of fetched page Using 8 bytes in a 8192 Byte page uses 1/1024 of bandwidth!

Two Pillars for Flash in GraFBoost
Flash Storage Sort-Reduce Algorithm Hardware Acceleration Sequentialize Random Access Reduce Overhead of Sort-Reduce System resources Power consumption Performance

Vertex-Centric Programming Model
Popular model for efficient parallism and distribution “Vertex program” only sees neighbors Algorithm executed in terms of disjoint iterations Vertex program is executed on one or more “Active Vertices” 5 10 5 ∞ 2 5 4 1 ∞ Organized into disjoint iterations, or supersteps where the vertex program is executed on all active vertices, and marks some vertices as active for the next iteration 4 9 2 4 Active Vertices 1 3 ∞ 1 1 2

Algorithmic Representation of a Vertex Program Iteration
𝐟𝐨𝐫 𝐞𝐚𝐜𝐡 𝑣 𝑠𝑟𝑐 𝐢𝐧 𝐴𝑐𝑡𝑖𝑣𝑒𝐿𝑖𝑠𝑡 𝐝𝐨 𝐟𝐨𝐫 𝐞𝐚𝐜𝐡 𝑒(𝑣 𝑠𝑟𝑐 , 𝑣 𝑑𝑠𝑡 ) 𝐢𝐧 G 𝐝𝐨 𝑒𝑣=𝐞𝐝𝐠𝐞_𝐩𝐫𝐨𝐠𝐫𝐚𝐦 ( 𝑣 𝑠𝑟𝑐 ,.𝑣𝑎𝑙, 𝑒.𝑤𝑒𝑖𝑔ℎ𝑡) 𝑣 𝑑𝑠𝑡 .𝑛𝑒𝑥𝑡_𝑣𝑎𝑙=𝐯𝐞𝐫𝐭𝐞𝐱_𝐮𝐩𝐝𝐚𝐭𝐞( 𝑣 𝑑𝑠𝑡 .𝑛𝑒𝑥𝑡_𝑣𝑎𝑙, 𝑒𝑣) 𝐞𝐧𝐝 𝐟𝐨𝐫 Random read-modify update into vertex data 𝐞𝐧𝐝 𝐟𝐨𝐫 Two for-loops represent the two sources of fine-grained random access Sort-reduce algorithm solves this issue

General Problem of Irregular Array Updates
Updating an array x with a stream of update requests xs and update function f 𝑭𝒐𝒓 𝒆𝒂𝒄𝒉<𝑖𝑑𝑥,𝑎𝑟𝑔>𝒊𝒏 𝑥𝑠: 𝑥 𝑖𝑑𝑥 =𝒇(𝑥 𝑖𝑑𝑥 , 𝑎𝑟𝑔) X xs Each array value update reads a value from a location, performs some computation on it with an argument, and writes it back. f Random Updates

Solution Part One - Sort
Sort xs according to index X Random Updates Sequential Updates xs Sorted xs Sort Much better than naïve random updates Terabyte graphs can generate terabyte logs Significant sorting overhead

Solution Part Two - Reduce
Associative update function f can be interleaved with sort e.g., (A + B) + C = A + (B + C) xs 3 7 1 3 1 8 3 5 2 8 3 9 1 2 merge 10 1 7 1 3 1 8 3 9 1 2 3 7 3 5 2 8 Reduced Overhead merge 3 9 1 8 10 7 2 19 1 3 8 7

Big Benefits from Interleaving Reduction
Ratio of data copied at each sort phase 90%

Accelerated GraFBoost Architecture
In-storage accelerator reduces data movement and cost Software Host (Server/PC/Laptop) Edge Program Vertex Value 1GB DRAM Accelerator-Aware Flash Management Multirate 16-to-1 Merge-Sorter Update Log (xs) FPGA Sort-Reduce Accelerator Multirate Aggregator Wire-Speed On-chip Sorter Edge Property Edge Data Vertex Data Partially Sort-Reduced Files Flash Active Vertices

Evaluated Graph Analytic Systems
In-memory Semi-External External GraphLab (IN) FlashGraph (SE1) X-Stream (SE2) GraphChi (EX) GraFSoft If we had 4x memory bandwidth, even the in-memory sort-reducer could perform at wire speed, and considering that would only be four memory cards at 40GB/s, that wouldn’t be too outrageous either. But for now, we have decided to take the conservative estimate GraFBoost Projected system with 2x memory bandwidth GraFBoost2 “Distributed GraphLab: a framework for machine learning and data mining in the cloud,” VLDB 2012 “FlashGraph: Processing billion-node graphs on an array of commodity SSDs,” FAST 2015 “X-Stream: edge-centric graph processing using streaming partitions,“ SOSP 2013 “GraphChi: Large-scale graph computation on just a PC,“ USENIX 2012

Evaluation Environment
+ 32-core Xeon 128 GB RAM 5x 0.5TB PCIe Flash 4-core i5 4 GB RAM $400 + $8000 Virtex 7 FPGA 1TB custom flash 1GB on-board RAM All software experiments $1000 $???

Evaluation Result Overview
In-memory Semi-External External GraFBoost Large graphs: Fail Fail Slow Fast Medium graphs: Fail Fast Slow Fast Small graphs: Fast Fast Slow Fast GraFBoost has very low resource requirement Memory, CPU, Power

Results with a Large Graph: Synthetic Scale 32 Kronecker Graph
0.5 TB in text, 4 Billion vertices GraphLab (IN) out of memory GraphChi (EX) did not finish 1.7x 2.8x 10x Algorithms! Performance is normalized against a purely software implementation Even the software version GraFSoft

Results with a Large Graph: Web Data Commons Web Crawl
2 TB in text, 3 Billion vertices GraphLab (IN) out of memory GraphChi (EX) did not finish Only GraFBoost succeeds in both graphs GraFBoost can run still larger graphs! GraFSoft

Results with Smaller Graphs: Breadth-First Search
0.03 TB 0.04 Billion 0.09 TB 0.3 Billion 0.3 TB 1 Billion Fastest! Slowest Slowest Slowest Since I’ve shown that GraFBoost is the only system that maintains high performance, or indeed, any performance in large graphs, I’d like to show that it also does pretty well for small graphs as well GraFSoft

Results with a Medium Graph: Against an In-Memory Cluster
Synthesized Kronecker Scale 28 0.09 TB in text, 0.3 Billion vertices GraFSoft

GraFBoost Reduces Resource Requirements
720 16 160 80 40 2 External Analytics + Hardware Acceleration External analytics Hardware Acceleration

Future Work Open-source GraFBoost Cleaning up code for users!
Acceleration using Amazon F1 Commercial accelerated storage Collaborating with Samsung More applications using sort-reduce Bioinformatics collaboration with Barcelona Supercomputering Center

Hardware Acceleration
Thank You Flash Storage Sort-Reduce Algorithm Hardware Acceleration

BlueDBM Cluster Architecture
Host Server (24-Core) FPGA (VC707) minFlash 1GB RAM Host Server (24-Core) FPGA (VC707) minFlash Ethernet 1 TB … PCIe 4GB/s FMC Each FPGA is equipped with 1GB of DRAM The high fan out network links create a completely separate mesh network from the host’s Ethernet network, which allows the in-storage accelerator very low latency access to remote flash. Using a hardware implementation of a low-overhead transport-layer protocol, each network hop takes about 1/2us 10Gbps network ×8 Host Server (24-Core) FPGA (VC707) minFlash Uniform latency of 100 µs!

BlueDBM Storage Device
The BlueDBM Cluster BlueDBM Storage Device

Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind

Similar presentations

Presentation on theme: "Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind

Similar presentations

Presentation on theme: "Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind"— Presentation transcript:

Similar presentations

About project

Feedback