Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind

Slides:



Advertisements
Similar presentations
Scalable Multi-Access Flash Store for Big Data Analytics
Advertisements

Flash storage memory and Design Trade offs for SSD performance
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Seunghwa Kang David A. Bader Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System.
LFGRAPH: SIMPLE AND FAST DISTRIBUTED GRAPH ANALYTICS Hoque, Imranul, Vmware Inc. and Gupta, Indranil, University of Illinois at Urbana-Champaign – TRIOS.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Computer Science Storage Systems and Sensor Storage Research Overview.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Capacity Planning in SharePoint Capacity Planning Process of evaluating a technology … Deciding … Hardware … Variety of Ways Different Services.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
BlueDBM: An Appliance for Big Data Analytics
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.
Network Aware Resource Allocation in Distributed Clouds.
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
X-Stream: Edge-Centric Graph Processing using Streaming Partitions
GRAPH PROCESSING Hi, I am Mayank and the second presenter for today is Shadi. We will be talking about Graph Processing.
GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Example: Sorting on Distributed Computing Environment Apr 20,
Inside your computer. Hardware Review Motherboard Processor / CPU Bus Bios chip Memory Hard drive Video Card Sound Card Monitor/printer Ports.
Inside your computer. Hardware Motherboard Processor / CPU Bus Bios chip Memory Hard drive Video Card Sound Card Monitor/printer Ports.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)
Data Structures and Algorithms in Parallel Computing
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Pregel: A System for Large-Scale Graph Processing Nov 25 th 2013 Database Lab. Wonseok Choi.
Bigtable: A Distributed Storage System for Structured Data
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
SEPTEMBER 8, 2015 Computer Hardware 1-1. HARDWARE TERMS CPU — Central Processing Unit RAM — Random-Access Memory  “random-access” means the CPU can read.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
CIT 140: Introduction to ITSlide #1 CSC 140: Introduction to IT Operating Systems.
FaRM: Fast Remote Memory Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson and Miguel Castro, Microsoft Research NSDI’14 January 5 th, 2016 Cho,
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
SketchVisor: Robust Network Measurement for Software Packet Processing
Network Requirements for Resource Disaggregation
TensorFlow– A system for large-scale machine learning
BD-Cache: Big Data Caching for Datacenters
Seth Pugsley, Jeffrey Jestes,
Data Center Network Architectures
Enabling Effective Utilization of GPUs for Data Management Systems
Diskpool and cloud storage benchmarks used in IT-DSS
Large-scale file systems and Map-Reduce
Lecture 16: Data Storage Wednesday, November 6, 2006.
FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs Scribed by Vinh Ha.
PREGEL Data Management in the Cloud
Cache Memory Presentation I
Hadoop Clusters Tess Fulkerson.
Accelerating Linked-list Traversal Through Near-Data Processing
Accelerating Linked-list Traversal Through Near-Data Processing
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Be Fast, Cheap and in Control
湖南大学-信息科学与工程学院-计算机与科学系
Parallel and Multiprocessor Architectures – Shared Memory
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
File Transfer Issues with TCP Acceleration with FileCatalyst
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Characteristics of Reconfigurable Hardware
Distributed File Systems
CS246 Search Engine Scale.
Graph Colouring as a Challenge Problem for Dynamic Graph Processing on Distributed Systems Scott Sallinen, Keita Iwabuchi, Suraj Poudel, Maya Gokhale,
University of Wisconsin-Madison
Many-Core Graph Workload Analysis
CS246: Search-Engine Scale
Fast Accesses to Big Data in Memory and Storage Systems
CS 295: Modern Systems Storage Technologies Introduction
Kenichi Kourai Kyushu Institute of Technology
A Closer Look at NFV Execution Models
Presentation transcript:

Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind GraFBoost: Using accelerated flash storage for external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind MIT CSAIL Funded by:

Large Graphs are Found Everywhere in Nature Human neural network Structure of the internet Social networks TB to 100s of TB in size! 1) Connectomics graph of the brain - Source: “Connectomics and Graph Theory” Jon Clayden, UCL 2) (Part of) the internet - Source: Criteo Engineering Blog 3) The Graph of a Social Network – Source: Griff’s Graphs

Storage for Graph Analytics Extremely sparse DRAM TB of DRAM Irregular structure Terabytes in size $$$ $8000/TB, 200W Our goal: $ $500/TB, 10W

Random Access Challenge in Flash Storage DRAM Bandwidth: 3.6 GB/s ~100 GB/s Access Granularity: 8192 Bytes 128 Bytes For many applications, performance is not bound entirely by memory bandwidth. In fact, most software struggle to saturate 3GB/s. Software not optimized for random access Wastes performance by not using most of fetched page Using 8 bytes in a 8192 Byte page uses 1/1024 of bandwidth!

Two Pillars for Flash in GraFBoost Flash Storage Sort-Reduce Algorithm Hardware Acceleration Sequentialize Random Access Reduce Overhead of Sort-Reduce System resources Power consumption Performance

Vertex-Centric Programming Model Popular model for efficient parallism and distribution “Vertex program” only sees neighbors Algorithm executed in terms of disjoint iterations Vertex program is executed on one or more “Active Vertices” 5 10 5 ∞ 2 5 4 1 ∞ Organized into disjoint iterations, or supersteps where the vertex program is executed on all active vertices, and marks some vertices as active for the next iteration 4 9 2 4 Active Vertices 1 3 ∞ 1 1 2

Algorithmic Representation of a Vertex Program Iteration 𝐟𝐨𝐫 𝐞𝐚𝐜𝐡 𝑣 𝑠𝑟𝑐 𝐢𝐧 𝐴𝑐𝑡𝑖𝑣𝑒𝐿𝑖𝑠𝑡 𝐝𝐨 𝐟𝐨𝐫 𝐞𝐚𝐜𝐡 𝑒(𝑣 𝑠𝑟𝑐 , 𝑣 𝑑𝑠𝑡 ) 𝐢𝐧 G 𝐝𝐨 𝑒𝑣=𝐞𝐝𝐠𝐞_𝐩𝐫𝐨𝐠𝐫𝐚𝐦 ( 𝑣 𝑠𝑟𝑐 ,.𝑣𝑎𝑙, 𝑒.𝑤𝑒𝑖𝑔ℎ𝑡) 𝑣 𝑑𝑠𝑡 .𝑛𝑒𝑥𝑡_𝑣𝑎𝑙=𝐯𝐞𝐫𝐭𝐞𝐱_𝐮𝐩𝐝𝐚𝐭𝐞( 𝑣 𝑑𝑠𝑡 .𝑛𝑒𝑥𝑡_𝑣𝑎𝑙, 𝑒𝑣) 𝐞𝐧𝐝 𝐟𝐨𝐫 Random read-modify update into vertex data 𝐞𝐧𝐝 𝐟𝐨𝐫 Two for-loops represent the two sources of fine-grained random access Sort-reduce algorithm solves this issue

General Problem of Irregular Array Updates Updating an array x with a stream of update requests xs and update function f 𝑭𝒐𝒓 𝒆𝒂𝒄𝒉<𝑖𝑑𝑥,𝑎𝑟𝑔>𝒊𝒏 𝑥𝑠: 𝑥 𝑖𝑑𝑥 =𝒇(𝑥 𝑖𝑑𝑥 , 𝑎𝑟𝑔) X xs Each array value update reads a value from a location, performs some computation on it with an argument, and writes it back. f Random Updates

Solution Part One - Sort Sort xs according to index X Random Updates Sequential Updates xs Sorted xs Sort Much better than naïve random updates Terabyte graphs can generate terabyte logs Significant sorting overhead

Solution Part Two - Reduce Associative update function f can be interleaved with sort e.g., (A + B) + C = A + (B + C) xs 3 7 1 3 1 8 3 5 2 8 3 9 1 2 merge 10 1 7 1 3 1 8 3 9 1 2 3 7 3 5 2 8 Reduced Overhead merge 3 9 1 8 10 7 2 19 1 3 8 7

Big Benefits from Interleaving Reduction Ratio of data copied at each sort phase 90%

Accelerated GraFBoost Architecture In-storage accelerator reduces data movement and cost Software Host (Server/PC/Laptop) Edge Program Vertex Value 1GB DRAM Accelerator-Aware Flash Management Multirate 16-to-1 Merge-Sorter Update Log (xs) FPGA Sort-Reduce Accelerator Multirate Aggregator Wire-Speed On-chip Sorter Edge Property Edge Data Vertex Data Partially Sort-Reduced Files Flash Active Vertices

Evaluated Graph Analytic Systems In-memory Semi-External External GraphLab (IN) FlashGraph (SE1) X-Stream (SE2) GraphChi (EX) GraFSoft If we had 4x memory bandwidth, even the in-memory sort-reducer could perform at wire speed, and considering that would only be four memory cards at 40GB/s, that wouldn’t be too outrageous either. But for now, we have decided to take the conservative estimate GraFBoost Projected system with 2x memory bandwidth GraFBoost2 “Distributed GraphLab: a framework for machine learning and data mining in the cloud,” VLDB 2012 “FlashGraph: Processing billion-node graphs on an array of commodity SSDs,” FAST 2015 “X-Stream: edge-centric graph processing using streaming partitions,“ SOSP 2013 “GraphChi: Large-scale graph computation on just a PC,“ USENIX 2012

Evaluation Environment + 32-core Xeon 128 GB RAM 5x 0.5TB PCIe Flash 4-core i5 4 GB RAM $400 + $8000 Virtex 7 FPGA 1TB custom flash 1GB on-board RAM All software experiments $1000 $???

Evaluation Result Overview In-memory Semi-External External GraFBoost Large graphs: Fail Fail Slow Fast Medium graphs: Fail Fast Slow Fast Small graphs: Fast Fast Slow Fast GraFBoost has very low resource requirement Memory, CPU, Power

Results with a Large Graph: Synthetic Scale 32 Kronecker Graph 0.5 TB in text, 4 Billion vertices GraphLab (IN) out of memory GraphChi (EX) did not finish 1.7x 2.8x 10x Algorithms! Performance is normalized against a purely software implementation Even the software version GraFSoft

Results with a Large Graph: Web Data Commons Web Crawl 2 TB in text, 3 Billion vertices GraphLab (IN) out of memory GraphChi (EX) did not finish Only GraFBoost succeeds in both graphs GraFBoost can run still larger graphs! GraFSoft

Results with Smaller Graphs: Breadth-First Search 0.03 TB 0.04 Billion 0.09 TB 0.3 Billion 0.3 TB 1 Billion Fastest! Slowest Slowest Slowest Since I’ve shown that GraFBoost is the only system that maintains high performance, or indeed, any performance in large graphs, I’d like to show that it also does pretty well for small graphs as well GraFSoft

Results with a Medium Graph: Against an In-Memory Cluster Synthesized Kronecker Scale 28 0.09 TB in text, 0.3 Billion vertices GraFSoft

GraFBoost Reduces Resource Requirements 720 16 160 80 40 2 External Analytics + Hardware Acceleration External analytics Hardware Acceleration

Future Work Open-source GraFBoost Cleaning up code for users! Acceleration using Amazon F1 Commercial accelerated storage Collaborating with Samsung More applications using sort-reduce Bioinformatics collaboration with Barcelona Supercomputering Center

Hardware Acceleration Thank You Flash Storage Sort-Reduce Algorithm Hardware Acceleration

BlueDBM Cluster Architecture Host Server (24-Core) FPGA (VC707) minFlash 1GB RAM Host Server (24-Core) FPGA (VC707) minFlash Ethernet 1 TB … PCIe 4GB/s FMC Each FPGA is equipped with 1GB of DRAM The high fan out network links create a completely separate mesh network from the host’s Ethernet network, which allows the in-storage accelerator very low latency access to remote flash. Using a hardware implementation of a low-overhead transport-layer protocol, each network hop takes about 1/2us 10Gbps network ×8 Host Server (24-Core) FPGA (VC707) minFlash Uniform latency of 100 µs!

BlueDBM Storage Device The BlueDBM Cluster BlueDBM Storage Device