Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Lecture 6: Multicore Systems
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
Frequent Itemset Mining on Graphics Processors Wenbin Fang, Mian Lu, Xiangye Xiao, Bingsheng He 1, Qiong Luo Hong Kong Univ. of Sci.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Concurrency for data-intensive applications
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
Large Scale Machine Learning based on MapReduce & GPU Lanbo Zhang.
Weekly Report Ph.D. Student: Leo Lee date: Oct. 9, 2009.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
GPGPU platforms GP - General Purpose computation using GPU
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Slide 1/8 Performance Debugging for Highly Parallel Accelerator Architectures Saurabh Bagchi ECE & CS, Purdue University Joint work with: Tsungtai Yeh,
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Database and Stream Mining using GPUs Naga K. Govindaraju UNC Chapel Hill.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Take a Close Look at MapReduce Xuanhua Shi. Acknowledgement  Most of the slides are from Dr. Bing Chen,
Computer Graphics Graphics Hardware
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
GPU Programming with CUDA – Optimisation Mike Griffiths
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
My Coordinates Office EM G.27 contact time:
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Computer Engg, IIT(BHU)
Computer Graphics Graphics Hardware
NFV Compute Acceleration APIs and Evaluation
CS427 Multicore Architecture and Parallel Computing
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Computer Graphics Graphics Hardware
Presentation transcript:

Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong Wang Microsoft Corp. Sina Corp. 1, Currently in Microsoft Research Asia

Overview Motivation Design Implementation Evaluation Conclusion 2/41

Overview Motivation Design Implementation Evaluation Conclusion 3/41

Graphics Processing Units Massively multi-threaded co-processors –240 streaming processors on NV GTX 280 –~1 TFLOPS of peak performance High bandwidth memory –10+x more than peak bandwidth of the main memory –142 GB/s, 1 GB GDDR3 memory on GTX280 4/41

Graphics Processing Units (Cont.) High latency GDDR memory –200 clock cycles of latency –Latency hiding using large number of concurrent threads (>8K) –Low context-switch overhead Better architectural support for memory –Inter-processor communication using a local memory –Coalesced access High speed bus with the main memory –Current: PCI-E express (4GB/sec) 5/41

GPGPU Linear algebra [Larsen 01, Fatahalian 04, Galoppo 05] FFT [Moreland 03, Horn 06] Matrix operations [Jiang 05] Database applications –Basic Operators [Naga 04] –Sorting [Govindaraju 06] –Join [He 08] 6/41

GPGPU Programming “Assembly languages” –DirectX, OpenGL Graphics rendering pipelines “C/C++” –NVIDIA CUDA, ATI CAL or Brook+ Different programming models Low portability among different hardware vendors –NV GPU code cannot run on AMD GPU “Functional language”? 7/41

MapReduce 8/41 Without worrying about hardware details— Make GPGPU programming much easier. Well harness high parallelism and high computational capability of GPUs.

MapReduce Functions Process lots of data to produce other data Input & Output: a set of records in the form of key/value pair Programmer specifies two functions –map (in_key, in_value) -> emit list(intermediate_key, intermediate_value) –reduce (out_key, list(intermediate_value)) -> emit list(out_key, out_value) 9/41

MapReduce Workflow 10/41 From

MapReduce outside google Hadoop [Apache project] MapReduce on multicore CPUs -- Phoenix [HPCA'07, Ranger et al.] MapReduce on Cell [07, Kruijf et al.] Merge [ASPLOS '08, Linderman et al.] MapReduce on GPUs [stmcs'08, Catanzaro et al.)] 11/41

Overview Motivation Design Implementation Evaluation Conclusion 12/41

MapReduce on GPU 13/41 Web Analysis MapReduce (Mars) Data Mining GPGPU languages (CUDA, Brook+) Rendering APIs (DirectX) GPU Drivers

MapReduce on Multi-core CPU (Phoenix [HPCA'07]) 14/41 Split Map Partition Reduce Merge Input Output

Limitations on GPUs Rely on the CPU to allocate memory –How to support variant length data? –How to allocate output buffer on GPUs? Lack of lock support –How to synchronize to avoid write conflict? 15/41

Data Structure for Mars A Record = Key1Key2Key3… Value1Value2Value3… 16/41 Index entry1Index entry2Index entry3… An index entry = Support variant length record!

Lock-free scheme for result output 17/41 Basic idea: Calculate the offset for each thread on the output buffer.

Lock-free scheme example 18/41 Pick up odd numbers from the array [1, 3, 2, 3, 4, 6, 9, 8]. map function as a filter – filter all odd numbers

Lock-free scheme example T1T2T3T4T1T2T3T / T1T2T3T4 [ 1, 3, 2, 3, 4, 7, 9, 8 ] Step1: Histogram Step2: Prefix sum (5) 8

Lock-free scheme example T1T2T3T4T1T2T3T4 T1T2 T3T /41 T1T2T3T4 [ 1, 3, 2, 3, 4, 7, 9 ] Histogram (5) Step3: Allocate

Lock-free scheme example T1T2T3T4T1T2T3T /41 T1T2T3T4 [ 1, 3, 2, 3, 4, 7, 9, 8 ] Step4: Computation Prefix sum

Lock-free scheme 22/41 1.Histogram on key size, value size, and record count. 2.Prefix sum on key size, value size, and record count. 3.Allocate output buffer on GPU memory. 4.Perform computing. Avoid write conflict. Allocate output buffer exactly once.

Mars Workflow 23/41 MapCount Map ReduceCount Reduce Input Output Sort and Group Allocate intermediate buffer on GPU Prefixsum Allocate output buffer on GPU

Mars Workflow– Map Only 24/41 MapCount Map Input Output Allocate intermediate buffer on GPU Prefixsum Map only, without grouping and reduce

Mars Workflow – Without Reduce 25/41 MapCount Map Input Output Sort and Group Allocate intermediate buffer on GPU Prefix Sum Map and grouping, without reduce

APIs of Mars 26/41 User-defined: MapCount Map Compare (optional) ReduceCount (optional) Reduce (optional) Runtime Provided: AddMapInput MapReduce EmitInterCount EmitIntermediate EmitCount (optional) Emit (optional)

Overview Motivation Design Implementation Evaluation Conclusion 27/41

Mars-GPU Operating system’s thread APIs Each map instance or reduce instance is a CPU thread. 28/41 Mars-CPU NVIDIA CUDA Each map instance or reduce instance is a GPU thread.

Optimization According to CUDA features Coalesced Access –Multiple accesses to consecutive memory addresses are combined into one transfer. Build-in vector type (int4, char4 etc) –Multiple small data items are fetched in one memory request. 29/41

Overview Motivation Design Implementation Evaluation Conclusion 30/41

Experimental Setup Comparison –CPU: Phoenix, Mars-CPU –GPU: Mars-GPU CPU (P4 Quad) GPU (NV GTX8800) Processors (HZ) 2.66G*41.35G*128 Cache size 8MB256KB Bandwidth (GB/sec) OS Fedora Core /41

Applications String Match (SM): Find the position of a string in a file. [S: 32MB, M: 64MB, L: 128MB] Inverted Index (II): Build inverted index for links in HTML files. [S: 16MB, M: 32MB, L: 64MB] Similarity Score (SS): Compute the pair-wise similarity score for a set of documents. [S: 512x128, M: 1024x128, L: 2048x128] 32/41

Applications (Cont.) Matrix Multiplication (MM): Multiply two matrices. [S: 512x512, M: 1024x10242, L: 2048x2048] Page View Rank (PVR): Count the number of distinct page views from web logs. [S: 32MB, M: 64MB, L: 96MB] Page View Count (PVC): Find the top-10 hot pages in the web log. [S: 32MB, M: 64MB, L: 96MB] 33/41

Effect of Coalessed Access 34/41 Coalessed access achieves a speedup of 1.2-2X

Effect of Built-In Data Types 35/41 Built-in data types achieve a speedup up to 2 times

Time Breakdown of Mars-GPU 36/41 GPU accelerates computation in MapReduce

Mars-GPU vs. Phoenix on Quadcore CPU 37/41 The speedup is times with various data sizes

Mars-GPU vs. Mars-CPU 38/41 The GPU accelerates MapReduce up to 7 times

Mars-CPU vs. Phoenix 39/41 Mars-CPU is 1-5 times as fast as Phoenix

Overview Motivation Design Implementation Evaluation Conclusion 40/41

Conclusion MapReduce framework on GPUs –Ease of GPU application development –Performance acceleration Want a Copy of Mars? /41

Discussion A uniform co-processing framework between the CPU and the GPU High performance computation routines –Index serving –Data mining (on-going) Power consumption benchmarking of the GPU –The GPU is a test bed for the future CPU. … 42/41