Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong Wang Microsoft Corp. Sina Corp. 1, Currently in Microsoft Research Asia

Overview Motivation Design Implementation Evaluation Conclusion 2/41

Graphics Processing Units Massively multi-threaded co-processors –240 streaming processors on NV GTX 280 –~1 TFLOPS of peak performance High bandwidth memory –10+x more than peak bandwidth of the main memory –142 GB/s, 1 GB GDDR3 memory on GTX280 4/41

Graphics Processing Units (Cont.) High latency GDDR memory –200 clock cycles of latency –Latency hiding using large number of concurrent threads (>8K) –Low context-switch overhead Better architectural support for memory –Inter-processor communication using a local memory –Coalesced access High speed bus with the main memory –Current: PCI-E express (4GB/sec) 5/41

GPGPU Linear algebra [Larsen 01, Fatahalian 04, Galoppo 05] FFT [Moreland 03, Horn 06] Matrix operations [Jiang 05] Folding@home, Seti@home Database applications –Basic Operators [Naga 04] –Sorting [Govindaraju 06] –Join [He 08] 6/41

GPGPU Programming “Assembly languages” –DirectX, OpenGL Graphics rendering pipelines “C/C++” –NVIDIA CUDA, ATI CAL or Brook+ Different programming models Low portability among different hardware vendors –NV GPU code cannot run on AMD GPU “Functional language”? 7/41

MapReduce 8/41 Without worrying about hardware details— Make GPGPU programming much easier. Well harness high parallelism and high computational capability of GPUs.

MapReduce Functions Process lots of data to produce other data Input & Output: a set of records in the form of key/value pair Programmer specifies two functions –map (in_key, in_value) -> emit list(intermediate_key, intermediate_value) –reduce (out_key, list(intermediate_value)) -> emit list(out_key, out_value) 9/41

MapReduce Workflow 10/41 From http://labs.google.com/papers/mapreduce.html

MapReduce outside google Hadoop [Apache project] MapReduce on multicore CPUs -- Phoenix [HPCA'07, Ranger et al.] MapReduce on Cell [07, Kruijf et al.] Merge [ASPLOS '08, Linderman et al.] MapReduce on GPUs [stmcs'08, Catanzaro et al.)] 11/41

MapReduce on GPU 13/41 Web Analysis MapReduce (Mars) Data Mining GPGPU languages (CUDA, Brook+) Rendering APIs (DirectX) GPU Drivers

MapReduce on Multi-core CPU (Phoenix [HPCA'07]) 14/41 Split Map Partition Reduce Merge Input Output

Limitations on GPUs Rely on the CPU to allocate memory –How to support variant length data? –How to allocate output buffer on GPUs? Lack of lock support –How to synchronize to avoid write conflict? 15/41

Data Structure for Mars A Record = Key1Key2Key3… Value1Value2Value3… 16/41 Index entry1Index entry2Index entry3… An index entry = Support variant length record!

Lock-free scheme for result output 17/41 Basic idea: Calculate the offset for each thread on the output buffer.

Lock-free scheme example 18/41 Pick up odd numbers from the array [1, 3, 2, 3, 4, 6, 9, 8]. map function as a filter – filter all odd numbers

Lock-free scheme example T1T2T3T4T1T2T3T4 1211 0134 19/41 1 4 3 7 2 9 3 T1T2T3T4 [ 1, 3, 2, 3, 4, 7, 9, 8 ] Step1: Histogram Step2: Prefix sum (5) 8

Lock-free scheme example T1T2T3T4T1T2T3T4 T1T2 T3T4 1211 20/41 T1T2T3T4 [ 1, 3, 2, 3, 4, 7, 9 ] Histogram (5) Step3: Allocate

Lock-free scheme example T1T2T3T4T1T2T3T4 0134 21/41 T1T2T3T4 [ 1, 3, 2, 3, 4, 7, 9, 8 ] Step4: Computation 13379 Prefix sum

Lock-free scheme 22/41 1.Histogram on key size, value size, and record count. 2.Prefix sum on key size, value size, and record count. 3.Allocate output buffer on GPU memory. 4.Perform computing. Avoid write conflict. Allocate output buffer exactly once.

Mars Workflow 23/41 MapCount Map ReduceCount Reduce Input Output Sort and Group Allocate intermediate buffer on GPU Prefixsum Allocate output buffer on GPU

Mars Workflow– Map Only 24/41 MapCount Map Input Output Allocate intermediate buffer on GPU Prefixsum Map only, without grouping and reduce

Mars Workflow – Without Reduce 25/41 MapCount Map Input Output Sort and Group Allocate intermediate buffer on GPU Prefix Sum Map and grouping, without reduce

APIs of Mars 26/41 User-defined: MapCount Map Compare (optional) ReduceCount (optional) Reduce (optional) Runtime Provided: AddMapInput MapReduce EmitInterCount EmitIntermediate EmitCount (optional) Emit (optional)

Mars-GPU Operating system’s thread APIs Each map instance or reduce instance is a CPU thread. 28/41 Mars-CPU NVIDIA CUDA Each map instance or reduce instance is a GPU thread.

Optimization According to CUDA features Coalesced Access –Multiple accesses to consecutive memory addresses are combined into one transfer. Build-in vector type (int4, char4 etc) –Multiple small data items are fetched in one memory request. 29/41

Experimental Setup Comparison –CPU: Phoenix, Mars-CPU –GPU: Mars-GPU CPU (P4 Quad) GPU (NV GTX8800) Processors (HZ) 2.66G*41.35G*128 Cache size 8MB256KB Bandwidth (GB/sec) 10.486.4 OS Fedora Core 7.0 31/41

Applications String Match (SM): Find the position of a string in a file. [S: 32MB, M: 64MB, L: 128MB] Inverted Index (II): Build inverted index for links in HTML files. [S: 16MB, M: 32MB, L: 64MB] Similarity Score (SS): Compute the pair-wise similarity score for a set of documents. [S: 512x128, M: 1024x128, L: 2048x128] 32/41

Applications (Cont.) Matrix Multiplication (MM): Multiply two matrices. [S: 512x512, M: 1024x10242, L: 2048x2048] Page View Rank (PVR): Count the number of distinct page views from web logs. [S: 32MB, M: 64MB, L: 96MB] Page View Count (PVC): Find the top-10 hot pages in the web log. [S: 32MB, M: 64MB, L: 96MB] 33/41

Effect of Coalessed Access 34/41 Coalessed access achieves a speedup of 1.2-2X

Effect of Built-In Data Types 35/41 Built-in data types achieve a speedup up to 2 times

Time Breakdown of Mars-GPU 36/41 GPU accelerates computation in MapReduce

Mars-GPU vs. Phoenix on Quadcore CPU 37/41 The speedup is 1.5-16 times with various data sizes

Mars-GPU vs. Mars-CPU 38/41 The GPU accelerates MapReduce up to 7 times

Mars-CPU vs. Phoenix 39/41 Mars-CPU is 1-5 times as fast as Phoenix

Conclusion MapReduce framework on GPUs –Ease of GPU application development –Performance acceleration Want a Copy of Mars? http://www.cse.ust.hk/gpuqp/Mars.html http://www.cse.ust.hk/gpuqp/Mars.html 41/41

Discussion A uniform co-processing framework between the CPU and the GPU High performance computation routines –Index serving –Data mining (on-going) Power consumption benchmarking of the GPU –The GPU is a test bed for the future CPU. … 42/41

Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Similar presentations

Presentation on theme: "Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Similar presentations

Presentation on theme: "Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong."— Presentation transcript:

Similar presentations

About project

Feedback