Presentation is loading. Please wait.

Presentation is loading. Please wait.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission.

Similar presentations


Presentation on theme: "SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission."— Presentation transcript:

1 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission Haicheng Wu*, Gregory Diamos #, Jin Wang*, Srihari Cadambi^, Sudhakar Yalamanchili*, Srimat Chakradhar^ *Georgia Institute of Technology # NVIDIA Research ^ NEC Laboratories America Sponsors: National Science Foundation, LogicBlox Inc., IBM, and NVIDIA

2 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY The General Purpose GPU 2 ② Launch Kernel ① Input Data ④ Result ③ Execute CPU (Multi Core) 2-10 Cores MAIN MEM ~128GB GPU ~1500 Cores GPU MEM ~6GB PCI-E  GPU is a many core co-processor  10s to 100s of cores  1000s to 10,000s of concurrent threads  CUDA and OpenCL are the dominant programming models  Well suited for data parallel apps  Molecular Dynamics, Options Pricing, Ray Tracing, etc.  Commodity: led by NVIDIA, AMD, and Intel

3 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Enterprise: Amazon EC2 GPU Instance Amazon EC2 GPU Instances ElementsCharacteristics OSCentOS 5.5 CPU2 x Intel Xeon X5570 (quad-core "Nehalem" arch, 2.93GHz) GPU2 x NVIDIA Tesla "Fermi" M2050 GPU Nvidia GPU driver and CUDA toolkit 3.1 Memory22 GB Storage1690 GB I/O10 GigE Price$2.10/hour NVIDIA Tesla 3

4 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Data Warehousing Applications on GPUs 4 The good Lots of potential data parallelism If data fits in GPU mem, 2x—27x speedup has been shown The bad Very large data set (will not even fit in host memory) I/O bound (GPU has no disk) PCI data transfer takes 15–90% of the total time * OrderPriceDiscount 01010% 12020% 21015% 35114% 43313% 52210% …… B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander. Relational query co-processing on graphics processors. In TODS, 2009.

5 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY This Work 5 Goal: Demonstrate the benefits of Kernel Fusion/Kernel Fission in enabling Large data warehousing applications on GPUs Assumptions In-memory system Host memory, not GPU memory Not OLTP (Online Transaction Processing) type simple queries Focus on data analysis instead of data entry/retrieval

6 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Two Optimizations for Data Movement 6 Our solutions are: Kernel Fusion – Aggregate computation to reuse data Kernel Fission – Overlap computation with PCI transfer This is the problem!!! CPU (Multi Core) 2-10 Cores MAIN MEM ~128GB GPU ~1500 Cores GPU MEM ~6GB PCI-E ~16GB/ s

7 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Relational Algebra (RA) Operators 7 RA are building blocks of DB APPs UNIONx = {(3,a), (4,a), (2,b)}, y = {(0,a), (2,b)} union x y -> {(3,a), (4,a), (2,b), (0,a)} INTERSECTIONx = {(3,a), (4,a), (2,b)}, y = {(0,a), (2,b)} intersection x y -> {(2,b)} PRODUCTx = {(3,a), (4,a)}, y = {(True, 2)} product x y -> {(3,a,True,2), (4,a,True,2)} DIFFERENCEx = {(3,a), (4,a), (2,b)}, y = {(4,a), (3,a)} difference x y -> {(2,b)} JOINx = {(2,b), (3,a), (4,a)}, y = {(2,f), (3,c)} join x y -> {(3,a,c), (2,b,f)} PROJECTIONx = {(3,True,a), (4,True,a), (2,False,b)} project [0,2] x -> {(3,a), (4,a), (2,b)} SELECTx = {(3,True,a), (4,True,a), (2,False,b)} select [field.0==2] x -> (2,False,b)

8 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Common RA Combinations of TPC-H 8

9 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Experimental Environment 9 Using a sequence of SELECTs to demonstrate the benefits of Kernel Fusion/Fission CPU2 quad-core Xeon E5520 @ 2.27GHz Memory48 GB GPU1 Tesla C2070 (6GB GDDR5 memory) OSUbuntu 10.04 Server GCC4.4.3 NVCC4.0

10 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY PCI Bandwidth vs. GPU Computation Capacity 10 PCI Bandwidth GPU Computation Capacity (1 SELECT) <

11 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY +/- Kernel Fusion 11 123 + Kernel A Kernel B Fused Kernel A1 : A2: Kernel A A1A2 A3 Kernel B Result A1A2A3 Fused Kernel A, B Result 456 579 A3: 246 - 333 123 A1 : A2: 456 A3: 246 333

12 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Benefits of Kernel Fusion-Reduce Data Footprint (1) 12 Spatial Locality Traverse the data only ONCE GPU temp GPU tempResult GPU Result temp A1A2A3 A1A2A3 Temporal Locality

13 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Benefits of Kernel Fusion-Reduce Data Footprint (2) 13 Reduce Data Transfer input1 result1 input2 result2 CPU MEM GPU MEM Memory Efficiency A1A2 Temp A3 A1 A2 A3 GPU MEM Kernel A A1A2 A3 Kernel B Result A1A2A3 Fused Kernel A, B Result Temp

14 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Benefits of Kernel Fusion-Enlarge Optimization Scope 14 Eliminate Common Stages Enable More Opt Fused Kernel A, B Larger code is good for other optimizations: a) instruction scheduling, b) register assignment, c) constant propagation …… Kernel A Kernel B

15 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Examples of Kernel Fusion 15 Original 1 SELECTFused 2 SELECTs

16 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fusion-Overall Performance 16 Including PCI Excluding PCI 1.80x speedup PCI-e noise

17 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fusion-Breakdown Execution Time 17 Not needed Faster filter and gather

18 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fusion-Sensitivity 18 Fusing more kernels is better Lower selected rate is better

19 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fission-CUDA Stream 19 Commands (Kernel or Memcpy) of different CUDA STREAM can run in parallel Commands in the same CUDA STREAM have to run in sequential

20 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fission-Stream Pool Stream Pool is a library that abstracts away the details of CUDA STREAM APIComment getAvailableStream()Get an available stream setStreamCommand()Assign a command to a specific stream startStreams()Start the execution selectWait()Assign point-to-point synchronization between two specific streams terminate()End the execution immediately 20

21 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fission-Different Ways to Use CUDA Stream Concurrently running two kernels is not always beneficial small uses half resource as big 21

22 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Example of Kernel Fission 1.37x speedup 22

23 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fusion + Kernel Fission 1.41x serial 1.31x fusion only 1.10x fission only 23

24 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Real Queries-Q1 24 Query Plan Totally 1.26x speedup

25 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Real Queries-Q21 25 Query Plan Totally 1.13x speedup

26 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Conclusions Two Data movement optimizations (Kernel Fusion & Kernel Fission) saves the memory transfer time and speeds up the computation time for Data Warehousing Apps. Kernel Fusion Does not need to dump intermediate temporary data Enlarge the optimization scope Kernel Fission works like double buffer that can overlap data transfer with GPU Computation 26

27 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Thank You Questions? 27


Download ppt "SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission."

Similar presentations


Ads by Google