Presentation is loading. Please wait.

Presentation is loading. Please wait.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Similar presentations


Presentation on theme: "Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009."— Presentation transcript:

1 Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

2 Euro-Par, 2006 Motivation and Overview Two Popular Trends –Data-intensive computing –GPU programming Seems like a good match Can we ease use of GPGPUs ? –Domain-specific Programming Tool –Can exploit common programming structure –Enable good speedups ICS 2009

3 Euro-Par, 2006 Context Many years of work on compiler and runtime support for data-intensive applications –Clusters, SMPs, Cluster of SMPs –FREERIDE and language front-ends Similar to map-reduce but … –Predates it and performs better !! –Recent work on (Cluster of) Multi-cores, Incorporate RSTM GPUs – C and Matlab front-end –Cluster of GPUs, Multi-core and GPUs ICS 2009

4 Euro-Par, 2006 ICS 2009 Outline Background GPU Computing Parallel Data Mining Challenges of Data Mining on GPU Architecture of the System –Sequential code analysis –Generation of CUDA programs –Optimization Techniques Experimental Results –k-means, EM, PCA Related and future work ICS 2009

5 Euro-Par, 2006 ICS 2009 Background - GPU Computing Many-core architectures/Accelerators are becoming more popular GPUs are inexpensive and fast CUDA is a high-level language for GPU programming

6 Euro-Par, 2006 ICS 2009 CUDA Programming Significant improvement over use of Graphics Libraries But.. Need detailed knowledge of the architecture of GPU and a new language Must specify the grid configuration Deal with memory allocation and movement Explicit management of memory hierarchy

7 Euro-Par, 2006 ICS 2009 Parallel Data mining Common structure of data mining applications (FREERIDE)‏ /* outer sequential loop *//* outer sequential loop */ while() { while() { /* Reduction loop */ /* Reduction loop */ Foreach (element e){ Foreach (element e){ (i, val) = process(e); (i, val) = process(e); Reduc(i) = Reduc(i) op val; Reduc(i) = Reduc(i) op val; } }

8 Euro-Par, 2006 Porting on GPUs High-level Parallelization is straight-forward Details of Data Movement Impact of Thread Count on Reduction time Use of shared memory

9 Euro-Par, 2006 ICS 2009 Architecture of the System Variable information Reduction functions Optional functions Code Analyzer( In LLVM)‏ Variable Analyzer Code Generator Variable Access Pattern and Combination Operations Host Program Grid configuration and kernel invocation Kernel functions Executable User Input

10 Euro-Par, 2006 User Input A sequential reduction function Optional functions (initialization function, combination function…)‏ Values of each variable or size of array Variables to be used in the reduction function

11 Euro-Par, 2006 ICS 2009 Analysis of Sequential Code Get the information of access features of each variable Determine the data to be replicated Get the operator for global combination Variables for shared memory

12 Euro-Par, 2006 Memory Allocation and Copy Copy the updates back to host memory after the kernel reduction function returns C.C.C.C. Need copy for each thread T0T1 T2 T3 T4 T61T62 T63T0T1 …… T0T1 T2T3T4 T61T62 T63T0T1 …… A.A.A.A. B.B.B.B.

13 Euro-Par, 2006 ICS 2009 Extract information of variable access Variable analyzer IR from LLVM Extract variables to be written Argument list Extract read-only variables User input Extract temporary variables

14 Euro-Par, 2006 ICS 2009 Generating CUDA Code and C++/C code Invoking the Kernel Function Memory allocation and copy Thread grid configuration (block number and thread number)‏ Global function Kernel reduction function Global combination

15 Euro-Par, 2006 ICS 2009 Global Combination Assume all updates are summed or multiplied from each thread An automatically generated global combination function which is invoked by 1 thread

16 Euro-Par, 2006 ICS 2009 Kernel Reduction Function Generated out of the original sequential code Divide the main loop by block_number and thread_number Replace the access offsets with appropriate indices

17 Euro-Par, 2006 ICS 2009 Optimizations Using shared memory Providing user-specified initialization functions and combination functions Specifying variables that are allocated once

18 Euro-Par, 2006 ICS 2009 Dealing with Shared memory Size = length * sizeof(type) * thread_info –length: size of the array –type: char, int, and float –thread_info: whether it’s copied to each thread Mark each array as shared until the size exceeds the limit of shared memory

19 Euro-Par, 2006 ICS 2009 Shared memory layout Strategies No-sorting Greedy sorting Write-first sorting

20 Euro-Par, 2006 ICS 2009 No sorting Shared Memory B A CD

21 Euro-Par, 2006 ICS 2009 Greedy sorting Shared Memory BACD BACD

22 Euro-Par, 2006 ICS 2009 Other Optimizations Reducing Memory allocation and copy overhead –Arrays shared by multiple iterations can be allocated and copied only once User defined combination function

23 Euro-Par, 2006 ICS 2009 Applications K-means clustering EM clustering PCA

24 Euro-Par, 2006 ICS 2009 Experiment Results Speedup of k-means

25 Euro-Par, 2006 ICS 2009 Speedup of k-means on GeForce 9800X2

26 Euro-Par, 2006 ICS 2009 Speedup of EM

27 Euro-Par, 2006 ICS 2009 Speedup of PCA

28 Euro-Par, 2006 Related Work OpenMP to CUDA (Purdue) Domain-specific operators to CUDA (NEC) CUDA-lite etc. (Illinois) Various application studies

29 Euro-Par, 2006 Conclusions Automatic CUDA Code Generation and Optimization is feasible Restricting to domain / communication style helps Interesting new compiler optimizations


Download ppt "Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009."

Similar presentations


Ads by Google