Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping Chi-Keung (CK) Luk Technology Pathfinding and Innovation Software.

Similar presentations


Presentation on theme: "1 Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping Chi-Keung (CK) Luk Technology Pathfinding and Innovation Software."— Presentation transcript:

1 1 Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping Chi-Keung (CK) Luk Technology Pathfinding and Innovation Software Solutions and Services Group Intel Sunpyo Hong Electrical and Computer Engineering Georgia Institute of Technology Hyesoon Kim College of Computing School of Computer Science Georgia Institute of Technology

2 MICRO ’09 2 Heterogeneous Architectures  Heterogeneous architectures are increasingly popular: Intel Core2 + Nvidia’s GPU IBM’s Cell processor Platform used NHM + Larrabee

3 MICRO ’09 3 Software Challenge GPU Core-0Core-1 Core-2Core-3 CPU SIMD A CPU + GPU system: The Mapping Problem: Map computations to PEs to optimize an objective function, which could be: Performance Energy Performance / Energy

4 MICRO ’09 4 Existing Solutions to the Mapping Problem  Programmer performs the mapping manually and statically  Examples:  IBM XL compiler extension that supports OpenMP on the Cell  Intel CTG’s ExoCHI/Merge framework for programming the CPU and GPU  Disadvantages:  Labor intensive  Not adaptable to changes in runtime environments

5 MICRO ’09 5 Outline  Introduction  Case Study  Adaptive Mapping  Experimental Evaluation  Conclusions

6 MICRO ’09 6 Case Study: Matrix Multiplication  Heterogeneous machine used:  CPU: dual-socket QuadCore (max = 8 cores)  GPU: Nvidia GTX-8800 GPU  Three configurations tested: 1.Small problem size, max CPU cores used 2.Big problem size, max CPU cores used 3.Big problem size, fewer CPU cores used  In each configuration: –Perform cooperative matrix multiplication (varying the distribution of works over the CPU and GPU)

7 MICRO ’09 7 Cooperative Matrix Multiplication C = A B x C1C1 = B x A1A1 C2C2 A2A2 CPU GPU

8 MICRO ’09 8 Cooperative Matrix Multiplication Results Configuration 1: Matrix dimension size = 1000 #CPU cores = 8 Configuration 3: Matrix dimension size = 6000 #CPU cores = 2 Configuration 2: Matrix dimension size = 6000 #CPU cores = 8 Lessons Learned: The optimal PE mapping depends on the application, the input size, and hardware resources available  Need an automatic and dynamic technique that takes all these factors into account Our contribution: ADAPTIVE MAPPING

9 MICRO ’09 9 Adaptive Mapping  A technique to automatically find the near-optimal mapping for the given program, problem size and hardware  Each configuration involves one training run and many reference runs:  Training run: Find the execution-time projections of the CPU and the GPU for the given configuration  Reference run: Compute the near-optimal distribution of work for the current problem size

10 MICRO ’09 10 Training Run Kernel K NtNt N 1,1 N 1,m KKKK N 2,1 N 2,m KK KK time taken: T c (N 1,1 ) curve fitting Input size Runtime curve fitting T’ C (N) T’ G (N) T c (N 1,m ) T G (N 2,1 )T G (N 2,m ) Database T’ C (N) = The projected time to execute the kernel of problem size N on the CPU = a c + b c * N T’ G (N) = The projected time to execute the kernel of problem size N on the GPU = a g + b g * N

11 MICRO ’09 11 Reference Run Database T’  (N r ) = Max( p/(p-1)T’ C (βN r ), T’ G ((1-β)N r ) ) Find β to minimize T’  (N r ) K CPU K GPU β = Fraction of work mapped to CPU p = Number of CPU cores N = Problem size T’ β (N) = The projected time to execute β N work on the CPU and (1- β)N work on the GPU = Max( p/(p-1)T’ C (βN), T’ G ((1-β)N) ) Once N is fixed to the actual problem size N r, we find the β that minimizes T’ β (N r ). We consider where the two curves p/(p-1)T’ C (βN r ) and T’ G ((1-β)N r ) intersect. There are 3 possible cases (see next slide) K NrNr

12 MICRO ’09 12 Three Possible Cases of β Minimized when mapping all work to the GPU Time  0 1 CPU: (p/p-1)T’ c (  N r ) GPU: T’ G ((1-  N r ) Case i: CPU and GPU curves intersect at β <= 0 Case ii: The two curves intersect at β >= 1 Time  0 1 CPU: (p/p-1)T’ c (  N r ) GPU: T’ G ((1-  N r ) Minimized when mapping all work to the CPU Case iii: The two curves intersect at 0<β<1  01 CPU: (p/p-1)T’ c (  N r ) GPU: T’ G ((1-  N r ) Minimized when mapping  min of work to the CPU  min

13 MICRO ’09 13 Outline  Introduction  Case Study  Adaptive Mapping  Experimental Evaluation  Conclusions

14 MICRO ’09 14 Prototype Implementation  Adaptive mapping could be implemented as:  Off-line optimization for static compilation  On-line optimization for dynamic compilation  Our prototype:  A dynamic compilation system called Qilin  Qilin API: Both stream-based and thread-based  Dynamic code generation: Generate TBB source code for the CPU Generate CUDA source code for the GPU Generate glue code to: Copy data back and forth between CPU and GPU Stage computations onto GPU to satisfy GPU memory limitation Division of work according to Adaptive Mapping C++ App Qilin API Qilin System CPUGPU

15 MICRO ’09 15 Heterogeneous PC used CPUGPU ArchitectureIntel Core2 QuadNvidia 8800 GTX Core Clock2.4 GHz1.35GHz Number of Cores8 cores (on 2 sockets) 128 stream processors Memory Size4 GB768 MB Memory Bandwidth8 GB/s86.4 GB/s Threading APIIntel TBBNvidia CUDA CompilerICC 10.1NVCC 1.1 OS32-bit Linux Fedora Core 6

16 MICRO ’09 16 Benchmarks NameDescriptionSource Binomial American option pricing CUDA SDK BlackScholes European option pricing CUDA SDK Convolve 2D separable image convolution CUDA SDK MatrixMultiply Dense matrix multiplication CUDA SDK Linear Linear image filter---compute output pixel as average of a 9-pixel square Intel’s Merge Sepia Modify RGB value to artificially age images Merge Smithwat Compute scoring matrix for a pair of DNA sequences Merge Svm Kernel from a SVM-based face classifier Merge (Financial, image processing, scientific)

17 MICRO ’09 17  Adaptive mapping achieves 94% of the speedup of manual mapping Performance of Adaptive Mapping (Note: The y-axis is in logarithmic scale)

18 MICRO ’09 18 Energy Consumption  Adaptive mapping is nearly as good as manual mapping in energy consumption (Total system power measured by Extech Power Analyser)

19 MICRO ’09 19 Distribution of Computations Manual mappingAdaptive mapping CPUGPUCPUGPU Binomial10%90%10.5%89.5% BlackScholes40%60%46.5%53.5% Convolve40%60%36.3%63.7% MatrixMultiply40%60%45.5%54.5% Linear60%40%50.8%49.2% Sepia80%20%76.2%23.8% Smithwat60%40%59.3%40.7% Svm10%90%14.3%85.7%  Adaptive mapping and manual mapping have similar distributions

20 MICRO ’09 20 Related Work Hardware  Kumar et al. demonstrate advantages of heterogeneous over homogeneous CMPs in terms of power and throughput  Similar observations from Hill and Mart => Both study point out the importance of the mapping problem Software  GPGPU: Brook, Accelerator, Peakstream, Rapidmind, Brook+, Cuda (they are all GPU only)  Intel’s TBB and Ct (currently CPU only)  IBM’s OpenMP extension for Cell and Intel’s ExoCHI/Merge Use both CPU and GPU, but based on static manual mapping  OpenCL: Doesn’t seem to have any automatic mapping technique based on the initial specification Autotuning Generating many variants of a computation kernel and benchmarking each variant on the target platform Adaptive mapping can be regarded as an autotuning technique that tunes for the distribution of works on heterogeneous platforms

21 MICRO ’09 21 Conclusions  Automates the mapping from computations to heterogeneous multicores  Encouraging results:  Performance and energy consumption close to manual mapping  Adapt to changes in input size, hardware & software configurations (see our paper)  Applicable to other heterogeneous systems  OpenCL or Ct on NHM + Larrabee  Future work:  Extend it to handle irregular computations  Adaptive mapping could be an important technique in the multicore software stack

22 MICRO ’09 22 Acknowledgments  Michael Linderman, Jamison Collins, Hong Wang  Sharing their Merge benchmarks  Geoff Lowney and Mark Abel  Support of this work  Geoff Lowney and Robert Cohn  Suggestions and feedbacks

23 MICRO ’09 23

24 MICRO ’09 24 Impact of Training Input Size (Note: The y-axis is in logarithmic scale) Training input size as percentage of the reference input size  Most of the performance benefit of Adaptive Mapping preserved when the training input size is at least 30% of the reference input size

25 MICRO ’09 25 Adapting to Hardware Changes (1) Using a less powerful GPU (GTX8800 with 128 cores => GTS8800 with 96 cores)  Adaptive mapping automatically recovers part of the performance loss in the GPU from the CPU Original result

26 MICRO ’09 26 Adapting to Hardware Changes (2) Original result Using a less powerful CPU (CPU with 8 cores => CPU with 2 cores)  Adaptive mapping shifts most work to the GPU

27 MICRO ’09 27 Adapting to Software Changes Original result Using a different compiler on CPU ICC => GCC (for both the serial and parallel cases) GCC doesn’t use SSE-x as well as ICC does  Adaptive mapping biases to GPU


Download ppt "1 Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping Chi-Keung (CK) Luk Technology Pathfinding and Innovation Software."

Similar presentations


Ads by Google