Presentation is loading. Please wait.

Presentation is loading. Please wait.

“SMT/GPU Provides High Performance; at WSU CAPPLab, we can help you!” Bogazici University Istanbul, Turkey Presented by: Dr. Abu Asaduzzaman Assistant.

Similar presentations


Presentation on theme: "“SMT/GPU Provides High Performance; at WSU CAPPLab, we can help you!” Bogazici University Istanbul, Turkey Presented by: Dr. Abu Asaduzzaman Assistant."— Presentation transcript:

1 “SMT/GPU Provides High Performance; at WSU CAPPLab, we can help you!” Bogazici University Istanbul, Turkey Presented by: Dr. Abu Asaduzzaman Assistant Professor in Computer Architecture and Director of CAPPLab Department of Electrical Engineering and Computer Science (EECS) Wichita State University (WSU), USA June 2, 2014

2 Dr. Zaman2 “SMT/GPU Provides High Performance; at WSU CAPPLab, we can help you!” Outline► ■Introduction  Single-Core to Multicore Architectures ■Performance Improvement  Simultaneous Multithreading (SMT)  (SMT enabled) Multicore CPU with GPUs ■Energy-Efficient Computing  Dynamic GPU Selection ■CAPPLab  “People First”  Resources  Research Grants/Activities ■Discussion QUESTIONS? Any time, please!

3 Dr. Zaman3 Thank you! ■Dr. Professor Can Ozturan  Chair, ComE Department  Bogazici University, Istanbul, Turkey ■Dr. Professor Bayram Yildirim  Alumni, Bogazici University  IME Department  Wichita State University ■Many more…

4 Dr. Zaman4 Introduction Some Important “Laws” ■Moore’s law ■Amdahl’s law Vs. Gustafson’s law ■Law of diminishing returns ■Koomey's law ■(Juggling)  http://www.youtube.com/watch?v=PqBlA9kU8ZE http://www.youtube.com/watch?v=PqBlA9kU8ZE  http://www.youtube.com/watch?v=S0d3fK9ZHUI http://www.youtube.com/watch?v=S0d3fK9ZHUI

5 Dr. Zaman5 Introduction Moore’s Law ■The number of transistors on integrated circuits doubles approximately every 18 months.

6 Dr. Zaman6 Introduction Amdahl’s law Vs. Gustafson’s law ■The speedup of a program using multiple processors in parallel computing is limited by the sequential fraction of the program. ■Computations involving arbitrarily large data sets can be parallelized.

7 Dr. Zaman7 Introduction Law of diminishing returns ■In all productive processes, adding more of one factor of production, while holding all others constant, will at some point yield lower per- unit returns.

8 Dr. Zaman8 Introduction Koomey's law ■The number of computations per joule of energy dissipated has been doubling approximately every 1.57 years. This trend has been remarkably stable since the 1950s.

9 Dr. Zaman9 Introduction Single-Core to Multicore Architecture ■History of Computing  Word “computer” in 1613 (this is not the beginning)  Von Neumann architecture (1945) – data/instructions memory  Harvard architecture (1944) – data memory, instruction memory ■Single-Core Processors  In most modern processors: split CL1 (I1, D1), unified CL2, …  Intel Pentium 4, AMD Athlon Classic, … ■Popular Programming Languages  C, …

10 Dr. Zaman10 (Single-Core to) Multicore Architecture Courtesy: Jernej Barbič, Carnegie Mellon University  Input  Process/Store  Output Multi-tasking  Time sharing (Juggling!) Cache not shown Introduction

11 Dr. Zaman11 Single-Core  “Core” Introduction a single core Courtesy: Jernej Barbič, Carnegie Mellon University A thread is a running “process”

12 Dr. Zaman12 Major Steps to Execute an Instruction 68000 CPU and Memory CPU D7……D0 Data Registers 31…16….8..0 A7’A7…A0 Address Registers 31…16….8..0 PC 31…16….8..0 ALU Decoder / Control Unit IR ??…16….8..0 SR 15….8..0 Start 1: I.F. 2: I.D. (3) O.F. 16b 24b (3) O.F. 24b 4: I.E. Introduction (5) W.B.

13 Dr. Zaman13 Introduction Thread 1: Integer (INT) Operation (Pipelining Technique) 1: Instruction Fetch 2: Instruction Decode (3) Operand(s) Fetch 4: Integer Operation Arithmetic Logic Unit (5) Result Write Back Floating Point Operation Thread 1: Integer Operation

14 Dr. Zaman14 Introduction Thread 2: Floating Point (FP) Operation (Pipelining Technique) Instruction Fetch Instruction Decode Operand(s) Fetch Integer Operation Arithmetic Logic Unit Result Write Back Floating Point Operation Thread 2: Floating Point Operation

15 Dr. Zaman15 Introduction Threads 1 and 2: INT and FP Operations (Pipelining Technique) Instruction Fetch Instruction Decode Operand(s) Fetch Integer Operation Arithmetic Logic Unit Result Write Back Floating Point Operation Thread 1: Integer Operation Thread 2: Floating Point Operation POSSIBLE?

16 Dr. Zaman16 Performance Threads 1 and 2: INT and FP Operations (Pipelining Technique) Instruction Fetch Instruction Decode Operand(s) Fetch Integer Operation Arithmetic Logic Unit Result Write Back Floating Point Operation Thread 1: Integer Operation Thread 2: Floating Point Operation POSSIBLE?

17 Dr. Zaman17 Performance Improvement Threads 1 and 3: Integer Operations Instruction Fetch Instruction Decode Operand(s) Fetch Integer Operation Arithmetic Logic Unit Result Write Back Floating Point Operation Thread 1: Integer Operation Thread 3: Integer Operation POSSIBLE?

18 Dr. Zaman18 Performance Improvement Threads 1 and 3: Integer Operations (Multicore) Instruction Fetch Instruction Decode Operand(s) Fetch Integer Operation Arithmetic Logic Unit Result Write Back Floating Point Operation Instruction Fetch Instruction Decode Operand(s) Fetch Integer Operation Arithmetic Logic Unit Result Write Back Floating Point Operation Thread 1: Integer Operation Thread 3: Integer Operation POSSIBLE? Core 1 Core 2

19 Dr. Zaman19 Performance Improvement Threads 1, 2, 3, and 4: INT & FP Operations (Multicore) Instruction Fetch Instruction Decode Operand(s) Fetch Integer Operation Arithmetic Logic Unit Result Write Back Floating Point Operation Instruction Fetch Instruction Decode Operand(s) Fetch Integer Operation Arithmetic Logic Unit Result Write Back Floating Point Operation Core 2 Thread 1: Integer Operation Thread 3: Integer Operation Thread 4: Floating Point Operation Thread 2: Floating Point Operation POSSIBLE? Core 1

20 Dr. Zaman20 More Performance? Threads 1, 2, 3, and 4: INT & FP Operations (Multicore) Instruction Fetch Instruction Decode Operand(s) Fetch Integer Operation Arithmetic Logic Unit Result Write Back Floating Point Operation Instruction Fetch Instruction Decode Operand(s) Fetch Integer Operation Arithmetic Logic Unit Result Write Back Floating Point Operation Core 2 Thread 1: Integer Operation Thread 3: Integer Operation Thread 4: Floating Point Operation Thread 2: Floating Point Operation POSSIBLE? Core 1

21 Dr. Zaman21 “SMT/GPU Provides High Performance; at WSU CAPPLab, we can help you!” Outline► ■Introduction  Single-Core to Multicore Architectures ■Performance Improvement  Simultaneous Multithreading (SMT)  (SMT enabled) Multicore CPU with GPUs ■Energy-Efficient Computing  Dynamic GPU Selection ■CAPPLab  “People First”  Resources  Research Grants/Activities ■Discussion

22 Dr. Zaman22 Parallel/Concurrent Computing Parallel Processing – It is not fun!  Let’s play a game: Paying the lunch bill together  Started with $30; spent $29 ($27 + $2)  Where did $1 go? FriendBefore Eating Total Bill ReturnTipAfter Paying A$10$1 B$10$25$5$2$1 C$10$1 Total$30$2 Total Spent $9 $27 SMT enabled Multicore CPU with Manycore GPU for Ultimate Performance!

23 Dr. Zaman23 Performance Improvement Simultaneous Multithreading (SMT) ■Thread  A running program (or code segment) is a process  Process  processes / threads ■Simultaneous Multithreading (SMT)  Multiple threads running in a single-processor at the same time  Multiple threads running in multiple processors at the same time ■Multicore Programming Language supports  OpenMP, Open MPI, CUDA,…C

24 Dr. Zaman24 Performance Improvement Simultaneous Multithreading (SMT) ■Example: ■Generating/Managing Multiple Threads  OpenMP, Open MPI…C

25 Dr. Zaman25 Identify Challenges ■Sequential data-independent problems  C[]  A[] + B[] ♦C[5]  A[5] + B[5]  A’[]  A[] ♦A’[5]  A[5]  SMT capable multicore processor; CUDA/GPU Technology Core 1Core 2 Performance Improvement

26 Dr. Zaman26 ■CUDA/GPU Programming ■GP-GPU Card  A GPU card with 16 streaming multiprocessors (SMs)  Inside each SM: 32 cores 64KB shared memory 32K 32bit registers 2 schedulers 4 special function units ■CUDA  GPGPU Programming Platform Performance Improvement

27 Dr. Zaman27 Performance Improvement CPU-GPU Technology ■Tasks/Data exchange mechanism  Serial Computations – CPU  Parallel Computations - GPU

28 Dr. Zaman28 Performance Improvement GPGPU/CUDA Technology ■The host (CPU) executes a kernel in GPU in 4 steps (Step 1) CPU allocates and copies data to GPU On CUDA API: cudaMalloc() cudaMemCpy()

29 Dr. Zaman29 Performance Improvement GPGPU/CUDA Technology ■The host (CPU) executes a kernel in GPU in 4 steps (Step 2) CPU Sends function parameters and instructions to GPU CUDA API: myFunc >>(parameters)

30 Dr. Zaman30 Performance Improvement GPGPU/CUDA Technology ■The host (CPU) executes a kernel in GPU in 4 steps (Step 3) GPU executes instruction as scheduled in warps (Step 4) Results will need to be copied back to Host memory (RAM) using cudaMemCpy()

31 Dr. Zaman31 Performance Improvement Case Study 1 (data independent computation without GPU/CUDA) ■Matrix Multiplication MatricesSystems

32 Dr. Zaman32 Performance Improvement Case Study 1 (data independent computation without GPU/CUDA) ■Matrix Multiplication Execution TimePower Consumption

33 Dr. Zaman33 Performance Improvement Case Study 2 (data dependent computation without GPU/CUDA) ■Heat Transfer on 2D Surface Execution TimePower Consumption

34 Dr. Zaman34 Performance Improvement Case Study 3 (data dependent computation with GPU/CUDA) ■Fast Effective Lightning Strike Simulation  The lack of lightning strike protection for the composite materials limits their use in many applications.

35 Dr. Zaman35 Performance Improvement Case Study 3 (data dependent computation with GPU/CUDA) ■Fast Effective Lightning Strike Simulation ■Laplace’s Equation ■Simulation  CPU Only  CPU/GPU w/o shared memory  CPU/GPU with shared memory

36 Dr. Zaman36 Performance Improvement Case Study 4 (MATLAB Vs GPU/CUDA) ■Different simulation models  Traditional sequential program  CUDA program (no shared memory)  CUDA program (with shared memory)  Traditional sequential MATLAB  Parallel MATLAB  CUDA/C parallel programming of the finite difference method based Laplace’s equation demonstrate up to 257x speedup and 97% energy savings over a parallel MATLAB implementation while solving a 4Kx4K problem with reasonable accuracy.

37 Dr. Zaman37 Identify More Challenges ■Sequential data-independent problems  C[]  A[] + B[] ♦C[5]  A[5] + B[5]  A’[]  A[] ♦A’[5]  A[5]  SMT capable multicore processor; CUDA/GPU Technology ■Sequential data-dependent problems  B’[]  B[] ♦B’[5]  {B[4], B[5], B[6]}  Communication needed ♦Core 1 and Core 2 Core 1Core 2 Performance Improvement

38 Dr. Zaman38 Develop Solutions ■Task Regrouping  Create threads ■Data Regrouping  Regroup data  Data for each thread  Threads with G2s first  Then, threads with G1s (Step 2 of 5) CPU copies data to GPU On CUDA API: cudaMemCpy() Performance Improvement

39 Dr. Zaman39 Assess the Solutions ■What is the Key? ■Synchronization  With synchronization  Without synchronization ♦Fast Vs. Accuracy  Threads with G2s first  Then, threads with G1s (Step 2 of 5) CPU copies data to GPU On CUDA API: cudaMemCpy() Performance Improvement

40 Dr. Zaman40 “SMT/GPU Provides High Performance; at WSU CAPPLab, we can help you!” Outline► ■Introduction  Single-Core to Multicore Architectures ■Performance Improvement  Simultaneous Multithreading (SMT)  (SMT enabled) Multicore CPU with GP-GPU ■Energy-Efficient Computing  Dynamic GPU Selection ■CAPPLab  “People First”  Resources  Research Grants/Activities ■Discussion

41 Dr. Zaman41 Kansas Unique Challenge ■Climate and Energy  Protect environment from harms due to climate change  Save natural energy Energy-Efficient Computing

42 Dr. Zaman42 “Power” Analysis ■CPU with multiple GPU  GPU usages vary ■Power Requirements  NVIDIA GTX 460 (336-core) - 160W [1]  Tesla C2075 (448-core) - 235W [2]  Intel Core i7 860 (4-core, 8-thread) - 150-245W [3, 4] ■Dynamic GPU Selection  Depending on ♦the “tasks”/threads ♦GPU usages CPU GPU Energy-Efficient Computing

43 Dr. Zaman43 CPU-to-GPU Memory Mapping ■GPU Shared Memory  Improves performance  CPU to GPU global memory  GPU global to shared ■Data Regrouping  CPU to GPU global memory Energy-Efficient Computing

44 Dr. Zaman44 Integrate Research into Education ■CS 794 – Multicore Architectures Programming  Multicore Architecture  Simultaneous Multithreading  Parallel Programming  Moore’s law  Amdahl’s law  Gustafson’s law  Law of diminishing returns  Koomey's law Teaching Low-Power HPC Systems

45 Dr. Zaman45 “SMT/GPU Provides High Performance; at WSU CAPPLab, we can help you!” Outline► ■Introduction  Single-Core to Multicore Architectures ■Performance Improvement  Simultaneous Multithreading (SMT)  (SMT enabled) Multicore CPU with GP-GPU ■Energy-Efficient Computing  Dynamic GPU Selection ■CAPPLab  “People First”  Resources  Research Grants/Activities ■Discussion

46 Dr. Zaman46 WSU CAPPLab CAPPLab ■Computer Architecture & Parallel Programming Laboratory (CAPPLab)  Physical location: 245 Jabara Hall, Wichita State University  URL: http://www.cs.wichita.edu/~capplab/http://www.cs.wichita.edu/~capplab/  E-mail: capplab@cs.wichita.edu; Abu.Asaduzzaman@wichita.educapplab@cs.wichita.eduAbu.Asaduzzaman@wichita.edu  Tel: +1-316-WSU-3927 ■Key Objectives  Lead research in advanced-level computer architecture, high- performance computing, embedded systems, and related fields.  Teach advanced-level computer systems & architecture, parallel programming, and related courses.

47 Dr. Zaman47 WSU CAPPLab “People First” ■Students  Kishore Konda Chidella, PhD Student  Mark P Allen, MS Student  Chok M. Yip, MS Student  Deepthi Gummadi, MS Student ■Collaborators  Mr. John Metrow, Director of WSU HiPeCC  Dr. Larry Bergman, NASA Jet Propulsion Laboratory (JPL)  Dr. Nurxat Nuraje, Massachusetts Institute of Technology (MIT)  Mr. M. Rahman, Georgia Institute of Technology (Georgia Tech)  Dr. Henry Neeman, University of Oklahoma (OU)

48 Dr. Zaman48 WSU CAPPLab Resources ■Hardware  3 CUDA Servers – CPU: Xeon E5506, 2x 4-core, 2.13 GHz, 8GB DDR3; GPU: Telsa C2075, 14x 32 cores, 6GB GDDR5 memory  2 CUDA PCs – CPU: Xeon E5506, …  Supercomputer (Opteron 6134, 32 cores per node, 2.3 GHz, 64 GB DDR3, Kepler card) via remote access to WSU (HiPeCC)  2 CUDA enabled Laptops  More … ■Software  CUDA, OpenMP, and Open MPI (C/C++ support)  MATLAB, VisualSim, CodeWarrior, more (as may needed)

49 Dr. Zaman49 WSU CAPPLab Scholarly Activities ■WSU became “CUDA Teaching Center” for 2012-13  Grants from NSF, NVIDIA, M2SYS, Wiktronics  Teaching Computer Architecture and Parallel Programming ■Publications  Journal: 21 published; 3 under preparation  Conference: 57 published; 2 under review; 6 under preparation  Book Chapter: 1 published; 1 under preparation ■Outreach  USD 259 Wichita Public Schools  Wichita Area Technical and Community Colleges  Open to collaborate

50 Dr. Zaman50 WSU CAPPLab Research Grants/Activities ■Grants  WSU: ORCA  NSF – KS NSF EPSCoR First Award  M2SYS-WSU Biometric Cloud Computing Research Grant  Teaching (Hardware/Financial) Award from NVIDIA  Teaching (Hardware/Financial) Award from Xilinx ■Proposals  NSF: CAREER (working/pending)  NASA: EPSCoR (working/pending)  U.S.: Army, Air Force, DoD, DoE  Industry: Wiktronics LLC, NetApp Inc, M2SYS Technology

51 Bogazici University; Istanbul, Turkey; 2014 “SMT/GPU Provides High Performance; at WSU CAPPLab, we can help you!” Thank You! QUESTIONS? Contact: Abu Asaduzzaman E-mail: abuasaduzzaman@ieee.org Phone: +1-316-978-5261 http://www.cs.wichita.edu/~capplab/


Download ppt "“SMT/GPU Provides High Performance; at WSU CAPPLab, we can help you!” Bogazici University Istanbul, Turkey Presented by: Dr. Abu Asaduzzaman Assistant."

Similar presentations


Ads by Google