MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

Slides:

Advertisements

Similar presentations

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Advertisements

Optimization on Kepler Zehuan Wang

Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,

An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.

Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Weekly Report Ph.D. Student: Leo Lee date: Oct. 9, 2009.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev.

All-Pairs-Shortest-Paths for Large Graphs on the GPU Gary J Katz 1,2, Joe Kider 1 1 University of Pennsylvania 2 Lockheed Martin IS&GS.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

GPGPU platforms GP - General Purpose computation using GPU

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Invited Talk 5: “Discovering Energy-Efficient High-Performance Computing Systems? WSU CAPPLab may help!” ICIEV 2014 Dhaka, Bangladesh Dr. Abu Asaduzzaman,

“Early Estimation of Cache Properties for Multicore Embedded Processors” ISERD ICETM 2015 Bangkok, Thailand May 16, 2015.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

“SMT/GPU Provides High Performance; at WSU CAPPLab, we can help you!” Bogazici University Istanbul, Turkey Presented by: Dr. Abu Asaduzzaman Assistant.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

GPU Based Sound Simulation and Visualization Torbjorn Loken, Torbjorn Loken, Sergiu M. Dascalu, and Frederick C Harris, Jr. Department of Computer Science.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Sunpyo Hong, Hyesoon Kim

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

“A Learner-Centered Computational Experience in Nanotechnology for Undergraduate STEM Students” IEEE ISEC 2016 Friend Center at Princeton University March.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

“SMT Capable CPU-GPU Systems for Big Data”

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

CS427 Multicore Architecture and Parallel Computing

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang

Lecture 2: Intro to the simd lifestyle and GPU internals

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

Mattan Erez The University of Texas at Austin

NVIDIA Fermi Architecture

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

ICIEV 2014 Dhaka, Bangladesh

Graphics Processing Unit

6- General Purpose GPU Programming

Multicore and GPU Programming

Presentation transcript:

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014

Gummadi 2 About Me  Deepthi Gummadi  MS in Computer Networking with Thesis  LaTeX programmer at CAPPLab since Fall 2013  Publications  “New CPU- to-GPU Memory Mapping Technique,” in IEEE SouthEast Conference  “The Impact of Thread Synchronization and Data Parallelism on Multicore Game Programming,” accepted in IEEE ICIEV  “Feasibility Study of Spider-Web Multicore/Manycore Network Architectures,” currently preparing.  “Investigating Impact of Data Parallelism on Computer Game Engine,” under review, IJCVSP Journal, 2014.

Gummadi 3 Committee Members  Dr. Abu Asaduzzaman, EECS Dept.  Dr. Ramazan Asmatulu, ME Dept.  Dr. Zheng Chen, EECS Dept.

Gummadi 4 “IMPROVING GPU PERFORMANCE BY REGROUPING CPU- MEMORY DATA” Outline ►  Introduction  Motivation  Problem Statement  Proposal  Evaluation  Experimental Results  Conclusions  Future Work Q U E S T I O N S ?Any time, please.

Gummadi 5 Introduction Central Processing Unit (CPU) Technology  Interpret and Execute the program instructions. What is new about CPU?  Initially, Processor evolved in sequential structure.  In millennium, processor speeds reached parallel.  Currently, we have multi core on-chip CPUs. CPU Speed Chart

Gummadi 6 Cache Memory Organization  Why we use cache memory?  Several memory layers:  Lower-level caches – faster, performing computations.  Higher-level cache – slower, storage purposes. Intel 4-core processor

Gummadi 7 NVIDIA Graphic Processing Unit  Parallel Processing Architecture  Components  Streaming Multiprocessors  Warp Schedulers  Execution pipelines  Registers  Memory Organization  Shared memory  Global memory GPU Memory Organization

Gummadi 8 CPU and GPU CPUGPU Low LatencyHigh Throughput, Moderate Latency Cache MemoryShared Memory Optimized MIMDOptimized SIMD CPU and GPU work together to be more efficient.

Gummadi 9 CPU-GPU Computing Workflow Step 1: CPU allocates the memory and copies the data. cudaMallac() cudaMemcpy()

Gummadi 10 CPU-GPU Computing Workflow Step 2: CPU sends function parameters and instructions to GPU.

Gummadi 11 CPU-GPU Computing Workflow Step 3: GPU executes the instructions based on received commands.

Gummadi 12 CPU-GPU Computing Workflow Step 4: After execution, the results will be retrieved from GPU DRAM to CPU memory.

Gummadi 13 Motivation ■ Data level parallelism  Spatial data partitioning  Temporal data partitioning  Spatial instruction partitioning  Temporal instruction partitioning Two Parallelization Strategies

Gummadi 14 Motivation ■ Parallelism and optimization techniques simplifies the programming for CUDA. ■ From developers view the memory is unified.

Gummadi 15 Problem Statement Traditional CPU to GPU global memory mapping technique is not good for GPU Shared memory

Gummadi 16 Outline ►  Introduction  Motivation  Problem Statement  Proposal  Evaluation  Experimental Results  Conclusions  Future Work Q U E S T I O N S ? Any time, please. “IMPROVING GPU PERFORMANCE BY REGROUPING CPU- MEMORY DATA”

Gummadi 17 Proposal Proposed CPU to GPU memory mapping to improve GPU shared memory performance

Gummadi 18 Proposed Technique Major Steps: Step 1: Start Step 2: Analyze problems; determine input parameters. Step 3: Analyze GPU card parameters/characteristics. Step 4: Analyze CPU and GPU memory organizations. Step 5: Determine the number of computations and the number of threads. Step 6: Identify/Partition the data-blocks for each thread. Step 7: Copy/Regroup CPU data-blocks to GPU global memory. Step 8: Stop

Gummadi 19 Proposed Technique Traditional Mapping ■ Data directly copied from CPU to GPU global memory. ■ Retrieved from different global memory blocks. ■ It is difficult to store the data into GPU shared memory. Proposed Mapping ■ Data should be regrouped and then copied from CPU to GPU global memory. ■ Retrieved from consecutive global memory blocks. ■ It is easy to store the data into GPU shared memory.

Gummadi 20 Evaluation System Parameters:  CPU Dual processor speed: 2.13 GHz  Fermi card: 14 SM, 32 CUDA cores in each SM.  Kepler card: 13 SM, 192 CUDA cores in each SM

Gummadi 21 Evaluation  Memory sizes of CPU and GPU cards.  Input parameters are size of rows and size of columns, whereas the output parameter is time.

Gummadi 22 Evaluation Electric charge distribution by Laplace’s equation for 2D problem (finite difference approximation) ϵ x(i,j) ( Φ i+1,j - Φ i,j ) /dx + ϵ y(i,j) ( Φ i,j+1 - Φ i,j ) /dy + ϵ x(i-1,j) ( Φ i,j – Φ i-1,j ) /dx + ϵ x(i,j-1) ( Φ i,j - Φ i,j-1 ) /dy =0 Φ = electric potential ϵ = medium permittivity dx, dy = spatial grid size, Φi,j = electric potential defined at lattice point (i, j) ϵ x(i,j), ϵ y(i,j) = effective x- and y-direction permittivity defined at edges of the element cell (i, j).

Gummadi 23 Evaluation Electric potential can be considered as same for a uniform material, the equation becomes ( Φ i+1,j - Φ i,j ) /dx + ( Φ i,j+1 - Φ i,j ) /dy + ( Φ i,j – Φ i-1,j ) /dx + ( Φ i,j - Φ i,j-1 ) /dy =0 23

Gummadi 24 Outline ►  Introduction  Motivation  Problem Statement  Proposal  Evaluation  Experimental Results  Conclusions  Future Work Q U E S T I O N S ? Any time, please. “IMPROVING GPU PERFORMANCE BY REGROUPING CPU- MEMORY DATA”

Gummadi 25 Experimental Results ■ Conducted study on high electric charge distribution by Laplace’s equation. ■ Implemented on three versions  CPU only.  GPU with shared memory.  GPU without shared memory. ■ Input / Outputs  Problem size (n for NxN Matrix)  Execution time

Gummadi 26 Experimental Results N n,m = 1/5 (N n,m-1 + N n,m+1 + N n,m + N n-1,m + N n+1,m ) Where, 1 <= n <= 8 and 1 <= m <= 8 Validation of our CUDA/C code: Both CPU/C and CUDA/C programs produce the same values

Gummadi 27 Experimental Results N n,m = 1/5 (N n,m-1 + N n,m+1 + N n,m + N n-1,m + N n+1,m ) Where, 1 <= n <= 8 and 1 <= m <= 8 Validation of our CUDA/C code: Both CPU/C and CUDA/C programs produce the same values

Gummadi 28 Experimental Results Impact of GPU shared memory  As the number of threads increases the processing time decreases (till 8X8 threads).  After 8X8 threads, GPU with shared memory shows better performance.

Gummadi 29 Experimental Results Impact of the Number of Threads  At a constant shared memory, the processing time of a GPU decreases as the number of threads increases (till 16X16).  After 16X16 threads, Kepler card shows better performance.

Gummadi 30 Experimental Results Impact of amount of shared memory  As the size of GPU shared memory increases, the processing time decreases.

Gummadi 31 Experimental Results Impact of the proposed data regrouping technique  In the case of data regrouping with shared memory, as the number of threads increases the processing time decreases.  Among the GPU with and without shared memory, with shared memory gives better performance for more number of threads.

Gummadi 32 Conclusions  For fast effective analysis of complex systems, high performance computations are necessary.  NVIDIA CUDA CPU/GPU, proves its potential on high computations.  Traditional memory mapping follows locality principle. So, data doesn’t fit in GPU shared memory.  Beneficial to keep data in GPU shared memory than GPU global memory.

Gummadi 33 Conclusions  To overcome this problem, we proposed a new memory mapping between CPU and GPU to improve the performance.  Implemented on three different versions.  Results indicates that proposed CPU-to-GPU memory mapping technique helps in decreasing the overall execution time by more than 75%.

Gummadi 34 Future Extensions ■ Modeling and simulation of Nanocomposites: Nanocomposites requires large number of computations at high speed. ■ Aircraft applications: High performance computations are required to study the mixture of composite materials.

Gummadi 35 “IMPROVING GPU PERFORMANCE BY REGROUPING CPU- MEMORY DATA” Questions?

Gummadi 36 “IMPROVING GPU PERFORMANCE BY REGROUPING CPU- MEMORY DATA” Thank you Contact: Deepthi Gummadi