Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations.
Advertisements

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
Multipattern String Matching On A GPU Author: Xinyan Zha, Sartaj Sahni Publisher: 16th IEEE Symposium on Computers and Communications Presenter: Ye-Zhi.
More on threads, shared memory, synchronization
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Weekly Report Ph.D. Student: Leo Lee date: Oct. 9, 2009.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
CS179: GPU Programming Lecture 11: Lab 5 Recitation.
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Introduction to CUDA Programming Implementing Reductions Andreas Moshovos Winter 2009 Based on slides from: Mark Harris NVIDIA Optimizations studied on.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
My Coordinates Office EM G.27 contact time:
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
GPU ProgrammingOther Relevant ObservationsExperiments GPU kernels run on C blocks (CTAs) of W warps. Each warp is a group of w=32 threads, which are executed.
Lecture 3 CUDA Programming 1
Employing compression solutions under openacc
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
Parallel Computing Lecture
Lecture 5: GPU Compute Architecture
Lecture 5: GPU Compute Architecture for the last time
STUDY AND IMPLEMENTATION
CS/EE 217 – GPU Architecture and Parallel Programming
All-Pairs Shortest Paths
Parallel Computation Patterns (Reduction)
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
ECE 498AL Spring 2010 Lecture 10: Control Flow
Data Parallel Pattern 6c.1
6- General Purpose GPU Programming
No. Date Agenda 1 09/14/2012  Course Organization; [slides]  Lecture 1 - What is Cloud Computing [slides] 2 09/21/2012  Lecture 2 - The Art of Concurrency.
Presentation transcript:

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009

Outline Introduction 7 implementations Work plan

Parallel Reduction Common and important data parallel primitive; Best example to learn optimization –Easy to implement but hard to get a high efficiency. –NVIDIA supplies 7 versions for computing the sum of the array. –I learn them one by one.

Parallel Reduction In order to deal with large arrays, the algorithm need to use multiple thread blocks. –Each block reduces a portion of the array How to communicate partial results between thread blocks? –No global synchronization: expensive, deadlock; –Decompose into multiple kernels

Optimization Goal Reach GPU peak performance –GLOP/s: for compute-bound kernels –Bandwidth: for memory-bound kernels Reductions have low arithmetic intensity –1 flop per element loaded –Try to achieve peak bandwidth!

Reduction 1: Interleaved Addressing

Hardware My computer –GeForce 8500GT NVIDIA

Performance for 4M element reduction NVIDIA’s results On my computer, the same code~ KernelTime (ms)Bandwidth (GB/s)

Reduction 2: Interleaved Addressing

Performance for 4M element reduction NVIDIA’s results On my computer KernelTime (ms) Bandwidth (GB/s) Step Speedup Cumulative speedup

Reduction 3: Sequential Addressing

Performance for 4M element reduction NVIDIA’s results On my computer KernelTime (ms) Bandwidth (GB/s) Step Speedup Cumulative speedup

Reduction 3: Sequential Addressing

Reduction 4: First Add During Load

Performance for 4M element reduction NVIDIA’s results On my computer KernelTime (ms) Bandwidth (GB/s) Step Speedup Cumulative speedup

Instruction Bottleneck Address arithmetic and loop overhead –17GB far from bandwidth bound –Ancillary instructions that are not loads, stores, or arithmetic for the core computation Strategy: unroll loops –When s<=32, only one warp left –Instructions are SIMD synchronous within a warp –If (tid<s) is useless Unroll the last 6 iterations

Reduction 5: Unroll the last Warp

Performance for 4M element reduction NVIDIA’s results On my computer KernelTime (ms)Bandwidth (GB/s)Step SpeedupCumulative speedup

Further optimization Complete Unrolling Multiple Adds/Thread

Other works Read two papers about matrix multiplication; Begin to read parallel computing books.

Work plan Learn the last two Reduction algorithms; Re-read the programming user guide.

Thanks