Optimizing and Auto-Tuning Belief Propagation on the GPU Scott Grauer-Gray and Dr. John Cavazos Computer and Information Sciences, University of Delaware.

Slides:



Advertisements
Similar presentations
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
Advertisements

BWUPEP2011, UIUC, May 29 - June Taking CUDA to Ludicrous Speed BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Reference: Message Passing Fundamentals.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Review student: Fan Bai Instructor: Dr. Sushil Prasad Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011.
GPGPU platforms GP - General Purpose computation using GPU
OpenSSL acceleration using Graphics Processing Units
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Introduction Belief propagation: known to produce accurate results for stereo processing/ motion estimation High storage requirements limit the use of.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Extracted directly from:
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Diane Marinkas CDA 6938 April 30, Outline Motivation Algorithm CPU Implementation GPU Implementation Performance Lessons Learned Future Work.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Genetic Programming on General Purpose Graphics Processing Units (GPGPGPU) Muhammad Iqbal Evolutionary Computation Research Group School of Engineering.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Large-scale Deep Unsupervised Learning using Graphics Processors
GPU Architecture and Programming
Dense Image Over-segmentation on a GPU Alex Rodionov 4/24/2009.
IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.
"Distributed Computing and Grid-technologies in Science and Education " PROSPECTS OF USING GPU IN DESKTOP-GRID SYSTEMS Klimov Georgy Dubna, 2012.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
QCAdesigner – CUDA HPPS project
M. Jędrzejewski, K.Marasek, Warsaw ICCVG, Multimedia Chair Computation of room acoustics using programable video hardware Marcin Jędrzejewski.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Sunpyo Hong, Hyesoon Kim
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Computer Engg, IIT(BHU)
CS/EE 217 – GPU Architecture and Parallel Programming
6- General Purpose GPU Programming
Presentation transcript:

Optimizing and Auto-Tuning Belief Propagation on the GPU Scott Grauer-Gray and Dr. John Cavazos Computer and Information Sciences, University of Delaware

GPUs/CUDA GPU: specialized hardware to parallel processor... GPUs: more FLOPS and greater memory bandwidth than CPUs nVidia encouraging usage of GPUs for general computing

CUDA Programming Model Each thread has a unique ID in a grid of thread blocks In theory, each thread processes the instructions of the kernel in parallel CUDA programming model

Belief Propagation General-purpose iterative algorithm for inference Applications in various disciplines Used for stereo vision problem in this work Graphical representations: input to belief propagation implementations

Stereo Vision Problem Input: scene images and disparity space Goal: retrieve disparity between objects in both images Output: Disparity at each pixel in input image Below (from left to right): Tsukuba stereo set and resultant disparity map

Belief Propagation for Stereo Vision Stereo algorithms that incorporate belief propagation generate accurate results, but algorithm has shortcomings: Longer run-time vs. other stereo algorithms High storage requirements

Belief Propagation for stereo vision Previously implemented on the GPU: Using graphics API Using CUDA architecture Implementations show decrease in running time versus CPU implementation In this work, we optimize belief propagation on the GPU

Belief Propagation for Stereo Vision Consists of the following steps: 1. Compute data cost for each pixel at each disparity. 2. Iteratively compute data costs at each succeeding level.

Belief Propagation for Stereo Vision 3. For each level in hierarchy: a. Compute message values to send to neighboring pixels. Repeat for i iterations, alternate between pixel sets b. If not at bottom level, copy message values at each pixel to block of pixels in next level. mdmd mumu mrmr mlml mlml mlml mrmr mumu mumu mdmd mdmd mdmd mumu mrmr mlml mlml mumu mumu mdmd mdmd mrmr mlml mlml mrmr mrmr mlml mrmr mlml mlml mrmr

Belief Propagation for Stereo Vision 4. Retrieve disparity estimates at each pixel.

Optimizing belief propagation for stereo vision First step: Generate an initial implementation on the GPU: Consists of the 5 steps/sub-steps Each parallelized on GPU using CUDA 70% of the running time is spent in kernel of step 3a Optimizations are performed on this kernel mdmd mumu mrmr mlml mlml mlml mrmr mumu mumu mdmd mdmd mdmd mumu mrmr mlml mlml mumu mumu mdmd mdmd mrmr mlml mlml mrmr mrmr mlml mrmr mlml mlml mrmr

Primary Aspects of Dominant Kernel Many calculations performed on structure known as message array Array length equal to number of values in disparity space Array stored in local memory in initial implementation Accesses to local memory on the GPU are slow, so... Might converting local memory accesses to register and/or shared memory accesses improve the running time? Message array

Memory Hierarchy on the GPU Register memory and local memory is per thread Shared memory is per thread-block Limited supply of low-latency register and shared memory per multiprocessor Utilize each form of memory in following implementations

Code from initial implementation float m[N]; float currMin = INFINITY; for (int i=0; i < N; i++) { m[i] = dataC[INDX_D_I] + neigh1[INDX_N1_i] + neigh2[INDX_N2_i] + neigh3[INDX_N3_i]; if (m[i] < currMin) currMin = m[i]; }..

Code from shared memory implementation __shared__ float m_shared[N*THREAD_BLK_SIZE]; float currMin = INFINITY; for (int i=0; i < N; i++) { m_shared[i_currThread] = dataC[INDX_D_I] + neigh1[INDX_N1_i] + neigh2[INDX_N2_i] + neigh3[INDX_N3_i]; if (m_shared[i_currThread] < currMin) currMin = m_shared[i_currThread] ; }..

Code from register memory implementation float m[N]; float currMin = INFINITY; m[0] = dataC[INDX_D_0] + neigh1[INDX_N1_0] + neigh2[INDX_N2_0] + neigh3[INDX_N3_0]; if (m[0] < currMin) currMin = m[0];.. m[N-1] = dataC[INDX_D_(N-1)] + neigh1[INDX_N1_(N-1)] + neigh2[INDX_N2_(N-1)] + neigh3[INDX_N3_(N-1)]; if (m[N-1] < currMin) currMin = m[N-1];.

Results: initial and optimized implementations Input: Tsukuba stereo set, 15 values in disparity space Storage / Loop Unroll Setting Num Local Loads Num Local Stores Nu m Reg s Occu p. Total Runnin g Time Speedup from init. CUDA imp Initial CUDA imp ms- - - Shared memory imp ms1.85 Register memory imp ms1.93 Results on GTX 285

Initial vs. optimized results Placing the message array in registers or shared memory improves results But...input Tsukuba stereo set has a relatively small disparity space Research questions: Might the programmer want to place split data between register and shared memory? Might the programmer want the option to place data in local memory? Our affirmative response to these questions represent the motivation for the following implementation...

Hybrid Implementation Combines the register/shared/local memory options for data storage Programmer can choose the amount of message array data placed in register, shared, and local memory Greatly increases flexibility of optimization options... registers shared memorylocal memory

Hybrid Implementation Input: `Cones' stereo set, disparity space of 65 values Experiments use Shared/local memory w/ 32 X 4 thread block 2. Shared/local memory w/ 32 X 2 thread block 3. Shared/register memory w/ 32 X 4 thread block 4. Shared/register memory w/ 32 X 2 thread block shared memory registers shared memoryregisters shared memory local memory shared memorylocal memory

Hybrid Implementation Results Results on GTX 285:

Auto-tuning for optimization Goal: develop auto-tuning framework to optimize algorithm across architectures / inputs Hybrid config. results: Running time decreases when moving to a config. which allows for greater occupancy Running time then increases until the parameters allows for increase in occupancy... We believe it is possible to retrieve a near-optimal configuration using a single carefully-generated configuration at selected occupancies

Auto-Tuning Framework 3. Determine and set the number of values to be stored in shared memory 4. Place remaining values in register memory; up to MAX_REG_VALUES total values in registers 5. Place any remaining values in local memory message array registersshared memory registersshared memory registersshared memoryregisterslocal memory registers

Auto-tuning experiments on GTX 285 Results relative to initial implementation: NUM_REG_VALUES_INITIAL = 12 MAX_REG_VALUES = 50 Parameter values:

Experiments across GPUs GTX processors Tesla C processors GTX 470 (Fermi) 448 processors

Results across GPUs Hybrid implementation experiment results: GPU (config.) Thread block dimension s Num vals in reg-shared-loc mem in mess array Total Running Time Speedup over init. CUDA imp. GTX X ms1.76 Tesla C87032 X ms1.43 GTX 470 (favor L1 cache) 32 X ms1.00 GTX 470 (favor shared mem) 32 X ms1.39

Auto-tuning Result on each GPU Optimal configuration on each GPU as obtained via auto-tuning Auto-tuning generates similar results to hybrid implementation experiments GPU (config.) Max occupancy Running Time (ms) Speed-up over Init. CUDA imp. % Difference in speedup from opt. config. in hybrid imp. experiments GTX ms % Tesla C ms % GTX 470 (favor L1 cache) ms % GTX 470 (favor shared mem) ms %

Conclusions Improved running time of CUDA implementation o Took advantage of storage options in memory hierarchy Different GPU architectures require different optimizations Requires `general' framework where it is possible to optimize depending on the architecture