Review student: Fan Bai Instructor: Dr. Sushil Prasad 2012.03.21 Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011.

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Sponsors: National Science Foundation, LogicBlox Inc., and NVIDIA Kernel.
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Parallel Database Systems
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Optimizing and Auto-Tuning Belief Propagation on the GPU Scott Grauer-Gray and Dr. John Cavazos Computer and Information Sciences, University of Delaware.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
GPGPU platforms GP - General Purpose computation using GPU
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
An Introduction to Programming with CUDA Paul Richmond
Slide 1/8 Performance Debugging for Highly Parallel Accelerator Architectures Saurabh Bagchi ECE & CS, Purdue University Joint work with: Tsungtai Yeh,
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Extracted directly from:
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
Advances in Modeling Neocortex and its impact on machine intelligence Jeff Hawkins Numenta Inc. VS265 Neural Computation December 2, 2010 Documentation.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
The introduction of GPGPU and some implementations on model checking Zhimin Wu Nanyang Technological University, Singapore.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache
Silberschatz, Galvin and Gagne ©2011 Operating System Concepts Essentials – 8 th Edition Chapter 2: The Linux System Part 3.
Sunpyo Hong, Hyesoon Kim
Martin Kruliš by Martin Kruliš (v1.0)1.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
My Coordinates Office EM G.27 contact time:
These slides are based on the book:
Gwangsun Kim, Jiyun Jeong, John Kim
CS427 Multicore Architecture and Parallel Computing
Controlled Kernel Launch for Dynamic Parallelism in GPUs
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang
Presented by: Isaac Martin
Chapter 2: The Linux System Part 3
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
Synchronization These notes introduce:
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Review student: Fan Bai Instructor: Dr. Sushil Prasad Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011

 Purpose  Background  Why can be parallelized  Mapping to CUDA  Optimizations methods  Experiment results

Utilize NvidiaGPUs to accelerate a neocortex inspired learning algorithm

 The neocortex is the part of the brain that is unique to mammals and is mostly responsible for executive processing skills such as mathematics, music, language, vision, perception, etc.  The neocortex comprises around 77% of the entire human brain.  For a typical adult, it is estimated the neocortex has around 11.5 billion neurons and 360 trillion synapses, or connections between neurons

 Hierarchical and regular structure  Composed of cortical columns –Neuroscientist Vernon Mountcastle Mountcastle was the first to observe the structural uniformity of the neocortex. He proposed that the neocortex is composed of millions of nearly identical functional units which he termed cortical columns because of the seemingly column shaped organizations.

 Neuroscientist Hubel and Mountcastle further classified cortical columns into hypercolumns and minicolumns.

 Minicolumns – neurons –Represent unique features –Share common receptive field Hypercolumns – minicolumns –Functional unit of neocortex Connectivity –Lateral –Feedforward(bottom-up) –Feedback (top-down)

 “Compute Unified Device Architecture”  Hardware –Streaming Multiprocessors (SMs) –Shared memory (16-48KB) –DRAM (1-6GB) Programming Framework –Threads –1000s –CTAs –groups of threads Cooperative Thread-Arrays (CTAs) –Kernel –group of CTAs

 Problem: Multiple kernel launch overhead –1 –2.5% of execution time –No CTA-CTA communication  Problem: GPGPU resources underutilized –Convergence is key part of the model / algorithm –Performance benefits diminish 50x speedup for large layers > 10x SLOWDOWN for small layers

 We can see that 1-2.5% of the total execution time for a hierarchy is spent on the additional kernel launch overhead, with smaller cortical networks suffering from larger overhead.

(1) 50x speedup for large layers (2) > 10x SLOWDOWN for small layers

Solution 1: Pipeline cortical network execution (1)Single kernel with 1 hypercolumn/ CTA (2)Double buffer maintains dependencies A double buffer between hierarchy levels guarantees that producer-consumer relationships are enforced. (3) – Improve resource utilization –Multiple kernel launches to fully propagate –Increases storage overhead

We instead create a software work-queue to explicitly orchestrate the order in which hypercolumns are executed. The work-queue is managed directly in the GPU’s global memory space, as in Figure 9. Ideally we would like to be able to execute the entire cortical architecture on the GPU concurrently, reducing the overhead to a single kernel launch. However, a limitation of the CUDA architecture is that there is no guarantee as to the order in which CTAs are scheduled.

(1)Single kernel launch A single CUDA-kernel is launched with only as many CTAs as can concurrently fit across all of the SMs in the GPGPU, as determined by the Occupancy calculator (Figure 9 shows 2 concurrent CTAs per Streaming-Multiprocessors(SM)). (2)Each CTA uses an atomic primitive to gain a unique index into the work-queue (solid blue arrows ’A’ and ’C’). The work-queue contains each hypercolumn’s ID in the cortical network and is organized to execute hypercolumns in order from the bottom of the hierarchy to the top. If all input activations are available, the hypercolumn can calculate its output activations. (in Figure 9, HC0’s inputs are ready, while HC9 must wait for its inputs to be produced by HC0).

Once a hypercolumn has calculated its output activations,they are written back to the global memory. The dashed red arrow (B) in the figure depicts how HC0 indicates to HC9 that all input activations are available via atomic increment of the flag. Finally, the CTA atomically indexes again into the work-queue to execute another hypercolumn until the work-queue is empty.

(3)Concurrent CTAs execute entire cortical network -Doesn’t rely on CTA scheduler -CUDA Disclaimer –CTA to CTA communication

Problem: Synchronization and workload imbalance Solution: Key algorithmic optimizations Profiling / distributing cortical networks on multi-GPU systems Provide insight into NvidiaGPU architectures

Cortical network algorithm well suited to GPGPUs –34x speedup baseline / 39x with optimizations –Synchronization overhead / workload imbalance Combat with algorithmic changes Fermi vs. GTX 280 architecture –Application sensitive (32 vs. 128 threads) –Improved GigaThreadCTA scheduler in Fermi Multi-GPU implementation –Online profiling / deployment –60x speedup vs. serial