CS6068 Parallel Computing - Fall 2015 Lecture Week 5 - Sept 28 Topics: Latency, Layout Design, Occupancy Challenging Parallel Design Patterns Sparse Matrix.

Slides:



Advertisements
Similar presentations
Instructor Notes This lecture begins with an example of how a wide- memory bus is utilized in GPUs The impact of memory coalescing and memory bank conflicts.
Advertisements

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Optimization on Kepler Zehuan Wang
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,
Discussion #33 Adjacency Matrices. Topics Adjacency matrix for a directed graph Reachability Algorithmic Complexity and Correctness –Big Oh –Proofs of.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
CS 193G Lecture 5: Performance Considerations. But First! Always measure where your time is going! Even if you think you know where it is going Start.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
CS 104 Introduction to Computer Science and Graphics Problems
Memory Management CSCI 3753 Operating Systems Spring 2005 Prof. Rick Han.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
RAM and Parallel RAM (PRAM). Why models? What is a machine model? – A abstraction describes the operation of a machine. – Allowing to associate a value.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
CUDA - 2.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Sunpyo Hong, Hyesoon Kim
Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
Buffering Techniques Greg Stitt ECE Department University of Florida.
Single Instruction Multiple Threads
CS/EE 217 – GPU Architecture and Parallel Programming
Memory Management.
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
CS 6068 Parallel Computing Fall 2015 Week 4 – Sept 21
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Lecture 5: GPU Compute Architecture
Recitation 2: Synchronization, Shared memory, Matrix Transpose
L18: CUDA, cont. Memory Hierarchy and Examples
Lecture 5: GPU Compute Architecture for the last time
CS/EE 217 – GPU Architecture and Parallel Programming
Parallel Computation Patterns (Reduction)
Memory System Performance Chapter 3
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Lecture 5: Synchronization and ILP
6- General Purpose GPU Programming
Presentation transcript:

CS6068 Parallel Computing - Fall 2015 Lecture Week 5 - Sept 28 Topics: Latency, Layout Design, Occupancy Challenging Parallel Design Patterns Sparse Matrix Operations Sieves, N-body problems Breadth-First Graph Traversals

Latency versus Bandwidth -Latency is time to complete task – eg load memory and do FLOP -Bandwidth is throughput or number of tasks per unit time – eg peak global to shared memory load rate, or graphic frames per sec -GPUs optimize for Bandwidth – employ techniques to hide latency -Reasons Latency lags Bandwidth over time -Moore’s Law favors Bandwidth over Latency -Smaller, faster transistors but communicate over longer wires, which limits Latency -Increasing Bandwidth with queues hurts Latency 2

Techniques for Performance Hiding Latency and Little’s Law -Number useful Bytes delivered = Average Latency x Bandwidth -Tiling for coalesced reads and utilization of shared memory -Synchthreads barriers and block density -Occupancy – latency hiding achieved by maximize the number of active blocks per SM 3

4 Considerations For Sizing Tiles: Need to make sure that: 1)Tiles are large enough to get speedups by lowering average latency 2)Tiles are small enough so we can fit into the shared/register memory limits and 3)Use many tiles so that groups of thread blocks running on SM fill memory bus

Sizing Tiles How Big Should We Make a Tile? 1.If we choose a larger tile, does that mean more or less total main memory bandwidth? 2.If choose larger tile, does that mean easier or harder to make use of shared memory? 3. If choose larger tile, does that mean we wait longer or shorter to synchronize? 4. If choose larger tile, does that mean we can more easily hide latency better? 5

Warps and Stalls A grid is composed of blocks which are completely independent. A block is composed of threads which can communicate within their own block Instructions are issued within an SM on a per warp basis -- and generally 32 threads form a warp If an operand is not ready the warp will stall Context switch between warps when stalled, and for performance context switch must be very fast 6

Fast Context Switching Registers and shared memory are allocated for an entire block, as long as that block is active Once a block is active it will stay active until all threads in that block have completed Context switching is very fast because registers and shared memory do not need to be saved and restored Goal: Have enough transactions in flight to saturate the memory bus Latency is better hidden when there are more transactions in flight. 7

Occupancy Occupancy is an easy, if somewhat imprecise, measure of how well we have saturated the memory pipeline for latency hiding. In a Cuda Program we measure occupancy as follows: Ratio of #Active Warps to #Maximum Active Warps Resources are allocated for the entire block and are thus potential occupancy limiters: Register usage, Shared memory usage, Block size 8

Occupancy Examples Fermi Streaming Multiprocessor SM specs: 1536 threads (Max Threads) = 48 active warps of 32 threads per warp Example 1: Suppose SM has 48K shared memory Program Kernel uses 32 bytes of shared memory per thread 48K/32 = 1536, so we can actively schedule 1536 threads or 48 warps Max Occupancy limit = 1 Example 2: Suppose SM has 16K shared memory Program Kernel uses 32 bytes of shared memory per thread 16K/32 = 512, so we can actively schedule only 512 threads per block or 16 warps Max occupancy limit =.3333 Cuda Toolkit includes: occupancy_calculator.xls 9

Bandwidth versus Latency Dense versus Space Matrices 10

A Classic Problem: Dense N-Body Compute an NxN matrix of Force vectors N objects – each with own parameters, 3D-location, mass, charge, etc. Goal: Compute N^2 Source/Dest Forces Solution 1: partition threads using PxP tiles Q: how to minimize average latency ratio of global versus shared memory fetches Solution 2: partition space using P threads using spacial locality Privatization principle: avoid write conflicts Impact of Partition on Bandwidth, Shared Memory, and Load Imbalance ??? 11

Sparse Matrix Dense Vector Products A format for sparse matrices using 3 dense vectors: –Compressed Sparse Row CSR format Value = nonzero data one array Column= identifies column for each value Rowptr= pointers to location of each 1 st data in each row

Segmented Scan Sparse Matrix multiplication requires modified scan, called a segmented scan in which there is special symbol indicating segments within array, and scan are done per segment Apply scan operation to segments independently Reduces to ordinary scan where we treat segment symbol as an annihilation value Work an example using both inclusive and exclusive scans.

Dealing with Sparse Matrices Apply dense format CSR Apply segmented Scan Operations Granularity considerations –Thread per Row vs Thread per Element Load imbalance Hybrid Approach? 16

17 Sieve of Eratosthenes: Sequential Algorithm A Classic Iterative Refinement: Sieve of Eratosthenes

18 Sieve of E - Python code def findSmallest(primes, p): #helper function for i in range(p, len(primes)): #finds smallest prime after p if primes[i] == 1: return i def sieve(n): #return num primes less than n primes = numpy.ones((n+1,), dtype=numpy.int) primes[0]=primes[1]=0 k=2 while (k*k <= n): mult= k*k while mult <= n: primes[mult] = 0 mult = mult + k k = findSmallest(primes,k+1) return sum(primes)

>>>import time, numpy for x in range(15,27): st=time.clock(); s=sieve(2**x); fin=time.clock() print x, s, fin-st x sum time

20 Work and Step Complexity: W(n)= O(n log log n) S(n) = O(#primes less than √n) = O(√n/log n) Proof sketch: Allocate array O(n) Sieving with each prime p takes work = n/p To find next smallest prime p takes work O(p) < n/p Prime Number Theorem says if a random integer is selected in range(n), the probability that the selected integer is prime is ~ 1 / log nrandomprobability Thus xth smallest prime is ~ x log x. So total work is bounded by O(n) times sum of the reciprocals of first √n/logn primes. So W(n) is O(n log log n) Since from Calculus I: Complexity of Sieve of Eratosthenes

21 Designing an Optimized Parallel Solution for Sieve Important Issues to consider: -What should a thread do? Focus on divisor or quotient? -Is tiling possible? Where should a thread start? What are dependencies? -Can thread blocks make use of coalesced DRAM memory access and shared memory? -Is compaction possible?

Structure of Data Impacts Design Parallel Graph Traversal Examples: WWW, Facebook, Tor Application: Visit every node once Breadth-First Traversal: Visit nodes level-by-level, synchronously Variety of Graphs : Small Depth, Large Depth, Small-world Depth 22

Design of BFS Goal: Compute hop distance of every node from given source node. 1 st try: Thread per Edge Method Work Complexity = Step Complexity = Control Iterations Race Conditions Finishing Conditions Next Time = Make this more efficient 23