CS 179: GPU Programming Lecture 10.

Slides:



Advertisements
Similar presentations
CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
Advertisements

Algorithms (and Datastructures) Lecture 3 MAS 714 part 2 Hartmut Klauck.
Review Binary Search Trees Operations on Binary Search Tree
CS 206 Introduction to Computer Science II 03 / 27 / 2009 Instructor: Michael Eckmann.
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
Data Structure and Algorithms (BCS 1223) GRAPH. Introduction of Graph A graph G consists of two things: 1.A set V of elements called nodes(or points or.
CS 179: GPU Programming Lecture 10. Topics Non-numerical algorithms – Parallel breadth-first search (BFS) Texture memory.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
CS 206 Introduction to Computer Science II 11 / 11 / Veterans Day Instructor: Michael Eckmann.
Graph & BFS.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Graph & BFS Lecture 22 COMP171 Fall Graph & BFS / Slide 2 Graphs * Extremely useful tool in modeling problems * Consist of: n Vertices n Edges D.
Graphs. Graphs Many interesting situations can be modeled by a graph. Many interesting situations can be modeled by a graph. Ex. Mass transportation system,
CS 206 Introduction to Computer Science II 11 / 05 / 2008 Instructor: Michael Eckmann.
CS 206 Introduction to Computer Science II 03 / 25 / 2009 Instructor: Michael Eckmann.
CS 206 Introduction to Computer Science II 03 / 30 / 2009 Instructor: Michael Eckmann.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Nirmalya Roy School of Electrical Engineering and Computer Science Washington State University Cpt S 223 – Advanced Data Structures Graph Algorithms Shortest-Path.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
Representing and Using Graphs
CS 179: GPU Programming Lecture 12 / Homework 4.
ITEC 2620A Introduction to Data Structures Instructor: Prof. Z. Yang Course Website: 2620a.htm Office: TEL 3049.
Data Structures CSCI 132, Spring 2014 Lecture 38 Graphs
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
Chapter 13 Backtracking Introduction The 3-coloring problem
My Coordinates Office EM G.27 contact time:
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
Brute Force and Exhaustive Search Brute Force and Exhaustive Search Traveling Salesman Problem Knapsack Problem Assignment Problem Selection Sort and Bubble.
Representing Graphs Depth First Search Breadth First Search Graph Searching Algorithms.
CS 179: GPU Programming Lecture 12 / Homework 4. Admin Lab 4 is out – Due Wednesday, April Come to OH this week, this set is more difficult than.
EECE571R -- Harnessing Massively Parallel Processors ece
Graphs Representation, BFS, DFS
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code
Csc 2720 Instructor: Zhuojun Duan
Database Management Systems (CS 564)
Lecture 5: GPU Compute Architecture
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Spare Register Aware Prefetching for Graph Algorithms on GPUs
CS 179: GPU Programming Lecture 7.
Data Structures and Algorithms for Information Processing
CMSC 341 Lecture 21 Graphs (Introduction)
Discussion section #2 HW1 questions?
Lecture 5: GPU Compute Architecture for the last time
Elementary Graph Algorithms
Graph & BFS.
Graphs Representation, BFS, DFS
Data Structures – Stacks and Queus
CS 179: Lecture 12.
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Analysis and design of algorithm
Chapter 22: Elementary Graph Algorithms I
Basic Graph Algorithms
Graphs – Adjacency Matrix
ITEC 2620M Introduction to Data Structures
ECE 498AL Lecture 15: Reductions and Their Implementation
CSE 373: Data Structures and Algorithms
CSE 373 Data Structures and Algorithms
Mattan Erez The University of Texas at Austin
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Chapter 14 Graphs © 2011 Pearson Addison-Wesley. All rights reserved.
Lecture 5: Synchronization and ILP
Elementary Graph Algorithms
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

CS 179: GPU Programming Lecture 10

Topics Non-numerical algorithms Texture memory Parallel breadth-first search (BFS) Texture memory

GPUs – good for many numerical calculations… What about “non-numerical” problems?

Graph Algorithms

Graph Algorithms Graph G(V, E) consists of: Complex data structures! Vertices Edges (defined by pairs of vertices) Complex data structures! How to store? How to work around? Are graph algorithms parallelizable?

Breadth-First Search* Given source vertex S: Find min. #edges to reach every vertex from S *variation

Breadth-First Search* Given source vertex S: Find min. #edges to reach every vertex from S (Assume source is vertex 0) 1 1 2 2 3 *variation

Breadth-First Search* Given source vertex S: Find min. #edges to reach every vertex from S (Assume source is vertex 0) 1 1 2 2 3 Sequential pseudocode: let Q be a queue Q.enqueue(source) label source as discovered source.value <- 0 while Q is not empty v ← Q.dequeue() for all edges from v to w in G.adjacentEdges(v): if w is not labeled as discovered Q.enqueue(w) label w as discovered, w.value <- v.value + 1

Breadth-First Search* Given source vertex S: Find min. #edges to reach every vertex from S (Assume source is vertex 0) 1 1 2 2 3 Sequential pseudocode: let Q be a queue Q.enqueue(source) label source as discovered source.value <- 0 while Q is not empty v ← Q.dequeue() for all edges from v to w in G.adjacentEdges(v): if w is not labeled as discovered Q.enqueue(w) label w as discovered, w.value <- v.value + 1 End: How do we parallelize? Runtime: O( |V| + |E| )

Representing Graphs “Adjacency matrix” “Adjacency list” A: |V| x |V| matrix: Aij = 1 if vertices i,j are adjacent, 0 otherwise O(V2) space “Adjacency list” Adjacent vertices noted for each vertex O(V + E) space

Representing Graphs “Adjacency matrix” “Adjacency list” A: |V| x |V| matrix: Aij = 1 if vertices i,j are adjacent, 0 otherwise O(V2) space <- hard to fit, more copy overhead “Adjacency list” Adjacent vertices noted for each vertex O(V + E) space <- easy to fit, less copy overhead

Representing Graphs “Compact Adjacency List” Array Ea: Adjacent vertices to vertex 0, then vertex 1, then … size: O(E) Array Va: Delimiters for Ea size: O(V) 1 2 3 4 5 2 4 8 9 11 Vertex: 1 2 3 4 5 End: We have data structure, but still a big problem!

Breadth-First Search* Given source vertex S: Find min. #edges to reach every vertex from S (Assume source is vertex 0) 1 1 2 2 3 Sequential pseudocode: let Q be a queue Q.enqueue(source) label source as discovered source.value <- 0 while Q is not empty v ← Q.dequeue() for all edges from v to w in G.adjacentEdges(v): if w is not labeled as discovered Q.enqueue(w) label w as discovered, w.value <- v.value + 1 How to “parallelize” when there’s a queue?

Breadth-First Search* Sequential pseudocode: let Q be a queue Q.enqueue(source) label source as discovered source.value <- 0 while Q is not empty v ← Q.dequeue() for all edges from v to w in G.adjacentEdges(v): if w is not labeled as discovered Q.enqueue(w) label w as discovered, w.value <- v.value + 1 Why do we use a queue? 1 1 2 2 3

BFS Order Here, vertex #s are possible BFS order "Breadth-first-tree" by Alexander Drichel - Own work. Licensed under CC BY 3.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Breadth-first-tree.svg#/media/File:Breadth-first-tree.svg

BFS Order Permutation within ovals preserves BFS! "Breadth-first-tree" by Alexander Drichel - Own work. Licensed under CC BY 3.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Breadth-first-tree.svg#/media/File:Breadth-first-tree.svg

BFS Order Queue replaceable if layers preserved! Permutation within ovals preserves BFS! "Breadth-first-tree" by Alexander Drichel - Own work. Licensed under CC BY 3.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Breadth-first-tree.svg#/media/File:Breadth-first-tree.svg

Alternate BFS algorithm Construct arrays of size |V|: “Frontier” (denote F): Boolean array - indicating vertices “to be visited” (at beginning of iteration) “Visited” (denote X): Boolean array - indicating already-visited vertices “Cost” (denote C): Integer array - indicating #edges to reach each vertex Goal: Populate C

Alternate BFS algorithm New sequential pseudocode: Input: Va, Ea, source (graph in “compact adjacency list” format) Create frontier (F), visited array (X), cost array (C) F <- (all false) X <- (all false) C <- (all infinity) F[source] <- true C[source] <- 0 while F is not all false: for 0 ≤ i < |Va|: if F[i] is true: F[i] <- false X[i] <- true for all neighbors j of i: if X[j] is false: C[j] <- C[i] + 1 F[j] <- true

Alternate BFS algorithm New sequential pseudocode: Input: Va, Ea, source (graph in “compact adjacency list” format) Create frontier (F), visited array (X), cost array (C) F <- (all false) X <- (all false) C <- (all infinity) F[source] <- true C[source] <- 0 while F is not all false: for 0 ≤ i < |Va|: if F[i] is true: F[i] <- false X[i] <- true for Ea[Va[i]] ≤ j < Ea[Va[i+1]]: if X[j] is false: C[j] <- C[i] + 1 F[j] <- true

Alternate BFS algorithm New sequential pseudocode: Input: Va, Ea, source (graph in “compact adjacency list” format) Create frontier (F), visited array (X), cost array (C) F <- (all false) X <- (all false) C <- (all infinity) F[source] <- true C[source] <- 0 while F is not all false: for 0 ≤ i < |Va|: if F[i] is true: F[i] <- false X[i] <- true for Ea[Va[i]] ≤ j < Ea[Va[i+1]]: if X[j] is false: C[j] <- C[i] + 1 F[j] <- true Parallelizable!

GPU-accelerated BFS CPU-side pseudocode: GPU-side kernel pseudocode: Input: Va, Ea, source (graph in “compact adjacency list” format) Create frontier (F), visited array (X), cost array (C) F <- (all false) X <- (all false) C <- (all infinity) F[source] <- true C[source] <- 0 while F is not all false: call GPU kernel( F, X, C, Va, Ea ) GPU-side kernel pseudocode: if F[threadId] is true: F[threadId] <- false X[threadId] <- true for Ea[Va[threadId]] ≤ j < Ea[Va[threadId + 1]]: if X[j] is false: C[j] <- C[threadId] + 1 F[j] <- true Can represent boolean values as integers

GPU-accelerated BFS CPU-side pseudocode: GPU-side kernel pseudocode: Input: Va, Ea, source (graph in “compact adjacency list” format) Create frontier (F), visited array (X), cost array (C) F <- (all false) X <- (all false) C <- (all infinity) F[source] <- true C[source] <- 0 while F is not all false: call GPU kernel( F, X, C, Va, Ea ) GPU-side kernel pseudocode: if F[threadId] is true: F[threadId] <- false X[threadId] <- true for Ea[Va[threadId]] ≤ j < Ea[Va[threadId + 1]]: if X[j] is false: C[j] <- C[threadId] + 1 F[j] <- true Can represent boolean values as integers Unsafe operation?

GPU-accelerated BFS CPU-side pseudocode: GPU-side kernel pseudocode: Input: Va, Ea, source (graph in “compact adjacency list” format) Create frontier (F), visited array (X), cost array (C) F <- (all false) X <- (all false) C <- (all infinity) F[source] <- true C[source] <- 0 while F is not all false: call GPU kernel( F, X, C, Va, Ea ) GPU-side kernel pseudocode: if F[threadId] is true: F[threadId] <- false X[threadId] <- true for Ea[Va[threadId]] ≤ j < Ea[Va[threadId + 1]]: if X[j] is false: C[j] <- C[threadId] + 1 F[j] <- true Can represent boolean values as integers Safe! No ambiguity!

Summary Tricky algorithms need drastic measures! Resources “Accelerating Large Graph Algorithms on the GPU Using CUDA” (Harish, Narayanan)

Texture Memory

“Ordinary” Memory Hierarchy http://www.imm.dtu.dk/~beda/SciComp/caches.png

GPU Memory Lots of types! Global memory Shared memory Constant memory

GPU Memory Lots of types! Must keep in mind: Global memory Shared memory Constant memory Must keep in mind: Coalesced access Divergence Bank conflicts Random serialized access … Size!

Hardware vs. Abstraction

Hardware vs. Abstraction Names refer to manner of access on device memory: “Global memory” “Constant memory” “Texture memory”

Review: Constant Memory Read-only access 64 KB available, 8 KB cache – small! Not “const”! Write to region with cudaMemcpyToSymbol()

Review: Constant Memory Broadcast reads to half-warps! When all threads need same data: Save reads! Downside: When all threads need different data: Extremely slow!

Review: Constant Memory Example application: Gaussian impulse response (from HW 1): Not changed Accessed simultaneously by threads in warp

Texture Memory (and co-stars) Another type of memory system, featuring: Spatially-cached read-only access Avoid coalescing worries Interpolation (Other) fixed-function capabilities Graphics interoperability

Fixed Functions Like GPUs in the old days! Still important/useful for certain things

Traditional Caching When reading, cache “nearby elements” (i.e. cache line) Memory is linear! Applies to CPU, GPU L1/L2 cache, etc

Traditional Caching

Traditional Caching 2D array manipulations: One direction goes “against the grain” of caching E.g. if array is stored row-major, traveling along “y-direction” is sub-optimal!

Texture-Memory Caching Can cache “spatially!” (2D, 3D) Specify dimensions (1D, 2D, 3D) on creation 1D applications: Interpolation, clipping (later) Caching when e.g. coalesced access is infeasible

Texture Memory “Memory is just an unshaped bucket of bits” (CUDA Handbook) Need texture reference in order to: Interpret data Deliver to registers

Texture References “Bound” to regions of memory Specify (depending on situation): Access dimensions (1D, 2D, 3D) Interpolation behavior “Clamping” behavior Normalization …

Interpolation Can “read between the lines!” http://cuda-programming.blogspot.com/2013/02/texture-memory-in-cuda-what-is-texture.html

Clamping Seamlessly handle reads beyond region! http://cuda-programming.blogspot.com/2013/02/texture-memory-in-cuda-what-is-texture.html

“CUDA Arrays” So far, we’ve used standard linear arrays “CUDA arrays”: Different addressing calculation Contiguous addresses have 2D/3D locality! Not pointer-addressable (Designed specifically for texturing) There’s a CS 24 set about this or something

Texture Memory Texture reference can be attached to: Ordinary device-memory array “CUDA array” Many more capabilities

Texturing Example (2D) http://www.math.ntu.edu.tw/~wwang/mtxcomp2010/download/cuda_04_ykhung.pdf http://www.math.ntu.edu.tw/~wwang/mtxcomp2010/download/cuda_04_ykhung.pdf

Texturing Example (2D) http://www.math.ntu.edu.tw/~wwang/mtxcomp2010/download/cuda_04_ykhung.pdf

Texturing Example (2D) http://www.math.ntu.edu.tw/~wwang/mtxcomp2010/download/cuda_04_ykhung.pdf

Texturing Example (2D)