Shared Memory Accesses

Shared Memory Accesses
©Sudhakar Yalamanchili unless otherwise noted

Objectives Understand how memory access patterns in a warp can affect performance Develop an intuition about how to avoid performance degrading access patterns to shared memory

Reading CUDA Programming Guide CUDA: Best Practices
CUDA: Best Practices

Avoiding Shared Memory Bank Conflicts
Be cognizant of addressing patterns in a warp Accesses to the same banks are serialized Accesses to the same address on the same bank are broadcast (compute capability 5.0 and 6.0) Check the behavior of shared memory as a function of compute capability CUDA Programming Guide Section H

Bank Conflicts Conflict free access Conflict free access
2-way Conflict access Conflict free access

Interleaved Memory Organization
τ memory access 1 memory access 2 Read the output of a memory access Time to read the output of memory Memory Module Time word interleaving 1 2 3 4 5 6 7 Bank 0 Bank 1 Bank 2 Bank 3 Memory is organized into multiple, concurrent, banks World level interleaving across banks Single address generates multiple, concurrent accesses Well matched to cache line access patterns

Sequential Bank Operation
bank 1 bank m-1 m lower order bits n-m higher order bits τ memory access 1 memory access 2 Read the output of a memory access Time to read the output of memory Memory Module Time

Concurrent Bank Operation
memory bank access Memory Module τ Time Each bank can be addressed independently Multiple sources of addresses  warp? Difference with interleaved memory Flexibility in addressing Requires greater address bandwidth Separate controllers and memory buses Support for non-blocking caches with multiple outstanding misses

Data Skewing for Concurrent Access
A 3-ordered 8 vector with C = 2. How can we guarantee that data can be accessed in parallel? Avoid bank conflicts Storage Scheme: A set of rules that determine for each array element, the address of the module and the location within a module Design a storage scheme to ensure concurrent access d-ordered n vector: the ith element is in module (d.i + C) mod M.

Conflict Free Access Conflict free access to elements of the vector if  M >= N M >= N. gcd(M,d) Multi-dimensional arrays treated as arrays of 1-d vectors Conflict free access for various patterns in a matrix requires M >= N. gcd(M,δ1) for columns M >= N. gcd(M, δ2) for rows M >= N. gcd(M, δ1+ δ2 ) for forward diagonals M >= N. gcd(M, δ1- δ2) for backward diagonals

Conflict Free Access Implications for M = N = even number?
For non-power-of-two values of M, indexing and address computation must be efficient Vectors that are accessed are scrambled Unscrambling of vectors is a non-trivial performance issue Data dependencies can still reduce bandwidth far below O(M)

Avoiding Bank Conflicts
Many banks int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; Even with 128 banks, since 512 is multiple of 128, conflict on word accesses Solutions: Software: loop interchange Software: adjust array size to a prime # (“array padding”) Hardware: prime number of banks (e.g. 17) Data skewing 12

Questions?

Shared Memory Accesses

Similar presentations

Presentation on theme: "Shared Memory Accesses"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Shared Memory Accesses

Similar presentations

Presentation on theme: "Shared Memory Accesses"— Presentation transcript:

Similar presentations

About project

Feedback