EE 193: Parallel Computing

EE 193: Parallel Computing
Fall 2017 Tufts University Instructor: Joel Grodstein Lecture 7: Matrix multiply

Goals Primary goals: Understand block-matrix algorithms

Matrix multiplication
Matrix multiplication is the poster child for parallel computation. Some uses: Computer graphics (e.g., linear algebra for transforming points between coordinate spaces) Anything that uses linear algebra Neural nets (perhaps the biggest use of GPGPUs) And that's good – because matrix multiplication is easy, right? It's easy to get it correct. But not so easy to make it fast. EE 193 Joel Grodstein

Matrix times vector Let's start simple: a (4x4 matrix) * (1x4 vector) Write the eqns. Write the code. A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 * x0 x1 x2 x3 = y0 y1 y2 y3 𝑦 𝑟 = 𝑘=0 𝑁−1 𝐴 𝑟,𝑘 𝑥 𝑘 If N were big, what do you think about the temporal and spatial locality of this code? for r=0…N-1 for k=0…N-1 y[r] += A[r,k]*x[k]; EE 193 Joel Grodstein

Matrix times vector Temporal: Spatial: A00 A01 A02 A03 A10 A11 A12 A13
* x0 x1 x2 x3 = y0 y2 y3 for r=0…N-1 for k=0…N-1 y[r] += A[r,k]*x[k]; Temporal: each element of A is used only once . each element of x is used N times . this is only useful if x fits into cache (hopefully easy) Spatial: The inner loop accesses memory in row-major order  EE 193 Joel Grodstein

Data for HSX and Pascal Look at Haswell server compute and memory
Nvidia Pascal P100 # cores 18 3840 Die area 660 mm2 (22nm) 610 mm2 (16nm) Frequency 2.3 GHz normal 1.3 GHz Max DRAM BW 100 GB/s 720 GB/s LLC size 2.5 MB/core (L3) 4 MB L2/chip LLC-1 size 256K/core(L2) 64 KB/SM (64 cores) Registers/core 180 (2 threads/core) 500 Power 165 watts 300 watts Company market cap $160B $90B Look at Haswell server compute and memory EE 193 Joel Grodstein

Xeon compute-to-memory ratio
2.3𝐺 𝑐𝑦𝑐 𝑠𝑒𝑐 ∙ 1.5 𝑓𝑙𝑜𝑝 𝑐𝑦𝑐∙𝑐𝑜𝑟𝑒 ∙ 18 𝑐𝑜𝑟𝑒𝑠 1 ∙ 2 𝑜𝑝𝑒𝑟𝑎𝑛𝑑𝑠 𝑓𝑙𝑜𝑝 = 125G operand/sec Max DRAM bandwidth = 100 𝐺𝐵 𝑠𝑒𝑐 ∙ 1 𝑓𝑙𝑜𝑎𝑡 4𝐵 =25G floats/sec Note that 125 > 25. So If we are reading our data from main memory (not from caches)… then we cannot keep all FPUs in all cores busy Note that 125/25 ≈ 5 If every piece of data gets used once (from memory) and four more times (from cache)… then we can keep all FPUs in all cores busy Now let’s look at a consequence of this EE 193 Joel Grodstein

Compute/memory ratio How many computations are there?
A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 * x0 x1 x2 x3 = y0 y2 y3 for i=0…N-1 for k=0…N-1 y[i] += A[i,k]*y[k]; How many computations are there? How many floats are read/written? The compute/memory ratio is We want every piece of data to be be reused 4 times – but it’s only used once, period So a Xeon cannot get data in from memory fast enough to keep the FPUs busy . Even perfect caches cannot fix this! N2 FMACs. N2 +2N Roughly 1 EE 193 Joel Grodstein

What does it all mean? Big-data problems do not fit in your registers.
They're unlikely to fit in L1. They may fit in L2… or L3… or main memory… If they fit into L1, all of this is irrelevant since the BW and latency from L1 to RF is so good. If they fit into main memory… A chain is no stronger than its weak link. Cache can buffer, reuse, even out flow, etc. If the compute/memory ratio is <5, then HSX cannot use all of its computes But if the average DRAM bandwidth is the weak link, there is no way to fix that. EE 193 Joel Grodstein

A slightly different problem
A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 * x0 x1 x2 x3 = y0 y2 y3 for i=0…N-1 for k=0…N-1 y[i] += A[i,k]*x[k]; Related problem: what if A is fixed, and we want to stream in lots of x vectors. Consider the work for one vector x producing one vector y, and assuming A is already in cache. How many computations are there? How many floats are read/written? The compute/memory ratio is So HSX can deal with this for N≥10. Note that computer graphics only typically has N=4. But the graphics pipeline has more than just one matrix multiply. N2 FMACs. 2N N/2 EE 193 Joel Grodstein

More on matrix*vector A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 * x0 x1 x2 x3 = y0 y2 y3 for i=0…N-1 for k=0…N-1 y[i] += A[i,k]*x[k]; Keep looking at the same problem: A is fixed, and we want to stream in lots of x vectors. In this example, do we have any need for caches? We need someplace to store the A matrix. Our code may not wind up accessing the various x vectors exactly in address-sequential order; a cache can store the ones we skip until we use them. EE 193 Joel Grodstein

Matrix times matrix Let's move on to a bigger problem: matrix*matrix
A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 C00 C01 C02 C03 C10 C11 C12 C13 C20 C21 C22 C23 C30 C31 C32 C33 B00 B01 B02 B03 B10 B11 B12 B13 B20 B21 B22 B23 B30 B31 B32 B33 * = for r=0…N-1 for c=0…N-1 for k=0…N-1 C[r,c] += A[r,k]*B[k,c]; 𝐶 𝑟,𝑐 = 𝑘=0 𝑁−1 𝐴 𝑟,𝑘 𝐵 𝑘,𝑐 EE 193 Joel Grodstein

Matrix times matrix Let's move on to a bigger problem: matrix*matrix *
A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 * = B00 B01 B02 B03 B10 B11 B12 B13 B20 B21 B22 B23 B30 B31 B32 B33 C00 C01 C02 C03 C10 C11 C12 C13 C20 C21 C22 C23 C30 C31 C32 C33 How many computations are there? How many floats are read/written? The compute/memory ratio is N3 FMACs 3N2 N/3. So, in principle, this should be easy! for r=0…N-1 for c=0…N-1 for k=0…N-1 C[r,c] += A[r,k]*B[k,c]; EE 193 Joel Grodstein

Matrix access patterns
A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 * = B00 B01 B02 B03 B10 B11 B12 B13 B20 B21 B22 B23 B30 B31 B32 B33 C00 C01 C02 C03 C10 C11 C12 C13 C20 C21 C22 C23 C30 C31 C32 C33 Access pattern for A? row major, repeat each line N times row will mostly be accessed from a small, fast cache Access pattern for B? column major, repeat entire matrix N times accessing from main memory, and slowly Access pattern for C? row major, repeat each element N times extremely fast access Can we fix this? for r=0…N-1 for c=0…N-1 for k=0…N-1 C[r,c] += A[r,k]*B[k,c]; EE 193 Joel Grodstein

Our trick: swap the 2nd and 3rd loops.
A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 * = B00 B01 B02 B03 B10 B11 B12 B13 B20 B21 B22 B23 B30 B31 B32 B33 C00 C01 C02 C03 C10 C11 C12 C13 C20 C21 C22 C23 C30 C31 C32 C33 for r=0…N-1 for c=0…N-1 for k=0…N-1 C[r,c] += A[r,k]*B[k,c]; Our trick: swap the 2nd and 3rd loops. Any change in the end-result computation? We still execute the final line with the same r,c,k triplets; just in a different order. And adding the same numbers in a different order doesn't change the result. EE 193 Joel Grodstein

In-class exercise Draw matrix multiple for 2x2 matrices on the board
Divide the class into two groups one group works on the original loop ordering, and the other on the swapped ordering Each group manually runs the code, and notes which A*B products got summed into which locations in C. At the end, compare to see if they both added the same product pairs EE 193 Joel Grodstein

Matrix access patterns
A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 * = B00 B01 B02 B03 B10 B11 B12 B13 B20 B21 B22 B23 B30 B31 B32 B33 C00 C01 C02 C03 C10 C11 C12 C13 C20 C21 C22 C23 C30 C31 C32 C33 Access pattern for A? row major, repeat each element N times extremely fast access Access pattern for B? row major, repeat entire matrix N times but if we cannot fit B into the L3, we must still keep reloading it from memory  Access pattern for C? row major, repeat each row N times row will mostly be accessed from a small, fast cache Can we fix this? for r=0…N-1 for k=0…N-1 for c=0…N-1 C[r,c] += A[r,k]*B[k,c]; EE 193 Joel Grodstein

* = r k for r=0…N-1 for k=0…N-1 for c=0…N-1 C[r,c] += A[r,k]*B[k,c];
A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 * = B00 B01 B02 B03 B10 B11 B12 B13 B20 B21 B22 B23 B30 B31 B32 B33 C00 C01 C02 C03 C10 C11 C12 C13 C20 C21 C22 C23 C30 C31 C32 C33 r k for r=0…N-1 for k=0…N-1 for c=0…N-1 C[r,c] += A[r,k]*B[k,c]; EE 193 Joel Grodstein

* = for r=0…N-1 for k=0…N-1 for c=0…N-1 C[r,c] += A[r,k]*B[k,c];
A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 * = B00 B01 B02 B03 B10 B11 B12 B13 B20 B21 B22 B23 B30 B31 B32 B33 C00 C01 C02 C03 C10 C11 C12 C13 C20 C21 C22 C23 C30 C31 C32 C33 for r=0…N-1 for k=0…N-1 for c=0…N-1 C[r,c] += A[r,k]*B[k,c]; Let's go parallel. Assign each thread to just a few r values. We want to assign each thread to one core, and have each thread only access as much memory as can fit in L2. Does it work? Each thread must now access only its own rows of A and C Each thread still accesses every element of B; and once per value of r. Conclusion: each thread must be able to hold all of B in its cache. Other than that, it parallelizes quite well. EE 193 Joel Grodstein

Block-matrix algorithms
What happens if B doesn't fit into our relevant cache? Is this a real concern? YES. Matrices can be big…, and we want to be able to do linear algebra on small CPUs (e.g., for and GPUs have small caches. Let's try to change our algorithm to work even on small caches and big matrices. EE 193 Joel Grodstein

Baby step Consider the two programs: Do they get the same results?
for a=0..1 for b=0..3 i=4*a+b; do something with i; for i=0..7 do something with i; Do they get the same results? Unless the "do something" is very tricky, yes they do. EE 193 Joel Grodstein

Interpretation: matrix blocks
A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 Break the 4x4 matrix into 4 2x2 blocks Block 0,0 Block 0,1 Block 1,0 Block 1,1 EE 193 Joel Grodstein

r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; C[r,c] += A[r,k]*B[k,c];
A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 * = B00 B01 B02 B03 B10 B11 B12 B13 B20 B21 B22 B23 B30 B31 B32 B33 C00 C01 C02 C03 C10 C11 C12 C13 C20 C21 C22 C23 C30 C31 C32 C33 for rb=0…Nb-1 for ri=0..Ni-1 for kb=0..Nb-1 for ki=0..Ni-1 for cb=0..Nb-1 for ci=0..Ni-1 r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; C[r,c] += A[r,k]*B[k,c]; Pick the row Pick k Pick the column Do we believe that this has the same result as before? Yes. But it is getting a bit confusing. EE 193 Joel Grodstein

A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 * = B00 B01 B02 B03 B10 B11 B12 B13 B20 B21 B22 B23 B30 B31 B32 B33 C00 C01 C02 C03 C10 C11 C12 C13 C20 C21 C22 C23 C30 C31 C32 C33 for rb=0…Nb-1 for ri=0…Ni-1 for kb=0…Nb-1 for ki=0…Ni-1 for cb=0…Nb-1 for ci=0…Ni-1 r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; C[r,c] += A[r,k]*B[k,c]; Now swap some loops EE 193 Joel Grodstein

A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 * = B00 B01 B02 B03 B10 B11 B12 B13 B20 B21 B22 B23 B30 B31 B32 B33 C00 C01 C02 C03 C10 C11 C12 C13 C20 C21 C22 C23 C30 C31 C32 C33 for rb=0…Nb-1 for kb=0..Nb-1 for cb=0..Nb-1 for ri=0..Ni-1 for ki=0..Ni-1 for ci=0..Ni-1 r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; C[r,c] += A[r,k]*B[k,c]; Pick the block Pick the offset within a block What? Doesn't it only take two numbers to specify a block??? EE 193 Joel Grodstein

A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 * = B00 B01 B02 B03 B10 B11 B12 B13 B20 B21 B22 B23 B30 B31 B32 B33 C00 C01 C02 C03 C10 C11 C12 C13 C20 C21 C22 C23 C30 C31 C32 C33 for rb=0…Nb-1 for kb=0..Nb-1 for cb=0..Nb-1 for ri=0..Ni-1 for ki=0..Ni-1 for ci=0..Ni-1 r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; C[r,c] += A[r,k]*B[k,c]; rb cb kb Pick the three blocks Yes, it only take two numbers to specify a block. But A, B and C each use a different index pair. EE 193 Joel Grodstein

A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 * = B00 B01 B02 B03 B10 B11 B12 B13 B20 B21 B22 B23 B30 B31 B32 B33 C00 C01 C02 C03 C10 C11 C12 C13 C20 C21 C22 C23 C30 C31 C32 C33 for rb=0…Nb-1 for kb=0..Nb-1 for cb=0..Nb-1 for ri=0..Ni-1 for ki=0..Ni-1 for ci=0..Ni-1 r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; C[r,c] += A[r,k]*B[k,c]; rb cb kb Pick the block Yes, it only take two numbers to specify a block. But A, B and C each use a different index pair. EE 193 Joel Grodstein

Why isn’t this is free lunch?
A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 * = B00 B01 B02 B03 B10 B11 B12 B13 B20 B21 B22 B23 B30 B31 B32 B33 C00 C01 C02 C03 C10 C11 C12 C13 C20 C21 C22 C23 C30 C31 C32 C33 for each triplet of three blocks specified by rb, cb, kb for each element combination ri, ci, ki within the blocks r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; C[r,c] += A[r,k]*B[k,c]; Why is this useful? We only work on three blocks at a time We can pick the block size so that three blocks will fit into cache Why isn’t this is free lunch? Instead of traversing entire rows of the matrix, we only traverse rows of the block → take less advantage of spatial locality. The cache turns over after each block triplet → our compute/mem ratio is really 𝐵𝑆 /3 rather than N/3. Basically, we're not getting as much temporal reuse. EE 193 Joel Grodstein

Look at blocks another way
= * Multiply two 16x16 matrices. EE 193 Joel Grodstein

= * Row 4, column 0 EE 193 Joel Grodstein

= * Block 1,0 EE 193 Joel Grodstein

= * Interpretation: block[1,0] = row[1] ∙ column[0] The output block depends only on the input blocks. We reused the same rows/columns over & over. EE 193 Joel Grodstein

Block-matrix and threads
We have a block-matrix algorithm We even understand it  It should make good use of caches, and thus be quite fast. Next problems: ordering the inner and outer loops multithreading Simplifying assumption: Pick the block size so that it fits into the L2 (or lower) cache Use the L3 (i.e., ring cache) to get “some” reuse of blocks without going to main memory EE 193 Joel Grodstein

for rb=0…Nb-1 for kb=0..Nb-1 for cb=0..Nb-1 for ri=0..Ni-1 for ki=0..Ni-1 for ci=0..Ni-1 r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; C[r,c] += A[r,k]*B[k,c]; Pick the offset within a block Does it matter what order we put the three inner loops? After all, blocks fit into the L2. Probably. It would be nice to get a lot of hits from the L1, to make sure our cores don’t wait for memory EE 193 Joel Grodstein

for rb=0…Nb-1 for kb=0..Nb-1 for cb=0..Nb-1 for ri=0..Ni-1 for ki=0..Ni-1 for ci=0..Ni-1 r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; C[r,c] += A[r,k]*B[k,c]; Pick the block Does it matter what order we put the three outer loops? After all, there’s enough computation that the initial block-loading time doesn’t matter. Perhaps. But that depends on a lot of factors, and it’s easy enough to pick a good loop ordering. That will let us reload blocks from the L3 rather than from main memory. What order to use? rb/kb/cb looks good, for the same reasons as before. EE 193 Joel Grodstein

for rb=0…Nb-1 for kb=0..Nb-1 for cb=0..Nb-1 for ri=0..Ni-1 for ki=0..Ni-1 for ci=0..Ni-1 r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; C[r,c] += A[r,k]*B[k,c]; Pick the block Pick the offset within a block This goes in a thread How many threads should we use? Start with one thread per block triplet (i.e., one thread per rb/kb/cb triplet). We can change that if we want to. EE 193 Joel Grodstein

for rb=0…Nb-1 for kb=0..Nb-1 for cb=0..Nb-1 for ri=0..Ni-1 for ki=0..Ni-1 for ci=0..Ni-1 r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; C[r,c] += A[r,k]*B[k,c]; Pick the block thread (thread_func, rb, kb, cb); Pick the offset within a block This goes in a thread Start with one thread per block triplet (i.e., one thread per rb/kb/cb triplet). How many threads are there? If we have 1Kx1K matrices, and 64x64 blocks, then how many threads are there? 1K/64 = 16. So Nb=16, and there are 163=4K threads What do you think about the code above? Way too many threads to launch all at once! Nb3 EE 193 Joel Grodstein

thread (thread_func, rb, kb, cb); Pick the block
for rb=0…Nb-1 for kb=0..Nb-1 for cb=0..Nb-1 thread (thread_func, rb, kb, cb); Pick the block Another issue: do we need a critical section to avoid multiple threads writing the same data values? Yes. Every thread with the same rb and cb all try to write the same locations of the output matrix. How many threads are fighting each other? Nb (i.e., all the different values of kb) If we use a critical section to protect the “C[r,c] += A[r,k]*B[k,c]”, how will that affect performance? Basically, no two threads can run at the same time any more  If we make each thread keep its own private work area and then merge them, how will that work? Each thread will need to keep an entire block as its own work space. That’s potentially a lot of space. And the merging is (#threads)*Nb2 flops, which is non-trivial EE 193 Joel Grodstein

for rb=0…Nb-1 for kb=0..Nb-1 for cb=0..Nb-1 for ri=0..Ni-1 for ki=0..Ni-1 for ci=0..Ni-1 r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; C[r,c] += A[r,k]*B[k,c]; Pick the block do { launch 4 threads; wait for them to finish; } until you’ve gone through every one of the Nb3 triplets (in (Nb3)/4 groups) Pick the offset within a block This goes in a thread Any ideas? What if we only launch a few at a time? With 4 cores, maybe launch 4 at a time? Does it fix the problem of having too many threads at once? Yes Do we still need a critical section? Probably not, since we won’t have two threads with the same kb running at the same time. EE 193 Joel Grodstein

Results on Norbert for kb=0…Nb-1
for rb=0..Nb-1 for cb=0..Nb-1 Spawn a thread to compute C[r,c] += A[r,k]*B[k,c]; (for every ri,ci,ki in the three blocks) if (we've spawned as many threads as cores) wait for them all to finish; Launching a thread takes .01ms, so .3s total Very slow. Why? Another problem, too. 1Kx1K matrices, 64x64 blocks → Nb=32. How many threads? How many groups of 32 threads (there are 32 cores)? Different threads go at different rates. We have 1024 iterations of "wait for the slowest thread to finish." Hugely inefficient. 32x32x32=32K 1024 EE 193 Joel Grodstein

What to do? What are your thoughts on how to fix this?
for kb=0…Nb-1 for rb=0..Nb-1 for cb=0..Nb-1 Spawn a thread to compute C[r,c] += A[r,k]*B[k,c]; (for every r,c,k in the three blocks) if (we've spawned as many threads as cores) wait for them all to finish; What are your thoughts on how to fix this? We launched too many threads We kept having the fastest thread wait for the slowest You get to take your best shot in HW #3. But you can brainstorm together here first EE 193 Joel Grodstein

Brainstorming Brainstorm ideas and write some metacode on the board
Questions to ask: you don’t want to launch a million threads. Roughly how many threads do you want to launch? Can you divide up the work between threads so that you don’t need a critical section? You must initialize the output matrix to 0 before threads start summing partial products into it. Can you distribute this work among threads also? EE 193 Joel Grodstein

SMT Onto the next problem 
We've assumed 1 thread/core until now. Any problem with that? SMT is often a good thing. It allows one thread to compute while the other thread is stalled anyway. If we write the 1st thread really well, perhaps it will never stall anyway? Let's try to use hyperthreading. Can we spawn 2 threads/core instead of 1? EE 193 Joel Grodstein

C++ threads can really suck
Let's say we divvy up the work on a block triplet among two threads in the same core But wait – we have a big problem C++ threads does not let you assign threads to cores. It lets the O/S do its best  There’s no guarantee our two threads, using the same block triplet, will be assigned to the same core. In fact, even if we only have 1 thread/core, a stupid enough O/S might put all the threads on 1 core. EE 193 Joel Grodstein

What to do about it An interesting final project would be to see how well the O/S does, or see how much better you can do yourself. There are low-level routines (mostly O/S dependent) that let you find out where threads went assign threads to cores. See EE 193 Joel Grodstein

EE 193: Parallel Computing

Similar presentations

Presentation on theme: "EE 193: Parallel Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EE 193: Parallel Computing

Similar presentations

Presentation on theme: "EE 193: Parallel Computing"— Presentation transcript:

Similar presentations

About project

Feedback