MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

Presentation on theme: "MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan."— Presentation transcript:

MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan

Resources Five insufficiently busy grad students Three weeks –Nine man weeks used Bluespec expertise –Easy parameterization/Fast Concurrency The promise of food

Basic Facts Matrix Multiply is embarrassingly parallel –More multipliers and adders should help Matrices are too large to be stored in FGPA memory Time was short, design needed to be partitioned to make use of all designers –Latency insensitive methodology

Outline The Problem Partitioning the Computation Architectural Overview Implementation Results Things We Wish we could do

The Standard N 3 Algorithm for(int i=0; i < N; i++) for(int j=0; j < N; j++) for(int k=0; k < N; k++) c[i][j] += a[i][k] * b[k][j];

and blocking is well understood… for(int ib = 0; ib < N; ib+=K) for(int io = 0; io < K; io++) for(int jb = 0; jb < N/K; jb+=K) for(int jo = 0; jo < K; jo++) for(int k = 0; k < K; k++) c[ib+io][jb+jo] +=a[ib+io][jb+k] * b[ib+k][jb+jo]; split reduces memory traffic for(int ib = 0; ib < N; ib+=K) for(int jb = 0; jb < N/K; jb+=K) for(int io = 0; io < K; io++) for(int jo = 0; jo < K; jo++) for(int k = 0; k < K; k++) c[ib+io][jb+jo] += (a[ib+io][jb+k] * b[ib+k][jb+jo]); swap Kernel

Outline The Problem Partitioning the Computation Architectural Overview Implementation Results Things We Wish we could do

Hardware Facts If we accelerate the computation, DRAM access becomes the bottleneck CPU has slow access to DRAM –HW can directly access DRAM via PLB (Processor Local Bus)

Hardware Facts CPU to HW memory bandwidth bound at 150MB/sec –Software overhead in data orchestration, probably only 50% of this bandwidth can be used Memory Bus supports 800MB/sec –Direct interface can provide up to a 5x improvement over software transfer Special hardware may not be complicated because memory access patterns are simple

High Level Architecuture Func Unit Func Unit Func Unit CPU PLB DRAM Interconnection Logic

Architecture Func Unit Func Unit Func Unit Controller Feeder CPU PLB Switch PLB Master DRAM

Software Example (C = A x B) Func Unit Func Unit Func Unit Controller Feeder CPU PLB Switch PLB Master DRAM AB Ld A 0Ld B 0St C 0MAC 0 C In reality – the execution of several blocks will be overlapped

Outline The Problem Partitioning the Computation Architectural Overview Implementation Results Things We Wish we could do

Functional Unit - Design Instructions: –Load operand (memory) –Store operand (memory) –Zero (C = 0) –Multiply-Add-Accumulate (C += A*B) Two FSMs (Read/Write and Compute) –Allows overlapping of Instructions

Functional Unit – Algorithm Take algo & unroll P loop iterations Adder Tree of P –Crit. path grows logarithmically Can pipeline –Complicated because of parameterization for(int i = 0; i < K; i++) for(int j = 0; j < K; j++) for(int k = 0; k < K; k++) c[i][j] += a[i][k] * b[k][j];

Functional Unit – Algorithm Different algorithm –reorder multiplies –writes c[i][j] multple times Unroll by P –same # of adders and multipliers –shorter critical path Pipelining is easy –2 stages for(int i = 0; i < K; i++) for(int j = 0; j < K; j++) for(int k = 0; k < K; k++) c[j][k] += a[i][k] * b[j][i];

FU Microarchitecture

Memory Bus Master (PLB) 32-bit bus interface 16-word burst transfers –Amortize bus setup costs DRAM may refresh during transfer –Added burst buffer for rapid recovery

Memory Bus Master (PLB) Half of critical path through bus arbiter –Beyond our control Substantial retiming needed –Register pushing –State decoupling Need fine-grained control over scheduling

Outline The Problem Partitioning the Computation Architectural Overview Implementation Results Things We Wish we could do

Design Parameters Architecture: Number of functional units Functional Unit: degree of parallelism, matrix size Memory Bus (PLB) Master: matrix memory layout, matrix size Switch: Number of functional units Algorithm Generator: Block size

Final Results 100MHz 1 Functional Unit –64 2 subblocks – 8 Complex Multiplies Lines of code – 10K total –Unit Testing Framework – 1.5K –C Code – 2K –BSV – 5.5K –Multiple FU implementations 1K –Additional Unused Hardware 1K More than 3 GOps/Sec

Performance Size Time ( µs) 64 2 799 128 2 5120 256 2 45300 512 2 332000 1024 2 2710000 125x

Things we would have done with more time We believe we could have obtained 10 billion ops per second 32-PLB -> 64-bit PLB –Double memory bandwidth fairly simple improvement Multiple Clock Domains –implemented, but had trouble synthesizing in EDK Play with # of FUs / registers per FU –HW parameterized for this Explore alternative machine organization Algorithmic Exploration

Fin

Download ppt "MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan."

Similar presentations