Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University.

Similar presentations


Presentation on theme: "Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University."— Presentation transcript:

1 Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University Spatial Computation Computing without General-Purpose Processors May 10, 2005

2 2 Outline Intro: Problems of current architectures Compiling Application-Specific Hardware ASH Evaluation Conclusions 1000 Performance

3 3 Resources We do not worry about not having hardware resources We worry about being able to use hardware resources [Intel]

4 4 Complexity ALUs Cannot rely on global signals (clock is a global signal) 5ps 20ps gate wire

5 5 Complexity ALUs Cannot rely on global signals (clock is a global signal) 5ps 20ps gate wire Automatic translation C ! HW Simple, short, unidirectional interconnect No interpretation Distributed control, Asynchronous Simple hw, mostly idle

6 6 Our Proposal: Application-Specific Hardware ASH addresses these problems ASH is not a panacea ASH complementary to CPU High-ILP computation Low ILP computation + OS + VM CPUASH Memory $

7 7 Outline Problems of current architectures CASH: Compiling Application-Specific Hardware ASH Evaluation Conclusions

8 8 Application-Specific Hardware C program Compiler Dataflow IR Reconfigurable/custom hw HW backend

9 9 Computation Dataflow x = a & 7;... y = x >> 2; Program & a 7 >> 2 x IR a Circuits &7 >>2 No interpretation Operations Nodes Pipeline stages Variables Def-use edges Channels (wires)

10 10 Basic Computation= Pipeline Stage data valid ack latch +

11 11 + Asynchronous Computation data valid ack 1 + 2 + 3 + 4 + 8 + 7 + 6 + 5 latch

12 12 Distributed Control Logic +- ack rdy global FSM short, local wires

13 13 MUX: Forward Branches if (x > 0) y = -x; else y = b*x; * x b0 y ! -> Conditionals ) Speculation SSA = no arbitration Critical path

14 14 Control Flow ) Data Flow data predicate Merge (label) Gateway data Split (branch) p !

15 15 i +1 < 100 0 * + sum 0 Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; ! ret back

16 16 Pipelining i + <= 100 1 * + sum pipelined multiplier (8 stages) int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; step 1

17 17 Pipelining i + <= 100 1 * + sum step 2

18 18 Pipelining i + <= 100 1 * + sum step 3

19 19 Pipelining i + <= 100 1 * + sum step 4

20 20 Pipelining i + <= 100 1 i=1 i=0 + sum step 5

21 21 Pipelining i + <= 100 1 * i=1 i=0 + sum step 6 back

22 22 Pipelining i + <= 100 1 * + sum is loop sums loop Long latency pipe predicate step 7

23 23 Predicate ack edge is on the critical path. Pipelining i + <= 100 1 * + sum critical path is loop sums loop

24 24 Pipeline balancing i + <= 100 1 * + sum is loop sums loop decoupling FIFO step 7

25 25 Pipeline balancing i + <= 100 1 * + sum is loop sums loop critical path decoupling FIFO back back to talk

26 26 Procedures Caller Callee Call Argument Return Continuation

27 27 Memory Access LD ST LD Monolithic Memory local communicationglobal structures Future work: fragment this! pipelined arbitrated network

28 28 Outline Problems of current architectures Compiling ASH ASH Evaluation Conclusions

29 29 Evaluating ASH C CASH core Verilog back-end Synopsys, Cadence P/R ASIC 180nm std. cell library, 2V ~1999 technology Mediabench kernels (1 hot function/benchmark) ModelSim (Verilog simulation) performance numbers Mem commercial tools

30 30 Compile Time C CASH core Verilog back-end Synopsys, Cadence P/R ASIC 20 seconds 10 seconds 20 minutes 1 hour 200 lines Mem

31 31 ASH Area (mm 2 ) P4: 217 minimal RISC core

32 32 ASH vs 600MHz CPU [4-wide OOO,.18 m]

33 33 Bottleneck: Memory Protocol LD ST Memory Enabling dependent operations requires round-trip to memory. LSQ Exploring novel memory access protocols.

34 34 Power (mW) DSP 110 mP 4000 Xeon [+cache] 67000

35 35 Energy-delay

36 36 Energy Efficiency (op/nJ)

37 37 Energy Efficiency 0. 01 0.1 1 10 100 1000 Energy Efficiency [Operations/nJ] General-purpose DSP Dedicated hardware ASH media kernels FPGA Microprocessors 1000x Asynchronous P

38 38 Outline Problems of current architectures Compiling ASH Evaluation Related work, Conclusions

39 39 Bilbliography Dataflow: A Complement to Superscalar Mihai Budiu, Pedro Artigas, and Seth Copen Goldstein ISPASS 2005 Spatial Computation Mihai Budiu, Girish Venkataramani, Tiberiu Chelcea, and Seth Copen Goldstein ASPLOS 2004 C to Asynchronous Dataflow Circuits: An End-to-End Toolflow Girish Venkataramani, Mihai Budiu, Tiberiu Chelcea, and Seth Copen Goldstein IWLS 2004 Optimizing Memory Accesses For Spatial Computation Mihai Budiu and Seth Copen Goldstein CGO 2003 Compiling Application-Specific Hardware Mihai Budiu and Seth Copen Goldstein FPL 2002

40 40 Related Work Optimizing compilers High-level synthesis Reconfigurable computing Dataflow machines Asynchronous circuits Spatial computation We target an extreme point in the design space: no interpretation, fully distributed computation and control

41 41 ASH Design Point Design an ASIC in a day Fully automatic synthesis to layout Fully distributed control and computation (spatial computation) –Replicate computation to simplify wires Energy/op rivals custom ASIC Performance rivals superscalar E £ t 100 times better than any processor

42 42 Conclusions FeatureAdvantages No interpretationEnergy efficiency, speed Spatial layoutShort wires, no contention AsynchronousLow power, scalable DistributedNo global signals Automatic compilationDesigner productivity Spatial computation strengths

43 43 Backup Slides Absolute performance Control logic Exceptions Leniency Normalized area ASH weaknesses Splitting memory Recursive calls Leakage Why not compare to… Targeting FPGAs

44 44 Absolute Performance CPU range back

45 = rdy in ack out rdy out ack in data in data out Reg back Pipeline Stage C

46 46 Exceptions Strictly speaking, C has no exceptions In practice hard to accommodate exceptions in hardware implementations An advantage of software flexibility: PC is single point of execution control High-ILP computation Low ILP computation + OS + VM + exceptions CPUASH Memory back $$$

47 47 Critical Paths if (x > 0) y = -x; else y = b*x; * xb0 y ! ->

48 48 Lenient Operations if (x > 0) y = -x; else y = b*x; * xb0 y ! -> Solves the problem of unbalanced paths backback to talk

49 49 Normalized Area back

50 50 ASH Weaknesses Both branch and join not free Static dataflow (no re-issue of same instr) Memory is far Fully static – No branch prediction – No dynamic unrolling – No register renaming Calls/returns not lenient back

51 51 Predicted not taken Effectively a noop for CPU! Predicted taken. Branch Prediction for (i=0; i < N; i++) {... if (exception) break; } i + < 1 & ! exception result available before inputs ASH crit path CPU crit path back

52 52 Memory Partitioning MIT RAW project: Babb FCCM 99, Barua HiPC 00,Lee ASPLOS 00 Stanford SpC: Semeria DAC 01, TVLSI 02 Illinois FlexRAM: Fraguella PPoPP 03 Hand-annotations #pragma back

53 53 Recursion recursive call save live values restore live values stack back

54 54 Leakage Power P s = k Area e -V T Employ circuit-level techniques Cut power supply of idle circuit portions –most of the circuit is idle most of the time –strong locality of activity back

55 55 Why Not Compare To… In-order processor –Worse in all metrics than superscalar, except power –We beat it in all metrics, including performance DSP –We expect roughly the same results as for superscalar (Wattch maintains high IPC for these kernels) ASIC –No available tool-flow supports C to the same degree Asynchronous ASIC –We compared with a Balsa synthesis system –We are 15 times better in Et compared to resulting ASIC Async processor –We are 350 times better in Et than Amulet (scaled to.18) back

56 56 Why not target FPGA Do not support asynchronous circuits Very inefficient in area, power, delay Too fine-grained for datapath circuits We are designing an async FPGA back


Download ppt "Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University."

Similar presentations


Ads by Google