Mihai Budiu Microsoft Research – Silicon Valley Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University Spatial Computation.

Mihai Budiu Microsoft Research – Silicon Valley Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University Spatial Computation Computing without General-Purpose Processors

2 Outline Intro: Problems of current architectures Compiling Application-Specific Hardware ASH Evaluation Conclusions 1000 Performance

3 Resources We do not worry about not having hardware resources We worry about being able to use hardware resources [Intel]

4 Complexity ALUs Cannot rely on global signals (clock is a global signal) 5ps 20ps gate wire

5 Complexity ALUs Cannot rely on global signals (clock is a global signal) 5ps 20ps gate wire Automatic translation C ! HW Simple, short, unidirectional interconnect No interpretation Distributed control, Asynchronous Simple hw, mostly idle

6 Our Proposal: Application-Specific Hardware ASH addresses these problems ASH is not a panacea ASH complementary to CPU High-ILP computation Low ILP computation + OS + VM CPUASH Memory $

7 Paper Content Automatic translation of C to hardware dataflow machines High-level comparison of dataflow and superscalar Circuit-level evaluation -- power, performance, area

8 Outline Problems of current architectures CASH: Compiling Application-Specific Hardware ASH Evaluation Conclusions

9 Application-Specific Hardware C program Compiler Dataflow IR Reconfigurable/custom hw HW backend

10 Computation Dataflow x = a & 7;... y = x >> 2; Program & a 7 >> 2 x IR a Circuits &7 >>2 No interpretation Operations Nodes Pipeline stages Variables Def-use edges Channels (wires)

11 Basic Computation= Pipeline Stage data valid ack latch +

12 Distributed Control Logic +- ack rdy global FSM short, local wires

13 MUX: Forward Branches if (x > 0) y = -x; else y = b*x; * x b0 y ! -> Conditionals ) Speculation SSA = no arbitration

14 Memory Access LD ST LD Monolithic Memory local communicationglobal structures pipelined arbitrated network Future work: fragment this!

15 Outline Problems of current architectures Compiling ASH ASH Evaluation Conclusions

16 Evaluating ASH C CASH core Verilog back-end Synopsys, Cadence P/R ASIC 180nm std. cell library, 2V ~1999 technology Mediabench kernels (1 hot function/benchmark) ModelSim (Verilog simulation) performance numbers Mem commercial tools

17 Compile Time C CASH core Verilog back-end Synopsys, Cadence P/R ASIC 20 seconds 10 seconds 20 minutes 1 hour 200 lines Mem

18 ASH Area P4: 217 minimal RISC core

19 ASH vs 600MHz CPU [.18 m]

20 Bottleneck: Memory Protocol LD ST Memory Enabling dependent operations requires round-trip to memory. Limit study: round trip zero time ) up to 5x speed-up. LSQ Exploring novel memory access protocols.

21 Power DSP 110 mP 4000 Xeon [+cache] 67000

22 Energy-delay vs. Wattch

23 Energy Efficiency 0. 01 0.1 1 10 100 1000 Energy Efficiency [Operations/nJ] General-purpose DSP Dedicated hardware ASH media kernels FPGA Microprocessors 1000x Asynchronous P

24 Outline Problems of current architectures Compiling ASH Evaluation Related work, Conclusions

25 Related Work Optimizing compilers High-level synthesis Reconfigurable computing Dataflow machines Asynchronous circuits Spatial computation We target an extreme point in the design space: no interpretation, fully distributed computation and control

26 ASH Design Point Design an ASIC in a day Fully automatic synthesis to layout Fully distributed control and computation (spatial computation) –Replicate computation to simplify wires Energy/op rivals custom ASIC Performance rivals superscalar E £ t 100 times better than any processor

27 Conclusions FeatureAdvantages No interpretationEnergy efficiency, speed Spatial layoutShort wires, no contention AsynchronousLow power, scalable DistributedNo global signals Automatic compilationDesigner productivity Spatial computation strengths

28 Backup Slides Absolute performance Control logic Exceptions Leniency Normalized area Loops ASH weaknesses Splitting memory Recursive calls Leakage Why not compare to… Targetting FPGAs

29 Absolute Performance

= rdy in ack out rdy out ack in data in data out Reg back Pipeline Stage C

31 Exceptions Strictly speaking, C has no exceptions In practice hard to accommodate exceptions in hardware implementations An advantage of software flexibility: PC is single point of execution control High-ILP computation Low ILP computation + OS + VM + exceptions CPUASH Memory back $$$

32 Critical Paths if (x > 0) y = -x; else y = b*x; * xb0 y ! ->

33 Lenient Operations if (x > 0) y = -x; else y = b*x; * xb0 y ! -> Solves the problem of unbalanced paths back

34 Normalized Area back

35 Control Flow ) Data Flow data predicate Merge (label) Gateway data Split (branch) p !

36 i +1 < 100 0 * + sum 0 Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; ! ret back

37 ASH Weaknesses Both branch and join not free Static dataflow (no re-issue of same instr) Memory is far Fully static – No branch prediction – No dynamic unrolling – No register renaming Calls/returns not lenient back

38 Predicted not taken Effectively a noop for CPU! Predicted taken. Branch Prediction for (i=0; i < N; i++) {... if (exception) break; } i + < 1 & ! exception result available before inputs ASH crit path CPU crit path back

39 Memory Partitioning MIT RAW project: Babb FCCM 99, Barua HiPC 00,Lee ASPLOS 00 Stanford SpC: Semeria DAC 01, TVLSI 02 Illinois FlexRAM: Fraguella PPoPP 03 Hand-annotations #pragma back

40 Recursion recursive call save live values restore live values stack back

41 Leakage Power P s = k Area e -V T Employ circuit-level techniques Cut power supply of idle circuit portions –most of the circuit is idle most of the time –strong locality of activity back

42 Why Not Compare To… In-order processor –Worse in all metrics than superscalar, except power –We beat it in all metrics, including performance DSP –We expect roughly the same results as for superscalar (Wattch maintains high IPC for these kernels) ASIC –No available tool-flow supports C to the same degree Asynchronous ASIC –We compared with a Balsa synthesis system –We are 15 times better in Et compared to resulting ASIC Async processor –We are 350 times better in Et than Amulet (scaled to.18) back

43 Compared to Next Talk Engine [180nm] Performance [MIPS] E/instruction [pJ] SNAP/LE2824 SNAP/LE240218 ASH110020 back

44 Why not target FPGA Do not support asynchronous circuits Very inefficient in area, power, delay Too fine-grained for datapath circuits We are designing an async FPGA back

Mihai Budiu Microsoft Research – Silicon Valley Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University Spatial Computation.

Similar presentations

Presentation on theme: "Mihai Budiu Microsoft Research – Silicon Valley Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University Spatial Computation."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mihai Budiu Microsoft Research – Silicon Valley Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University Spatial Computation.

Similar presentations

Presentation on theme: "Mihai Budiu Microsoft Research – Silicon Valley Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University Spatial Computation."— Presentation transcript:

Similar presentations

About project

Feedback