Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presentation at May 17, 2004 Mihai Budiu Carnegie Mellon University Spatial Computation Computing without General-Purpose Processors.

Similar presentations


Presentation on theme: "Presentation at May 17, 2004 Mihai Budiu Carnegie Mellon University Spatial Computation Computing without General-Purpose Processors."— Presentation transcript:

1 Presentation at May 17, 2004 Mihai Budiu mihaib@cs.cmu.edu Carnegie Mellon University Spatial Computation Computing without General-Purpose Processors

2 2 Mihai Budiu mihaib@cs.cmu.edu Carnegie Mellon University Spatial Computation A computation model based on: application-specific hardware no interpretation minimal resource sharing Spatial Computation

3 3 Research Scope Object: future architectures Tool: compilers Evaluation: simulators

4 4 Three Spatial Computation Projects virtual reconfigurable hardware Application-Specific Hardware (ASH) C Compiler reconfigurable hardware [FPGA 99] [ISCA 99] [IEEE Computer 00] [Euro-Par 00] [2 licenses] [ISCA 01] [ASAP 03] [Chapter 03] [FCCM 01] [FPL 02] [FPL 02a] [CGO 03] [IWLS 04] [MSP 04] [3 submitted] nanoFabrics

5 5 Main Results of My Research (1) Developed DIL compiler Completely replaces CAD tool-chain 700 times faster than commercial tools New optimizations (BitValue, place-and-route) Streaming kernels execute 20-300 times faster than on  P

6 6 Main Results of My Research (2) nanoFabrics Identified strengths & limitations of nanodevices Proposed new reconfigurable architecture & HLL ! HW compilation for spatial computation Studied first-order properties of spatial computation

7 7 Main Results of My Research (3) Fast prototyping: automatic from ANSI C ! HW High performance: sustained > 0.8 GOPS [180nm] Low power: Energy/op 100-1000 £ better than  P Application-Specific Hardware (ASH) Compiler-synthesized architecture

8 8 Related Work Nanotechnology Dataflow machines High-level synthesis Reconfigurable computing Computer architecture Embedded systems Asynchronous circuits Compilation

9 9 Outline Research overview Problems of current architectures Compiling Application-Specific Hardware ASH Evaluation New compiler optimizations Conclusions 1000 Performance

10 10 Resources We do not worry about not having hardware resources We worry about being able to use hardware resources [Intel]

11 11 Complexity ALUs Cannot rely on global signals (clock is a global signal) 10pswire 5psgate Delay

12 12 Instruction-Set Architecture Software Hardware ISA VERY rigid to changes (e.g. x86 vs Itanium)

13 13 Our Proposal ASH addresses these problems ASH is not a panacea ASH “complementary” to CPU High-ILP computation Low ILP computation + OS + VM CPUASH Memory $

14 14 What’s New? Investigate new computational model Source is full ANSI C Result is asynchronous circuit Build spatial dataflow hardware No resources limitations New compiler algorithms End-to-end results –C to structural Verilog in seconds –high performance results –excellent power efficiency black box Investigate new computational model Source is full ANSI C Result is asynchronous circuit Build spatial dataflow hardware No resources limitations New compiler algorithms End-to-end results –C to structural Verilog in seconds –high performance results –excellent power efficiency

15 15 Outline Research overview Problems of current architectures CASH: Compiling ASH –program representation –compiling C programs ASH Evaluation New compiler optimizations Conclusions

16 16 Application-Specific Hardware C program Compiler Dataflow IR Reconfigurable/custom hw SW HW ISA HW backend

17 17 Application-Specific Hardware C program Compiler Dataflow IR CPU [predication] SW backend Soft

18 18... def-use may-dep. Key: Intermediate Representation Traditionally SSA + predication + speculation Uniform for scalars and memory Explicitly encodes may-depend Executable Precise semantics Dataflow IR Close to asynchronous target Our IR CFG

19 19 Computation = Dataflow Operations ) functional units Variables ) wires No interpretation x = a & 7;... y = x >> 2; Programs & a 7 >> 2 x Circuits

20 20 Basic Computation + data valid ack latch

21 21 + Asynchronous Computation data valid ack 1 + 2 + 3 + 4 + 8 + 7 + 6 + 5 latch

22 22 Distributed Control Logic +- ack rdy global FSM asynchronous control short, local wires

23 23 Outline Research overview Problems of current architectures CASH: Compiling ASH –program representation –compiling C programs ASH Evaluation New compiler optimizations Conclusions

24 24 MUX: Forward Branches if (x > 0) y = -x; else y = b*x; * x  b0 y ! -> Conditionals ) Speculation critical path SSA = no arbitration

25 25 Control Flow ) Data Flow  data predicate Merge (label) Gateway data Split (branch) p !

26 26 i +1 < 100 0 * + sum 0 Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; ! ret pipelining

27 27 no speculation sequencing of side-effects Predication and Side-Effects Load addr data pred token to memory

28 28 Memory Access LD ST LD Monolithic Memory local communicationglobal structures pipelined arbitrated network Future work: fragment this! related workcomplexity

29 29 CASH Optimizations SSA-based optimizations –unreachable/dead code, gcse, strength reduction, loop-invariant code motion, software pipelining, reassociation, algebraic simplifications, induction variable optimizations, loop unrolling, inlining Memory optimizations –dependence & alias analysis, register promotion, redundant load/store elimination, memory access pipelining, loop decoupling Boolean optimizations –Espresso CAD tool, bitwidth analysis

30 30 Outline Research overview Problems of current architectures Compiling ASH Evaluation: CASH vs. clocked designs New compiler optimizations Conclusions

31 31 Evaluating ASH C CASH core Verilog back-end Synopsys, Cadence P/R ASIC 180nm std. cell library, 2V ~1999 technology Mediabench kernels (1 hot function/benchmark) ModelSim (Verilog simulation) performance numbers Mem

32 32 ASH Area P4: 217 normalized area minimal RISC core

33 33 ASH vs 600MHz CPU [.18  m]

34 34 Bottleneck: Memory Protocol LD ST Memory Token release to dependents: requires round-trip to memory. Limit study: round trip zero time ) up to 6x speed-up. LSQ Exploring protocol for in-order data delivery & fast token release.

35 35 Power DSP 110 mP 4000 Xeon [+cache] 67000

36 36 Energy Efficiency 0. 01 0.1 1 10 100 1000 Energy Efficiency [Operations/nJ] General-purpose DSP Dedicated hardware ASH media kernels Asynchronous  P Microprocessors 1000x

37 37 Outline Research overview Nanotechnology and architecture Compiling ASH ASH Evaluation New compiler optimizations BitValue dataflow analysis Optimizing memory accesses SIDE: static instantiation, dynamic evaluation Conclusions

38 38 Detecting Constant Bits a b = a >> 4; b0000

39 39 Detecting Useless Bits a b XXXX Don’t care bits b = a >> 4;

40 40 Dataflow on  * In practice  32 Forward ) generalize constant propagation Backward ) generalize dead-code Transfer functions non-trivial BitValue Dataflow Analysis a b = a >> 4; b XXXX 0000 X 01 U Bit lattice 

41 41 BitValue on C Programs MediabenchSpecInt95SpecInt2K % useless int arithmetic 27

42 42 Outline [...] New compiler optimizations BitValue dataflow analysis Memory access optimization Static Instantiation, Dynamic Evaluation Conclusions

43 43 Meaning of Token Edges [Token graph is maintained transitively reduced] Maybe dependent No intervening memory operation Independent …=*q *p=… …=*q *p=…

44 44 Dead Code Elimination *p=… (false)

45 45 ¼ PRE...=*p (p1)...=*p (p2)...=*p (p1 Ç p2) This corresponds in the CFG to lifting the load to a basic block dominating the original loads

46 46 Register Promotion …=*p (p2) *p=… (p1) …=*p *p=… (p1) (p2 Æ : p1) Load is executed only if store is not

47 47 (p2 Æ : p1) Register Promotion (2) …=*p (p2) *p=… (p1) …=*p (false) *p=… (p1) When p2 ) p1 the load becomes dead......i.e., when store dominates load in CFG

48 48 Outline [...] New compiler optimizations BitValue dataflow analysis Memory access optimization A SIDE dish: dataflow analysis Static Instantiation, Dynamic Evaluation Conclusions

49 49 Availability Dataflow Analysis y y = a*b;... if (x) {...... = a*b; }

50 50 Dataflow Analysis Is Conservative if (x) {... y = a*b; }...... = a*b; y?y?

51 51 Static Instantiation, Dynamic Evaluation flag = false; if (x) {... y = a*b; flag = true; }...... = flag ? y : a*b;

52 52 SIDE Register Promotion Effect Loads Stores % reduction

53 53 Outline Research overview +Problems of current architectures +Compiling ASH +ASH Evaluation +New compiler optimizations =Future work & conclusions

54 54 Future Work Optimizations for area/speed/power Memory partitioning Concurrency Compiler-guided layout Explore extensible ISAs Hybridization with superscalar mechanisms Reconfigurable hardware support for ASH Formal verification

55 55 How far can you go? Grand Vision: Certified Circuit Generation Translation validation: input ´ output Preserve input properties –e.g., C programs cannot deadlock –e.g., type-safe programs cannot crash Debug, test, verify only at source-level HLLIRIR opt Veriloggateslayout formally validated

56 56 Conclusions FeatureAdvantages No interpretationEnergy efficiency, speed Spatial layoutShort wires, no contention AsynchronousLow power, scalable DistributedNo global signals Automatic compilationDesign productivity, no ISA Spatial computation strengths

57 57 Backup Slides Reconfigurable hardware Critical paths Software pipelining Control logic More on PipeRench ASH vs... ASH weaknesses Exceptions Research methodology Normalized area Why C? Splitting memory More performance Recursive calls Nanotech and architecture

58 58 Reconfigurable Hardware Universal gates and/or storage elements Interconnection network Programmable switches

59 59 Switch controlled by a 1-bit RAM cell 00010001 Universal gate = RAM a0 a1 a0 a1 data a1 & a2 0 data in control Main RH Ingredient: RAM Cell backback to talk

60 = rdy in ack out rdy out ack in data in data out  Reg back to talkback Pipeline Stage C

61 61 Critical Paths if (x > 0) y = -x; else y = b*x; * xb0 y ! ->

62 62 Lenient Operations if (x > 0) y = -x; else y = b*x; * xb0 y ! -> Solves the problem of unbalanced paths back to talkback

63 63 Pipelining i + <= 100 1 * + sum pipelined multiplier (8 stages) int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; step 1

64 64 Pipelining i + <= 100 1 * + sum step 2

65 65 Pipelining i + <= 100 1 * + sum step 3

66 66 Pipelining i + <= 100 1 * + sum step 4

67 67 Pipelining i + <= 100 1 i=1 i=0 + sum step 5

68 68 Pipelining i + <= 100 1 * i=1 i=0 + sum step 6 back

69 69 Pipelining i + <= 100 1 * + sum i’s loop sum’s loop Long latency pipe predicate step 7

70 70 Predicate ack edge is on the critical path. Pipelining i + <= 100 1 * + sum critical path i’s loop sum’s loop

71 71 Pipeline balancing i + <= 100 1 * + sum i’s loop sum’s loop decoupling FIFO step 7

72 72 Pipeline balancing i + <= 100 1 * + sum i’s loop sum’s loop critical path decoupling FIFO back back to talk

73 73 Process 0.18  m, 6 Al metal layers Area49 mm 2 Clock60MHz I/O 120MHz internal Power< 4W Stripes16 physical 256 virtual Compiler functional on first silicon Licensed by two companies

74 74 Hardware Virtualization compute configure Page in Page out Configuration Hardware Overlap configuration with computation.

75 75 PipeRench Hardware ALU Interconnection Network ALU Interconnection Network ALU Register data flow

76 76 Register Network Register Network Mapping Computation + concat << substr >><< &~ bit-shuffling Network used for computation

77 77 Register Compiler-Controlled Clock Network Register Slow clock Register Network Register Fast clock

78 78 Time-Multiplexing Wires 12 43 12 43 ? One channel available for two wires compute in even cycles compute in odd cycles

79 79 Compilation Times (sec on PII/400)

80 80 Compilation Speed (PII/400)

81 81 Placed Circuit Utilization

82 82 PipeRench Performance Speed-up vs. 300Mhz UltraSparc

83 83 PipeRench Compiler Role Classical optimizations Partial evaluation Data width inference (~ type inference) Module generation (~ macro expansion) Placement (~ VLIW scheduling) Routing (~irregular register allocation) Network link multiplexing (~ spilling) Clock-cycle management Technology mapping (~ instruction selection) Code generation back

84 84 HLL to HW High-level Synthesis Behavioral HDL Synchronous Hardware Reconfigurable Computing C [subsets] Hardware configuration (spatial computation) Asynchronous circuits Concurrent Language Asynchronous Hardware Prior work This research

85 85 CASH vs High-Level Synthesis CASH: the only existing tool to translate complete ANSI C to hardware CASH generates asynchronous circuits CASH does not treat C as an HDL –no annotations required –no reactivity model –does not handle non-C, e.g., concurrency back

86 86 ASH Weaknesses Low efficiency for low-ILP code Does not adapt at runtime Monolithic memory Resource waste Not flexible No support for exceptions

87 87 ASH Weaknesses (2) Both branch and join not free Static dataflow (no re-issue of same instr) Memory is “far” Fully static – No branch prediction – No dynamic unrolling – No register renaming Calls/returns not lenient back

88 88 Predicted not taken Effectively a noop for CPU! Predicted taken. Branch Prediction for (i=0; i < N; i++) {... if (exception) break; } i + < 1 & ! exception result available before inputs ASH crit path CPU crit path back

89 89 Research Methodology Constraint Space state-of-the-art X (e.g., power) Y (e.g., cost) “reasonable limits” incremental evolution new solutions back

90 90 Exceptions Strictly speaking, C has no exceptions In practice hard to accommodate exceptions in hardware implementations An advantage of software flexibility: PC is single point of execution control High-ILP computation Low ILP computation + OS + VM + exceptions CPUASH Memory back $$$

91 91 Why C Huge installed base Embedded specifications written in C Small and simple language –Can leverage existing tools –Simpler compiler Techniques generally applicable Not a toy language back

92 92 Performance

93 93 Parallelism Profile

94 94 Normalized Area back back to talk

95 95 Memory Partitioning MIT RAW project: Babb FCCM ‘99, Barua HiPC ‘00,Lee ASPLOS ‘00 Stanford SpC: Semeria DAC ‘01, TVLSI ‘02 Berkeley CCured: Necula POPL ‘02 Illinois FlexRAM: Fraguella PPoPP ‘03 Hand-annotations #pragma back back to talk

96 96 Memory Complexity back LSQ RAM addr data back to talk

97 97 Recursion recursive call save live values restore live values stack back

98 98 Nanotechnology and Architecture

99 99 Nanotechnology Implications new devices new manufacturing new architectures new compilers my work

100 100 CAEN V DD Output Input 1 Input 2 lithography Study computer architecture implications of Chemically-Assembled Electronic Nanotechnology

101 101 No Complex Irregular Structures

102 102 Control Regular Substrate 10 11 gates

103 103 High Defect Rate

104 104 Executable Paradigm Shift Configuration Complex fixed chip + Program Dense, regular structure + Configuration defects

105 105 New Computer Architecture CMOSSelf-assembled circuits TransistorNew molecular devices Custom hardwareReconfigurable hardware Yield (defect) controlDefect tolerance through reconfiguration Synchronous circuitsAsynchronous computation MicroprocessorsApp-specific Hardware+CPU

106 106 + + + + + + Exploiting Nanotechnology Nanotechnology + cheap + high-density + low-power – unreliable Computer architecture + vast body of knowledge – expensive – high-power Reconfigurable Computing + defect tolerant + high performance – low density – – – –

107 107 Research Convergence Systems research issues feature size decrease Deep sub-micron CMOS Chemically-assembled electronic nanotechnology my work back

108 108 Venues


Download ppt "Presentation at May 17, 2004 Mihai Budiu Carnegie Mellon University Spatial Computation Computing without General-Purpose Processors."

Similar presentations


Ads by Google