Mihai Budiu Microsoft Research – Silicon Valley Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University Spatial Computation.

Slides:



Advertisements
Similar presentations
Spatial Computation Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS Mihai.
Advertisements

Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Adders Used to perform addition, subtraction, multiplication, and division (sometimes) Half-adder adds rightmost (least significant) bit Full-adder.
AP STUDY SESSION 2.
1
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Sequential Logic Design
Processes and Operating Systems
Copyright © 2013 Elsevier Inc. All rights reserved.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination. Introduction to the Business.
Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon.
Optimizing Memory Accesses for Spatial Computation Mihai Budiu, Seth Goldstein CGO 2003.
Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University.
Mihai Budiu May 23, Based On Critical Path: A Tool for System-Level Timing Analysis Girish Venkataramani, Tiberiu Chelcea, Mihai Budiu, and Seth.
On The Energy Efficiency of Computation Mihai Budiu CMU CS CALCM Seminar Feb 17, 2004 Note: this version fixes some errors in the ASH performance graphs.
On the Critical Path of (Parallel) Computations Mihai Budiu March 30, 2005.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
Robust Window-based Multi-node Technology- Independent Logic Minimization Jeff L.Cobb Kanupriya Gulati Sunil P. Khatri Texas Instruments, Inc. Dept. of.
Photo Slideshow Instructions (delete before presenting or this page will show when slideshow loops) 1.Set PowerPoint to work in Outline. View/Normal click.
Break Time Remaining 10:00.
Processor Data Path and Control Diana Palsetia UPenn
Augmenting FPGAs with Embedded Networks-on-Chip
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
Advance Nano Device Lab. Fundamentals of Modern VLSI Devices 2 nd Edition Yuan Taur and Tak H.Ning 0 Ch9. Memory Devices.
PP Test Review Sections 6-1 to 6-6
EIS Bridge Tool and Staging Tables September 1, 2009 Instructor: Way Poteat Slide: 1.
Bellwork Do the following problem on a ½ sheet of paper and turn in.
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
Equal or Not. Equal or Not
5 minutes.
A Hardware Processing Unit For Point Sets S. Heinzle, G. Guennebaud, M. Botsch, M. Gross Graphics Hardware 2008.
Datorteknik TopologicalSort bild 1 To verify the structure Easy to hook together combinationals and flip-flops Harder to make it do what you want.
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s.
Analyzing Genes and Genomes
Essential Cell Biology
Clock will move after 1 minute
PSSA Preparation.
Essential Cell Biology
Datorteknik TopologicalSort bild 1 To verify the structure Easy to hook together combinationals and flip-flops Harder to make it do what you want.
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
Immunobiology: The Immune System in Health & Disease Sixth Edition
Physics for Scientists & Engineers, 3rd Edition
Energy Generation in Mitochondria and Chlorplasts
Select a time to count down from the clock above
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
Spatial Computation Computing without General-Purpose Processors Mihai Budiu Carnegie Mellon University July 8, 2004.
Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.
Application-Specific Hardware Computing Without Processors Mihai Budiu October 6, 2001 SOCS-4.
Presentation at May 17, 2004 Mihai Budiu Carnegie Mellon University Spatial Computation Computing without General-Purpose Processors.
Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003.
Global Critical Path: A Tool for System-Level Timing Analysis
ASH: A Substrate for Scalable Architectures Mihai Budiu Seth Copen Goldstein CALCM Seminar, March 19, 2002.
Spatial Computation Computing without General-Purpose Processors
Presentation transcript:

Mihai Budiu Microsoft Research – Silicon Valley Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University Spatial Computation Computing without General-Purpose Processors

2 Outline Intro: Problems of current architectures Compiling Application-Specific Hardware ASH Evaluation Conclusions 1000 Performance

3 Resources We do not worry about not having hardware resources We worry about being able to use hardware resources [Intel]

4 Complexity ALUs Cannot rely on global signals (clock is a global signal) 5ps 20ps gate wire

5 Complexity ALUs Cannot rely on global signals (clock is a global signal) 5ps 20ps gate wire Automatic translation C ! HW Simple, short, unidirectional interconnect No interpretation Distributed control, Asynchronous Simple hw, mostly idle

6 Our Proposal: Application-Specific Hardware ASH addresses these problems ASH is not a panacea ASH complementary to CPU High-ILP computation Low ILP computation + OS + VM CPUASH Memory $

7 Paper Content Automatic translation of C to hardware dataflow machines High-level comparison of dataflow and superscalar Circuit-level evaluation -- power, performance, area

8 Outline Problems of current architectures CASH: Compiling Application-Specific Hardware ASH Evaluation Conclusions

9 Application-Specific Hardware C program Compiler Dataflow IR Reconfigurable/custom hw HW backend

10 Computation Dataflow x = a & 7;... y = x >> 2; Program & a 7 >> 2 x IR a Circuits &7 >>2 No interpretation Operations Nodes Pipeline stages Variables Def-use edges Channels (wires)

11 Basic Computation= Pipeline Stage data valid ack latch +

12 Distributed Control Logic +- ack rdy global FSM short, local wires

13 MUX: Forward Branches if (x > 0) y = -x; else y = b*x; * x b0 y ! -> Conditionals ) Speculation SSA = no arbitration

14 Memory Access LD ST LD Monolithic Memory local communicationglobal structures pipelined arbitrated network Future work: fragment this!

15 Outline Problems of current architectures Compiling ASH ASH Evaluation Conclusions

16 Evaluating ASH C CASH core Verilog back-end Synopsys, Cadence P/R ASIC 180nm std. cell library, 2V ~1999 technology Mediabench kernels (1 hot function/benchmark) ModelSim (Verilog simulation) performance numbers Mem commercial tools

17 Compile Time C CASH core Verilog back-end Synopsys, Cadence P/R ASIC 20 seconds 10 seconds 20 minutes 1 hour 200 lines Mem

18 ASH Area P4: 217 minimal RISC core

19 ASH vs 600MHz CPU [.18 m]

20 Bottleneck: Memory Protocol LD ST Memory Enabling dependent operations requires round-trip to memory. Limit study: round trip zero time ) up to 5x speed-up. LSQ Exploring novel memory access protocols.

21 Power DSP 110 mP 4000 Xeon [+cache] 67000

22 Energy-delay vs. Wattch

23 Energy Efficiency Energy Efficiency [Operations/nJ] General-purpose DSP Dedicated hardware ASH media kernels FPGA Microprocessors 1000x Asynchronous P

24 Outline Problems of current architectures Compiling ASH Evaluation Related work, Conclusions

25 Related Work Optimizing compilers High-level synthesis Reconfigurable computing Dataflow machines Asynchronous circuits Spatial computation We target an extreme point in the design space: no interpretation, fully distributed computation and control

26 ASH Design Point Design an ASIC in a day Fully automatic synthesis to layout Fully distributed control and computation (spatial computation) –Replicate computation to simplify wires Energy/op rivals custom ASIC Performance rivals superscalar E £ t 100 times better than any processor

27 Conclusions FeatureAdvantages No interpretationEnergy efficiency, speed Spatial layoutShort wires, no contention AsynchronousLow power, scalable DistributedNo global signals Automatic compilationDesigner productivity Spatial computation strengths

28 Backup Slides Absolute performance Control logic Exceptions Leniency Normalized area Loops ASH weaknesses Splitting memory Recursive calls Leakage Why not compare to… Targetting FPGAs

29 Absolute Performance

= rdy in ack out rdy out ack in data in data out Reg back Pipeline Stage C

31 Exceptions Strictly speaking, C has no exceptions In practice hard to accommodate exceptions in hardware implementations An advantage of software flexibility: PC is single point of execution control High-ILP computation Low ILP computation + OS + VM + exceptions CPUASH Memory back $$$

32 Critical Paths if (x > 0) y = -x; else y = b*x; * xb0 y ! ->

33 Lenient Operations if (x > 0) y = -x; else y = b*x; * xb0 y ! -> Solves the problem of unbalanced paths back

34 Normalized Area back

35 Control Flow ) Data Flow data predicate Merge (label) Gateway data Split (branch) p !

36 i +1 < * + sum 0 Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; ! ret back

37 ASH Weaknesses Both branch and join not free Static dataflow (no re-issue of same instr) Memory is far Fully static – No branch prediction – No dynamic unrolling – No register renaming Calls/returns not lenient back

38 Predicted not taken Effectively a noop for CPU! Predicted taken. Branch Prediction for (i=0; i < N; i++) {... if (exception) break; } i + < 1 & ! exception result available before inputs ASH crit path CPU crit path back

39 Memory Partitioning MIT RAW project: Babb FCCM 99, Barua HiPC 00,Lee ASPLOS 00 Stanford SpC: Semeria DAC 01, TVLSI 02 Illinois FlexRAM: Fraguella PPoPP 03 Hand-annotations #pragma back

40 Recursion recursive call save live values restore live values stack back

41 Leakage Power P s = k Area e -V T Employ circuit-level techniques Cut power supply of idle circuit portions –most of the circuit is idle most of the time –strong locality of activity back

42 Why Not Compare To… In-order processor –Worse in all metrics than superscalar, except power –We beat it in all metrics, including performance DSP –We expect roughly the same results as for superscalar (Wattch maintains high IPC for these kernels) ASIC –No available tool-flow supports C to the same degree Asynchronous ASIC –We compared with a Balsa synthesis system –We are 15 times better in Et compared to resulting ASIC Async processor –We are 350 times better in Et than Amulet (scaled to.18) back

43 Compared to Next Talk Engine [180nm] Performance [MIPS] E/instruction [pJ] SNAP/LE2824 SNAP/LE ASH back

44 Why not target FPGA Do not support asynchronous circuits Very inefficient in area, power, delay Too fine-grained for datapath circuits We are designing an async FPGA back