Presentation is loading. Please wait.

Presentation is loading. Please wait.

Evaluating The Raw Microprocessor: Scalability and Versatility Michael Taylor Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry.

Similar presentations


Presentation on theme: "Evaluating The Raw Microprocessor: Scalability and Versatility Michael Taylor Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry."— Presentation transcript:

1 Evaluating The Raw Microprocessor: Scalability and Versatility Michael Taylor Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry Hoffmann, Paul Johnson, Jason Kim, James Psota, Arvind Saraf, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal. M.I.T.

2 Could processors be even more general purpose? Square inch of silicon Gets more powerful every generation Custom Chip “General Purpose” Microprocessor Video/3D Graphics Network Encryption Wireless/Cell Phone Digital Camera MP3 Player Automotive Why can custom chips run these apps? Spec Office

3 Custom Chips: Efficient Extraction of Parallelism 10’s, 100’s or 1000’s of parallel operators 10’s or 100’s of parallel memory ports 10’s or 100’s of parallel I/O ops But, not general purpose! Can’t run GCC. Customized placement and routing of operators & operands -High locality -Minimum Control -Operands routed over wires, not thru register files  Area and Power Efficient GP Micro 3-8 2 1

4 The Raw Goal Create an architecture that: Scales to 100’s-1000’s of functional units, memory ports by exploiting custom-chip like features - in particular, application-specific routing of operands … while being “general purpose”: Run ILP-based sequential programs Support standard General Purpose Abstractions - like context switching, caching and instruction virtualization [IEEE Micro, “Billion Transistor” Issue, 1997]

5 Un-buildable Super-Wide Issue GP Control Wide Fetch (16 inst) Unified Load/Store Queue PC RF ALU Bypass Net

6 Area and Frequency Scalability Problems ALU Bypass Net RF ~N 3 ~N 2 N ALUs Ex: Itanium 2 Without modification, freq decreases linearly or worse.

7 Operand Routing is Global ALU Bypass Net RF >> +

8 Idea: Exploit Locality ALU Bypass Net RF

9 ALU RF Bypass Net Idea: Exploit Locality

10 ALU RF Replace the crossbar with a point-to-point, pipelined, routed network.

11 ALU RF >> + Replace the crossbar with a point-to-point, pipelined, routed network.

12 Un-pipelined crossbar Point-to-Point Routed Mesh Network ALUs N N Bisection BW~ N ½ Local BW~ N ½ ~ N Area~ N 2 ~ N Operand Transport Scaling – Bandwidth and Area If we want to keep our ALUs busy, we better map communicating instructions nearby so communication is local. Scales as 2-D VLSI

13 Operand Transport Scaling - Latency Time for operand to travel between instructions mapped to different ALUs. Non-local Placement ~ N~ N ½ Locality Driven Placement ~ N~ 1~ 1 Un-pipelined crossbar Point-to-Point Routed Mesh Network If we want to make sure that a latency-bound program doesn’t slow down when more ALUs are added, we must map the instructions to ALUs in a local fashion. [ASPLOS98]

14 Distribute the Register File ALU RF

15 ALU RF Control Wide Fetch (16 inst) Unified Load/Store Queue PC SCALABLE

16 More Scalability Problems Control Wide Fetch (16 inst) Unified Load/Store Queue PC

17 Distribute the rest. ALU RF Control Wide Fetch (16 inst) Unified Load/Store Queue PC I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ [ISCA99]

18 Tiles! ALU RF I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$

19 Tiles!

20 Tiled Processor Architectures - composed of a replicated tile -all signals registered at tile boundaries -NO global signals -wire delay problem much easier - easy scalability story Easier to Tune the Frequency Easier to Verify Easier to do the Physical Design

21 Raw Compute Internals ALU RF I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ IF RF D ATL M1 M2 FP E U r26 r27 r25 r24 r26 r27 r25 r24

22 ALU RF I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ We could not find this type of network in Patterson & Hennessey. - optimizes time for delivery of scalar operands between functional units - we conceptualized this idea into the term “scalar operand network” or SON - CMP: 15-100 cycles - iWarp: 12 cycles - Raw: 3 cycles - Alpha 21264: 1 cycle - Superscalar: 0 cycle scalable HPCA 2003 – “Scalar Operand Networks” Intended for use as SON

23 Evaluation of Raw - holistic approach - design a complete architecture - design and build the processor and enclosing system - build the compilers - used the chip in real systems - head-to-head versus Intel Chip in same litho generation

24 Raw 180 nm ASIC (IBM SA-27E) 16 tiles Core Frequency: 425 MHz @ 1.8 V 500 MHz @ 2.2 V Frequency competitive with IBM-implemented PowerPCs in same process. 18 W (vpenta) Critical Path: ≈ Single-Ported 32 KB SRAM + 14-bit Mux. + Flip Flop

25 Raw Chips October 02

26 Raw motherboard Support Chipset implemented in FPGA (vs. custom ASICs for P3)

27 Comparison to Pentium 3 Self-comparisons hide architectural and compiler inefficiency. What’s hard: Normalizations between processors is very tricky. Especially academic projects versus indu$try. - ASIC cannot attain the same frequencies. Honest: Our solution: -Pick closest Intel processor implementation -Don’t scale any numbers in any way. People can now compare to P3 and by extension to Raw.

28 ParameterIBM SA-27E (Raw)Intel P858 (P3)Favors Litho180 nm - Metal LayersCu 6Al 6Raw Wire sizingNoYesIntel Dielectric k4.13.55Intel FO1 Delay23 ps11 psIntel Design StyleStd Cell ASICFull customIntel Voltage Tweak0 %10 %Intel Initial Freq425500-733- Presumed Ave. Chip Freq 425600- Pins1100190Raw Die Area331 mm 2 106 mm 2 Raw

29 Methodology - HW Intel: Pentium III Coppermine 600 MHz Dell Precision 410, stocked with 2-2-2 PC100 DRAM Raw: Validated Cycle-Accurate Simulator - Matches RTL for Raw Chip to the precise cycle for all 200,000+ lines of test code Simulator used so we could: - Normalize motherboard + DRAM timings - replace (research) software i-caching system with conventional hardware i-cache.

30 Methodology - SW When applicable - normalize compiler: P3: gcc 3.3 –O3 –march=pentium3 –mfpmath=sse Raw: gcc 3.3 –O3 (non parallelizing) - normalize stdio/stdlib: P3 & Raw: Newlib 1.9.0 w/ Deionizer P3: Intel Performance Primitives LAPACK/BLAS with SSEfor linear algebra routines Raw: rawcc - home brew parallelizing compiler Streamit - home brew parallelizing compiler gcc 3.3 + snippets inline assembly for some parallel apps

31 Performance Survey

32 Sources of Speedup vs. P3 or 1 Tile FactorApprox. Upper Bound on Speedup Tile Parallelism16x Streaming I/O Bandwidth60x Streaming v. cache thrashing15x

33 Future Work: Raw supercomputing fabric Emulator of a 1K-tile Raw chip circa. 2010 …Ultimate test of scaling

34 Related Work: AsTrO Taxonomy ALU >> + Assignment ( Static/Dynamic) Transport (Static/Dynamic) Ordering (Static/Dynamic) + >> Is instruction assignment to ALUs predetermined? Are operand routes predetermined? Is the execution order of instructions assigned to a node predetermined? % & /

35 Static Dynamic Static Dynamic Static RawDyn [00] Raw [97] Scale [04] GRID [01] WaveScalar [03] Static Dynamic ILDP[00] OOO- Superscalar Assignment Transport Ordering How Raw relates to other distributed microprocessors using AsTrO taxonomy

36 Conclusions VLSI Scalable microprocessors are possible. Constant factors are beginning to give way to asymptotics: - 16 ALU Raw – Oct 2002 - 64 ALU Raw – Now - 1,024 ALU Raw- 2010 - 32,768 ALU Raw – If Moore’s Law makes it to 2 nm There is an opportunity to make processors more “versatile” i.e., steal applications from custom chips. Tiled Processor Architectures are a promising approach and merit further research.

37 * * **

38 Embedded system: 1020 Element Microphone Array


Download ppt "Evaluating The Raw Microprocessor: Scalability and Versatility Michael Taylor Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry."

Similar presentations


Ads by Google