Evaluating The Raw Microprocessor: Scalability and Versatility Michael Taylor Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry.

Evaluating The Raw Microprocessor: Scalability and Versatility Michael Taylor Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry Hoffmann, Paul Johnson, Jason Kim, James Psota, Arvind Saraf, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal. M.I.T.

Could processors be even more general purpose? Square inch of silicon Gets more powerful every generation Custom Chip “General Purpose” Microprocessor Video/3D Graphics Network Encryption Wireless/Cell Phone Digital Camera MP3 Player Automotive Why can custom chips run these apps? Spec Office

Custom Chips: Efficient Extraction of Parallelism 10’s, 100’s or 1000’s of parallel operators 10’s or 100’s of parallel memory ports 10’s or 100’s of parallel I/O ops But, not general purpose! Can’t run GCC. Customized placement and routing of operators & operands -High locality -Minimum Control -Operands routed over wires, not thru register files  Area and Power Efficient GP Micro 3-8 2 1

The Raw Goal Create an architecture that: Scales to 100’s-1000’s of functional units, memory ports by exploiting custom-chip like features - in particular, application-specific routing of operands … while being “general purpose”: Run ILP-based sequential programs Support standard General Purpose Abstractions - like context switching, caching and instruction virtualization [IEEE Micro, “Billion Transistor” Issue, 1997]

Un-buildable Super-Wide Issue GP Control Wide Fetch (16 inst) Unified Load/Store Queue PC RF ALU Bypass Net

Area and Frequency Scalability Problems ALU Bypass Net RF ~N 3 ~N 2 N ALUs Ex: Itanium 2 Without modification, freq decreases linearly or worse.

Operand Routing is Global ALU Bypass Net RF >> +

Idea: Exploit Locality ALU Bypass Net RF

ALU RF Bypass Net Idea: Exploit Locality

ALU RF Replace the crossbar with a point-to-point, pipelined, routed network.

ALU RF >> + Replace the crossbar with a point-to-point, pipelined, routed network.

Un-pipelined crossbar Point-to-Point Routed Mesh Network ALUs N N Bisection BW~ N ½ Local BW~ N ½ ~ N Area~ N 2 ~ N Operand Transport Scaling – Bandwidth and Area If we want to keep our ALUs busy, we better map communicating instructions nearby so communication is local. Scales as 2-D VLSI

Operand Transport Scaling - Latency Time for operand to travel between instructions mapped to different ALUs. Non-local Placement ~ N~ N ½ Locality Driven Placement ~ N~ 1~ 1 Un-pipelined crossbar Point-to-Point Routed Mesh Network If we want to make sure that a latency-bound program doesn’t slow down when more ALUs are added, we must map the instructions to ALUs in a local fashion. [ASPLOS98]

Distribute the Register File ALU RF

ALU RF Control Wide Fetch (16 inst) Unified Load/Store Queue PC SCALABLE

More Scalability Problems Control Wide Fetch (16 inst) Unified Load/Store Queue PC

Distribute the rest. ALU RF Control Wide Fetch (16 inst) Unified Load/Store Queue PC I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ [ISCA99]

Tiles! ALU RF I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$

Tiles!

Tiled Processor Architectures - composed of a replicated tile -all signals registered at tile boundaries -NO global signals -wire delay problem much easier - easy scalability story Easier to Tune the Frequency Easier to Verify Easier to do the Physical Design

Raw Compute Internals ALU RF I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ IF RF D ATL M1 M2 FP E U r26 r27 r25 r24 r26 r27 r25 r24

ALU RF I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ We could not find this type of network in Patterson & Hennessey. - optimizes time for delivery of scalar operands between functional units - we conceptualized this idea into the term “scalar operand network” or SON - CMP: 15-100 cycles - iWarp: 12 cycles - Raw: 3 cycles - Alpha 21264: 1 cycle - Superscalar: 0 cycle scalable HPCA 2003 – “Scalar Operand Networks” Intended for use as SON

Evaluation of Raw - holistic approach - design a complete architecture - design and build the processor and enclosing system - build the compilers - used the chip in real systems - head-to-head versus Intel Chip in same litho generation

Raw 180 nm ASIC (IBM SA-27E) 16 tiles Core Frequency: 425 MHz @ 1.8 V 500 MHz @ 2.2 V Frequency competitive with IBM-implemented PowerPCs in same process. 18 W (vpenta) Critical Path: ≈ Single-Ported 32 KB SRAM + 14-bit Mux. + Flip Flop

Raw Chips October 02

Raw motherboard Support Chipset implemented in FPGA (vs. custom ASICs for P3)

Comparison to Pentium 3 Self-comparisons hide architectural and compiler inefficiency. What’s hard: Normalizations between processors is very tricky. Especially academic projects versus indu$try. - ASIC cannot attain the same frequencies. Honest: Our solution: -Pick closest Intel processor implementation -Don’t scale any numbers in any way. People can now compare to P3 and by extension to Raw.

ParameterIBM SA-27E (Raw)Intel P858 (P3)Favors Litho180 nm - Metal LayersCu 6Al 6Raw Wire sizingNoYesIntel Dielectric k4.13.55Intel FO1 Delay23 ps11 psIntel Design StyleStd Cell ASICFull customIntel Voltage Tweak0 %10 %Intel Initial Freq425500-733- Presumed Ave. Chip Freq 425600- Pins1100190Raw Die Area331 mm 2 106 mm 2 Raw

Methodology - HW Intel: Pentium III Coppermine 600 MHz Dell Precision 410, stocked with 2-2-2 PC100 DRAM Raw: Validated Cycle-Accurate Simulator - Matches RTL for Raw Chip to the precise cycle for all 200,000+ lines of test code Simulator used so we could: - Normalize motherboard + DRAM timings - replace (research) software i-caching system with conventional hardware i-cache.

Methodology - SW When applicable - normalize compiler: P3: gcc 3.3 –O3 –march=pentium3 –mfpmath=sse Raw: gcc 3.3 –O3 (non parallelizing) - normalize stdio/stdlib: P3 & Raw: Newlib 1.9.0 w/ Deionizer P3: Intel Performance Primitives LAPACK/BLAS with SSEfor linear algebra routines Raw: rawcc - home brew parallelizing compiler Streamit - home brew parallelizing compiler gcc 3.3 + snippets inline assembly for some parallel apps

Performance Survey

Sources of Speedup vs. P3 or 1 Tile FactorApprox. Upper Bound on Speedup Tile Parallelism16x Streaming I/O Bandwidth60x Streaming v. cache thrashing15x

Future Work: Raw supercomputing fabric Emulator of a 1K-tile Raw chip circa. 2010 …Ultimate test of scaling

Related Work: AsTrO Taxonomy ALU >> + Assignment ( Static/Dynamic) Transport (Static/Dynamic) Ordering (Static/Dynamic) + >> Is instruction assignment to ALUs predetermined? Are operand routes predetermined? Is the execution order of instructions assigned to a node predetermined? % & /

Static Dynamic Static Dynamic Static RawDyn [00] Raw [97] Scale [04] GRID [01] WaveScalar [03] Static Dynamic ILDP[00] OOO- Superscalar Assignment Transport Ordering How Raw relates to other distributed microprocessors using AsTrO taxonomy

Conclusions VLSI Scalable microprocessors are possible. Constant factors are beginning to give way to asymptotics: - 16 ALU Raw – Oct 2002 - 64 ALU Raw – Now - 1,024 ALU Raw- 2010 - 32,768 ALU Raw – If Moore’s Law makes it to 2 nm There is an opportunity to make processors more “versatile” i.e., steal applications from custom chips. Tiled Processor Architectures are a promising approach and merit further research.

* * **

Embedded system: 1020 Element Microphone Array

Evaluating The Raw Microprocessor: Scalability and Versatility Michael Taylor Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry.

Similar presentations

Presentation on theme: "Evaluating The Raw Microprocessor: Scalability and Versatility Michael Taylor Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evaluating The Raw Microprocessor: Scalability and Versatility Michael Taylor Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry.

Similar presentations

Presentation on theme: "Evaluating The Raw Microprocessor: Scalability and Versatility Michael Taylor Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry."— Presentation transcript:

Similar presentations

About project

Feedback