Scalar Operand Networks for Tiled Microprocessors Michael Taylor Raw Architecture Project MIT CSAIL (now at UCSD)

Until 3 years ago – computer architects have been using the N-way superscalar to encapsulate the ideal for a parallel processor… - nearly “perfect” but not attainable Superscalar “PE”->”PE” communication Free exploitation of parallelism Implicit Clean semantics Yes scalable No power efficient No (hw scheduler or compiler) (or VLIW)

What’s great about superscalar microprocessors?  It’s the networks! Fast low-latency tightly-coupled networks (0-1 cycles of latency, no occupancy) -For the lack of a better name let’s call them Scalar Operand Networks (SONs) - Can we incorporate the benefits of superscalar communication + multicore scalability -Can we build Scalable Scalar Operand Networks? (I agree with Jose: “We need low-latency tightly-coupled … network interfaces” – Jose Duato, OCIN, Dec 6, 2006) mul $2,$3,$4 add $6,$5,$2

The industry shift toward Multicore - attainable but hardly ideal SuperscalarMulticore “PE”->”PE” communication FreeExpensive exploitation of parallelism ImplicitExplicit Clean semantics YesNo scalable NoYes power efficient NoYes

SuperscalarMulticore “PE”->”PE” communication FreeExpensive exploitation of parallelism ImplicitExplicit Clean semantics YesNo scalable NoYes power efficient NoYes What we’d like – neither superscalar nor multicore Superscalars have fast networks and great usability Multicore has great scalability and efficiency

Why communication is expensive on multicore Multiprocessor Node 1 Multiprocessor Node 2 Transport Cost send overhead receive overhead send occupancy send latency receive occupancy receive latency

Multiprocessor SON Operand Routing Multiprocessor Node 1 send occupancy send latency Destination node name Sequence number Value Launch sequence Commit Latency Network injection

Multiprocessor SON Operand Routing Multiprocessor Node 2 receive occupancy receive latency receive sequence demultiplexing branch mispredictions injection cost.. similar overheads for shared memory multiprocessors - store instr, commit latency, spin locks (+ attndt br. mispredicts)

Defining a figure of merit for scalar operand networks 5-tuple : Send Occupancy Send Latency Network Hop Latency Receive Latency Receive Occupancy Tip: Ordering follows timing of message from sender to receiver We can use this metric to quantitatively differentiate SONs from existing multiprocessor networks…

Impact of Occupancy (“o” = so+ro) if (o * “surface area” > “volume”)  not worth it to offload: overhead too high (parallelism too fine-grained) Impact of Latency The lower the latency, the less work needed to keep myself busy waiting for answer  not worth it to offload: could have done it myself faster (not enough parallelism to hide latency) Proc 0 Proc 1 nothing to do

The interesting region Power4 (on-chip) Superscalar (not scalable)

SuperscalarMulticoreTiled Multicore PE-PE communication FreeExpensiveCheap exploitation of parallelism ImplicitExplicitBoth scalable NoYes power efficient NoYes (w/ scalable SON) Tiled Microprocessors (or “Tiled Multicore”)

SuperscalarMulticoreTiled Multicore Alu-Alu communication FreeExpensiveCheap exploitation of parallelism ImplicitExplicitBoth scalable NoYes power efficient NoYes

Superscalar CMP/multicore Tiled add scalable SON add scalability Transforming from multicore or superscalar to tiled

The interesting region Power4 (on-chip) Raw Tiled “Famous Brand 2” Superscalar (not scalable)

Scalability Problems in Wide Issue Microprocessors Control Wide Fetch (16 inst) Unified Load/Store Queue PC RF ALU Bypass Net

Area and Frequency Scalability Problems ALU Bypass Net RF ~N 3 ~N 2 N ALUs Ex: Itanium 2 Without modification, freq decreases linearly or worse.

Operand Routing is Global ALU Bypass Net RF >> +

Idea: Make Operand Routing Local ALU Bypass Net RF

ALU RF Bypass Net Idea: Exploit Locality

ALU RF Replace the crossbar with a point-to-point, pipelined, routed scalar operand network.

ALU RF >> + Replace the crossbar with a point-to-point, pipelined, routed scalar operand network.

Un-pipelined crossbar bypass Point-to-Point Routed Mesh Network Local BW~ N ½ ~ N Area~ N 2 ~ N Operand Transport Scaling – Bandwidth and Area We can route more operands per unit time if we are able to map communicating instructions nearby. Scales as 2-D VLSI For N ALUs and N ½ bisection bandwidth: as in conventional superscalar

Operand Transport Scaling - Latency Time for operand to travel between instructions mapped to different ALUs. Non-local Placement ~ N~ N ½ Locality- Driven Placement ~ N~ 1~ 1 Un-pipelined crossbar Point-to-Point Routed Mesh Network Latency bonus if we map communicating instructions nearby so communication is local.

Distribute the Register File ALU RF

ALU RF Control Wide Fetch (16 inst) Unified Load/Store Queue PC SCALABLE

More Scalability Problems Control Wide Fetch (16 inst) Unified Load/Store Queue PC

Distribute the rest: Raw – a Fully-Tiled Microprocessor ALU RF Control Wide Fetch (16 inst) Unified Load/Store Queue PC I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$

Tiles! ALU RF I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$

Tiles!

Tiled Microprocessors -fast inter-tile communication through SON -easy to scale (same reasons as multicore)

1. Scalar Operand Network and Tiled Microprocessor intro 2. Raw Architecture + SON 3. VLSI implementation of Raw, a scalable microprocessor with a scalar operand network. Outline

Raw Microprocessor Tiled scalable microprocessor Point-to-point pipelined networks 16 tiles, 16 issue Each 4 mm x 4mm tile: MIPS-style compute processor - single-issue 8-stage pipe - 32b FPU - 32K D Cache, I Cache 4 on-chip networks - two for operands - one for cache misses - one for message passing

Fetch Unit Instruction Cache Generalized Transport Networks Dynamic Router “GDN” Dynamic Router “MDN” Functional Units Execution Core Inter-tile Network Links Compute Processor Trusted Core Untrusted Core Inter-tile SON Instruction Cache Static Router Switch Processor Cross- bar Intra-tile SON Data Cache Raw Microprocessor Components Cross- bar

IF RF D ATL M1 M2 FP E U r26 r27 r25 r24 r26 r27 r25 r24 Raw Compute Processor Internals Ex: fadd r24, r25, r26

Tile-Tile Communication add $25,$1,$2

Tile-Tile Communication add $25,$1,$2 Route $P->$E

Tile-Tile Communication add $25,$1,$2 Route $P->$E Route $W->$P

Tile-Tile Communication add $25,$1,$2 sub $20,$1,$25 Route $P->$E Route $W->$P

tmp3 = (seed*6+2)/3 v2 = (tmp1 - tmp3)*5 v1 = (tmp1 + tmp2)*3 v0 = tmp0 - v1 …. pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v2=v2.7 seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v2=v2.7 seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 RawCC assigns instructions to the tiles, maximizing locality. It also generates the static router instructions that transfer operands between tiles. Compilation

One cycle in the life of a tiled micro httpd 4-way automatically parallelized C program 2-thread MPI app Direct I/O stream into Scalar Operand Network mem Zzz... An application uses only as many tiles as needed to exploit the parallelism intrinsic to that application…

Tile 0 Tile 5 Tile 9 Tile 12 Tile 8Tile 13Tile 14Tile 11 Tile 4Tile 1Tile 2Tile 7 Tile 3 Tile 6 Tile 10 Tile 15 One Streaming Application on Raw very different traffic patterns than RawCC-style parallelization

Splitter FIRFilter Joiner Splitter Detector Magnitude FIRFilter Vec Mult Detector Magnitude FIRFilter Vec Mult Detector Magnitude FIRFilter Vec Mult Detector Magnitude FIRFilter Vec Mult Joiner Splitter FIRFilter Joiner Splitter Joiner Vec Mult FIRFilter Magnitude Detector Vec Mult FIRFilter Magnitude Detector Vec Mult FIRFilter Magnitude Detector Vec Mult FIRFilter Magnitude Detector OriginalAfter fusion Auto-Parallelization Approach #2: Streamit Language + Compiler

FIRFilter Joiner Vec Mult FIRFilter Magnitude Detector Vec Mult FIRFilter Magnitude Detector End Results – auto-parallelized by MIT Streamit to 8 tiles.

AsTrO Taxonomy: Classifying SON diversity ALU >> + Assignment ( Static/Dynamic) Transport (Static/Dynamic) Ordering (Static/Dynamic) + >> Is instruction assignment to ALUs predetermined? Are operand routes predetermined? Is the execution order of instructions assigned to a node predetermined? % & /

Static Dynamic Static Dynamic Static RawDyn Raw Scale TRIPS Static Dynamic ILDP WaveScalar Assignment Transport Ordering Microprocessor SON diversity using AsTrO taxonomy

1. Scalar Operand Network and Tiled Microprocessor intro 2. Raw Architecture + SON 3. VLSI implementation of Raw, a scalable microprocessor with a scalar operand network. Outline

Raw Chips October 02

Raw 16 tiles (16 issue) 180 nm ASIC (IBM SA-27E) ~100 million transistors 1 million gates 3-4 years of development 1.5 years of testing 200K lines of test code Core Frequency: 425 MHz @ 1.8 V 500 MHz @ 2.2 V Frequency competitive with IBM-implemented PowerPCs in same process. 18W average power

Raw motherboard Support Chipset implemented in FPGA

A Scalable Microprocessor in Action [Taylor et al, ISCA ’04]

Conclusions Scalability problems in general purpose processors can be addressed by tiling resources across a scalable, low-latency, low-occupancy scalar operand network (SON). These SONs can be characterized by a 5-tuple and the AsTrO classification. The 180 nm 16-issue Raw prototype shows the feasibility of the approach is feasible. 64+-issue is possible in today’s VLSI processes. Multicore machines could benefit by adding inter- node SON for cheap communication.

* * **

Scalar Operand Networks for Tiled Microprocessors Michael Taylor Raw Architecture Project MIT CSAIL (now at UCSD)

Similar presentations

Presentation on theme: "Scalar Operand Networks for Tiled Microprocessors Michael Taylor Raw Architecture Project MIT CSAIL (now at UCSD)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scalar Operand Networks for Tiled Microprocessors Michael Taylor Raw Architecture Project MIT CSAIL (now at UCSD)

Similar presentations

Presentation on theme: "Scalar Operand Networks for Tiled Microprocessors Michael Taylor Raw Architecture Project MIT CSAIL (now at UCSD)"— Presentation transcript:

Similar presentations

About project

Feedback