A tiled processor architecture prototype: the Raw microprocessor October 02.

A tiled processor architecture prototype: the Raw microprocessor October 02

Tiled Processor Architecture (TPA) SMEM SWITCH PC  Short, prog. wires DMEM IMEM REG PC FPU ALU  Lots of ALUs and regs Tile  Lower power Programmable. Supports ILP and Streams

A Prototype TPA: The Raw Microprocessor The Raw Chip Software-scheduled interconnects (can use static or dynamic routing – but compiler determines instruction placement and routes) Tile Disk stream Video1 RDRAM Packet stream A Raw Tile SMEM SWITCH PC DMEM IMEM REG PC FPU ALU Raw Switch PC SMEM [Billion transistor IEEE Computer Issue ’97]

Tight integration of interconnect The Raw Chip Tile Disk stream Video1 RDRAM Packet stream A Raw Tile SMEM SWITCH PC DMEM IMEM REG PC FPU ALU IF RF D ATL M1 FP E U WB r26 r27 r25 r24 Input FIFOs from Static Router r26 r27 r25 r24 Output FIFOs to Static Router 0-cycle “local bypass network” M2 TV F4 Point-to-point bypass-integrated compiler-orchestrated on-chip networks

Tile 11 fmul r24, r3, r4 software controlled crossbar software controlled crossbar fadd r5, r3, r25 route P->E route W->P Tile 10 How to “program the wires”

The result of orchestrating the wires Custom Datapath Pipeline mem httpd C program ILP computation MPI program Zzzz

Perspective We have replaced Bypass paths, ALU-reg bus, FPU-Int. bus, reg-cache-bus, cache-mem bus, etc. With a general, point-to-point, routed interconnect called: Scalar operand network (SON) Fundamentally new kind of network optimized for both scalar and stream transport

Programming models and software for tiled processor architectures o Conventional scalar programs (C, C++, Java) Or, how to do ILP o Stream programs

Scalar (ILP) program mapping v2.4 = v2 seed.0 = seed v1.2 = v1 pval1 = seed.0 * 3.0 pval0 = pval1 + 2.0 tmp0.1 = pval0 / 2.0 pval2 = seed.0 * v1.2 tmp1.3 = pval2 + 2.0 pval3 = seed.0 * v2.4 tmp2.5 = pval3 + 2.0 pval5 = seed.0 * 6.0 pval4 = pval5 + 2.0 tmp3.6 = pval4 / 3.0 pval6 = tmp1.3 - tmp2.5 v2.7 = pval6 * 5.0 pval7 = tmp1.3 + tmp2.5 v1.8 = pval7 * 3.0 v0.9 = tmp0.1 - v1.8 v3.10 = tmp3.6 - v2.7 tmp2 = tmp2.5 v1 = v1.8; tmp1 = tmp1.3 v0 = v0.9 tmp0 = tmp0.1 v3 = v3.10 tmp3 = tmp3.6 v2 = v2.7 E.g., Start with a C program, and several transformations later: Lee, Amarasinghe et al, “Space-time scheduling”, ASPLOS ‘98 Existing languages will work

pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v2=v2.7 seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 Scalar program mapping v2.4 = v2 seed.0 = seed v1.2 = v1 pval1 = seed.0 * 3.0 pval0 = pval1 + 2.0 tmp0.1 = pval0 / 2.0 pval2 = seed.0 * v1.2 tmp1.3 = pval2 + 2.0 pval3 = seed.0 * v2.4 tmp2.5 = pval3 + 2.0 pval5 = seed.0 * 6.0 pval4 = pval5 + 2.0 tmp3.6 = pval4 / 3.0 pval6 = tmp1.3 - tmp2.5 v2.7 = pval6 * 5.0 pval7 = tmp1.3 + tmp2.5 v1.8 = pval7 * 3.0 v0.9 = tmp0.1 - v1.8 v3.10 = tmp3.6 - v2.7 tmp2 = tmp2.5 v1 = v1.8; tmp1 = tmp1.3 v0 = v0.9 tmp0 = tmp0.1 v3 = v3.10 tmp3 = tmp3.6 v2 = v2.7 Graph Program code

pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v2=v2.7 seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 Program graph clustering

Placement seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v2=v2.7 Tile1 Tile2 Tile3 Tile4

Routing Processor code seed.0=recv() pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v2.7=recv() v3.10=tmp3.6-v2.7 v3=v3.10 route(W,S,t) route(W,S) route(S,t) Tile2 v2.4=v2 seed.0=recv(0) pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 send(tmp2.5) tmp1.3=recv() pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 Send(v2.7) v2=v2.7 route(N,t) route(t,E) route(E,t) route(t,E) Tile3 v1.2=v1 seed.0=recv() pval2=seed.0*v1.2 tmp1.3=pval2+2.0 send(tmp1.3) tmp1=tmp1.3 tmp2.5=recv() pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 tmp0.1=recv() v0.9=tmp0.1-v1.8 v0=v0.9 route(N,t) route(t,W) route(W,t) route(N,t) route(W,N) Tile4 seed.0=seed send(seed.0) pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 send(tmp0.1) tmp0=tmp0.1 route(t,E,S) route(t,E) Tile1 Switch code

Instruction Scheduling seed.0=recv() pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v2.7=recv() v3.10=tmp3.6-v2.7 v3=v3.10 route(W,t) route(W,S) route(S,t) send(seed.0) pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 send(tmp0.1) tmp0=tmp0.1 route(t,E) v2.4=v2 seed.0=recv(0) pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 send(tmp2.5) tmp1.3=recv() pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 Send(v2.7) v2=v2.7 route(N,t) route(t,E) route(E,t) route(t,E) v1.2=v1 seed.0=recv() pval2=seed.0*v1.2 tmp1.3=pval2+2.0 send(tmp1.3) tmp1=tmp1.3 tmp2.5=recv() pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 route(N,t) route(W,N) seed.0=seed route(W,S) tmp0.1=recv() route(t,W) route(W,t) route(W,N) route(t,E) Tile1 Tile3 Tile4 Tile2 time

Raw die photo.18 micron process, 16 tiles, 425MHz, 18 Watts (vpenta) Of course, custom IC designed by industrial design team could do much better

Raw motherboard

Generation 1 focus on computation programmable cores domain specific cores L1 caches reuse level raised from SC to IP communication: straightforward buses + bridges (heterogeneous) Data is communicated via external memory under synchronization control of a programmable core ex. Nexperia 0.18  m / 8M 1.8V / 4.5 W 75 clock domains 35 M transistors MPEG MBS + VIP MMI+AICP 1394 MSP M-PI MIPS TriMedia VLIW T-PI Conditional access CAB

Acc Conventional architectures (Nexperia) Memory Proc Cache... Proc Cache Proc Cache Symmetric Multiprocessor [Culler] Bus based interconnect (RT)OS rationale: driven by flexibility applications are unknown at design time) dynamic load balancing via flexible binding of Kahn processes to processors Extension of well-known computer architectures (shared memory, caches, …) adopting a general purpose view and using existing skills. key issue: cache coherency and memory consistency performance analysis via simulation Acc task graph job1 job2 task in1 in2 out f h =16MHz

Problems (1): timing events C_1 CPU C_2C_20... mem 1 2 3 4 Classic approach: processors communicate via SDRAM under synchronisation control of the CPU P1: extra claims on scarce resource (bandwidth)  point to point comm. P2: lots of events exchanged with higher SW levels  start when data available fAfA fBfB drivers SD level TM sync level application level

Problems (2): Timing & processor stalls. Processor stalls, e.g. 60%. large variation. miss penalty (BC, AC, WC). Miss rate. unpredictability at every arbiter. Caches. Memory. Busses. programming effort ?. “Easy to program, hard to tune”. Cost ? L1$ L2$ way2 Task B SDRAM 3% miss rate BC=8cc AC=20cc WC=3000cc

[Janssen, Dec. 02] A typical video flow graph

Problems (3): End-to-end timing ? Interaction between multiple local arbiters TM-CPU D$I$ TM-CPU D$I$ D$I$ STC ASIP DDR SDRAM

TM-CPU D$I$ DDR SDRAM Multiple independent applications active simultaneously TM-CPU D$I$ Problems (4): Compositionality D$I$ STC ASIP

Generation 1: Problem summary Proc D$I$ D$I$ D$I$ DDR SDRAM Proc. Timing. Events (coarse level sync. Latency critical. 3% miss rate, 20 cc penalty  60% stalls. end-to-end timing behavior of the application. interaction local arbiters. Composition of several applications sharing resources (virtualization). Power 2 x power dissipation. Area expensive caches. NRE cost. (>20 M$ of ITRS due to SW): verification by simulation

Towards a solution. distributed systems.. tiles will become very much autonomous.. timing (GALS) techniques. for performance & predictability reasons we want to decouple communication from computation.. Tiles run independent from other tiles and from communication actions.. Add a communication assist (CA). acts as an initiator of a communication action.. arbitrates the access to the memory. can stall the processor.. This way communication and computation concerns are separated. Compu- tation Local mem CA master slave

Gen. 2 Architecture Processor CA CA Network on chip MEM MEM cluster stall CA MEM SDRAMCTRL CA MEM. Cluster/tile = computation + local memory. heterogeneous. CPU, DSP, ASIPs, ASICs. Memory only. IO. Clusters are autonomous.. The communication is done via an on-chip network [Culler][Hijdra] Memory cluster Processor stall A generic scalable multiprocessor architecture = a collection of essentially complete computers, including one or more processors and memory, communicating through a general-purpose high-performance scalable interconnect and a communication assist. [C&S, pp.51]

A tiled processor architecture prototype: the Raw microprocessor October 02.

Similar presentations

Presentation on theme: "A tiled processor architecture prototype: the Raw microprocessor October 02."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A tiled processor architecture prototype: the Raw microprocessor October 02.

Similar presentations

Presentation on theme: "A tiled processor architecture prototype: the Raw microprocessor October 02."— Presentation transcript:

Similar presentations

About project

Feedback