Lecture 18: Introduction to Multiprocessors

Lecture 18: Introduction to Multiprocessors
Prepared and presented by: Kurt Keutzer with thanks for materials from Kunle Olukotun, Stanford; David Patterson, UC Berkeley

Why Multiprocessors? Needs Relentless demand for higher performance
Servers Networks Commercial desire for product differentiation Opportunities Silicon capability Ubiquitous computers

Exploiting (Program) Parallelism
Instruction Loop Thread Process Levels of Parallelism Grain Size (instructions) 1 10 100 1K 10K 100K 1M The goal of any high-performance architecture is to exploit program parallelism. Now parallelism exists at multiple levels. At the lowest level is the parallelism between instructions in a basic-block, Above that is the loop-level parallelism between loop iterations. Above this is thread-level parallelism that comes from parallelizing a single application either manually or automatically. At the highest level is the parallelism between separate processes in a multiprogramming environment. The granularity of parallelism typically increases as the level of parallelism increases. Superscalar architectures concentrate on exploiting ILP and to some extent LLP, but these techniques are complex and do not scale well in advanced semiconductor technologies. So this begs the question

Exploiting (Program) Parallelism -2
Process Thread Levels of Parallelism Loop Instruction The goal of any high-performance architecture is to exploit program parallelism. Now parallelism exists at multiple levels. At the lowest level is the parallelism between instructions in a basic-block, Above that is the loop-level parallelism between loop iterations. Above this is thread-level parallelism that comes from parallelizing a single application either manually or automatically. At the highest level is the parallelism between separate processes in a multiprogramming environment. The granularity of parallelism typically increases as the level of parallelism increases. Superscalar architectures concentrate on exploiting ILP and to some extent LLP, but these techniques are complex and do not scale well in advanced semiconductor technologies. So this begs the question Bit 1 10 100 1K 10K 100K 1M Grain Size (instructions)

Need for Parallel Computing
Diminishing returns from ILP Limited ILP in programs ILP increasingly expensive to exploit Peak performance increases linearly with more processors Amhdahl’s law applies Adding processors is inexpensive But most people add memory also Performance Die Area 2P+M 2P+2M Performance P+M Die Area

What to do with a billion transistors ?
1 clk Technology changes the cost and performance of computer elements in a non-uniform manner logic and arithmetic is becoming plentiful and cheap wires are becoming slow and scarce This changes the tradeoffs between alternative architectures superscalar doesn’t scale well global control and data So what will the architectures of the future be? 1998 2001 2004 2007 64 x the area 4x the speed slower wires 3 (10, 16, 20?) clks

Elements of a multiprocessing system
General purpose/special purpose Granularity - capability of a basic module Topology - interconnection/communication geometry Nature of coupling - loose to tight Control-data mechanisms Task allocation and routing methodology Reconfigurable Computation Interconnect Programmer’s model/Language support/ models of computation Implementation - IC, Board, Multiboard, Networked Performance measures and objectives [After E. V. Krishnamurty - Chapter 5

Use, Granularity General purpose
attempting to improve general purpose computation (e.g. Spec benchmarks) by means of multiprocessing Special purpose attempting to improve a specific application or class of applications by means of multiprocessing Granularity - scope and capability of a processing element (PE) Nand gate ALU with registers Execution unit with local memory RISC R1000 processor

Topology Topology - method of interconnection of processors Bus
Full-crossbar switch Mesh N-cube Torus Perfect shuffle, m-shuffle Cube-connected components Fat-trees

Coupling Relationship of communication among processors
Shared clock (Pipelined) Shared registers (VLIW) Shared memory (SMM) Shared network

Control/Data Way in which data and control are organized
Control - how the instruction stream is managed (e.g. sequential instruction fetch) Data - how the data is accessed (e.g. numbered memory addresses) Multithreaded control flow - explicit constructs: fork, join, wait, control program flow - central controller Dataflow model - instructions execute as soon as operands are ready, program structures flow of data, decentralized control

Task allocation and routing
Way in which tasks are scheduled and managed Static - allocation of tasks onto processing elements pre-determined before runtime Dynamic - hardware/software support allocation of tasks to processors at runtime

Reconfiguration Computational restructuring of computational elements
reconfigurable - reconfiguration at compile time dynamically reconfigurable- restructuring of computational elements at runtime Interconnection scheme switching network - software controlled reconfigurable fabric

Programmer’s model How is parallelism expressed by the user?
Expressive power Process-level parallelism Shared-memory Message-passing Operator-level parallelism Bit-level parallelism Formal guarantees Deadlock-free Livelock free Support for other real-time notions Exception handling

Parallel Programming Models
Message Passing Fork thread Typically one per node Explicit communication Send messages send(tid, tag, message) receive(tid, tag, message) Synchronization Block on messages (implicit sync) Barriers Shared Memory (address space) Fork thread Typically one per node Implicit communication Using shared address space Loads and stores Synchronization Atomic memory operators barriers

Message Passing Multicomputers
Computers (nodes) connected by a network Fast network interface Send, receive, barrier Nodes not different than regular PC or workstation Cluster conventional workstations or PCs with fast network cluster computing Berkley NOW IBM SP2 P M Network Node

Shared-Memory Multiprocessors
Several processors share one address space conceptually a shared memory often implemented just like a multicomputer address space distributed over private memories Communication is implicit read and write accesses to shared memory locations Synchronization via shared memory locations spin waiting for non-zero barriers Network M Conceptual Model P P P M M M Network Actual Implementation

Cache Coherence - A Quick Overview
P1 P2 PN With caches, action is required to prevent access to stale data Processor 1 may read old data from its cache instead of new data in memory or Processor 3 may read old data from memory rather than new data in Processor 2’s cache Solutions no caching of shared data Cray T3D, T3E, IBM RP3, BBN Butterfly cache coherence protocol keep track of copies notify (update or invalidate) on writes $ $ $ Network M A:3 P1: Rd(A) Rd(A) P2: Wr(A,5) P3: Rd(A)

Implementation issues
Underlying hardware implementation Bit-slice Board assembly Integration in an integrated-circuit Exploitation of new technologies DRAM integration on IC Low-swing chip-level interconnect

Performance objectives
Speed Power Cost Ease of programming/time to market/ time to money In-field flexibility Methods of measurement Modeling Emulation Simulation Transaction Instruction-set Hardware

Flynn’s Taxonomy of Multiprocessing
Single-instruction single-datastream (SISD) machines Single-instruction multiple-datastream (SIMD) machines Multiple-instruction single-datastream (MISD) machines Multiple-instruction multiple-datastream (MIMD) machines Examples?

Examples Single-instruction single-datastream (SISD) machines
Non-pipelined Uniprocessors Single-instruction multiple-datastream (SIMD) machines Vector processors (VIRAM) Multiple-instruction single-datastream (MISD) machines Network processors (Intel IXP1200 Multiple-instruction multiple-datastream (MIMD) machines Network of workstations (NOW)

Predominant Approaches
Pipelining ubiquitious Much academic research focused on performance improvements of ``dusty decks’’ Illiac 4 - Speed-up of Fortran SUIF, Flash - Speed-up of C Niche market in high-performance computing Cray Commercial support for high-end servers Shared-memory multiprocessors for server market Commercial exploitation of silicon capability General purpose: Super-scalar, VLIW Special purpose: VLIW for DSP, Media processors, Network processors Reconfigurable computing

C62x Pipeline Operation Pipeline Phases
Fetch Decode Execute PG PS PW PR DP DC E1 E2 E3 E4 E5 Decode DP Instruction Dispatch DC Instruction Decode Execute E1 - E5 Execute 1 through Execute 5 Single-Cycle Throughput Operate in Lock Step Fetch PG Program Address Generate PS Program Address Send PW Program Access Ready Wait PR Program Fetch Packet Receive Execute Packet 1 PG PS PW PR DP DC E1 E2 E3 E4 E5 Execute Packet 2 [19] You have your typical fetch, execute - excuse me - fetch, decode, and execute phases of the pipeline. And then those are broken up into smaller subsets to allow for our 5 nanosecond cycle time. Most of our instructions are single cycle execution. PG PS PW PR DP DC E1 E2 E3 E4 E5 Execute Packet 3 PG PS PW PR DP DC E1 E2 E3 E4 E5 Execute Packet 4 PG PS PW PR DP DC E1 E2 E3 E4 E5 Execute Packet 5 PG PS PW PR DP DC E1 E2 E3 E4 E5 Execute Packet 6 PG PS PW PR DP DC E1 E2 E3 E4 E5 Execute Packet 7 PG PS PW PR DP DC E1 E2 E3 E4 E5

Superscalar: PowerPC 604 and Pentium Pro
Both In-order Issue, Out-of-order execution, In-order Commit

IA-64 aka EPIC aka VLIW Compiler schedules instructions
Instruction Cache Compiler schedules instructions Encodes dependencies explicitly saves having the hardware repeatedly rediscover them Support speculation speculative load branch prediction Really need to make communication explicit too still has global registers and global instruction issue Instruction Issue Register File

Phillips Trimedia Processor

TMS320C6201 Revision 2 Data Memory Program Cache / Program Memory
C6201 CPU Megamodule Data Path 1 D1 M1 S1 L1 A Register File Data Path 2 L2 S2 M2 D2 B Register File Instruction Dispatch Program Fetch Interrupts Control Registers Control Logic Emulation Test Ext. Memory Interface 4-DMA Program Cache / Program Memory 32-bit address, 256-Bit data512K Bits RAM Host Port Interface 2 Timers 2 Multi-channel buffered serial ports (T1/E1) Data Memory 32-Bit address, 8-, 16-, 32-Bit data 512K Bits RAM Pwr Dwn Instruction Decode [12] Here we have a picture, or a block diagram, of the fixed-point architecture. In the green we have the VLIW, or Very Long Instruction Word Architecture, giving us a performance of 1600 MIPS, running at 200 megahertz, allowing for a 5 nanosecond cycle time. Inside the CPU, you'll see that we have eight functional units performing operations each cycle. These are eight independent, functional units running off a 32-bit instruction word. It is a load-store architecture running off of 3.3 volts for our I/Os and 2.5 volts for the CPU. The CPU supports 16-bit multiplies as well as 32- and 40-bit arithmetic. Inside the CPU we have dual datapaths with eight independent functional units, as I mentioned before. For the peripherals of the device, we have 1 megabit of memory divided between your program memory and your data memory. We also have a 32-bit external memory interface, or EMIF, supporting SDRAM, SRAM, or synchronous burst SRAM. We also have a four channel DMA supporting boot loading. We have a 16-bit host port interface. We have two multi-channel serial ports and two 32-bit timers.

TMS320C6701 DSP Block Diagram Program Cache/Program Memory
32-bit address, 256-Bit data 512K Bits RAM Power Down ’C67x Floating-Point CPU Core Program Fetch Control Registers Host Port Interface Instruction Dispatch Instruction Decode 4 Channel DMA Control Logic Data Path 1 Data Path 2 A Register File B Register File Test Emulation L1 S1 M1 D1 D2 M2 S2 L2 External Memory Interface Interrupts [28] Here we have a block diagram of the C67, or the floating-point DSP... 2 Timers 2 Multi-channel buffered serial ports (T1/E1) Data Memory 32-Bit address 8-, 16-, 32-Bit data 512K Bits RAM

TMS320C67x CPU Core Program Fetch Control Registers
’C67x Floating-Point CPU Core Program Fetch Control Registers Instruction Dispatch Instruction Decode Control Logic Data Path 1 Data Path 2 A Register File B Register File Test Emulation L1 S1 M1 D1 D2 M2 S2 L2 Interrupts [30] Now to go into more detail about the changes between the 62x CPU and the 67x CPU. In this slide I have tried to blow up to show you where we have added floating-point capability. We've added floating-point capability to six of the eight total functional units, so the ALU, or the Arithmetic Logic Unit, the auxiliary logic unit, and the multiplier all support floating-point. The D-unit, or the address calculation unit, it doesn't care what kind of data it's looking at, so it didn't need floating-point capability. So we've only added floating-point capability to six of the eight functional units. Floating-Point Capabilities Arithmetic Logic Unit Auxiliary Logic Unit Multiplier Unit

Single-Chip Multiprocessors CMP
Build a multiprocessor on a single chip linear increase in peak performance advantage of fast interaction between processors Fine grain threads make communication and synchronization very fast (1 cycle) break the problem into smaller pieces Memory bandwidth Makes more effective use of limited memory bandwidth Programming model Need parallel programs P P P P $ $ $ $ $ M

Intel IXP1200 Network Processor
SDRAM Ctrl MicroEng PCI Interface SRAM SA Core Mini DCache ICache Scratch Pad IX Bus Hash Engine 6 micro-engines RISC engines 4 contexts/eng 24 threads total IX Bus Interface packet I/O connect IXPs scalable StrongARM less critical tasks Hash engine level 2 lookups PCI interface 126mm, 6.5 million transistors, 432 pins, BGA package StrongARM SA-1100 in 0.35 micron process IXP1200 in 0.28 micron with three metal layers 6 RISC engines, 4 contexts each -> 24 threads IX Bus Interface is an interface to the SRAM, SDRAM, PCI interface, and other companion IXP1200s the architecture is designed to be scalable We did not touch the strongARM here We removed the use of the hash engine since we are dealing with IP packets

IXP1200 MicroEngine 32-bit RISC instruction set
32 SRAM Read XFER Registers 64 GPRs (A-Bank) 32 SDRAM (B-Bank) ALU Write XFER from SRAM from SDRAM to SRAM to SDRAM 32-bit RISC instruction set Multithreading support for 4 threads Maximum switching overhead of 1 cycle bit GPRs in two banks of 64 Programmable 1KB instruction store (not shown in diagram) bit transfer registers Command bus arbiter and FIFO (not shown in diagram) All instructions are executed in a single cycle Multithreading support for 4 threads usually a zero overhead context swap fill with a deferred instruction (like a branch delay slot) 1 cycle if thread polls for another thread but does not find one explicitly say that you are going to sleep until a signal event specified occurs or signal a swap when using other IXP1200 resources (e.g. SRAM, SDRAM, PCI, HASH) bit GPRs in two banks of 64 32 registers per thread are exclusive (relative addressing) absolute addressing allows sharing between the threads bit transfer registers 8 SRAM, 8 SDRAM read, 8 SRAM, 8 SDRAM write (relative addressing) or all are visible using absolute addressing Command bus arbiter and FIFO manages accesses on the IXP1200 bust

IXP1200 Instruction Set ARM IXP1200 Powerful ALU instructions:
can manipulate word and part of word quite effectively Swap-thread on memory reference Hides memory latency sram[read, r0, base1, offset, 1], ctx_swap Can use an “intelligent” DMA-like controller to copy packets to/from memory sdram[t_fifo_wr, --, pkt_bffr, offset, 8] Exposed branch behavior can fill variable branch slots can select a static prediction on a per-branch basis ARM mov r1, r0, lsl #16 mov r1, r1, r0, asr #16 add r0, r1, r0, asr #16 IXP1200 ld_field_w_clr[temp, 1100, accum] alu_shf[accum, temp, +, accum, <<16]

UCB: Processor with DRAM (PIM) IRAM, VIRAM
Put the processor and the main memory on a single chip much lower memory latency much higher memory bandwidth But need to build systems with more than one chip P M V 64Mb SDRAM Chip Internal K subarrays 4 bits per subarray each 10ns 51.2 Gb/s External - 8 bits at 10ns, 800Mb/s 1 Integer processor ~ 100KBytes DRAM 1 FP processor ~ 500KBytes DRAM 1 Vector Unit ~ 1 MByte DRAM

IRAM Vision Statement Proc L o g i c f a b $ $ L2$ I/O I/O Bus Bus D R
Microprocessor & DRAM on a single chip: on-chip memory latency 5-10X, bandwidth X improve energy efficiency 2X-4X (no off-chip bus) serial I/O 5-10X v. buses smaller board area/volume adjustable memory size/width $ $ L2$ I/O I/O Bus Bus D R A M I/O I/O $B for separate lines for logic and memory Single chip: either processor in DRAM or memory in logic fab Proc D R A M f a b Bus D R A M

Potential Multimedia Architecture
“New” model: VSIW=Very Short Instruction Word! Compact: Describe N operations with 1 short instruct. Predictable (real-time) performance vs. statistical performance (cache) Multimedia ready: choose N*64b, 2N*32b, 4N*16b Easy to get high performance Compiler technology already developed, for sale! Don’t have to write all programs in assembly language Why MPP? Best potential performance! Few successes Operator on vectors of registers Its easier to vectorize than parallelize Scales well: more hardware and slower clock rate Crazy research

Revive Vector (= VSIW) Architecture!
Cost: ≈ $1M each? Low latency, high BW memory system? Code density? Compilers? Performance? Power/Energy? Limited to scientific applications? Single-chip CMOS MPU/IRAM IRAM Much smaller than VLIW For sale, mature (>20 years) Easy scale speed with technology Parallel to save energy, keep perf Multimedia apps vectorizable too: N*64b, 2N*32b, 4N*16b Supercomputer industry dead? Very attractive to scale New class of applications Before had a lousy scalar processor; modest CPU will do well on many programs, vector do great on others

V-IRAM1: 0. 18 µm, Fast Logic, 200 MHz 1. 6 GFLOPS(64b)/6
V-IRAM1: 0.18 µm, Fast Logic, 200 MHz 1.6 GFLOPS(64b)/6.4 GOPS(16b)/16MB + 4 x 64 or 8 x 32 16 x 16 x 2-way Superscalar Vector Instruction Processor ÷ Queue I/O Load/Store I/O 16K I cache Vector Registers 16K D cache 4 x 64 4 x 64 Serial I/O 1Gbit technology Put in perspective 10X of Cray T90 today Memory Crossbar Switch M M M M M M M M M M M M M M M M M M … M M I/O 4 x 64 4 x 64 4 x 64 4 x 64 … … … … … … … … 4 x 64 … … I/O M M M M M M M M M M

Tentative VIRAM-1 Floorplan
0.18 µm DRAM MB in 16 banks x 256b 0.18 µm, 5 Metal Logic ≈ 200 MHz MIPS IV, K I$, 16K D$ ≈ MHz FP/int. vector units die: ≈ 20x20 mm xtors: ≈ M power: ≈2 Watts Memory (128 Mbits / 16 MBytes) 4 Vector Pipes/Lanes C P U +$ Ring- based Switch I/O Floor plan showing memory in purple Crossbar in blue (need to match vector unit, not maximum memory system) vector units in pink CPU in orange I/O in yellow How to spend 1B transistors vs. all CPU! VFU size based on looking at 3 MPUs in 0.25 micron technology; MIPS mm2 for 1FPU (Mul,Add, misc) IBM Power3 48 mm2 for 2 FPUs (2 mul/add units) HAL SPARC III 40 mm2 for 2 FPUs (2 multiple, add units) Memory (128 Mbits / 16 MBytes)

Tentative VIRAM-”0.25” Floorplan
Demonstrate scalability via 2nd layout (automatic from 1st) 8 MB in 2 banks x 256b, 32 subbanks ≈ 200 MHz CPU, 8K I$, 8K D$ 1 ≈ 200 MHz FP/int. vector units die: ≈ 5 x 20 mm xtors: ≈ 70M power: ≈0.5 Watts C P U +$ 1 VU Memory (32 Mb / 4 MB) Kernel GOPS V-1 V-0.25 Comp iDCT Clr.Conv Convol FP Matrix Floor plan showing memory in purple Crossbar in blue (need to match vector unit, not maximum memory system) vector units in pink CPU in orange I/O in yellow How to spend 1B transistors vs. all CPU! VFU size based on looking at 3 MPUs in 0.25 micron technology; MIPS mm2 for 1FPU (Mul,Add, misc) IBM Power3 48 mm2 for 2 FPUs (2 mul/add units) HAL SPARC III 40 mm2 for 2 FPUs (2 multiple, add units)

Stanford: Hydra Design
2.Approach: including innovative ideas and constructive plans for achieving the stated objectives. Single-chip multiprocessor Four processors Separate primary caches Write-through data caches to maintain coherence Shared 2nd-level cache Separate read and write busses Data Speculation Support

Scott Weber University of California at Berkeley
Mescal Architecture Scott Weber University of California at Berkeley GSRC annual review

Outline Architecture rationale and motivation Architecture goals
Architecture template Processing elements Multiprocessor architecture Communication architecture Architecture rationale: Tensilica experience -> configurability for scalar -> configurability for EPIC -> configurability for MP Architecture goals: Configuration in functional units, retiming structures, EPIC platform, shared memory/message passing MP network Processing elmenets: EPIC type configurable processors (FUs, registers, caches, memory)

Architectural Rationale and Motivation
Configurable processors have shown orders of magnitude performance improvements Tensilica has shown ~2x to ~50x performance improvements Specialized functional units Memory configurations Tensilica matches the architecture with software development tools FU RegFile Memory ICache FU RegFile Memory ICache HUF DCT Configuration Set memory parameters Add DCT and Huffman blocks for a JPEG app Invisible bullet Tensilica experience shows that configurability is good, but it quickly slows for a scalar architecture Exploiting memory and compute bandwidth is easy if one is given the ability to do so Tensilica limits the user by forcing one to store state or perform a prefetch like instruction (in the background), bandwidth is the real issue Software development tools must exist for any configuration that is realizable in our template (the level of support may vary) Concurrency can hide some memory latency, but it fails quickly as miss penalties increase rapidly, especially in a MP

Architectural Rationale and Motivation
In order to continue this performance improvement trend Architectural features which exploit more concurrency are required Heterogeneous configurations need to be made possible Software development tools support new configuration options FU RegFile Memory ICache HUF DCT ...begins to look like a VLIW... PE FU RegFile Memory ICache DCT HUF ...concurrent processes are required in order to continue performance improvement trend... Invisible bullet Tensilica experience shows that configurability is good, but it quickly slows for a scalar architecture Exploiting memory and compute bandwidth is easy if one is given the ability to do so Tensilica limits the user by forcing one to store state or perform a prefetch like instruction (in the background), bandwidth is the real issue Software development tools must exist for any configuration that is realizable in our template (the level of support may vary) Concurrency can hide some memory latency, but it fails quickly as miss penalties increase rapidly, especially in a MP ...generic mesh may not suit the application’s topology... PE ...configurable VLIW PEs and network topology... PE

Architecture Goals Provide template for the exploration of a range of architectures Retarget compiler and simulator to the architecture Enable compiler to exploit the architecture Concurrency Multiple instructions per processing element Multiple threads per and across processing elements Multiple processes per and across processing elements Support for efficient computation Special-purpose functional units, intelligent memory, processing elements Support for efficient communication Configurable network topology Combined shared memory and message passing Template to explore options, supports a wide range of architectures Must support compiler and simulator retargetability The static dynamic interface is moved upwards towards the compiler Concurrency at all levels Efficient computation and communication make or break the architecture Communication is probably the most important unless it can be hidden by running different contexts

Architecture Template
Prototyping template for array of processing elements Configure processing element for efficient computation Configure memory elements for efficient retiming Configure the network topology for efficient communication FU RegFile Memory ICache ...configure PE... FU RegFile Memory ICache DCT HUF ...configure memory elements... FU RegFile Memory ICache DCT HUF Prototyping architecture allows for constrained refinement Estimators and feedback construct the architecture - hints to system programmer - automatic process Common (compatible) interface with coprocessors makes refinement fit into the programmer’s model What is the exact communication protocol? Too early to say in this slide. ...configure PEs and network to match the application...

Range of Architectures
FU Register File Memory System Instruction Cache Scalar Configuration EPIC Configuration EPIC with special FUs Mesh of HPL-PD PEs Customized PEs, network Supports a family of architectures Plan to extend the family with the micro-architectural features presented This is a flavor of the architectures that we can currently support Mention context swapping, split register files, shared memory blocks This is generic in the sense that it is not a specific architecture (I think this is OK)

PE FU Register File Memory System Instruction Cache Scalar Configuration EPIC Configuration EPIC with special FUs Mesh of HPL-PD PEs Customized PEs, network Supports a family of architectures Plan to extend the family with the micro-architectural features presented This is a flavor of the architectures that we can currently support Mention context swapping, split register files, shared memory blocks This is generic in the sense that it is not a specific architecture (I think this is OK)

FU FFT Register File Memory System Instruction Cache DCT DES Scalar Configuration EPIC Configuration EPIC with special FUs Mesh of HPL-PD PEs Customized PEs, network Supports a family of architectures Plan to extend the family with the micro-architectural features presented This is a flavor of the architectures that we can currently support Mention context swapping, split register files, shared memory blocks This is generic in the sense that it is not a specific architecture (I think this is OK)

FU FFT Register File Memory System Instruction Cache DCT DES Scalar Configuration EPIC Configuration EPIC with special FUs Mesh of HPL-PD PEs Customized PEs, network Supports a family of architectures Plan to extend the family with the micro-architectural features presented PE PE This is a flavor of the architectures that we can currently support Mention context swapping, split register files, shared memory blocks This is generic in the sense that it is not a specific architecture (I think this is OK)

PE Scalar Configuration EPIC Configuration EPIC with special FUs Mesh of HPL-PD PEs Customized PEs, network Supports a family of architectures Plan to extend the family with the micro-architectural features presented This is a flavor of the architectures that we can currently support Mention context swapping, split register files, shared memory blocks This is generic in the sense that it is not a specific architecture (I think this is OK)

Range of Architectures (Future)
SDRAM Ctrl MicroEng PCI Interface SRAM SA Core Mini DCache ICache Scratch Pad IX Bus Hash Engine Template support for such an architecture Prototype architecture Software development tools generated Generate compiler Generate simulator In the future, we want to support really wacky architectures such as the IXP1200 In our framework, the automation of tools will allow architects to explore options more efficiently There is currently no high-level environment for this processor only an optimizing assembler and strong visualization tools IXP1200 Network Processor (Intel)

Slides prepared by Manish Vachhrajani
The RAW Architecture Slides prepared by Manish Vachhrajani

Outline RAW architecture Compiling for RAW Overview Features
Benefits and Disadvantages Compiling for RAW Structure of the compiler Basic block compilation Other techniques

RAW Machine Overview Scalable architecture without global interconnect
Constructed from Replicated Tiles Each tile has a mP and a switch Interconnect via a static and dynamic network

RAW Tiles Simple 5 stage pipelined mP w/ local PC(MIMD)
Can contain configurable logic Per Tile IMEM and DMEM, unlike other modern architectures mP contains ins. to send and recv. data IMEM DMEM PC REGS SMEM CL PC Switch

RAW Tiles(cont.) Tiles have local switches
Implemented with a stripped down mP Static Network Fast, easy to implement Need to know data transfers, source and destintation at compile time Dynamic Network Much slower and more complex Allows for messages whose route is not known at compile time

Configurable Hardware in RAW
Each tile Contains its own configurable hardware Each tile has several ALUs and logic gates that can operate at bit/byte/word levels Configurable interconnect to wire componenets together Coarser than FPGA based implementations

Benefits of RAW Scalable Can target many forms of Parallelism
Each tile is simple and replicated No global wiring, so it will scale even if wire delay doesn’t Short wires and simple tiles allow higher clock rates Can target many forms of Parallelism Ease of design Replication reduces design overhead Tiles are relatively simple designs simplicity makes verification easier

Disadvantages of RAW Complex Compilation Software Complexity
Full space-time compilation Distributed memory system Need sophisticated memory analysis to resolve “static references” Software Complexity Low-level code is complex and difficult to examine and write by hand Code Size?

Traditional Operations on RAW
How does one exploit the Raw architecture across function calls, especially in libraries? Can we easily maintain portability with different tile counts? Memory Protection and OS Services Context switch overhead Load on dynamic network for memory protection and virtual memory?

Compiling for RAW machines
Determine available parallelism Determine placement of memory items Discover memory constraints Dependencies between parallel threads Disambiguate memory references to allow for static access to data elements Trade-off memory dependence and Parallelism

Compiling for RAW(cont.)
Generate route instructions for switches static network only Generate message handlers for dynamic events Speculative execution Unpredictable memory references Optimal partitioning algorithm is NP complete

Structure of RAWCC Source Language
Partition data to increase static accesses Partition instructions to allow parallel execution allocate data to tiles to minimize communication overhead Traditional Dataflow Optimizations Build CFG MAPS System Space-time scheduler RAW executable

The MAPS System Manages memory to generate static promotions of data structures For loop accesses to arrays uses modulo unrolling For data structures, uses SPAN analysis package to identify potential references and partition memory structures can be split across processing elements.

Space-Time Scheduler For Basic Blocks
Maps instructions to processors Maps scalar data to processors Generates communication instructions Schedules computation and communication For overall CFG, performs control localization

Basic Block Orchestrator
All values are copied to the tiles that work on the data from the home tile Within a Block, all access are local At the end of a block, values are copied to home tiles Initial Code Transformation Instruction Partitioner Global Data Partitioner Data & Ins. Placer Comm Code Generator Event Scheduler

Initial Code Transformation
Convert Block to static single assignment form removes false dependencies Analagous to register renaming Live on entry, and live on exit variables marked with dummy instructions Allows for overlap of “stitch” code with useful work

Instruction Partitioner
Partitions stream into multiple streams, one for each tile Clustering Partition instructions to minimize runtime considering only communication Merging Reduces cluster count to match tile count Uses a heuristic based algorithm to achieve good balance and low communication overhead

Global Data Partitioner
Partitions global data for assignment to home locations Local data is copied at the start of a basic block Summarize instruction stream’s data access pattern with affinity Maps instructions and data to virtual processors Map instructions, optimally place data based on affinity Remap instructions with data placement knowledge Repeat until local minima is reached Only real data are mapped, not dummies formed in ICT

Data and Instruction Placer
Places data items onto physical tiles driven by static data items Places instructions onto tiles Uses data information to determine cost Takes into account actual model of communications network Uses a swap based greedy allocation

Event Scheduler Schedules routing instructions as well as computation instructions in a basic block Schedules instructions using a greedy list based scheduler Switch schedule is ensured to be deadlock free Allows tolerance of dynamic events

Control Flow Control Localization Global Branching
Certain branches are enveloped in macro instructions, and the surrounding blocks merged Allows branch to occur only on one tile Global Branching Done through target broadcast and local branching

Performance RAW achieves anywhere from 1.5 to 9 times speedup depending on application and tile count Applications tested were particularly well suited to RAW Heavily dependent integer programs may do poorly(encryption, etc.)) Depends on its ability to statically schedule and localize memory accesses

Future Work Use multisequential execution to run multiple applications simultaneously Allow static communication between threads known at compile time Minimize dynamic overhead otherwise Target ILP across branches more agressively Explore configurability vs. parallelism in RAW

Reconfigurable processors
Adapt the processor to the application special function units special wiring between function units Builds on FPGA technology FPGAs are inefficient a multiplier built from an FPGA is about 100x larger and 10x slower than a custom multiplier. Need to raise the granularity configure ALUs, or whole processors Memory and communication are usually the bottleneck not addressed by configuring a lot of ALUs Programming model Difficult to program Verilog

SCORE Stream Computation Organized for Reconfigurable Execution
Eylon Caspi Michael Chu André DeHon Randy Huang Joseph Yeh John Wawrzynek Nicholas Weaver

Opportunity High-throughput, regular operations
can be mapped spatially onto FPGA-like (programmable, spatial compute substrate) achieving higher performance (throughput per unit area) than conventional, programmable devices (e.g. processors)

Problem Only have raw devices Solutions non-portable
Solutions not scale to new hardware Device resources exposed to developer Little or no abstraction of implementations Composition of subcomponents hard/ad hoc No unifying computational model or run-time environment

Introduce: SCORE Compute Model virtualizes RC hardware resources
supports automatic scaling supports dynamic program requirements efficiently provides compositional semantics defines runtime environment for programs

Viewpoint SCORE (or something like it) is a necessary condition to enable automatic exploitation of new RC hardware as it becomes available. Automatic exploitation is essential to making RC a long-term viable computing solution.

Outline Opportunity Problem Review Model Preliminary Results
related work enabling hardware Model execution programmer Preliminary Results Challenges and Questions ahead

…borrows heavily from... RC, RTR P+FPGA Dataflow Streaming Dataflow
Multiprocessors Operating System (see working paper) Tried to steal all the good ideas :-) build a coherent model exploit strengths of RC

Enabling Hardware High-speed, computational arrays
[250MHz, HSRA, FPGA’99] Large, on-chip memories [2Mbit, VLSI Symp. ‘99] [allow microsecond reconfiguration] Processor and FPGA hybrids [GARP, NAPA, Triscend, etc.]

BRASS Architecture

Array Model

Platform Vision Hardware capacity scales up with each generation
Faster devices More computation More memory With SCORE, old programs should run on new hardware and exploit the additional capacity automatically

Example: SCORE Execution

Spatial Implementation

Serial Implementation

Summary: Elements of a multiprocessing system
General purpose/special purpose Granularity - capability of a basic module Topology - interconnection/communication geometry Nature of coupling - loose to tight Control-data mechanisms Task allocation and routing methodology Reconfigurable Computation Interconnect Programmer’s model/Language support/ models of computation Implementation - IC, Board, Multiboard, Networked Performance measures and objectives [After E. V. Krishnamurty - Chapter 5

Conclusions Portions of multi/parallel processing have become successful Pipelining ubiquitious Superscalar ubiquitious VLIW successful in DSP, Multimedia - GPP? Silicon capability re-invigorating multiprocessor research GPP - Flash, Hydra, RAW SPP - Intel IXP 1200, IRAM/VIRAM, Mescal Reconfigurable computing has found a niche in wireless communications Problem of programming models, languages, computational models etc. for multiprocessors still largely unsolved

Lecture 18: Introduction to Multiprocessors

Similar presentations

Presentation on theme: "Lecture 18: Introduction to Multiprocessors"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 18: Introduction to Multiprocessors

Similar presentations

Presentation on theme: "Lecture 18: Introduction to Multiprocessors"— Presentation transcript:

Similar presentations

About project

Feedback