Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tiled Processing Systems

Similar presentations


Presentation on theme: "Tiled Processing Systems"— Presentation transcript:

1 Tiled Processing Systems
Shervin Vakili October 21, 2007 All materials are copyrights of their respective authors as listed in references

2 Contents Why parallel processing Fundamental MP design decision
Design Space SoC Architectures Tiled Processor M.I.T. Raw Processor Field Programmable Function Array Performance Analysis for Data Intensive Application

3 Why parallel processing
Performance drive Diminishing returns for exploiting ILP and OLP Multiple processors fit easily on a chip Cost effective (just connect existing processors or processor cores) Low power: parallelism may allow lowering Vdd However: Parallel programming is hard

4 Which parallelism are we talking about
Which parallelism are we talking about? Classification: Flynn Categories [9] SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) Stream based processing SIMD (Single Instruction Multiple Data = DLP) Examples: Illiac-IV, CM-2 (Thinking Machines), Xetal (Philips), Imagine (Stanford), Vector machines, Cell architecture (Sony) Simple programming model Low overhead MIMD (Multiple Instruction Multiple Data) Examples: Sun Enterprise 5000, Cray T3D, SGI Origin, Multi-core Pentiums, and many more….

5 Fundamental MP design decision
We have already discussed: Shared memory versus Message passing Coherence, Consistency and Synchronization issues Other extremely important decisions: Processing units: Homogeneous versus Heterogeneous? Generic versus application specific ? Interconnect: Bus versus Network ? Type (topology) of network What types of parallelism to support ? Focus on Performance, Power or Cost ? Memory organization ?

6 SMP: Symmetric Multi-Processor [9]
Memory: centralized with uniform access time (UMA) and bus interconnect, I/O Examples: Sun Enterprise 6000, SGI Challenge, Intel One or more cache levels Processor One or more cache levels Processor One or more cache levels Processor One or more cache levels Processor Main memory I/O System

7 DSM: Distributed Shared Memory [9]
Nonuniform access time (NUMA) and scalable interconnect (distributed memory) Cache Processor Memory Cache Processor Memory Cache Processor Memory Cache Processor Memory T3E 480 MB/sec per link, 3 links per node memory on node switch based up to 2048 nodes $30M to $50M Interconnection Network Main memory I/O System

8 Interconnection Network
Independent Memory [9] Appropriate for message passing scheme Cache Processor Memory Cache Processor Memory Cache Processor Memory Cache Processor Memory T3E 480 MB/sec per link, 3 links per node memory on node switch based up to 2048 nodes $30M to $50M Interconnection Network I/O System

9 Homogeneous or Heterogeneous
Homogenous: Replication effect Memory dominated any way less performance Advantages: Scalability Degradability (Intel Core SOLO)

10 Homogeneous or Heterogeneous
Better fit to application domain Most modern systems are Heterogeneous

11 MP vs. SoC SoC (System on Chip) is multi-IP system IPs can be:
Custom hardware General purpose or DSP processor Coprocessor Memory blocks Reconfigurable matrix I/O protocol cores Multi-Processor systems can be categorized as a SoC (MPSoC)

12 Design Space SoC Architectures [7]
(R-SOC) FINE GRAIN (FPGA) MULTI GRANULARITY (Heterogeneous) COARSE GRAIN (Systolic) Processor + Coprocessor Tile-Based Architecture Island Topology Hierarchical Topology Coarse Grain Coprocessor Fine Grain Coprocessor Mesh Topology (Tiled processors) Linear Topology Hierarchical Topology Xilinx Virtex Xilinx Spartran Atmel AT40K Lattice ispXPGA Altera Stratix Altera Apex Altera Cyclone Chameleon REMARC Morphosys Pleiades Garp FIPSOC Triscend E5 Triscend A7 Xilinx Virtex-II Pro Altera Excalibur Atmel FPSIC aSoC E-FPFA RAW AsAP CHESS MATRIX KressArray Systolix Pulsedsp Systolic Ring RaPiD PipeRench DART FPFA

13 Tiled Processor Homogeneous multi processor systems
Generally with 2D structure Well-mapping on 2D die Uses simple processors for each tile Advantages: Scalability Potential degradability fault tolerance Disadvantages: Less efficient than heterogeneous stroctures

14 M.I.T. Raw Processor M.I.T. Raw architecture workstation (Raw) architecture Raw processor tile Array What’s in a Raw tile? Raw processor tile Inside the Compute Processor Raw’s Networking Routing Resources Raw Inter-processor Communication M.I.T. Raw novel features

15 M.I.T. raw Architecture Workstation (RAW) Architecture
Composed of a replicated processor tile. [8] 8 stage pipelined MIPS-like 32-bit processor [7] Static and dynamic routers Any tile output can be routed off the edge of the chip to the I/O pins. Chip bandwidth (16-tile version). Single channel (32-bit) bandwidth of MHz. 14 channels for a total chip bandwidth of MHz.

16 RAW Architecture [8] Divide the silicon into an array of identical, programmable tiles.

17 Raw Processor Tile [8] Compute Processor Routers On-chip networks

18 Inside the Compute Processor [8]
Local Bypass Network Input FIFOs from Static Router Output FIFOs to Static Router networks are integrated directly into the bypass paths E M1 M2 A TL TV IF D RF F P U F4 WB

19 Tiles Static Communication [8]

20 RAW’s Static Network Consists of two tightly-coupled sub-networks:
Tile interconnection network For operands & streams between tiles Controlled by the 16 tiles’ static router processors Used to: route operands among local and remote ALUs route data streams among tiles, DRAM, I/O ports Local bypass network For operands & streams within a tile

21 RAW’s Dynamic Network Insert header, and < 32 data words.
Worms through network. Enable MPI programming Inter-message ordering not guaranteed. RAW’s memory network RAW’s general network User-level messaging Can interrupt tile when message arrives Lower performance; for coarse-grained apps For non-compile time predictable communication among tiles possibly with I/O devices

22 M.I.T. Raw Novel Features Dynamic and Static Network Routers.
Scalability of Raw chips. Fabricated Raw chips can be placed in an array to further increase the system computing performance. Specifies a homogenous 2-D array of very simple processors Local bypass network First, Raw implements fine-grain communication between large numbers of replicated processing elements and, thereby, is able to exploit huge amounts of fine-grain parallelism in applications, when this parallelism exists. Second, it exposes the complete details of the underlying hardware architecture to the software system (be it the software CAD system, the applications software, or the compiler), so the software can carefully orchestrate the execution of the application by applying techniques such as pipelining, synchronization and conflict elimination for shared resources by static scheduling and routing.

23 Field Programmable Function Array of Chameleon Structure
A FPFA consists of interconnected processor tiles Multiple processes can coexit in parallel on different tiles Within a tile multiple data streams can be processed in parallel Each processor tile contains multiple reconfigurable ALUs, local memories, a control unit and a communication unit Field-Programmable Function Arrays (FPFAs) are reminiscent to FPGAs, but have a matrix of ALUs and lookup tables [7] instead of Configurable Logic Blocks (CLBs). Basically the FPFA is a low power, reconfigurable accelerator for an application specific domain. Low power is mainly achieved by exploiting locality of reference. High performance is obtained by exploiting parallelism. A FPFA consists of interconnected processor tiles. Multiple processes can coexist in parallel on different tiles. Within a tile multiple data streams can be processed in parallel. Each processor tile contains multiple reconfigurable ALUs, local memories, a control unit and a communication unit. Figure 1 shows a FPFA with 25 tiles; each tile has five ALUs. The ALUs on a processor tile are tightly interconnected and are designed to execute the (highly regular) inner loops of an application domain. ALUs on the same tile share a control unit and a communication unit. The ALUs use the locality of reference principle extensively: an ALU loads its operands from neighboring ALU outputs, or from (input) values stored in lookup tables or local registers. [7]

24 Field Programmable Function Array
The FPFA concept has a number of advantage The FPFA has a highly regular organization We use general purpose process core Its scalability stands in contrast to the dedicated chips designed nowadays The FPFA can do media processing tasks such as compression/decompression efficiently . The FPFA has a highly regular organisation, it requires the design and replication of a single processor tile, and hence the design and verification is rather straightforward. The verification of the software might be less trivial. Therefore, for less demanding applications we use a general-purpose processor core in combination with a FPFA. . Its scalability stands in contrast to the dedicated chips designed nowadays. In FPFAs, there is no need for a redesign in order to exploit all the benefits of a next generation CMOS process or the next generation of a standard. . The FPFA can do media processing tasks such as compression/decompression efficiently. Multimedia applications can for example benefit from such energy-efficient compression by saving (energy-wasting) network bandwidth.

25 Field Programmable Function Array
Processor tiles Consists of five identical blocks, which share a control unit and a communication unit An individual block contains an ALU, two memories and four register banks of four 20-bit wide register A crossbar-switch makes flexible routing between the ALUs, registers and memories This structure is convenient for the Fast Fourier Transform(6-input,4-output) and the Finite Impulse Response A FPFA processor tile in Figure 1 consists of five identical blocks, which share a control unit and a communication unit. An individual block contains an ALU, two memories and four register banks of four 20-bit wide registers. Because of the locality of reference principle, each ALU has two local memories. Each memory has bit entries. A crossbar-switch makes flexible routing between the ALUs, registers and memories possible. Figure 7 shows the crossbar interconnect between five blocks. This interconnect enables an ALU to write-back to any register or memory within a tile. Five blocks per processor tile seems reasonable. With five blocks there are ten memories available. This is convenient for the FFT algorithm, which has six inputs and four outputs. Also, we now have the ability to use 5×16=80-bit wide numbers, which enable us to use floating-point numbers (although some additional hardware is required). Some algorithms, like the FIR filter, can benefit substantially from additional ALUs. With five ALUs, a five-tap FIR filter can be implemented efficiently. The fifth ALU can also be used for complex address calculations and other control purposes. [7]

26 Performance Analysis for Data Intensive Application [1]
Three data-intensive radar sub-systems including: The corner turn: matrix transpose operation The matrix size is larger than Imagine’s SRF (128 KB) and Raw’s internal memories (2 MB), but smaller than VIRAM’s on-chip memory (13 MB) beam steering: directs a phased-array radar without physically rotating the antenna coherent side-lobe canceller (CSLC) kernels and consists of FFTs, a weight application (multiplication) stage, and IFFTs Implemented on: Processors In Memory (PIM) Stream Processors Tile Processors PowerPC

27 Vector Intelligent RAM (VIRAM, Berkeley) [6]
Merge DRAM with Vector Processor Mixed logic-DRAM CMOS process Scalar MIPS processor core bit GOPS, 1.6 GFLOPS 4 float ALUs; 8 32bit int ALUs; 16 16bit ALUs 12.8 GB/s peak memory access 13 MB DRAM 15 x 18 mm; IBM Foundry Chips fabbed in Q1 ‘03, ISI board on

28 Imagine Streaming Processor (Stanford)
300 MHz, VLIW SIMD machine 28 16-bit GOPS, 14 GFLOPS 128 kB Streaming Register File 8 ALU Clusters 6 ALUs / cluster 84-95% ALU utilization typical 256 x 32 bit local register file Streaming Memory Buffers Re-order DRAM accesses Expose data locality ALU Intra-cluster BW GB/sec DRAM BW GB/sec 16 x 16 mm; TI Foundry

29 MIT RAW 16 tiles of MIPS R4000 @ 300 MHz 4.6 GOPS or GFLOPS
4 Communication Networks 2 Static Networks, 38.3 GB / sec 2 Dynamic Networks 14 External Ports (I/O or DRAM) 33.5 GB/sec C and ASM; gcc based compiler 18.2 x 18.2 mm; IBM Foundry Fully scalable architecture

30 Experimental Results [1]
Processor Parameters Experimental Results (*10^3 cycles) Speedup compared with PPC with AltiVec

31 References [1] J. Suh, E.G. Kim, S. P. Crago, L. Srinivasan, M. C. French, ”A Performance analysis of PIM, stream processing, and tiled processing on memory-intensive signal processing kernels,” Proc. of the International Symposium on Computer Architecture, Jun [2] M. B. Taylor, “The Raw processor specification,” Comprehensive specification for the Raw processor, Cambridge, MA, Continuously Updated 2003. [3] D. Wentzlaff, M. B. Taylor., “The Raw architecture: signal processing on a scalable composable Computation Fabric,” High Performance Embedded Computing Workshop, 2001 [4] M. B. Taylor, W. Lee, J. Miller, D. Wentzlaff, I. Bratt, B. Greenwald, “Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams,” Proceedings of International Symposium on Computer Architecture, Jun. 2004 [5] M.I.T. Raw architecture workstation website: [6] Berkeley Intelligent RAM website: [7] “Reconfigurable computation and communication architectures,” Available on: [8] J. W. Webb, “Processor architectures at A glance: M.I.T. Raw vs. UC Davis AsAP”, Course Presentation, Available on: [9] H. Corporaal, “Multi-Processor”, Course Presentation, Available on:

32 Appendix Chess UC Davis Asynchronous Array of simple Processors (AsAP)

33 Chess HP Labs – Bristol, England 2-D array – similar to Matrix
Contains more “FPGA-like” routing resources. No reported software or application results Doesn’t support incremental compilation

34 Chess Interconnect More like an FPGA
Takes advantage of near-neighbor connectivity

35 Chess Basic Block Switchbox memory can be used as storage
ALU core for computation

36 Chess Statistics Use metrics to evaluate computational power.
Efficient multiplies due to embedded ALU Process independent.

37 UC Davis Asynchronous Array of simple Processors (AsAP) Architecture
Composed of a replicated processor tile. 9-stage pipelined reduced complexity DSP processor Four nearest neighbor inter-processor communication. Individual processor tile can operate at different frequencies than its neighbors. Off chip access to the I/O pins must be reached by routing to boundary processors. Chip Bandwidth Single channel (16-bit) bandwidth of MHz. The array topology of AsAP is well-suited for applications that are composed of a series of independent tasks. Each of these tasks can be assigned to one or more processors. 16-bit fixed point single issue CPU ALU, MAC Small Instruction/Data Memories Hardware address generation Local programmable clock oscillator ~1.1mm2/processor in 0.18mm CMOS 1GHz targeted operation

38 Asynchronous Array of simple Processors [8]
AsAP limits interprocessor communication to “four nearest neighbors”. Many algorithms have a linear data flow, so this is more than adequate. For algorithms requiring more complex communication, some processors may be allocated to route or forward data. However, because the processors are fairly small, this is probably not to big of loss. All interprocess communication is asynchronous. Each processor contains its own clock oscillator. Data is synchronized across clock domains using dual clock FIFOs. These also act to buffer data in case of a rate mismatch between producer and consumer. Stalling: If a processor attempts to read from an empty FIFO or write to a full FIFO, it will stall, until the instruction is able to safely complete. Hence, interproc. communication is automatically handled in hardware.

39 What’s in an AsAP tile? 16-bit fixed point datapath single issue CPU
Instructions for AsAP processors are 32-bits wide. ALU, MAC Small Instruction/Data Memories 64-entry instruction memory and a 128-word data memory. Hardware address generation Each processor has 4 address generators that calculate addresses for data memory. Local programmable clock oscillator 2 Input and 1 Output 16-bits wide and 32-words deep dual-clock FIFOs. ~1.1mm2/processor in 0.18mm CMOS 800 MHz targeted operation

40 AsAP Single Processor Tile [8]

41 AsAP Inter-processor Communication
Each processor output is hard-wired to its four nearest neighbors input multiplexers. At power-up the input multiplexers are configured. As input FIFOs fill up the sourcing neighbor can be halted by asserting corresponding hold signal. [8]

42 AsAP Contributions Provides parallel execution of independent tasks by providing many, parallel, independent processing engines AsAP specifies a homogenous 2-D array of very simple processors Single-issue pipelined CPUs Independent tasks are mapped across processors and executed in parallel Allows efficient exploitation of Application-level parallelism. To overcome the drawbacks of sequential execution and large memory higherarchies, we simply provide many independent processing engines, each with its own memory and code space. Our AsAP architecture does that by defining an 2-D array of processors. Because each processor performs only a small subset of the application, it can be simpler than most modern CPUs. We chose a simple single-issue processor following a traditional RISC architecture. Most DSP systems are already partitioned into a cascade of blocks. These blocks map easily onto AsAP processors. Because each processing element is a complete CPU, the coding process simply consists of defining each block in software. The result is an efficient exploitation of Application Level parallelism: That is, each block of the algorithm executes in parrallel with other blocks, on independent data. This effectively creates a pipeline, similar to what might be built into an ASIC solution. ** TRANSITION ** Before seeing how this mapping works for 11a, I’ll discuss some details of the AsAP architecture.


Download ppt "Tiled Processing Systems"

Similar presentations


Ads by Google