The Raw Architecture A Concrete Perspective Michael Bedford Taylor Raw Architecture Group Laboratory for Computer Science Massachusetts Institute of Technology.

Slides:



Advertisements
Similar presentations
Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.
Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,
Instruction-Level Parallelism (ILP)
Presenter: Jeremy W. Webb Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Processor Architectures At A Glance: M.I.T.
CGRA QUIZ. Quiz What is the fundamental drawback of fine-grained architecture that led to exploration of coarse grained reconfigurable architectures?
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
The Raw Processor: A Scalable 32 bit Fabric for General Purpose and Embedded Computing Presented at Hotchips 13 On August 21, 2001 by Michael Bedford Taylor.
A tiled processor architecture prototype: the Raw microprocessor October 02.
SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.
Chapter 12 Pipelining Strategies Performance Hazards.
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.
Scalar Operand Networks for Tiled Microprocessors Michael Taylor Raw Architecture Project MIT CSAIL (now at UCSD)
Chapter 17 Parallel Processing.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.
Evaluating the Raw microprocessor Michael Bedford Taylor Raw Architecture Group Computer Science and AI Laboratory Massachusetts Institute of Technology.
Gigabit Routing on a Software-exposed Tiled-Microprocessor
Blue Gene / C Cellular architecture 64-bit Cyclops64 chip: –500 Mhz –80 processors ( each has 2 thread units and a FP unit) Software –Cyclops64 exposes.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
Computer Architecture ECE 4801 Berk Sunar Erkay Savas.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Computer Architecture
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
CSE 661 PAPER PRESENTATION
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
Computer Organization & Assembly Language © by DR. M. Amer.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Computer Architecture Lecture 32 Fasih ur Rehman.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
1 Chapter 2 Central Processing Unit. 2 CPU The "brain" of the computer system is called the central processing unit. Everything that a computer does is.
The Alpha – Data Stream Matt Ziegler.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Raw Status Update Chips & Fabrics James Psota M.I.T. Computer Architecture Workshop 9/19/03.
High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,
Creating a Scalable Microprocessor: A 16-issue Multiple-Program-Counter Microprocessor With Point-to-Point Scalar Operand Network Michael Bedford Taylor.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Nios II Processor: Memory Organization and Access
Packet Switching on Raw
Architecture & Organization 1
Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal
The Microarchitecture of the Pentium 4 processor
Architecture & Organization 1
Alpha Microarchitecture
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
RAW Scott J Weber Diagrams from and summary of:
What is Computer Architecture?
What is Computer Architecture?
ADSP 21065L.
Presentation transcript:

The Raw Architecture A Concrete Perspective Michael Bedford Taylor Raw Architecture Group Laboratory for Computer Science Massachusetts Institute of Technology

Raw Architectural Overview Compute Processor On-chip networks Raw exploration avenues Outline Architectural Usage hints

The Raw Architecture Divide the silicon into an array of identical, programmable tiles. Prototype: 4x4 Arch, Sim & Fabric: {4,8,16,32} x {4,8,16,32}

The Raw Tile 8 stage 32b MIPS-style single-issue in-order compute processor Tile 4-stage 32b pipelined FPU 32 KB DCache 32 KB IMem Routers and wires for three on-chip mesh networks

Raw’s four on-chip mesh networks Compute Pipeline Registered at input  longest wire = length of tile 8 32-bit channels

Raw’s I/O and Memory System Raw chipset DRAM PCI x 2 DRAM D/A Routes on any network off the edge of the chip appear on the pins Gb/s channels ( Mhz) Direct connection of computation to I/O and DRAM Streams

Raw Architectural Overview Compute Processor On-chip networks Raw exploration avenues Outline Architectural Usage hints

Raw compute processor ISA MIPS-flavour -- similar instructions, with improvements -- all but divide instructions are fully pipelined ALU 1 cycles MUL 2 cycles LOAD 3 cycles FPU 4 cycles Similar latencies as Pentium-3

Bit-oriented instructions PopCount Count Leading Zero Rotate and Mask / Rotate, Mask and Insert - Like PowerPC, but more flexible masks Bit-smashing swiss army knife inputs sample outputs after 1 cycle

Branches & Divides 1 branch per cycle no delay slots static branch prediction bit (e.g., bne+ bne-) 3 cycle mispredict penalty Integer Divide42 cycles Floating Point Divide12 cycles The only non-pipelined instructions in the ISA. Uses FD and HI/LO registers with full/empty bits. Proc, IntDiv, FPDiv State Machines run independently Processor will only stall if you read MFFD, MFLO, MFHI before dividers have finished.

Bnea bnea $2,$3, addr branch not equal, meanwhile, add SPR BR_INCR minor tweak reduces some loop overheads by %50 tagswi $0,$5, 3 tagswi $0,$5, 2 tagswi $0,$5, 1 tagswi $0,$5, 0 li $4, kCacheSize - (kCacheLineSize << 2) mtsri BR_INCR, (kCacheLineSize << 2) bnea+ $5, $4, ___cache_invalidate_loop ___cache_invalidate_loop: li $5, 0 BR_INCR is callee-saved

Data Cache 32KB per tile, 2-way set associative allocate-on-write, write-back tile caches are not coherent - user manages coherence Extensive collection of cache management instrs for coherence management tag-index-based for wide address range ops address-based for short ranges

More Instructions Many other instructions including: writing to instruction and switch memories packaging messages managing interrupts managing I/O and off-chip memory adding an immediate value to top 16 bits of word (useful for $gp-relative calculations) See the Raw Spec, available off of Raw website.

Raw Architectural Overview Compute Processor On-chip networks Raw exploration avenues Outline Architectural Usage hints

Raw’s four on-chip networks Memory Dynamic Network “mdn” General Dynamic Network “gdn” Static Network x 2

Raw’s Static Network Highest performance solution For predictable communication patterns Use it: To route operands among local and remote ALUs To route data streams among tiles, DRAM, I/O ports

Raw’s Static Network Consists of two tightly-coupled sub-networks: Tile interconnection network For operands & streams between tiles Controlled by the 16 tiles’ static router processors Local bypass network For operands & streams within a tile

Between tile operand transport The routes programmed into the static router ICache guarantee in-order delivery of operands between tiles at a rate of 2 words/direction/cycle. Compute Pipeline static router crossbar static router crossbar Compute Pipeline Static Router ICache JZ P, label route P->E,S FIFO Static Router ICache JZ W, label route W->P FIFO

IFRFD ATL M1M2 FP E U TV F4WB r26 r27 r25 r24 Input FIFOs from Static Router r26 r27 r25 r24 Output FIFOs to Static Router Ex: lb r25, 0x341(r26) 0-cycle “local bypass network” Operand Transport among functional units and the static router

Raw static router ISA Branches, nops, and up to 13 routes per cycle Routes support multicast. move $1, $cstoroute $csto->$cWo, $csto->$cEo beqzd- $1,$1, cleanup route $csto->$cWo bnezd+ $1,$1, cleanuproute $csto->$cWo, $cWi->$cNo

tmp3 = (seed*6+2)/3 v2 = (tmp1 - tmp3)*5 v1 = (tmp1 + tmp2)*3 v0 = tmp0 - v1 …. pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v2=v2.7 seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v2=v2.7 seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 Assign instructions to the tiles, maximizing locality. Generate the static router instructions to transfer Operands & streams tiles. Compilation

Raw’s Memory Network Dynamic, dimension-ordered wormhole-routed Insert header, and < 32 data words. Worms through network. Inter-message ordering not guaranteed. Trusted clients, stringent usage rules Cache misses I/O Devices Memory DMA traffic Operating System Expert users Talk to us if you’d like to use it.

Raw’s General Network Dynamic, dimension-ordered wormhole-routed Insert header, and < 32 data words. Worms through network. Inter-message ordering not guaranteed. User-level messaging Can interrupt tile when message arrives Lower performance; for coarse-grained apps For non-compile time predictable communication - among tiles - possibly with I/O devices

A Raw System in Action httpd 4-way automatically parallelized C program 2-thread MPI app Direct I/O streams into Static Network mem Zzz... Note that an application uses only as many tiles as needed to exploit the parallelism intrinsic to that application.

Raw Architectural Overview Compute Processor On-chip networks Raw exploration avenues Outline Architectural Usage hints

Usage hint #1 If a program operates on linear arrays of data and spends many cycles pulling them in through cache misses:  Stream data in/out of DRAMs via the static network instead of through cache.

Usage hint #2 If the tiles are all stalled, and the DRAM ports are busy.  Use more DRAM ports! Up to 16 DRAM banks in,16 DRAM banks out.

Usage hint #3 If the computation is fine-grained and the code does not use the static network:  The static network is the lowest latency, lowest occupancy communication mechanism on Raw. Using it will greatly reduce communication overhead.

Usage hint #4 If tiles receive data from the static network, and store it to the local data memory before using it:  The static network guarantees in-order arrival of data. Use the data directly as it comes off the network, to save the latency and occupancy overhead.

Raw Architectural Overview Compute Processor On-chip networks Raw exploration avenues Outline Architectural Usage hints

1 cycle 180 nm 90 nm Looking into the future 16-issue64-issue Just stamp out more tiles! Longest wire, frequency, design and verification complexity all independent of issue width. Architecture is backwards compatible.

Exploration Avenues for External Users Prototype

Raw Prototype

18.2 mm x 18.2 mm 16 Flops/ops per cycle 208 Operand Routes / cycle 2048 KB L1 SRAM 1657 Pin CCGA Package 1080 HSTL core-speed signal I/O Raw ASIC IBM SA-27E.15u 6L Cu 3.6 Peak GFLOPS (without FMAC) 230 Gb/s on-chip bisection bandwidth 201 Gb/s off-chip I/O 225 MHz Worst Case (Temp, Vdd, process):

Prototype+ real world experience, fast + practical apps - static configuration # tiles, I/O & memory ratio Simulator Exploration Avenues for External Users

Raw Fabric Prototype+ real world experience, fast + practical apps - static configuration # tiles, I/O & memory ratio Simulator- “magic”, slow simulation speed + scalability exploration (64 tiles is feasible in 90nm) lots of flexibility in I/O and memory experimentation Exploration Avenues for External Users

Raw Chip Raw Chip Raw Chip Raw Chip Raw Chip Raw Chip Raw Chip Raw Chip Raw Chip Raw Chip Raw Chip Raw Chip Raw Chip Raw Chip Raw Chip Raw Chip This 16 chip array would approximate a 256 tile Raw from a 45 nm process. Raw chips gluelessly connect to form larger virtual chips up to 32x32 tiles.

Raw Fabric + positive aspects of both - not exact simulation of future - $$$, not avail. yet Prototype+ real world experience, fast + practical apps - static configuration # tiles, I/O & memory ratio Simulator- “magic”, slow simulation speed + scalability exploration (64 tiles is feasible in 90nm) lots of flexibility in I/O and memory experimentation Exploration Avenues for External Users

Raw Architectural Overview Compute Processor On-chip networks Raw exploration avenues Outline Architectural Usage hints End comments

Ultimate Guide: Raw Specification Definitely need to read it to get the best performance. Chapters 1-14 “Raw Architecture Manual” Chapters 15 “Prototype Specific Data” Updated continuously. Ultra dense presentation of constants and implementation. Read this to understand Raw’s components, how they work, and how to program them. In a few cases, later sections supercede earlier ones. Ignore the constants – they’ve changed over time.