The CRAY-1 Computer System Richard M. Russell Presented by Andrew Waterman ECE259 Spring 2008.

Slides:



Advertisements
Similar presentations
The CRAY-1 Computer System Richard Russell Communications of the ACM January 1978.
Advertisements

Computer Organization and Architecture
Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
Dr. Ken Hoganson, © August 2014 Programming in R COURSE NOTES 2 Hoganson Language Translation.
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
RISC / CISC Architecture By: Ramtin Raji Kermani Ramtin Raji Kermani Rayan Arasteh Rayan Arasteh An Introduction to Professor: Mr. Khayami Mr. Khayami.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
The University of Adelaide, School of Computer Science
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
1 RISC Machines Because of their load-store ISAs, RISC architectures require a large number of CPU registers. These register provide fast access to data.
Computer Organization and Architecture
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
CS2422 Assembly Language & System Programming September 19, 2006.
Ch1. Fundamentals of Computer Design 3. Principles (5) ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department University of Massachusetts.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Appendix A Pipelining: Basic and Intermediate Concepts
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
SUPERSCALAR EXECUTION. two-way superscalar The DLW-2 has two ALUs, so it’s able to execute two arithmetic instructions in parallel (hence the term two-way.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Krste Asanovic Electrical Engineering and Computer Sciences
Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.
Basics and Architectures
1/9/02CSE ISA's Instruction Set Architectures Part 1 I/O systemInstr. Set Proc. Compiler Operating System Application Digital Design Circuit Design.
CDA 5155 Superscalar, VLIW, Vector, Decoupled Week 4.
DSP Processors We have seen that the Multiply and Accumulate (MAC) operation is very prevalent in DSP computation computation of energy MA filters AR filters.
Pipeline Extensions prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University MIPS Extensions1May 2015.
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University.
1 Pipelining Part I CS What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.
Pipelining and Parallelism Mark Staveley
How Computers Work Lecture 12 Page 1 How Computers Work Lecture 12 Introduction to Pipelining.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
CSC 4250 Computer Architectures September 29, 2006 Appendix A. Pipelining.
Copyright © Curt Hill SIMD Single Instruction Multiple Data.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,
1 Lecture 3: Pipelining Basics Today: chapter 1 wrap-up, basic pipelining implementation (Sections C.1 - C.4) Reminders:  Sign up for the class mailing.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
New-School Machine Structures Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions.
Vector computers.
Addressing modes, memory architecture, interrupt and exception handling, and external I/O. An ISA includes a specification of the set of opcodes (machine.
Use of Pipelining to Achieve CPI < 1
CS 352H: Computer Systems Architecture
Single Instruction Multiple Data
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
CC 423: Advanced Computer Architecture Limits to ILP
Morgan Kaufmann Publishers
Computer Architecture CSCE 350
Chapter 1 Fundamentals of Computer Design
COMP4211 : Advance Computer Architecture
Out of Order Processors
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Multivector and SIMD Computers
Advanced Computer Architecture
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang
Topic 2: Vector Processing and Vector Architectures
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

The CRAY-1 Computer System Richard M. Russell Presented by Andrew Waterman ECE259 Spring 2008

Background CRAY-1 by no means first vector machine –1960s: Westinghouse Solomon/ILLIAC IV –1974: CDC STAR 100 “I never, ever want to be a pioneer” --Cray –STAR 100, ILLIAC IV: who's this Amdahl dude? 1972: Cray Research formed after spat with CDC –Seymour Cray wanted to start from scratch on 8600; CDC brass, not so much 1976: first CRAY-1 deployed at Livermore

CRAY-1 Hardware

Look Ma, No ASICs!

CRAY-1 Architecture 5-ton, vector uniprocessor Word size = 64 bits 80 MHz clock 8MB RAM in MHz –f cpu /f mem = 4 (!!)‏ Fairly RISCy 16- or 32-bit instructions –Load/store; register-register operations

Scalar Operation and Octal Annoyance 10 8 A-registers for 24-bit address calculations B-registers serve as backing store for A-registers 10 8 S-registers for source/dest of scalar integer/FP insns T is to S as B is to A 11 8 pipelined scalar FUs –Address add, mult –Integer add, shift, logic, pop count –FP add, mult, reciprocal

Scalar Operation Protection without virtual memory –Base & limit address regs Ld $dest,$addr actually loads from $base+$addr Program killed if $base+$addr >= $limit A handful of registers for interrupts, exceptions, etc.

OS and Front End cos (CRAY OS) handles job scheduling, storage management (tapes!), other I/O, checkpointing –Packaged with CAL (assembler)‏ –...and CFT (Fortran compiler), more later Command-line interface and job submission via separate front-end computer, e.g. VAX

Vector Operation (Finally!)‏ 8x64-word V-registers Vector Length Register –Indicates # ops performed by vector insns –Set from contents of an A-register Vector Mask Register –Indicates which elements in vector to operate on –Set by vector test insns (e.g. VM[i] := ($V k [i] == 0))‏ 6 Vector FUs –integer add, shift, bitwise logic –FP via scalar FPU: add, mult, reciprocal

Vector Load/Store Architecture Big departure from STAR 100: register-register ops CRAY-1 memory bandwidth == 80Mword/s == 1word/cycle –If all 2-source insns are memory-memory, then IPC=1/3! (and that assumes no bank conflicts!)‏ –Solution: the RISC approach Combined with chaining (next), can sustain >> 1 flop/cycle

Chaining Pipeline bypass meets vectors Consider SAXPY vector expression a*X+Y –Slow approach: compute a*X (64 mults), then compute a*X+Y (64 adds)‏ Total latency: 128+mult latency+add latency –since, in CRAY-1, all FUs are pipelined –But... no fundamental serialization requirement As soon as a*X[0] is computed, can compute a*X[0]+Y[0] Total latency: 64+mult latency+add latency (speedup of almost 2)‏

Chaining Example Assume: 8-element vectors, single-cycle ops mul.ds $v2,$v3,$s1 add.d $v1,$v2,$v1 Without chaining: m m m m m m m m a a a a a a a a With chaining: m m m m m m m m a a a a a a a a

Vector Startup Times For vector ops to be efficient enough to justify, startup overhead must be small CRAY-1 can issue a vector insn every cycle, assuming no structural hazards on FUs –Result: vector performance > scalar performance for as few as four elements/vector

Cray Fortran Compiler (CFT)‏ Important insight: hand-coding assembly sucks The actual important insight: most vectorizable code is of the embarrassingly-parallel variety –Even with 1970s compiler technology, innermost- loop parallelism is low-hanging fruit –Exploit this—make the compiler do the heavy lifting CFT is pretty good for branchless inner loops...but doesn't even attempt to vectorize code with IFs –So any use of the Vector Mask register must be hand-coded Upshot: a good start, but not quite there

Analysis Extremely fast computer for 1976 Thought experiment: what if CRAY-1's parameters scaled with Moore's Law? (32 years == 21 doublings)‏ –200,000 transistors => 400 billion transistors –8MB main memory => 16TB main memory –80 MHz clock => petahertz? (if only)‏ For a (merely) 2nd-generation vector processor, the CRAY-1 was ahead of its time (I think)‏ –I'm not the only one: it was commercially phenomenal However, design techniques (discrete logic) are totally unscalable

Questions? Richard M. Russell Presented by Andrew Waterman ECE259 Spring 2008

The CRAY-1 Computer System Richard M. Russell Presented by Andrew Waterman ECE259 Spring 2008