Presentation is loading. Please wait.

Presentation is loading. Please wait.

The CRAY-1 Computer System Richard M. Russell Presented by Andrew Waterman ECE259 Spring 2008.

Similar presentations


Presentation on theme: "The CRAY-1 Computer System Richard M. Russell Presented by Andrew Waterman ECE259 Spring 2008."— Presentation transcript:

1 The CRAY-1 Computer System Richard M. Russell Presented by Andrew Waterman ECE259 Spring 2008

2 Background CRAY-1 by no means first vector machine –1960s: Westinghouse Solomon/ILLIAC IV –1974: CDC STAR 100 “I never, ever want to be a pioneer” --Cray –STAR 100, ILLIAC IV: who's this Amdahl dude? 1972: Cray Research formed after spat with CDC –Seymour Cray wanted to start from scratch on 8600; CDC brass, not so much 1976: first CRAY-1 deployed at Livermore

3 CRAY-1 Hardware

4 Look Ma, No ASICs!

5 CRAY-1 Architecture 5-ton, vector uniprocessor Word size = 64 bits 80 MHz clock 8MB RAM in 16 banks @ 20 MHz –f cpu /f mem = 4 (!!)‏ Fairly RISCy 16- or 32-bit instructions –Load/store; register-register operations

6 Scalar Operation and Octal Annoyance 10 8 A-registers for 24-bit address calculations 100 8 B-registers serve as backing store for A-registers 10 8 S-registers for source/dest of scalar integer/FP insns T is to S as B is to A 11 8 pipelined scalar FUs –Address add, mult –Integer add, shift, logic, pop count –FP add, mult, reciprocal

7 Scalar Operation Protection without virtual memory –Base & limit address regs Ld $dest,$addr actually loads from $base+$addr Program killed if $base+$addr >= $limit A handful of registers for interrupts, exceptions, etc.

8 OS and Front End cos (CRAY OS) handles job scheduling, storage management (tapes!), other I/O, checkpointing –Packaged with CAL (assembler)‏ –...and CFT (Fortran compiler), more later Command-line interface and job submission via separate front-end computer, e.g. VAX

9 Vector Operation (Finally!)‏ 8x64-word V-registers Vector Length Register –Indicates # ops performed by vector insns –Set from contents of an A-register Vector Mask Register –Indicates which elements in vector to operate on –Set by vector test insns (e.g. VM[i] := ($V k [i] == 0))‏ 6 Vector FUs –integer add, shift, bitwise logic –FP via scalar FPU: add, mult, reciprocal

10 Vector Load/Store Architecture Big departure from STAR 100: register-register ops CRAY-1 memory bandwidth == 80Mword/s == 1word/cycle –If all 2-source insns are memory-memory, then IPC=1/3! (and that assumes no bank conflicts!)‏ –Solution: the RISC approach Combined with chaining (next), can sustain >> 1 flop/cycle

11 Chaining Pipeline bypass meets vectors Consider SAXPY vector expression a*X+Y –Slow approach: compute a*X (64 mults), then compute a*X+Y (64 adds)‏ Total latency: 128+mult latency+add latency –since, in CRAY-1, all FUs are pipelined –But... no fundamental serialization requirement As soon as a*X[0] is computed, can compute a*X[0]+Y[0] Total latency: 64+mult latency+add latency (speedup of almost 2)‏

12 Chaining Example Assume: 8-element vectors, single-cycle ops mul.ds $v2,$v3,$s1 add.d $v1,$v2,$v1 Without chaining: m m m m m m m m a a a a a a a a With chaining: m m m m m m m m a a a a a a a a

13 Vector Startup Times For vector ops to be efficient enough to justify, startup overhead must be small CRAY-1 can issue a vector insn every cycle, assuming no structural hazards on FUs –Result: vector performance > scalar performance for as few as four elements/vector

14 Cray Fortran Compiler (CFT)‏ Important insight: hand-coding assembly sucks The actual important insight: most vectorizable code is of the embarrassingly-parallel variety –Even with 1970s compiler technology, innermost- loop parallelism is low-hanging fruit –Exploit this—make the compiler do the heavy lifting CFT is pretty good for branchless inner loops...but doesn't even attempt to vectorize code with IFs –So any use of the Vector Mask register must be hand-coded Upshot: a good start, but not quite there

15 Analysis Extremely fast computer for 1976 Thought experiment: what if CRAY-1's parameters scaled with Moore's Law? (32 years == 21 doublings)‏ –200,000 transistors => 400 billion transistors –8MB main memory => 16TB main memory –80 MHz clock => petahertz? (if only)‏ For a (merely) 2nd-generation vector processor, the CRAY-1 was ahead of its time (I think)‏ –I'm not the only one: it was commercially phenomenal However, design techniques (discrete logic) are totally unscalable

16 Questions? Richard M. Russell Presented by Andrew Waterman ECE259 Spring 2008

17 The CRAY-1 Computer System Richard M. Russell Presented by Andrew Waterman ECE259 Spring 2008


Download ppt "The CRAY-1 Computer System Richard M. Russell Presented by Andrew Waterman ECE259 Spring 2008."

Similar presentations


Ads by Google