Presentation is loading. Please wait.

Presentation is loading. Please wait.

Better answers The Alpha 21364 and 21464 Microprocessors: Continuing the Performance Lead Beyond Y2K Shubu Mukherjee, Ph.D. Principal Hardware Engineer.

Similar presentations


Presentation on theme: "Better answers The Alpha 21364 and 21464 Microprocessors: Continuing the Performance Lead Beyond Y2K Shubu Mukherjee, Ph.D. Principal Hardware Engineer."— Presentation transcript:

1 Better answers The Alpha 21364 and 21464 Microprocessors: Continuing the Performance Lead Beyond Y2K Shubu Mukherjee, Ph.D. Principal Hardware Engineer VSSAD Labs, Alpha Development Group Compaq Computer Corporation Shrewsbury, Massachusetts Slides: 1998 Microprocessor Forum (Peter Bannon) and 1999 Microprocessor Forum (Joel Emer)

2 Better answers Higher Performance Lower Cost 2000 2001 2002 200319981999 21264 EV6 21264 EV68 0.35  m 21264 EV67 0.28  m 0.18  m 21364EV7 21464EV8 0.125  m Alpha Microprocessor Roadmap 21364EV78 0.125  m First System Ship

3 Better answers Alpha 21264 Microprocessor  Architectural Features First “Out-of-Order” Alpha First “Out-of-Order” Alpha Four-wide superscalar Four-wide superscalar …  Performance World’s Fastest Microprocessor (www.spec.org, 11/17/99) World’s Fastest Microprocessor (www.spec.org, 11/17/99) 39 SPECINT95, 68 SPECFP95 @ 700 Mhz 39 SPECINT95, 68 SPECFP95 @ 700 Mhz – Intel Pentium III @ 733 Mhz delivers 36 SPECINT95, 30 SPECFP95

4 Better answers Higher Performance Lower Cost 2000 2001 2002 200319981999 21264 EV6 21264 EV68 0.35  m 21264 EV67 0.28  m 0.18  m 21364EV7 21464EV8 0.125  m Alpha Microprocessor Roadmap 21364EV78 0.125  m First System Ship

5 Better answers Alpha 21364 Goals  Leadership single stream performance Higher operating frequency Higher operating frequency Integrated memory interface Integrated memory interface  Leadership multiprocessor performance Integrated system / multiprocessor interface Integrated system / multiprocessor interface

6 Better answers Alpha 21364 Features  System-on-a-Chip Alpha 21264 core with enhancements Alpha 21264 core with enhancements Integrated L2 Cache Integrated L2 Cache Integrated memory controller Integrated memory controller Integrated network interface Integrated network interface  Fault-Tolerance Support for lock-step operation to enable high- availability systems. Support for lock-step operation to enable high- availability systems.

7 Better answers Memory Controller RAMBUSRAMBUS 21364 Chip Block Diagram 21264 Core 16 L1 Miss Buffers L2 Cache Address Out Address In Network Interface N S E W I/O 16 L1 Victim Buf 16 L2 Victim Buf 64K Icache 64K Dcache

8 Better answers Int Reg Map Branch Predictors 21364 Core FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 L2 cache1.5MB 6-Set Int Issue Queue (20) Exec 4 Instructions / cycle Reg File (80 ) Victim Buffer L1 Data Cache 64KB 2-Set FP Reg Map FP ADD Div/Sqrt FP MUL Addr 80 in-flight instructions plus 32 loads and 32 stores Addr Miss Address Next-Line Address L1 Ins. Cache 64KB 2-Set Exec Reg File (80 ) FP Issue Queue (15) Reg File (72 )

9 Better answers Integrated L2 Cache  1.5 MB  6-way set associative  16 GB/s total read/write bandwidth  16 Victim buffers for L1 -> L2  16 Victim buffers for L2 -> Memory  ECC SECDED code  12ns load to use latency

10 Better answers Integrated Memory Controller  Direct RAMbus High data capacity per pin High data capacity per pin 800 MHz operation 800 MHz operation 30ns CAS latency pin to pin 30ns CAS latency pin to pin  6 GB/sec read or write bandwidth  100s of open pages  Directory based cache coherence  ECC SECDED

11 Better answers Integrated Network Interface  Direct processor-to-processor interconnect  10 GB/second per processor  15ns processor-to-processor latency  Out-of-order network with adaptive routing  Asynchronous clocking between processors  3 GB/second I/O interface per processor

12 Better answers 21364 System Block Diagram 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO

13 Better answers Alpha 21364 Technology  0.18  m CMOS  1000+ MHz  100 Watts @ 1.5 volts  3.5 cm 2  6 Layer Metal  100 million transistors 8 million logic 8 million logic 92 million RAM 92 million RAM

14 Better answers Alpha 21364 Status  70 SPECint95 (estimated)  120 SPECfp95 (estimated)  RTL model running  Tapeout: Summer 2000

15 Better answers 21364 Summary: System on a Chip  Integrated L2 cache and memory controller outstanding single processor performance outstanding single processor performance  Integrated network interface high performance multi-processor systems high performance multi-processor systems scales to large number of processors scales to large number of processors

16 Better answers Higher Performance Lower Cost 2000 2001 2002 200319981999 21264 EV6 21264 EV68 0.35  m 21264 EV67 0.28  m 0.18  m 21364EV7 21464EV8 0.125  m Alpha Microprocessor Overview 21364EV78 0.125  m First System Ship

17 Better answers Alpha 21464 Goals  Leadership single stream performance Higher operating frequency / better technology Higher operating frequency / better technology New microarchitecture New microarchitecture Integrated memory interface (like 21364) Integrated memory interface (like 21364)  Leadership multiprocessor performance Simultaneous Multithreading (with minimal change/cost) Simultaneous Multithreading (with minimal change/cost) Integrated system / multiprocessor interface (like 21364) Integrated system / multiprocessor interface (like 21364)

18 Better answers Alpha 21464 Technology Overview  Leading edge process technology – 1.2-2.0GHz 0.125µm CMOS 0.125µm CMOS SOI-compatible SOI-compatible Cu interconnect Cu interconnect low-k dielectrics low-k dielectrics  Chip characteristics ~1.2V Vdd ~1.2V Vdd ~250 Million transistors ~250 Million transistors

19 Better answers Alpha 21464 Architecture Overview  Enhanced out-of-order execution  8-wide superscalar  Large on-chip L2 cache  Direct RAMBUS interface  On-chip router for system interconnect  Glueless, directory-based, ccNUMA for up to 512-way multiprocessing for up to 512-way multiprocessing  4-way simultaneous multithreading (SMT)

20 Better answers Instruction Issue Reduced function unit utilization due to dependencies Time

21 Better answers Superscalar Issue Superscalar leads to more performance, but lower utilization Time

22 Better answers Predicated Issue Adds to function unit utilization, but results are thrown away Time

23 Better answers Chip Multiprocessor Limited utilization when only running one thread Time

24 Better answers Fine Grained Multithreading Intra-thread dependencies still limit performance Time

25 Better answers Simultaneous Multithreading Maximum utilization of function units by independent operations Time

26 Better answers Basic Out-of-order Pipeline Fetch Decode/ Map Queue Reg Read ExecuteDcache/ Store Buffer Reg Write Retire PC Icache Register Map Dcache Regs Thread-blind

27 Better answers SMT Pipeline Fetch Decode/ Map Queue Reg Read ExecuteDcache/ Store Buffer Reg Write Retire Icache Dcache PC Register Map Regs

28 Better answers Changes for SMT  Basic pipeline – unchanged  Replicated resources Program counters Program counters Register maps Register maps  Shared resources Register file (size increased) Register file (size increased) Instruction queue Instruction queue First and second level caches First and second level caches Translation buffers Translation buffers Branch predictor Branch predictor

29 Better answers Multiprogrammed workload

30 Better answers Decomposed SPEC95 Applications

31 Better answers Multithreaded Applications

32 Better answers Architectural Abstraction  1 Processor with 4 Thread Processing Units (TPUs)  Shared hardware resources TPU 0TPU1TPU2TPU3 IcacheTLBDcache Scache

33 Better answers 21464 System Block Diagram EV8 MIO EV8 MIO EV8 MIO EV8 MIO EV8 MIO EV8 MIO EV8 MIO EV8 MIO EV8 MIO 0123

34 Better answers Alpha 21464 Summary  Leadership single stream performance Higher operating frequency / better technology Higher operating frequency / better technology New microarchitecture New microarchitecture Integrated memory interface (like 21364) Integrated memory interface (like 21364)  Leadership multiprocessor performance Simultaneous Multithreading (with minimal changes/cost) Simultaneous Multithreading (with minimal changes/cost) Integrated system / multiprocessor interface (like 21364) Integrated system / multiprocessor interface (like 21364)

35 Better answers  Alpha 21364 Reuses 21264 microprocessor core Reuses 21264 microprocessor core System on a chip System on a chip  Alpha 21464 New microarchitecture New microarchitecture System on a chip System on a chip Simultaneous Multithreading Simultaneous Multithreading Maintain Performance Lead Beyond Y2K

36 Better answers My Current Research: Beyond 21464?  The Truth Project (w/ Joel Emer) Examines different microarchitectural issues Examines different microarchitectural issues  The Multinet Project (w/ Rick Kessler) Tightly-coupled multiprocessor networks Tightly-coupled multiprocessor networks  The Reliant Project (w/ Steve Reinhardt) Self-Checking Microprocessors using SMT, ISCA submission Self-Checking Microprocessors using SMT, ISCA submission  Asim (w/ VSSAD Labs) Performance Model for Alphas beyond 21464 Performance Model for Alphas beyond 21464


Download ppt "Better answers The Alpha 21364 and 21464 Microprocessors: Continuing the Performance Lead Beyond Y2K Shubu Mukherjee, Ph.D. Principal Hardware Engineer."

Similar presentations


Ads by Google