Alpha 21364: A Scalable Single-chip SMP

Slides:



Advertisements
Similar presentations
Slide #1February 11, 1997 EECS Alpha Microprocessor Jerry Huang Alpha Microprocessor The World’s Highest Performance Microprocessor Zhihui.
Advertisements

CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
1 Lecture 18: VLIW and EPIC Static superscalar, VLIW, EPIC and Itanium Processor (First introduce fast and high- bandwidth L1 cache design)
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Better answers The Alpha and Microprocessors: Continuing the Performance Lead Beyond Y2K Shubu Mukherjee, Ph.D. Principal Hardware Engineer.
Alpha Microarchitecture Onur/Aditya 11/6/2001.
THE AMD-K7 TM PROCESSOR Microprocessor Forum 1998 Dirk Meyer.
Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
April 27, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Putting it all together: Intel Nehalem Steve Ko Computer Sciences and Engineering University.
Intel Labs Labs Copyright © 2000 Intel Corporation. Fall 2000 Inside the Pentium ® 4 Processor Micro-architecture Next Generation IA-32 Micro-architecture.
SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.
VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.
Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Simultaneous Multithreading: Multiplying Alpha Performance Dr. Joel Emer Principal Member Technical Staff Alpha Development Group Compaq.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.
Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
ALPHA Introduction I- Stream ALPHA Introduction I- Stream Dharmesh Parikh.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Computer Architecture CPSC 350
Performance Analysis of the Compaq ES40--An Overview Paper evaluates Compaq’s ES40 system, based on the Alpha Only concern is performance: no power.
Alpha 21364: A Scalable Single-chip SMP Peter Bannon Senior Consulting Engineer Compaq Computer Corporation Shrewsbury, MA.
Alpha Supplement CS 740 Oct. 14, 1998
Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.
IBM/Motorola/Apple PowerPC
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
The Alpha – Data Stream Matt Ziegler.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
COMPUTER SYSTEMS ARCHITECTURE A NETWORKING APPROACH CHAPTER 12 INTRODUCTION THE MEMORY HIERARCHY CS 147 Nathaniel Gilbert 1.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
ALPHA 21164PC. Alpha 21164PC High-performance alternative to a Windows NT Personal Computer.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Itanium® 2 Processor Architecture
Presented by: Nick Kirchem Feb 13, 2004
Lynn Choi School of Electrical Engineering
ALPHA Introduction I- Stream
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Architecture & Organization 1
Computer Architecture CSCE 350
Architecture & Organization 1
IA-64 Microarchitecture --- Itanium Processor
Yingmin Li Ting Yan Qi Zhao
Alpha Microarchitecture
Lecture 20: OOO, Memory Hierarchy
Lecture 20: OOO, Memory Hierarchy
Sampoorani, Sivakumar and Joshua
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
CS455/CpE 442 Intro. To Computer Architecure
Computer Evolution and Performance
Peter Bannon Staff Fellow HP
Presentation transcript:

Alpha 21364: A Scalable Single-chip SMP Oct 13 & 14 Alpha 21364: A Scalable Single-chip SMP Thank you Lindley It’s a pleasure to have a opportunity to talk about Alpha and the 21364 in particular. Peter Bannon Senior Consulting Engineer Compaq Computer Corporation Shrewsbury, MA Microprocessor Forum

Outline Alpha .vs. IA-64 Alpha Roadmap Alpha 21164PC Update Oct 13 & 14 Outline Alpha .vs. IA-64 Alpha Roadmap Alpha 21164PC Update Alpha 21264 Update Alpha 21364 Conclusion Today I would like to spend some time comparing Alpha and its design philosophy against IA-64. I’ll then give a quick overview of the Alpha roadmap to position our current and future processors. After an update on the 21164PC family, I’ll review the current status of the 21264 and try to provide some motivation for the 21364 design. The bulk of the time will be spend reviewing the design of the 21364. Lastly, I’ll have some conclusions. Microprocessor Forum Microprocessor Forum

IA-64 .vs. Alpha Philosophy Oct 13 & 14 IA-64 .vs. Alpha Philosophy EPIC Smart compiler and a dumb machine Compiler creates record of execution Machine plays record Stall when compiler is wrong Focus on vector programs Compiler transform scalar to vector What about: function calls, indirection dynamic linking C++, Java/JIT ALPHA Smart compiler, smart machine, and a GREAT circuit design Compiler creates record of execution Machine exploits additional information available at runtime Works across barriers to compile-time analysis Focus on scalar programs Add resources for vector Amdahl’s law Looked at VLIW in 93&94, rejected for four reasons 1. VLIW did not fundamentally remove a difficult part of the hardware design. 2. Many of the compiler techniques associated with VLIW can also be applied to RISC machines 3. There are many road blocks to full analysis of a program by the compiler: function calls outside the compilation, indirection, dynamic linking 4. There is more to learn at runtime. We should not ignore this. Alpha uses smart compilers, smart hardware and Great circuit design to build the fastest machine possible with a focus on scalar performance. Resources are added where needed to support vector program. Amdahl’s law assures us that this is the correct approach; you must have a fast scalar machine to have a fast vector machine. Lets look at a familiar example. Last year Intel began to introduce the IA-64 architecture. IA-64 was positioned as building on the best ideas of RISC and VLIW. Both RISC and VLIW focus on pushing difficult or infrequent operations out of the hardware and into the compiler. Doing this requires a smart compiler and allows for a dump machine design. In a classic VLIW design the compiler schedules the code with full knowledge of the machine architecture. The hardware executes the instructions exactly the way the compiler planned, stalling or even breaking when the compiler is wrong. The compiler can do a good job when it is provided with accurate profiling data that can be used to create the correct static schedule. For example, vector programs often behave this way. However, large modern programs inhibit the compilers ability to determine the correct static schedule. Functions call out of a basic block or outside the compilation unit block optimizations. Call to shared libraries, DLL for example, also prevent optimization. Use of indirection to reference both data and code also reduces the compilers ability to find instruction level parallelism. Alpha uses smart compilers, smart machines and GREAT circuit design to produce the highest level of performance. The compiler schedules groups of instructions much like a VLIW compiler would, taking advantage of profiling data, prefetching, speculation, and predication. The processor then uses additional information at runtime to expose additional parallelism as the program executes. As the software world become more complex and dynamic, we believe that Alphas focus on scalar performance along with providing the necessary resources for vector like program will produce the best performance. Microprocessor Forum Microprocessor Forum

Predication & Speculation Oct 13 & 14 Predication & Speculation If ((b[j]==true) && (a[I+j]==true) && (c[I-j+7]==true)) IA-64 Alpha 1 R1=&b[j] 1 R1=&b[j] R3=&a[I+j] R3=&a[I+j] R5=&c[I-j+7] R5=&c[I=j+7] 2 ld R2=[R1] 2 ld R2=[R1] ld.s R4=[R3] ld R4=[R3] ld.s R6=[R5] ld R6=[R5] 4 P1,P2<-cmp(R2==true) 4 cmoveq r4, r31, r2 5 <P1> chk.s R4 5 cmoveq r6, r31, r2 <P1> P3,P4<-cmp(R4==true) 6 beq r2, else 6 <P3> chks R6 <P3> P5,P6<-cmp(R6==true) 7 <P5> br then Compare the code for IA-64, presented last year with the output form toadies Unix compiler. Alpha is already doing speculation of loads and predication with CMOV The code on the left was copied from last years IA-64 presentation. The code on the right was created by the currently shipping Unix compiler for Alpha. Note the use of speculative loads to improve the parallelism in the memory system, as well conditional move instructions to remove data dependent branches. Assuming lots of functional units with 1 cycle operates and 2 cycle loads, both sequences execute in a similar number of architectural cycles. Microprocessor Forum Microprocessor Forum

Predication & Speculation Oct 13 & 14 Predication & Speculation If ((b[j]==true) && (a[I+j]==true) && (c[I-j+7]==true)) IA-64 Alpha Instructions 12 9 Bytes 64 36 Branches 1 1 Mispredicts 16% 13% (measured) Cycles 71 if () 6 dependent cycles if() 11 static full loop 6 executing on 21264 1 J Crawford, J Huck “Next Generation Instruction Set Architecture”, Microprocessor Forum 1997, pg 25 The increased size of the IA-64 instructions results in a larger memory foot print, allowing us to call Alpha a bit efficient architecture:). Both code sequences contain a single branch, Intel reported a 16% mispredict rate for some design. We have measured 13% on the 21264. I would now like to move away from architecture to actual implementations. On the 21264, the static schedule for the FULL loop requires 11 cycles. However the out-or-order hardware allows the machine execute loop in 6 cycles. Microprocessor Forum Microprocessor Forum

Queens on 21264 Several loops executing on the 21264 Time Fetch Oct 13 & 14 Queens on 21264 Several loops executing on the 21264 Time loop loop LDL CMOV loop CMOV loop loop LDL CMOV loop CMOV loop loop LDL CMOV loop CMOV loop loop LDL LDA loop LDL CMOV BEQ LDA loop LDL CMOV BEQ LDA loop LDL CMOV BEQ LDA loop LDL Fetch LDA loop LDL LDA loop LDL LDA loop LDL LDA loop LDL LDA loop CMOV LDA loop CMOV LDA loop CMOV LDA loop CMOV loop This slide show the execution of the full loop on the 21264. The top group of blocks shows the instructions being fetched from the Icache and mapped into the issue queue. Each loop iteration is a different color. Note that fetching and mapping the instructions is the limiting work for this loop and that there is no bubble in the fetch stream for the taken branch. The second group of blocks show the instructions executing. Note that the hardware is overlapping two iterations of the loop. The CMOV instructions are broken into two parts by the mapper and execution stages because we are unwilling to map three source registers per cycle and likewise unwilling to add a third read port to the register file just for these instructions. Next, I’d like to review the roadmap. LDL LDL loop loop loop LDL LDL loop loop loop LDL LDL loop loop loop LDL LDL loop LDA loop loop LDA loop loop LDA loop loop Execute LDL CMOV loop LDA CMOV LDL CMOV loop LDA CMOV LDL CMOV loop LDA CMOV LDL CMOV loop CMOV BEQ CMOV LDA CMOV BEQ CMOV LDA CMOV BEQ CMOV LDA CMOV BEQ Microprocessor Forum Microprocessor Forum

Alpha Roadmap ... EV6/575 21264 EV7/1000 21364 EV8 EV5/333 21164 Oct 13 & 14 Alpha Roadmap Higher Performance 0.35mm 0.18mm 0.13mm 0.5mm EV6/575 21264 EV7/1000 21364 EV8 EV5/333 21164 Lower Cost 0.35mm 0.28mm EV56/60021164 EV67/750 21264 ... 0.35mm 0.18mm PCA56/533 21164PC EV68/1000 21264 Next I would like to review the Alpha processor roadmap. We will continue to introduce new cores at the high end and then cost reduce those products with improved technology and value engineering. The 21164 is currently shipping in 0.35um at 600MHz. A cost reduced version of the part will be introduced this quarter in 0.28, also running at 600MHz. The 21264 will begin shipping this quarter in 0.35. That part will be followed by shrinks into 0.28 and then 0.18 over the next 18 months. The 21364 will be introduced in 0.18. An entirely new core, EV8 will be introduced in 0.13. 0.28mm PCA57/600 21164PC 1995 1996 1997 1998 1999 2000 2001 Microprocessor Forum Microprocessor Forum

Alpha 21164PC Shipping at 583MHz November 1998 Chip features: Oct 13 & 14 Alpha 21164PC Shipping at 583MHz November 1998 16.7/17.0 estimated SPECint95 (base/peak) 20.7/22.7 estimated SPECfp95 (base/peak) 340 MB/sec STREAMS Chip features: 1.0 cm2 7 million transistors 32K 2-set Icache 16K virtual Dcache improved 3-cycle multiplier improved 6 bit/cycle divider increased write buffer size (8 x 32B) support for 200MHz off-chip cache The 21164PC in 0.28um CMOS will be introduced later this quarter. At 583MHz, the part achieves 17 SPECint95 and 22.7 SPECfp95 using a 1MB external cache built form 200MHz PB2 SRAMs. The 1cm die contains 7 million transistors. Several improvements were made to the parts compared to the original 0.35 version. The Icache was doubled in size to 32K, and changed from direct mapped to 2-way set. The Dcache was also doubled in size to 16KB and converted from a physically indexed, physically tagged cache to be virtually indexed and physically tagged. A new integer multiplier is fully pipelined for 32 bit multiplies with a latency of 3 cycles. A new floating point divider produces 6 result bits per cycle, compared to the 2.4 bit/cycle of the old design. The floating point unit has improved support for IEEE floating point, including hardware support for NANs and infinity. The size of the write buffer was increased and the chip now supports off-chip caches built from 200MHz PB2 SRAMS. Microprocessor Forum Microprocessor Forum

Alpha 21264 Update Microprocessor Forum 1996 Oct 13 & 14 Alpha 21264 Update Microprocessor Forum 1996 30+ SPECint95 and 50+ SPECfp95 500MHz in 0.35um CMOS Spectacular memory bandwidth Systems 2H97 First power on July 1997 (no FP) Full function power on Feb 1998 Production power on June 1998 At Microprocessor Forum in 1996, we introduced the architecture of the 21264. At that time we estimated a SPECint95 of 30 and a SPECfp95 performance of 50. In addition we hoped to ship systems by the end of 1997 at 500MHz. In the past we have taken great pride in delivering the performance we predict on time, this time we did not. Completing the physical design and verification of the chip took considerably longer than we had planned. The first prototype version of the chip was powered on in the summer of 1997. The first full version of the chip was powered on in February of this year. System and software bring up had progressed rapidly, with the last version of the chip powering on in June. We now expect to ship systems later this quarter. The current systems use 1/3rd speed caches while the simulated results used 2/3rd speed caches. This results in a 10% short fall on SPEC performance. During 1999, we expect to introduce systems with 2/3rd speed caches. Microprocessor Forum Microprocessor Forum

Alpha 21264 Systems AlphaServer 8400 with EV6/575 **estimated Oct 13 & 14 Alpha 21264 Systems AlphaServer 8400 with EV6/575 The AlphaServer 8400 is a large bus based SMP system. This chart shows the performance of the 21164 (ev5) as well as the 21264 (ev6) for 3 single processor benchmarks as well as 3 multiprocessor benchmarks. Note that the 21264 is providing a 50% performance improvement over the 21164 using the same memory system. Please note that the full TPC-C information is missing from your slides. In the next slide I’d like to look more closely at the TPC performance **estimated *37,541 tpmC at $79.4/tpmC for 8CPU 16GB Sybase V11.9 available 12/98 Microprocessor Forum Microprocessor Forum

Estimated time for TPC-C Oct 13 & 14 Estimated time for TPC-C New core Higher MHz Higher integration This graph shows where a single processor is spending its time executing the TPC benchmark. Lower is faster. The graph is normalized to the 21164 processor. For a single processor, simply replacing the 21164 with a 21264 results in doubling of performance. Since both chips where measured in the same memory system and at the same frequency, the improvement is completed attributed to the micro-architecture of the 21264. Note that significant improvement have been made in the compute, mispredict, cache and memory portions of the graph. As the frequency of the chip is increased with improved technology in the coming years, the memory system will again dominate the performance of this application. This provides the motivation for the 21364 design. Microprocessor Forum Microprocessor Forum

Alpha 21364 Goals Improve Decrease Oct 13 & 14 Alpha 21364 Goals Improve Single processor performance, operating frequency, and memory system SMP scaling System performance density (computes/ft3) Reliability and availability Decrease System cost System complexity The 21364 processor will use a very high level of system integration to improve: single processor performance, SMP scaling, compute density, and reliability and availability. Naturally we would also like to reduce system cost and complexity. Microprocessor Forum Microprocessor Forum

Alpha 21364 Features Alpha 21264 core with enhancements Oct 13 & 14 Alpha 21364 Features Alpha 21264 core with enhancements Integrated L2 Cache Integrated memory controller Integrated network interface Support for lock-step operation to enable high-availability systems. The 21364 will start with an enhanced version of the 21264 core. We will at an integrated L2 cache, Direct RAMbus memory controller, and a network interface. The chip will support lock-step operation to enable high-availability systems. Microprocessor Forum Microprocessor Forum

21364 Chip Block Diagram 21264 Core 16 L1 Miss Buffers Memory Oct 13 & 14 21364 Chip Block Diagram 21264 Core 16 L1 Miss Buffers Address In Memory Controller R A M B U S L2 Cache Address Out Network Interface N S E W I/O 64K Icache 16 L1 Victim Buf 16 L2 64K Dcache Here's a top level block diagram of the chip. The next few slides will provide additional detail. We will start with core of the 21264. The number of outstanding cache block fills will be increased from 8 to 16. Misses to the L1 caches will first access the L2 cache. Data will be returned on a 128 byte wide bus. References that miss the L2 cache will access the local memory and return data to the core. Memory locations not located in the local memory will access the network. The integrated network interface will route the request to the appropriate node in the network using one of the 4 ports (N,S,E,W). The 21164 cores 8 entry victim buffer is current used for both L1 and L2 victims. The new design will increase the size of the victim buffer to 16 x 64 byte blocks for L1->L2 victims. A new 16 x 64 bytes victim buffer will be used to hold victims leaving the L2 cache for the local memory or the network. Microprocessor Forum Microprocessor Forum

21364 Core FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Oct 13 & 14 21364 Core FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Branch Predictors Int Reg Map Int Issue Queue (20) Reg File (80) Exec L2 cache1.5MB 6-Set Exec Addr L1 Data Cache 64KB 2-Set Reg File (80) Exec 80 in-flight instructions plus 32 loads and 32 stores Addr Exec Next-Line Address L1 Ins. Cache 64KB 2-Set 4 Instructions / cycle The 21364 will use the 21264 core. I’d like to briefly walk through the pipeline and then point out a few interesting features of the pipeline. Four instructions are fetched from the 64K 2-set Icache each cycle and delivered to the integer and floating point mapers. Following the path of integer instructions, the map stage maps the 32 virtual register numbers into the 80 physical register numbers which are then inserted into the issue queue. Every cycle the integer issue queue will issue up to 4 instructions in age prioritized order, press out empty slots, and insert new instructions from the maper. The next cycle the inputs are read from the register file. Four integer units share two copies of the register file. The two lower units contain add and logic functions. These unit generate addresses for the cache system. Each of the upper units contains add, logic, shift, and branch functions. A fully pipelined integer multiply unit is part of one upper unit, while the other upper unit processes the population count and multi-media instructions. The 64K 2-set L1 Dcache can process 2 loads or stores per cycle. References that miss are merged into cache block requests that access the L2 cache. Modified data is displaced during the fill and is buffered in a 16 entry victim buffer. FP ADD Div/Sqrt FP Issue Queue (15) Reg File (72) FP Reg Map Victim Buffer FP MUL Miss Address Microprocessor Forum Microprocessor Forum

Integrated L2 Cache 1.5 MB 6-way set associative Oct 13 & 14 Integrated L2 Cache 1.5 MB 6-way set associative 16 GB/s total read/write bandwidth 16 Victim buffers for L1 -> L2 16 Victim buffers for L2 -> Memory ECC SECDED code 12ns load to use latency The 1.5MB 6-set L2 cache has a 12 cycle load to use latency. This latency is set by the existing control in the core and is used to significantly reduce the power consumption of the L2 array. The L2 cache and read or write 16 bytes/cycle at 1GHz, resulting in 16GB/second of read or write bandwidth. The array is protected by a single error correct, double error detect ECC code. Error are corrected on the fly in hardware. Microprocessor Forum Microprocessor Forum

Integrated Memory Controller Oct 13 & 14 Integrated Memory Controller Direct RAMbus High data capacity per pin 800 MHz operation 30ns CAS latency pin to pin 6 GB/sec read or write bandwidth 100s of open pages Directory based cache coherence ECC SECDED The chip contains a integrated Direct RAMbus memory controller. Direct RAMbus provides high data capacity per bin along with outstanding bandwidth and latency. The pin to pin delay for a page hit in the RDRAM is 30ns. The memory controller will provide 6GB/sec of read or write bandwidth to the core. With 2GFLOPs, the chip provides 3byte/FLOP of usable memory bandwidth, a significant improvement over current systems. To reduce memory latency the memory controller will track 100s of open pages in the RDRAM array. A directory based cache coherence protocol is an integral part of the memory controller. The memory is protected by a single error correct, double error detect ECC code. EV56/600 300MB/sec / 1.2GFLOPS = 0.4 b/flop EV6/600 1000MB/sec /1.2GFLOPS = 0.83b/flop Microprocessor Forum Microprocessor Forum

Integrated Network Interface Oct 13 & 14 Integrated Network Interface Direct processor-to-processor interconnect 10 GB/second per processor 15ns processor-to-processor latency Out-of-order network with adaptive routing Asynchronous clocking between processors 3 GB/second I/O interface per processor The integrated network interface allows multi-processor systems to be built using a 2D torus topology. Each node is capable of moving 10GB/second. Each hop in the network will take an average of 15ns. The network moved data and control packets from the source to the destination. It does not garentee ordering. Adaptive routing of packets allows the network to detect and avoid hot spots. Asynchronous clocking between processors removes the need to distribute a low skew clock within a large system. A fifth port provides upto 3GB/sec on bandwidth to industry standard buses, PCI, PCI-X, AGP, and ServerNet to name a few. Microprocessor Forum Microprocessor Forum

21364 System Block Diagram M 364 IO M 364 IO M 364 IO Oct 13 & 14 21364 System Block Diagram 364 M IO 364 M IO Here’s the block diagram of a 12 processor system using the 2D torus topology. Each processor may have its own local memory and may have its own local I/O connection. It is possible for a processor to operate in the system without memory or I/O if that is attractive 364 M IO Microprocessor Forum Microprocessor Forum

Alpha 21364 Technology 0.18 mm CMOS 1000+ MHz 100 Watts @ 1.5 volts Oct 13 & 14 Alpha 21364 Technology 0.18 mm CMOS 1000+ MHz 100 Watts @ 1.5 volts 3.5 cm2 6 Layer Metal 100 million transistors 8 million logic 92 million RAM Microprocessor Forum Microprocessor Forum

Alpha 21364 Status 70 SPECint95 (estimated) 120 SPECfp95 (estimated) Oct 13 & 14 Alpha 21364 Status 70 SPECint95 (estimated) 120 SPECfp95 (estimated) RTL model running Tapeout 4Q99 Microprocessor Forum Microprocessor Forum

Oct 13 & 14 Conclusion The 21364 integrated L2 cache and memory controller provide outstanding single processor performance The 21364 integrated network interface enables high performance multi-processor systems The high level of integration directly supports systems containing a large number of processors Microprocessor Forum Microprocessor Forum

Oct 13 & 14 Microprocessor Forum