Presentation on theme: "The Alpha Roadmap How it applies to Alpha clusters Ray Hookway Compaq Computer Corporation Littleton, MA"— Presentation transcript:
The Alpha Roadmap How it applies to Alpha clusters Ray Hookway Compaq Computer Corporation Littleton, MA Ray.Hookway@compaq.com
Map Features Alpha Processor Roadmap Alpha Systems Alpha Clusters
Processor Roadmap References: Pete Bannon, “Alpha 21364: A Scalable Single- chip SMP”, Microprocessor Forum 1998, http://www.digital.com/alphaoem/microprocessorfo rum.htm http://www.digital.com/alphaoem/microprocessorfo rum.htm http://www.digital.com/alphaoem/microprocessorfo rum.htm Simultaneous Multithreading: Multiplying Alpha Performance”, Microprocessor Forum 1999 Joel Emer, “Simultaneous Multithreading: Multiplying Alpha Performance”, Microprocessor Forum 1999
Alpha Roadmap Higher Performance Lower Cost 2000 2001 2002 200319981999 EV6 21264 EV68 0.35 m EV67 0.28 m 0.18 m EV7... EV8 0.125 m EV78
Alpha 21264 is performance leader Compaq AlphaServer DS20 HP D380Sun UE250 SPECfp95 58.717.4 20.6 12.6 IBM F50
Alpha 21264 Systems AlphaServer 8400 with EV6/575 * 37,541 tpmC at $79.4/tpmC for 8CPU 16GB Sybase V11.9 available 12/98 **estimated
IA-64.vs. Alpha Philosophy EPIC EPIC Smart compiler and a dumb machine Compiler creates record of execution Machine plays record Stall when compiler is wrong Stall when compiler is wrong Focus on vector programs Compiler transform scalar to vector Compiler transform scalar to vector What about: function calls, indirection function calls, indirection dynamic linking dynamic linking C++, Java/JIT C++, Java/JIT ALPHA ALPHA Smart compiler, smart machine, and a GREAT circuit design Compiler creates record of execution Machine exploits additional information available at runtime Works across barriers to compile- time analysis Works across barriers to compile- time analysis Focus on scalar programs Add resources for vector Add resources for vector Amdahl’s law Amdahl’s law
Alpha 21364 Goals Improve Single processor performance, operating frequency, and memory system Single processor performance, operating frequency, and memory system SMP scaling SMP scaling System performance density (computes/ft 3 ) System performance density (computes/ft 3 ) Reliability and availability Reliability and availability Decrease System cost System cost System complexity System complexity
Integrated L2 Cache 1.5 MB 6-way set associative 16 GB/s total read/write bandwidth 16 Victim buffers for L1 -> L2 16 Victim buffers for L2 -> Memory ECC SECDED code 12ns load to use latency
Integrated Memory Controller Direct RAMbus High data capacity per pin High data capacity per pin 800 MHz operation 800 MHz operation 30ns CAS latency pin to pin 30ns CAS latency pin to pin 6 GB/sec read or write bandwidth 100s of open pages Directory based cache coherence ECC SECDED
Integrated Network Interface Direct processor-to-processor interconnect 10 GB/second per processor 15ns processor-to-processor latency Out-of-order network with adaptive routing Asynchronous clocking between processors 3 GB/second I/O interface per processor
21364 System Block Diagram 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO
Alpha 21364 Technology 0.18 m CMOS 1000+ MHz 100 Watts @ 1.5 volts 3.5 cm 2 6 Layer Metal 100 million transistors 8 million logic 8 million logic 92 million RAM 92 million RAM
21364 Summary The 21364 integrated L2 cache and memory controller provide outstanding single processor performance The 21364 integrated network interface enables high performance multi-processor systems The high level of integration directly supports systems containing a large number of processors
21464 Overview Enhanced out-of-order execution 8-wide superscalar Large on-chip L2 cache Direct RAMBUS interface On-chip router for system interconnect Glueless, directory-based, ccNUMA for up to 512- way SMP 4-way simultaneous multithreading (SMT)
What Changed? Multiple Program Counters Choose among them Choose among them More Architectural Register Space Mapper Mapper Register Files Register Files Distinguished Per Thread Instruction State Register mapping Register mapping Instruction Retire Instruction Retire Store Buffers Store Buffers Abort and Restart Information Abort and Restart Information
What Didn’t Change Almost everything else No basic functional changes in any stage No partitioned instruction cache No partitioned data caches No partitioned off-chip caches No extra register files Little special branch prediction mechanism
AlphaServer DS Series Uni and dual processor systems Offerings scale to 8GB memory Up to 6 PCI slots Switched based system - 64-bit PCI I/O subsystems - Very Large Memory Scalable clusters on DIGITAL UNIX, OpenVMS Modular system packaging - advanced systems management 1-64Processors Up to 128GB of memory Up to 224 PCI slots Up to 32GB of memory 1- 4 Processors Up to 10 PCI slots AlphaServer ES Series AlphaServer GS Series AlphaServer Family Today
AlphaServer DS10 Fast Memory Access Large total RAM -128MB up to 1GB High bandwidth access - 1.3 GB/s Flexible Internal Storage Internal dual channel IDE storage included Optional SCSI adapter supported 3 internal disk bays Special Features 4 PCI I/O slots (3 64-bit, 1 32-bit) 300 watt power supply 3U Small footprint - Rack or Desktop Dual embedded 10/100 Ethernet ports
New AlphaServer DS Series Solution for project environment Fastest Uni processor design in a 1U formfactor Fastest Memory Access with the Highest Bandwidth memory in its class High speed I/O with 64 bit PCI Sleek, compact and powerful package Dual Purpose Solutions Support Rack and desktop-ready for space constrained environments Rack and desktop-ready for space constrained environments
AlphaServer DS Series Fast Memory Access Large total RAM - 64MB up to 1GB Large total RAM - 64MB up to 1GB High bandwidth access - 1.3 GB/s High bandwidth access - 1.3 GB/s Flexible Internal Storage Internal dual channel IDE Internal dual channel IDE Wide range of PCI Options supported Wide range of PCI Options supported 2 disk bays-27 GB IDE / 18 GB SCSI 2 disk bays-27 GB IDE / 18 GB SCSI Special Features Optional Slimline CD-Floppy Combo Optional Slimline CD-Floppy Combo Toolless features-snap out CD and Disk Toolless features-snap out CD and Disk Full 1 PCI I/O slot (64-bit) Full 1 PCI I/O slot (64-bit) 150 watt power supply 150 watt power supply 1U (1.75”) Small footprint – Rack or Desktop 1U (1.75”) Small footprint – Rack or Desktop Dual embedded 10/100b Ethernet ports Dual embedded 10/100b Ethernet ports Performance and Management Features Remote management console Remote management console Serverworks and Compaq Insight Manager Serverworks and Compaq Insight Manager
Complementary, low-cost, open source model. Leadership performance over other Linux platforms. Tru64 UNIX compatibility with common SWD tools Support services through Compaq and partners Two ways to build an Alpha cluster Scalable, robust HPC platform Maximum performance over broadest range of applications Outstanding system management and reliability features Sierra Beowulf
Sierra Architecture Tera-scale systems derived from ASCI PathForward Very large Distributed Shared Memory systems High speed, scalable interconnect (Quadrics) Exploit EV6, EV7 & EV8 Installed and administered as single system System wide scheduler High performance file systems (PFS, CFS, AdvFS) Application availability
Alpha Beowulf Clusters Compaq ships 64-bit Linux on Alpha systems Myrinet and other popular interconnects are supported SeverNet-II available in late 1999 Compaq Tru64Unix (Digital Unix) development tools ported in 1999 (!)
Prepackaged Beowulf Cluster CT-D10MJ-SR Starter DS10-based Beowulf cluster, including eight Alphaserver DS10 compute nodes, one Alphaserver DS10 management station with keyboard/trackball and display, Myrinet™ system area network, 73.1 GB JBOD UltraSCSI disk storage, Ethernet multiplexer for system management, and all Linux software required for basic Beowulf operation.
ServerNet-II Interconnect Scalable high-performance network. 65,536 end nodes, 5 km range. Multi-gigabit, low latency, low CPU, cheap. VIA - Virtual Interface Architecture. MPI - Message Passing Interface. Open source Intel and Alpha Linux drivers. NT, Tru64, NonStop Clusters, VxWorks.
Virtual Interface Architecture (VIA) Applications VI Primitive Library Open/Close/Map Memory Send/Receive/Read/Write VI Kernel Support VI Kernel HW Interface SAN Media Interface (ServerNet, Ethernet,...) OS Vendor API DBMSApps CQ VIVI VI 65,536 VIs per node. RDMA & send/recv. Reliable reception. < 2% CPU utilization. Low latency/zero copy. Thread-safe, protected. Basis for COMPAQ’s “System I/O”. Communication through “Virtual Interfaces (VI)” with associated “Completion Queues (CQ)”.
ServerNet-II Hardware Components Router II FCAL bridge, dual line card, or LAN bridge Line Car d IBC Logic FCAL bridge, dual line card, or LAN bridge Dual-port PCI interface (NIC) –VIA in hardware, DCE –negligible CPU cost –64 bit, 33 MHz & 66 MHz 12 port crossbar switch –wormhole routed –< 300 nsec latency –“fat pipe” channel bonding –bridges to fibre channel, gigabit ethernet Gigabit ethernet cables –copper or fibre optic –5 meters to 5 km
ServerNet-II Hardware Performance 1.25+1.25 gigabit/s links 1999, doubles 2001. < 300 nanosecond path formation per stage. 1M end nodes, 5 km fibre optic links. Single VIMultiple VIs 33 MHz PCI-64166 MB/s240 MB/s 66 MHz PCI-64197 MB/s350 MB/s
Reliability Dual port NICs, dual network topologies. Link level CRC, in-band control protocol. Strong packet ordering guarantees. Every packet is acknowledged by receiver. Automatic retry on transmission failure. Avoids deadlock & livelock.
Linux Software Development Tools for AlphaServers Boosted performance with Compaq Portable Math Library (CPML) for Linux on Alpha Significantly increases the precision and speed of mathematical calculations up to 10 times compared to other mathematical libraries currently available on Linux Significantly increases the precision and speed of mathematical calculations up to 10 times compared to other mathematical libraries currently available on Linux Following the success of the Portable Math Library, now announcing plans for Compaq Extended Math Library to run on Linux AlphaServer systems Compaq C compiler announced in April Compaq Fortran compilers for Linux announced in April available in July beta program New Compaq C++ compiler Makes it easy to support both Linux and Tru64 UNIX software Makes it easy to support both Linux and Tru64 UNIX software New Software Development Test- Drive capabilities Test out the performance of your Application over the Web Test out the performance of your Application over the Web Get help from our leading Linux developers to optimize your application Get help from our leading Linux developers to optimize your application
Summary Alpha is the fastest processor available Alpha is available in a full range of high performance systems Sierra systems provided complete tera-scale solutions Compaq wants to be involved in the Beowulf community