Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Alpha Roadmap How it applies to Alpha clusters Ray Hookway Compaq Computer Corporation Littleton, MA

Similar presentations


Presentation on theme: "The Alpha Roadmap How it applies to Alpha clusters Ray Hookway Compaq Computer Corporation Littleton, MA"— Presentation transcript:

1 The Alpha Roadmap How it applies to Alpha clusters Ray Hookway Compaq Computer Corporation Littleton, MA

2 Map Features  Alpha Processor Roadmap  Alpha Systems  Alpha Clusters

3 Processor Roadmap References:  Pete Bannon, “Alpha 21364: A Scalable Single- chip SMP”, Microprocessor Forum 1998, rum.htm rum.htm rum.htm  Simultaneous Multithreading: Multiplying Alpha Performance”, Microprocessor Forum 1999  Joel Emer, “Simultaneous Multithreading: Multiplying Alpha Performance”, Microprocessor Forum 1999

4 Alpha Roadmap Higher Performance Lower Cost EV EV  m EV  m 0.18  m EV7... EV  m EV78

5 Alpha is performance leader Compaq AlphaServer DS20 HP D380Sun UE250 SPECfp IBM F50

6 Alpha Systems  AlphaServer 8400 with EV6/575 * 37,541 tpmC at $79.4/tpmC for 8CPU 16GB Sybase V11.9 available 12/98 **estimated

7 IA-64.vs. Alpha Philosophy EPIC EPIC  Smart compiler and a dumb machine  Compiler creates record of execution  Machine plays record Stall when compiler is wrong Stall when compiler is wrong  Focus on vector programs Compiler transform scalar to vector Compiler transform scalar to vector  What about: function calls, indirection function calls, indirection dynamic linking dynamic linking C++, Java/JIT C++, Java/JIT ALPHA ALPHA  Smart compiler, smart machine, and a GREAT circuit design  Compiler creates record of execution  Machine exploits additional information available at runtime Works across barriers to compile- time analysis Works across barriers to compile- time analysis  Focus on scalar programs Add resources for vector Add resources for vector Amdahl’s law Amdahl’s law

8 Alpha Goals  Improve Single processor performance, operating frequency, and memory system Single processor performance, operating frequency, and memory system SMP scaling SMP scaling System performance density (computes/ft 3 ) System performance density (computes/ft 3 ) Reliability and availability Reliability and availability  Decrease System cost System cost System complexity System complexity

9 “It’s the Memory, Stupid” Dick Sites

10 Estimated time for TPC-C New core Higher MHz Higher integration

11 Alpha Features  Alpha core with enhancements  Integrated L2 Cache  Integrated memory controller  Integrated network interface  Support for lock-step operation to enable high- availability systems.

12 Memory Controller RAMBUSRAMBUS Chip Block Diagram Core 16 L1 Miss Buffers L2 Cache Address Out Address In Network Interface N S E W I/O 16 L1 Victim Buf 16 L2 Victim Buf 64K Icache 64K Dcache

13 Int Reg Map Branch Predictors Core FETCH MAP QUEUE REG EXEC DCACHE Stage: L2 cache1.5MB 6-Set Int Issue Queue (20) Exec 4 Instructions / cycle Reg File (80 ) Victim Buffer L1 Data Cache 64KB 2-Set FP Reg Map FP ADD Div/Sqrt FP MUL Addr 80 in-flight instructions plus 32 loads and 32 stores Addr Miss Address Next-Line Address L1 Ins. Cache 64KB 2-Set Exec Reg File (80 ) FP Issue Queue (15) Reg File (72 )

14 Integrated L2 Cache  1.5 MB  6-way set associative  16 GB/s total read/write bandwidth  16 Victim buffers for L1 -> L2  16 Victim buffers for L2 -> Memory  ECC SECDED code  12ns load to use latency

15 Integrated Memory Controller  Direct RAMbus High data capacity per pin High data capacity per pin 800 MHz operation 800 MHz operation 30ns CAS latency pin to pin 30ns CAS latency pin to pin  6 GB/sec read or write bandwidth  100s of open pages  Directory based cache coherence  ECC SECDED

16 Integrated Network Interface  Direct processor-to-processor interconnect  10 GB/second per processor  15ns processor-to-processor latency  Out-of-order network with adaptive routing  Asynchronous clocking between processors  3 GB/second I/O interface per processor

17 21364 System Block Diagram 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO

18 Alpha Technology  0.18  m CMOS  MHz  volts  3.5 cm 2  6 Layer Metal  100 million transistors 8 million logic 8 million logic 92 million RAM 92 million RAM

19 Alpha Performance/Status  70 SPECint95 (estimated)  140 SPECfp95 (estimated)  RTL model running  Tapeout 4Q99

20 21364 Summary  The integrated L2 cache and memory controller provide outstanding single processor performance  The integrated network interface enables high performance multi-processor systems  The high level of integration directly supports systems containing a large number of processors

21 21464 Overview  Enhanced out-of-order execution  8-wide superscalar  Large on-chip L2 cache  Direct RAMBUS interface  On-chip router for system interconnect  Glueless, directory-based, ccNUMA for up to 512- way SMP  4-way simultaneous multithreading (SMT)

22 Superscalar Instruction Issue Time

23 Multi-Threading

24 Simultaneous Multi-Threading Time

25 What Changed?  Multiple Program Counters Choose among them Choose among them  More Architectural Register Space Mapper Mapper Register Files Register Files  Distinguished Per Thread Instruction State Register mapping Register mapping Instruction Retire Instruction Retire Store Buffers Store Buffers Abort and Restart Information Abort and Restart Information

26 What Didn’t Change Almost everything else  No basic functional changes in any stage  No partitioned instruction cache  No partitioned data caches  No partitioned off-chip caches  No extra register files  Little special branch prediction mechanism

27 Multi-threaded Scaling 1.8x2.0x 1.9x 2.3x

28 AlphaServer DS Series Uni and dual processor systems Offerings scale to 8GB memory Up to 6 PCI slots Switched based system - 64-bit PCI I/O subsystems - Very Large Memory Scalable clusters on DIGITAL UNIX, OpenVMS Modular system packaging - advanced systems management 1-64Processors Up to 128GB of memory Up to 224 PCI slots Up to 32GB of memory 1- 4 Processors Up to 10 PCI slots AlphaServer ES Series AlphaServer GS Series AlphaServer Family Today

29 AlphaServer DS10 Fast Memory Access Large total RAM -128MB up to 1GB High bandwidth access GB/s Flexible Internal Storage Internal dual channel IDE storage included Optional SCSI adapter supported 3 internal disk bays Special Features 4 PCI I/O slots (3 64-bit, 1 32-bit) 300 watt power supply 3U Small footprint - Rack or Desktop Dual embedded 10/100 Ethernet ports

30 New AlphaServer DS Series Solution for project environment  Fastest Uni processor design in a 1U formfactor  Fastest Memory Access with the Highest Bandwidth memory in its class  High speed I/O with 64 bit PCI  Sleek, compact and powerful package  Dual Purpose Solutions Support Rack and desktop-ready for space constrained environments Rack and desktop-ready for space constrained environments

31 AlphaServer DS Series  Fast Memory Access Large total RAM - 64MB up to 1GB Large total RAM - 64MB up to 1GB High bandwidth access GB/s High bandwidth access GB/s  Flexible Internal Storage Internal dual channel IDE Internal dual channel IDE Wide range of PCI Options supported Wide range of PCI Options supported 2 disk bays-27 GB IDE / 18 GB SCSI 2 disk bays-27 GB IDE / 18 GB SCSI  Special Features Optional Slimline CD-Floppy Combo Optional Slimline CD-Floppy Combo Toolless features-snap out CD and Disk Toolless features-snap out CD and Disk Full 1 PCI I/O slot (64-bit) Full 1 PCI I/O slot (64-bit) 150 watt power supply 150 watt power supply 1U (1.75”) Small footprint – Rack or Desktop 1U (1.75”) Small footprint – Rack or Desktop Dual embedded 10/100b Ethernet ports Dual embedded 10/100b Ethernet ports  Performance and Management Features Remote management console Remote management console Serverworks and Compaq Insight Manager Serverworks and Compaq Insight Manager

32 New AlphaServer DS Series

33  Complementary, low-cost, open source model.  Leadership performance over other Linux platforms.  Tru64 UNIX compatibility with common SWD tools  Support services through Compaq and partners Two ways to build an Alpha cluster  Scalable, robust HPC platform  Maximum performance over broadest range of applications  Outstanding system management and reliability features Sierra Beowulf

34 Sierra Architecture  Tera-scale systems derived from ASCI PathForward  Very large Distributed Shared Memory systems  High speed, scalable interconnect (Quadrics)  Exploit EV6, EV7 & EV8  Installed and administered as single system  System wide scheduler  High performance file systems (PFS, CFS, AdvFS)  Application availability

35 Sierra – ASCI Pathforward Project

36 Alpha Beowulf Clusters  Compaq ships 64-bit Linux on Alpha systems  Myrinet and other popular interconnects are supported  SeverNet-II available in late 1999  Compaq Tru64Unix (Digital Unix) development tools ported in 1999 (!)

37 Prepackaged Beowulf Cluster CT-D10MJ-SR Starter DS10-based Beowulf cluster, including eight Alphaserver DS10 compute nodes, one Alphaserver DS10 management station with keyboard/trackball and display, Myrinet™ system area network, 73.1 GB JBOD UltraSCSI disk storage, Ethernet multiplexer for system management, and all Linux software required for basic Beowulf operation.

38 ServerNet-II Interconnect  Scalable high-performance network.  65,536 end nodes, 5 km range.  Multi-gigabit, low latency, low CPU, cheap.  VIA - Virtual Interface Architecture.  MPI - Message Passing Interface.  Open source Intel and Alpha Linux drivers.  NT, Tru64, NonStop Clusters, VxWorks.

39 Virtual Interface Architecture (VIA) Applications VI Primitive Library Open/Close/Map Memory Send/Receive/Read/Write VI Kernel Support VI Kernel HW Interface SAN Media Interface (ServerNet, Ethernet,...) OS Vendor API DBMSApps CQ VIVI VI  65,536 VIs per node.  RDMA & send/recv.  Reliable reception.  < 2% CPU utilization.  Low latency/zero copy.  Thread-safe, protected.  Basis for COMPAQ’s “System I/O”. Communication through “Virtual Interfaces (VI)” with associated “Completion Queues (CQ)”.

40 ServerNet-II Components Beowulf.loc1.Tandem.com

41 ServerNet-II Hardware Components Router II FCAL bridge, dual line card, or LAN bridge Line Car d IBC Logic FCAL bridge, dual line card, or LAN bridge  Dual-port PCI interface (NIC) –VIA in hardware, DCE –negligible CPU cost –64 bit, 33 MHz & 66 MHz  12 port crossbar switch –wormhole routed –< 300 nsec latency –“fat pipe” channel bonding –bridges to fibre channel, gigabit ethernet  Gigabit ethernet cables –copper or fibre optic –5 meters to 5 km

42 ServerNet-II Hardware Performance  gigabit/s links 1999, doubles  < 300 nanosecond path formation per stage.  1M end nodes, 5 km fibre optic links. Single VIMultiple VIs 33 MHz PCI MB/s240 MB/s 66 MHz PCI MB/s350 MB/s

43 Reliability  Dual port NICs, dual network topologies.  Link level CRC, in-band control protocol.  Strong packet ordering guarantees.  Every packet is acknowledged by receiver.  Automatic retry on transmission failure.  Avoids deadlock & livelock.

44 Linux Software Development Tools for AlphaServers  Boosted performance with Compaq Portable Math Library (CPML) for Linux on Alpha Significantly increases the precision and speed of mathematical calculations up to 10 times compared to other mathematical libraries currently available on Linux Significantly increases the precision and speed of mathematical calculations up to 10 times compared to other mathematical libraries currently available on Linux  Following the success of the Portable Math Library, now announcing plans for Compaq Extended Math Library to run on Linux AlphaServer systems  Compaq C compiler announced in April  Compaq Fortran compilers for Linux announced in April available in July beta program  New Compaq C++ compiler Makes it easy to support both Linux and Tru64 UNIX software Makes it easy to support both Linux and Tru64 UNIX software  New Software Development Test- Drive capabilities Test out the performance of your Application over the Web Test out the performance of your Application over the Web Get help from our leading Linux developers to optimize your application Get help from our leading Linux developers to optimize your application

45 SPEC CPU Benchmark* * not audited

46 Linpack (100x100) MFlops

47 Summary  Alpha is the fastest processor available  Alpha is available in a full range of high performance systems  Sierra systems provided complete tera-scale solutions  Compaq wants to be involved in the Beowulf community


Download ppt "The Alpha Roadmap How it applies to Alpha clusters Ray Hookway Compaq Computer Corporation Littleton, MA"

Similar presentations


Ads by Google