The Cray X1 Multiprocessor and Roadmap

The Cray X1 Multiprocessor and Roadmap

Cray’s Computing Vision
Scalable High-Bandwidth Computing 2010 ‘Cascade’ Sustained Petaflops ‘Black Widow’ ‘Black Widow 2’ 2006 X1E 2004 Product Integration X1 2006 ‘Strider 3’ ‘Strider X’ 2004 2005 Red Storm RS ‘Strider 2’

Cray X1 Cray PVP Powerful vector processors Very high memory bandwidth
Non-unit stride computation Special ISA features Modernized the ISA T3E Extreme scalability Optimized communication Memory hierarchy Synchronization features Improved via vectors ISA = Instruction Set Architecture; the instructions implemented in the system Extreme scalability with high bandwidth vector processors

Cray X1 Instruction Set Architecture
New ISA Much larger register set (32x64 vector, scalar) All operations performed under mask 64- and 32-bit memory and IEEE arithmetic Integrated synchronization features Advantages of a vector ISA Compiler provides useful dependence information to hardware Very high single processor performance with low complexity ops/sec = (cycles/sec) * (instrs/cycle) * (ops/instr) Localized computation on processor chip – large register state with very regular access patterns – registers and functional units grouped into local clusters (pipes)  excellent fit with future IC technology Latency tolerance and pipelining to memory/network very well suited for scalable systems Easy extensibility to next generation implementation

Not Your Father’s Vector Machine
New instruction set New system architecture New processor microarchitecture “Classic” vector machines were programmed differently Classic vector: Optimize for loop length with little regard for locality Scalable micro: Optimize for locality with little regard for loop length The Cray X1 is programmed like other parallel machines Rewards locality: register, cache, local memory, remote memory Decoupled microarchitecture performs well on short loop nests (however, does require vectorizable code)

Cray X1 Node Four multistream processors (MSPs), each 12.8 Gflops
$ M mem IO 51 Gflops, 200 GB/s Four multistream processors (MSPs), each 12.8 Gflops High bandwidth local shared memory (128 Direct Rambus channels) 32 network links and four I/O links per node

NUMA Scalable up to 1024 Nodes
Interconnection Network This slide was created by Steve Scott. 16 parallel networks for bandwidth Global shared memory across machine

Network Topology (16 CPUs)
node 0 M0 M1 M15 P P P P node 1 M0 M1 M15 P P P P node 2 M0 M1 M15 Interconnect diagram of an air-cooled cabinet (or tiny liquid-cooled system). Strictly cabling between M chips on each node. Maximum of two hops from any node to any other. P P P P node 3 M0 M1 M15 Section 0 Section 1 Section 15

16 links R R R R Example of two LC cabinets. The “R” boxes are router modules that handle the interconnect beyond 8 nodes and between cabinets.

Even larger system 8 full LC cabinets.

Designed for Scalability
T3E Heritage Distributed shared memory (DSM) architecture Low latency, load/store access to entire machine (tens of TBs) Decoupled vector memory architecture for latency tolerance Thousands of outstanding references, flexible addressing Very high performance network High bandwidth, fine-grained transfers Same router as Origin 3000, but 32 parallel copies of the network Architectural features for scalability Remote address translation Global coherence protocol optimized for distributed memory Fast synchronization Parallel I/O scales with system size

Decoupled Microarchitecture
Decoupled access/execute and decoupled scalar/vector Scalar unit runs ahead, doing addressing and control Scalar and vector loads issued early Store addresses computed and saved for later use Operations queued and executed later when data arrives Hardware dynamically unrolls loops Scalar unit goes on to issue next loop before current loop has completed Full scalar register renaming through 8-deep shadow registers Vector loads renamed through load buffers Special sync operations keep pipeline full, even across barriers This is key to making the system like short-VL code

Address Translation High translation bandwidth: scalar + four vector translations per cycle per P Source translation using 256 entry TLBs with support for multiple page sizes Remote (hierarchical) translation: allows each node to manage its own memory (eases memory mgmt.) TLB only needs to hold translations for one node  scales

Cache Coherence Global coherence, but only cache memory from local node (8-32 GB) Supports SMP-style codes up to 4 MSP (16 SSP) References outside this domain converted to non-allocate Scalable codes use explicit communication anyway Keeps directory entry and protocol simple Explicit cache allocation control Per instruction hint for vector references Per page hint for scalar references Use non-allocating refs to avoid cache pollution Coherence directory stored on the M chips (rather than in DRAM) Low latency and really high bandwidth to support vectors Typical CC system: 1 directory update per proc per 100 or 200 ns Cray X1: 3.2 dir updates per MSP per ns (factor of several hundred!)

System Software

Cray X1 I/O Subsystem CNS CPES CNS RS200 SB CB I Chip X1 Node RS200 SB
SPC Channels 2 x 1.2 GB/s PCI-X BUSES 800 MB/s PCI-X Cards 200 MBytes/s FC-AL I/O Board SB CB RS200 FC XFS and XVM ADIC SNFS I Chip IOCA Gigabit Ethernet or Hippi Network CNS RS200 X1 Node SB CB I Chip IP over Fiber CPES Gigabit Ethernet or Hippi Network CNS I/O Board

Cray X1 UNICOS/mp Single System Image UNIX kernel executes
on each Application node (somewhat like Chorus™ microkernel on UNICOS/mk) Provides: SSI, scaling & resiliency UNICOS/mp Global resource manager (Psched) schedules applications on Cray X1 nodes 256 CPU Cray X1 Commands launch applications like on T3E (mpprun) System Service nodes provide file services, process manage- ment and other basic UNIX functionality (like /mk servers). User commands execute on System Service nodes.

Programming Models Shared memory: Distributed memory:
OpenMP, pthreads Single node (51 Gflops, 8-32 GB) Fortran and C Distributed memory: MPI shmem(), UPC, Co-array Fortran Hierarchical models (DM over OpenMP) Both MSP and SSP execution modes supported for all programming models

Mechanical Design

Field-Replaceable Memory
Node board Field-Replaceable Memory Daughter Cards Spray Cooling Caps CPU MCM 8 chips Network Interconnect PCB 17” x 22’’ Air Cooling Manifold

Cray X1 Node Module

Cray X1 Chassis

64 Processor Cray X1 System ~820 Gflops

The Next Generation HPC
Cray BlackWidow The Next Generation HPC System From Cray Inc.

BlackWidow Highlights
Designed as a follow-on to Cray X1: Upward compatible ISA FCS in 2006 Significant improvement (>> Moore’s Law rate) in: Single thread scalar performance Price performance Refine X1 implementation: Continue to support DSM with 4-way SMP nodes Further improve sustained memory bandwidth Improve scalability up and down Improve reliability/fault tolerance Enhance instruction set A few BlackWidow features: Single chip vector microprocessor Hybrid electrical/optical interconnect Configurable memory capacity, memory BW and network BW

Cascade Project Cray Inc. Stanford Cal Tech/JPL Notre Dame

High Productivity Computing Systems
Goals: Provide a new generation of economically viable high productivity computing systems for the national security and industrial user community (2007 – 2010) Impact: Performance (efficiency): critical national security applications by a factor of 10X to 40X Productivity (time-to-solution) Portability (transparency): insulate research and operational application software from system Robustness (reliability): apply all known techniques to protect against outside attacks, hardware faults, & programming errors HPCS Program Focus Areas Mission: Provide a focused research and development program, creating new generations of high end programming environments, software tools, architectures, and hardware components in order to realize a new vision of high end computing, high productivity computing systems (HPCS). Address the issues of low efficiency, scalability, software tools and environments, and growing physical constraints. Fill the high end computing between today’s late 80’s based technology High Performance Computing (HPCs) and the promise of quantum computing. Provide economically viable high productivity computing systems for the national security and industrial user communities with the following design attributes in the latter part of this decade: Performance: Improve the computational efficiency and performance of critical national security applications. Productivity: Reduce the cost of developing, operating, and maintaining HPCS application solutions. Portability: Insulate research and operational HPCS application software from system specifics. Robustness: Deliver improved reliability to HPCS users and reduce risk of malicious activities. Background: High performance computing is at a critical juncture. Over the past three decades, this important technology area has provided crucial superior computational capability for many important national security applications. Government research, including substantial DoD investments, has enabled major advances in computing, contributing to the U.S. dominance of the world computer market. Unfortunately, current trends in commercial high performance computing, future complementary metal oxide semiconductor (CMOS) technology challenges, and emerging threats are creating technology gaps that threaten continued U.S. superiority in important national security applications. As reported in recent DoD studies, there is a national security requirement for high productivity computing systems. Without government R&D and participation, high-end computing will be available only through commodity manufacturers primarily focused on mass-market consumer and business needs. This solution would be ineffective for important national security applications. The HPCS program will significantly contribute to DoD and industry information superiority in the following critical applications areas: operational weather and ocean forecasting; planning exercises related to analysis of the dispersion of airborne contaminants; cryptanalysis; weapons (warheads and penetrators); survivability/stealth design; intelligence/surveillance/reconnaissance systems; virtual manufacturing/failure analysis of large aircraft, ships, and Applications: Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant modeling and biotechnology Fill the Critical Technology and Capability Gap Today (late 80’s HPC technology)…..to…..Future (Quantum/Bio Computing)

HPCS Planned Program Phases 1-3
Early Academia Early Metrics and Benchmarks Metrics, Products Software Research Pilot Benchmarks HPCS Capability or Products Tools Platforms Platforms Application Analysis Performance Assessment Requirements and Metrics Research Prototypes & Pilot Systems System Design Review PDR & Early Prototype Concept Reviews Industry Guided Research Technology Assessments Research Phase 2 Readiness Reviews Phase 3 Readiness Review Fiscal Year 02 03 04 05 06 07 08 09 10 Reviews Industry BAA/RFP Critical Program Milestones Phase 1 Industry Concept Study Phase 2 R&D Phase 3 Full Scale Development (Planned)

Technology Focus of Cascade
System Design Network topology and router design Shared memory for low latency/low overhead communication Processor design Exploring clustered vectors, multithreading and streams Enhancing temporal locality via deeper memory hierarchies Improving latency tolerance and memory/communication concurrency Lightweight threads in memory For exploitation of spatial locality with little temporal locality Thread  data instead of data  thread Exploring use of PIMs Programming environments and compilers High productivity languages Compiler support for heavyweight/lightweight thread decomposition Parallel programming performance analysis and debugging Operating systems Scalability to tens of thousands of processors Robustness and fault tolerance Lets us explore technologies we otherwise couldn’t. A three year head start on typical development cycle.

Cray’s Computing Vision
Scalable High-Bandwidth Computing 2010 ‘Cascade’ Sustained Petaflops ‘Black Widow’ ‘Black Widow 2’ 2006 X1E 2004 Product Integration X1 2006 ‘Strider 3’ ‘Strider X’ 2004 2005 Red Storm RS ‘Strider 2’

Questions? File Name: BWOpsReview ppt

The Cray X1 Multiprocessor and Roadmap

Similar presentations

Presentation on theme: "The Cray X1 Multiprocessor and Roadmap"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Cray X1 Multiprocessor and Roadmap

Similar presentations

Presentation on theme: "The Cray X1 Multiprocessor and Roadmap"— Presentation transcript:

Similar presentations

About project

Feedback