Presentation on theme: "Supercomputers, Enterprise Servers, and Some Other Big Ones Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University."— Presentation transcript:
Supercomputers, Enterprise Servers, and Some Other Big Ones Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Topics for This Lecture 1.A discussion of the C.mmp, a parallel computer developed in the early 1970s. 2.Examination of enterprise computers, such as the IBM S/360 line. 3.More on the Cray-1 and similar vector processors. 4.A discussion of the big ones: the IBM BlueGene and the Cray Jaguar.
Scalar System Performance Early scalar (not vector) computers were evaluated in terms of performance in MIPS (Millions Instructions Per Second). The VAX/11-780 was rated at 1 MIP by its developer (the Digital Equipment Corp.) Some claim the term stands for Meaningless Indicator of Performance (for) Salesmen. The measure does reflect system performance.
The C.mmp This was a multiprocessor system developed by Carnegie-Mellon in the early 1970s It was thoroughly documented by Wulf and Harbinson in what has been fairly called one of the most thorough and balanced research– project retrospectives … ever seen. Remarkably, this paper gives a thorough description of the projects failures.
The C.mmp Itself The C.mmp is described as a multiprocessor composed of 16 PDP–11s, 16 independent memory banks, a cross-point [crossbar] switch which permits any processor to access any memory It includes an independent bus, called the IP bus, used to communicate control signals. As of 1978, the system included the following 16 processors. 5 PDP–11/20s, each rated at 0.20 MIPS (that is 200,000 instructions per second) 11 PDP–11/40s, each rated at 0.40 MIPS 3 megabytes of shared memory (650 nsec core and 300 nsec semiconductor) The system was observed to compute at 6 MIPS.
The Design Goals of the C.mmp The goal of the project seems to have been the construction of a simple system using as many commercially available components as possible. The C.mmp was intended to be a research project not only in distributed processors, but also in distributed software. The native operating system designed for the C.mmp was called Hydra. It was intended as an OS kernel, intended to provide only minimal services and encourage experimentation in system software. As of 1978, the software developed on top of the Hydra kernel included file systems, directory systems, schedulers and a number of language processors.
The C.mmp: Lessons Learned The use of two variants of the PDP–11 was considered as a mistake, as it complicated the process of making the necessary processor and operating system modifications. The authors had used newer variants of the PDP–11 in order to gain speed, but concluded that It would have been better to have had a single processor model, regardless of speed.
The C.mmp: More Lessons Learned The critical component was expected to be the crossbar switch. Experience showed the switch to be very reliable, and fast enough. Early expectations that the raw speed of the switch would be important were not supported by experience. The authors concluded that most applications are sped up by decomposing their algorithms to use the multiprocessor structure, not by executing on processors with short memory access times.
Still More Lessons Learned 1.Hardware (un)reliability was our largest day–to–day disappointment … The aggregate mean–time– between–failure (MTBF) of C.mmp/Hydra fluctuated between two to six hours. 2.About two–thirds of the failures were directly attributable to hardware problems. There is insufficient fault detection built into the hardware. 3.We found the PDP–11 UNIBUS to be especially noisy and error–prone. 4.The crosspoint [crossbar] switch is too trusting of other components; it can be hung by malfunctioning memories or processors.
Another Set of Lessons My favorite lesson learned is summarized in the following two paragraphs in the report. We made a serious error in not writing good diagnostics for the hardware. The software developers should have written such programs for the hardware. In our experience, diagnostics written by the hardware group often did not test components under the type of load generated by Hydra, resulting in much finger–pointing between groups.
Enterprise Servers & Supercomputers There are two categories of large computers based on the intended use. Enterprise servers handle simple problems, usually commercial transactions, but handle a very large transaction volume. Supercomputers handle complex problems, usually scientific simulations, and work only a few problems at a time.
The IBM S/360 Evolves The IBM S/360, introduced in April 1964, was the first of a line of compatible enterprise servers. At left is a modern variant, possibly a z/11. All computers in this line were specialized for commercial work.
The z196: A Cloud in a Box Part of the title quotes IBM sales material. According to IBM, the z/196 is the high end server and the flagship of the IBM systems portfolio. Some interesting features of the z/196 1.It contains 96 microprocessors running at 5.2 GHz. Each processor is quad-core. 2.It can execute 52 billion instructions per second. 3.The main design goal is Zero Down Time, so the system has a lot of redundancy.
A Mainframe SMP IBM zSeries Uniprocessor with one main memory card to a high-end system with 48 processors and 8 memory cards Dual-core processor chip Each includes two identical central processors (CPs) CISC superscalar microprocessor Mostly hardwired, some vertical microcode 256-kB L1 instruction cache and a 256-kB L1 data cache L2 cache 32 MB Clusters of five Each cluster supports eight processors and access to entire main memory space System control element (SCE) Arbitrates system communication Maintains cache coherence Main store control (MSC) Interconnect L2 caches and main memory Memory card Each 32 GB, Maximum 8, total of 256 GB Interconnect to MSC via synchronous memory interfaces (SMIs) Memory bus adapter (MBA) Interface to I/O channels, go directly to L2 cache
The Cray Series of Supercomputers Note the design. In 1976, the magazine Computerworld called the Cray–1 the worlds most expensive love seat.
FLOPS: MFLOPS to Petaflops The workload for a modern supercomputer focuses mostly on floating-point arithmetic. As a result, all supercomputers are rated in terms of FLOPS (Floating Point Operations Per Second). The names scale by 1000s: MFLOPS, GFLOPS, TFLOPS, and Petaflops. Todays high-end machines rate in the range from 1 to 20 Petaflops with more on the way.
History of Seymour Cray and His Companies Seymour Cray is universally regarded as the father of the supercomputer. There are no other claimants to this title. Cray began work at Control Data Corporation soon after its founding in 1960 and remained there until 1972. At CDC, he designed the CDC 1604, CDC 6600, and CDC 7600. The CDC 6600 is considered the first RISC machine, though Cray did not use the term.
Crays Algorithm for Buying Cars 1.Go to the nearest auto dealership. 2.Look at the car closest to the entrance. 3.Offer to pay full price for that car. 4.Drive that car off the lot. 5.Return to designing of fast computers.
More History Cray left Control Data Corporation in 1972 to found Cray Research, based in Chippewa Falls, Wisconsin. The Cray–1 was delivered in 1976. This lead to a bidding war, with Los Alamos National Lab paying more than list price. In 1989, Cray left the company in order to found Cray Computers, Inc. His goal was to spend more time on research. Seymour Cray died on October 5, 1996.
More Vector Computers The successful introduction of the Cray-1 insured the cash flow for Crays company, and allowed future work along two lines. 1.Research and development on the Cray–2. 2.Production of a line of computers that were derivatives of the Cray–1. These were called the X–MP, Y–MP, etc. The X–MP was introduced in 1982. It was a dual–processor computer with a 9.5 nanosecond (105 MHz) clock and 16 to 128 megawords of static RAM main memory. The Y–MP was introduced in 1988, with up to eight processors that used VLSI chips. It had a 32–bit address space, with up to 64 megawords of static RAM main memory.
The Cray-2 While his assistant, Steve Chen, oversaw the production of the commercially successful X–MP and Y–MP series, Seymour Cray pursued his development of the Cray–2. The original intent was to build the VLSI chips from gallium arsenide (GaAs), which would allow must faster circuitry. The technology for manufacturing GaAs chips was not then mature enough to be used for mass production. The Cray–2 was a four–processor computer that had 64 to 512 (64 bit) megawords of 128–way interleaved DRAM memory. The computer was built very small in order to be very fast, as a result the circuit boards were built as very compact stacked cards. VLSI chips were not yet available in 1985.
Cooling the Cray-2 Due to the card density, it was not possible to use air cooling. The entire system was immersed in a tank of Fluorinert, an inert liquid intended to be a blood substitute. When introduced in 1985, the Cray–2 was not significantly faster than the Y–MP.
The Cray-3 and Cray-4 The Cray–3, a 16–processor system, was announced in 1993 but never delivered. The Cray–4, a smaller version of the Cray–3 with a 1 GHz clock was ended when the Cray Computer Corporation went bankrupt in 1995. In 1993, Cray Research moved away from pure vector processors, producing its first massively parallel processing (MPP) system, the Cray T3D. Cray Research merged with SGI (Silicon Graphics, Inc.) in February 1996. It was spun off as a separate business unit in August 1999. In March 2000, Cray Research was merged with Terra Computer Company to form Cray, Inc. Cray, Inc. is going strong today (as of Summer 2012).
MPP and the AMD Opteron Beginning in the 1990s, there was a move to build supercomputers as MPP (Massively Parallel Processor) machines. These would have tens of thousands of stock commercial processors. The AMD Opteron quickly became the favorite chip, as it offered a few features not found in the Intel Pentium line.
The Opteron The AMD Opteron is a 64–bit processor that can operate in three modes. 1.In legacy mode, the Opteron runs standard Pentium binary programs unmodified. 2.In compatibility mode, the operating system runs in full 64–bit mode, but applications must run in 32–bit mode. 3.In 64–bit mode, all programs can issue 64–bit addresses; both 32–bit and 64–bit programs can run simultaneously in this mode.
The Cray XT-5 Introduced in 2007, this is built from about 60,000 Quad–Core AMD Opteron processors.
Some Comments on the Jaguar On July 30, 2008 the NCCS took delivery of the first 16 of 200 cabinets of an XT5 upgrade to Jaguar that ultimately has taken the system to 1,639 TF with 362 TB of high-speed memory and over 10,000 TB of disk space. The final cabinets were delivered on September 17, 2008. Twelve days later on September 29th, this incredibly large and complex system ran a full-system benchmark application that took two and one-half hours to complete.
The IBM Blue Gene/L The Blue Gene system was designed in 1999 as a massively parallel supercomputer for solving computationally–intensive problems in, among other fields, the life sciences. The BlueGene/L was the first model built; it was shipped to Lawrence Livermore Lab in June 2003. A quarter–scale model, with 16,384 processors, became operational in November 2004 and achieved a computational speed of 71 teraflops.
Notes on Installing the Blue Gene Because Blue Gene/Q is so unique, there are several atypical requirements for the data center that will house the unit. 1.Power. Each Blue Gene/Q rack has a maximum input power consumption of 106 kVA. Therefore, the magnitude of the electrical infrastructure is much larger than typical equipment. 2.Cooling. Because of the large load, each rack is cooled by air and water. The building facility water must support the Blue Gene/Q cooling system. A raised floor is not a requirement for air cooling the Blue Gene/Q. 3.Size. The Blue Gene/Q racks are large and heavy. Appropriate actions must be taken before, during, and after installation to ensure the safety of the personnel and the Blue Gene/Q equipment.
IBMs Comments Roadrunner is the first general purpose computer system to reach the petaflop milestone. On June 10, 2008, IBM announced that this supercomputer had sustained a record-breaking petaflop, or 10 15 floating point operations per second. Roadrunner was designed, manufactured, and tested at the IBM facility in Rochester, Minnesota. The actual initial petaflop run was done in Poughkeepsie, New York. Its final destination is the Los Alamos National Laboratory (LANL). Most notably, Roadrunner is the latest tool used by the National Nuclear Security Administration (NNSA) to ensure the safety and reliability of the US nuclear weapons stockpile.
Roadrunner Breaks the Pflop/s Barrier 1,026 Tflop/s on LINPACK reported on June 9, 2008 6,948 dual core Opteron + 12,960 cell BE 80 TByte of memory IBM built, installed at LANL
Impact of Cloud Simulation Clouds affect both solar and terrestrial radiation, control precipitation. Poor simulated cloud distribution impacts global moisture budget. Several important climate features are poorly simulated including: Inter-tropical convergence zone (ITCZ) Madden-Julian Oscillation (MJO) Underestimation of low marine stratus clouds Errors in precipitation patterns, especially monsoons. The effect of clouds in current global climate models are parameterized, not directly simulated. Currently cloud systems are much smaller than model grid cells (unresolved).
Global Cloud System Resolving Climate Modeling Direct simulation of cloud systems replacing statistical parameterization. Approach recently was called a top priority by the 1st UN WMO Modeling Summit. Direct simulation of cloud systems in global models requires exascale Parameterization of mesoscale cloud statistics performs poorly. Individual cloud physics fairly well understood
Global Cloud System Resolving Models 1km Cloud system resolving models enable transformational change in quality of simulation results 25km Upper limit of climate models with cloud parameterizations 200km Typical resolution of IPCC AR4 models Surface Altitude (feet)
Computational Requirements Computational Requirements for 1km Global Cloud System Resolving Model, based on David Randalls (CSU) icosahedral code: Approximately 1,000,000x more computation than current production models Must achieve 1000x faster than realtime to be useful for climate studies 10 PetaFlops sustained, ~200PF peak ExaFlop(s) for required ensemble runs 20-billion subdomains Minimum 20-million way parallelism Only 5MB memory requirement per core 200 MB/s in 4 nearest neighbor directions Dominated by eqn of motion due to CFL fvCAM Icosahedral