Presentation on theme: "1 Chemical Supercomputing on the Cheap, CSC’82, 1999 Chemical Supercomputing on the Cheap: 94GFlops computer system at cdn$3680/gigaflop S. Patchkovskii,"— Presentation transcript:
1 Chemical Supercomputing on the Cheap, CSC’82, 1999 Chemical Supercomputing on the Cheap: 94GFlops computer system at cdn$3680/gigaflop S. Patchkovskii, R. Schmid, and T. Ziegler Department of Chemistry, University of Calgary, 2500 University Dr. NW, Calgary, Alberta, T2N 1N4 Canada
2 Chemical Supercomputing on the Cheap, CSC’82, 1999 Introduction Accurate quantum-chemical modeling of systems of chemical interest is extremely computationally intensive and requires substantial amounts of memory and secondary storage. This has traditionally consigned first-principles calculations of chemical properties to large (and expensive) vector and parallel computers, thus placing them out of reach of most practical chemists. With the ever-increasing computational power of low-end workstations and commodity PCs, it is now possible to perform useful quantum-chemical calculations on inexpensive off-the-shelf hardware. The widely available and robust local area (LAN) network technologies, such as switched 100Mbit/second Ethernet, may be used to combine multiple workstations into a larger parallel system, providing supercomputer level of performance at the favorable price/performance ratio. COBALT In this poster, we describe COBALT cluster (Computers On Benches All Linked Together) - a chemically oriented supercomputer built in our research group at the University of Calgary.
3 Chemical Supercomputing on the Cheap, CSC’82, 1999 A node of the Cobalt cluster is a Compaq/Digital Personal Workstation model 500au. Each workstation is configured with: Cobalt hardware: Nodes (*) SpecInt and SpecFP values estimated from published results for a 500au system with 2Mb L3 cache. For a comparison, a top of the line 550MHz Intel Xeon workstation with 512Kb of L2 cache achieves 24.4 SpecInt 95 and 17.1 SpecFP 95 and costs about cdn$4400 from Dell (May 1999).
4 Chemical Supercomputing on the Cheap, CSC’82, 1999 Cobalt hardware: Network Latency and bandwidth measured with Larry McVoy’s Lmbench using otherwise idle nodes. Cobalt nodes are communicate through a dedicated 96-port full- duplex 100BaseTx Ethernet Switch, constructed from 4 24-port 3COM SuperStack II 3300 switches linked by a matrix module.
5 Chemical Supercomputing on the Cheap, CSC’82, 1999 Cobalt hardware: The Cluster Node 1 Node 93 World Switch 93x100BaseTx 100BaseTx (half-duplex) 2x100BaseTx 128Mb memory 18Gbytes RAID-1 (4 spindles) RAID, assembly and miscellaneous costs: cdn$6,500
6 Chemical Supercomputing on the Cheap, CSC’82, 1999 Cobalt: system software
7 Chemical Supercomputing on the Cheap, CSC’82, 1999 Cobalt: Single system image Single system image (SSI), or ability of a group of computers to present the illusion of a large single computer system, is considered the definitive characteristic of clusters. In order to have a usability advantage over a pile of individual computers, a cluster must provide its users with the SSI covering most of the users’ problem areas. Cobalt nodes present the illusion of a single computer in several important aspects, namely:
8 Chemical Supercomputing on the Cheap, CSC’82, 1999 Cobalt: application software (*) Gaussian supports cluster environments with Network Linda - an extra-cost package is not available on Cobalt
9 Chemical Supercomputing on the Cheap, CSC’82, 1999 Cobalt: total cost (*) Gaussian supports cluster environments with Network Linda - an extra-cost package is not available on Cobalt The complete per-node construction price, including all hardware and software, is thus substantially lower than the retail price of a comparably equipped PC
10 Chemical Supercomputing on the Cheap, CSC’82, 1999 Running ADF in parallel on Cobalt ADF has been parallelized at the Vrije University in Amsterdam, and can utilize either MPI or PVM message passing libraries. Parallelization has been performed only for the computationally intensive parts of the program (numerical integration and density fitting). All relatively inexpensive parts of the calculations are repeated on all participating nodes, greatly reducing the amount of data which have to be communical over the network. In a typical ADF run, the nodes have to synchronize only once per SCF cycle or a gradient calculation. Node 1 Node 2 Node 3 Time Communications ADF Node 1 Node 2 Node 3 Time Communications Classical parallel application
11 Chemical Supercomputing on the Cheap, CSC’82, 1999 We illustrate the parallel performance of ADF for full geometry optimization of nitridoporphyrinatochromium(V), a medium-sized molecule with 38 atoms shown on the left. This calculation used polarized triple- basis set on all atoms, resulting in 580 basis functions. The molecule was constrained to C4v symmetry. For this system, a serial calculation takes 683 minutes on a single Cobalt node (using 45Mb of memory and about 100Mbytes of the disk space). For the parallel runs, the execution time can be approximated by the Amdahl’s law: Number of nodes Speedup ideal Amdahl law parallel is the parallel part of the calculation (662 minutes), and T overhead is the parallel overhead (103 minutes). where T serial is the inherently serial part of the calculation (21 minute), T parallel is the parallel part of the calculation (662 minutes), and T overhead is the parallel overhead (103 minutes).
12 Chemical Supercomputing on the Cheap, CSC’82, 1999 Running PAW in parallel on Cobalt As a parallel application, PAW is the exact opposite of ADF. Computationally, it is dominated by fast Fourier transforms (FFTs), which place a heavy demand on both the inter-node bandwidth and round-trip latency. When running on n nodes, parallel FFT algorithm used in PAW needs the exchange all Fourier coefficients on each node (which can easily require several hundred megabytes of storage) n times during each molecular dynamics (MD) step (see below), resulting in a heavy communications traffic. Node 1Node 2Node 3 In a typical parallel PAW run on Cobalt, the full- duplex 100Mbit/second communication links between the nodes and the central switch continuously run at over 20% utilization (or more than 2.5Mbytes/second) in each direction. In a sense, Cobalt nodes and communication network are perfectly matched together for PAW runs: having faster CPUs would have made the communication network choke on the data, while a slower communication network would have been unable to keep CPUs busy.
13 Chemical Supercomputing on the Cheap, CSC’82, 1999 To illustrate the performance of parallel PAW on Cobalt, consider an S N 2 substitution reaction between CH 3 I and [Rh(CO) 2 I 2 ] -. This medium- size simulation was performed in an 11Å periodic cell. In a serial run, a single time step requires about 83 seconds; a complete simulation consists of several thousands steps. Fitting of the measured execution times using different node counts to the Amdahl law gives (all times in seconds): Speedup Nodes ideal Amdahl law Unlike the ADF case, there the inherently serial part constitutes less than 3% percent of the total work, PAW spends almost 10% of the total time in the parallel section. As a consequence, PAW cannot efficiently utilize more than four Cobalt nodes for this simulation.
14 Chemical Supercomputing on the Cheap, CSC’82, 1999 Molecular dynamics calculations in PAW are frequently limited by the amount of memory required to perform the calculation rather than by the simulation time. In the parallel mode, PAW can significantly reduce its per-node memory requirements by distributing both the real-space and Fourier-space grids between the nodes. Since the size of the grids grows with the unit cell size R as O(R 3 ), they dominate PAW memory requirements for all but smallest systems. In the CH 3 I and [Rh(CO) 2 I 2 ] - system, memory requirements in the serial mode are relatively modest at 231 megabytes. In the parallel regime, per-node memory requirements are given by: Per-node memory usage Nodes ideal Measured distributed is the amount of memory shared between the nodes (224Mb), and M overhead where M private is the amount of memory holding data private to a given node (7Mb), M distributed is the amount of memory shared between the nodes (224Mb), and M overhead is the parallel overhead (9Mb). Running this job on six nodes thus reduces the per-node memory requirements to just 53Mbytes. Parallel PAW was used to run jobs requiring almost 3Gbytes of memory on Cobalt - even though no Cobalt node has more that 512Mb of memory installed in it.
15 Chemical Supercomputing on the Cheap, CSC’82, 1999 Summary We described construction of the Cobalt cluster - a uniquely powerful and inexpensive dedicated computational chemistry resource. With per-node construction cost typical of high-end PCs, Cobalt provides super-computer level of performance on several quantum- chemical applications. Multiple nodes can be utilized in parallel, resulting in increased throughput and reduced wall-clock execution time. Tens of nodes can be utilized efficiently for a single large DFT calculation using ADF. For further information on Cobalt hardware and software, visit the Cobalt home page at http://www.cobalt.chem.ucalgary.ca Credits Financial support for the construction of the Cobalt cluster was provided by: Canada Foundation for Innovation (CFI) Alberta Intellectual Infrastructure Partnership program (AIIP) Department of Chemistry of the University of Calgary Scientific Chemistry Simulations Inc., Netherlands Mitsui Chemicals Nova Chemicals
16 Chemical Supercomputing on the Cheap, CSC’82, 1999 References and further reading SpecFp95 and SpecInt95 benchmark results are available on the web site of the Standard Performance Evaluation Corp. (SPEC) at http://www.specbench.org Prices and system specifications of Dell workstation were taken from the Dell Canada web site at http://www.dell.ca Technical specifications of the 3COM fast Ethernet switches are available on the 3COM web site at http://www.3com.com Larry McVoy’s Lmbench microbenchmark suite was downloaded from the Bitmover web site at http://www.bitmover.com/lmbench/ Greg Pfister’s In Search of Clusters, 2nd edition, published by Prentice Hall in 1998 is the definitive guide to clusters Additional information on the Amsterdam density functional code is available on the web site of Scientific Computing and Modeling at http://www.scm.com Additional information on PAW first-principles MD code is available on the Cobalt web site at http://www.cobalt.chem.ucalgary.ca/paw/ See the Gaussian Inc. web site at http://www.gaussian.com/ Lmbench Dell 3COM SPEC Clusters ADF PAW Gaussian