Presentation is loading. Please wait.

Presentation is loading. Please wait.

Near-Term NCCS & Discover Cluster Changes and Integration Plans: A Briefing for NCCS Users October 30, 2014.

Similar presentations


Presentation on theme: "Near-Term NCCS & Discover Cluster Changes and Integration Plans: A Briefing for NCCS Users October 30, 2014."— Presentation transcript:

1 Near-Term NCCS & Discover Cluster Changes and Integration Plans: A Briefing for NCCS Users October 30, 2014

2 NASA Center for Climate Simulation Agenda Storage Augmentations Discover Cluster Hardware Changes SLURM Changes Q & A NCCS Discover Changes, October 31, 20142

3 NASA Center for Climate Simulation Storage Augmentations Dirac (Mass Storage) Disk Augmentation – 4 Petabytes usable (5 Petabytes “raw”) –Arriving November 2014 –Gradual integration (many files, “inodes” to move) Discover Storage Expansion –8 Petabytes usable (10 Petabytes “raw”) –Arriving December 2014 –For both targeted “Climate Downscaling” project and general use –Begin operational use in mid/late December NCCS Discover Changes, October 31, 20143

4 NASA Center for Climate Simulation Motivation for Near-Term Discover Changes Due to demand for more resources, we’re undertaking major Discover cluster augmentations.  Result will be: 3x computing capacity increase, net increase of 20,000 cores for Discover! But – we have floor space and power limitations, so we need to do phased removal of oldest Discover processors (12-core Westmeres). Interim reduction in Discover cores will be partly relieved by the addition of previously-dedicated compute nodes. Prudent use of SLURM features can help optimize your job’s turnaround during the “crunch time” (More on this later…) NCCS Discover Changes, October 31, 20144

5 NASA Center for Climate Simulation Discover Hardware Changes What we have now (October 2014): –12-core ‘Westmere’ and 16-core ‘Sandy Bridge’ What’s being delivered near term: –28-core ‘Haswell’ What’s being removed to make room: –12-core ‘Westmere’ Impacts for Discover users: there will be an interim “crunch time” with fewer nodes/cores available. (Transition schedule is subject to change.) NCCS Discover Changes, October 31, 20145

6 NASA Center for Climate Simulation Discover Compute Nodes, Early October 2014 (Peak ~600 TFLOPS) “Westmere” nodes, 12 cores per node, 2 GB memory per core –SLES11 SP1 –SCU7 1,200 nodes, 14,400 cores total, 161 TFLOPS peak –SCU1, SCU2, SCU3, SCU4 1,032 nodes, 12,384 cores total, 139 TFLOPS peak “Sandy Bridge” nodes, 16 cores per node –SLES11 SP1 –SCU8, 2 GB memory per core 480 nodes, 7,680 cores, 160 TFLOPS peak –SCU9, 4 GB memory per core 480 nodes, 7,680 cores, 160 TFLOPS peak NCCS Discover Changes, October 31, 20146

7 NASA Center for Climate Simulation Discover Compute Nodes, Late January 2015 (Peak 2,200 TFLOPS) No remaining “Westmere” nodes “Sandy Bridge” nodes, 16 cores per node (no change) –SLES11 SP1 –SCU8, 2 GB memory per core 480 nodes, 7,680 cores, 160 TFLOPS peak –SCU9, 4 GB memory per core 480 nodes, 7,680 cores, 160 TFLOPS peak “Haswell” nodes, 28 cores per node (new) –SLES11 SP3 –SCU10, 4.5 GB memory per core 1,080 nodes, 30,240 cores total, 1,229 TFLOPS peak –SCU11, 4.5 GB memory per core ~600 nodes, 16,800 cores total, 683 TFLOPS peak NCCS Discover Changes, October 31, 20147

8 NASA Center for Climate Simulation Total Discover Peak Computing Capability as a Function of Time (Intel Xeon Processors Only) NCCS Discover Changes, October 31, 20148

9 NASA Center for Climate Simulation Total Number of Discover Intel Xeon Processor Cores as a Function of Time NCCS Discover Changes, October 31, 20149

10 NASA Center for Climate Simulation Projected Weekly Detail (Subject to Change): Discover Processor Cores for General Work NCCS Discover Changes, October 31, 201410

11 NASA Center for Climate Simulation Discover “Crunch Time” Transition: SLURM Job Turnaround Tips (1) See for many helpful tips: –http://www.nccs.nasa.gov/methods_job_slurm.htmlhttp://www.nccs.nasa.gov/methods_job_slurm.html Time Limits: –Specify both a preferred maximum time limit, and a minimum time limit as well, if your workflow performs self- checkpointing. –In this example, if you know that your job will save its intermediate results within the first 4 hours, these specifications will cause SLURM to schedule your job in the earliest available time window of 4 hours or longer, up to 12hrs: #SBATCH --time=12:00:00 #SBATCH --time-min=04:00:00 NCCS Discover Changes, October 31, 201411

12 NASA Center for Climate Simulation Discover “Crunch Time” Transition: SLURM Job Turnaround Tips (2) See for many additional helpful tips: –http://www.nccs.nasa.gov/methods_job_slurm.htmlhttp://www.nccs.nasa.gov/methods_job_slurm.html Don't specify any SLURM partition, unless you are trying to access specialized hardware, such as datamove or co-processor nodes. Do specify memory requirements explicitly, either as memory per node, or as memory per CPU, e.g.: –#SBATCH --mem=12G –#SBATCH --mem-per-cpu=3G Don’t specify any processor architecture (e.g., ‘west’ or ‘sand’) if your job can run on any of the processors. NCCS's Slurm configuration ensures that each job will only run on one type of processor architecture. NCCS Discover Changes, October 31, 201412

13 Questions & Answers NCCS User Services: support@nccs.nasa.gov 301-286-9120 https://www.nccs.nasa.gov Thank you support@nccs.nasa.gov

14 NASA Center for Climate Simulation SUPPLEMENTAL SLIDES NCCS Discover Changes, October 31, 201414

15 NASA Center for Climate Simulation SLURM Quality of Service (‘qos’) Quality of serviceTime limitMax CPUs/user Max running jobs/user allnccs (default)12 hrs4096N/A debug1 hr5121 long24 hrs516N/A serial12 hrs40961 NCCS Discover Changes, October 31, 201415 Example SBATCH directive for qos: –#SBATCH --qos=long Don’t (no need to) specify the default qos (allnccs). See for more details (plus more SLURM info will be coming soon): –http://www.nccs.nasa.gov/primer/computing.htmlhttp://www.nccs.nasa.gov/primer/computing.html

16 NASA Center for Climate Simulation FY14-FY15 Cluster Upgrade Combined funding from FY14 and FY15 –Taking advantage of new Intel processors – double the floating point operations over SandyBridge –Decommission SCU7 (Westmeres) Scalable Unit 10 –Target to effectively double the NCCS compute capability –128 GB of RAM per node with FDR IB (56 Gbps) or greater –Benchmarks used in procurement include GEOS5 and WRF Target delivery date ~Oct 2014 NCCS User Forum July 22, 201416

17 NASA Center for Climate Simulation Letter to NCCS Users NCCS Discover Changes, October 31, 201417 The NCCS is committed to providing the best possible high performance solutions to meet the NASA science requirements. To this end, the NCCS is undergoing major integration efforts over the next several months to dramatically increase both the overall compute and storage capacity within the Discover cluster. The end result will increase the total numbers of processors by over 20,000 cores while increasing the peak computing capacity by almost a factor of 3x! Given the budgetary and facility constraints, the NCCS will be removing parts of Discover to make room for the upgrades. The charts shown on this web page (PUT URL FOR INTEGRATION SCHEDULE HERE) show the current time schedules and the impacts for changes to the cluster environment. The decommissioning of Scalable Compute Unit 7 (SCU7) has already begun and will be complete by early November. After the availability of the new Scalable Compute Unit 10 (SCU10), the removal of Scalable Compute Units 1 through 4 will occur later this year. The end result will be the removal of all Intel Westmere processors from the Discover environment by the end of the 2014 calendar year. While we are taking resources out of the environment, users may run into longer wait times as the new systems are integrated into operations. In order to alleviate this potential issue, the NCCS has coordinated with projects that are currently using dedicated systems in order to free up resources for general processing. Given the current workload, we are confident that curtailing the dedication of resources for specialized projects will keep the wait times at their current levels. The NCCS will be communicating frequently with our user community throughout the integration efforts. Email will be sent out with information about the systems that are being taken off line and added. This web page, while subject to change, will be updated frequently, and as always, users are welcome to contact the support desk (support@nccs.nasa.gov) with any questions. There is never a good time to remove computational capabilities, but the end result will be a major boost to the overall science community. Throughout this process, we are committed to doing everything possible to work with you to get your science done. We are asking for your patience as we work through these changes to the environment, and we are excited about the future science that will be accomplished using the NCCS resources! Sincerely, The NCCS Leadership Team

18 NASA Center for Climate Simulation Oct. 27-31 Nov. 3- 7 Nov. 10-14 Nov. 17-21 Nov. 24-28 Dec. 1-5 Dec. 1-5 Dec. 8- 12 Dec. 15-19 Dec. 15-19 Dec. 21-26 Dec. 21-26 SCU8 No changes will be made to SCU8 throughout this time period. SCU8 will be available for general use. SLES11, SP1 480 Nodes 7,680 Cores Intel SandyBridge 160 TF Peak COMPUTE SCU9 SLES11, SP1 480 Nodes 7,680 Cores Intel SandyBridge 160 TF Peak No changes will be made to SCU9 throughout this time period. This system has been dedicated for a specific project, but will be made available for general use starting in early November. Discover Xeon Sandy Bridge Nodes: SCU8 and SCU9 NCCS Discover Changes, October 31, 201418

19 NASA Center for Climate Simulation Oct. 27-31 Nov. 3- 7 Nov. 10-14 Nov. 17-21 Nov. 24-28 Dec. 1-5 Dec. 1-5 Dec. 8- 12 Dec. 15-19 Dec. 15-19 Dec. 21-26 Dec. 21-26 SCU7 Decommissioning 200 Nodes 1,000 Nodes First 200 nodes of Scalable Unit 7 (SCU7 – installed 2010) will be removed the week of October 27 th. Rest of SCU7 will be removed the week of November 3 rd. Space being vacated by SCU7 will be used to house new SCU10 compute nodes. SCU11 Integration Schedule for the next 600 Haswell nodes (SCU11) is still being worked. The NCCS is targeting the delivery by the middle of December. This is subject to change. Removal of the final 516 Westmere nodes will coincide with SCU11 integration. Equipment Delivery SCU10 Integration Equipment Delivery Pioneer Access General Access The delivery of the system is scheduled for November 12 th. It will take about 1 week for the vendor to cable the system and another week to perform the initial burn-in of the equipment. After that, the NCCS will provision the system with Discover images and integrate it with the storage. The target for general access will be mid December. SLES11, SP3 1,080 Nodes 30,240 Cores Intel Haswell 1,229 TF Peak SLES11, SP3 600 Nodes 16,800 Cores Intel Haswell 683 TF Peak SLES11, SP1 1,032 Nodes 12,384 Cores Intel Westmere 139 TF Peak SCU 1, 2, 3, 4 Decommissioning 516 Nodes 516 Nodes To make room for the new SCU11 compute nodes, the nodes of Scalable Units 1, 2, 3, and 4 (installed in 2011) will be removed from operations during the middle of December. The removal of half of these nodes will coincide with the general access to SCU10. SLES11, SP1 1,200 Nodes 14,400 Cores Intel Westmere 161 TF Peak Discover COMPUTE NCCS Discover Changes, October 31, 2014 19

20 NASA Center for Climate Simulation Oct. 27-31 Nov. 3- 7 Nov. 10-14 Nov. 17-21 Nov. 24-28 Dec. 1-5 Dec. 1-5 Dec. 8- 12 Dec. 15-19 Dec. 15-19 Dec. 21-26 Dec. 21-26 Dirac – Mass Storage Disk Expansion Equipment Delivery Additional disk capacity for the mass storage will be delivered at the beginning of November. This equipment will run through a variety of tests before being put into operations. Once in operations, user directories on Dirac will be migrated to the new storage. The system administrators will coordinate the movement of file systems to the new storage with users. The additional capacity will dramatically increase the residence time of data on disk as new data is stored on Dirac. The additional disk will make recalls of recently stored data much faster. 5,080 TB RAW 4,064 TB Usable Production Operations Discover Storage Expansion Equipment Delivery Additional disk capacity for Discover will be delivered at the beginning of December. This equipment will run through a variety of tests before being put into operations. This disk environment will be used for the downscaling experiments and for general use. 10,080 TB RAW 8,064 TB Usable Production Operations STORAGE Discover and Mass Storage Disk Augmentations NCCS Discover Changes, October 31, 201420

21 NASA Center for Climate Simulation Total Number of Discover Intel Xeon Processor Cores as a Function of Time NCCS Discover Changes, October 31, 201421

22 NASA Center for Climate Simulation Total Discover Peak Computing Capability as a Function of Time (Intel Xeon Processors Only) NCCS Discover Changes, October 31, 201422


Download ppt "Near-Term NCCS & Discover Cluster Changes and Integration Plans: A Briefing for NCCS Users October 30, 2014."

Similar presentations


Ads by Google