Presentation is loading. Please wait.

Presentation is loading. Please wait.

SAN DIEGO SUPERCOMPUTER CENTER Niches, Long Tails, and Condos Effectively Supporting Modest-Scale HPC Users 21st High Performance Computing Symposia (HPC'13)

Similar presentations


Presentation on theme: "SAN DIEGO SUPERCOMPUTER CENTER Niches, Long Tails, and Condos Effectively Supporting Modest-Scale HPC Users 21st High Performance Computing Symposia (HPC'13)"— Presentation transcript:

1 SAN DIEGO SUPERCOMPUTER CENTER Niches, Long Tails, and Condos Effectively Supporting Modest-Scale HPC Users 21st High Performance Computing Symposia (HPC'13) Rick Wagner High Performance System Manager San Diego Supercomputer Center April 10, 2013

2 SAN DIEGO SUPERCOMPUTER CENTER What trends are driving HPC in the SDSC machine room? Niches: Specialized hardware for specialized applications Long tail of science: Broadening the HPC user base Condos: Leveraging our facility to support campus users

3 SAN DIEGO SUPERCOMPUTER CENTER SDSC Clusters UCSD System: Triton Shared Compute Cluster Gordon Trestles XSEDE Systems https://www.xsede.org/ Overview: http://bit.ly/14Xj0Dm

4 SAN DIEGO SUPERCOMPUTER CENTER SDSC Clusters Gordon Niches: Specialized hardware for specialized applications

5 SAN DIEGO SUPERCOMPUTER CENTER Gordon – An Innovative Data Intensive Supercomputer Designed to accelerate access to massive amounts of data in areas of genomics, earth science, engineering, medicine, and others Emphasizes memory and IO over FLOPS. Appro integrated 1,024 node Sandy Bridge cluster 300 TB of high performance Intel flash Large memory supernodes via vSMP Foundation from ScaleMP 3D torus interconnect from Mellanox In production operation since February 2012 Funded by the NSF and available through the NSF Extreme Science and Engineering Discovery Environment program (XSEDE)

6 SAN DIEGO SUPERCOMPUTER CENTER Gordon Design: Two Driving Ideas Observation #1: Data keeps getting further away from processor cores (“red shift”) Do we need a new level in the memory hierarchy? Observation #2: Many data-intensive applications are serial and difficult to parallelize Would a large, shared memory machine be better from the standpoint of researcher productivity for some of these?  Rapid prototyping of new approaches to data analysis

7 SAN DIEGO SUPERCOMPUTER CENTER Red Shift: Data keeps moving further away from the CPU with every turn of Moore’s Law Source Dean Klein, Micron Disk Access Time BIG DATA LIVES HERE

8 SAN DIEGO SUPERCOMPUTER CENTER Gordon Design Highlights 3D Torus Dual rail QDR 64, 2S Westmere I/O nodes 12 core, 48 GB/node 4 LSI controllers 16 SSDs Dual 10GbE SuperMicro mobo PCI Gen2 300 GB Intel 710 eMLC SSDs 300 TB aggregate 1,024 2S Xeon E5 (Sandy Bridge) nodes 16 cores, 64 GB/node Intel Jefferson Pass mobo PCI Gen3 Large Memory vSMP Supernodes 2TB DRAM 10 TB Flash “Data Oasis” Lustre PFS 100 GB/sec, 4 PB

9 SAN DIEGO SUPERCOMPUTER CENTER SDSC Clusters Trestles Long tail of science: Broadening the HPC user base

10 SAN DIEGO SUPERCOMPUTER CENTER The Majority of TeraGrid/XD Projects Have Modest-Scale Resource Needs “80/20” rule around 512 cores ~80% of projects only run jobs smaller than this … And use <20% of resources Only ~1% of projects run jobs as large as 16K cores and consume >30% of resources Many projects/users only need modest-scale jobs/resources And a modest-size resource can provide the resources for a large number of these projects/users Exceedance distributions of projects and usage as a function of the largest job (core count) run by a project over a full year (FY2009)

11 SAN DIEGO SUPERCOMPUTER CENTER The Modest Scale Source: XSEDE Metrics on Demand (XDMoD) https://xdmod.ccr.buffalo.edu/

12 SAN DIEGO SUPERCOMPUTER CENTER Trestles Focuses on Productivity of its Users Rather than System Utilization We manage the system with a different focus than has been typical of TeraGrid/XD systems Short queue waits are key to productivity Primary system metric is expansion factor = 1 + (wait time/run time) Long-running job queues (48 hours std, up to 2 weeks) Shared nodes for interactive, accessible computing User-settable advance reservations Automatic on-demand access for urgent applications Robust suite of applications software Once expectations for system are established, say yes to user requests whenever possible …

13 SAN DIEGO SUPERCOMPUTER CENTER Trestles is a 100TF system with 324 nodes (Each node 4 socket*8-core/64GB DRAM/120GB flash, AMD Magny-Cours) System ComponentConfiguration AMD MAGNY-COURS COMPUTE NODE Sockets4 Cores32 Clock Speed2.4 GHz Flop Speed307 Gflop/s Memory capacity64 GB Memory bandwidth171 GB/s STREAM Triad bandwidth100 GB/s Flash memory (SSD)120 GB FULL SYSTEM Total compute nodes324 Total compute cores10,368 Peak performance100 Tflop/s Total memory20.7 TB Total memory bandwidth55.4 TB/s Total flash memory39 TB QDR INFINIBAND INTERCONNECT TopologyFat tree Link bandwidth8 GB/s (bidrectional) Peak bisection bandwidth5.2 TB/s (bidirectional) MPI latency1.3 us DISK I/O SUBSYSTEM File systemsNFS, Lustre Storage capacity (usable)150 TB: Dec 2010 2PB : August 2011 4PB: July 2012 I/O bandwidth50 GB/s

14 SAN DIEGO SUPERCOMPUTER CENTER SDSC Clusters UCSD System: Triton Shared Compute Cluster Condos: Leveraging our facility to support campus users

15 SAN DIEGO SUPERCOMPUTER CENTER Competitive Edge Source: http://ovpitnews.iu.edu/news/page/normal/23265.html

16 SAN DIEGO SUPERCOMPUTER CENTER Condo Model (for those who can’t afford Crays) Central facility (space, power, network, management) Researchers purchase compute nodes on grants Some campus subsidy for operation Small selection of nodes (Model T or Model A) Benefits Sustainability Efficiency from scale Harvest idle cycles Adopters: Purdue, LBNL, UCLA, Clemson, …

17 SAN DIEGO SUPERCOMPUTER CENTER Triton Shared Compute Cluster PerformanceTBD Compute Nodes~100 Intel XEON E5 2.6 GHz dual socket; 16 cores/node; 64 GB RAM GPU Nodes~4 Intel XEON E5 2.3 GHz dual socket; 16 cores/node; 32 GB RAM 4 NVIDIA GeForce GTX680 Interconnects10GbE QDR, Fat Tree Islands (Opt)

18 SAN DIEGO SUPERCOMPUTER CENTER Conclusion: Supporting Evidence Source: http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=503148


Download ppt "SAN DIEGO SUPERCOMPUTER CENTER Niches, Long Tails, and Condos Effectively Supporting Modest-Scale HPC Users 21st High Performance Computing Symposia (HPC'13)"

Similar presentations


Ads by Google