Presentation on theme: "CSG Research Computing Jim Pepin USC CTO/Director HPCC."— Presentation transcript:
CSG Research Computing Jim Pepin USC CTO/Director HPCC
HPCC Provide common facilities and services for a large cross section of the university that requires leading edge computational and networking resources. Leverage USC central resources with externally funded projects.
Overview Sponsored by ISD (Information Services Division of USC) and ISI (Information Sciences Institute) User community ISI LAS Engineering School of Medicine IMSC ICT Others
Current Resources High Performance Computing Resources Linux Cluster (~1000nodes/2000cpus, 2Gb/sec Myrinet) 20TB shared disk, 18GB - 40GB local disk per node. Ranks in top 10 for academic clusters. Myrinet switch is 768 nodes. Adding nodes funded by USC research groups. Sun Core Servers (E15k shared memory) 72 processors, 288GB memory, 30TB shared disk Mass Storage Facilities (Unitree) 18,000 tape capacity
Funding Sources ISD (University) Resources 1.5M M/S and Equipment budget Software/Maintenance.4M Generic capital 1.0M Other.1M 3 FTEs direct support 2 FTEs system staff offset Los Nettos/LAAP 2.0M Condo Arrangements 50k-250k one off capital purchases
Cluster Power Usage Math 42 nodes/cabinet 200 watts/node. 8.4Kw/cabinet 1000 nodes 24 cabinets 1 control cabinet per 8 cabinets of compute servers 8 control cabinets 32 cabinets per 100 nodes 268Kw per 1000 nodes 100 Tons of a/c per 1000 nodes Roughly 400KW total power use for 1000 nodes 1500-2000 sq feet of space.
Current Software Cluster software from IBM (xcat) is core of facility. Stable production environment. MPI is basic message passing Globus/NMI work is proceeding with Carls help in funding plus ISD resources. Leverages with campus need for global directory More later. Solaris and Unitree are core for Mass Storage support. We need to look at other mass storage opportunities. Issues We need to be able to support faculty/researchers with tools and consulting to help them effectively use large-scale resources. Many packages exist on HPCC resources, with no local support to help use them.
Middleware Globus as base with NMI architecture for campus. GT2 moving to GT3 SCEC/ISI Condor as lightweight job manager in user rooms PBS/Maui on Cluster and Computation side of E15k Issues Kx509 bridge from Kerberos USC PKI lite CA is base. Only hosts and services. NMI based. Pubcookie (Kerberos back-end) Uses host certs from PKI lite CA Shib for some prototype library apps (scholars portal) Campus GDS/PR using NMI schemes (eduperson etc)
HPCC Governance HPCC faculty advisory group Meets 4-5 times a year Provides guidance to DCIO and CTO Final Decisions are in ISD (CIO/DCIO) Usual mode is agreement Time allocation No recharge Large project reviewed by faculty allocation group Some projects over 500k node hours Condo users get dedicated nodes and cost sharing Research leverage Condo Cost sharing External funding Grid construction Next generation network
CTO/HPCC Projects Advanced Networking Projects Calren-2 2xGb service today. 10Gb service in next 2 years. Fiber/wavelength services(CENIC/National Lambda Rail) Online for west coast. Look at L2 possibilities to build shared spaces. Look to leverage for project like Optiputer ITR. 1 Wilshire colo facilities See if we can use that space to facilitate ETF proposal. Optiputer ITR as way to help network expansion.
CTO/HPCC Projects Leverage HPCC efforts at ISI with ISD Resources Clusters Expand cluster to ~2000 nodes centrally owned. Expand cluster for other groups (condo model). Mass Storage Look into large scale storage for groups like VHF project and other high end storage needs. (fractional petabytes) Globus/NMI Provide campus leadership for Global directory services and identity management. (authentication and authorization). Networking Research
CTO/HPCC Projects Fiber is a major part of the HPCCs ability to service large scale computational needs. The following slides show what we have today and how it can be used.
Fiber Facilities Lease dark fiber. Started with dark fiber 3 years ago. Pioneer in this area. DWP (Department of Water and Power) USC franchise area fiber for campus access. Leverage new players (NLR/Cenic). Use for USC, LAAP and Los Nettos projects. Built-out today using low cost CWDM and 15540s. 10Gbps ethernet backbone in place Fall 02 Built-out fiber to Caltech/JPL/VHF(Shoah) and other Los Nettos sites.
Fiber Facilities Lease more dark fiber. Harvey Mudd. Build second path to USC for disaster recovery. Install DWDM gear from CENIC deal with Cisco. 1Gb wavelengths in first phase (fall 04) 10Gb wavelengths in summer 04. Use to enable projects like Optiputer and ETF. Experiment with optical switching hardware as fiber patch panel for development of shared computer centers.
Original USC Fiber Backbone Downtown Clinic UPC HSC ISI ICT 1 wilshire Original 4 strand SM DWP fiber External fiber plant
Colo Facilities Acquired space in 1 wilshire (original site). 3 years ago. DWP fiber is core. Use to connect to exchanges and others ISPs. Extend to potentially other 1 Wilshire buildings. Use new Campus Level 3 fiber as means. House routers and l2 equipment. Provide space on USC campus for partners partners. Enables Pacific Wave Exchange Point.
Experimental Networking Networking research community California Institutes for Science and Innovation (CITRIS, CalIT 2, Nano Systems, BioMedical) San Diego Super Computer Center CACR ISI Teragrid/Distributed Terascale Facility UCSB/Dan Blumenthal optical labs
Future Resource Goals High Performance Computing Resources Linux Cluster (2048nodes/4096cpus, 2Gb/sec Myrinet) 60TB shared disk, 36GB - 72GB local disk per node. Rank in top 5 for academic clusters. Start 64 bit nodes in summer 04. Switch fabric will expand past 1024 nodes with ability to condo other users. Plan to add more nodes funded by USC research groups (condo) Goal would be 3000+ nodes total. Sun Core Servers (E15k shared memory) 72 processors, 288GB memory, 300TB disk Use this system for high end data users (large scale databases) and video users. Mass Storage Facilities (Unitree today) 18,000 tape capacity PB online as goal in 3 years.
3 Year Strategy Next step after 32 bit pentium. Need to determine what will replace Xeons. One answer is opteron or IA64, but we need to start to develop clusters in this space and benchmark. Much of the code will need reworking at user level. Find ways to cost share with local cluster purchasers. Condo housing of medium to large clusters will be important. Build Grid-U
3 Year Strategy As cluster expand into the 2-4k node space power and A/C become significant issues (along with floor space). We need to develop several major partners to allow HPCC to be the central piece of joint proposals from USC for such initiatives as ETF and future cyber infrastructure proposals. Example is shared submission for Major Research Instrument grant.
3 Year Strategy Networking Futures Expand Exchange Point (R/E, Pacific Wave) 10Gb at all sites Layer 1 facilities (Optiputer type connections) Re-design/RFP for campus network this month Design network with enclaves for research or academic support Much higher internal bandwidth (10Gb core-core, at least 1Gb to all buildings 10Gb to major research centers) How to provide comprehensive security without unacceptable friction.