Presentation is loading. Please wait.

Presentation is loading. Please wait.

MURI Hardware Resources Ray Garcia Erik Olson Space Science and Engineering Center at the University of WI - Madison.

Similar presentations


Presentation on theme: "MURI Hardware Resources Ray Garcia Erik Olson Space Science and Engineering Center at the University of WI - Madison."— Presentation transcript:

1 MURI Hardware Resources Ray Garcia Erik Olson Space Science and Engineering Center at the University of WI - Madison

2 Resources for Researchers CPU cycles Memory Storage space Network Software Compilers Models Visualization programs

3 Original MURI hardware 16 P III processors Storage server with 0.5 TB Gigabit networking Purpose: Provide working environment for collaborative development. Enable running of large multiprocessor MM5 model. Gain experience working with clustered systems.

4 Capabilities and Limitations Successfully ran initial MM5 model runs, algorithm development (fast model), and modeling of GIFTS optics (FTS simulator). MM5 model runs for 140 by 140 domains. One 270 by 270 run with very limited time steps. OpenPBS system scheduling hundreds of jobs. Idle CPU time given to FDTD raytracing. Expanded to 28 processors using funding from B. Baum, IPO, and others. However, MM5 model runtime limited domain size and storage space limited number of output time steps.

5 CY2003 Upgrade NASA provided funding for 11 Dual-Pentium4 processor nodes 4GB DDR-RAM 2.4GHz CPUs Expressly purposed for running large IHOP field program simulations (400 by 400 grid point domain).

6 Cluster “Mark 2” Gains: Larger scale model runs and instrument simulations as needed for IHOP Terabytes of experimental and simulation data online through NAS, hosted RAID arrays Limitations to further work at even larger scale Interconnect limitations slowed large model runs 32-bit memory limitation on huge model set-up jobs for MM5 and WRF Increasing number of small storage arrays

7 3 Years of Cluster Work Inexpensive Adding CPUs to the system Costly Adding users to the system Adding storage to the system Easily understood Matlab Not so well-understood Distributed system (computing, storage) capabilities

8 Along comes DURIP H.L.Huang / R.Garcia DURIP proposal awarded May 2004. Purpose: Provide hardware for next generation research and education programs. Scope: Identify computing and storage systems to serve the need to expand simulation, algorithm research, data assimilation and limited operational product generation experiments.

9 Selecting Computing Hardware Cluster options for numerical modeling were evaluated and found to require significant time investment. Purchased SGI Altix fall of 2004 after extensive test runs with WRF and MM5. 24 - Itanium2 processors running Linux 192GB of RAM 5TB of FC/SATA disk Recently upgraded to 32 CPUs, 10TB storage.

10 SGI Altix Capabilities Large, contiguous RAM allows 1600 by 1600 grid point domain (> CONUS area at 4 km res). Largest so far is 1070 by 1070. NUMAlink interconnect provides fast turn around for model runs Presents itself as a single 32-CPU Linux machine Intel compilers for ease of porting and optimizing Fortran/C on 32-bit and 64-bit hardware.

11 Storage Class: Home Directory Small size for source code (preferably also held under CVS control) and critical documents Nightly incremental backups Quota enforcement Current implementation Local disks on cluster head Backup by TC

12 Storage Class: Workspace Optimized for speed Automatic flushing of unused files No insurance against disk failure Users expected to move important results to Long-term Storage Current implementation RAID5 or RAID0 drive arrays within the cluster systems

13 Storage Class: Long-term Large amount of space Redundant, preferably back-up to tape Managed directory system, preferably with metadata Current implementation Lots of project-owned NAS devices with partial redundancy (RAID5) NFS spaghetti Ad-hoc tape backup

14 DURIP phase 2: Storage Long term storage scaling and management goals: Reduce or eliminate NFS ‘spaghetti’ Include hardware phase-in / phase-out strategy in purchase decision Acquire the hardware to seed a Storage Area Network (SAN) in the Data Center, improving uniformity and scalability Reduce overhead costs (principally human time) Work closely with Technical Computing group on system setup and operations for a long-term facility

15 Immediate Options Red Hat GFS Size limitations and hardware/software mix-and- match; Support costs make up for free source code. HP Lustre More likely to be a candidate for workspace. Expensive. SDSC SRB (Storage Resource Broker) Stability, documentation, and maturity at time of testing found to be inadequate. Apple Xsan Plays well with third-party storage hardware. Straightforward to configure and maintain. Affordable.

16 Dataset Storage Purchase Plan 64-bit storage servers and meta-data server Qlogic Fibre channel switch Move data between hosts, drive arrays SAN software to provide distributed filesystem Focusing on Apple Xsan for 1-3 year span Follow up with 1-year assessment with option of re-competing Storage arrays Competing Apple XRAID, Western Scientific Tornado

17 Target System for 2006 Scalable dataset storage accessible from clusters, workstations, and supercomputer Backup strategy Update existing cluster nodes to ROCKS Simplified management and improve uniformity Proven on other clusters deployed by SSEC Retire/repurpose slower cluster nodes Reduce bottlenecks to workspace disk Improve ease of use and understanding

18 Long-term Goals 64-bit shared memory system scaled to huge job requirements (Altix) Complementary compute farm migrating to x86-64 (Opteron) hardware Improved workspace performance Scalable storage with full metadata for long- term and published datasets Software development tools for multiprocessor algorithm development


Download ppt "MURI Hardware Resources Ray Garcia Erik Olson Space Science and Engineering Center at the University of WI - Madison."

Similar presentations


Ads by Google