Presentation is loading. Please wait.

Presentation is loading. Please wait.

"Towards Petascale Grids as a Foundation of E-Science" Satoshi Matsuoka Tokyo Institute of Technology / The NAREGI Project, National Institute of Informatics.

Similar presentations

Presentation on theme: ""Towards Petascale Grids as a Foundation of E-Science" Satoshi Matsuoka Tokyo Institute of Technology / The NAREGI Project, National Institute of Informatics."— Presentation transcript:

1 "Towards Petascale Grids as a Foundation of E-Science" Satoshi Matsuoka Tokyo Institute of Technology / The NAREGI Project, National Institute of Informatics Oct. 1, 2007 EGEE07 Presentation @ Budapest, Hungary

2 Vision of Grid Infrastructure in the past… Bunch of networked PCs virtualized to be a Supercomputer Very divergent & distributed supercomputers, storage, etc. tied together & “virtualized” OR The “dream” is for the infrastructure to behave as a virtual supercomputing environment with an ideal programming model for many applications

3 But this is not meant to be Don Quixote or wrong tree dog bark picture here

4 TSUBAME: the first 100 Teraflops Supercomputer for Grids 2006-2010 ClearSpeed CSX600 SIMD accelerator 360 648 boards, 35 52.2TeraFlops Storage 1.0 Petabyte (Sun “Thumper”) 0.1Petabyte (NEC iStore) Lustre FS, NFS, CIF, WebDAV (over IP) 50GB/s aggregate I/O BW 500GB 48disks 500GB 48disks 500GB 48disks NEC SX-8i (for porting) Unified IB network Sun Galaxy 4 (Opteron Dual core 8-socket) 10480core/655Nodes 32-128GB 21.4TeraBytes 50.4TeraFlops OS Linux (SuSE 9, 10) NAREGI Grid MW Voltaire ISR9288 Infiniband 10Gbps x2 ~1310+50 Ports ~13.5Terabits/s (3Tbits bisection) 10Gbps+External NW “Fastest Supercomputer in Asia” 29 th Top500@48.88TF Now 103 TeraFlops Peak as of Oct. 31 st ! 1.5PB 60GB/s Sun Blade Integer Workload Accelerator (90 nodes, 720 CPU

5 TSUBAME Job Statistics Dec. 2006-Aug.2007 (#Jobs) 797,886 Jobs (~3270 daily) 597,438 serial jobs (74.8%) 121,108 <=8p jobs (15.2%) 129,398 ISV Application Jobs (16.2%) However, >32p jobs account for 2/3 of cumulative CPU usage 90% Coexistence of ease-of-use in both - short duration parameter survey - large scale MPI Fits the TSUBAME design well

6 In the Supercomputing Landscape, Petaflops class is already here… in early 2008 2008Q1 TACC/Sun “Ranger” ~52,600 “Barcelona” Opteron CPU Cores, ~500TFlops ~100 racks, ~300m2 floorspace 2.4MW Power, 1.4km IB cx4 copper cabling 2 Petabytes HDD 2008 LLNL/IBM “BlueGene/P” ~300,000 PPC Cores, ~1PFlops ~72 racks, ~400m2 floorspace ~3MW Power, copper cabling > 10 Petaflops > million cores > 10s Petabytes planned for 2011-2012 in the US, Japan, (EU), (other APAC) Other Petaflops 2008/2009 - LANL/IBM “Roadrunner” - JICS/Cray(?) (NSF Track 2) - ORNL/Cray - ANL/IBM BG/P - EU Machines (Julich…) …

7 In fact we can build one now (!) @Tokyo---One of the Largest IDC in the World (in Tokyo...)@Tokyo---One of the Largest IDC in the World (in Tokyo...) Can fit a 10PF here easy (> 20 Rangers)Can fit a 10PF here easy (> 20 Rangers) On top of a 55KV/6GW SubstationOn top of a 55KV/6GW Substation 150m diameter (small baseball stadium)150m diameter (small baseball stadium) 140,000 m2 IDC floorspace140,000 m2 IDC floorspace 70+70 MW power70+70 MW power Size of entire Google(?) (~million LP nodes)Size of entire Google(?) (~million LP nodes) Source of “Cloud” infrastructureSource of “Cloud” infrastructure

8 Gilder’s Law – Will make thin-client accessibility to servers essentially “free” Scientific American, January 2001 Number of Years 012345 Performance per Dollar Spent Data Storage (bits per square inch) (Doubling time 12 Months) Optical Fiber (bits per second) (Doubling time 9 Months) Silicon Computer Chips (Number of Transistors) (Doubling time 18 Months) (Original slide courtesy Phil Papadopoulos @ SDSC)

9 DOE SC Applications Overview (following slides courtesy John Shalf @ LBL NERSC) Sparse MatrixLU FactorizationMulti-DisciplineSuperLU Dense MatrixCMB AnalysisCosmologyMADCAP ParticleMolecular DynamicsLife SciencesPMEMD StructureProblem/MethodDisciplineNAME Fourier/GridDFTMaterial SciencePARATEC Particle in CellVlasov-PoissonMagnetic FusionGTC 2D/3D LatticeMHDPlasma PhysicsLBMHD 3D GridGeneral RelativityAstrophysicsCACTUS 3D GridAGCMClimate ModelingFVCAM

10 Latency Bound vs. Bandwidth Bound? How large does a message have to be in order to saturate a dedicated circuit on the interconnect? –N 1/2 from the early days of vector computing –Bandwidth Delay Product in TCP 3.4KB2GB/s1.7usRapidArray/IB4xCray XD1 2.8KB500MB/s5.7usMyrinet 2000Myrinet Cluster 8.4KB1.5GB/s5.6usNEC CustomNEC ES 46KB6.3GB/s7.3usCray CustomCray X1 2KB1.9GB/s1.1usNumalink-4SGI Altix Bandwidth Delay Product Peak Bandwidth MPI Latency TechnologySystem Bandwidth Bound if msg size > Bandwidth*Delay Latency Bound if msg size < Bandwidth*Delay –Except if pipelined (unlikely with MPI due to overhead) –Cannot pipeline MPI collectives (but can in Titanium) (Original slide courtesy John Shalf @ LBL)

11 Diagram of Message Size Distribution Function ( MADBench-P2P) (Original slide courtesy John Shalf @ LBL) 60% of messages > 1MB  BW Dominant, Could be executed on WAN

12 (Original slide courtesy John Shalf @ LBL) Message Size Distributions (SuperLU-PTP) > 95% of messages < 1KByte  Low latency, tightly coupled LAN

13 Collective Buffer Sizes - demise of metacomputing - 95% Latency Bound!!! => For metacomputing, Desktop and small cluster grids pretty much hopeless except parameter sweep apps (Original slide courtesy John Shalf @ LBL)

14 So what does this tell us? A “grid” programming model for parallelizing a single app is not worthwhile –Either simple parameter sweep / workflow, or will not work –We will have enough problems programming a single system with millions of threads (e.g., Jack’s keynote) Grid programming at “diplomacy” level –Must look at multiple applications, and how they compete / coordinate The apps execution environment should be virtualized --- grid being transparent to applications Zillions of apps in the overall infrastructure, competing for resources Hundreds to thousands of application components that coordinate (workflow, coupled multi-physics interactions, etc.) –NAREGI focuses on these scenarios

15 RISM Solvent distribution FMO Electronic structure Mediator Solvent charge distribution is transformed from regular to irregular meshes Mulliken charge is transferred for partial charge of solute molecules Electronic structure of Nano-scale molecules in solvent is calculated self-consistent by exchanging solvent charge distribution and partial charge of solute molecules. *Original RISM and FMO codes are developed by Institute of Molecular Science and National Institute of Advanced Industrial Science and Technology, respectively. Suitable for SMPSuitable for Cluster GridMPI Use case in NAREGI: RISM-FMO Coupled Simulation

16 Application sharing in Research communities Information Service ⑦ Register Deployment Info. Server#1Server#2Server#3 CompilingOK! Test Run NG! Test RunOK!Test RunOK! Resource Info. ③ Compile Application Summary Program Source Files Input Files Resource Requirements etc. Application Developer ⑥ Deploy ④ Send back Compiled Application Environment ① Register Application ② Select Compiling Host ⑤ Select Deployment Host ACS (Application Contents Service) PSE Server Registration & Deployment of Applications

17 PSE DataGrid Information Service Super Scheduler Web server(apache) Workflow Servlet tomcat Wokflow Description By NAREGI-WFML Server BPEL ↓ JSDL -A ↓ JSDL -A ………………….. NAREGI JM I/F module BPEL+JSDL http(s) Data icon Program icon Appli-A Appli-B JSDL applet Global file information Application Information /gfarm/.. GridFTP (Stdout Stderr) Description of Workflow and Job Submission Requirements

18 Reservation Based Co- Allocation Computing Resource GridVM Accounting CIM UR/RUS GridVM Resource Info. Reservation, Submission, Query, Control… Client Concrete JSDL Concrete JSDL Workflow Abstract JSDL Super Scheduler Distributed Information Service DAI Resource Query Reservation based Co-Allocation Co-allocation for heterogeneous architectures and applications Used for advanced science applications, huge MPI jobs, realtime visualization on grid, etc...

19 1.Modules GridMPI:MPI-1 and 2 compliant grid ready MPI library GridRPC:OGF/GridRPC compliant GridRPC library Mediator:Communication tool for heterogeneous applications SBC:Storage based communication tool 2.Features GridMPI MPI for a collection of geographically distributed resources High performance optimized for high bandwidth network GridMPI Task parallel simple seamless programming Mediator Communication library for heterogeneous applications Data format conversion SBC Storage based communication for heterogeneous applications 3.Supporting Standards MPI-1 and 2 OGF/GridRPC Communication Libraries and Tools

20 Grid Ready Programming Libraries Standards compliant GridMPI and GridRPC GridMPI Data Parallel MPI Compatibility 100000 CPU 100-500 CPU RPC GridRPC (Ninf-G2) Task Parallel, Simple Seamless programming

21 Mediator SBC (Storage Based Communication) Communication Tools for Co- Allocation Jobs Application-1Application-2Mediator Data Format Conversion Data Format Conversion GridMPI ( ) Application-3Application-2SBC library SBC protocol ( )

22 Cluster A (fast CPU, slow networks) Cluster B (high bandwidth, large memory) App A (High BW) Compete Scenario: MPI / VM Migration on Grid (our ABARIS FT-MPI) MPI Comm Log Redistribution Resource Manager Host VM MPI VM MPI VM MPI Host App A (High Bandwidth) Host App B ( CPU) VM MPI VM MPI VM MPI Resource manager, aware of individual application characteristics Host VM MPI VM MPI VM MPI VM Job Migration Power Optimization

Download ppt ""Towards Petascale Grids as a Foundation of E-Science" Satoshi Matsuoka Tokyo Institute of Technology / The NAREGI Project, National Institute of Informatics."

Similar presentations

Ads by Google