"Towards Petascale Grids as a Foundation of E-Science" Satoshi Matsuoka Tokyo Institute of Technology / The NAREGI Project, National Institute of Informatics.

Slides:



Advertisements
Similar presentations
1 Towards an Open Service Framework for Cloud-based Knowledge Discovery Domenico Talia ICAR-CNR & UNIVERSITY OF CALABRIA, Italy Cloud.
Advertisements

National Institute of Advanced Industrial Science and Technology Ninf-G - Core GridRPC Infrastructure Software OGF19 Yoshio Tanaka (AIST) On behalf.
Clusters, Grids and their applications in Physics David Barnes (Astro) Lyle Winton (EPP)
Jeffrey P. Gardner Pittsburgh Supercomputing Center
How The Internet Changed the Game Presented by: Duston Barto from Infinicom USA.
1 Sizing the Streaming Media Cluster Solution for a Given Workload Lucy Cherkasova and Wenting Tang HPLabs.
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
1 Chapter 11: Data Centre Administration Objectives Data Centre Structure Data Centre Structure Data Centre Administration Data Centre Administration Data.
A. Sim, CRD, L B N L 1 ANI and Magellan Launch, Nov. 18, 2009 Climate 100: Scaling the Earth System Grid to 100Gbps Networks Alex Sim, CRD, LBNL Dean N.
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Center for Computational Sciences Cray X1 and Black Widow at ORNL Center for Computational.
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Blue Gene/P System Overview - Hardware.
©2009 HP Confidential template rev Ed Turkel Manager, WorldWide HPC Marketing 4/7/2011 BUILDING THE GREENEST PRODUCTION SUPERCOMPUTER IN THE.
Towards an Exa-scale Operating System* Ely Levy, The Hebrew University *Work supported in part by a grant from the DFG program SPPEXA, project FFMK.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
Today’s topics Single processors and the Memory Hierarchy
Beowulf Supercomputer System Lee, Jung won CS843.
Parallel Research at Illinois Parallel Everywhere
USING THE GLOBUS TOOLKIT This summary by: Asad Samar / CALTECH/CMS Ben Segal / CERN-IT FULL INFO AT:
Real Parallel Computers. Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel.
1/16/2008CSCI 315 Operating Systems Design1 Introduction Notice: The slides for this lecture have been largely based on those accompanying the textbook.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Simo Niskala Teemu Pasanen
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
Ch4: Distributed Systems Architectures. Typically, system with several interconnected computers that do not share clock or memory. Motivation: tie together.
© Copyright 2010 Hewlett-Packard Development Company, L.P. 1 HP + DDN = A WINNING PARTNERSHIP Systems architected by HP and DDN Full storage hardware and.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
High Performance Computing G Burton – ICG – Oct12 – v1.1 1.
Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
1 port BOSS on Wenjing Wu (IHEP-CC)
2006/1/23Yutaka Ishikawa, The University of Tokyo1 An Introduction of GridMPI Yutaka Ishikawa and Motohiko Matsuda University of Tokyo Grid Technology.
Connecting OurGrid & GridSAM A Short Overview. Content Goals OurGrid: architecture overview OurGrid: short overview GridSAM: short overview GridSAM: example.
NAREGI WP4 (Data Grid Environment) Hideo Matsuda Osaka University.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 1: Introduction What is an Operating System? Mainframe Systems Desktop Systems.
Introduction to HPC resources for BCB 660 Nirav Merchant
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
100G R&D at Fermilab Gabriele Garzoglio (for the High Throughput Data Program team) Grid and Cloud Computing Department Computing Sector, Fermilab Overview.
HYDRA: Using Windows Desktop Systems in Distributed Parallel Computing Arvind Gopu, Douglas Grover, David Hart, Richard Repasky, Joseph Rinkovsky, Steve.
HYDRA: Using Windows Desktop Systems in Distributed Parallel Computing Arvind Gopu, Douglas Grover, David Hart, Richard Repasky, Joseph Rinkovsky, Steve.
Massive Supercomputing Coping with Heterogeneity of Modern Accelerators Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology, Japan.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.
Presented by NCCS Hardware Jim Rogers Director of Operations National Center for Computational Sciences.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Pathway to Petaflops A vendor contribution Philippe Trautmann Business Development Manager HPC & Grid Global Education, Government & Healthcare.
Scheduling a 100,000 Core Supercomputer for Maximum Utilization and Capability September 2010 Phil Andrews Patricia Kovatch Victor Hazlewood Troy Baer.
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
Architecture of a platform for innovation and research Erik Deumens – University of Florida SC15 – Austin – Nov 17, 2015.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Page : 1 SC2004 Pittsburgh, November 12, 2004 DEISA : integrating HPC infrastructures in Europe DEISA : integrating HPC infrastructures in Europe Victor.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
G. Russo, D. Del Prete, S. Pardi Frascati, 2011 april 4th-7th The Naples' testbed for the SuperB computing model: first tests G. Russo, D. Del Prete, S.
Chapter 1: Introduction What is an Operating System? Mainframe Systems Desktop Systems Multiprocessor Systems Distributed Systems Clustered System Real.
NAREGI PSE with ACS S.Kawata 1, H.Usami 2, M.Yamada 3, Y.Miyahara 3, Y.Hayase 4 1 Utsunomiya University 2 National Institute of Informatics 3 FUJITSU Limited.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
Enhancements for Voltaire’s InfiniBand simulator
NIIF HPC services for research and education
VisIt Project Overview
Evaluating Web Services Based Implementations of Grid RPC
Clouds , Grids and Clusters
Appro Xtreme-X Supercomputers
Jeffrey P. Gardner Pittsburgh Supercomputing Center
Chapter 1: Introduction
Presentation transcript:

"Towards Petascale Grids as a Foundation of E-Science" Satoshi Matsuoka Tokyo Institute of Technology / The NAREGI Project, National Institute of Informatics Oct. 1, 2007 EGEE07 Budapest, Hungary

Vision of Grid Infrastructure in the past… Bunch of networked PCs virtualized to be a Supercomputer Very divergent & distributed supercomputers, storage, etc. tied together & “virtualized” OR The “dream” is for the infrastructure to behave as a virtual supercomputing environment with an ideal programming model for many applications

But this is not meant to be Don Quixote or wrong tree dog bark picture here

TSUBAME: the first 100 Teraflops Supercomputer for Grids ClearSpeed CSX600 SIMD accelerator boards, TeraFlops Storage 1.0 Petabyte (Sun “Thumper”) 0.1Petabyte (NEC iStore) Lustre FS, NFS, CIF, WebDAV (over IP) 50GB/s aggregate I/O BW 500GB 48disks 500GB 48disks 500GB 48disks NEC SX-8i (for porting) Unified IB network Sun Galaxy 4 (Opteron Dual core 8-socket) 10480core/655Nodes GB 21.4TeraBytes 50.4TeraFlops OS Linux (SuSE 9, 10) NAREGI Grid MW Voltaire ISR9288 Infiniband 10Gbps x2 ~ Ports ~13.5Terabits/s (3Tbits bisection) 10Gbps+External NW “Fastest Supercomputer in Asia” 29 th Now 103 TeraFlops Peak as of Oct. 31 st ! 1.5PB 60GB/s Sun Blade Integer Workload Accelerator (90 nodes, 720 CPU

TSUBAME Job Statistics Dec Aug.2007 (#Jobs) 797,886 Jobs (~3270 daily) 597,438 serial jobs (74.8%) 121,108 <=8p jobs (15.2%) 129,398 ISV Application Jobs (16.2%) However, >32p jobs account for 2/3 of cumulative CPU usage 90% Coexistence of ease-of-use in both - short duration parameter survey - large scale MPI Fits the TSUBAME design well

In the Supercomputing Landscape, Petaflops class is already here… in early Q1 TACC/Sun “Ranger” ~52,600 “Barcelona” Opteron CPU Cores, ~500TFlops ~100 racks, ~300m2 floorspace 2.4MW Power, 1.4km IB cx4 copper cabling 2 Petabytes HDD 2008 LLNL/IBM “BlueGene/P” ~300,000 PPC Cores, ~1PFlops ~72 racks, ~400m2 floorspace ~3MW Power, copper cabling > 10 Petaflops > million cores > 10s Petabytes planned for in the US, Japan, (EU), (other APAC) Other Petaflops 2008/ LANL/IBM “Roadrunner” - JICS/Cray(?) (NSF Track 2) - ORNL/Cray - ANL/IBM BG/P - EU Machines (Julich…) …

In fact we can build one now of the Largest IDC in the World (in of the Largest IDC in the World (in Tokyo...) Can fit a 10PF here easy (> 20 Rangers)Can fit a 10PF here easy (> 20 Rangers) On top of a 55KV/6GW SubstationOn top of a 55KV/6GW Substation 150m diameter (small baseball stadium)150m diameter (small baseball stadium) 140,000 m2 IDC floorspace140,000 m2 IDC floorspace MW power70+70 MW power Size of entire Google(?) (~million LP nodes)Size of entire Google(?) (~million LP nodes) Source of “Cloud” infrastructureSource of “Cloud” infrastructure

Gilder’s Law – Will make thin-client accessibility to servers essentially “free” Scientific American, January 2001 Number of Years Performance per Dollar Spent Data Storage (bits per square inch) (Doubling time 12 Months) Optical Fiber (bits per second) (Doubling time 9 Months) Silicon Computer Chips (Number of Transistors) (Doubling time 18 Months) (Original slide courtesy Phil SDSC)

DOE SC Applications Overview (following slides courtesy John LBL NERSC) Sparse MatrixLU FactorizationMulti-DisciplineSuperLU Dense MatrixCMB AnalysisCosmologyMADCAP ParticleMolecular DynamicsLife SciencesPMEMD StructureProblem/MethodDisciplineNAME Fourier/GridDFTMaterial SciencePARATEC Particle in CellVlasov-PoissonMagnetic FusionGTC 2D/3D LatticeMHDPlasma PhysicsLBMHD 3D GridGeneral RelativityAstrophysicsCACTUS 3D GridAGCMClimate ModelingFVCAM

Latency Bound vs. Bandwidth Bound? How large does a message have to be in order to saturate a dedicated circuit on the interconnect? –N 1/2 from the early days of vector computing –Bandwidth Delay Product in TCP 3.4KB2GB/s1.7usRapidArray/IB4xCray XD1 2.8KB500MB/s5.7usMyrinet 2000Myrinet Cluster 8.4KB1.5GB/s5.6usNEC CustomNEC ES 46KB6.3GB/s7.3usCray CustomCray X1 2KB1.9GB/s1.1usNumalink-4SGI Altix Bandwidth Delay Product Peak Bandwidth MPI Latency TechnologySystem Bandwidth Bound if msg size > Bandwidth*Delay Latency Bound if msg size < Bandwidth*Delay –Except if pipelined (unlikely with MPI due to overhead) –Cannot pipeline MPI collectives (but can in Titanium) (Original slide courtesy John LBL)

Diagram of Message Size Distribution Function ( MADBench-P2P) (Original slide courtesy John LBL) 60% of messages > 1MB  BW Dominant, Could be executed on WAN

(Original slide courtesy John LBL) Message Size Distributions (SuperLU-PTP) > 95% of messages < 1KByte  Low latency, tightly coupled LAN

Collective Buffer Sizes - demise of metacomputing - 95% Latency Bound!!! => For metacomputing, Desktop and small cluster grids pretty much hopeless except parameter sweep apps (Original slide courtesy John LBL)

So what does this tell us? A “grid” programming model for parallelizing a single app is not worthwhile –Either simple parameter sweep / workflow, or will not work –We will have enough problems programming a single system with millions of threads (e.g., Jack’s keynote) Grid programming at “diplomacy” level –Must look at multiple applications, and how they compete / coordinate The apps execution environment should be virtualized --- grid being transparent to applications Zillions of apps in the overall infrastructure, competing for resources Hundreds to thousands of application components that coordinate (workflow, coupled multi-physics interactions, etc.) –NAREGI focuses on these scenarios

RISM Solvent distribution FMO Electronic structure Mediator Solvent charge distribution is transformed from regular to irregular meshes Mulliken charge is transferred for partial charge of solute molecules Electronic structure of Nano-scale molecules in solvent is calculated self-consistent by exchanging solvent charge distribution and partial charge of solute molecules. *Original RISM and FMO codes are developed by Institute of Molecular Science and National Institute of Advanced Industrial Science and Technology, respectively. Suitable for SMPSuitable for Cluster GridMPI Use case in NAREGI: RISM-FMO Coupled Simulation

Application sharing in Research communities Information Service ⑦ Register Deployment Info. Server#1Server#2Server#3 CompilingOK! Test Run NG! Test RunOK!Test RunOK! Resource Info. ③ Compile Application Summary Program Source Files Input Files Resource Requirements etc. Application Developer ⑥ Deploy ④ Send back Compiled Application Environment ① Register Application ② Select Compiling Host ⑤ Select Deployment Host ACS (Application Contents Service) PSE Server Registration & Deployment of Applications

PSE DataGrid Information Service Super Scheduler Web server(apache) Workflow Servlet tomcat Wokflow Description By NAREGI-WFML Server BPEL ↓ JSDL -A ↓ JSDL -A ………………….. NAREGI JM I/F module BPEL+JSDL http(s) Data icon Program icon Appli-A Appli-B JSDL applet Global file information Application Information /gfarm/.. GridFTP (Stdout Stderr) Description of Workflow and Job Submission Requirements

Reservation Based Co- Allocation Computing Resource GridVM Accounting CIM UR/RUS GridVM Resource Info. Reservation, Submission, Query, Control… Client Concrete JSDL Concrete JSDL Workflow Abstract JSDL Super Scheduler Distributed Information Service DAI Resource Query Reservation based Co-Allocation Co-allocation for heterogeneous architectures and applications Used for advanced science applications, huge MPI jobs, realtime visualization on grid, etc...

1.Modules GridMPI:MPI-1 and 2 compliant grid ready MPI library GridRPC:OGF/GridRPC compliant GridRPC library Mediator:Communication tool for heterogeneous applications SBC:Storage based communication tool 2.Features GridMPI MPI for a collection of geographically distributed resources High performance optimized for high bandwidth network GridMPI Task parallel simple seamless programming Mediator Communication library for heterogeneous applications Data format conversion SBC Storage based communication for heterogeneous applications 3.Supporting Standards MPI-1 and 2 OGF/GridRPC Communication Libraries and Tools

Grid Ready Programming Libraries Standards compliant GridMPI and GridRPC GridMPI Data Parallel MPI Compatibility CPU CPU RPC GridRPC (Ninf-G2) Task Parallel, Simple Seamless programming

Mediator SBC (Storage Based Communication) Communication Tools for Co- Allocation Jobs Application-1Application-2Mediator Data Format Conversion Data Format Conversion GridMPI ( ) Application-3Application-2SBC library SBC protocol ( )

Cluster A (fast CPU, slow networks) Cluster B (high bandwidth, large memory) App A (High BW) Compete Scenario: MPI / VM Migration on Grid (our ABARIS FT-MPI) MPI Comm Log Redistribution Resource Manager Host VM MPI VM MPI VM MPI Host App A (High Bandwidth) Host App B ( CPU) VM MPI VM MPI VM MPI Resource manager, aware of individual application characteristics Host VM MPI VM MPI VM MPI VM Job Migration Power Optimization