Grid Datafarm Architecture for Petascale Data Intensive Computing Osamu Tatebe Grid Technology Research Center, AIST On behalf of the Gfarm project

Grid Datafarm Architecture for Petascale Data Intensive Computing Osamu Tatebe Grid Technology Research Center, AIST On behalf of the Gfarm project http://datafarm.apgrid.org/ ACAT 2002 June 28, 2002 Moscow, Russia

Petascale Data Intensive Computing / Large-scale Data Analysis Data intensive computing, large-scale data analysis, data mining Data intensive computing, large-scale data analysis, data mining  High Energy Physics  Astronomical Observation, Earth Science  Bioinformatics…  Good support still needed Large-scale database search, data mining Large-scale database search, data mining  E-Government, E-Commerce, Data warehouse  Search Engines  Other Commercial Stuff

Example: Large Hadron Collider Accelerator at CERN Detector for ALICE experiment Detector for LHCb experiment Truck ATLAS Detector 40mx20m 7000 Tons LHC Perimeter 26.7km ~2000 physicists from 35 countries

Peta/Exascale Data Intensive Computing Requirements Peta/Exabyte scale files Peta/Exabyte scale files Scalable parallel I/O throughput Scalable parallel I/O throughput  > 100GB/s, hopefully > 1TB/s within a system and between systems Scalable computational power Scalable computational power  > 1TFLOPS, hopefully > 10TFLOPS Efficiently global sharing with group-oriented authentication and access control Efficiently global sharing with group-oriented authentication and access control Resource Management and Scheduling Resource Management and Scheduling System monitoring and administration System monitoring and administration Fault Tolerance / Dynamic re-configuration Fault Tolerance / Dynamic re-configuration Global Computing Environment Global Computing Environment

Current storage approach 1: HPSS/DFS,... Mover Metadata Manager IP Network (-10Gbps) Petascale Tape Archive Single system image & Parallel I/O I/O bandwidth limited by IP network Disk cache Super- computer Disk Meta DB

Current storage approach 2: Striping cluster filesystem, ex. PVFS, GPFS Compute Node Compute Node I/O Node I/O Node Metadata Manager IP Network (-10Gbps) Meta DB Single system image & Parallel I/O I/O bandwidth limited by IP network File stripe

For Petabyte-scale Computing Wide-area efficient sharing Wide-area efficient sharing  Wide-area fast file transfer  Wide-area file replica management Scalable I/O bandwidth, >TB/s Scalable I/O bandwidth, >TB/s  I/O bandwidth limited by network bandwidth  Utilize local disk I/O as much as possible  Avoid data movement through network as much as possible Fault tolerance Fault tolerance  Temporal failure of wide-area network is common  Node and disk failures not exceptional cases but common Fundamentally New Paradigm is necessary Fundamentally New Paradigm is necessary

Our approach : Data Parallel Cluster- of-cluster Filesystem Comp. & I/O Node Comp. & I/O Node Comp. & I/O Node Comp. & I/O Node Metadata Manager IP Network Meta DB Single system image & Parallel I/O Local file view and file affinity scheduling to exploit locality of local I/O channels N x 100MB/s File fragment

Our approach (2) : Parallel Filesystem for Grid of Clusters Cluster-of-cluster filesystem on the Grid Cluster-of-cluster filesystem on the Grid  File replicas among clusters for fault tolerance and load balancing  Extension of striping cluster filesystem Arbitrary file block length Arbitrary file block length Unified I/O and compute node Unified I/O and compute node Parallel I/O, parallel file transfer, and more Parallel I/O, parallel file transfer, and more Extreme I/O bandwidth, >TB/s Extreme I/O bandwidth, >TB/s  Exploit data access locality  File affinity scheduling and local file view Fault tolerance – file recovery Fault tolerance – file recovery  Write-once files can be re-generated using a command history and re-computation

Inter-cluster ~10Gbps MS Meta- server MS Gfarm cluster-of-cluster filesystem (1) Extension of cluster filesystem Extension of cluster filesystem  File is divided into file fragments  Arbitrary length for each file fragment  Arbitrary number of I/O nodes for each file  Filesystem metadata is managed by metaserver  Parallel I/O and parallel file transfer Cluster-of-cluster filesystem Cluster-of-cluster filesystem  File replicas among (or within) clusters fault tolerance and load balancing fault tolerance and load balancing  Filesystem metaserver manages metadata at each site

Gfarm cluster-of-cluster filesystem (2) Gfsd – I/O daemon running on each filesystem node Gfsd – I/O daemon running on each filesystem node  Remote file operations  Authentication / access control (via GSI,...)  Fast executable invocation  Heartbeat / load monitor Process / resource monitoring, management Process / resource monitoring, management Gfmd – metaserver and process manager running at each site Gfmd – metaserver and process manager running at each site  Filesystem metadata management  Metadata consists of Mapping from logical filename to physical distributed fragment filenames Mapping from logical filename to physical distributed fragment filenames Replica catalog Replica catalog Command history for regeneration of lost files Command history for regeneration of lost files Platform information Platform information File status information File status information Size, protection,... Size, protection,...

Extreme I/O bandwidth (1) Petascale file tends to be accessed with access locality Petascale file tends to be accessed with access locality  Local I/O aggressively utilized for scalable I/O throughput  Target architecture – cluster of clusters, each node facilitating large-scale fast local disks File affinity process scheduling File affinity process scheduling  Almost Disk-owner computation Gfarm parallel I/O extension - Local file view Gfarm parallel I/O extension - Local file view  MPI-IO insufficient especially for irregular and dynamically distributed data  Each parallel process accesses only its own file fragment  Flexible and portable management in single system image  Grid-aware parallel I/O library

Extreme I/O bandwidth (2) Process manager - scheduling File affinity scheduling File affinity scheduling PC Process.0Process.1Process.2Process.3 File.0File.2File.3 File.1 Process scheduling based on file distribution Ex. % gfrun –H gfarm:File Process gfarm:File Host0.ch Host1.ch Host2.jp Host3.jp gfmd Host0.chHost1.chHost2.jpHost3.jp gfsd

Extreme I/O bandwidth (3) Gfarm I/O API – File View (1) Global file view Global file view PC Process.0Process.1Process.2Process.3 gfarm:File Host0.ch Host1.ch Host2.jp Host3.jp gfmd Host0.chHost1.chHost2.jpHost3.jp gfsd File.0File.2File.3 File.1 (I/O bandwidth limited by bisection bandwidth, ~GB/s, as an ordinal parallel filesystem) File

Extreme I/O bandwidth (4) Gfarm I/O API - File View (2) Local file view Local file view File.0File.2File.3 File.1 Process.0Process.1Process.2Process.3 gfarm:File Host0.ch Host1.ch Host2.jp Host3.jp gfmd Host0.chHost1.chHost2.jpHost3.jp gfsd Accessible data set is restricted to a local file fragment Scalable disk I/O bandwidth (>TB/s) (Local file fragment may be stored in remote node)

Gfarm Example: gfgrep - parallel grep % gfgrep –o gfarm:output regexp gfarm:input grep regexp CERN.CH KEK.JP input.1 output.1 input.2 output.2 input.3 output.3 input.4 output.4 open(“gfarm:input”, &f1) create(“gfarm:output”, &f2) set_view_local(f1) set_view_local(f2) close(f1); close(f2) grep regexp Host2.ch Host1.ch Host3.ch Host4.jp gfarm:input Host1.ch Host2.ch Host3.ch Host4.jp Host5.jp gfmd grep regexp input.5 output.5 Host5.jp

Fault-tolerance support File replicas on an individual fragment basis File replicas on an individual fragment basis Re-generation of lost or needed write- once files using a command history Re-generation of lost or needed write- once files using a command history  Program and input files stored in fault- tolerant Gfarm filesystem  Program should be deterministic  Re-generation also supports GriPhyN virtual data concept

Specifics of Gfarm APIs and commands (For Details, Please See Paper at http://datafarm.apgrid.org/) http://datafarm.apgrid.org/

Gfarm Parallel I/O APIs gfs_pio_create / open / close gfs_pio_create / open / close gfs_pio_set_view_local / index / global gfs_pio_set_view_local / index / global gfs_pio_read / write / seek / flush gfs_pio_read / write / seek / flush gfs_pio_getc / ungetc / putc gfs_pio_getc / ungetc / putc gfs_mkdir / rmdir / unlink gfs_mkdir / rmdir / unlink gfs_chdir / chown / chgrp / chmod gfs_chdir / chown / chgrp / chmod gfs_stat gfs_stat gfs_opendir / readdir / closedir gfs_opendir / readdir / closedir

Major Gfarm Commands gfrep gfrep  Replicate a Gfarm file using parallel streams gfsched / gfwhere gfsched / gfwhere  List hostnames where each Gfarm fragment and replica is stored gfls gfls  Lists contents of directory gfcp gfcp  Copy files using parallel streams gfrm, gfrmdir gfrm, gfrmdir  Remove directory entries gfmkdir gfmkdir  Make directories gfdf gfdf  Displays number of free disk blocks and files gfsck gfsck  Check and repair file systems

Porting Legacy or Commercial Applications Hook syscalls open(), close(), write(),... to utilize Gfarm filesystem Hook syscalls open(), close(), write(),... to utilize Gfarm filesystem  Intercepted syscalls executed in local file view  This allows thousands of files to be grouped automatically and processed in parallel.  Quick upstart for legacy apps (but some portability problems have to be coped with) gfreg command gfreg command  After creation of thousands of files, gfreg explicitly groups files into a single Gfarm file.

Initial Performance Evaluation – Presto III Gfarm Development Cluster (Prototype) Dual Athlon MP 1.2GHz Nodes x128 768 MB, 200GB HDD each Total 98GB Mem, 25TB Storage Myrinet 2K Full Bandwidth, 64bit PCI 614 GFLOPS (Peak) 331.7 GFLOPS Linpack for Top500 Operation from Oct 2001

Initial Performance Evaluation (2) - parallel I/O (file affinity scheduling and local file view) 1742 MB/s on writes 1974 MB/s on reads Presto III 64 nodes for comp. & IO nodes 640 GB of data [MB/s] open(“gfarm:f”, &f); set_view_local(f); write(f, buf, len); close(f);

Initial Performance Evaluation (3) - File replication (gfrep) Presto III, Myrinet 2000, 10 GB each fregment 443 MB/s 23 parallel streams 180 MB/s 7 parallel streams [1] O.tatebe, et al, Grid Datafarm Architecture for Petascale Data Intensive Computing, Proc. of CCGrid 2002, Berlin, May 2002

Design of AIST Gfarm cluster I and Gfarm testbed Cluster node Cluster node  1U, Dual 2.4GHz Xeon, GbE  480GB RAID with 8 2.5” 60GB HDDs 105MB/s on writes, 85MB/s on reads 105MB/s on writes, 85MB/s on reads 10-node experimental cluster (will be installed by July 2002) 10-node experimental cluster (will be installed by July 2002)  10U + GbE switch  Totally 5TB RAID with 80 disks 1050MB/s on writes, 850MB/s on reads 1050MB/s on writes, 850MB/s on reads Gfarm testbed Gfarm testbed  AIST 10 + 10 nodes  Titech 256 nodes  KEK, ICEPP  Osaka Univ. 108 nodes ApGrid/PRAGMA testbed ApGrid/PRAGMA testbed  AIST, Titech, KEK,...  Indiana Univ., SDSC,...  ANU, APAC, Monash,...  KISTI (Korea)  NCHC (Taiwan)  Kasetsart U., NECTEC (Thai)  NTU, BII (Singapore)  USM (Malaysia),... 480GB 100MB/s 10GFlops GbE switch

Related Work MPI-IO MPI-IO  No local file view that is a key issue to maximize local I/O scalability PVFS – striping cluster filesystem PVFS – striping cluster filesystem  UNIX I/O API, MPI-IO  I/O bandwidth limited by network bandwidth  Fault-tolerance??? Wide- area??? Scalability?? IBM PIOFS, GPFS IBM PIOFS, GPFS HPSS – hierarchical mass storage system HPSS – hierarchical mass storage system  Also limited by network bandwidth Distributed filesystems Distributed filesystems  NFS, AFS, Coda, xFS, GFS,...  Poor bandwidth for parallel write Globus – Grid Toolkit Globus – Grid Toolkit  GridFTP – Grid security and parallel streams  Replica Management Replica catalog and GridFTP Replica catalog and GridFTP Kangaroo – Condor approach Kangaroo – Condor approach  Latency hiding to utilize local disks as a cache  No solution for bandwidth Gfarm is the first attempt of cluster-of-cluster filesystem on the Grid  File replica  File affinity scheduling,...

Grid Datafarm Development Schedule Initial Prototype 2000-2001 Initial Prototype 2000-2001  Gfarm filesystem, Gfarm API, file affinity scheduling, and data streaming  Deploy on Development Gfarm Cluster Second Prototype 2002(-2003) Second Prototype 2002(-2003)  Grid security infrastructure  Load balance, Fault Tolerance, Scalability  Multiple metaservers with coherent cache  Evaluation in cluster-of-cluster environment  Study of replication and scheduling policies  ATLAS full-geometry Geant4 simulation (1M events)  Accelerate by National “Advanced Network Computing initiative” (US$10M/5y) Full Production Development (2004-2005 and beyond) Full Production Development (2004-2005 and beyond)  Deploy on Production GFarm cluster  Petascale online storage Synchronize with ATLAS schedule Synchronize with ATLAS schedule  ATLAS-Japan Tier-1 RC “prime customer” 5km KEK AIST/TACC 10xN Gbps U-Tokyo (60km) TITECH (80km) Super SINET Tsukuba WAN 10 Gbps

Summary Petascale Data Intensive Computing Wave Petascale Data Intensive Computing Wave Key technology: Grid and cluster Key technology: Grid and cluster Grid datafarm is an architecture for Grid datafarm is an architecture for  Online >10PB storage, >TB/s I/O bandwidth  Efficient sharing on the Grid  Fault tolerance Initial performance evaluation shows scalable performance Initial performance evaluation shows scalable performance  1742 MB/s on writes on 64 cluster nodes  1974 MB/s on reads on 64 cluster nodes  443 MB/s using 23 parallel streams Metaserver overhead is negligible Metaserver overhead is negligible I/O bandwidth limited by not network but disk I/O (good!) I/O bandwidth limited by not network but disk I/O (good!) datafarm@apgrid.org http://datafarm.apgrid.org/

~1PB/year (1MB/event 100MB/sec) ~1PB/year ~300TB/year 100KB/event ~10TB/year 10KB/event CERN/ATLAS High Energy Physics Data Analysis – PetaScale Online Data Processing magnetic-field reconstruction algorithm tracker-2 reconstruction algorithm RAW tracker-1 digits tracker-2 digits Event calorimeter-1 digits calorimeter-2 digits magnet-1 digits REC tracker-1 position info tracker-2 position info Event magnet-1 field calorimeter-1 energy calorimeter-2 energy track reconstruction algorithm cluster reconstruction algorithm ESD track-1 Event cluster-1 tracker-1 reconstruction algorithm calorimeter-2 reconstruction algorithm calorimeter-1 reconstruction algorithm cluster-2cluster-3track-2track-3track-4track-5 jet identification algorithm electron identification algorithm AOD jet 1 Event electron1photon1electron2jet 2Et miss Et miss identification algorithm

Grid Datafarm Architecture for Petascale Data Intensive Computing Osamu Tatebe Grid Technology Research Center, AIST On behalf of the Gfarm project

Similar presentations

Presentation on theme: "Grid Datafarm Architecture for Petascale Data Intensive Computing Osamu Tatebe Grid Technology Research Center, AIST On behalf of the Gfarm project"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Grid Datafarm Architecture for Petascale Data Intensive Computing Osamu Tatebe Grid Technology Research Center, AIST On behalf of the Gfarm project

Similar presentations

Presentation on theme: "Grid Datafarm Architecture for Petascale Data Intensive Computing Osamu Tatebe Grid Technology Research Center, AIST On behalf of the Gfarm project"— Presentation transcript:

Similar presentations

About project

Feedback