Worldwide File Replication on Grid Datafarm Osamu Tatebe and Satoshi Sekiguchi Grid Technology Research Center, National Institute of Advanced Industrial.

Worldwide File Replication on Grid Datafarm Osamu Tatebe and Satoshi Sekiguchi Grid Technology Research Center, National Institute of Advanced Industrial Science and Technology (AIST) APAN 2003 Conference Fukuoka, January 2003

ATLAS/Grid Datafarm project: CERN LHC Experiment Detector for ALICE experiment Detector for LHCb experiment Truck ATLAS Detector 40mx20m 7000 Tons LHC Perimeter 26.7km ~2000 physicists from 35 countries Collaboration between KEK, AIST, Titech, and ICEPP, U Tokyo

Petascale Data-intensive Computing Requirements  Peta/Exabyte scale files  Scalable parallel I/O throughput > 100GB/s, hopefully > 1TB/s within a system and between systems  Scalable computational power > 1TFLOPS, hopefully > 10TFLOPS  Efficiently global sharing with group-oriented authentication and access control  Resource Management and Scheduling  System monitoring and administration  Fault Tolerance / Dynamic re-configuration  Global Computing Environment

Grid Datafarm: Cluster-of-cluster Filesystem with Data Parallel Support  Cluster-of-cluster filesystem on the Grid File replicas among clusters for fault tolerance and load balancing Extension of striping cluster filesystem  Arbitrary file block length  Filesystem node = compute node + I/O node. Each node has large fast local disks.  Parallel I/O, parallel file transfer, and more  Extreme I/O bandwidth, >TB/s Exploit data access locality File affinity scheduling and local file view  Fault tolerance – file recovery Write-once files can be re-generated using a command history and re-computation [1] O.tatebe, et al, Grid Datafarm Architecture for Petascale Data Intensive Computing, Proc. of CCGrid 2002, Berlin, May 2002 Available at http://datafarm.apgrid.org/

Distributed disks across the clusters form a single Gfarm file system Distributed disks across the clusters form a single Gfarm file system n Each cluster generates the corresponding part of data n The data are replicated for fault tolerance and load balancing (bandwidth challenge!) n Analysis process is executed on the node that has the data BaltimoreTsukuba Indiana SandiegoTokyo

Extreme I/O bandwidth support example: gfgrep - parallel grep % gfrun –G gfarm:input gfgrep –o gfarm:output regexp gfarm:input –o gfarm:output regexp gfarm:input CERN.CH KEK.JP input.1 input.2 input.3 input.4 open(“gfarm:input”, &f1) create(“gfarm:output”, &f2) set_view_local(f1) set_view_local(f2) close(f1); close(f2) grep regexp Host2.ch Host1.ch Host3.ch Host4.jp gfarm:input Host1.ch Host2.ch Host3.ch Host4.jp Host5.jp gfmd input.5 Host5.jp output.4output.2 output.5 output.3 output.1 gfgrep File affinity scheduling

Design of AIST Gfarm Cluster I  Cluster node (High density and High performance) 1U, Dual 2.4GHz Xeon, GbE 480GB RAID with 4 3.5 ” 120GB HDDs + 3ware RAID  136MB/s on writes, 125MB/s on reads  12-node experimental cluster (operational from Oct 2002) 12U + GbE switch (2U) + KVM switch (2U) + keyboard + LCD Totally 6TB RAID with 48 disks  1063MB/s on writes, 1437MB/s on reads  410 MB/s for file replication with 6 streams WAN emulation with NistNET 480GB 120MB/s 10GFlops GbE switch

Grid Datafarm US-OZ-Japan Testbad Indiana Univ. SDSC Indianapolis GigaPoP NOC Tokyo NOC OC-12 POS Japan APAN/TransPAC KEK Titech AIST ICEPP PNWG OC-12 StarLight OC-12 ATM Tsukuba WAN 20 Mbps GbE SuperSINET 1 Gbps OC-12 US ESnet NII-ESnet HEP PVC GbE Total disk capacity: 18 TB, disk I/O bandwidth: 6 GB/s KEKTitechAIST ICEPP SDSCIndiana U Melbourne Australia

Grid Datafarm for a HEP application Osamu Tatebe (Grid Technology Research Center, AIST) Satoshi Sekiguchi(AIST), Youhei Morita (KEK), Satoshi Matsuoka (Titech & NII), Kento Aida (Titech), Donald F. (Rick) McMullen (Indiana), Philip Papadopoulos (SDSC) SC2002 High-Performance Bandwidth Challenge

Target Application at SC2002: FADS/Goofy zMonte Carlo Simulation Framework with Geant4 (C++) zFADS/Goofy: Framework for ATLAS/Autonomous Detector Simulation / Geant4-based Object- oriented Folly http://atlas.web.cern.ch/Atlas/GROUPS/SOFTWARE/OO/domains/simulation/ zModular I/O package selection: Objectivity/DB and/or ROOT I/O on top of Gfarm filesystem with good scalability zCPU intensive event simulation with high speed file replication and/or distribution

Network and cluster configuration for SC2002 Bandwidth Challenge SCinet 10 GE SC2002, Baltimore Indiana Univ. SDSC Indianapolis GigaPoP NOC Tokyo NOC OC-12 POS Japan APAN/TransPAC KEK Titech AIST ICEPP PNWG Grid Cluster Federation Booth OC-12 StarLight OC-12 ATM (271Mbps) E1200 Tsukuba WAN 20 Mbps GbE SuperSINET 1 Gbps OC-12 10 GE US ESnet NII-ESnet HEP PVC GbE KEKTitechAISTICEPP SDSCIndiana USC2002 Total bandwidth from/to SC2002 booth: 2.137 Gbps Total disk capacity: 18 TB, disk I/O bandwidth: 6 GB/s Peak CPU performance: 962 GFlops

US Backbone San Diego SDSC Indiana Univ SC2002 Booth AIST Titech U Tokyo Chicago Seattle MAFFIN Tokyo 1Gbps 622Mbps KEK 622Mbps 271Mbps 1Gbps 10Gbps Tsukuba WAN 20Mbps Transpacific Multiple routes in a single application! 741 Mbps A record speed SC2002 Bandwidth Challenge 2.286 Gbps! using 12 nodes Parallel file replication Baltimore

Network and cluster configuration SC2002 booth 12-node AIST Gfarm cluster connected with GbE connects to the SCinet with 10 GE using Force10 E1200 Performance in LAN Network bandwidth: 930Mbps File transfer bandwidth: 75MB/s(=629Mbps) GTRC, AIST The same 7-node AIST Gfarm cluster connects to Tokyo XP with GbE via Tsukuba WAN and Maffin Indiana Univ 15-node PC cluster connected with Fast Ethernet connects to Indianapolis GigaPoP with OC12SDSC 8-node PC cluster connected with GbE connects to outside with OC12 TransPAC north and south routes Default north route South route is used by static routing for 3 nodes in each SC booth and AIST RTT between AIST and SC booth: north 199ms, south 222ms South route is shaped to 271Mbps PC E1200 PC GbE SCinet 10GE OC192 12 nodes Network configuration at SC2002 booth

Challenging points of TCP-based file transfer Large latency, high Bandwidth (aka LFN) Big socket size for large congestion window Fast window-size recovery after packet loss High Speed TCP (internet draft by Sally Floyd) Network striping Packet loss due to real congestion Transfer control Poor disk I/O performance 3ware RAID with 4 3.5 ” HDDs on each node Over 115 MB/s (~ 1 Gbps network bandwidth) Network striping vs. disk striping access # streams, stripe size Limited number of nodes Need to achieve maximum file transfer performance

Bandwidth measurement result of TransPAC by Iperf Northern route: 2 node pairs Southern route: 3 node pairs (100Mbps streams) Totally 753Mbps (10-sec avg) (peak bandwidth: 622+271=893Mbps ） 10-sec average bandwidth 5-min average bandwidth Northern route Southern route

Bandwidth measurement result between SC- booth and other sites by Iperf (1-min average) Due to a packet loss problem of Abilene between Denver and Kansas City TransPAC Northern route has very high deviation Indiana Univ SDSC TransPAC North TransPAC South 10GE not available Due to evaluation of different configurations of southern route

File replication between US and Japan Using 4 nodes in each US and Japan, we achieved 741 Mbps for file transfer! （ out of 893 Mbps, 10-sec average bandwidth ） 10-sec average bandwidth

Parameters of US-Japan file transfer parameter Northern route Southern route socket buffer size 610 KB 250 KB Traffic control per stream 50 Mbps 28.5 Mbps # streams per node pair 16 streams 8 streams # nodes 3 hosts 1 host stripe unit size 128 KB # node pairs # streams 10-sec average BW Transfer time (sec) Average BW 1 (N1) 16 (N16x1) 113.0152Mbps 2 (N2) 32 (N16x2) 419 Mbps 115.9297Mbps 3 (N3) 48 (N16x3) 593 Mbps 139.6369Mbps 4 (N3 S1) 56 (N16x3 + S8x1) 741 Mbps 150.0458Mbps

File replication performance between SC-booth and other US sites

SC02 Bandwidth Challenge Result We achieved 2.286 Gbps using 12 nodes! (outgoing 1.691 Gbps, incoming 0.595 Gbps) 10-sec average bandwidth 1-sec average bandwidth 0.1-sec average bandwidth

Summary  Petascale Data Intensive Computing Wave  Key technology: Grid and cluster  Grid datafarm is an architecture for Online >10PB storage, >TB/s I/O bandwidth Efficient sharing on the Grid Fault tolerance  Initial performance evaluation shows scalable performance 1742 MB/s, 1974 MB/s on writes and reads on 64 cluster nodes of Presto III 443 MB/s using 23 parallel streams on Presto III 1063 MB/s, 1436 MB/s on writes and reads on 12 cluster nodes of AIST Gfarm I 410 MB/s using 6 parallel streams on AIST Gfarm I  Metaserver overhead is negligible  Gfarm file replication achieves 2.286 Gbps at SC2002 bandwidth challenge, and 741 Mbps out of 893 Mbps between US and Japan!  Smart resource broker is needed!! datafarm@apgrid.org http://datafarm.apgrid.org/

Special thanks to  Rick McMullen, John Hicks (Indiana Univ, PRAGMA)  Phillip Papadopoulos (SDSC, PRAGMA)  Hisashi Eguchi (Maffin)  Kazunori Konishi, Yoshinori Kitatsuji, Ayumu Kubota (APAN)  Chris Robb (Indiana Univ, Abilene)  Force 10 Networks, Inc

Worldwide File Replication on Grid Datafarm Osamu Tatebe and Satoshi Sekiguchi Grid Technology Research Center, National Institute of Advanced Industrial.

Similar presentations

Presentation on theme: "Worldwide File Replication on Grid Datafarm Osamu Tatebe and Satoshi Sekiguchi Grid Technology Research Center, National Institute of Advanced Industrial."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Worldwide File Replication on Grid Datafarm Osamu Tatebe and Satoshi Sekiguchi Grid Technology Research Center, National Institute of Advanced Industrial.

Similar presentations

Presentation on theme: "Worldwide File Replication on Grid Datafarm Osamu Tatebe and Satoshi Sekiguchi Grid Technology Research Center, National Institute of Advanced Industrial."— Presentation transcript:

Similar presentations

About project

Feedback