Worldwide File Replication on Grid Datafarm Osamu Tatebe and Satoshi Sekiguchi Grid Technology Research Center, National Institute of Advanced Industrial.

Slides:



Advertisements
Similar presentations
Engineering Meeting Report Aug. 29, 2003 Kazunori Konishi.
Advertisements

National Institute of Advanced Industrial Science and Technology Belle/Gfarm Grid Experiment at SC04 Osamu Tatebe Grid Technology Research Center, AIST.
Gfarm v2 and CSF4 Osamu Tatebe University of Tsukuba Xiaohui Wei Jilin University SC08 PRAGMA Presentation at NCHC booth Nov 19,
TCP Performance over IPv6 Yoshinori Kitatsuji KDDI R&D Laboratories, Inc.
Current Testbed : 100 GE 2 sites (NERSC, ANL) with 3 nodes each. Each node with 4 x 10 GE NICs Measure various overheads from protocols and file sizes.
Experience of High Performance Experiments Yoshinori Kitatsuji Tokyo XP KDDI R&D Laboratories, Inc.
1 GridTorrent Framework: A High-performance Data Transfer and Data Sharing Framework for Scientific Computing.
Copyright © 2005 Department of Computer Science 1 Solving the TCP-incast Problem with Application-Level Scheduling Maxim Podlesny, University of Waterloo.
Maximizing End-to-End Network Performance Thomas Hacker University of Michigan October 5, 2001.
Grid Datafarm Architecture for Petascale Data Intensive Computing Osamu Tatebe Grid Technology Research Center, AIST On behalf of the Gfarm project
23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.
GridPP meeting Feb 03 R. Hughes-Jones Manchester WP7 Networking Richard Hughes-Jones.
Kento Aida, Tokyo Institute of Technology 1 Tutorial: Technology of the Grid 1. Definition 2. Components 3. Infrastructure Kento Aida Tokyo Institute of.
Outline Network related issues and thinking for FAX Cost among sites, who has problems Analytics of FAX meta data, what are the problems  The main object.
KEK Network Qi Fazhi KEK SW L2/L3 Switch for outside connections Central L2/L3 Switch A Netscreen Firewall Super Sinet Router 10GbE 2 x GbE IDS.
1 A Basic R&D for an Analysis Framework Distributed on Wide Area Network Hiroshi Sakamoto International Center for Elementary Particle Physics (ICEPP),
Toulouse 2008/6/24 Network monitoring on 10GbE Y. Kodama ITRI (Information Technology Research Institute) AIST (National Institute of Advanced.
Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST.
Experiences in Design and Implementation of a High Performance Transport Protocol Yunhong Gu, Xinwei Hong, and Robert L. Grossman National Center for Data.
ALICE data access WLCG data WG revival 4 October 2013.
Large File Transfer on 20,000 km - Between Korea and Switzerland Yusung Kim, Daewon Kim, Joonbok Lee, Kilnam Chon
Maximizing End-to-End Network Performance Thomas Hacker University of Michigan October 26, 2001.
Masaki Hirabaru Internet Architecture Group GL Meeting March 19, 2004 High Performance Data transfer on High Bandwidth-Delay Product Networks.
D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.
Data GRID Activity in Japan Yoshiyuki WATASE KEK (High energy Accelerator Research Organization) Tsukuba, Japan
Grid Applications for High Energy Physics and Interoperability Dominique Boutigny CC-IN2P3 June 24, 2006 Centre de Calcul de l’IN2P3 et du DAPNIA.
Applications Requirements Working Group HENP Networking Meeting June 1-2, 2001 Participants Larry Price Steven Wallace (co-ch)
Network Tests at CHEP K. Kwon, D. Han, K. Cho, J.S. Suh, D. Son Center for High Energy Physics, KNU, Korea H. Park Supercomputing Center, KISTI, Korea.
Masaki Hirabaru CRL, Japan APAN Engineering Team Meeting APAN 2003 in Busan, Korea August 27, 2003 Common Performance Measurement Platform.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Data transfer over the wide area network with a large round trip time H. Matsunaga, T. Isobe, T. Mashimo, H. Sakamoto, I. Ueda International Center for.
Sejong STATUS Chang Yeong CHOI CERN, ALICE LHC Computing Grid Tier-2 Workshop in Asia, 1 th December 2006.
HEP Data Grid in Japan Takashi Sasaki Computing Research Center KEK.
High-speed TCP  FAST TCP: motivation, architecture, algorithms, performance (by Cheng Jin, David X. Wei and Steven H. Low)  Modifying TCP's Congestion.
Masaki Hirabaru Tsukuba WAN Symposium 2005 March 8, 2005 e-VLBI and End-to-End Performance over Global Research Internet.
A Hierarchical MapReduce Framework Yuan Luo and Beth Plale School of Informatics and Computing, Indiana University Data To Insight Center, Indiana University.
Worldwide Fast File Replication on Grid Datafarm Osamu Tatebe 1, Youhei Morita 2, Satoshi Matsuoka 3, Noriyuki Soda 4, Satoshi Sekiguchi 1 1 Grid Technology.
ITEP computing center and plans for supercomputing Plans for Tier 1 for FAIR (GSI) in ITEP  8000 cores in 3 years, in this year  Distributed.
Masaki Hirabaru Network Performance Measurement and Monitoring APAN Conference 2005 in Bangkok January 27, 2005 Advanced TCP Performance.
Masaki Hirabaru NICT Koganei 3rd e-VLBI Workshop October 6, 2004 Makuhari, Japan Performance Measurement on Large Bandwidth-Delay Product.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
Challenges of deploying Wide-Area-Network Distributed Storage System under network and reliability constraints – A case study
TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University.
SINET Update and Collaboration with TEIN2 Jun Matsukata National Institute of Informatics (NII) Research Organization of Information and Systems
Performance Engineering E2EpiPEs and FastTCP Internet2 member meeting - Indianapolis World Telecom Geneva October 15, 2003
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
TCP transfers over high latency/bandwidth networks & Grid DT Measurements session PFLDnet February 3- 4, 2003 CERN, Geneva, Switzerland Sylvain Ravot
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
National Institute of Advanced Industrial Science and Technology Gfarm Grid File System for Distributed and Parallel Data Computing Osamu Tatebe
Pathway to Petaflops A vendor contribution Philippe Trautmann Business Development Manager HPC & Grid Global Education, Government & Healthcare.
Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project, University of Wisconsin.
Hiroyuki Matsunaga (Some materials were provided by Go Iwai) Computing Research Center, KEK Lyon, March
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
CRISP WP18, High-speed data recording Krzysztof Wrona, European XFEL PSI, 18 March 2013.
SRB at KEK Yoshimi Iida, Kohki Ishikawa KEK – CC-IN2P3 Meeting on Grids at Lyon September 11-13, 2006.
National Institute of Advanced Industrial Science and Technology Gfarm v2: A Grid file system that supports high-performance distributed and parallel data.
Performance measurement of transferring files on the federated SRB
Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017
Status report on LHC_2: ATLAS computing
Realization of a stable network flow with high performance communication in high bandwidth-delay product network Y. Kodama, T. Kudoh, O. Tatebe, S. Sekiguchi.
The transfer performance of iRODS between CC-IN2P3 and KEK
Efficient utilization of 40/100 Gbps long-distance network
Networking between China and Europe
Grid Datafarm and File System Services
Grid Canada Testbed using HEP applications
Support for ”interactive batch”
PRAGMA Telescience at iGRID 2002
GridTorrent Framework: A High-performance Data Transfer and Data Sharing Framework for Scientific Computing.
Jan. 24th, 2003 Kento Aida (TITECH) Sissades Tongsima (NECTEC)
Presentation transcript:

Worldwide File Replication on Grid Datafarm Osamu Tatebe and Satoshi Sekiguchi Grid Technology Research Center, National Institute of Advanced Industrial Science and Technology (AIST) APAN 2003 Conference Fukuoka, January 2003

ATLAS/Grid Datafarm project: CERN LHC Experiment Detector for ALICE experiment Detector for LHCb experiment Truck ATLAS Detector 40mx20m 7000 Tons LHC Perimeter 26.7km ~2000 physicists from 35 countries Collaboration between KEK, AIST, Titech, and ICEPP, U Tokyo

Petascale Data-intensive Computing Requirements  Peta/Exabyte scale files  Scalable parallel I/O throughput > 100GB/s, hopefully > 1TB/s within a system and between systems  Scalable computational power > 1TFLOPS, hopefully > 10TFLOPS  Efficiently global sharing with group-oriented authentication and access control  Resource Management and Scheduling  System monitoring and administration  Fault Tolerance / Dynamic re-configuration  Global Computing Environment

Grid Datafarm: Cluster-of-cluster Filesystem with Data Parallel Support  Cluster-of-cluster filesystem on the Grid File replicas among clusters for fault tolerance and load balancing Extension of striping cluster filesystem  Arbitrary file block length  Filesystem node = compute node + I/O node. Each node has large fast local disks.  Parallel I/O, parallel file transfer, and more  Extreme I/O bandwidth, >TB/s Exploit data access locality File affinity scheduling and local file view  Fault tolerance – file recovery Write-once files can be re-generated using a command history and re-computation [1] O.tatebe, et al, Grid Datafarm Architecture for Petascale Data Intensive Computing, Proc. of CCGrid 2002, Berlin, May 2002 Available at

Distributed disks across the clusters form a single Gfarm file system Distributed disks across the clusters form a single Gfarm file system n Each cluster generates the corresponding part of data n The data are replicated for fault tolerance and load balancing (bandwidth challenge!) n Analysis process is executed on the node that has the data BaltimoreTsukuba Indiana SandiegoTokyo

Extreme I/O bandwidth support example: gfgrep - parallel grep % gfrun –G gfarm:input gfgrep –o gfarm:output regexp gfarm:input –o gfarm:output regexp gfarm:input CERN.CH KEK.JP input.1 input.2 input.3 input.4 open(“gfarm:input”, &f1) create(“gfarm:output”, &f2) set_view_local(f1) set_view_local(f2) close(f1); close(f2) grep regexp Host2.ch Host1.ch Host3.ch Host4.jp gfarm:input Host1.ch Host2.ch Host3.ch Host4.jp Host5.jp gfmd input.5 Host5.jp output.4output.2 output.5 output.3 output.1 gfgrep File affinity scheduling

Design of AIST Gfarm Cluster I  Cluster node (High density and High performance) 1U, Dual 2.4GHz Xeon, GbE 480GB RAID with ” 120GB HDDs + 3ware RAID  136MB/s on writes, 125MB/s on reads  12-node experimental cluster (operational from Oct 2002) 12U + GbE switch (2U) + KVM switch (2U) + keyboard + LCD Totally 6TB RAID with 48 disks  1063MB/s on writes, 1437MB/s on reads  410 MB/s for file replication with 6 streams WAN emulation with NistNET 480GB 120MB/s 10GFlops GbE switch

Grid Datafarm US-OZ-Japan Testbad Indiana Univ. SDSC Indianapolis GigaPoP NOC Tokyo NOC OC-12 POS Japan APAN/TransPAC KEK Titech AIST ICEPP PNWG OC-12 StarLight OC-12 ATM Tsukuba WAN 20 Mbps GbE SuperSINET 1 Gbps OC-12 US ESnet NII-ESnet HEP PVC GbE Total disk capacity: 18 TB, disk I/O bandwidth: 6 GB/s KEKTitechAIST ICEPP SDSCIndiana U Melbourne Australia

Grid Datafarm for a HEP application Osamu Tatebe (Grid Technology Research Center, AIST) Satoshi Sekiguchi(AIST), Youhei Morita (KEK), Satoshi Matsuoka (Titech & NII), Kento Aida (Titech), Donald F. (Rick) McMullen (Indiana), Philip Papadopoulos (SDSC) SC2002 High-Performance Bandwidth Challenge

Target Application at SC2002: FADS/Goofy zMonte Carlo Simulation Framework with Geant4 (C++) zFADS/Goofy: Framework for ATLAS/Autonomous Detector Simulation / Geant4-based Object- oriented Folly zModular I/O package selection: Objectivity/DB and/or ROOT I/O on top of Gfarm filesystem with good scalability zCPU intensive event simulation with high speed file replication and/or distribution

Network and cluster configuration for SC2002 Bandwidth Challenge SCinet 10 GE SC2002, Baltimore Indiana Univ. SDSC Indianapolis GigaPoP NOC Tokyo NOC OC-12 POS Japan APAN/TransPAC KEK Titech AIST ICEPP PNWG Grid Cluster Federation Booth OC-12 StarLight OC-12 ATM (271Mbps) E1200 Tsukuba WAN 20 Mbps GbE SuperSINET 1 Gbps OC GE US ESnet NII-ESnet HEP PVC GbE KEKTitechAISTICEPP SDSCIndiana USC2002 Total bandwidth from/to SC2002 booth: Gbps Total disk capacity: 18 TB, disk I/O bandwidth: 6 GB/s Peak CPU performance: 962 GFlops

US Backbone San Diego SDSC Indiana Univ SC2002 Booth AIST Titech U Tokyo Chicago Seattle MAFFIN Tokyo 1Gbps 622Mbps KEK 622Mbps 271Mbps 1Gbps 10Gbps Tsukuba WAN 20Mbps Transpacific Multiple routes in a single application! 741 Mbps A record speed SC2002 Bandwidth Challenge Gbps! using 12 nodes Parallel file replication Baltimore

Network and cluster configuration SC2002 booth 12-node AIST Gfarm cluster connected with GbE connects to the SCinet with 10 GE using Force10 E1200 Performance in LAN Network bandwidth: 930Mbps File transfer bandwidth: 75MB/s(=629Mbps) GTRC, AIST The same 7-node AIST Gfarm cluster connects to Tokyo XP with GbE via Tsukuba WAN and Maffin Indiana Univ 15-node PC cluster connected with Fast Ethernet connects to Indianapolis GigaPoP with OC12SDSC 8-node PC cluster connected with GbE connects to outside with OC12 TransPAC north and south routes Default north route South route is used by static routing for 3 nodes in each SC booth and AIST RTT between AIST and SC booth: north 199ms, south 222ms South route is shaped to 271Mbps PC E1200 PC GbE SCinet 10GE OC nodes Network configuration at SC2002 booth

Challenging points of TCP-based file transfer Large latency, high Bandwidth (aka LFN) Big socket size for large congestion window Fast window-size recovery after packet loss High Speed TCP (internet draft by Sally Floyd) Network striping Packet loss due to real congestion Transfer control Poor disk I/O performance 3ware RAID with ” HDDs on each node Over 115 MB/s (~ 1 Gbps network bandwidth) Network striping vs. disk striping access # streams, stripe size Limited number of nodes Need to achieve maximum file transfer performance

Bandwidth measurement result of TransPAC by Iperf Northern route: 2 node pairs Southern route: 3 node pairs (100Mbps streams) Totally 753Mbps (10-sec avg) (peak bandwidth: =893Mbps ) 10-sec average bandwidth 5-min average bandwidth Northern route Southern route

Bandwidth measurement result between SC- booth and other sites by Iperf (1-min average) Due to a packet loss problem of Abilene between Denver and Kansas City TransPAC Northern route has very high deviation Indiana Univ SDSC TransPAC North TransPAC South 10GE not available Due to evaluation of different configurations of southern route

File replication between US and Japan Using 4 nodes in each US and Japan, we achieved 741 Mbps for file transfer! ( out of 893 Mbps, 10-sec average bandwidth ) 10-sec average bandwidth

Parameters of US-Japan file transfer parameter Northern route Southern route socket buffer size 610 KB 250 KB Traffic control per stream 50 Mbps 28.5 Mbps # streams per node pair 16 streams 8 streams # nodes 3 hosts 1 host stripe unit size 128 KB # node pairs # streams 10-sec average BW Transfer time (sec) Average BW 1 (N1) 16 (N16x1) Mbps 2 (N2) 32 (N16x2) 419 Mbps Mbps 3 (N3) 48 (N16x3) 593 Mbps Mbps 4 (N3 S1) 56 (N16x3 + S8x1) 741 Mbps Mbps

File replication performance between SC-booth and other US sites

SC02 Bandwidth Challenge Result We achieved Gbps using 12 nodes! (outgoing Gbps, incoming Gbps) 10-sec average bandwidth 1-sec average bandwidth 0.1-sec average bandwidth

Summary  Petascale Data Intensive Computing Wave  Key technology: Grid and cluster  Grid datafarm is an architecture for Online >10PB storage, >TB/s I/O bandwidth Efficient sharing on the Grid Fault tolerance  Initial performance evaluation shows scalable performance 1742 MB/s, 1974 MB/s on writes and reads on 64 cluster nodes of Presto III 443 MB/s using 23 parallel streams on Presto III 1063 MB/s, 1436 MB/s on writes and reads on 12 cluster nodes of AIST Gfarm I 410 MB/s using 6 parallel streams on AIST Gfarm I  Metaserver overhead is negligible  Gfarm file replication achieves Gbps at SC2002 bandwidth challenge, and 741 Mbps out of 893 Mbps between US and Japan!  Smart resource broker is needed!!

Special thanks to  Rick McMullen, John Hicks (Indiana Univ, PRAGMA)  Phillip Papadopoulos (SDSC, PRAGMA)  Hisashi Eguchi (Maffin)  Kazunori Konishi, Yoshinori Kitatsuji, Ayumu Kubota (APAN)  Chris Robb (Indiana Univ, Abilene)  Force 10 Networks, Inc