BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS.

BNL Grid Projects

2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS OSG Configuration  LCG 2 Status  3D (Distributed Deployment of Database) Project  PHENIX Data Transfer (Non-USATLAS)

Network/dCache

4 Current Network Configuration  How is (was) our network configured for SC3? What performance did we observe? What adjustments did we make? How significant is (or has been) the firewall? How many servers of what kind did we use for dCache?

5 Network in the Past  The performance for SC3 throughput: the peak performance for several hour is 150M bytes/second. The average data transfer rate is 120M Bytes. During SC3 service phase, we re-installed the dCache system and tuned the system during service phase, we experienced some data transfer problem. But we can still maintain the data transfer rate around 100M byte/second for several hours. This time  Several adjustments that we made after SC throughput phase:  September: The disk dCache Write Pool was changed from RAID 0 to RAID 5 to add redundancy to the precious data. The file system was switched to EXT3 due to that a XFS bug crashed the RAID5 based disk. The performance was degraded for past several weeks.  December: we upgrade the OS system of dCache to RHEL 4.0 and redeploy XFS in dCache write pool nodes.  We constantly hit the performance bottleneck of 1 Gbps. We found that there was excessive traffic between door nodes (SW9) and pool nodes (SW7). The traffic was already put on aggregated ethernet channels (3*1Gbps) between two ATLAS switches. We found that the hashing algorithm always sent traffic to one physical network fiber, therefore led to in-balanced load distribution.  We finally relocated all dCache servers into one network switch to avoid inter-switch traffic.  We did not find any performance issues associated with firewall, but firewall indeed drops some packages between 2 ATLAS subnets (130.199.48.0 and 130.199.185.0), which prevents the job submission from ATLAS grid gatekeeper to the condor pool. This problem does not affect SC3 data transfer to BNL dCache system.

6 Current dCache Configuration  dCache consists of write pool nodes, read pool nodes and core services: (courtesy of Zhenping)  PNFS Core server node 1 (dedicated) RHEL 4.0, DELL 3.0Ghz  SRM server (door) node 1 (dedicated) RHEL 4.0, DELL 3.0Ghz  GridFTP and DCAP Core server nodes (doors) 4 (dedicated) RHEL 4 Dell 3.0Ghz  Internal/External Read pool nodes 322 (shared) 145 TB SL3, mix of Penguin 3.0 and Dell 3.4 Ghz.  Internal/External write pool nodes 8 (dedicated) 1 TB, RHEL 4.0, Dell 3.0GHz  Total 336 146 TB

7 Read pools DCap doors SRM door doors GridFTP doors doors Control Channel write pools Data Channel DCap Clients Pnfs ManagerPool Manager HPSS GridFTP Clientsd SRM Clients Oak Ridge Batch system DCache System One BNL dCache Instance

8 Future Network/dCache Plan  The design of USATLAS is in the following slides.  The network bandwidth to ACF will be 20Gpbs redundant network bandwidth to external. BNL to CERN connection is 10Gpbs.  dCache should be expanded to accommodate LHC data. We tried to avoid mixing of LHC data traffic with the remaining ATLAS production traffic. Either we created a dedicated dCache instance, or we dedicate fraction of dCache resource (separated dCache write pool group) to LHC data transfer. Zhenping and I prefer to have a dedicated dCache instance since the number of nodes in BNL dCache managed by the current dCache technology is running into the limitation. Anyway, in the next several month, LHC fraction of dCache should be able to handle 200MB/seconds, one day worth of disk space (16.5 Tera). We need to have 20TB (20% will be used as redundancy in RAID 5) local disk space.  10 Nodes, each with 2TB local disks.

USATLAS Tier 1 Network Design

10 Current Unsolved/Unsettled Issues  LHCOPN did not address Tier 2 sites issues. What is the policy on trusting non-US Tier 2 sites? We simplify the issue and treat these non- US Tier 2 sites as regular internet end points.  LHCOPN include T0, All T1 sites and their existing connection: All T0- BNL and other ATLAS T1s-BNL traffic will be treated as LHCOPN traffic and they could share network resource provided by US LHCnet.  If one Tier 1 goes down, its LHC traffic will be routed via another Tier 1 and use fraction of network resource owned by the Tier 1. This type of traffic does not affect BNL internal network design. The AUP should be negotiated between Tier 1 sites. It is not done yet.

11 User Scenarios 1.LHC data is transferred via LHCOPN from CERN to BNL. Data is transferred into dCache, then migrated into HPSS. A small fraction of data will be read immediately by users at Tier 2s. (Volume_{LHC}) 2.All of Tier 2s upload their simulation/analysis data to Tier 1 dCache site. The data will be immediately replicated to the dCache cluster and migrated into HPSS. (Volume_{Tier 2}) 3.Physicists at Tier 3 read data (Input data) from Tier 1 dCache Read Pool, run analysis/transformation on their home institution, upload the result data to Tier 1 dCache write pool. Then the results will be immediately replicated into dCache read pool and archived into HPSS system. (Volume_{Physicists} = Volume_{Inputs}+Volumes_{Results}) 4.BNL owns fraction of ATLAS reconstruction data, ESD, AOD/Tag data. This data will be read from dCache and send to other Tier 1 sites. Similarly, BNL needs to read the same type of data from other Tier 1. (Volume_{T1}=Volume_{in}+Volume_{out}. 5.European Tier 2, 2+ sites needs to read data from BNL, the traffic will be treated as regular internet Traffic.  The total data volumes that we put on network links and backplane:  Volume_{Total}= 2*Volume_{LHC}+3*Volume_{Tier 2}+ Volume_{Inputs}+3*Volumes_{Results}+Volume_{T1}+ Volume_{Others}.

12 Requirements  dCache brings the subsystems (grid and computing cluster) in ACF even closer, in which computing cluster serves as data storage system). Data will be constantly replicated among them. Any connection restriction (firewall conduits) among them will potentially impact the functionality and performance.  We should isolate the internal ATLAS traffic within ATLAS network domain.  We needs to optimize the network traffic volume between BNL campus/ACF.  What fraction of data are we going to filtered by firewall? Item 1, 2, 3…… Any traffic that we plan to firewall, then we might double or triple the tax on the link between BNL Campus/ACF.  Any operation issues in BNL campus network should not impact the ACF internal network traffic between different USATLAS subnets.  We should not overload the BNL firewall with large data volumes of physics data.

13 USATLAS/BNL LAN

14 dCacheHPSS ACF farm ATLAS DL2 CERN ACLPolicyRouting Internet/Analysis CERN LHCOPN traffic Internet Traffic All traffic between any two hosts in ACF, routed or switched. Tier 2s US ATLAS Tier 2s Option 1 Tier 1s

15 ATLAS DL2 CERN ACLPolicyRouting Internet LHC/SC4dCache GridHPSS ACF farm Tier 2s CERN LHCOPN traffic Internet Traffic LHC data to HPSS is internal to ATLAS. It never leaves ATLAS router. USATLAS Tier 2s Option 2 Tier 1s

16 ATLAS DL2 CERN ACL Policy Routing Internet Single Network cable LHCdCache GridHPSS ACF farm Tier 2s CERN LHCOPN traffic Internet Traffic LHC data to HPSS is external to ATLAS. It leaves ATLAS router. USATLAS Tier 2s Option 3 Tier 1s

17 ATLAS DL2 CERN Internet CERN LHCOPN traffic Internet Traffic LHC data to HPSS is routed via DL2, The traffic needs to leaves ATLAS router. LHCdCache HPSS ACF farm Option 4  Disadvantage: All ATLAS traffic may double, or triple tax the BNL/USATLAS link.  Put All Traffic router via DL2.  Network Management is not easy?  Firewall becomes the bottleneck.  Does not utilize ATLAS routing capability. GridSystem Tier 2s USATLAS Tier2 Taffic Tier 1s

TeraPaths

19 QoS/MPLS  QoS/MPLS technology can be manually deployed into BNL campus/USATLAS network now. The behavior is well understood and LAN QoS expertise are handy now.  The TeraPaths software system is under intensive re- development to approach product quality. It will be ready by the end of February. We will need one month (March) to verify and deploy it into our production network infrastructure. When SC4 starts, we can quantitatively manage BNL LAN to send and receive data. The following month will be focusing on deploying the software package into Tier 2 sites participating SC4.

20  This Project Investigates the Integration and Use of LAN QoS and MPLS Based Differentiated Network Services in the ATLAS Data Intensive Distributed Computing Environment As a Way to Manage the Network As a Critical Resource.  The Collaboration Includes BNL and University of Michigan, and Other Collaborators From OSCAR (ESNET), Lambda Station (FNAL), and TeraPaths monitoring project (SLAC). What Is TeraPaths?

21 TeraPaths System Architecture Site A (initiator) Site B (remote) WAN web services WAN monitoring WAN web services route planner scheduler user manager site monitor router manager hardware drivers route planner scheduler user manager site monitor router manager Web page APIs Cmd line QoS requests

22 TeraPaths SC2005  Two bbcp periodically copied data from BNL disk to UMICH disk. One used class 2 traffic (200Mbps) and another used class EF (expedite forwarding: 400Mbps). Iperf sent out background traffic. The allocated network resource is 800Mbps.  We could quantitatively control shared network resource for mission critical tasks.  Verified the Effectiveness of MPLS/LAN QoS and Its Impact to prioritized traffic, background best effort traffic, and overall Network Performance.

Service Challenge 3

24 What was the SC3 configuration, hardware, software, middleware we used?

25 Services at BNL  FTS client + server (FTS 1.3) and its backend Oracle and myproxy servers.  FTS does the job of reliable file transfer from CERN to BNL.  Most Functionalities were implemented. It became reliable in controlling data transfer after several rounds of redeployments for bug fixing: short timeout value causing excessive failures, incompatibility with dCache/SRM.  Does not support DIRECT data transfer between CERN to BNL dCache data pool server (dCache SRM third party data transfer). The data transfers actually go through a few dCache GridFTP door nodes at BNL, which presents scalability issue. We had to move these door nodes to non-blocking networking ports to distribute traffic.  Both BNL and RAL discovered that the number of streams per file could not be more than 10, (a bug)?  Networking to CERN:  Network for dCache was upgraded to 2*1Gpbs around June.  Shared link with Long Round Trip Time: >140 ms, while RTT for Europe sites to CERN is about 20ms.  Occasional packet losses were discovered along the path between BNL-CERN.  1.5 G bps aggregated bandwidth observed by iperf with 160 TCP streams.

26 Services Used at BNL SC3  dCache/SRM ( V1.6.6.3, with SRM 1.3 interface). The detailed configuration can be found in Slide 6. q All read pool nodes have Scientific Linux 3 with XFS module compiled. q Experienced High load on write pool serves during large amount data transfer. Was fixed by replacing the EXT file systems with XFS file system. q Core server crashed once. Reason was identified and fixed. q Small buffer space (1.0TB) for data written into dCache system. q dCache can now deliver up to 200MB/second for input/output (limited by network speed.)  LFC (1.3.4) client and server was installed at BNL Replica Catalog Server.  Server was installed. Tested the basic functionalities: lfc-ls, lfc-mkdir etc.  Will populate LFC with the entries in our production globus RLS server.  ATLAS VO Box (DDM + LCG VO box) was deployed at BNL.  Two Instances of Distributed Data Management (DDM) software (DQ2) were deployed at BNL, one for Panda Production and one for SC3 service phase.

27 how did SC3 infrastructure evolve?  FTS was upgraded from 1.2 to 1.3.  dCache was upgraded from 1.6.5 to 1.6.6.3 (Dec/7/2005).  Write Pool File System was migrated from EXT3 to XFS before Service challenge 3 throughput phase. After SC3 throughput phase, we migrated the underlying disk from RAID0 to RAID 5 for better reliability. But it triggered the XFS file system bug when using RAID 5 disk and crashed server. We had to switch back to EXT3 file system. It fixed the bug, but significantly reduced the performance. The recent OS upgrade on dCache write pool and core servers alleviated the XFS bug (did not fix it), we migrated it back to XFS for better performance.  dCache software on read pool was upgraded as well. OS in Read Pool Nodes did not change after May/June Upgrade.

28 BNL SC3 data transfer All data actually are routed through GridFtp doors SC3 Monitored at BNL and CERN are consistent.

29 Data Transfer Status  BNL stablized FTS data transfer with high successful completion rate, as shown in the left image during Throughput Phase.  We have attained150 MB/second rate for about one hour with large number (> 50) of parallel file transfers During SC3 throughput Phase.

30 Final SC3 Throughput Data Transfer Results

31 Lessons Learned From SC2  Four file transfer servers with 1 Gigabit WAN network connection to CERN.  Meet the performance/throughput challenges (70~80MB/second disk to disk).  Enabled data transfer between dCache/SRM and CERN SRM at openlab  Design our own script to control SRM data transfer.  Enabled data transfer between BNL GridFtp servers and CERN openlab GridFtp servers controlled by Radiant software.  Many components need to be tuned  Long Round Trip Time, high packet dropping rate, has to use multiple TCP streams and multiple file transfers to fill up network pipe.  Sluggish parallel file I/O with EXT2/EXT3, lot of processes with I/O wait state, more file streams, worse the performance on file system.  Slight improvement with XFS system. Still need to tune file system parameter

32 Some Issues During SC3 Throughput Phase  Service Challenge also challenges resource:  Tuned network pipes, optimized the configuration and performance of BNL production dCache system and its associate OS, file systems,  Required more than one staff’s involvements to stabilize the newly deployed FTS, dCache and network infrastructure.  Staffing level decreased as services became stable.  Limited Resources are shared by experiments and users.  At CERN, SC3 infrastructure are shared by multiple Tier 1 sites.  Due to the heterogeneous nature of Tier 1 sites, data transfer for each site should be optimized non-uniformly based on site’s various aspects: i.e. network RRT, packet loss rates, experiment requirements etc.  At BNL, network and dCache are also used by production users.  Need to closely monitor the SRM and network to avoid impacting production activities.  At CERN, James Casey alone handles answering email, setting up the system, reporting problems and running data transfer. He provides 7/16 support himself.  How to scale to 7/24 production support/production center?  How to handle the time difference between US and CERN?  CERN Support Phone (Tried once, but the operator did not speak English)

33 Some Issues During SC3 Service Phase  FTS was changed from version 1.3 to 1.4 at CERN. FTS version 1.4 was supposed to support the direct third party transfer. When the direct data transfer into the pool without going through door was used, it could not handle the long wait, led to channel lockup. Therefore we had to switch to glite-url-copy which ad-hoc handles transferring into dCache.  dCache was constantly improved for better performance and reliability during past several month, reached a stable dCache recently.  SC3 service phase exposed several problems when it started. We took this opportunity to find the problems and fixed them. The performance and stability were continuously improved over the course of SC3. We was able to achieve high performance by the end of SC3. A good learning experience indeed.  SC Operation needs to be improved to timely problem reports.

34 What have been done.  SC3 Throughput Phase showed good data transfer bandwidth.  SC3 Tier 2 Data Transfer  Data were transferred to three selected Tier 2 sites.  SC3 Tape Transfer  Tape Data Transfer was stablized at 60 MB/second with loaned tape resources.  Met the goal defined at the beginning of Service Challenge.  Full Chain of data transfer was exercised.  SC3 Service Phase: we showed very good peak performance.

35 General view of SC3  When everything is running smoothly BNL got very good results 100M Byte/seconds  The middleware (FTS) is stable but there were still lots of compatibility issues:  FTS does not work effectively with the new version of dCache/SRM (version 1.3).  We had to turn off FTS controlled direct data transfer into dCache Pool since lots of time out errors completely blocked the data transfer channel.  We need to improve SC operation which included performance monitoring and timely problem reporting for preventing from deteriorating and quick fixing.  We fixed many dCache issues after its upgrade. We also tuned the dCache system to work under FTS/ATLAS DDM system (DQ2).  We achieved the best performance among the dCache sites which participated ATLAS SC3 service phase. 15 TB data was transferred to BNL. Sites using CASTOR SRM showed better performance.

SC3 re-run and SC4 Planning

37 SC3 re-run TWe upgraded BNL dCache core server OS to RHEL 4 and dCache to 1.6.6 starting Dec/07/2005. TWe will add few more dCache pool nodes if the software upgrades did not meet our expectation. TFTS should be upgraded if the necessary fix to prevent channel blocking is ready before new year. TLCG BDII needs to report status of dCache, FTS. (before Christmas). TWe would like to schedule a test period at the Beginning of January for stability and scalability. TEverything should be ready by January 9. TRe-run will start at January 16.

38 What will our SC4 configuration look like, network, servers, software, etc?  The physical network location for SC4 is shown in Slide 15.  We subscribed two subnet to LHCOPN (130.199.185.0/24 and 130.199.48.0/23). The current dCache instance will be on these two subnet. The new dCache instance for LHC/SC4 will be in 130.199.185.0/24 exclusively).  10 dCache Write/Read Pool Servers.  4 Door servers (RAL already merged door nodes with pool nodes. We will evaluated whether it is doable in BNL).  2 core servers. (dCache PNFS manager and SRM server).  The newest dCache production release: dCache 1.6.6.3+

39 BNL Service Challenge 4 Plan  Several steps needed to set-up hardware or service (ex: choose, procure, start install, end install, make operational), starting at January, ending before the beginning of March.  LAN, Tape system.  FTS, LFC, DDM, LCG VO boxes and other base line sevices will be maintained with agreed SLA and supported by USATLAS VO.  Dedicated LHC dCache/SRM write pool which provides up to 17 Tera bytes storage (24 hour worth data). (to be done synchronized with LAN, WAN).  Deploy and strengthen necessary monitoring infrastructure based on ganglia, nagios, Monalisa and LCG-RGMA. (February).  Drill for service integration (March)  Simulate network failure, server crashes, and how support center will respond to the issues.  Tier 0/Tier 1 End-to-End high performance network operational: bandwidth, stability and performance.

40 BNL Service Challenge 4 Plan  April/2006, establish the stable data transfer in the speed of 200M Bytes/second to disks and 200 M Bytes/second to tape.  May/2006, disk and computing farm upgrading.  July/01/2006: stable data transfer driven by ATLAS production system and ATLAS data management infrastructure between T0~T1 (200M Bytes/second) and provide services to satisfy SLA (Service level agreement).  Details of involving Tier 2 are in planning too. (February and March)  Tier 2 dCache: UC dCache needs to be stabilize and operational in February, UTA and BU need to have dCache in March.  Base line client tools should be deployed at Tier 2 centers.  Base line services should support Tier 1~Tier2 data transfer before SC4 starts.

3D project

42 Oracle part  Tie0 – Tie1  Oracle  Oracle streams replication  BNL joined to the 3D replicatoin testbed  Streams replication was setup between CERN and BNL successfully in Oct 2005  Several experiments foresee Oracle clusters for online systems  Focus on Oracle database clusters as main building block for Tie0 and Tie1  Propose to setup pre-production services for March and full service after 6 months deployment experience

43 BNL 3D Oracle Production Schedule  Dec 2005: h/w setup (Done)  Two nodes with 500GB fibre channel storage  Jan 2006: h/w acceptance tests, RAC(real application cluster) setup  March 2006: service starts  May 2006: service review ---> h/w defined for full production  September 2006: full database service in place

44 MySQL Database replication at BNL  Oracle – MySQL replication:  DataBase: ATLAS TAG DB  DB server at BNL: dbdevel2 (MySQL-4.0.25)  use case : Oracle CERN to MySQL BNL (push)  tool: Octopus replicator ( Java-based extraction, transformation and loading)  thanks to Julius Hrivnac (LAL,Orsay) and Kristo Karr (ANL) for successful collaboration  More details in Twiki: https://uimon.cern.ch/twiki/bin/view/Atlas/DatabaseReplication

45 MySQL Database replication at BNL  MySQL – MySQL replication:  DataBases:  Geometry DB ATLASDD  MySQL conditions DBs LArNBDC2 and LArIOVDC2  MySQL DB servers at BNL:  dbdevel1.usatlas.bnl.gov (MySQL -4.0.25)  db1.usatlas.bnl.gov (MySQL-4.0.25)  collected the first experience with CERN-BNL ATLAS DB replication  procedure using both mysqldump and on-line replication  Current versions correspond to most recent ATLAS production release 11.0.3

LCG 2 at BNl

47 Summary  LCG setup at BNL is partially functional. LCG-VO box was used in SC3. There is no technical difficulties/hurdles preventing the CE and SE from fully functional.  Deployed at mixed of hardware: Dell 3.0 Ghz, and some VA linux nodes: we deployed CE, RB, SE, Proxy server, Monitoring nodes (R-GMA), and a collection of worker node. Some systems are combined into a single server.

48 Progress and To Do  OS and LCG system installation and configuration is automatic, can be reinstalled on new hardware with 2 hours  Managed via RPM and updatable via a local YUM repositories which are automatically rebuilt from CERN and else where source.  GUMS controls LCG grid mapfile.  Site information is being published correctly, and some SFT (site functional tests) run from Operation CERN can complete successfully.  Still need to configure LCG to run condor at ATLAS pool.

BNL USATLAS Grid Testbed

50 Internet HPSS Submit Grid Jobs OSG Gatekeepers Disks RHIC/USATLA S Job scheduler NFS HPSS MOVER SRM/GridFtp SERVERS GridFtp Panasas BNL USATLAS OSG Configuration Grid Users Grid Job Requests Condor and dCache

PHENIX Data Transfer Activities

52 Courtesy of Y. Watanabe

54 Data Transfer to CCJ  2005 RHIC run ended on June 24, Above shows the last day of RHIC Run.  Total data transfer to CCJ (Computer Center in Japan) is 260 TB (polarized p+p raw data)  100% data transferred via WAN, Tool used here: GridFtp. No 747 involved.  Average Data Rate: 60~90MB/second, Peak Performance: 100 Mbytes/second recorded in Ganglia Plot! About 5TB/day! Courtesy of Y. Watanabe

55 Network Monitoring on NAT Box

56 Month and Year

57 Network Monitoring at Perimeter Router

58 Network Monitoring at CCJ, JAPAN

59 Our Role  Provide effective and efficient Network/Grid Solutions for Data Transfer.  Install Grid Tools on the PHENIX Buffer boxes.  Tune performance of network path along PHENIX Counting House/RCF/BNL LAN.  Install Ganglia monitoring tools for data transfer.  Diagnose problems and provide fix.  For future PHENIX data transfer, we continue to play these role. We will Integrate dCache/SRM into the future data transfer and automate the data transfer.  Ofer maintains the PHENIX dCache/SRM pools. He works on pilot transfer data from PHENIX dCache/SRM to CCJ.

60 Lessons Learned  Four monitor systems: BNL NAT ganglia, Router MRTG (Multi-Router Traffic Grapher), CCJ ganglia and Data Transfer Monitoring, caught errors in early stage.  EXT3 file system is not designed for high performance data transfer.  XFS has much better performance in disk I/O with high bandwidth, this experience was used in LHC service challenge 3 for ATLAS experime nt.  Broadcom BCM95703 copper gigabit network card has much less packet erros than Intel Pro1000.  Several ES-net/SINET network outages, traffic was rerouted to alternative paths. Problems were promptly discovered and resolved by on-call personnel and network engineers. Because of large disk cache at both ends, no data were lost due to network outages.

BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS.

Similar presentations

Presentation on theme: "BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS.

Similar presentations

Presentation on theme: "BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS."— Presentation transcript:

Similar presentations

About project

Feedback