Presentation is loading. Please wait.

Presentation is loading. Please wait.

DØ Data Handling Operational Experience GridPP8 Sep 22-23, 2003 Rod Walker Imperial College London Computing Architecture Operational Statistics Challenges.

Similar presentations


Presentation on theme: "DØ Data Handling Operational Experience GridPP8 Sep 22-23, 2003 Rod Walker Imperial College London Computing Architecture Operational Statistics Challenges."— Presentation transcript:

1 DØ Data Handling Operational Experience GridPP8 Sep 22-23, 2003 Rod Walker Imperial College London Computing Architecture Operational Statistics Challenges and Future Plans Regional Analysis Centres Computing activities Summary Roadmap of Talk

2 Great Britain 200 all Monte Carlo Production Netherlands 50 France 100 Texas 64 Czech R. 32 fnal.gov DØ computing/data handling/database architecture UNIX hosts ENSTORE movers LINUX farm 300+ dual PIII/IV nodes Startap Chicago switch a: production c: development ADIC AML/2 STK 9310 powderhorn ClueDØ Linux desktop user cluster 227 nodes Fiber to experiment switch DEC4000 d0ola,b,c L3 nodes RIP data logger collector/router a b c SUN 4500 Linux quad d0ora1 d0lxac1 Linux d0dbsrv1 switch SGI Origin2000 128 R12000 prcsrs 27 TB fiber channel disks Central Analysis Backend (CAB) 160 dual 2GHz Linux nodes 35 GB cache ea. Experimental Hall/office complex CISCO

3 SAM Data Management System Flexible and scalable distributed model Field hardened code Reliable and Fault Tolerant Adapters for mass storage systems: Enstore, (HPSS, and others planned) Adapters for Transfer Protocols: cp, rcp, scp, encp, bbftp, GridFTP. Useful in many cluster computing environments: SMP w/ compute servers, Desktop, private network (PN), NFS shared disk,… Ubiquitous for DØ users SAM is Sequential data Access via Meta-data Est. 1997. http://d0db.fnal.gov/sam SAM Station – 1. Collection of SAM servers which manage data delivery and caching for a node or cluster 2. The node or cluster hardware itself

4 Overview of DØ Data Handling Registered Users600 Number of SAM Stations56 Registered Nodes900 Total Disk Cache40 TB Number Files - physical1.2M Number Files - virtual0.5M Robotic Tape Storage305 TB Regional Center Analysis site Summary of DØ Data HandlingIntegrated Files Consumed vs Month (DØ) Integrated GB Consumed vs Month (DØ) 4.0 M Files Consumed 1.2 PB Consumed Mar2002 Mar2003

5 Data In and out of Enstore (robotic tape storage) Daily Aug 16 to Sep 20 5 TB outgoing 1 TB Incoming. Shutdown starts

6 Consumption Applications “consume” data In DH system: consumers can be hungry or satisfied allowing for consumption rate, the next course delivered before asking. 180 TB consumed per month 1.5 PB Consumed in 1yr

7 Challenges Getting SAM to meet the needs of DØ in the many configurations is and has been an enormous challenge. Some examples include… –File corruption issues. Solved with CRC. –Preemptive distributed caching is prone to race conditions and log jams or Gridlock. These have been solved. –Private networks sometimes require “border” services. This is understood. –NFS shared cache configuration provides additional simplicity and generality, at the price of scalability (star configuration). This works. –Global routing completed. –Installation procedures for the station servers have been quite complex. They are improving and we plan to soon have “push button” and even “opportunistic deployment” installs. –Lots of details with opening ports on firewalls, OS configurations, registration of new hardware, and so on. –Username clashing issues. Moving to GSI and Grid Certificates. –Interoperability with many MSS. –Network attached files. Sometimes, the file does not need to move to the user.

8 RAC:Why Regions are Important 1.Opportunistic use of ALL computing resources within the region 2.Management for resources within the region 3.Coordination of all processing efforts is easier 4.Security issues within the region are similar, CA’s, policies… 5.Increases the technical support base 6.Speak the same language 7.Share the same time zone 8.Frequent Face-to-face meetings among players within the region. 9.Physics collaboration at a regional level to contribute to results for the global level 10.A little spirited competition among regions is good

9 Summary of Current & Soon-to-be RACs RACIAC’s CPU  Hz (Total*) Disk (Total*) Archive (Total*) Schedule GridKa @FZK Aachen, Bonn, Freiburg, Mainz, Munich, Wuppertal, 52 GHz (518 GHz) 5.2 TB (50 TB) 10 TB (100TB) Established as RAC SAR @UTA (Southern US) AZ, Cinvestav (Mexico City), LA Tech, Oklahoma, Rice, KU, KSU 160 GHz (320 GHz) 25 TB (50 TB) Summer 2003 UK @tbd Lancaster, Manchester, Imperial College, RAL 46 GHz (556 GHz) 14 TB (170 TB) 44 TBActive, MC production IN2P3 @Lyon CCin2p3, CEA-Saclay, CPPM- Marseille, IPNL-Lyon, IRES-Strasbourg, ISN- Grenoble, LAL-Orsay, LPNHE- Paris 100 GHz12 TB200 TBActive, MC production DØ @FNAL (Northern US) Farm, cab, clued0, Central- analysis 1800 GHz 25 TB1 PBEstablished as CAC *Numbers in () represent totals for the center or region, other numbers are DØ’s current allocation.

10 UK RAC ManchesterLancasterLeScImperial(CMS) RAL 3.6TB FNAL MSS,25TB Global File Routing FNAL throttles transfers Direct access unnecessary Firewalls, policies,… Configurable, with fail-overs

11 From RAC’s to Riches Summary and Future We feel that the RAC approach is important to more effectively use remote resources Management and organization in each region is as important as the hardware. However… –Physics group collaboration will transcend regional boundaries –Resources within each region will be used by the experiment at large (Grid computing Model) –Our models of usage will be revisited frequently. Experience already indicates that the use of thumbnails differs from that of our RAC model (skims). –No RAC will be completely formed at birth. There are many challenges ahead. We are still learning…

12 Stay Tuned for SAM-Grid The best is yet to come…

13 CPU intensive activities Primary reconstruction –On-site, with local help to keep-up. MC production –Anywhere. No input data. Re-reconstruction (reprocessing) –Must be fast to be useful –Use all resources. Thumbnail skims –1 per physics group Common skim – OR of group skims –End up with all events if triggers are good –Defeats the object, i.e. small datasets User analysis – not a priority (CAB can satisfy demand) First on SAMGrid

14 Current Reprocessing of D0RunII Why now and fast? –Improved tracking for Spring conferences –Tevatron shutdown – include reconstruction farm Reprocess all RunII data –40TB of DST data –40k files (basic unit of Data Handling) –80 million events How –Many sites in US and Europe, inc. UK RAC –qsub initially, but UK will lead move to SAMGrid. –Nikhef (LCG) –Will gather statistics and report.

15 Runjob and SAMGrid Runjob workflow manager –Maintained by Lancaster. Mainstay of D0 MC production. –No difference between MC production and data (re)processing. SAMGrid integration –Was done for GT2.0, eg.tier1a via EDG1.4 CE –Job Bomb: 1 grid job-to-many local BS jobs, i.e. job has structure. –Request 2.0 gatekeeper(0mth), write custom perl jobmanagers(2mth), or use DAGman to absorb structure(3mth) –Pressure to use grid-submit - want 2.0 for now. 4UK sites, 0.5FTE’s – need to use SAMGrid.

16 Conclusions SAM enables PB scale HEP computing today. Details are important in production system –PN’s, NFS, scaling, cache management(free space=zero, always), gridlock,… Official & semi-official tasks dominate cpu requirements. –reconstruction, reprocessing, MC production, skims. –by definition these are structured, repeatable – good for grid. User analysis runs locally(still needs DH), or centrally. (Still project goal – just not mine) SAM experience valuable – see report on reprocessing. Have LCG seen how good it is?


Download ppt "DØ Data Handling Operational Experience GridPP8 Sep 22-23, 2003 Rod Walker Imperial College London Computing Architecture Operational Statistics Challenges."

Similar presentations


Ads by Google