Presentation is loading. Please wait.

Presentation is loading. Please wait.

ATLAS Report 14 April 2010 RWL Jones. The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was.

Similar presentations


Presentation on theme: "ATLAS Report 14 April 2010 RWL Jones. The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was."— Presentation transcript:

1 ATLAS Report 14 April 2010 RWL Jones

2 The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was nothing to say about it  It is good he thinks so!  But it is also a little hopeful!  But stick with the good news for now…..  People doing real work on the Grid, and in large numbers

3 Data throughput MB/s per day Total ATLAS data throughput via Grid (Tier0, Tier-1s, Tier-2s) Beam splashes First collisions Cosmics End data-taking  ATLAS was pumping data out at a high rate, despite the low luminosity  We are limited by trigger rate and event size  We also increased the number of first data replicas

4 900GeV reprocessing We are reprocessing swiftly and often We need prompt and continuous response This dataset is tiny compared with what is to come!

5 Reprocessed data to UK  Only one outstanding data transfer issue for UK by end of Xmas period

6 6 Tier 2 Nominal Shares 2010

7 Event data flow to the UK 7 Data distribution has been very fast despite event size Most datasets available in Tier 2s in a couple of hours

8 High Energy Data Begins UK T2 throughput MB/sUK T2 transfer volume GB UK T1 throughput MB/s

9 Analysis  This is going generally well  Previously recognised good sites are generally doing well  Workload is uneven depending on data distribution  This is not yet in ‘equilibrium’ as the data is sparse  Remember, datasets for specific streams go to specific sites, so usage reflects those hosting minimum bias samples  However, the fact the UK sites were favoured (e.g. Glasgow) also reflects good performance  The 7TeV will move more to equilibrium

10 Data Placement  There are issues in the current ATLAS distribution  Should be a full set in each cloud  This has not always happened because of a bug  We need to be more responsive to site performance & capacity  At the moment, the UK has been patching-in extra copies ‘manually’  ATLAS has followed UK ideas and has introduced ‘primary’ and ‘secondary’ copies of datasets  Secondary copies live only while space permits  Improves access – UK typically has 1.6 copies

11 11 The UK and Data Placement  The movement and placement of data must be managed  Overload of the data management system slows the system down for everyone  Unnecessary multiple copies waste disk space and will prevent a full set being available  Some multiple copies will be a good idea to balance loads  We have a group for deciding the data placement:  UK Physics Co-ordinator, UK deputy spokesman, Tony Doyle (UK data ops), Roger Jones (UK ops), aided by Stewart, Love & Brochu  The UK Physics co-ordinator consults the institute physics reps  The initial data plans follows the matching of trigger type to site from previous exercises  We will make second copies until we run short of space, then the second copies will be removed *at little notice*

12 But dataset x is not in the UK  In general, this should not be the case, unless it is RAW  Access it elsewhere (unless RAW/less popular ESD)  The job goes to the data, not the data to you  We can copy small amounts to the UK on request  E.g. my Higgs candidate in RAW or ESD  But we must manage it - specify  What need for the data is (activity, which physics and performance group)  Why it is not already covered by a physics or performance group area  How big it will be *at a maximum*  How the data will be used (what sort of job to be run, database access etc)  We are still surprised to see requests for datasets that are freely available on the Grid in the UK to be copied to ‘their’ Tier 2  Local requests should be to Tier 3 (non-GridPP) space 12

13 Site responsibilities  Sites are either  Supporting important datasets  Supporting physics groups  Reliability is vital  We need to be in the high 90s at least  Means paying a lot of attention to ATLAS monitoring and not just to SAM tests.  The switch to a SAM nagios based system is potentially useful, but many bugs to be ironed out  Sites just have to be pro-actively looking at the ATLAS dashboards (blame this on the infrastructure people again).  We are reviewing middleware, but the sites must play their part  Local monitoring is important  It should not be users who spot site problems first!  Sites must also look at ATLAS monitoring, not just SAM tests – they are not enough  ATLAS is working to help this…

14 Monitoring & Validation  ATLAS is working to improve the monitoring  Learn more from the user jobs:  We focus on “active” probing of the sites.  But “passive” yet automatic observation of the user jobs would lead to a better understanding of what is happening at the sites.  The current ADC metrics for analysis are the Hammer Cloud tests using the GangaRobot  These tests are heavy but fairly reliable  Reflect the computing model and needs in data-taking era  Reminder:  About 55% of CPU for ATLAS-wide analysis  About 100% of disk for ATLAS-wide analysis  About 0% of either for local use!

15 GangaRobot Today  ~8 tests per site per day w/ a mix of:  A few different releases  Different access modes  Mc and real data  Cond DB access  All are defined/prepared by Johannes Elmshauser  Results on GR web and in SAM  Non-critical; sites usually ignore it  Auto blacklisting on EGEE/WMS2x daily email report sent to DAST containing:  Sites with failures  Sites with no job submitted (brokerage error, e.g. no release)

16 ATLAS Validation – GR2  New tool, GR2, under development to validate sites  Lighter load on sites – GR2 is HC in ‘gentle mode’  Concept of Test templates (release, analysis, ds pattern, [sites])  Defined by ADC  Still has bugs  Installations need to be clearly defined and installed  Test samples need to be in place  This will almost certainly be the framework for future metrics  The metrics themselves require more experience to define

17 Installations  Installations:  Our sites have been apparently ‘off’ because of missing releases  ATLAS central is also slow at responding to problems with non-CERN set-ups  Major clean-up underway  Auto-retry installer under development

18 PANDA & WMS  There are now two distinct groups of users  Those who use the PANDA back-end  Those who use the WMS  There is less monitoring of the WMS, and less control  Some (e.g. Graeme) favour a tight focus on the PANDA approach  I am not sure this is possible  However, ATLAS clearly has more feedback and more control if this route is taken  Do not be surprised!

19 Middleware  Sites cannot be made 100% reliable with the current middleware  Many options are being considered  In particular, data management may reduce from 3 layers to 2  This would effectively remove the LFC if so  Radical options are also being considered  BUT ATLAS involved in recognizing the limitations of the system today and making it work

20 Conclusion  We are now finally dealing with real data  We are still learning  We must all work hard to make things work  Many thanks for everyone’s effort so far  But the work continues for 20 years!  The UK has been heavily used and involved in first physics studies  This is partly because of data location  But also because we are a reliable cloud  We can all celebrate this at the dinner tonight  But please keep an eye on your sites on your smart phones!


Download ppt "ATLAS Report 14 April 2010 RWL Jones. The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was."

Similar presentations


Ads by Google