Presentation is loading. Please wait.

Presentation is loading. Please wait.

Take on messages from Lecture 1 LHC Computing has been well sized to handle the production and analysis needs of LHC (very high data rates and throughputs)

Similar presentations

Presentation on theme: "Take on messages from Lecture 1 LHC Computing has been well sized to handle the production and analysis needs of LHC (very high data rates and throughputs)"— Presentation transcript:

1 Take on messages from Lecture 1 LHC Computing has been well sized to handle the production and analysis needs of LHC (very high data rates and throughputs) Based on the hierarchical Monarc model It has been very successful WLCG operates smoothly and reliably Data is well transferred and made available in a very short time to everybody Higgs boson discovery was announced within a week from latest data update! Network has worked well and allows now for computing model changes

2 Ian.Bird@ce / August 2012 2 Grid computing enables the rapid delivery of physics results

3 Outlook to the Future 3

4 4 4 Computing Model Evolution Evolution of computing models Hierarchy Mesh

5 5 Evolution During the development the evolution of the WLCG Production grid has oscillated between structure and flexibility Driven by capabilities of the infrastructure and the needs of the experiments ALICE Remote Access PD2P/ Popularity CMS Full Mesh 5

6 6 Structur Data management in the WLCG has been moving to a less deterministic system as the software improved Started with deterministic pre-placement of data on disk storage for all samples (ATLAS) Then subscriptions driven by physics groups (CMS) Then dynamic placement of data based on access to only replicate samples that were going to be looked at (ATLAS) Once IO is optimized and network links improve we can send data over the wide area so jobs can run anywhere and access the data (ALICE, ATLAS, CMS) Good for opportunistic resources, balancing, clouds, or any other time when the sample will be accessed only once 6 Data Management Evolution Less Deterministic

7 7 Structur Scheduling evolution has similar drivers We started with a very deterministic system where jobs were sent directly to a specific site This leads to early binding of jobs to resources requests idle in long queues, no ability to reschedule All 4 experiments evolved to use a set of pilots to make better scheduling decisions based on current information The pilot system now evolves further to allow submission to additional resources like clouds What began as a deterministic system has evolved to flexibility in scheduling and resources 7 Scheduling Evolution Less Deterministic

8 More dynamic data placement is needed less restrictions in where the data comes from but data is still pushed to sites 8 Data Access Frequency Ian Fisk FNAL/CD ATLAS Tier-1 Tier-2 Tier-1 Tier-2

9 Services like the Data Popularity Service track all the file accesses and can show what data is accessed and for how long Over a year, popular data stays that way for reasonable long periods of time 9 Popularity Ian Fisk FNAL/CD CMS Data Popularity Service

10 ATLAS uses the central queue and popularity to understand how heavily used a dataset is Additional copies of the data made Jobs re-brokered to use them Unused copies are cleaned 10 Dynamic Data Placement Ian Fisk FNAL/CD PANDA Request s Tier-1 Tier-2

11 We like to think of high energy data as series of embarrassing parallel events In reality it’s not how we either write or read the files More like Big gains in how storage is used by optimizing how events are read and streamed to an application Big improvements from the Root team and application teams in this area 11 Analysis Data Ian Fisk FNAL/CD 1122334455667788

12 With optimized IO other methods of managing the data and the storage are available Sending data directly to applications over the WAN Allows users to open any file regardless of their locations or the file’s source Sites deploy at least one xrootd server that acts as a proxy/door 12 Wide Area Access Ian Fisk FNAL/CD

13 We tested with a “Diskless” Tier-3 CPU Efficiency competitive 13 Performance Ian Fisk FNAL/CD 8TB/day peak about 1.5TB average

14 Once we have a combination of dynamic placement, wide area access to data, and reasonable networking then facilities we can be treated as part of a coherent system Also opens doors to use new kinds of resources (opportunistic resorces, commercial clouds, data centers..) 14 Transparent Access to Data

15 CERN is deploying a remote computing facility in Budapest 200Gb/s of networking between the centers at 35ms ping time As experiments we cannot really tell the difference where resources are installed 15 Example: Expanding the CERN Tier0 CERNBudapest 100Gb/s

16 Tier 0: Wigner Data Centre, Budapest New facility due to be ready at the end of 2012 1100m² (725m²) in an existing building but new infrastructure 2 independent HV lines Full UPS and diesel coverage for all IT load (and cooling) Maximum 2.7MW

17 These 100Gb/s links are the first in production for WLCG Other sites will soon follow We have reduced the differences in site functionality Then reduced the difference in even the perception that two sites are separate We can begin to think of the facility as a big center and not a cluster of center This concept can be expanded to many facilities 17 Networks

18 The WLCG service architecture has been reasonably stable for over a decade This is beginning to change with new Middleware for resource provisioning A variety of places are opening their resources to “Cloud” type of provisioning From a site perspective this is often chosen for cluster management and flexibility reasons Everything is virtualized and services are put on top 18 Changing the Services

19 Grids offer primarily standard services with agreed protocols Designed to be generic, but execute a particular task Clouds offer the ability to build custom services and functions More flexible, but also more work for users 19 Clouds vs Grids

20 CMS and ATLAS are trying to provision resources like this with the High Level Trigger farms Open Stack interfaced to the Pilot systems In CMS we got to 6000 running cores and the facility looks like another destination, though no grid CE exists It will be used for large scale production running in a few weeks Already several sites have requested similar connections to local resources 20 Trying this out

21 We have a grid because: We need to collaborate and share resources Thus we will always have a “grid” Our network of trust is of enormous value for us and for (e-)science in general We also need distributed data management That supports very high data rates and throughputs We will continually work on these tools We are now working on how to integrate Cloud Infrastructures in WLCG 21 WLCG will remain a Grid

22 Evolution of the Services and Tools

23 Computing infrastructure is a needed piece to the ultimate core mission of HEP experiments development effort is steadily decreasing Common solutions try to take advantage of the similarities in the experiment activities optimize development effort and offer lower long-term maintenance and support costs Together with the willingness of the experiments to work together Successful examples in Distributed Data Management, Data Analysis, Monitoring( HammerCloud, Dashboards, Data Popularity, the Common Analysis Framework, …) Taking advantage of the Long Shut-down 1 Need for Common Solutions

24 Architecture of the Common Analysis Framework

25 Evolution of Capacity: CERN & WLCG 25 Modest growth until 2014 Anticipate x2 in 2015 Anticipate x5 after 2018 Modest growth until 2014 Anticipate x2 in 2015 Anticipate x5 after 2018 What we thought was needed at LHC start What we actually used at LHC start!

26 Resource Utilization was highest in 2012 for both Tier-1 and Tier-2 sites CMS Resource Utilization

27 Growth curves for resources CMS Resource Utilization

28 Conclusions 28 C First years of LHC data – WLCG has helped deliver physics rapidly Data available everywhere within 48h Just the start of decades of exploration of new physics Sustainable solutions! Entering a phase of consolidation and at the same time evolution LS1: opportunity for disruptive changes and scale testing of new technologies Wide area access, dynamic data placement, new analysis tools, clouds Challenges for computing – scale & complexity – will continue to increase 28 Conclusions

29 In the new resource provisioning model the pilot infrastructure communicates with the resource provisioning tools directly Requesting groups of machines for periods of time 29 Evolving the Infrastructure 29 Resource Provisioning Resource Provisioning Pilots Resource Requests Cloud Interface CE VM with Pilots Batch Queue WN with Pilots

Download ppt "Take on messages from Lecture 1 LHC Computing has been well sized to handle the production and analysis needs of LHC (very high data rates and throughputs)"

Similar presentations

Ads by Google