Presentation is loading. Please wait.

Presentation is loading. Please wait.

Will / Can Clouds Replace Grids? A Three-Point Grid Support Group, IT Department, CERN.

Similar presentations


Presentation on theme: "Will / Can Clouds Replace Grids? A Three-Point Grid Support Group, IT Department, CERN."— Presentation transcript:

1 Will / Can Clouds Replace Grids? A Three-Point ChecklistJamie.Shiers@cern.ch Grid Support Group, IT Department, CERN

2 Introduction  This talk tries to establish a checklist to determine whether Cloud computing could be an alternative – or possibly a complementary solution – to Grids for LHC- scale computing  In other words, it tries to build a list of criteria that a cloud-based solution must satisfy if it is to be considered as an acceptable solution  This checklist leads naturally to a set of actions or possible project(s) in this area  Outstanding issues and / or current experience from today’s solution (aka Grid) is interleaved to emphasize the relevance of these issues

3 Abstract  The WLCG service has been declared officially open for production and analysis during the LCG Grid Fest held at CERN - with live contributions from around the world - on Friday 3rd October 2008. But the service is not without its problems - services or even sites suffer degradation or complete outage with painful repercussions on experiment activities, the operations and service model is arguably not sustainable at this level but yet an important element of the funding comes to and end approximately one year after this conference! Cloud computing - which has been referred to as Grid computing with a viable business model - makes ambitious claims. Could it solve all - or even a significant fraction, say Monte Carlo production - of our computing problems? What would be the associated costs, technical and sociological implications? This presentation analyzes the Strengths, Weaknesses, Opportunities and Threats of these potential rival models from the viewpoint of the current WLCG service. It makes proposals for studies that should be performed - beyond existing largely paper analyses - and highlights some key differentiators between the two approaches.

4 What is Cloud Computing? a. The latest in a series of hype; b. Yet another form of utility computing; c. Grid Computing but with a business model; d. Where the action (money) is currently at; e. All of the above?

5 Does it matter?  Some ten years ago Larry Ellison – head honcho at Oracle – declared:  “There have been 3 generations of computing: mainframe, client-server and Internet computing  There’ll be nothing new for one thousand (1000) years”  Curiously enough, just a couple of years later, Oracle declared Grid to be “the next big thing”

6 What is Grid Computing?  Today there are many definitions of Grid computing:  The definitive definition of a Grid is provided by [1] Ian Foster in his article "What is the Grid? A Three Point Checklist" [2].[1][2]  The three points of this checklist are: 1.Computing resources are not administered centrally; 2.Open standards are used; 3.Non-trivial quality of service is achieved.

7 What is Grid Computing?  Today there are many definitions of Grid computing:  The definitive definition of a Grid is provided by [1] Ian Foster in his article "What is the Grid? A Three Point Checklist" [2].[1][2]  The three points of this checklist are: 1.Computing resources are not administered centrally; 2.Open standards are used; 3.Non-trivial quality of service is achieved.

8 WLCG Key Performance Indicators Since the beginning of last year we have held week-daily conference calls open to all experiments and sites to follow-up on short-term operations issues These have been well attended by the experiments, with somewhat more patchy attendance from sites but minutes are widely and rapidly read by members of the WLCG Management Board and beyond minutes A weekly summary is given to the Management Board where we have tried to evolve towards a small set of Key Performance Indicators These currently include a summary of the GGUS tickets opened in the previous week by the LHC VOs, as well as more important service incidents requiring follow-up: Service Incident Reports (aka post-mortems)summary 8

9 GGUS Summary 9 VO concernedUSERTEAMALARMTOTAL ALICE3003 ATLAS16 032 CMS1300 LHCb92011 Totals4118059 No alarm tickets – this may also reflect activity Increasing use of TEAM TICKETS  Regular test of ALARM TICKETS coming soon! See https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek090223# Tuesday under AOB https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek090223# Tuesday

10 Intervention Summary (fake) 10 Site# scheduled#overran#unscheduledHours sched. Hours unsched. Bilbo501104 Frodo110222 Drogo27001650 As with GGUS summary we will drill-down in case of exceptions (examples high-lighted above) Q: what are reasonable thresholds? Proposal: look briefly at ALL unscheduled interventions, ALL overruns and “high” (TBD) # of scheduled

11 (Some) Unscheduled Interventions 11 SiteReason NL-T1 (SARA- MATRIX) A DDN storage device partially crashed and needs a cold reboot and some additional actions. We are uncertain how long it will take. The SARA CE's may be affected. Period announced 23-02-2009 09:30 – 11:15 Intervention terminated 23-02-2009 12:20 NDGFSome dCache pools offline from time to time due to bad hardware causing spontaneous reboots. Period announced 20-02-2009 15:22 – 23-02-2009 15:22 Terminated 23-02-2009 16:25  We need to automatically harvest this information and improve follow-up reporting  A convenient place to provide such a report is at the daily WLCG operations call!

12 12 ALICE ATLAS CMS LHCb

13 Constant improvement of the quality of the infrastructure Comparison of the CMS site availability based on the results of SAM tests specific for CMS VO First and last quarter of 2008.

14 2009 Data Taking – The Prognosis Production activities will work sufficiently well – many Use Cases have been tested extensively and for prolonged periods at a level equal to (or even greater than) the peak loads that can be expected from 2009 LHC operation There will be problems but we must focus on restoring the service as rapidly and as systematically as possible  Analysis activities are an area of larger concern – by definition the load is much less predictable Flexible Analysis Services and Analysis User Support will be key  In parallel, we must transition to a post-EGEE III environment – whilst still not knowing exactly what this entails… But we do know what we need to run stable Grid Services! 14

15 The Goal – WLCG Services The goal is that – by end 2009 – the weekly WLCG operations / service report is quasi-automatically generated 3 weeks out of 4 with no major service incidents – just a (tabular?) summary of the KPIs  We are currently very far from this target with (typically) multiple service incidents that are either: New in a given week; Still being investigating or resolved several to many weeks later  Quite a few are avoidable too if we followed some basic rules! By definition, such incidents are characterized by severe (or total) loss of service or even a complete site (or even Cloud in the case of ATLAS) 15 From February 2009 LHCC mini-review of (W)LCG

16 How Can We Improve? Change Management Plan and communicate changes carefully; Do not make untested changes on production systems – these can be extremely costly to recover from. Incident Management The point is to learn from the experience and hopefully avoid similar problems in the future; Documenting clearly what happened together with possible action items is essential.  All teams must buy into this: it does not work simply by high-level management decision (which might not even filter down to the technical teams involved). CERN IT plans to address this systematically (ITIL) as part of its 2009+ Programme of Work 16 Pronounced “Common Sense”

17 What is LHC Scale Computing?  (W)LCG was initially declared to require “100,000 of today’s fastest PCs”  Technology has changed quite significantly since this was first written, but with a very minor change this still holds true  100,000 cores  This is also used to loosely characterize “petascale” computing (aka supercomputing)  Any demonstration that we do must be on a scale commensurate with this – a 1% (or less) test is completely irrelevant! (and we know that the data is petabyte-scale…)

18 Can Clouds Replace Grids?  We now have the first two of three points of this checklist: 1.Non-trivial quality of service must be achieved;  We have well understood metrics from day-to-day operation of WLCG services 2.The scale of the test(s) must be meaningful for petascale computing;  Obviously one cannot expect to dedicate 100,000 cores to the first prototype, but anything done on a scale < 1,000 cores will not be relevant to the conclusion! 3.Data

19 Current Data Management vs Database Strategies Data Management  Specify only interface (e.g. SRM) and allow sites to chose implementation (both of SRM and backend s/w & h/w mass storage system) Databases  Agree on a single technology (for specific purposes) and agree on detailed implementation and deployment details WLCG experience from both areas shows that you need to have very detailed control down to the lowest levels to get the required performance and scalability. How can this be achieved through today’s (or tomorrow’s) Cloud interfaces? Are we just dumb???

20 Major Service Incidents Quite a few such incidents are “DB-related” in the sense that they concern services with a DB backend The execution of a “not quite tested” procedure on ATLAS online led – partly due to the Xmas shutdown – to a break in replication of ATLAS conditions from online out to Tier1s of over 1 month (online-offline was restored much earlier) Various Oracle problems over many weeks affected numerous services (CASTOR, SRM, FTS, LFC, ATLAS conditions) at ASGC  need for ~1FTE of suitably qualified personnel at WLCG Tier1 sites, particularly those running CASTOR; recommendations to follow CERN/3D DB configuration & perform a clean Oracle+CASTOR install; communication issues Various problems affecting CASTOR+SRM services at RAL over prolonged period, including “Oracle bugs” strongly reminiscent of those seen at CERN with earlier Oracle version: very similar (but not identical) problems seen recently at CERN & ASGC (not CNAF…) Plus not infrequent power + cooling problems [ + weather! ] Can take out an entire site – main concern is controlled recovery (and communication) 20 From February 2009 LHCC mini-review of (W)LCG

21 21 At the November 2008 WLCG workshops a recommendation was made that each WLCG Tier1 site should have at least 1 FTE of DBA effort. This effort (preferably spread over multiple people) should proactively monitor the databases behind the WLCG services at that site: CASTOR/dCache, LFC/FTS, conditions and other relevant applications. The skills required include the ability to backup and recover, tune and debug the database and associated applications. At least one WLCG Tier1 does not have this effort available today.

22 Services – Concrete Actions 1.Review on a regular (3-6 monthly?) basis open Oracle “Service Requests” that are significant risk factors for the WLCG service (Tier0+Tier1s+Oracle) The first such meeting is being setup, will hopefully take place prior to CHEP 2009 2.Perform “technology-oriented” reviews of the main storage solutions (CASTOR, dCache) focussing on service and operational issues Follow-on to Jan/Feb workshops in these areas; again report at pre-CHEP WLCG Collaboration Workshop 3.Perform Site Reviews – initially Tier0 and Tier1 sites – focussing again and service and operational issues. Will take some time to cover all sites; proposal is for review panel to include members of the site to be reviewed who will participate also in the review before and after their site 22

23 Remaining Questions  Are Grids too complex?  Are Clouds too Simple? IMHO we can learn much from the strengths and weaknesses of these approaches, particularly in the key (for us) areas of data(base) management & service provision. This must be a priority for the immediate future….  Do Grids have to be too complex?  Do Clouds have to be too simple?

24 Can Clouds Replace Grids? – The Checklist  We have established a short checklist that will allow us to determine whether clouds can replace – or be used in conjunction with – Grids for LHC-scale data intensive applications: 1.Non-trivial quality of service must be achieved; 2.The scale of the test(s) must be meaningful for petascale computing; 3.Data Volumes, Rates and Access patterns representative of LHC data acquisition, (re-)processing and analysis; 4.Cost (of entry; of ownership).

25 Conclusions  We cannot afford to ignore major trends in the computing industry Some may turn out to be dead-ends Some may die only to be reborn in a different guise  We have established – through a long series of challenges – a well-proven mechanism for determining whether a (set of) computing service(s) satisfies an agreed set of requirements  Not evaluating cloud computing for at least some HEP Use Cases would appear to be the one option we cannot afford to take…

26 Summary  The following targets must be met by Cloud-based (or any other solution) to satisfy LHC-scale needs: 1.Non-trivial quality of service must be achieved; 2.The scale of the test(s) must be meaningful for petascale computing; 3.Data Volumes, Rates and Access patterns representative of LHC data acquisition, (re-)processing and analysis; 4.Cost (of entry; of ownership).

27 Acknowledgements  Elsie Gee (E. Gee!) – for many interesting but often heated discussions


Download ppt "Will / Can Clouds Replace Grids? A Three-Point Grid Support Group, IT Department, CERN."

Similar presentations


Ads by Google