Presentation on theme: "29/8/07 CMS and the Grid: 2004 – 7 Or Peta-meta-computing in the proto-Grid era: a sociotechnological retrospective Or Whose."— Presentation transcript:
29/8/07 CMS and the Grid: 2004 – 7 Or Peta-meta-computing in the proto-Grid era: a sociotechnological retrospective Or Whose paradigm is it anyway? Experiments for the Grid The Grid for Experiments
29/8/07 History à 2004: The end of innocence n DC04 data challenge - a learning experience n (also start playing with a new idea: PhEDEx) à 2005: The year of (re-)design n Computing TDR n A new ground-up software & computing framework à 2006: Making it work n CSA06 - the acid test n CASTOR in the UK …we all thought you were crazy - I. Fisk à 2007: Making it work without losing our sanity n Learning about storage + data transfers n CSA07: the real fun is yet to come
29/8/07 DC04 à 25% (of startup) data challenge n Set the traditional formula for subsequent challenges n Exercised T0, T1, T2 centres for a full month n 50M events, ~15 centres, ~30 people à Tools & technologies n Ad hoc scripts for workflow mgmt + CASTOR at CERN n Variety of storage at T1 (SRB + ADS at RAL), plain disk at T2 n First-generation EDG workload + data mgmt tools (incl. RLS) à Results n Technically: largely a disaster Managed to break essentially every component in the system n Organisationally: a major step forward Learn several key lessons that informed the computing model Established new projects to solve the technical problems n GridPP2 contributed substantially to the analysis / solution of problems
29/8/07 CSA06 - in Numbers 7 Tier-1 centres 35 Tier-2 sites 100M fully simulated events 1.4PB of data 400MB/s rate from RAL CASTOR 1200MB/s peak dataflow 70 people 180 meetings 500 dodgy disks (but only one tape eaten!) £300k of electricity 40l of Champagne at end-of-CSA party Shambolic CSA Forced to Abandon Targets - Headline, The Independent, 5th Jan
29/8/07 Lesson the 1st: Data Its all about the data, stupid! à Remember the DataGrid? à Weve failed to build a uniform approach to WADM à We deal with data processing centres, not CPU centres à The (remaining) hard problems are in efficient data access n I.e. storage, IO, data transfer at Tier-2, not just Tier-1 à Need to nail the local IO problem very soon à Still have a lot to do on reliable data transfer as well n BTW, the network still isnt the bottleneck (but keep trying…)
29/8/07 Lesson the 2nd: Locality Keep local stuff local à Aim of the Grid: avoid central points of congestion n Present coherent local interfaces, but reduce global state n Actually: aim of all coherent system-building strategies à Examples from current CMS system: n Use local catalogues whenever possible; update asynchronously n Dont use off-site Grid services for local workflows (e.g. reco). à This also applies to contacts with sites n Users / experiments must work with sites directly n Up the chain of command and down again does not work n NB: also applies to strategic discussions and resource planning
29/8/07 Lesson the 3rd: Reliability Reliability trumps performance & scalability à Unreliable systems are extremely inefficient n N_tries goes as log(1-p) -1, bookkeeping at least as N_tries 2 à Unreliable systems are not trusted by users à If one cant make a small system work, larger systems will be progressively worse n We are getting there; but not fast enough n Reliability can be achieved iff robustness is built in à Without reliability, what is the point of the Grid?
29/8/07 Lesson the 4th: Exceptions Sticking plasters wont cure a broken leg à We use the network stack model of fault tolerance n Higher layer functions compensate for unreliability of lower layers à Alas, does not work for intrinsically unreliable systems n Example: wireless network in CERN building 40 n b fine with 1% error rate; collapses with 10% (CMS week!) à Fault-finding is impossible without fault reporting n And intelligible logging, recorded and accessible at all levels à Exception handling is clearly hard n A key property of a mature system n Remember: exceptions should be exceptional
29/8/07 Lesson the 5th: People Generic Grid sites do not really exist à A site is precisely as good as the people running it n Objectively: throughput (transfers) tracks national holidays! n We are still in a highly labour-intensive mode; the labour is at sites à What does CMS need from site operators, today? n Close contact with the project and the users;sharing of experience n Proactive deployment & testing of new services, software n Active participation in resource planning and data operations à Will generic sites ever exist? n Not until central support and problem-tracking are much improved
29/8/07 Lesson the 6th: Focus No more neat ideas - for now à In 2007/8, that is! n Focus on (dull, tedious, hard) integration, testing, documentation n The excitement will come from the physics! à But many big unsolved problems for later: n How can we store data more efficiently? Can we? n How can we compute more efficiently? Can we? n How should we use multi-threading & virtualisation? n How do we use really high-speed networks? n Will anyone ever make videoconferencing work properly? à Someone should start targeting these problems…
29/8/07 Hype? à Where are we? (Or at least: what sign is the gradient?)
29/8/07 Whither the Grid? à Is CMS using the Grid? n PKI-based, uniform(ish) web services interfaces? Yes But also a lot of remote DB access for many purposes n Resource discovery / Info service? Not really. >90% of CMS jobs are whitelisted at RB level (many even at user level) n Replica management? Partially, through our own mechanisms No real attempt at optimisation of data access - yet n Support, authentication, ROC services? Partially Augmented with CMS-specific and national support mechanisms à Has it all been worth it (so far?) n Yes! If if didnt exist, wed have had to invent it anyway à Will we become more Grid-like? n Undoubtedly (though not sure utility computing will ever be a goer) n For now, efficiency appears to require simplicity - no surprise à The real value of The Grid is yet to come
29/8/07 (Near) Future The hard work starts here! à I say this every six months n So far its always been true à CSA07 n Really the last big test of our organisation, readiness for data n Already reviewing some aspects of model after discussion… à 2008: The crunch year n Focus should be on basic reliable services at centres n Need to reinforce communications between expts and sites à GridPP3 n Clearly a major role to play in CMS computing - at many levels n Roll on LHC startup!