Presentation on theme: "Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) APA Conference, Brussels, October 2014."— Presentation transcript:
Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) APA Conference, Brussels, October 2014
The Story So Far… Together, we have reached the point where a generic, multi-disciplinary, scalable e-i/s for LTDP is achievable – and will hopefully be funded Built on standards, certified via agreed procedures, using the “Cream of DP services” In parallel, Business Cases and Cost Models are increasingly understood, working closely with Projects, Communities and Funding Agencies
Topic For RDA-4 Joint DP W/S The high-level Use Cases we are being required to address by FAs are: 1.Open Access (specific samples / purposes); 2.Reproducibility (of data, results, publications); 3.Provision of Data Management Plan(s). AFAIK, these “requirements” are not specific to a given community, i.e. it is ALL disciplines funded by a given FA that must address these
Open Questions Long-term sustainability is still a technical issue – Let’s assume that we understand the Business Cases & Cost Models well enough… – And (we) even have agreed funding for key aspects But can the service providers guarantee a multi-decade service? – Is this realistic? – Is this even desirable? I will address these issues at the APA conference next month in Brussels – with a proposal for “a solution”
Background 20 years ago – in 1994 – the first Computing R&D projects for the LHC were proposed – About 10 years before the expected startup date History shows that these projects didn’t start too early – even including the LHC startup delays We now foresee “next generation” data factories in the 2020s and beyond These will generate Exabytes (e.g. HL-LHC) to Zettabytes (e.g. FCC, SKA) of data and last decades 5
Technology(?) Of course, in 1-2 decades we can expect huge advances in technology At least some of these changes are likely to be disruptive – just look back! But you cannot plan based on the unknown Eventually, you will have to make decisions based either on what exists, or what you can be confident will be delivered, on the needed timescale Major changes in technology during the active life of current / future projects likely 6
H2020 EINFRA Managing, preserving and computing with big research data 7)Proof of concept and prototypes of data infrastructure-enabling software (e.g. for databases and data mining) for extremely large or highly heterogeneous data sets scaling to zetabytes and trillion of objects. Clean slate approaches to data management targeting 'data factory' requirements of research communities and large scale facilities (e.g. ESFRI projects) are encouraged 7
Lunatic Fringe But this is clearly the lunatic fringe. What exactly does it have to do with me? Quite a lot: implications for efforts such as – CTRUST / RDA Certification Interest Group Can current certification procedures “scale” to such massive data volumes? Can multi-site requirements be addressed? – RDA Active Data Management Plans – 4C: Costs of Exa / Zetta scale curation must clearly be well understood and justified – RDA Reproducibility Interest Group (and many others) – DPINFRA: next generation requirements – [ Preservation VRE: some aspects ~independent of total data volume, some not ] – APA CoE – … Significant economies of scale in “shared bit repositories” 8
9 Suppose these guys can build / share the most cost effective, scalable and reliable federated storage services, e.g. for peta- / exa- / zetta- scale bit preservation? Can we ignore them?
Next Generation Data Factories HL-LHC (https://indico.cern.ch/category/4863/)https://indico.cern.ch/category/4863/ – Europe’s top priority should be the exploitation of the full potential of the LHC, including the high-luminosity upgrade of the machine and detectors with a view to collecting ten times more data than in the initial design, by around 2030 – (European Strategy for Particle Physics) SKA – The Square Kilometre Array (SKA) project is an international effort to build the world’s largest radio telescope, with a square kilometre (one million square metres) of collecting area Typified by SCALE in several dimensions: – Cost; longevity; data rates & volumes – Last decades; cost O(EUR 10 9 ); EB / ZB data volumes 10
opportunities/digital-data-management / “The focus of this statement is sharing and preservation of digital research data” All proposals submitted to the Office of Science (after 1 October 2014) for research funding must include a Data Management Plan (DMP) that addresses the following requirements: 1.DMPs should describe whether and how data generated in the course of the proposed research will be shared and preserved. If the plan is not to share and/or preserve certain data, then the plan must explain the basis of the decision (for example, cost/benefit considerations, other parameters of feasibility, scientific appropriateness, or limitations discussed in #4). At a minimum, DMPs must describe how data sharing and preservation will enable validation of results, or how results could be validated if data are not shared or preserved. 11
12 LHC experiments increasingly talking about: 1.Open Access for Outreach; 2.Reproducibility of Results.
13 These are becoming mandatory activities, fully supported at all levels of the Collaborations
Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 14 Computing at the HL-LHC (~2025+) Predrag Buncic on behalf of the Trigger/DAQ/Offline/Computing Preparatory Group ALICE: Pierre Vande Vyvre, Thorsten Kollegger, Predrag Buncic; ATLAS: David Rousseau, Benedetto Gorini, Nikos Konstantinidis; CMS: Wesley Smith, Christoph Schwick, Ian Fisk, Peter Elmer ; LHCb: Renaud Legac, Niko Neufeld
Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 16 Data: Outlook for HL-LHC Very rough estimate of a new RAW data per year of running using a simple extrapolation of current data volume scaled by the output rates. To be added: derived data (ESD, AOD), simulation, user data… At least 0.5 EB / year (x 10 years of data taking) PB We are here!
Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 17 Data storage issues Our data problems may still look small on the scale of storage needs of internet giants Business , video, music, smartphones, digital cameras generate more and more need for storage The cost of storage will probably continue to go down but… Commodity high capacity disks may start to look more like tapes, optimized for multimedia storage, sequential access Need to be combined with flash memory disks for fast random access The residual cost of disk servers will remain While we might be able to write all this data, how long it will take to read it back? Need for sophisticated parallel I/O and processing. +We have to store this amount of data every year and for many years to come (Long Term Data Preservation )
Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 18 WLCG Collaboration Today Distributed infrastructure of 150 computing centers in 40 countries 300+ k CPU cores (~ 2M HEP-SPEC-06) The biggest site with ~50k CPU cores, 12 T2 with 2-30k CPU cores Distributed data, services and operation infrastructure
Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 19 WLCG Collaboration Tomorrow How will this evolve to HL-LHC needs? To what extent is it applicable to other comparable scale projects? Already evolving, most significantly during Long Shutdowns, but also during data taking!
20 Today’s state of the art in 0.1EB scale bit preservation (or “exabit”)
Bit-preservation WG one-slider Mandate summary (see w3.hepix.org/bit-preservation)w3.hepix.org/bit-preservation – Collecting and sharing knowledge on bit preservation across HEP (and beyond) – Provide technical advise to – Recommendations for sustainable archival storage in HEP Survey on Large HEP archive sites carried out and presented at last HEPiX – 19 sites; areas such as archive lifetime, reliability, access, verification, migration – HEP Archiving has become a reality by fact rather than by design – Overall positive but lack of SLA’s, metrics, best practices, and long-term costing impact 21
Ongoing Work Two work areas: 1.Preparing a set of best-practice recommendations for bit-level preservation within HEP – ~10 recommendations – Concentrate more on “what” rather than “how” to do – Will be circulated to WG participants and surveyed sites summer time – Feedback will be most appreciated 2.Defining a simple and customisable model for helping establishing the long-term cost of bit-level preservation – Useful for site planning/outlook – Input for DPHEP – significant fraction of overall Data Preservation cost! – The rest of this presentation 22
Verification & reliability Systematic verification of archive data ongoing – “Cold” archive: Users only accessed ~20% of the data (2013) – All “historic” data verified between – All new and repacked data being verified as well Data reliability significantly improved over last 5 years – From annual bit loss rates of O( ) (2009) to O( ) (2012) – New drive generations + less strain (HSM mounts, TM “hitchback”) + verification – Differences between vendors getting small Still, room for improvement – Vendor quoted bit error rates: O( ) – But, these only refer to media failures – Errors (eg bit flips) appearing in complete chain ~35 PB verified in 2014 No losses 23
“LHC Cost Model” (simplified) Start with 10PB, then +50PB/year, then +50% every 3y (or +15% / year) 24 10EB 1EB
Case B) increasing archive growth 25
Total cost: ~$59.9M (~$2M / year) Case B) increasing archive growth 26
From Petabytes to Exabytes Can the current computing and data management models scale by orders of magnitude? We cannot simply “scale out” in terms of number of sites and need much greater resilience against data loss / corruption, including (semi-)automated recovery + support for adding / removing sites Today, this is often done by the experiments: how will this work after data taking stops? How will we cope when (not if) sites no longer “support” a given experiment? 27
28 History shows that we will need many years of R&D to reach a new scale. Not all paths will be successful but we cannot postpone starting as the whole process, including the necessary service hardening, will take many years. (Decade + ?)
Future Circular Collider (FCC)
Science case Convince me that this project is scientifically excellent Project Plan Convince me that you know what you are doing: scope, costs and schedule are under control “Business case ” Convince me that this is a good use of public money
What did the cost? Tevatron accelerator –$120M (1983) = $277M (2012 $) Main Injector project –$290M (1994) = $450M (2012 $) Detectors and upgrades –Guess: 2 x $500M (collider detectors) + $300M (FT) Operations –Say 20 years at $100M/year = $2 billion Total cost = $4 billion
PhD Student Training Value of a PhD student –$2.2M (US Census Bureau, 2002) = $2.8M (2012 $) Number of students trained at the Tevatron –904 (CDF + DØ) –492 (Fixed Target) –18 (Smaller Collider experiments) –1414 total Financial Impact = $3.96 billion
Superconducting Magnets Current value of SC Magnet Industry –$1.5 Billion p.a. Value of MRI industry (the major customer for SC magnets) –$5 Billion p.a. This industry would probably have succeeded anyway – what we can realistically claim is that the large scale investment in this technology at the Tevatron significantly accelerated its development –Guess – one to two years faster than otherwise? Financial Impact = $5-10 billion
Balance sheet 20 year investment in Tevatron ~ $4B Students$4B Magnets and MRI$5-10B~ $50B total Computing$40B Very rough calculation – but confirms our gut feeling that investment in fundamental science pays off I think there is an opportunity for someone to repeat this exercise more rigorously cf. STFC study of SRS Impact }
39 We have a good song to sing in terms of the scientific, economic and cultural benefits of these next generation data factories. Data sharing, Reproducibility and Measurable Data Management Plans are going to be key.
Certification Next generation data factories will bring new requirements in terms of certification Multi-site certification can be expected to be core Today’s “best practices” will need to be extended – possibly rethought for this new scale Room for collaboration with peta- / exa-scale practitioners, e.g. those from HEPiX WG + RDA IG ??? Push key storage sites to pursue Certification in a coordinated fashion 40
Data Management Plans Often these are “static” – revised at best every few years (and hence typically out of date with reality) – e.g. WLCG Technical Design Report Can we switch to a “dashboard mode”, whereby the current reality can be viewed, with the appropriate level of detail, through a portal? This is something that could “come naturally”, combining existing displays from data scrubbing, migration, caching and replication with Reproducibility & Outreach views: Tabs for Experts, FAs & GP 41
42 We’re moving towards capturing the analysis environment so that Reproducibility is part of the Approval Process for Publication!
43 CERN aims for 100% Gold Open Access for all its original HEP results, experimental and theoretical, by end 2016.
Costs of Curation Given the scale, duration and expected costs of future generation data factories, a clear understanding of the costs and benefits of curation must be built in. The costs of “bit preservation” can clearly be reduced through economies of scale, but then not much further. – Is there any other way than “state of the art”? – Around $1M/year/EB in !!! The real issues relate to manpower intensive areas, such as knowledge capture and the ability to full re- use the data in the long-term. 44
Reproducibility It is exciting to see such key issues being addressed from “grass root” initiatives, such as the recent RDA BoF in this area, with many experts involved! – Leading hopefully to an Interest Group and concrete outcomes – Maybe a “specific” call once mature? We have much to learn by sharing expertise and not repeatedly re-inventing wheels… 45
opportunities/digital-data-management / “The focus of this statement is sharing and preservation of digital research data” All proposals submitted to the Office of Science (after 1 October 2014) for research funding must include a Data Management Plan (DMP) that addresses the following requirements: 1.DMPs should describe whether and how data generated in the course of the proposed research will be shared and preserved. If the plan is not to share and/or preserve certain data, then the plan must explain the basis of the decision (for example, cost/benefit considerations, other parameters of feasibility, scientific appropriateness, or limitations discussed in #4). At a minimum, DMPs must describe how data sharing and preservation will enable validation of results, or how results could be validated if data are not shared or preserved. 46
47 Surely we can address these generic (scientific) requirements together, using at least some common services: SCIDIP-ES outputs, CernVM[FS], Zenodo / Invenio, … A joint VRE (R&D) proposal?
2020 Vision for LT DP in HEP Long-term – e.g. FCC timescales: disruptive change – By 2020, all archived data – e.g. that described in DPHEP Blueprint, including LHC data – easily findable, fully usable by designated communities with clear (Open) access policies and possibilities to annotate further – Best practices, tools and services well run-in, fully documented and sustainable; built in common with other disciplines, based on standards – DPHEP portal, through which data / tools accessed “HEP FAIRport”: Findable, Accessible, Interoperable, Re-usable Agree with Funding Agencies clear targets & metrics 48
Summary Next generation data factories will bring with them many challenges for computing, networking and storage Data Preservation – and management in general – will be key to their success and must be an integral part of the projects: not an afterthought We need to start a range of R&D activities now: these can bring tangible benefits to existing projects in addition to preparing us for the future 49