Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) APARSEN Webinar, November 2014.

Similar presentations


Presentation on theme: "Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) APARSEN Webinar, November 2014."— Presentation transcript:

1 Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APARSEN Webinar, November 2014

2 The Story So Far… Together, we have reached the point where a generic, multi-disciplinary, scalable e-i/s for LTDP is achievable – and will hopefully be funded Built on standards, certified via agreed procedures, using the “Cream of DP services” In parallel, Business Cases and Cost Models are increasingly understood, working closely with Projects, Communities and Funding Agencies

3 Open Questions Long-term sustainability is still a technical issue – Let’s assume that we understand the Business Cases & Cost Models well enough… – And (we) even have agreed funding for key aspects But can the service providers guarantee a multi-decade service? – Is this realistic? – Is this even desirable?

4 4C Roadmap Messages A Collaboration to Clarify the Costs of Curation 1.Identify the value of digital assets and make choices 2.Demand and choose more efficient systems 3.Develop scalable services and infrastructure 4.Design digital curation as a sustainable service 5.Make funding dependent on costing digital assets across the whole lifecycle 6.Be collaborative and transparent to drive down costs OSD@Orsay - Jamie.Shiers@cern.ch4

5 5 “Observations” (unrepeatable) versus “measurements” “Records” versus “data” Choices & decisions: -Some (re-)uses of data are unforeseen! No “one-size fits all”

6 6 Suppose these guys can build / share the most cost effective, scalable and reliable federated storage services, e.g. for peta- / exa- / zetta- scale bit preservation? Can we ignore them?

7 H2020 EINFRA-1-2014 Managing, preserving and computing with big research data 7)Proof of concept and prototypes of data infrastructure-enabling software (e.g. for databases and data mining) for extremely large or highly heterogeneous data sets scaling to zetabytes and trillion of objects. Clean slate approaches to data management targeting 2020+ 'data factory' requirements of research communities and large scale facilities (e.g. ESFRI projects) are encouraged 7

8 Next Generation Data Factories HL-LHC (https://indico.cern.ch/category/4863/)https://indico.cern.ch/category/4863/ – Europe’s top priority should be the exploitation of the full potential of the LHC, including the high-luminosity upgrade of the machine and detectors with a view to collecting ten times more data than in the initial design, by around 2030 – (European Strategy for Particle Physics) SKA – The Square Kilometre Array (SKA) project is an international effort to build the world’s largest radio telescope, with a square kilometre (one million square metres) of collecting area  Typified by SCALE in several dimensions: – Cost; longevity; data rates & volumes – Last decades; cost O(EUR 10 9 ); EB / ZB data volumes 8

9 http://science.energy.gov/funding- opportunities/digital-data-management / “The focus of this statement is sharing and preservation of digital research data” All proposals submitted to the Office of Science (after 1 October 2014) for research funding must include a Data Management Plan (DMP) that addresses the following requirements: 1.DMPs should describe whether and how data generated in the course of the proposed research will be shared and preserved. If the plan is not to share and/or preserve certain data, then the plan must explain the basis of the decision (for example, cost/benefit considerations, other parameters of feasibility, scientific appropriateness, or limitations discussed in #4). At a minimum, DMPs must describe how data sharing and preservation will enable validation of results, or how results could be validated if data are not shared or preserved. 9

10 Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 10 Data: Outlook for HL-LHC Very rough estimate of a new RAW data per year of running using a simple extrapolation of current data volume scaled by the output rates. To be added: derived data (ESD, AOD), simulation, user data…  At least 0.5 EB / year (x 10 years of data taking) PB We are here!

11 Bit-preservation WG one-slider Mandate summary (see w3.hepix.org/bit-preservation)w3.hepix.org/bit-preservation – Collecting and sharing knowledge on bit preservation across HEP (and beyond) – Provide technical advice to – Recommendations for sustainable archival storage in HEP Survey on Large HEP archive sites carried out and presented at last HEPiX – 19 sites; areas such as archive lifetime, reliability, access, verification, migration – HEP Archiving has become a reality by fact rather than by design – Overall positive but lack of SLA’s, metrics, best practices, and long-term costing impact 11

12 Verification & reliability Systematic verification of archive data ongoing – “Cold” archive: Users only accessed ~20% of the data (2013) – All “historic” data verified between 2010-2013 – All new and repacked data being verified as well Data reliability significantly improved over last 5 years – From annual bit loss rates of O(10 -12 ) (2009) to O(10 -16 ) (2012) Still, room for improvement – Vendor quoted bit error rates: O(10 -19..-20 ) – But, these only refer to media failures – Errors (eg bit flips) appearing in complete chain ~35 PB verified in 2014 No losses 12

13 “LHC Cost Model” (simplified) Start with 10PB, then +50PB/year, then +50% every 3y (or +15% / year) 13 10EB 1EB

14 Total cost: ~$59.9M (~$2M / year) Case B) increasing archive growth 14

15 Certification – Why Bother? ✚ Help align policies and practices across sites ✚ Improve reliability, eliminate duplication of effort, reduce “costs of curation” – Some of this is being done via HEPiX WG ✚ Help address the “Data Management Plan” issue required by Funding Agencies ✚ Increase “trust” with “customers” wrt stewardship of the data ✚ Increase attractiveness for future H2020 bids and / or to additional communities

16 2020 Vision for LT DP in HEP Long-term – e.g. FCC timescales: disruptive change – By 2020, all archived data – e.g. that described in DPHEP Blueprint, including LHC data – easily findable, fully usable by designated communities with clear (Open) access policies and possibilities to annotate further – Best practices, tools and services well run-in, fully documented and sustainable; built in common with other disciplines, based on standards – DPHEP portal, through which data / tools accessed  “HEP FAIRport”: Findable, Accessible, Interoperable, Re-usable  Agree with Funding Agencies clear targets & metrics 16

17 OSD@Orsay - Jamie.Shiers@cern.ch17

18 Summary Next generation data factories will bring with them many challenges for computing, networking and storage Data Preservation – and management in general – will be key to their success and must be an integral part of the projects: not an afterthought Raw “bit preservation” costs may drop to ~$100K / year / EB over the next 25 years 18

19 3 Points to Take Away: 1.Efficient; 2.Scalable; 3.Sustainable.  A (small-ish) network of certified, trusted digital repositories can address all of these 19


Download ppt "Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) APARSEN Webinar, November 2014."

Similar presentations


Ads by Google