Presentation is loading. Please wait.

Presentation is loading. Please wait.

APARSEN Webinar, November 2014

Similar presentations


Presentation on theme: "APARSEN Webinar, November 2014"— Presentation transcript:

1 Jamie.Shiers@cern.ch APARSEN Webinar, November 2014
Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) APARSEN Webinar, November 2014

2 The Story So Far… Together, we have reached the point where a generic, multi-disciplinary, scalable e-i/s for LTDP is achievable – and will hopefully be funded  Built on standards, certified via agreed procedures, using the “Cream of DP services” In parallel, Business Cases and Cost Models are increasingly understood, working closely with Projects, Communities and Funding Agencies

3 Open Questions Long-term sustainability is still a technical issue
Let’s assume that we understand the Business Cases & Cost Models well enough… And (we) even have agreed funding for key aspects But can the service providers guarantee a multi-decade service? Is this realistic? Is this even desirable?

4 4C Roadmap Messages A Collaboration to Clarify the Costs of Curation
Identify the value of digital assets and make choices Demand and choose more efficient systems Develop scalable services and infrastructure Design digital curation as a sustainable service Make funding dependent on costing digital assets across the whole lifecycle Be collaborative and transparent to drive down costs -

5 OSD@Orsay - Jamie.Shiers@cern.ch
“Observations” (unrepeatable) versus “measurements” “Records” versus “data” Choices & decisions: Some (re-)uses of data are unforeseen! No “one-size fits all” -

6 Suppose these guys can build / share the most cost effective, scalable and reliable federated storage services, e.g. for peta- / exa- / zetta- scale bit preservation? Can we ignore them?

7 H2020 EINFRA Managing, preserving and computing with big research data Proof of concept and prototypes of data infrastructure-enabling software (e.g. for databases and data mining) for extremely large or highly heterogeneous data sets scaling to zetabytes and trillion of objects. Clean slate approaches to data management targeting 'data factory' requirements of research communities and large scale facilities (e.g. ESFRI projects) are encouraged

8 Next Generation Data Factories
HL-LHC ( Europe’s top priority should be the exploitation of the full potential of the LHC, including the high-luminosity upgrade of the machine and detectors with a view to collecting ten times more data than in the initial design, by around 2030 (European Strategy for Particle Physics) SKA The Square Kilometre Array (SKA) project is an international effort to build the world’s largest radio telescope, with a square kilometre (one million square metres) of collecting area Typified by SCALE in several dimensions: Cost; longevity; data rates & volumes Last decades; cost O(EUR 109); EB / ZB data volumes

9 http://science. energy
“The focus of this statement is sharing and preservation of digital research data” All proposals submitted to the Office of Science (after 1 October 2014) for research funding must include a Data Management Plan (DMP) that addresses the following requirements: DMPs should describe whether and how data generated in the course of the proposed research will be shared and preserved. If the plan is not to share and/or preserve certain data, then the plan must explain the basis of the decision (for example, cost/benefit considerations, other parameters of feasibility, scientific appropriateness, or limitations discussed in #4). At a minimum, DMPs must describe how data sharing and preservation will enable validation of results, or how results could be validated if data are not shared or preserved.

10 Data: Outlook for HL-LHC
PB We are here! Very rough estimate of a new RAW data per year of running using a simple extrapolation of current data volume scaled by the output rates. To be added: derived data (ESD, AOD), simulation, user data… At least 0.5 EB / year (x 10 years of data taking)

11 Bit-preservation WG one-slider
Mandate summary (see w3.hepix.org/bit-preservation) Collecting and sharing knowledge on bit preservation across HEP (and beyond) Provide technical advice to Recommendations for sustainable archival storage in HEP Survey on Large HEP archive sites carried out and presented at last HEPiX 19 sites; areas such as archive lifetime, reliability, access, verification, migration HEP Archiving has become a reality by fact rather than by design Overall positive but lack of SLA’s, metrics, best practices, and long-term costing impact

12 Verification & reliability
Systematic verification of archive data ongoing “Cold” archive: Users only accessed ~20% of the data (2013) All “historic” data verified between All new and repacked data being verified as well Data reliability significantly improved over last 5 years From annual bit loss rates of O(10-12) (2009) to O(10-16) (2012) Still, room for improvement Vendor quoted bit error rates: O( ) But, these only refer to media failures Errors (eg bit flips) appearing in complete chain ~35 PB verified in 2014 No losses 12

13 “LHC Cost Model” (simplified)
Start with 10PB, then +50PB/year, then +50% every 3y (or +15% / year) 10EB 1EB

14 Case B) increasing archive growth
Total cost: ~$59.9M (~$2M / year)

15 Certification – Why Bother?
Help align policies and practices across sites Improve reliability, eliminate duplication of effort, reduce “costs of curation” Some of this is being done via HEPiX WG Help address the “Data Management Plan” issue required by Funding Agencies Increase “trust” with “customers” wrt stewardship of the data Increase attractiveness for future H2020 bids and / or to additional communities

16 2020 Vision for LT DP in HEP Long-term – e.g. FCC timescales: disruptive change By 2020, all archived data – e.g. that described in DPHEP Blueprint, including LHC data – easily findable, fully usable by designated communities with clear (Open) access policies and possibilities to annotate further Best practices, tools and services well run-in, fully documented and sustainable; built in common with other disciplines, based on standards DPHEP portal, through which data / tools accessed “HEP FAIRport”: Findable, Accessible, Interoperable, Re-usable Agree with Funding Agencies clear targets & metrics

17 OSD@Orsay - Jamie.Shiers@cern.ch

18 Summary Next generation data factories will bring with them many challenges for computing, networking and storage Data Preservation – and management in general – will be key to their success and must be an integral part of the projects: not an afterthought Raw “bit preservation” costs may drop to ~$100K / year / EB over the next 25 years

19 3 Points to Take Away: Efficient; Scalable; Sustainable.
A (small-ish) network of certified, trusted digital repositories can address all of these


Download ppt "APARSEN Webinar, November 2014"

Similar presentations


Ads by Google