Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) APARSEN Webinar, November 2014.

Slides:



Advertisements
Similar presentations
Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) APA Conference, Brussels, October 2014.
Advertisements

SCD in Horizon 2020 Ian Collier RAL Tier 1 GridPP 33, Ambleside, August 22 nd 2014.
SCIDIP-ES Components Oct ,Brussels. Basic Preservation Strategies Often stated as: “Emulate or Migrate” OAIS concepts change these to: Add Representation.
New DFG Information Infrastructure Projects Dr. Stefan Winkler-Nees; Birmingham, 28. March 2011 New DFG Information Infrastructure Projects.
Co-funded by the European Union under FP7-ICT Alliance Permanent Access to the Records of Science in Europe Network Co-ordinated by aparsen.eu #APARSEN.
DATA PRESERVATION IN ALICE FEDERICO CARMINATI. MOTIVATION ALICE is a 150 M CHF investment by a large scientific community The ALICE data is unique and.
PV2013 Summary Results Data Stewardship Interest Group WGISS-37 Meeting Cocoa Beach (Florida-US) - April 14-18, 2014.
1 Building National Cyberinfrastructure Alan Blatecky Office of Cyberinfrastructure EPSCoR Meeting May 21,
RDA Wheat Data Interoperability Working Group Outcomes RDA Outputs P5 9 th March 2015, San Diego.
Ian Bird WLCG Management Board CERN, 17 th February 2015.
Computing in Atmospheric Sciences Workshop: 2003 Challenges of Cyberinfrastructure Alan Blatecky Executive Director San Diego Supercomputer Center.
Strategic Information Systems Planning
Exa-Scale Data Preservation in HEP
Chinese-European Workshop on Digital Preservation, Beijing July 14 – Network of Expertise in Digital Preservation 1 Trusted Digital Repositories,
Long-Term Data Preservation in HEP Challenges, Opportunities and Solutions(?) Workshop on Best Practices for Data Management & Sharing.
Developing a result-oriented Operational Plan Training
Long-Term Data Preservation in HEP Challenges, Opportunities and Solutions(?) Joint Data Preservation RDA-3 International Collaboration.
Co-funded by the European Union under FP7-ICT Co-ordinated by aparsen.eu #APARSEN Why persistent identifiers are crucial in digital preservation.
Using Business Scenarios for Active Loss Prevention Terry Blevins t
December 14, 2011/Office of the NIH CIO Operational Analysis – What Does It Mean To The Project Manager? NIH Project Management Community of Excellence.
Results of the HPC in Europe Taskforce (HET) e-IRG Workshop Kimmo Koski CSC – The Finnish IT Center for Science April 19 th, 2007.
ICSTI Annual Members’ Meeting & Workshop Dr. Stefan Winkler-Nees; Paris, 5. March 2012 The Alliance of German Science Organisations - Recommendations on.
Long-Term Data Preservation: Debriefing Following RDA-4 WLCG GDB, October 2014
1 Digital Archives - Past, Present & Future Issues Anne Van Camp Manager, Member Initiatives The Research Libraries Group Digital Archives Directions (DADs)
HEPiX bit-preservation WG update – Spring 2014 Dmitry Ozerov/DESY Germán Cancio/CERN HEPiX Spring 2014, Annecy.
ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.
Data Management and Accessibility S.M. Kaye PPPL Research Seminar 12/16/2013.
Data Preservation in High Energy Physics Towards a Global Effort for Sustainable Long-Term Data Preservation in HEP
Managing, Preserving & Computing with Big Research Data Challenges, Opportunities and Solutions(?) EU-T0 F2F, April 2014 International.
The DPHEP Collaboration & Project(s) Services, Common Projects, Business Model(s) PH/SFT Group Meeting December 2013 International.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
UKOLN is supported by: Digital Preservation Benefits Tools Project Dissemination Workshop Dr Liz Lyon, Associate Director, UK Digital Curation Centre Director,
Ian Bird Trigger, Online, Offline Computing Workshop CERN, 5 th September 2014.
Strategic Planning Workshop  Presented by: Jason P Aubee.
Co-ordinated by aparsen.eu #APARSEN Co-funded by the European Union under FP7-ICT The importance of interoperability and intelligibility in digital.
N. RadziwillEVLA Advisory Committee Meeting May 8-9, 2006 NRAO End to End (e2e) Operations Division Nicole M. Radziwill.
Long Term Data Preservation LTDP = Data Sharing – In Time and Space Big Data, Open Data Workshop, May 2014 International Collaboration.
Data Preservation in HEP Use Cases, Business Cases, Costs & Cost Models Grid Deployment Board International Collaboration for Data.
Office of Science Statement on Digital Data Management Laura Biven, PhD Senior Science and Technology Advisor Office of the Deputy Director for Science.
ATTRACT is a proposal for an EU-funded R&D programme for sensor, imaging and related computing devlopment Its purpose is to demonstrate the value of European.
#DPHEP: Status and Outlook Sustainable Strategies for Long-Term DP at the Exa-scale LHCC Referees Meeting International Collaboration.
Ian Bird WLCG Networking workshop CERN, 10 th February February 2014
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI strategy and Grand Vision Ludek Matyska EGI Council Chair EGI InSPIRE.
Preservation e-Infrastructures, Certification & ADMP IGs DPHEP Status and Outlook RDA Plenary 6 Paris, September 2016 International.
International Collaboration for Data Preservation and Long Term Analysis in High Energy Physics RECODE - Final Workshop - January.
ESA UNCLASSIFIED – For Official Use Data Stewardship Interest Group ESA – EO Data Stewardship Maturity Matrix WGISS#41 Meeting, Canberra, (AUS) 14–18 March,
The DPHEP Collaboration & Project(s) Services, Common Projects, Business Model(s) EGI “towards H2020” Workshop December 2013 International.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No EPOS and EUDAT.
Preparing Data Management Plans for WLCG and HNISciCloud IT International Collaboration for Data Preservation and Long Term.
Research Data Management 26 th April 2016 Federica Fina, Data Scientist, University of St Andrews Library.
DPHEP Update LTDP = Data Sharing – In Time and Space WLCG Overview Board, May 2014 International Collaboration for Data Preservation.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No EUDAT Aalto Data.
Data Preservation in HEP Use Cases, Business Cases, Costs & Cost Models Grid Deployment Board International Collaboration for Data.
Usecases: 1.ISIS Neutron Source 2.DP for HEP Matthew Viljoen STFC, UK APARSEN-EGI workshop: preserving big data for research Amsterdam Science Park 4-6.
European Perspective on Distributed Computing Luis C. Busquets Pérez European Commission - DG CONNECT eInfrastructures 17 September 2013.
Update on Data Preservation (CERN / WLCG Scope) WLCG OB June 2016 International Collaboration for Data Preservation and Long Term.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No Support to scientific.
Computing infrastructures for the LHC: current status and challenges of the High Luminosity LHC future Worldwide LHC Computing Grid (WLCG): Distributed.
EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu
School on Grid & Cloud Computing International Collaboration for Data Preservation and Long Term Analysis in High Energy Physics.
Ian Bird, CERN WLCG Project Leader Amsterdam, 24 th January 2012.
Digital Sustainability on the EU Policy Level
HEP LTDP Use Case & EOSC Pilot
EOSCpilot WP4: Use Case 5 Material for
APARSEN Webinar, November 2014
Data Preservation Update Data Preservation, Curation & Stewardship
Connecting the European Grid Infrastructure to Research Communities
New strategies of the LHC experiments to meet
Moving in the digital world – breaking down the barriers Monique Nielsen National Archives of Australia February 2018.
What does DPHEP do? DPHEP has become a Collaboration with signatures from the main HEP laboratories and some funding agencies worldwide. It has established.
Presentation transcript:

Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) APARSEN Webinar, November 2014

The Story So Far… Together, we have reached the point where a generic, multi-disciplinary, scalable e-i/s for LTDP is achievable – and will hopefully be funded Built on standards, certified via agreed procedures, using the “Cream of DP services” In parallel, Business Cases and Cost Models are increasingly understood, working closely with Projects, Communities and Funding Agencies

Open Questions Long-term sustainability is still a technical issue – Let’s assume that we understand the Business Cases & Cost Models well enough… – And (we) even have agreed funding for key aspects But can the service providers guarantee a multi-decade service? – Is this realistic? – Is this even desirable?

4C Roadmap Messages A Collaboration to Clarify the Costs of Curation 1.Identify the value of digital assets and make choices 2.Demand and choose more efficient systems 3.Develop scalable services and infrastructure 4.Design digital curation as a sustainable service 5.Make funding dependent on costing digital assets across the whole lifecycle 6.Be collaborative and transparent to drive down costs -

5 “Observations” (unrepeatable) versus “measurements” “Records” versus “data” Choices & decisions: -Some (re-)uses of data are unforeseen! No “one-size fits all”

6 Suppose these guys can build / share the most cost effective, scalable and reliable federated storage services, e.g. for peta- / exa- / zetta- scale bit preservation? Can we ignore them?

H2020 EINFRA Managing, preserving and computing with big research data 7)Proof of concept and prototypes of data infrastructure-enabling software (e.g. for databases and data mining) for extremely large or highly heterogeneous data sets scaling to zetabytes and trillion of objects. Clean slate approaches to data management targeting 'data factory' requirements of research communities and large scale facilities (e.g. ESFRI projects) are encouraged 7

Next Generation Data Factories HL-LHC ( – Europe’s top priority should be the exploitation of the full potential of the LHC, including the high-luminosity upgrade of the machine and detectors with a view to collecting ten times more data than in the initial design, by around 2030 – (European Strategy for Particle Physics) SKA – The Square Kilometre Array (SKA) project is an international effort to build the world’s largest radio telescope, with a square kilometre (one million square metres) of collecting area  Typified by SCALE in several dimensions: – Cost; longevity; data rates & volumes – Last decades; cost O(EUR 10 9 ); EB / ZB data volumes 8

opportunities/digital-data-management / “The focus of this statement is sharing and preservation of digital research data” All proposals submitted to the Office of Science (after 1 October 2014) for research funding must include a Data Management Plan (DMP) that addresses the following requirements: 1.DMPs should describe whether and how data generated in the course of the proposed research will be shared and preserved. If the plan is not to share and/or preserve certain data, then the plan must explain the basis of the decision (for example, cost/benefit considerations, other parameters of feasibility, scientific appropriateness, or limitations discussed in #4). At a minimum, DMPs must describe how data sharing and preservation will enable validation of results, or how results could be validated if data are not shared or preserved. 9

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 10 Data: Outlook for HL-LHC Very rough estimate of a new RAW data per year of running using a simple extrapolation of current data volume scaled by the output rates. To be added: derived data (ESD, AOD), simulation, user data…  At least 0.5 EB / year (x 10 years of data taking) PB We are here!

Bit-preservation WG one-slider Mandate summary (see w3.hepix.org/bit-preservation)w3.hepix.org/bit-preservation – Collecting and sharing knowledge on bit preservation across HEP (and beyond) – Provide technical advice to – Recommendations for sustainable archival storage in HEP Survey on Large HEP archive sites carried out and presented at last HEPiX – 19 sites; areas such as archive lifetime, reliability, access, verification, migration – HEP Archiving has become a reality by fact rather than by design – Overall positive but lack of SLA’s, metrics, best practices, and long-term costing impact 11

Verification & reliability Systematic verification of archive data ongoing – “Cold” archive: Users only accessed ~20% of the data (2013) – All “historic” data verified between – All new and repacked data being verified as well Data reliability significantly improved over last 5 years – From annual bit loss rates of O( ) (2009) to O( ) (2012) Still, room for improvement – Vendor quoted bit error rates: O( ) – But, these only refer to media failures – Errors (eg bit flips) appearing in complete chain ~35 PB verified in 2014 No losses 12

“LHC Cost Model” (simplified) Start with 10PB, then +50PB/year, then +50% every 3y (or +15% / year) 13 10EB 1EB

Total cost: ~$59.9M (~$2M / year) Case B) increasing archive growth 14

Certification – Why Bother? ✚ Help align policies and practices across sites ✚ Improve reliability, eliminate duplication of effort, reduce “costs of curation” – Some of this is being done via HEPiX WG ✚ Help address the “Data Management Plan” issue required by Funding Agencies ✚ Increase “trust” with “customers” wrt stewardship of the data ✚ Increase attractiveness for future H2020 bids and / or to additional communities

2020 Vision for LT DP in HEP Long-term – e.g. FCC timescales: disruptive change – By 2020, all archived data – e.g. that described in DPHEP Blueprint, including LHC data – easily findable, fully usable by designated communities with clear (Open) access policies and possibilities to annotate further – Best practices, tools and services well run-in, fully documented and sustainable; built in common with other disciplines, based on standards – DPHEP portal, through which data / tools accessed  “HEP FAIRport”: Findable, Accessible, Interoperable, Re-usable  Agree with Funding Agencies clear targets & metrics 16

-

Summary Next generation data factories will bring with them many challenges for computing, networking and storage Data Preservation – and management in general – will be key to their success and must be an integral part of the projects: not an afterthought Raw “bit preservation” costs may drop to ~$100K / year / EB over the next 25 years 18

3 Points to Take Away: 1.Efficient; 2.Scalable; 3.Sustainable.  A (small-ish) network of certified, trusted digital repositories can address all of these 19