Presentation is loading. Please wait.

Presentation is loading. Please wait.

Update on Data Preservation (CERN / WLCG Scope) WLCG OB June 2016 International Collaboration for Data Preservation and Long Term.

Similar presentations


Presentation on theme: "Update on Data Preservation (CERN / WLCG Scope) WLCG OB June 2016 International Collaboration for Data Preservation and Long Term."— Presentation transcript:

1 Update on Data Preservation (CERN / WLCG Scope) Jamie.Shiers@cern.ch WLCG OB June 2016 International Collaboration for Data Preservation and Long Term Analysis in High Energy Physics

2 Pre-Amble Data Preservation (-related) services are offered and supported with a long-term view Close interaction with the experiments On-going discussions on (and measurable) improvements A Status Report covering ~3 years since DPHEP Blueprint has been published  The next Status Report tentatively scheduled to provide input to the next ESPP update  BUT … 2

3 TALK TOPICS 3

4 F.A.I.R. and Open Data Data Management Plans Data Preservation & Certification of Trusted Digital Repositories 4

5 F.A.I.R. AND OPEN DATA 5

6 F.A.I.R. & Open Data People seem to have (tacitly) agreed on F.A.I.R. without understanding the details – See FAIR link on agenda pageFAIR link – See http://b2find.eudat.eu/dataset?q=ALEPHhttp://b2find.eudat.eu/dataset?q=ALEPH Recent EU communication on making “Open Data” the default from now on  Both of these will have major implications – we need to understand them and provide feedback! See also H2020 DMP guidelines (next) 6

7 DATA MANAGEMENT PLANS 7

8 DMP Requirements from Funding Agencies To integrate data management planning into the overall research plan, all proposals submitted to the Office of Science for research funding are required to include a Data Management Plan (DMP) of no more than two pages that describes how data generated through the course of the proposed research will be shared and preserved or explains why data sharing and/or preservation are not possible or scientifically appropriate. At a minimum, DMPs must describe how data sharing and preservation will enable validation of results, or how results could be validated if data are not shared or preserved. Similar requirements from European FAs and EU (H2020) 8 From WLCG OB, May 2014

9 9

10 Data Management Plans What is new in 2016: – Projects have been refused funding due to lack of or weak DMPs – DMPs are not that difficult to write (see below) but again we need to be actively involved in discussion – A workshop is scheduled for June 28 – 30: (perhaps 1 st of several) this one explicitly on FA / project expectations and (mis-) match 10

11 What are DMPs about? – Typical Questions How will the data be curated and preserved? – Answer: In Trusted Certified Digital Repositories How will it be shared / made accessible for verification and re-use? – E.g. CERN Open Data Portal; Reproducibility?? (CAP) How will publications refer to the data? – The plots in the paper, the data behind them and so forth? – HEPData et al The “Data Policies” of the LHC experiments are actually pretty close to “Data Management Plans” in terms of goals; less so in terms of implementation 11

12 H2020: Annex 1 (DMP Template) The DMP should address the points below… 1.Data set reference and name Identifier for the DS to be produced 2.Data set description Description; origin; nature & scale; to whom useful; underpins publication? similar data? 3.Standards and metadata Reference to standards of the discipline 4.Data sharing How will it be shared? Embargo periods? Mechanisms for dissemination, s/w and other tools for re-use, access open to restricted to groups, where is repository? Type of repository? 5.Archiving and preservation Description of procedures, how long will it be preserved? End volume? Costs? How will these be covered? 12

13 H2020 DMP Guidelines – Annex 2 Are the data: – Discoverable, – Accessible, – Assessable and – Intelligible, – Usable beyond their original purpose, – Interoperable to specific quality standards?  This looks a bit like F.A.I.R. without implementation details 13

14 “Annex 3” – other key questions What is the relationship between the data you are collecting and any existing data? (NSF) Requirement #2: DMPs should provide a plan for making all research data displayed in publications resulting from the proposed research open, machine-readable, and digitally accessible to the public at the time of publication. – This includes data that are displayed in charts, figures, images, etc. – In addition, the underlying digital research data used to generate the displayed data should be made as accessible as possible to the public in accordance with the principles stated above. – This requirement could be met by including the data as supplementary information to the published article, or through other means. – The published article should indicate how these data can be accessed. (DoE) See November 2015 workshop at CERN: https://indico.cern.ch/event/395374/other- view?view=standard#20151109.detailed https://indico.cern.ch/event/395374/other- view?view=standard#20151109.detailed 14

15 CERTIFICATION – STATUS & PLANS 15

16 ISO 16363 Overview ISO 16363 (based on OAIS) is divided into 3 sets of metrics These are numbered following the standard itself: 1.Introduction 2.Overview 3.Organisational Infrastructure 4.Digital Object Management 5.Infrastructure & Security Risk Management – The colours already give a clue as to potential “problem areas” The last 3 are the metrics, themselves broken down into (sub-)sub-categories: 16

17 Current Status Original idea was to perform Certification in the context of WLCG However: a)Quite a few of the metrics concern the (CERN) site; b)Interest also in an OAIS archive for “CERN’s Digital Memory”; c)The two are linked: policies, strategies, mission statements for the former are part of the latter d)Some things will be easier in the latter which will in turn help the former  Current thinking: (self-)certify site-wise; “project- specific details” via “Project DMPs” 17

18 Infrastructure & Security Risk Management 5.1Technical Infrastructure Risk Management [ We do all of this, but is it documented? ] Technology watches, h/w & s/w changes, detection of bit corruption or loss, reporting, security updates, storage media refreshing, change management, critical processes, handling of multiple data copies etc 5.2Security Risk Management [ Do we do all of this, and is it documented? ] Security risks (data, systems, personnel, physical plant), disaster preparedness and recovery plans … 18

19 Organisational Infrastructure 3.1Governance & Organisational Viability Mission Statement, Preservation Policy, Implementation plan(s) etc. [ CERN, CERN, project(s) ] 3.2Organisational Structure & StaffingDuties, staffing, professional development etc. 3.3Procedural accountability & preservation policy framework Designated communities, knowledge bases, policies & reviews, change management, transparency & accountability etc. [ At least partially projects ] 3.4Financial sustainabilityBusiness planning processes, financial practices and procedures etc 3.5Contracts, licenses & liabilitiesFor the digital materials preserved… [ CERN? Projects? ] 19

20 Digital Object Management 4.1Ingest: acquisition of content 4.2Ingest: creation of the AIP 4.3Preservation planning 4.4AIP Preservation 4.5Information management“FAIR” etc 4.6Access management The plan is to address these after metrics 3 & 5… 20

21 From the bottom: There is no reason why we cannot “write down” (preferably in formal documents with a DoI) responses to all metrics in category 5 – Even if in some areas, such as disaster preparedness / recovery, there is still work to do… – But “the suspects” are fairly well identified… Digital Object Management: needs (much) more work but we agreed to address this after 3 & 5 – Maybe first in the context of CERN Digital Memory? – A) as it is an easier case & B) to give us some experience Need a “CERN Mission Statement” (ISO terminology) & other formal documents – Drafts exist – approval process? Data ownership, liability, IPR is another potentially complex issue 21

22 Example 1 A few people have asked this year about EMC data Even if the bits were still available, would we know how to access them? – Experiment s/w – obsolete operating systems, language constructs, version of CERNLIB etc (Hydra, MVS, HBOOK 3, …)  But more basically, could we even open a file? – “DCB information” – a few basic types – long since lost – Not to mention calibrations, alignments, “titles” etc.  The “Information Packages” are intended to capture this knowledge so that the data remains usable in the (very) long term – AFTER the original project is over 22

23 Example 2 Documentation for CERN Program Library last revised around 1995 Revisited in 2015: – Produce consistent PDF/A & HTML; – Store in a Digital Repository for long-term; – Add maximum meta-data (including information missing from 1995 revision). This required (small) modifications to LaTeX sources  Did we have the right to do it?  Would an external repository have had the right? Ability?  What about reformatting e.g. ALICE data in 2100? 23

24 Certification Plans Draft responses to the metrics have been prepared in a Wiki (DPHEP-IB e-group) A number of “volunteers” (really!) to help complete this have been identified – Idea: “senior – junior team” for continuity – Cross-department to start with: then Tier1 sites?  Some formal documents will need to be approved – this will take time! Aligning with other external certification activities as a sanity check 24

25 Certification Summary “The plan” was to address 3 & 5 prior to iPRES (October), at least in draft form – This is still possible 2017 was foreseen for “formal approval” – This too is still possible Digital Object Management requires more thought… – But is essential if our data (of all sorts) is really to be usable in the medium and long term Otherwise we end up just doing “bit preservation” 2018: “all” should be competed prior to ESPP++ 25

26 Post-Amble Data Preservation (-related) services are offered and supported Close interaction with the experiments On-going discussions on (and measurable) improvements A Status Report covering ~3 years since DPHEP Blueprint has been published  The next Status Report tentatively scheduled to provide input to the next ESPP update  BUT REALLY long term preservation requires investment up to and beyond the end of life of the original collaboration.  We don’t know how to solve this yet, but the steps needed to satisfy ISO 16363 should help… 26

27 F.A.I.R. and Open Data: Requires effort & Resources Data Management Plans: Sharing, Re-Use; Reproducibility of Results Data Preservation & Certification of Trusted Digital Repositories: Helps Address the Goals Below. 27

28 Concluding Remarks Data Preservation is a Journey – Not a Destination – “Once you stop pedalling, you stop & fall off” Data Preservation is not an Island – it is part of a much bigger picture, including the full data lifecycle – You can’t share or re-use data, nor reproduce results, if you haven’t first preserved it 28

29 And Recommendations “The WLCG OB is invited to approve this expansion of scope, understanding that no delays in the Certification Process are expected as a result”. “The WLCG OB will review the draft documents for approval at subsequent meetings (mission statement, strategy, implementation plan etc.) and, subject to any necessary amendments, support their adoption.” “The WLCG OB welcomes the target of achieving certification prior to the next ESPP update”. 29

30

31 LONG TERM

32 2020 Vision for LT DP in HEP Long-term – e.g. FCC timescales: disruptive change – By 2020, all archived data – e.g. that described in DPHEP Blueprint, including LHC data – easily findable, fully usable by designated communities with clear (Open) access policies and possibilities to annotate further – Best practices, tools and services well run-in, fully documented and sustainable; built in common with other disciplines, based on standards – DPHEP portal, through which data / tools accessed  “HEP FAIRport”: Findable, Accessible, Interoperable, Re-usable  Agree with Funding Agencies clear targets & metrics 32

33 3.5. Will there be need for an adjustment of the general CERN ́s data policy? CERN will establish a data policy that is in line with funding agency requirements, including in terms of Open Access (Science). This can be expected to be largely similar to that adopted by the 4 main LHC experiments, with a significant fraction of the data released after a reasonable embargo period. The duration of the embargo period and the fraction of the data to be released would be determined based on experience, resource requirements and scientific, educational and cultural benefits. Given that the total dataset of the (HL-)LHC will be in the Exabyte range, the volume of data to be released will eventually become significant and the appropriate resources must be factored into any planning. The development and use of the computing models and the infrastructure for HL-LHC does not depend on development of this policy. 5 November 2015 IT 201633


Download ppt "Update on Data Preservation (CERN / WLCG Scope) WLCG OB June 2016 International Collaboration for Data Preservation and Long Term."

Similar presentations


Ads by Google