Presentation is loading. Please wait.

Presentation is loading. Please wait.

Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,

Similar presentations

Presentation on theme: "Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,"— Presentation transcript:

1 Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference, November 2011

2 Overview Introduction – What is STFC? What do we do? Why do we do it? An example project (Practice) – Not: ODE, APARSEN, EUDAT, SCAPE, SCIDIP-ES RCUK Data Principles (Policy)

3 Programme includes: Neutron and Muon Source Synchrotron Radiation Source Lasers Space Science Particle Physics Compuing and Data Management Microstructures Nuclear Physics Radio Communications What is STFC? 250m ESRF & ILL, GrenobleDaresbury Laboratory Square Kilometre Array Large Hadron Collider

4 What is the science?

5 ©John Womersley/Keith Jeffery/STFC Tomorrow’s Digital Infrastructure for Science APA Conference 2007 John Womersley/Keith Jeffery

6 The Innovation Lifecycle The Body of Knowledge The Government Process The Research Process Aggregation of Knowledge lies at the heart of the innovation lifecycle Enabling Knowledge Creation Enabling Wealth Creation Quality Assessment Strategic Direction Improved Quality of Life Improved Understanding Data and the Research Process

7 Data centric view of research Data Creation Archival Access Storage Compute Network Services Curation the researcher acts through ingest and access Virtual Research Environment the researcher shouldn’t have to worry about the information infrastructure Information Infrastructure

8 The 7 C’s Creation Collection Capacity Computation Curation Collaboration Communication Data

9 Linked systems for: Proposal submission User management Data acquisition Metadata carried from each system to the next Detectors moving from Hz to KHz, MHz, GHz,... Creation Examining the detectors on MAPS instrument on ISIS Barrel toroid magnet and detector module from ATLAS at CERN

10 Capacity Currently store about 7 PetaBytes of data  1PB = 10 15 Bytes  = a Billion Floppys  = a Million CDs  = a Thousand PCs LHC experiments ~ 2PB Active file system based HSM ~2PB (facilities) Dark archive ~2PB (facilities and non-STFC services) of Today’s } 10 x Moore’s Law (2 years) 2 x Moore’s Law (1.5 years) Moore’s Law x1000 in 13 years Doubling every 1.3 years

11 Computation Compute intensive components on the grid Computational applications for Laser Theory Group’s adoption of HPC Laser real-time diagnostics & data flow pipeline. Fitting of experimental data to model

12 Curation Complexity of Facility Archives All ISIS data (~25 years) > 3,000,000 files All Diamond Data (~2.5 years) > 11,000,000 files Breadth of Data Sources: STFC (Tier 1) NERC (BADC, NEODC) BBSRC (All institutes) AHRC MRC Others: – The STFC Data Portal – The STFC Publications Archive – The CCPs (Collaborative Computational Projects) – The Chemical Database Service – The Digital Curation Centre – The EUROPRACTICE Software service – The HPCx Supercomputer – The JISCmail service – The NERC Datagrid – The Starlink Software suite – The UK Grid Support Centre – The World Data Centre for Solar-Terrestrial Physics Atlas Datastore Tape Robot The StorageTek tape robot with capacity of 20PB

13 Collection Proposal Approval Scheduling Experiment Data cleansing Record Publication Scientist submits application for beamtime Facility committee approves application Facility registers, trains, and schedules scientist’s visit Scientists visits, facility run’s experiment Subsequent publication registered with facility Raw data filtered and cleansed Data analysis Tools for processing made available

14 Communication Immense Expectations ! Web enables: – access to everything Everything on-line Interlinking enables: – Validation of results – Repetition of experiment Discovery enables: – new knowledge from old Archiving enables: – Unplanned reuse of data Antarctic environmental data – Reuse of knowledge One paper has >20,000 downloads – Completing the cycle Publications entered in next proposal STFC’s “e-pubs” Institutional Repository has records of 30,000 publications spanning 25 years “The web has changed everything...”

15 Collaboration Technology integration facilitates scientific collaboration Cross facility/beamline Cross disciplinary Technology integration improves facility efficiency PaN-data –Photon and Neutron Data infrastructure ICAT also used in Australian Synchrotron and Oak Ridge National Lab

16 Overview Introduction – What is STFC? What do we do? Why do we do it? An example project – Not: ODE, APARSEN, EUDAT, SCAPE, SCIDIP-ES RCUK Data Policy Principles

17 Tools for virtual research environments Generic services, storage and computation OA participatory infrastructure Agriculture Environment Physics, Engineering Biology Medicine Atmosphere/Space Physics Social Sciences Scientific Data (Discipline Specific) Other Data Researcher 1 Non Scientific World Scientific World Researcher 2 Aggregated Data Sets (Temporary or Permanent) Workflows Aggregation Path transPLANT EUDAT AgINFRA iMarine OPENAire Plus diXa SCIDIP-ES ESPAS ENGAGE PanDataODI Scientific Data Landscape of Initiatives – results from call9 VREs PaNdataODI

18 PaN-data bring together 11 major European Research Infrastructures PaN-data is coordinated by the e-Science Department at the Rutherford Appleton Laboratory, UK ISIS is the world’s leading pulsed spallation neutron source ILL operates the most intense slow neutron source in the world PSI operates the Swiss Light Source, SLS, and Neutron Spallation Source, SINQ, and is developing the SwissFEL Free Electron Laser HZB operates the BER II research reactor the BESSY II synchrotron CEA/LLB operates neutron scattering spectrometers from the Orphée fission reactor ESRF is a third generation synchrotron light source jointly funded by 19 European countries Diamond is new 3rd generation synchrotron funded by the UK and the Wellcome Trust DESY operates two synchrotrons, Doris III and Petra III, and the FLASH free electron laser Soleil is a 2.75 GeV synchrotron radiation facility in operation since 2007 ELETTRA operates a 2-2.4 GeV synchrotron and is building the FERMI Free Electron Laser ALBA is a new 3 GeV synchrotron facility due to become operational in 2010 PaN-data Partners JCNS Juelich Centre for Neutron Science MaxLab, Max IV Synchrotron The partners operate hundreds of instruments used by over 30,000 scientists each year

19 EDNS - European Data Infrastructure for Neutron and Synchrotron Sources PaNdata Vision Single Infrastructure  Single User Experience Capacity Storage Publications Repositories Data Repositories Software Repositories Raw Data Data Analysis Analysed Data Publication Data Publication s Facility 1 Raw Data Data Analysis Analysed Data Publication Data Publication s Facility 2 Raw Data Data Analysis Analysed Data Publication Data Publication s Facility 3 Different Infrastructures  Different User Experiences Raw Data Catalogue Data Analysis Analysed Data Catalogue Publication Data Catalogue Publications Catalogue

20 Science driver – Data Integration Neutron diffraction X-ray diffraction } NMR High-quality structure refinement }

21 PaN-data Standardisation PaN-data Europe is undertaking 5 standardisation activities: 1.Development of a common data policy framework 2.Agreement on protocols for shared user information exchange 3.Definition of standards for common scientific data formats 4.Strategy for the interoperation of data analysis software enabling the most appropriate software to be used independently of where the data is collected 5.Integration and cross-linking of research outputs completing the lifecycle of research, linking all information underpinning publications, and supporting the long-term preservation of the research outputs PaN-data Europe – building a sustainable data infrastructure for Neutron and Photon laboratories

22 PaNdata Activities

23 ERA Open Access Sharing Initiatives (examples, etc) ERA Infrastructure Platform Initiatives (EGI, etc) PaNdata Support Action (Ends 30 Nov 11) Policies and Standards PaNdata ODI (begins end2011) JRAs Users Data Software Integration Provenance Preservation Scalability PaNdata ODI (begins end2011) Services Users Data PaNdata ODI Virtual Labs Policies Powder Diff SAXS & SANS Tomography

24 Objectives Objective 2 – Users To deploy, operate and evaluate a system for pan-European user identification across the participating facilities and implement common processes for the joint maintenance of that system. Objective 3 – Data To deploy, operate and evaluate a generic catalogue of scientific data across the participating facilities and promote its integration with other catalogues beyond the project. Objective 4 – Provenance To research and develop a conceptual framework, defined as a metadata model, which can record the analysis process, and to provide a software infrastructure which implements that model to record analysis steps hence enabling the tracing of the derivation of analysed data outputs. Objective 5 – Preservation To add to the PaNdata infrastructure extra capabilities oriented towards long-term preservation and to integrate these within selected virtual laboratories of the project to demonstrate benefits. These capabilities should, as for the developments in the provenance JRA, be integrated into the normal scientific lifecycle as far as possible. The conceptual foundations will be the OAIS standard and the NeXus file format. Objective 6 – Scalability To develop a scalable data processing framework, combining parallel filesystems with a parallelized standard data formats (pNexus pHDF5) to permit applications to make most efficient use of dedicated multi-core environments and to permit simultaneous ingest of data from various sources, while maintaining the possibility for real-time data processing. Objective 7 – Demonstration To deploy and operate the services and technology developed in the project in virtual laboratories for three specific techniques providing a set of integrated end-to-end data services.

25 PaNdata ODI Joint Research Activities PaNdata ODI Service Activities PaNdata ODI Service Releases Standards from PaNdata Support Action uCat dCat vLabs Prov Pres Scale Rel 1Rel 2Rel 3Rel 4 users data s/w Integ Jun 2014 Jun 2013 Dec 2013Dec 2012

26 Data The Research Lifecycle – a personal view the researcher acts through ingest and access Research Environment Creation Archival Access Storage Compute Network Data Services the researcher shouldn’t have to worry about the information infrastructure Information Infrastructure MetaData/ Catalogues Portals User Info feed DAQ feed Data Analysis feed EGI GEANT Local resources e-Infrastructure Provenanced Research

27 Overview Introduction – What is STFC? What do we do? Why do we do it? An example project – Not: ODE, APARSEN, EUDAT, SCAPE, SCIDIP-ES RCUK Data Policy Principles

28 RCUK Principles on Data Policy Seven (fairly) orthogonal principles: Public good Preservation Discoverability Confidentiality First use Recognition Costs } Data } Access } Rights

29 Motivation Data are a critical output of the research process: For the integrity, transparency and robustness of the research record Often value increases through aggregation Enables new research questions to be addressed Supports the wider exploitation of data

30 Repeat, Repeal, Repurpose Why might we want access to data? Three distinct reasons for sharing data: Repeat - Validation of previous analysis – How does this fit with peer review? Reconsider/Reform/Repeal/Reverse - Alternative hypotheses in the same field – c.f. Reuse – How does this fit with “right” to first use? Repurpose - New research in another field – c.f. Recycle – How does this fit with recognition of Intellectual contribution? (What’s in it for me?) Different concerns and requirements for each type of sharing

31 Data are a Public Good Publicly funded research data are a public good, produced in the public interest, which should be made openly available with as few restrictions as possible in a timely and responsible manner that does not harm intellectual property. Public good – is nonrival and non-excludable [wikipedia]nonrivalnon-excludable consumption by one does not reduce availability for others no one can be effectively excluded from using Research Data recorded factual material commonly retained by and accepted in the scientific community as necessary to validate research findings As few restrictions as possible Later (distinguish registration from restriction) Timely Later (discipline specific) Responsible Later (maximising access does not necessarily maximising research benefit) Intellectual Property Later (balance contribution from sharing and from primary research)

32 Data should be managed Institutional and project specific data management policies and plans should be in accordance with relevant standards and community best practice. Data with acknowledged long term value should be preserved and remain accessible and usable for future research Policies and Plans DMPs should exist for all data Institutional/Departmental v. Project Standards and Best practice discipline specific Long term value eg. NERC has the concept of the Data Value Checklist. Future research (by current and future generations) Don’t lose it by accident

33 Data should be discoverable To enable research data to be discoverable and effectively reused by others, sufficient metadata should be recorded and made openly available to enable other researchers to understand the research and re-use potential of the data. Published results should always include information on how to access the supporting data Effectively reused by others for validation (the data behind the graph – and derivation/provenance?) for reworking (alternative hypothesis) for repurposing(same data or data merging) Sufficient Metadata... openly available (stronger than for data) Understand the re-use potential Discoverable – repository? registration? Published results should include (pointers) could be an email address (but note longevity requirement)

34 Data may be protected RCUK recognises that there are legal, ethical and commercial constraints on release of research data. To ensure that the research process is not damaged by inappropriate release of data, research organisation policies and practices should ensure that these are considered at all stages in the research process. Legal Data Protection Act, Freedom of Information Act, Environmental Information Regulations, EU INSPIRE Directive (Spatial Data) Some Case law: UEA “Climategate” and a few others - can go either way ICO developing Guidelines on FoI Protection of Freedoms Bill Ethical Consent, Privacy, National Security, Consent eg longitudinal cohort studies – protection of cohort participation Commercial – shared funding, patent pending, commercial in confidence... all stages in the research process... Plan in proposal – peer reviewed external review of access requests

35 Originators may have first use To ensure that research teams get appropriate recognition for the effort involved in collecting and analysing data, those who undertake research council funded work may be entitled to a limited period of privileged use of the data they have collected to enable them to publish the results of their research. The length of this period varies by research discipline and, where appropriate, is discussed further in the published policies of individual Research Councils.... may be entitled... c.f. will be entitled... limited period of privileged use... “Withholding of data without good reason beyond this period is not acceptable”... to publish the results of their research... to work on and publish the results... the length of this period varies... individual research councils’ policies elaborate

36 Reusers have responsibilities In order to recognise the intellectual contributions of researchers who generate, preserve and share key research datasets, all users of research data should acknowledge the sources of their data and abide by the terms and conditions under which they are accessed.... abide by the terms and conditions... terms and conditions may exist to monitor use to promote terms and conditions of use... should acknowledge the sources of their data... Data citation c.f. should be required to acknowledge....

37 Data sharing is not free It is appropriate to use public funds to support the management and sharing of publicly-funded research data. To maximise the research benefit which can be gained from limited budgets, the mechanisms for these activities should be both efficient and cost-effective in the use of public funds. Data sharing costs! Marginal cost may be small but intial cost may be high Even if data are free at the point of use (which it may not be) there are costs behind. Cost is fundable by RCs - but needs tensioning against other research The policies, models and mechanisms for managing and providing access to research data Funders will work with grant holders to.....” No prescription on how funding should be raised.” You can ask for funding...... but be reasonable

38 Outcomes Ensure the continuing availability of data of long-term value Facilitate the development of mechanisms to improve the management, accessibility and attribution of data Promote awareness of legislation and guidance relating to the management and dissemination of information

39 The Seven Ps Public good  Public good Preservation  Preservation Discoverability  Promotion Confidentiality  Protection & Privacy First use  Privilege Recognition  Probity Costs  Price

40 Overview Introduction – What is STFC? What do we do? Why do we do it? An example project – Not: ODE, APARSEN, EUDAT, SCAPE, SCIDIP-ES RCUK Data Policy Principles a final thought

41 The 7 C’s Creation Collection Capacity Computation Curation Collaboration Communication Permanent Access Provenanced Research The Knowledge Lifecycle Data Creation Archival Access Storage Compute Network Services Curation

42 Thank You

43 Building an Open Data Infrastructure for Science Policy and Practice Juan Bicarregui STFC e-Science APA conference, November 2011

Download ppt "Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,"

Similar presentations

Ads by Google