Linking raw data with scientific workflow and software repository: some early experience in PanData-ODI Erica Yang, Brian Matthews Scientific Computing.

Slides:



Advertisements
Similar presentations
Grey Literature, Institutional Repositories and the Organisational Context Simon Lambert, Brian Matthews & Catherine Jones Business & Information Technology.
Advertisements

A centre of expertise in digital information management UKOLN is supported by: Curating the Scientific Record: The Challenges Ahead Dr.
Towards an information model for I2S2
I2S2 - Infrastructure for Integration in Structural Sciences Information Model Development Workshop RAL 11 th February 2010
A centre of expertise in data curation and preservation DigCCur2007 Symposium, Chapel Hill, N.C., April 18-20, 2007 Co-operation for digital preservation.
Experiment Workflow Pipelines at APS: Message Queuing and HDF5 Claude Saunders, Nicholas Schwarz, John Hammonds Software Services Group Advanced Photon.
ICAT + Information Model Brian Matthews Scientific Information Group E-Science Centre STFC Rutherford Appleton Laboratory
Slide: 1 Welcome to the workshop ESRFUP-WP7 User Single Entry Point.
PaN-data WP7 - Integration Brian Matthews STFC-e-Science.
The Data Lifecycle and the Curation of Laboratory Experimental Data Tony Hey Corporate VP for Technical Computing Microsoft Corporation.
December 2008 MRC Data Support Services (DSS) Chris Morris 13 th February 2009 Sharing Research Data: Pioneers, Policies and Protocols The seventh cat.
ICAT Integration at DLS. Alun Ashton. What were the requirements? Integrate with current business system Collect Data and Metadata relating to a proposal.
Data Catalogue Service Work Package 4. Main Objective: Deployment, Operation and Evaluation of a cataloguing service for scientific data. Why: Potential.
PaNdata Photon and Neutron Data Infrastructure I2S2Meeting 1 April 2011 Juan Bicarregui.
23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.
University of Southampton, U.K.
Developing PANDORA Mark Corbould Director, IT Business Systems.
Scientists are Sensitive too: Some Issues in Research ethics arising from Data Sharing Brian Matthews Scientific Information Group Scientific Computing.
26-28 th April 2004BioXHIT Kick-off Meeting: WP 5.2Slide 1 WorkPackage 5.2: Implementation of Data management and Project Tracking in Structure Solution.
Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory.
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
Integrated e-Infrastructure for Scientific Facilities Kerstin Kleese van Dam STFC- e-Science Centre Daresbury Laboratory
Metadata for Large Science: The ICAT Data Model Brian Matthews, Leader, Scientific Applications Group, E-Science Centre, STFC Rutherford Appleton Laboratory.
ISIS: towards a 21st century facility computing environment (slowly) Robert McGreevy Science and Technology Facilities Council ISIS, Rutherford Appleton.
Elements of a Data Management Plan Bill Michener University Libraries University of New Mexico Data Management Practices for.
EBank UK: linking scientific data, scholarly communication and learning Michael Day and Rachel Heery UKOLN, University of Bath
Manjula Patel Scaling-up to Integrated Research Data Management Workshop 6 th International Digital Curation Conference Holiday Inn, Mart Plaza Chicago,
8th November 2002Tim Adye1 BaBar Grid Tim Adye Particle Physics Department Rutherford Appleton Laboratory PP Grid Team Coseners House 8 th November 2002.
Context and Linking in the Research Lifecycle CERIF and other standards Catherine Jones Scientific Information Group Scientific Computing Department STFC.
Because good research needs good data Funded by: Digital Curation for Researchers, 28th February 2013 The Shifting Research Data Management Policy Landscape.
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
The Faster Research Cycle Interoperability for better science Brian Matthews, Leader, Information Management Group, E-Science Centre, STFC Rutherford Appleton.
Jamie Hall (ILL). SciencePAD Persistent Identifiers Workshop PANData Software Catalogue January 30th 2013 Jamie Hall Developer IT Services, Institut Laue-Langevin.
U.S. Department of the Interior U.S. Geological Survey CDI Webinar Series 2013 Data Management at the National Climate Change and Wildlife Science Center.
UKOLN is supported by: Digital Preservation Benefits Tools Project Dissemination Workshop Dr Liz Lyon, Associate Director, UK Digital Curation Centre Director,
WP5 – Virtual Laboratories. WP5 Deliverables  D5.1: Specific requirements for the virtual laboratories M6  D5.2: Deployment of Specification of the.
“Live” Tomographic Reconstructions Alun Ashton Mark Basham.
Data Management & the Library. FACT #1 Research is increasingly digital and produces digital data.
Vision for academic geographic data access Dr David Medyckyj-Scott GRADE Project Director EDINA.
Project Database Handler The Project Database Handler is a brokering application, which will mediate interactions between the project database and other.
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
Slide David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow UK-T0 Meeting 21 st Oct 2015 GridPP.
29 March 2004 Steven Worley, NSF/NCAR/SCD 1 Research Data Stewardship and Access Steven Worley, CISL/SCD Cyberinfrastructure meeting with Priscilla Nelson.
Snapshot of DAQ challenges for Diamond Martin Walsh.
U.S. Department of the Interior U.S. Geological Survey Decision Support Tools and USGS Data Management Best Practices Cassandra Ladino USGS Chesapeake.
PaNdata ODI Open Data Infrastructure INFRA : Data infrastructures for e-Science PaNdata-ODI will develop, deploy and operate an Open Data Infrastructure.
Towards a Structural Biology Work Bench Chris Morris, STFC.
Data Preservation at Rutherford Lab David Corney 9 th July 2010 KEK.
Cameron Neylon School of Chemistry, University of Southampton & Science and Technology Facilities Council, Rutherford Appleton Laboratory, Oxfordshire.
PERSISTENT IDENTIFIERS FOR THE UK: SOCIAL AND ECONOMIC DATA …………………………………………………………………………………………………… LOUISE CORTI …………………….…………………………….… UK DATA ARCHIVE.
Research Data Management 26 th April 2016 Federica Fina, Data Scientist, University of St Andrews Library.
Usecases: 1.ISIS Neutron Source 2.DP for HEP Matthew Viljoen STFC, UK APARSEN-EGI workshop: preserving big data for research Amsterdam Science Park 4-6.
Store and exchange data with colleagues and team Synchronize multiple versions of data Ensure automatic desktop synchronization of large files B2DROP is.
A U.S. Department of Energy laboratory managed by UChicago Argonne, LLC. Introduction APS Engineering Support Division –Beamline Controls and Data Acquisition.
Afternoon session: The archival problem and infrastructure for solutions Prof John R Helliwell Interactive Publications.
EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu
EGI… …is a Federation of over 300 computing and data centres spread across 56 countries in Europe and worldwide …delivers advanced computing.
Enhancements to Galaxy for delivering on NIH Commons
e-Infrastructure across Photon and Neutron Sources
WP18, High-speed data recording Krzysztof Wrona, European XFEL
Scientific Computing Department
Research Data Context Preservation in SCAPE
UK Status and Plans Scientific Computing Forum 27th Oct 2017
Final review 24th Nov 2014 Brussels
Graeme Winter STFC Computational Science & Engineering
Computing Infrastructure for DAQ, DM and SC
Connecting the European Grid Infrastructure to Research Communities
Brian Matthews STFC EOSCpilot Brian Matthews STFC
Data Management Components for a Research Data Archive
WP6 – EOSC integration J-F. Perrin (ILL) 15th Jan 2019
Presentation transcript:

Linking raw data with scientific workflow and software repository: some early experience in PanData-ODI Erica Yang, Brian Matthews Scientific Computing Department (SCD) Rutherford Appleton Laboratory (RAL) Science and Technology Facilities Council (STFC), U. K.

BACKGROUND

STFC Rutherford Appleton Laboratory STFC RAL: CLF, Diamond, ISIS, SCD

What we do Scientific Computing Department... Data management infrastructure, systems and software Super Computer, HPC, storage, monitoring and archiving infrastructure, services and software Software development (tools and technologies) Application/user support Facility Sciences (e.g. ISIS, CLF, Diamond, PanData Community) Scientific Experiments and Communities (e.g. LHC, EGI, NGS, SCARF, NSCCS, CCP4, CCPi, Computational Sciences) Institutional repository, metadata catalogue, semantic web, DOIs, visualisation facilities, supporting data processing pipelines Large scale compute clusters, GPU clusters, storage disk and robotic tape systems Hardware Infrastructure Applications Technologies Services Communities

What we do Scientific Computing Department... Data management infrastructure, systems and software Super Computer, HPC, storage, monitoring and archiving infrastructure, services and software Software development (tools and technologies) Application/user support Facility Sciences (e.g. ISIS, CLF, Diamond, PanData Community) Scientific Experiments and Communities (e.g. LHC, EGI, NGS, SCARF, NSCCS, CCP4, CCPi, Computational Sciences) Institutional repository, metadata catalogue, semantic web, DOIs, visualisation facilities, supporting data processing pipelines Large scale compute clusters, GPU clusters, storage disk and robotic tape systems Hardware Infrastructure Applications Technologies Services Communities I am a computer scientist... I2S2, SRF, PanData-ODI, SCAPE, PinPoinT, TomoDM

PANDATA-ODI

PaN-data ODI– an Open Data Infrastructure for European Photon and Neutron laboratories Federated data catalogues supporting cross-facility, cross-discipline interaction at the scale of atoms and molecules Neutron diffraction X-ray diffraction High-quality structure refinement Unification of data management policies Shared protocols for exchange of user information Common scientific data formats Interoperation of data analysis software Data Provenance WP: Linking Data and Publications Digital Preservation: supporting the long-term preservation of the research outputs

(Facility) Data Continuum Proposal Approval Scheduling Experiment Data reduction Publication Data analysis Metadata Catalogue PanSoft Services on top of ICAT: DOI, TopCAT, Eclipse ICAT Explorer...

Data Continuum Proposal Approval Scheduling Experiment Data reduction Publication Data analysis Metadata Catalogue PanSoft These are with users. Traditionally, these, although very useful for data citation, reuse and sharing, are very difficult to capture! Practices vary from individuals to individuals, and from institutions to institutions Well developed and supported

Prior Experience Credits: Martin Dove, Erica Yang (Nov. 2009) Raw data Derived data Resultant data

Can be difficult to capture... Credits: Alan Soper (Jan. 2010)

Data Provenance: Case Studies (so far) DescriptionFacilityLikely stages of continuum 1. Automated SANS2D reduction and analysis ISIS -Experiment/Sample preparation -Raw data collection -Data reduction -Data Analysis 2. Tomography reconstruction and analysis Manchester Uni./DLS -Raw data -Reconstructed data -Analysed data 3.EXPRESS ServicesISIS -Experiment preparation -Raw data collection -Data Analysis to standard final data product 4. Recording publications arising from proposal ISIS -Proposal system -Raw Data collection -Publication recording 5.DAWN + iSpyBDLS, ESRF-Experiment preparation -Data analysis steps These case studies have given us unique insights into today’s facilities... Looking for more case studies...

TODAY’S FACILITIES – FACTS

Diamond and ISIS (the story so far...) Diamond: – ~ 290TB and 120 million files [1] – In SCD data archive – Largest file: ~120GB – Tomography beamlines [3] ISIS [2]: – ~16TB and 11 million files – In ISIS data archive & SCD data archive (on going) – Largest file: ~16GB – the WISH instrument

Diamond Tomography Up to 120 GBs/file every 30 minutes 6,000 TIFF images/file Up to 200 GBs/hr ~5 TBs/day 1-3 days/experiment How many images are there for each experiment?

ISIS Peak data copy rate (off instrument) 100MB/S – Expecting to reach 500MB-1GB/S (but not soon). Expect to grow to 10TB/cycle in 3-5 years Become interested in centrally hosted services (WISH)

WHAT DOES IT MEAN?

It means... Due to the volume, it is not cost effective to transfer the (raw + reconstructed) data back to the home institutions or elsewhere to process – The network bandwidth to universities mean that it will take a long time to transfer... – So, users have to physically take data back home on storage drive... It is impossible for users to do the reconstruction or analysis on their own computer/laptop. – How much RAM do you have on your laptop? And how big is the file from WISH? The Mantid story... It is expensive to re-do the analysis back to the home institutions because of – Lack of hardware resources – Lack of metadata (a large number of files often mean that there is not much useful metadata) – Lack of expertise (e.g. parallel processing, GPU programming) – (Assuming software is open source...) Facilities become interested, again, in centralised computing services, right next to the data – The ISIS WISH story... – Diamond GPU cluster vs. SCD GPU cluster (directly linked to the data archive)

Users now ask for remote services Users are interested in remote data analysis services – “... Of course this would mean a step change in the facilities provided and the time users spend at the facility.... ” – In response, we are developing: “TomoDM”... Then, what benefits can remote services bring? – Systematic recording of data continuum, thus allowing the recording of scientific workflows, software, and data provenance (the first four categories data as defined in the “Living Publication” [4]) – Drive data processing (reduction & analysis) with data provenance – It is not only possible to create bi-directional links between raw data and publications, it is also possible to systematically create pair-wise bi-directional links between raw, derived, resultant data and publications.

TWO EXAMPLES

SRF – Automated Data Processing Pipeline for ISIS SampleTracks OpenGenie Script Analysed data Data AcquisitionData Reduction Raw data Model Fitting ICAT Data Catalogue Reduced data Reduced data Raw data Samples, Experiment setup Blog Posts in LabTrove

Links between metadata and files Raw data Reconstructed/ Processed data Metadata nexus format pointer Actual data in blobs One HDF5 (e.g. 2MB) – NeXus format One HDF5 (e.g. 120GB) Nexus – raw pointer Metadata nexus format pointer Actual data in blobs One HDF5 (e.g. 2MB) One HDF5 (e.g. 120GB) DOI Service (between raw and processed data)

But... If we can capture all types of data – Samples – Raw data – Reduced data – Analysed data – Software – Data provenance (i.e. Relationships between datasets) Facilities operators do not normally – Perform validation and fixity checking of data files (volume vs. cost) – Actually MUCH cheaper to simply back up all images without regard for quality, relevance or even duplication, than it is to build an infrastructure for automatically analyzing millions of image files to determine which are “worth keeping”. [1] – But lifetime of data on tape vs. Lifetime of tape Will DOIs good enough for data citation?

STFC EXPERIENCE OF USING DATACITE DOIS

STFC DOI landing page

Behind the landing page

Citation Issues At what granularity should data be made citable? – If single datasets are given identifiers, what about collections of datasets, or subsets of data? – What are we citing? Datasets or an aggregation of datasets STFC link DOI to experiments on large facilities which contain many data files Other organisations DOI usage policies use different granularities. Can there be a common, or discipline common DOI usage policy for data? Citation of different types of object: software, processes, workflows... Private vs. Public datasets/DOIs

Summary Infrastructure services – Compute – Storage – Archiving + preservation (validation, integrity) – Remote data processing – DOIs It is already happening – DLS is already collecting processed data (e.g. iSpyB) – But, interesting issues remind to be resolved when lots of data become available...

Acknowledgements Project (s)Institution(s)People PanData-ODI (case study)Diamond, U.K.Mark Basham, Kaz Wanelik PanData-ODI: (case study)Manchester University, U.K. Harwell Research Complex, RAL Diamond, U.K. Prof. Philip Withers, Prof. Peter Lee SCAPE, SRF, PanData-ODIScientific Computing Department, STFC David Corney, Shaun De Witt, Chris Kruk, Michael, Wilson I2S2, SRFChemistry, Southampton University, UK Prof. Jeremy Frey, Dr. Simon Coles I2S2Cambridge University, UKProf. Martin Dove I2S2, PanData-ODI, SRF, SCAPE ISIS, STFCTom Griffin, Chris Morton-Smith, Alan Soper, Silvia Imberti, Cameron Neylon, Ann Terry, Matt Tucker I2S2, SRF are funded by JISC, UK PanData-ODI, SCAPE are funded by EC

References 1.Colin Nave (Diamond), “Survey of Future Data Archiving Policies for Macromolecular Crystallography at Synchrotrons”, distributed via July Chris Morton-Smith (ISIS), “ISIS Data rates and sizes – up to March 2012”, May Mark Basham (Diamond), “HDF5 Parallel Reading and Tomography Processing at DLS”, PanData-ODI DESY meeting, Hamburg, Germany, Feb John R Helliwell, Thomas C. Terwilliger, Brian McMahon, “The Living Publication”, April 2012.

Questions

Backup