Presentation is loading. Please wait.

Presentation is loading. Please wait.

Research Data Introduction Mark Scott, Richard Boardman, Philippa Reed and Simon Cox Microsoft Institute for HPC and Engineering Materials Group.

Similar presentations


Presentation on theme: "Research Data Introduction Mark Scott, Richard Boardman, Philippa Reed and Simon Cox Microsoft Institute for HPC and Engineering Materials Group."— Presentation transcript:

1 Research Data Introduction Mark Scott, Richard Boardman, Philippa Reed and Simon Cox Microsoft Institute for HPC and Engineering Materials Group

2 Talk Outline 1.Five Ways to Think About Research Data 2.Case Studies 3.Data Management

3 Five Ways to Think About Research Data 1.How it is created 2.Forms of research 3.Electronic representation of research 4.Size of datasets 5.The data life cycle

4 1. Research Data Creation Scientific experiment Models or simulation Observation Derived data Reference data 4

5 2. Forms of Research Electronic text documents Spreadsheets Digital objects, e.g. figures, videos Database schemas Database contents Models, algorithms and scripts Software configuration Software input and output files (pre- and post-process) 5

6 2. Forms of Research Notebooks and diaries Questionnaires, transcripts and codebooks Audiotapes and videotapes Photographs and films Specimens, samples and artefacts Methodologies, workflows, procedures and protocols Experimental results Metadata (data describing data) 6

7 3. Electronic Storage of Research Data Textual Numerical Multimedia Structured Software code Software specific Discipline specific Instrument specific Text files, Microsoft Word, PDF, RTF Excel, CSV TIFF image, AVI movie, MP3 audio CSV, database, multi-purpose (XML) Java, C, Matlab 3D CAD, statistical model Chemistry’s CIF (for crystallography) Archaeology’s laser scanner files 7

8 4. Size of Electronic Datasets Individual large file Set of small files collectively large Set of small files collectively small Individual small file Combinations of the above Subjective 8 Raw CT data; movie Individual frames of movie Source code files Photograph

9 5. Data Life Cycle 9 CategoriesStages

10 5. Data Life Cycle (User/Creator) 10

11 5. Data Life Cycle (Curator) 11 (Digital Curation Centre, 2010)

12 5. Stages in Data Life Cycle (Research Project) 12 (Humphrey, 2006)

13 Case Studies

14 Life Cycles Used in Case Studies 14 CategoriesStages

15 Human Genetics Case Study 15

16 Human Genetics Case Study Life cycle Case Study ActivityCategoryStage Collect DNA and sequenceObtainCollect Align against reference dataUsePre-Process Process data (GATK)UseProcess Filter data (SIFT)UsePost-Process Analyse with spreadsheetUseAnalyse Write-up discovery in journalUsePublish Upload sequences to public databasesManageCurate 16

17 Human Genetics Research Data 17 Source of dataReference data Scientific experiment Model/simulation Derived data Human genome reference sequence Analysis of gene sequence Process with GATK on HPC cluster Aligned data produced by Novoalign Case study provides good example Also relevant Key:

18 Human Genetics Research Data 18 Forms of research Specimens, samplesThe DNA sequence Electronic textJournal paper SpreadsheetsAnalysis of results Models, algorithmsGATK files Pre-process filesFASTQ file Post-process filesNovoalign files Case study provides good example Also relevant Key:

19 Human Genetics Research Data 19 Case study provides good example Also relevant Key: Electronic representation Discipline specific Textual Numerical Gene sequence data (FASTQ) Journal paper Spreadsheets with analysis StructuredGene sequence data (FASTQ)

20 Human Genetics Research Data 20 Data volumesIndividual large fileGene sequence data (FASTQ) Individual small fileSpreadsheet; Journal paper Case study provides good example Also relevant Key:

21 Materials Engineering Case Study 21

22 Materials Engineering Case Study Life cycle Case Study ActivityCategoryStage Collect PD readings from fatigue testObtainCollect Manipulation/smoothing of PD readingsUsePost-Process Plotting of da/dN versus ∆KUseAnalyse Publish findings in a paperUsePublish 22

23 Materials Engineering Research Data 23 Source of dataScientific experiment Observations Derived data The fatigue test Monitoring PD to find threshold Smoothed PD readings in Excel Case study provides good example Also relevant Key: Reference dataK values for specimen geometry and load combination; PD calibration values

24 Materials Engineering Research Data 24 Forms of research SpreadsheetsManipulated PD readings in Excel Digital objectsGraph of da/dN versus ∆K; Images of the material’s microstructure Models, algorithmsSmoothing algorithm Pre-process filesRaw CSV data of PD readings Post-process filesManipulated PD readings in Excel Case study provides good example Also relevant Key: Specimens, samplesMaterial being tested

25 Materials Engineering Research Data 25 Electronic representation Numerical Multimedia CSV and Excel files Graph of da/dN versus ∆K (vector file); Material’s microstructure (bitmap file) Case study provides good example Also relevant Key:

26 Materials Engineering Research Data 26 Data volumesIndividual small fileCSV file with raw PD data; Excel file Set of small files, collectively small Microstructure images Case study provides good example Also relevant Key:

27 Aerodynamics Case Study 27

28 Aerodynamics Case Study Life cycle Case Study ActivityCategoryStage Create meshes and geometriesObtainPre-Process Run simulation using FluentObtainProcess Assess quality of data graphically (TecPlot)Obtain/UsePost-Process Vortex and turbulence analysis (MATLAB)UseAnalyse Publish findings in a paperUsePublish Look after data throughout life cycleManageAll 28

29 Aerodynamics Research Data 29 Source of dataModel/simulation Derived data Reference data Air flow simulation in Fluent Data produced from simulation Properties of air in simulation Case study provides good example Also relevant Key:

30 Aerodynamics Research Data 30 Forms of research Electronic textText document describing simulation Digital objectsTecPlot images; animations of air flow Models, algorithmsMeshes; geometries; MATLAB scripts Software configurationFluent case files Pre-process filesMeshes; geometries Post-process filesFluent output Case study provides good example Also relevant Key:

31 Aerodynamics Research Data 31 Electronic representation Textual Multimedia Software code Document describing simulation; paper TecPlot images; animations of air flow MATLAB scripts; Fluent functions in C Software specificMeshes; geometries Case study provides good example Also relevant Key:

32 Aerodynamics Research Data 32 Data volumesSet of small files, collectively large Fluent output Set of small files, collectively small Collection of meshes, geometries and other files for Fluent simulation Case study provides good example Also relevant Key:

33 Chemistry Case Study 33

34 Chemistry Case Study Life cycle Case Study ActivityCategoryStage The X-ray examination of the crystalObtainCollect The extraction of h, k and l Miller indicesObtainPost-Process Find a model that matches the sampleUseAnalyse Submit new chemical to journal UsePublish Upload to Crystallographic Data CentreManageCurate 34

35 Chemistry Research Data 35 Source of dataDerived data Scientific experiments Extraction of h, k and l Miller indices X-ray examination of the crystal Case study provides good example Also relevant Key:

36 Chemistry Research Data 36 Forms of research Electronic textDetails properties of the sample Specimens, samplesThe crystalline sample Experimental resultsRaw X-ray data from diffractometer Digital objectsCrystal structure images and videos Pre-process filesRaw X-ray data from diffractometer Post-process filesh, k, l data Case study provides good example Also relevant Key:

37 Chemistry Research Data 37 Electronic representation Structured Software specific Discipline specific h, k, l structured text data Diffractometer software’s data files h, k, l data; CIF (crystallographic information file) MultimediaCrystal structure images and videos Case study provides good example Also relevant Key: Instrument specificDiffractometer’s data files

38 Chemistry Research Data 38 Data volumesSet of small files, collectively large Diffractometer raw data Individual small fileh, k, l structured text data Case study provides good example Also relevant Key:

39 Archaeology Case Study 39

40 Archaeology Case Study Life cycle Case Study ActivityCategoryStage Excavation measurements, photos, etc.ObtainCollect Assess collected data; look for patternsUseAnalyse Publish discoveriesUsePublish Upload to Archaeology Data ServiceManageCurate Look after data throughout life cycleManageAll 40

41 Archaeology Research Data 41 Source of dataObservations Reference data Details and features about sites and discoveries Maps of an area; record of previous work on a site Case study provides good example Also relevant Key:

42 Archaeology Research Data 42 Forms of research Electronic textExcavation diary SpreadsheetsDetails of finds (dimensions, weights) Laboratory notebooksExcavation diary Audio/video tapesExcavation site video Photographs, filmsPhotographs of site Specimens, artefactsDiscoveries from excavation Case study provides good example Also relevant Key: Database contentsExcavation details database Workflows, proceduresExcavation procedures MetadataIPTC photographic metadata Digital objectsDigital photogrammetry

43 Archaeology Research Data 43 Electronic representation Textual Multimedia Structured Excavation diary Photogrammetry; scene visualisations Excavation database NumericalFinds spreadsheets (dimensions, weights) Case study provides good example Also relevant Key: Software specificArcGIS files Discipline specific Instrument specific ARK (Archaeological Recording Kit) files Laser scanner software

44 Archaeology Research Data 44 Data volumesIndividual large fileScene visualisation Set of small files, collectively large Digital photogrammetry Case study provides good example Also relevant Key: Individual small fileExcavation diary; Spreadsheets with details of finds

45 Data Management Best Practices

46 How do you find your data again? Choose sensible file names. Include: –something meaningful to you (what are you doing) –something meaningful to someone else (experiment number or project name) 46

47 How do you find your data again? Use a sensible folder structure (one big-flat folder versus hierarchical tree) Use file metadata (tagging) Consider keeping a record of your data sets in a –Spreadsheet or –Database or –In a logbook 47

48 Protecting Your Data Backup regularly – offsite! Follow a process that allows you to cope with versions of files (or use specialist software such as Mercurial) Link to a publication or suitable write-up if there is one, to help others understand the data Upload to a discipline-specific data repositories if available 48

49 Protecting Your Data Can you still access your file in 20 years? Try to use text files. Formatting might be lost but the data will be useful. –So, consider exporting to CSV, XML, free-form text Otherwise, use file formats with openly published specifications to provide some protection: –DOCX or ODT for textual data –XLSX or ODS for spreadsheet data –SVG for figures (an open-standard vector format) –PDF/A for PDFs (a standardised version of PDF) 49

50 Summary Ways of thinking about research data: –Its source –Forms of research –Electronic research data –Data volume –Data life cycle Case studies illustrating these categories Manage your data and consider the long-term view More information in the accompanying guide 50

51 Acknowledgements The categorisation of research data collection was defined in Research Information Network (2008) The forms of research data and categorisation of electronic storage of research data was adapted from The University of Edinburgh (2011). The following people helped with the preparation of this document: –Andy Collins (Human Genetics case study). –Thomas Mbuya and Kath Soady (Materials Fatigue Test case study). –Gregory Jasion (CFD case study). –Simon Coles (Chemistry case study). –Graeme Earl (Archaeology case study). –Mark Scott, Richard Boardman, Philippa Reed and Simon Cox (overall content). We acknowledge ongoing support from the University of Southampton, Robert’s funding, Microsoft, EPSRC, BBSRC, JISC, AHRC and MRC. 51

52 References Digital Curation Centre (2010), ‘DCC Curation Lifecycle Model’. URL: http://www.dcc.ac.uk/resources/curation-lifecycle-model Humphrey, C. (2006), ‘e-Science and the Life Cycle of Research’. URL: http://datalib.library.ualberta.ca/ humphrey/lifecycle- science060308.doc Research Information Network (2008), ‘Stewardship of digital research data: a framework of principles and guidelines’. The University of Edinburgh (2011), ‘Defining research data’. URL: http://www.ed.ac.uk/schools-departments/information- services/services/research- support/data-library/research-data- mgmt/data-mgmt/research-data-definition University of York (2012), ‘Archaeology Data Service’. URL: http://archaeologydataservice.ac.uk/ 52


Download ppt "Research Data Introduction Mark Scott, Richard Boardman, Philippa Reed and Simon Cox Microsoft Institute for HPC and Engineering Materials Group."

Similar presentations


Ads by Google