Fundamental Practices for Preparing Data Sets Bob Cook Environmental Sciences Division Oak Ridge National Laboratory.

Slides:



Advertisements
Similar presentations
Organising and Documenting Data Stuart Macdonald EDINA & Data Library DIY Research Data Management Training Kit for Librarians.
Advertisements

File Management Chapter 3
Pengolahan dan Analisa Data Indra Budi Fasilkom UI.
Local Data Management: Building understandable spreadsheets Jeff Arnfield National Climatic Data Center Version 1.0 Review Date.
TDWG, October 19, 2008 Giri Palanisamy, Metadata Management – Mercury (  Current strategy and standards:
John Porter Why this presentation? The forms data take for analysis are often different than the forms data take for archival storage Spreadsheets are.
Data - Information - Knowledge
1 ORNL DAAC: Data and Services Robert Cook and Suresh SanthanaVannan Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN Presentation.
Fundamental Practices for Preparing Data Sets Bob Cook Environmental Sciences Division Oak Ridge National Laboratory 5 th NACP Principal Investigator’s.
Managing Your Own Data (…if you have to) Kathryn A. Carson, Sc.M. Senior Research Associate Department of Epidemiology Johns Hopkins Bloomberg School of.
Elements of a Data Management Plan Alison Boyer Environmental Sciences Division Oak Ridge National Laboratory.
Elements of a Data Management Plan
Flow Cytometry and Reproducible Analysis Cliburn Chan Department of Biostatistics and Bioinformatics, DUMC.
Data quality control, Data formats and preservation, Versioning and authenticity, Data storage Managing research data well workshop London, 30 June 2009.
Best Practices for Preserving Data Bob Cook Environmental Sciences Division Oak Ridge National Laboratory.
Records Survey and Retention Schedule Recertification 2011.
Introduction to Ecoinformatics: Past, Present & Future William Michener LTER Network Office, University of New Mexico January 2007.
Page 19/4/2015 CSE 30341: Operating Systems Principles Raid storage  Raid – 0: Striping  Good I/O performance if spread across disks (equivalent to n.
U.S. Department of the Interior U.S. Geological Survey Data Management Training Modules: Best Practices for Preparing Science Data to Share.
SAFARI 2000 Data Activities at the ORNL DAAC Bob Cook, Les Hook, Stan Attenberger, Dick Olson, and Tim Rhyne Oak Ridge National Laboratory.
IB Internal Assessment Design. Designing an Experiment Formulate a research question. Read the background theory. Decide on the equipment you will need.
From Best Practices for Preserving Data by Bob Cook, Environmental Sciences Division Oak Ridge National Laboratory Module 9.
Fundamental Practices for Preparing Data Sets Robert Cook ORNL Distributed Active Archive Center Environmental Sciences Division Oak Ridge National Laboratory.
U.S. Department of the Interior U.S. Geological Survey Best Practices for Preparing Science Data to Share.
Inter-American Workshop on Environmental Data Access Panel discussion on scientific and technical issues Merilyn Gentry, LBA-ECO Data Coordinator NASA.
AON Data Questionnaire Results 21 Respondents Last Updated 27 March 2007 First AON PI Meeting Scot Loehrer, Jim Moore.
Data management in the field Ari Haukijärvi 2nd EHES training seminar.
Recordkeeping for Good Governance Toolkit Digital Recordkeeping Guidance Funafuti, Tuvalu – June 2013.
Best Practices for Preparing Data Sets Non-CO2 Synthesis Workshop Boulder, Colorado October 2008 Compiled by: A. Dayalu, Harvard University Adapted.
Miscellaneous Excel Combining Excel and Access. – Importing, exporting and linking Parsing and manipulating data. 1.
Elements of a Data Management Plan Bill Michener University Libraries University of New Mexico Data Management Practices for.
Managing Your Data: Backing Up Your Data Robert Cook Oak Ridge National Laboratory Version 1.0 Review Date.
CC&E Best Data Management Practices, April 19, 2015 Please take the Workshop Survey 1.
Enhancing Linkages Between Projects and Datasets: Examples from LBA-ECO for NACP Lisa Wilcox, Amy L. Morrell,
Managing the Impacts of Programmatic Scale and Enhancing Incentives for Data Archiving A Presentation for “International Workshop on Strategies for Preservation.
Mark A. Magumba Storage Management. What is storage An electronic place where computer may store data and instructions for retrieval The objective of.
Module 6. Data Management Plans  Definitions ◦ Quality assurance ◦ Quality control ◦ Data contamination ◦ Error Types ◦ Error Handling  QA/QC best practices.
Data Management Practices for Early Career Scientists: Closing Robert Cook Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN.
Introduction to Geospatial Metadata – FGDC CSDGM National Coastal Data Development Center A division of the National Oceanographic Data Center Please .
Managing Your Data: Backing Up Your Data Robert Cook Oak Ridge National Laboratory Section: Local Data Management Version 1.0 October 2012.
Microsoft ® Office Access ™ 2007 Training Datasheets I: Create a table by entering data ICT Staff Development presents:
Backup & Restore The purpose of backup is to protect data from loss. The purpose of restore is to recover data that is temporarily unavailable due to some.
WK 13 - How to Prepare Ecological Data Sets for Effective Analysis and Sharing 2:00 PM-5:00 PM August 1 st, 2010.
Fundamental Practices for Preparing Data Sets Bob Cook Environmental Sciences Division Oak Ridge National Laboratory.
Managing the Impacts of Change on Archiving Research Data A Presentation for “International Workshop on Strategies for Preservation of and Open Access.
Responsible Data Use and Local Data Management Ruth Duerr National Snow and Ice Data Center.
Data Management 101 for Earth Scientists Managing Your Data Robert Cook Environmental Sciences Division Oak Ridge National Laboratory.
Archival Workshop on Ingest, Identification, and Certification Standards Certification (Best Practices) Checklist Does the archive have a written plan.
Managing Your Data: Assign Descriptive File Names Robert Cook Oak Ridge National Laboratory Section: Local Data Management Version 1.0 October 2012.
DataONE: Preserving Data and Enabling Data-Intensive Biological and Environmental Research Bob Cook Environmental Sciences Division Oak Ridge National.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Cyberinfrastructure to promote Model - Data Integration Robert Cook, Yaxing Wei, and Suresh S. Vannan Oak Ridge National Laboratory Presented at the Model-Data.
Special Considerations for Archiving Data from Field Observations A Presentation for “International Workshop on Strategies for Preservation of and Open.
Integrate, check and share documents Module 3.3. Integrate, check and share documents Module 3.3.
Data Organization Quality Assurance and Transformations.
Generating Summaries from FOT Data ITS World Congress, Detroit 2014 Dr. Sami Koskinen, VTT
CERTIFICATE IV IN BUSINESS JULY 2015 BSBWRT401A - Write Complex Documents.
Managing Your Data: Assign Descriptive File Names Robert Cook Oak Ridge National Laboratory Version 1.0 Review Date.
Data Management Practices for Early Career Scientists: Closing Robert Cook Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN.
SWRCBSWRCBSWRCBSWRCB AB2886 Implementation Sacramento Training Sacramento Training August 13 & 14, 2001 Marilyn R. Arsenault ArsenaultLegg, Inc.
SWRCBSWRCBSWRCBSWRCB AB2886 Implementation San Jose Training San Jose Training July 30, 2001 Marilyn R. Arsenault ArsenaultLegg, Inc.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Standardization Promotes Biogeochemical Data Management and Use in Multidisciplinary Environmental Research Yaxing Wei, Suresh Vannan, Robert B. Cook,
INITIAL WORLD OCEAN DATABASE COLLECTION
Storage Basic recommendations:
Fundamental Practices for Preparing Data Sets
Research Data Management
Instructor Materials Chapter 5: Ensuring Integrity
Presentation transcript:

Fundamental Practices for Preparing Data Sets Bob Cook Environmental Sciences Division Oak Ridge National Laboratory

NACP Best Data Management Practices, February 3, 2013 The 20-Year Rule The metadata accompanying a data set should be written for a user 20 years into the future--what does that investigator need to know to use the data? Prepare the data and documentation for a user who is unfamiliar with your project, methods, and observations NRC (1991) 2

Proper Curation Enables Data Reuse Time Information Content Planning Collection Assure Documentation Archive Sufficient for Sharing and Reuse 3

NACP Best Data Management Practices, February 3, 2013 Metadata needed to Understand Data The details of the data …. Measurement date Sample ID Parameter name location 4 Courtesy of Raymond McCord, ORNL

NACP Best Data Management Practices, February 3, 2013 Metadata Needed to Understand Data Measurement QA flag media generator method date Sample ID parameter name location records Units Sample def. Type, date location generator lab field Method def. Units method Parameter def. org.type name custodian address, etc. coord. elev. type depth Record system date words QA def. Units def. GIS 5

NACP Best Data Management Practices, February 3, 2013 Fundamental Data Practices 1.Define the contents of your data files 2.Use consistent data organization 3.Use stable file formats 4.Assign descriptive file names 5.Preserve information 6.Perform basic quality assurance 7.Provide documentation 8.Protect your data 6

NACP Best Data Management Practices, February 3, Define the contents of your data files Content flows from science plan (hypotheses) and is informed from requirements of final archive. Keep a set of similar measurements together in one file same investigator, methods, time basis, and instrument –No hard and fast rules about contents of each files. 7

1. Define the Contents of Your Data Files Define the parameters 8 NACP Data Management Practices, February 3, 2013 Ehleringer, et al LBA-ECO CD-02 Carbon, Nitrogen, Oxygen Stable Isotopes in Organic Material, Brazil. Data set. Available on-line [ from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. doi: /ORNLDAAC/983

NACP Best Data Management Practices, February 3, Define the Contents of Your Data Files Define the parameters (cont) Be consistent Choose a format for each parameter, –Explain the format in the metadata, and –Use that format throughout the file Use commonly accepted parameter names and units (SI Units) –e.g., use yyyymmdd; January 2, 1999 is –Use 24-hour notation (13:30 hrs instead of 1:30 p.m. and 04:30 instead of 4:30 a.m.) –Report in both local time and Coordinated Universal Time (UTC) –See Hook et al. (2010) for additional examples of parameter formats 9

1. Define the contents of your data files Site Table 10 …… Ehleringer, et al LBA-ECO CD-02 Carbon, Nitrogen, Oxygen Stable Isotopes in Organic Material, Brazil. Data set. Available on-line [ from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. doi: /ORNLDAAC/983

2. Use consistent data organization (one good approach) StationDateTempPrecip Units YYYYMMDDCmm HOGI HOGI HOGI Note: is a missing value code for the data set 11 Each row in a file represents a complete record, and the columns represent all the parameters that make up the record.

2. Use consistent data organization (a 2 nd good approach) StationDateParameterValueUnit HOGI Temp12C HOGI Temp14C HOGI Precip0mm HOGI Precip3mm 12 Parameter name, value, and units are placed in individual rows. This approach is used in relational databases.

NACP Best Data Management Practices, February 3, Use consistent data organization (cont) Be consistent in file organization and formatting –don’t change or re-arrange columns –Include header rows (first row should contain file name, data set title, author, date, and companion file names) –column headings should describe content of each column, including one row for parameter names and one for parameter units 13

NACP Best Data Management Practices, February 3, Collaboration and Data Sharing A personal example of bad practice… Courtesy of Stefanie Hampton, NCEAS 2. Use consistent data organization (cont)

NACP Best Data Management Practices, February 3, Use stable file formats Los[e] years of critical knowledge because modern PCs could not always open old file formats. Lesson: Avoid proprietary formats. They may not be readable in the future 15

3. Use stable file formats (cont) 16 Aranibar, J. N. and S. A. Macko SAFARI 2000 Plant and Soil C and N Isotopes, Southern Africa, Data set. Available on-line [ from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. doi: /ORNLDAAC/783 Use text-based comma separated values (csv)

NACP Best Data Management Practices, February 3, 2013 Use descriptive file names Unique Reflect contents ASCII characters only Avoid spaces Bad: Mydata.xls 2001_data.csv best version.txt Better:bigfoot_agro_2000_gpp.tiff Site name Year What was measured Project Name File Format 4. Assign descriptive file names 17

18 Courtesy of PhD Comics

NACP Best Data Management Practices, February 3, 2013 Biodiversity Lake Experiments Field work Grassland Biodiv_H20_heatExp_2005_2008.csv Biodiv_H20_predatorExp_2001_2003.csv Biodiv_H20_planktonCount_start2001_active.csv Biodiv_H20_chla_profiles_2003.csv … … 4. Assign descriptive file names Organize files logically Make sure your file system is logical and efficient 19 From S. Hampton

5. Preserve information Keep your raw data raw No transformations, interpolations, etc, in raw file Giles_zoopCount_Diel_2001_2003.csv TAXCOUNTTEMPC C F M F011.9 C F M N … ### Giles_zoop_temp_regress_4jun08.r ### Load data Giles<- read.csv("Giles_zoopCount_Diel_2001_2003.csv") ### Look at the data Giles plot(COUNT~ TEMPC, data=Giles) ### Log Transform the independent variable (x+1) Giles$Lcount<-log(Giles$COUNT+1) ### Plot the log-transformed y against x plot(Lcount ~ TEMPC, data=Giles) 20 Raw Data File Processing Script (R) From S. Hampton

NACP Best Data Management Practices, February 3, Preserve information (cont) Use a scripted language to process data –R Statistical package (free, powerful) –SAS –MATLAB Processing scripts are records of processing –Scripts can be revised, rerun Graphical User Interface-based analyses may seem easy, but don’t leave a record 21

NACP Best Data Management Practices, February 3, Perform basic quality assurance Assure that data are delimited and line up in proper columns Check that there no missing values (blank cells) for key parameters Scan for impossible and anomalous values Perform and review statistical summaries Map location data (lat/long) and assess errors No better QA than to analyze data 22

6. Perform basic quality assurance (con’t) Place geographic data on a map to ensure that geographic coordinates are correct. 23

6. Perform basic quality assurance (con’t) Plot information to examine outliers 24 Model X uses UTC time, all others use Eastern Time Data from the North American Carbon Program Site Synthesis (Courtesy of Dan Ricciuto and Yaxing Wei, ORNL) NACP Site Synthesis Model-Observation Intercomparison

6. Perform basic quality assurance (con’t) Plot information to examine outliers 25 NACP Site Synthesis Model-Observation Intercomparison Data from the North American Carbon Program Site Synthesis (Courtesy of Dan Ricciuto and Yaxing Wei, ORNL)

NACP Best Data Management Practices, February 3, Provide Documentation / Metadata What does the data set describe? Why was the data set created? Who produced the data set and Who prepared the metadata? When and how frequently were the data collected? Where were the data collected and with what spatial resolution? (include coordinate reference system) How was each parameter measured? How reliable are the data?; what is the uncertainty, measurement accuracy?; what problems remain in the data set? What assumptions were used to create the data set? What is the use and distribution policy of the data set? How can someone get a copy of the data set? Provide any references to use of data in publication(s) 26

NACP Best Data Management Practices, February 3, Protect data Ensure that file transfers are done without error –Compare checksums before and after transfers Example tools to generate checksums

NACP Best Data Management Practices, February 3, Protect data (cont) Create back-up copies often –Ideally three copies –original, one on-site (external), and one off-site –Frequency based on need / risk 28

NACP Best Data Management Practices, February 3, Protect data (cont) Use reliable devices for backups Removable storage device Managed network drive –Raid, tape system Managed cloud file-server –DropBox, Amazon Simple Storage Service (S3), Carbonite 29

NACP Best Data Management Practices, February 3, Protect data (cont) Test your backups Automatically test backup copies –Media degrade over time –Annually test copies using checksums or file compare Know that you can recover from a data loss –Periodically test your ability to restore information (at least once a year) –Each year simulate an actual loss, by trying to recover solely from the backed up copies 30

NACP Best Data Management Practices, February 3, 2013 Fundamental Data Practices 1.Define the contents of your data files 2.Use consistent data organization 3.Use stable file formats 4.Assign descriptive file names 5.Preserve information 6.Perform basic quality assurance 7.Provide documentation 8.Protect your data 31

NACP Best Data Management Practices, February 3, 2013 Best Practices: Conclusions Data management is important in today’s science Well organized data: –enables researchers to work more efficiently –can be shared easily by collaborators –can potentially be re-used in ways not imagined when originally collected 32

NACP Best Data Management Practices, February 3, 2013 Bibliography Cook, Robert B., Richard J. Olson, Paul Kanciruk, and Leslie A. Hook Best Practices for Preparing Ecological Data Sets to Share and Archive. Bulletin of the Ecological Society of America, Vol. 82, No. 2, April Hook, L. A., T. W. Beaty, S. Santhana-Vannan, L. Baskaran, and R. B. Cook Best Practices for Preparing Environmental Data Sets to Share and Archive. Michener, W. K., J. W. Brunt, J. Helly, T. B. Kirchner, and S. G. Stafford Non- Geospatial Metadata for Ecology. Ecological Applications. 7:

Questions? 34