Fundamental Practices for Preparing Data Sets Bob Cook Environmental Sciences Division Oak Ridge National Laboratory 5 th NACP Principal Investigator’s.

Slides:



Advertisements
Similar presentations
Organising and Documenting Data Stuart Macdonald EDINA & Data Library DIY Research Data Management Training Kit for Librarians.
Advertisements

Pengolahan dan Analisa Data Indra Budi Fasilkom UI.
Local Data Management: Building understandable spreadsheets Jeff Arnfield National Climatic Data Center Version 1.0 Review Date.
ORNL DAAC Experience With Digital Object Identifiers (DOIs) Bruce Wilson, ORNL DAAC Manager for NASA Data Center Managers telecon 22 Feb 2010.
John Porter Why this presentation? The forms data take for analysis are often different than the forms data take for archival storage Spreadsheets are.
1 ORNL DAAC: Data and Services Robert Cook and Suresh SanthanaVannan Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN Presentation.
Elements of a Data Management Plan Alison Boyer Environmental Sciences Division Oak Ridge National Laboratory.
Elements of a Data Management Plan
Best Practices for Preserving Data Bob Cook Environmental Sciences Division Oak Ridge National Laboratory.
ORGANIZING AND STRUCTURING DATA FOR DIGITAL PROJECTS Suzanne Huffman Digital Resources Librarian Simpson Library.
U.S. Department of the Interior U.S. Geological Survey Data Management Training Modules: Best Practices for Preparing Science Data to Share.
SAFARI 2000 Data Activities at the ORNL DAAC Bob Cook, Les Hook, Stan Attenberger, Dick Olson, and Tim Rhyne Oak Ridge National Laboratory.
From Best Practices for Preserving Data by Bob Cook, Environmental Sciences Division Oak Ridge National Laboratory Module 9.
Fundamental Practices for Preparing Data Sets Robert Cook ORNL Distributed Active Archive Center Environmental Sciences Division Oak Ridge National Laboratory.
U.S. Department of the Interior U.S. Geological Survey Best Practices for Preparing Science Data to Share.
Inter-American Workshop on Environmental Data Access Panel discussion on scientific and technical issues Merilyn Gentry, LBA-ECO Data Coordinator NASA.
1 CDIAC Data Support for SPRUCE and NGEE Les A. Hook and Ranjeet Devarakonda Environmental Sciences Division Oak Ridge National Laboratory CDIAC User Working.
Data Organization Data Collection and Spreadsheets.
AON Data Questionnaire Results 21 Respondents Last Updated 27 March 2007 First AON PI Meeting Scot Loehrer, Jim Moore.
Recordkeeping for Good Governance Toolkit Digital Recordkeeping Guidance Funafuti, Tuvalu – June 2013.
Best Practices for Preparing Data Sets Non-CO2 Synthesis Workshop Boulder, Colorado October 2008 Compiled by: A. Dayalu, Harvard University Adapted.
10 Minutes of Metadata Viv Hutchison US Geological Survey Core Science Analytics Synthesis & Libraries Denver, CO 5 th NACP Principal.
DM_PPT_NP_v01 SESIP_0715_AJ HDF Product Designer Aleksandar Jelenak, H. Joe Lee, Ted Habermann Gerd Heber, John Readey, Joel Plutchak The HDF Group HDF.
MAPLDDesign Integrity Concepts You Mean We’re Still Working On It? Sustaining a Design.
Elements of a Data Management Plan Bill Michener University Libraries University of New Mexico Data Management Practices for.
Managing Your Data: Backing Up Your Data Robert Cook Oak Ridge National Laboratory Version 1.0 Review Date.
Data Wrangling and Interoperability Andrea Denton Research and Data Services Manager Claude Moore Health Sciences Library Ricky Patterson.
CC&E Best Data Management Practices, April 19, 2015 Please take the Workshop Survey 1.
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1 Vegetation/Ecosystem Modeling and Analysis Project (VEMAP) Lessons Learned or How to Do.
Enhancing Linkages Between Projects and Datasets: Examples from LBA-ECO for NACP Lisa Wilcox, Amy L. Morrell,
Data Citation and Data Attribution A View from the Data Center Perspective Bruce E. Wilson Group Lead, Client & Collaboration Technologies Oak Ridge National.
EARTH SCIENCE MARKUP LANGUAGE Why do you need it? How can it help you? INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
Fundamental Practices for Preparing Data Sets Bob Cook Environmental Sciences Division Oak Ridge National Laboratory.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
Module 6. Data Management Plans  Definitions ◦ Quality assurance ◦ Quality control ◦ Data contamination ◦ Error Types ◦ Error Handling  QA/QC best practices.
Data Management Practices for Early Career Scientists: Closing Robert Cook Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN.
Global map layers Additional global data sets such as Hydrology data (Hydrosheds), new and updated Landcover data (Globcover), demographic data and others.
Managing Your Data: Backing Up Your Data Robert Cook Oak Ridge National Laboratory Section: Local Data Management Version 1.0 October 2012.
Microsoft ® Office Access ™ 2007 Training Datasheets I: Create a table by entering data ICT Staff Development presents:
Fundamental Practices for Preparing Data Sets Bob Cook Environmental Sciences Division Oak Ridge National Laboratory.
Managing the Impacts of Change on Archiving Research Data A Presentation for “International Workshop on Strategies for Preservation of and Open Access.
Responsible Data Use and Local Data Management Ruth Duerr National Snow and Ice Data Center.
ORNL DAAC Semi-Automated Data Ingest Process Daine Wright Suresh Vannan, Tammy Beaty, Bob Cook, Yaxing Wei, Ranjeet Deverakonda, Harold.
Data Management 101 for Earth Scientists Managing Your Data Robert Cook Environmental Sciences Division Oak Ridge National Laboratory.
Managing Your Data: Assign Descriptive File Names Robert Cook Oak Ridge National Laboratory Section: Local Data Management Version 1.0 October 2012.
Reconstituting the Ocean: a tale from U.S. JGOFS Cyndy Chandler (MCG, WHOI) U.S. JGOFS Data Management Office and Ocean Carbon and Biogeochemistry Coordination.
Biological and Chemical Oceanography Data Management Office slide 1 of 19 CAMEO Data Management Bob Groman Biological and Chemical Oceanography Data Management.
DataONE: Preserving Data and Enabling Data-Intensive Biological and Environmental Research Bob Cook Environmental Sciences Division Oak Ridge National.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Vers national spatial data infrastructure training program What is Metadata? Introduction to Metadata An overview of geospatial metadata, presentation.
NACP A High-Resolution Daily Surface Weather Database for NACP Investigations Peter E. Thornton 1, Robert B. Cook 2, W. Mac Post 2, Bruce E. Wilson 2,
1 U.S. Department of the Interior U.S. Geological Survey LP DAAC Stacie Doman Bennett, LP DAAC Scientist Dave Meyer, LP DAAC Project Scientist.
Cyberinfrastructure to promote Model - Data Integration Robert Cook, Yaxing Wei, and Suresh S. Vannan Oak Ridge National Laboratory Presented at the Model-Data.
ORNL DAAC: Introduction Bob Cook ORNL DAAC Environmental Sciences Division Oak Ridge National Laboratory.
The US Long Term Ecological Research (LTER) Network: Site and Network Level Information Management Kristin Vanderbilt Department of Biology University.
Terra MODIS Collection 4 / 4.5 and Aqua MODIS Collection 4; Sinusoidal Projection Data from 2000 to present; 8-day, 16-day, or annual composites Sites.
Special Considerations for Archiving Data from Field Observations A Presentation for “International Workshop on Strategies for Preservation of and Open.
Data Organization Quality Assurance and Transformations.
Getting Familiar with Metadata Laurie Porth Rocky Mountain Research Station Audience: Scientists/researchers who have heard of metadata and now need to.
Data Management in Clinical Research Rosanne M. Pogash, MPA Manager, PHS Data Management Unit January 12,
Data Systems Integration Committee of the Earth Science Data System Working Group (ESDSWG) on Data Quality Robert R. Downs 1 Yaxing Wei 2, and David F.
Global Change Master Directory (GCMD) Mission “To assist the scientific community in the discovery of Earth science data, related services, and ancillary.
Managing Your Data: Assign Descriptive File Names Robert Cook Oak Ridge National Laboratory Version 1.0 Review Date.
Data Management Practices for Early Career Scientists: Closing Robert Cook Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN.
New & Improved Meteorological Data Archives Kenneth G. Wastrack Jennifer M. Call D. Sherea Burns Tennessee Valley Authority.
NASA Earth Science Data Stewardship
Standardization Promotes Biogeochemical Data Management and Use in Multidisciplinary Environmental Research Yaxing Wei, Suresh Vannan, Robert B. Cook,
Fundamental Practices for Preparing Data Sets
Prepared by: Jennifer Saleem Arrigo, Program Manager
Presentation transcript:

Fundamental Practices for Preparing Data Sets Bob Cook Environmental Sciences Division Oak Ridge National Laboratory 5 th NACP Principal Investigator’s Meeting Washington, DC January 25, 2015

NACP Best Data Management Practices, January 25, 2015 Data centers use the 20-year rule The data set and accompanying documentation should be prepared for a user 20 years into the future--what does that investigator need to know to use the data? Prepare the data and documentation for a user who is unfamiliar with your project, methods, and observations NRC (1991) 2

NACP Best Data Management Practices, January 25, 2015 Fundamental Data Practices 1.Define the contents of your data files 2.Define the variables 3.Use consistent data organization 4.Use stable file formats 5.Assign descriptive file names 6.Preserve processing information 7.Perform basic quality assurance 8.Provide documentation 9.Protect your data 10.Preserve your data 3

NACP Best Data Management Practices, January 25, Define the contents of your data files Content flows from science plan (hypotheses) and is informed from requirements of final archive. Keep a set of similar measurements together in one file same investigator, methods, time basis, and instrument –No hard and fast rules about contents of each files. 4

NACP Best Data Management Practices, January 25, Define the variables 1.Choose the units and format for each variable, 2.Explain the format in the metadata, and 3.Use that format consistently throughout the file Date / Time Example –e.g., use yyyymmdd; January 2, 1999 is –Report in both local time and Coordinated Universal Time (UTC) and 24-hour notation (13:30 hrs instead of 1:30 p.m.) –Use a code (e.g., -9999) for missing values 5 Representation of dates and times

NACP Best Data Management Practices, January 25, 2015 ISO Formatted Dates Sort Chronologically 6

NACP Best Data Management Practices, January 25, Define the variables (cont) Use commonly accepted variable names and units ORNL DAAC Best Practices (Hook et al., 2010) Additional examples of variable names, units, and their formats Next Generation Ecosystem Experiment – Arctic Guidance for variable names and units FLUXNET Guidance for flux tower variable names and units 7 UDUNITS Unit database and conversion between units CF Standard Name Climate Forecast (CF) standards promote sharing International System of Units

NACP Best Data Management Practices, January 25, Define the variables (cont) 8 Scholes (2005) Be consistent Explicitly state units Use ISO formats Variable Table

NACP Best Data Management Practices, January 25, Define the variables Site Table 9 Site NameSite Code Latitude (deg ) Longitude (deg) Elevation (m) Date Kataba (Mongu)k Pandamatengap Skukuza Flux Tower skukuza Scholes, R. J SAFARI 2000 Woody Vegetation Characteristics of Kalahari and Skukuza Sites. Data set. Available on-line [ from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. doi: /ORNLDAAC/777 ……

NACP Best Data Management Practices, January 25, Use consistent data organization (one good approach) StationDateTempPrecip Units YYYYMMDDCmm HOGI HOGI HOGI Note: is a missing value code for the data set 10 Each row in a file represents a complete record, and the columns represent all the variables that make up the record.

NACP Best Data Management Practices, January 25, Use consistent data organization (a 2 nd good approach) StationDateVariableValueUnit HOGI Temp12C HOGI Temp14C HOGI Precip0mm HOGI Precip3mm 11 Variable name, value, and units are placed in individual rows. This approach is used in relational databases.

NACP Best Data Management Practices, January 25, Use consistent data organization (cont) Be consistent in file organization and formatting –don’t change or re-arrange columns –Include header rows (first row should contain file name, data set title, author, date, and companion file names) –column headings should describe content of each column, including one row for variable names and one for variable units 12

NACP Best Data Management Practices, January 25, Example of Poor Data Practices for Collaboration and Data Sharing Courtesy of Stefanie Hampton, NCEAS Problems with spreadsheets  Multiple tables  Embedded figures  No headings / units  Poor file names Problems with spreadsheets  Multiple tables  Embedded figures  No headings / units  Poor file names

NACP Best Data Management Practices, January 25, 2015 Stable Isotope Data at ORNL: tabular csv format Aranabar and Macko doi: /ORNLDAAC/783 14

NACP Best Data Management Practices, January 25, Use stable file formats Los[e] years of critical knowledge because modern PCs could not always open old file formats. Lesson: Avoid proprietary formats They may not be readable in the future 15

NACP Best Data Management Practices, January 25, Use stable file formats (cont) 16 Aranibar, J. N. and S. A. Macko SAFARI 2000 Plant and Soil C and N Isotopes, Southern Africa, Data set. Available on-line [ from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. doi: /ORNLDAAC/783 Use text (ASCII) file formats for tabular data (e.g.,.txt or.csv (comma-separated values)

NACP Best Data Management Practices, January 25, Use stable file formats (cont) Suggested Geospatial File Formats Raster formats Geotiff netCDF o with CF convention preferred HDF ASCII o plain text file gridded format with external projection information Vector Shapefile ASCII –17 GTOPO30 Elevation Minimum Temperature

NACP Best Data Management Practices, January 25, 2015 Use descriptive file names Unique Reflect contents ASCII characters only Avoid spaces Bad: Mydata.xls 2001_data.csv best version.txt Better:bigfoot_agro_2000_gpp.tiff Site name Year What was measured Project Name File Format 5. Assign descriptive file names 18

NACP Best Data Management Practices, January 25, Courtesy of PhD Comics

NACP Best Data Management Practices, January 25, 2015 Biodiversity Lake Experiments Field work Grassland Biodiv_H20_heatExp_2005_2008.csv Biodiv_H20_predatorExp_2001_2003.csv Biodiv_H20_planktonCount_start2001_active.csv Biodiv_H20_chla_profiles_2003.csv … … 5. Assign descriptive file names (cont) Organize files logically Make sure your file system is logical and efficient 20 From S. Hampton

NACP Best Data Management Practices, January 25, Preserve processing information Keep raw data raw: Do not Include transformations, interpolations, etc in raw file Make your raw data “read only” to ensure no changes Giles_zoopCount_Diel_2001_2003.csv TAXCOUNTTEMPC C F M F011.9 C F M N … Raw Data File ### Giles_zoop_temp_regress_4jun08.r ### Load data Giles<-read.csv("Giles_zoopCount_Diel_2001_2003.csv") ### Look at the data Giles plot(COUNT~ TEMPC, data=Giles) ### Log Transform the independent variable (x+1) Giles$Lcount<-log(Giles$COUNT+1) ### Plot the log-transformed y against x plot(Lcount ~ TEMPC, data=Giles) When processing data: Use a programming language (e.g., R, SAS, MATLAB) Code is a record of the processing done Codes can be revised, rerun 21

NACP Best Data Management Practices, January 25, Perform basic quality assurance Assure that data are delimited and line up in proper columns Check that there no missing values (blank cells) for key variables Scan for impossible and anomalous values Perform and review statistical summaries Map location data (lat/long) and assess errors No better QA than to analyze data 22

NACP Best Data Management Practices, January 25, Perform basic quality assurance (cont) Place geographic data on a map to ensure that geographic coordinates are correct. 23

NACP Best Data Management Practices, January 25, Perform basic quality assurance (con’t) Plot information to examine outliers 24 Model X uses UTC time, all others use Eastern Time Data from the North American Carbon Program Interim Synthesis (Courtesy of Dan Ricciuto and Yaxing Wei, ORNL) Model-Observation Intercomparison

NACP Best Data Management Practices, January 25, Perform basic quality assurance (con’t) Plot information to examine outliers 25 Data from the North American Carbon Program Interim Synthesis (Courtesy of Dan Ricciuto and Yaxing Wei, ORNL) Model-Observation Intercomparison

NACP Best Data Management Practices, January 25, Provide Documentation / Metadata What does the data set describe? Why was the data set created? Who produced the data set and Who prepared the metadata? When and how frequently were the data collected? Where were the data collected and with what spatial resolution? (include coordinate reference system) How was each variable measured? How reliable are the data?; what is the uncertainty, measurement accuracy?; what problems remain in the data set? What assumptions were used to create the data set? What is the use and distribution policy of the data set? How can someone get a copy of the data set? Provide any references to use of data in publication(s) 26

NACP Best Data Management Practices, January 25, Protect data Create back-up copies often –Ideally three copies –original, one on-site (external), and one off-site –Frequency based on need / risk Know that you can recover from a data loss –Periodically test your ability to restore information 27

NACP Best Data Management Practices, January 25, Protect data (cont) Ensure that file transfers are done without error –Compare checksums before and after transfers Example tools to generate checksums

NACP Best Data Management Practices, January 25, Preserve Your Data What to preserve from the research project? Well-structured data files, with variables, units, and values defined Documentation and metadata record describing the data Additional information (provides context) Materials from project wiki/websites Files describing the project, protocols, or field sites (including photos) Publication(s) 29

NACP Best Data Management Practices, January 25, Preserve Your Data (cont) Where should the data be archived? Part of project planning Contact archive / data center early to find out their requirements –What additional data management steps would they like you to do? Suggested data centers / archives: 30 –BCO-DMO –Ecological Archives –CDIAC –Dryad –NASA DAACs –(ORNL DAAC)

NACP Best Data Management Practices, January 25, 2015 Fundamental Data Practices 1.Define the contents of your data files 2.Define the variables 3.Use consistent data organization 4.Use stable file formats 5.Assign descriptive file names 6.Preserve processing information 7.Perform basic quality assurance 8.Provide documentation 9.Protect your data 10.Preserve your data 31

NACP Best Data Management Practices, January 25, 2015 Best Practices: Conclusion Data management is important in today’s science Well organized data: –enables researchers to work more efficiently –can be shared easily by collaborators –can potentially be re-used in ways not imagined when originally collected Include data management in your research workflow and budget Data Management Practices should be a habit 32

NACP Best Data Management Practices, January 25, 2015 Web Resources Web Site –Data Management for Data ProvidersData Management for Data Providers Workshops –Workshops on Data Management 101 at NASA Terrestrial Ecology meeting in 2013Workshops on Data Management 101 –Workshop at American Meteorological SocietyWorkshop at American Meteorological Society 2012 with ESIP and DataONE Training Materials (ORNL DAAC contributed) –ESIP training modulesESIP training modules –DataONE training modulesDataONE training modules 33