Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Formats Curt Tilmes/NASA Jeff Arnfield/National Climatic Data Center Al Fleig/PITA Version 1.0.

Similar presentations


Presentation on theme: "Data Formats Curt Tilmes/NASA Jeff Arnfield/National Climatic Data Center Al Fleig/PITA Version 1.0."— Presentation transcript:

1 Data Formats Curt Tilmes/NASA Jeff Arnfield/National Climatic Data Center Al Fleig/PITA Version 1.0

2 Data Formats; Version 1.0 Overview Data Formats: Avoiding proprietary formats Choosing and adopting community accepted standards Building understandable spreadsheets Using self-describing data formats

3 Data Formats; Version 1.0 Data Formats: Avoiding proprietary formats Albert J. Fleig PITA Analytic Sciences Version 1.0

4 Data Formats; Version 1.0 Overview The use of proprietary formats in data sets for general use introduces problems in every subsequent stage of archival, distribution, application and maintenance. Their use should be avoided.

5 Data Formats; Version 1.0 Cost, time and utility are compromised Both the ultimate user and the archival and distribution organization will need to acquire appropriate tools to read the data. This will require both money and time. Developers of future analysis and visualization packages are unlikely to incorporate readers for other companies’ proprietary formats into their new tools.

6 Data Formats; Version 1.0 Provenance is compromised Once a data set is reformatted by either the archival and distribution organization or the end user, it becomes very difficult or impossible to maintain provenance back to the original data set. For instance, neither a check-sum nor a byte-by-byte comparison is possible.

7 Data Formats; Version 1.0 Tools may become unavailable The manufacturer of the proprietary data format tools may not maintain them as technology for storing and processing data changes. Sufficient information for the archive or a user to provide the necessary changes may not be available.

8 Data Formats; Version 1.0 Best practices Provide data in an open source, well documented format that has widespread community access.

9 Data Formats; Version 1.0 Data Formats: Choosing and adopting community accepted standards Curt Tilmes NASA Version 1.0

10 Data Formats; Version 1.0 Overview Some guidelines for choosing and adopting community accepted standards.

11 Data Formats; Version 1.0 Background Most projects (rightly so) focus on the content of their data files, you need to consider the format as well. Since you captured or created the data, and stored them in your own files, you know how the data are organized, how to read them, how to use them, characteristics of the data that could constrain their use. The goal of a good data format is to make it easier for others to read the data too. Many man hours have gone into developing standards for formats – try to learn from them.

12 Data Formats; Version 1.0 Why use community standards? If you try to develop your data format from scratch, you will forget something. Build on the experience and improvements built into the community standards over years of use. Tools and analysis software natively support reading community standard data. Reduce development effort and support reuse. Positive feedback – they are more likely to be adopted by others.

13 Data Formats; Version 1.0 Why use community standards? http://xkcd.com/927/

14 Data Formats; Version 1.0 A few guidelines Consider your archive: Do they have any recommendations? Consider your users: Who wants this data? Why do they want it? What do they want to do with it? Will they be using your data in concert with other data? Consider heritage: What worked well for similar data in the past? What could be done better for newly created data? Consider tools: Try to use data formats supported by the software you intend to use it with.

15 Data Formats; Version 1.0 Adopting standards The standard gives you a starting point, not a complete solution. Communicate early with a broad range of data users: archivists, software engineers, scientists. Consider how you will be writing the data and how you will be reading the data. Get feedback before making final decisions. Start sharing sample data in proposed format to nail down specifics and work out ambiguities. Document your use and application of the standard completely.

16 Data Formats; Version 1.0 Data Formats: Building understandable spreadsheets Jeff Arnfield National Climatic Data Center Version 1.0 Review Date

17 Data Formats; Version 1.0 Overview Spreadsheets are amazingly flexible, and are commonly used for data collection, analysis and management Spreadsheets are seldom self-documenting, and seldom well-documented Subtle (and not so subtle) errors are easily introduced during entry, manipulation and analysis Spreadsheet conventions – often ad hoc and evolutionary – may change or be applied inconsistently Spreadsheet file formats are proprietary and thus generally unacceptable as long term archival purposes

18 Data Formats; Version 1.0 Prior planning prevents pitfalls Clearly document all parameters, conventions, algorithms and assumptions. Use descriptive, consistent names. Use optimal and consistent layouts and formats. Preserve raw data and minimize duplication. Validate data and formulae. Keep formulae as simple and comprehensible as possible. Use some method of version control to manage change. Develop a strategy for archiving the content and logic.

19 Data Formats; Version 1.0 Planning and documenting Form follows function and content. Data sources, volume, diversity of content and planned transformations may require multiple files or directories. Your data will often suggest natural naming conventions and layouts. Document your conventions as they’re established. Revise documentation contemporaneously, not “after the fact”. This work is also the basis for end-user or reviewer documentation. What should you document? Everything! Data import, manipulation, QC procedures, special flags and encoding. Naming conventions, layouts, headings, units and abbreviations Does TEMP mean “temporary,” “air temperature at time of observation,” or ? Formulae and constants.

20 Data Formats; Version 1.0 What’s in a name? Define naming conventions for: Directory hierarchies (if the scope of your project requires them). Files, including different versions if necessary. Column and row labels, and individual tabs on multi-tab spreadsheets. Names should clearly and uniquely describe content. Sheet1, Sheet2 and Sheet3 do not convey much useful information! Stable, consistent names can be referenced and processed. If dates form part of a name, use a sortable format. yyyymmdd (20110825) is easier to process and validate automatically than 25Aug2011, or the unthinkable 25Aug11. Consider standard abbreviations or keywords, if they exist.

21 Data Formats; Version 1.0 Layout and formatting Use appropriate data types. Date/time data type permits sorting and calculations. If numeric identifiers may have leading 0s, format as text. Format numeric values with the appropriate number of decimal places. Use separate columns to denote special cases rather than relying solely upon even well-documented color coding. Color coded conditions cannot readily be sorted and aggregated. Color coding is lost entirely if data are exported as ASCII text. Include enough detail to make the data self-describing. Item identifier, date, time, parameter name, units, value, quality flag. Make the spreadsheet self-contained by adding a tab with legend details, column definitions and formula descriptions.

22 Data Formats; Version 1.0 Data handling and formulae Don’t combine multiple data layouts in a single sheet. If sheets have common content, identical layouts and headings increase clarity and simplify linking for analysis. Simplify updates by separating data and analyses. If data are propagated across sheets or manually manipulated, refreshing data can be a daunting and error-prone task. Freeze heading rows and columns to provide context as you work deep in the sheet. Placing totals, counts and other summary statistics above the column headers ensures they are always visible. Headers, footers and repeating headings improve printouts.

23 Data Formats; Version 1.0 Best practices Hard code nothing! Use a separate “values and assumptions” tab or area for constants and conversion factors. Use named ranges and cells rather than row/column references. C2*CubicFtToGallonConvertFactor is clearer than C2*Assumptions!$B$3 When sorting rows or copying formulae, be sure cell references do not change unintentionally. Most spreadsheets have data validation tools. Use them! If using spreadsheet for data entry, build entry validation rules. Counts, averages, max/mins, standard deviations, value lookups and custom formulae provide more sophisticated QC checks. Pivot tables are useful for QC and summarization.

24 Data Formats; Version 1.0 Version control and archiving Versioning and change management Periodic, dated backups are essential. Simple version control is possible via a naming convention for saving incremental copies, along with a log of significant changes. Version control systems, like Subversion, are powerful but complex. How will you archive for long-term availability? Gear your approach to your data, and to your archive’s requirements. Save content as delimited ASCII. Must save each tab separately. Document formulae separately, since results rather than formulae are saved. Easy cheat: ~ will reveal formulae, which can then be exported intact. However, references to other sheets will not be readily resolved. Convert to XML. Printing as PDF preserves appearance, but complicates future reuse.

25 Data Formats; Version 1.0 Data Formats: Using self-describing data formats Curt Tilmes NASA Version 1.0

26 Data Formats; Version 1.0 Overview Self-describing data formats have become a well accepted way of archiving and disseminating scientific data.

27 Data Formats; Version 1.0 Background Before self-describing data formats became widely used, each project often invented their own data formats, often raw binary or even ASCII. These approaches had a number of problems: Machine dependent byte ordering or floating point organizations. Required a ‘key’ to be able to open the file and read the right data. A new custom reader is needed for each different data organization. Working in a new language could be very difficult since you have to redevelop the reader anew.

28 Data Formats; Version 1.0 Self-describing data formats Information describing the data contents of the file are embedded within the data file itself: Names for various fields. Data types – Standardized, portable, machine independent. Pointers to various fields, making it efficient to extract the particular fields you want without reading the entire file. Attributes and flags related to the primary fields with extra information such as units, fill values, etc. Include a standard API and portable data access libraries in a variety of languages. There are tools that can open and work with arbitrary files, using the embedded descriptions to interpret the data.

29 Data Formats; Version 1.0 Some examples HDF – Hierarchical Data Format HDF4 and HDF5 versions are in use today. A NASA variant called HDF-EOS is used within the Earth Observing System program. The Aura project developed a common approach across their instruments and released guidelines as a Technical Note. NetCDF – Network Common Data Form Widely used by agencies including NASA and NOAA. Climate and forecast (CF) metadata conventions help standardize some things into NetCDF in a common manner.

30 Data Formats; Version 1.0 Best practices Choosing a self-describing format is a good first step, but it isn’t a panacea. You still have to decide how to encode your data into the format. Think carefully about the how you use the format: Layout of data within the file. Unambiguous names for fields; Use standard names if possible. Units. Fill values. Keep the users/readers of your files in mind. Some formats support seamless internal compression that can help with file sizes.

31 Data Formats; Version 1.0 Case Study: Format abuse A project had to distribute NORAD Two-Line Element (TLE) Sets This is a small amount of data, in a well defined format within ASCII, widely used and common. ASCII isn’t the best format, but for a small amount of data like this, especially in a widely used and understood format, it would have been fine. People understand the TLE format and have standard ways to parse it. Nevertheless, it isn’t self-describing, and people unfamiliar with TLE wouldn’t have a clue what those numbers mean. They chose to encode into HDF. 1 39900U 10123A 10249.02432654.00000388 00001-0 14877-3 0 3039 2 39900 098.6793 188.3954 0009896 294.6098 065.4121 14.19557889216547

32 Data Formats; Version 1.0 Case Study: Format abuse (cont.) A straightforward encoding would be to parse the fields, create fields with the right types (floating point) and name them according to their actual content from the TLE spec. They chose instead to maintain the ASCII text, encoding the individual characters of the file in their raw numerical form as an array of bytes. To read this data from the HDF file, you first have to extract the ASCII bytes, then parse the data according to the TLE spec. Rather than attaching metadata to the data fields, they created a separate empty dataset just to hold the metadata. This is just bizarre. Don’t do it like that.

33 Data Formats; Version 1.0 Case Study: Format abuse (cont)

34 Data Formats; Version 1.0 References and Resources Cook, R. B., R. J. Olson, P. Kanciruk, L. A. Hook. 2001 “Best Practices for Preparing Ecological Data Sets to Share and Archive.” Bulletin of the Ecological Society of America 82(2):138-141. http://www.jstor.org/stable/20168543http://www.jstor.org/stable/20168543 Leong, K. “Seven deadly sins of spreadsheet use in business: Excel best practices.” http://production- scheduling.com/seven-deadly-spreadsheet-sins/http://production- scheduling.com/seven-deadly-spreadsheet-sins/ GCMD science and associated directory keywords http://gcmd.nasa.gov/Resources/valids/archives/keyword_li st.html http://gcmd.nasa.gov/Resources/valids/archives/keyword_li st.html CF Metadata Convention – CF Standard Names. http://cf-pcmdi.llnl.gov/documents/cf-standard-names/ http://cf-pcmdi.llnl.gov/documents/cf-standard-names/

35 Data Formats; Version 1.0 References and Resources HDF: http://www.hdfgroup.orghttp://www.hdfgroup.org HDF-EOS: http://hdfeos.orghttp://hdfeos.org HDF-EOS Aura File Format Guidelines: http://disc.sci.gsfc.nasa.gov/Aura/additional/documentation/HDFEOS_ Aura_File_Format_Guidelines.pdfhttp://disc.sci.gsfc.nasa.gov/Aura/additional/documentation/HDFEOS_ Aura_File_Format_Guidelines.pdf http://www.esdswg.org/spg/spgfolder/events/esdswg-meeting-october- 25-27-2005/auraasabestpracticerev2.pdf NetCDF: http://www.unidata.ucar.edu/software/netcdfhttp://www.unidata.ucar.edu/software/netcdf CF: http://cf-pcmdi.llnl.gov/http://cf-pcmdi.llnl.gov/


Download ppt "Data Formats Curt Tilmes/NASA Jeff Arnfield/National Climatic Data Center Al Fleig/PITA Version 1.0."

Similar presentations


Ads by Google