Data Formats Curt Tilmes/NASA Jeff Arnfield/National Climatic Data Center Al Fleig/PITA Version 1.0.

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
Business Planning using Spreasheets-2 1 BP-2: Good Spreadsheet Practice  There is always the temptation to rush in and start entering data.  However.
Data Formats: Using self-describing data formats Curt Tilmes NASA Version 1.0 Review Date.
CS1100: Computer Science and Its Applications Building Flexible Models in Microsoft Excel.
C6 Databases.
Excel Tutorial 2: Formatting Workbook Text and Data
Enhancing FRx Reports by Integrating Spreadsheet Data Elisa R. Vick
Describing Process Specifications and Structured Decisions Systems Analysis and Design, 7e Kendall & Kendall 9 © 2008 Pearson Prentice Hall.
Local Data Management: Building understandable spreadsheets Jeff Arnfield National Climatic Data Center Version 1.0 Review Date.
® Microsoft Office 2010 Excel Tutorial 2: Formatting a Workbook.
Tutorial 5: Working with Excel Tables, PivotTables, and PivotCharts
John Porter Why this presentation? The forms data take for analysis are often different than the forms data take for archival storage Spreadsheets are.
Data format translation and migration Future possibilities Alasdair Crockett, Data Standards Manager UK Data Archive.
PowerPoint: Tables Computer Information Technology Section 5-11 Some text and examples used with permission from: Note: We are.
1 Organising data in a spreadsheet Module 1 Session 3.
Basic Concept of Data Coding Codes, Variables, and File Structures.
Database Design IST 7-10 Presented by Miss Egan and Miss Richards.
CHAPTER 14 Formatting a Workbook Part 1. Learning Objectives Format text, numbers, dates, and time Format cells and ranges CMPTR Chapter 14: Formatting.
Computer Literacy BASICS
ORGANIZING AND STRUCTURING DATA FOR DIGITAL PROJECTS Suzanne Huffman Digital Resources Librarian Simpson Library.
Databases C HAPTER Chapter 10: Databases2 Databases and Structured Fields  A database is a collection of information –Typically stored as computer.
Data Formats: Using Self-describing Data Formats Curt Tilmes NASA Version 1.0 February 2013 Section: Local Data Management Copyright 2013 Curt Tilmes.
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
Term 2, 2011 Week 1. CONTENTS Types and purposes of graphic representations Spreadsheet software – Producing graphs from numerical data Mathematical functions.
1Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall. Exploring Microsoft Office Access 2010 by Robert Grauer, Keith Mast, and Mary Anne.
Spreadsheet-Based Decision Support Systems Chapter 22:
Miscellaneous Excel Combining Excel and Access. – Importing, exporting and linking Parsing and manipulating data. 1.
1 Data List Spreadsheets or simple databases - a different use of Spreadsheets Bent Thomsen.
Elements of a Data Management Plan Bill Michener University Libraries University of New Mexico Data Management Practices for.
To enhance learning, service, and research through an advanced information technology environment. Our Mission:To enhance learning, service,and research.
1 Performing Spreadsheet What-If Analysis Applications of Spreadsheets.
EARTH SCIENCE MARKUP LANGUAGE Why do you need it? How can it help you? INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
Access 2013 Microsoft Access 2013 is a database application that is ideal for gathering and understanding data that’s been collected on just about anything.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
SE: CHAPTER 7 Writing The Program
 Agenda: 4/24/13 o External Data o Discuss data manipulation tools and functions o Discuss data import and linking in Excel o Sorting Data o Date and.
Creating documentation and metadata: Recording provenance and context Jeff Arnfield National Climatic Data Center Version a1.0 Review Date.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
An Object-Oriented Approach to Programming Logic and Design Fourth Edition Chapter 6 Using Methods.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Chapter 4c, Database H Definition H Structure H Parts H Types.
Term 2, 2011 Week 1. CONTENTS Problem-solving methodology Programming and scripting languages – Programming languages Programming languages – Scripting.
IS 325 Notes for Wednesday August 28, Data is the Core of the Enterprise.
Managing the Impacts of Change on Archiving Research Data A Presentation for “International Workshop on Strategies for Preservation of and Open Access.
WHAT IS A DATABASE? A DATABASE IS A COLLECTION OF DATA RELATED TO A PARTICULAR TOPIC OR PURPOSE OR TO PUT IT SIMPLY A GENERAL PURPOSE CONTAINER FOR STORING.
DATABASE MANAGEMENT SYSTEMS CMAM301. Introduction to database management systems  What is Database?  What is Database Systems?  Types of Database.
FILES AND DATABASES. A FILE is a collection of records with similar characteristics, e.g: A Sales Ledger Stock Records A Price List Customer Records Files.
Microsoft ® Office Excel 2003 Training Using XML in Excel SynAppSys Educational Services presents:
Spreadsheet Engineering Builders use blueprints or plans – Without plans structures will fail to be effective Advanced planning in any sort of design can.
Animal Shelter Activity 2.
Overview Excel is a spreadsheet, a grid made from columns and rows. It is a software program that can make number manipulation easy and somewhat painless.
Aura HDF-EOS File Format Guidelines: Overview and Status Cheryl Craig.
Copyright (c) 2014 Pearson Education, Inc. Introduction to DBMS.
Microsoft Office 2013 Try It! Chapter 4 Storing Data in Access.
Chapter 10: Working with Large Data Spreadsheet-Based Decision Support Systems Prof. Name Position (123) University Name.
This lesson teaches you how to enter, edit, and manipulate cells and their data,—which are the building blocks of spreadsheets. Cells and Cell Data In.
Miscellaneous Excel Combining Excel and Access.
Data Formats: Choosing and Adopting Community Accepted Standards
GO! with Microsoft Office 2016
Tutorial 5: Working with Excel Tables, PivotTables, and PivotCharts
Tutorial 2: Formatting a Workbook
Data Formats: Avoiding proprietary formats
GO! with Microsoft Access 2016
Agenda: 10/05/2011 and 10/10/2011 Review Access tables, queries, and forms. Review sample forms. Define 5-8 guidelines each about effective form and report.
Formatting a Workbook Part 1
Chapter 13 Quality Management
Database management systems
Presentation transcript:

Data Formats Curt Tilmes/NASA Jeff Arnfield/National Climatic Data Center Al Fleig/PITA Version 1.0

Data Formats; Version 1.0 Overview Data Formats: Avoiding proprietary formats Choosing and adopting community accepted standards Building understandable spreadsheets Using self-describing data formats

Data Formats; Version 1.0 Data Formats: Avoiding proprietary formats Albert J. Fleig PITA Analytic Sciences Version 1.0

Data Formats; Version 1.0 Overview The use of proprietary formats in data sets for general use introduces problems in every subsequent stage of archival, distribution, application and maintenance. Their use should be avoided.

Data Formats; Version 1.0 Cost, time and utility are compromised Both the ultimate user and the archival and distribution organization will need to acquire appropriate tools to read the data. This will require both money and time. Developers of future analysis and visualization packages are unlikely to incorporate readers for other companies’ proprietary formats into their new tools.

Data Formats; Version 1.0 Provenance is compromised Once a data set is reformatted by either the archival and distribution organization or the end user, it becomes very difficult or impossible to maintain provenance back to the original data set. For instance, neither a check-sum nor a byte-by-byte comparison is possible.

Data Formats; Version 1.0 Tools may become unavailable The manufacturer of the proprietary data format tools may not maintain them as technology for storing and processing data changes. Sufficient information for the archive or a user to provide the necessary changes may not be available.

Data Formats; Version 1.0 Best practices Provide data in an open source, well documented format that has widespread community access.

Data Formats; Version 1.0 Data Formats: Choosing and adopting community accepted standards Curt Tilmes NASA Version 1.0

Data Formats; Version 1.0 Overview Some guidelines for choosing and adopting community accepted standards.

Data Formats; Version 1.0 Background Most projects (rightly so) focus on the content of their data files, you need to consider the format as well. Since you captured or created the data, and stored them in your own files, you know how the data are organized, how to read them, how to use them, characteristics of the data that could constrain their use. The goal of a good data format is to make it easier for others to read the data too. Many man hours have gone into developing standards for formats – try to learn from them.

Data Formats; Version 1.0 Why use community standards? If you try to develop your data format from scratch, you will forget something. Build on the experience and improvements built into the community standards over years of use. Tools and analysis software natively support reading community standard data. Reduce development effort and support reuse. Positive feedback – they are more likely to be adopted by others.

Data Formats; Version 1.0 Why use community standards?

Data Formats; Version 1.0 A few guidelines Consider your archive: Do they have any recommendations? Consider your users: Who wants this data? Why do they want it? What do they want to do with it? Will they be using your data in concert with other data? Consider heritage: What worked well for similar data in the past? What could be done better for newly created data? Consider tools: Try to use data formats supported by the software you intend to use it with.

Data Formats; Version 1.0 Adopting standards The standard gives you a starting point, not a complete solution. Communicate early with a broad range of data users: archivists, software engineers, scientists. Consider how you will be writing the data and how you will be reading the data. Get feedback before making final decisions. Start sharing sample data in proposed format to nail down specifics and work out ambiguities. Document your use and application of the standard completely.

Data Formats; Version 1.0 Data Formats: Building understandable spreadsheets Jeff Arnfield National Climatic Data Center Version 1.0 Review Date

Data Formats; Version 1.0 Overview Spreadsheets are amazingly flexible, and are commonly used for data collection, analysis and management Spreadsheets are seldom self-documenting, and seldom well-documented Subtle (and not so subtle) errors are easily introduced during entry, manipulation and analysis Spreadsheet conventions – often ad hoc and evolutionary – may change or be applied inconsistently Spreadsheet file formats are proprietary and thus generally unacceptable as long term archival purposes

Data Formats; Version 1.0 Prior planning prevents pitfalls Clearly document all parameters, conventions, algorithms and assumptions. Use descriptive, consistent names. Use optimal and consistent layouts and formats. Preserve raw data and minimize duplication. Validate data and formulae. Keep formulae as simple and comprehensible as possible. Use some method of version control to manage change. Develop a strategy for archiving the content and logic.

Data Formats; Version 1.0 Planning and documenting Form follows function and content. Data sources, volume, diversity of content and planned transformations may require multiple files or directories. Your data will often suggest natural naming conventions and layouts. Document your conventions as they’re established. Revise documentation contemporaneously, not “after the fact”. This work is also the basis for end-user or reviewer documentation. What should you document? Everything! Data import, manipulation, QC procedures, special flags and encoding. Naming conventions, layouts, headings, units and abbreviations Does TEMP mean “temporary,” “air temperature at time of observation,” or ? Formulae and constants.

Data Formats; Version 1.0 What’s in a name? Define naming conventions for: Directory hierarchies (if the scope of your project requires them). Files, including different versions if necessary. Column and row labels, and individual tabs on multi-tab spreadsheets. Names should clearly and uniquely describe content. Sheet1, Sheet2 and Sheet3 do not convey much useful information! Stable, consistent names can be referenced and processed. If dates form part of a name, use a sortable format. yyyymmdd ( ) is easier to process and validate automatically than 25Aug2011, or the unthinkable 25Aug11. Consider standard abbreviations or keywords, if they exist.

Data Formats; Version 1.0 Layout and formatting Use appropriate data types. Date/time data type permits sorting and calculations. If numeric identifiers may have leading 0s, format as text. Format numeric values with the appropriate number of decimal places. Use separate columns to denote special cases rather than relying solely upon even well-documented color coding. Color coded conditions cannot readily be sorted and aggregated. Color coding is lost entirely if data are exported as ASCII text. Include enough detail to make the data self-describing. Item identifier, date, time, parameter name, units, value, quality flag. Make the spreadsheet self-contained by adding a tab with legend details, column definitions and formula descriptions.

Data Formats; Version 1.0 Data handling and formulae Don’t combine multiple data layouts in a single sheet. If sheets have common content, identical layouts and headings increase clarity and simplify linking for analysis. Simplify updates by separating data and analyses. If data are propagated across sheets or manually manipulated, refreshing data can be a daunting and error-prone task. Freeze heading rows and columns to provide context as you work deep in the sheet. Placing totals, counts and other summary statistics above the column headers ensures they are always visible. Headers, footers and repeating headings improve printouts.

Data Formats; Version 1.0 Best practices Hard code nothing! Use a separate “values and assumptions” tab or area for constants and conversion factors. Use named ranges and cells rather than row/column references. C2*CubicFtToGallonConvertFactor is clearer than C2*Assumptions!$B$3 When sorting rows or copying formulae, be sure cell references do not change unintentionally. Most spreadsheets have data validation tools. Use them! If using spreadsheet for data entry, build entry validation rules. Counts, averages, max/mins, standard deviations, value lookups and custom formulae provide more sophisticated QC checks. Pivot tables are useful for QC and summarization.

Data Formats; Version 1.0 Version control and archiving Versioning and change management Periodic, dated backups are essential. Simple version control is possible via a naming convention for saving incremental copies, along with a log of significant changes. Version control systems, like Subversion, are powerful but complex. How will you archive for long-term availability? Gear your approach to your data, and to your archive’s requirements. Save content as delimited ASCII. Must save each tab separately. Document formulae separately, since results rather than formulae are saved. Easy cheat: ~ will reveal formulae, which can then be exported intact. However, references to other sheets will not be readily resolved. Convert to XML. Printing as PDF preserves appearance, but complicates future reuse.

Data Formats; Version 1.0 Data Formats: Using self-describing data formats Curt Tilmes NASA Version 1.0

Data Formats; Version 1.0 Overview Self-describing data formats have become a well accepted way of archiving and disseminating scientific data.

Data Formats; Version 1.0 Background Before self-describing data formats became widely used, each project often invented their own data formats, often raw binary or even ASCII. These approaches had a number of problems: Machine dependent byte ordering or floating point organizations. Required a ‘key’ to be able to open the file and read the right data. A new custom reader is needed for each different data organization. Working in a new language could be very difficult since you have to redevelop the reader anew.

Data Formats; Version 1.0 Self-describing data formats Information describing the data contents of the file are embedded within the data file itself: Names for various fields. Data types – Standardized, portable, machine independent. Pointers to various fields, making it efficient to extract the particular fields you want without reading the entire file. Attributes and flags related to the primary fields with extra information such as units, fill values, etc. Include a standard API and portable data access libraries in a variety of languages. There are tools that can open and work with arbitrary files, using the embedded descriptions to interpret the data.

Data Formats; Version 1.0 Some examples HDF – Hierarchical Data Format HDF4 and HDF5 versions are in use today. A NASA variant called HDF-EOS is used within the Earth Observing System program. The Aura project developed a common approach across their instruments and released guidelines as a Technical Note. NetCDF – Network Common Data Form Widely used by agencies including NASA and NOAA. Climate and forecast (CF) metadata conventions help standardize some things into NetCDF in a common manner.

Data Formats; Version 1.0 Best practices Choosing a self-describing format is a good first step, but it isn’t a panacea. You still have to decide how to encode your data into the format. Think carefully about the how you use the format: Layout of data within the file. Unambiguous names for fields; Use standard names if possible. Units. Fill values. Keep the users/readers of your files in mind. Some formats support seamless internal compression that can help with file sizes.

Data Formats; Version 1.0 Case Study: Format abuse A project had to distribute NORAD Two-Line Element (TLE) Sets This is a small amount of data, in a well defined format within ASCII, widely used and common. ASCII isn’t the best format, but for a small amount of data like this, especially in a widely used and understood format, it would have been fine. People understand the TLE format and have standard ways to parse it. Nevertheless, it isn’t self-describing, and people unfamiliar with TLE wouldn’t have a clue what those numbers mean. They chose to encode into HDF U 10123A

Data Formats; Version 1.0 Case Study: Format abuse (cont.) A straightforward encoding would be to parse the fields, create fields with the right types (floating point) and name them according to their actual content from the TLE spec. They chose instead to maintain the ASCII text, encoding the individual characters of the file in their raw numerical form as an array of bytes. To read this data from the HDF file, you first have to extract the ASCII bytes, then parse the data according to the TLE spec. Rather than attaching metadata to the data fields, they created a separate empty dataset just to hold the metadata. This is just bizarre. Don’t do it like that.

Data Formats; Version 1.0 Case Study: Format abuse (cont)

Data Formats; Version 1.0 References and Resources Cook, R. B., R. J. Olson, P. Kanciruk, L. A. Hook “Best Practices for Preparing Ecological Data Sets to Share and Archive.” Bulletin of the Ecological Society of America 82(2): Leong, K. “Seven deadly sins of spreadsheet use in business: Excel best practices.” scheduling.com/seven-deadly-spreadsheet-sins/ scheduling.com/seven-deadly-spreadsheet-sins/ GCMD science and associated directory keywords st.html st.html CF Metadata Convention – CF Standard Names.

Data Formats; Version 1.0 References and Resources HDF: HDF-EOS: HDF-EOS Aura File Format Guidelines: Aura_File_Format_Guidelines.pdfhttp://disc.sci.gsfc.nasa.gov/Aura/additional/documentation/HDFEOS_ Aura_File_Format_Guidelines.pdf /auraasabestpracticerev2.pdf NetCDF: CF: