Survey of Current Practices for Reporting Missing, Qualified Data Wade Sheldon GCE-LTER.

Slides:



Advertisements
Similar presentations
The INFILE Statement Reading files into SAS from an outside source: A Very Useful Tool!
Advertisements

Data Formats: Using self-describing data formats Curt Tilmes NASA Version 1.0 Review Date.
XML and Enterprise Computing. What is XML? Stands for “Extensible Markup Language” –similar to SGML and HTML –document “tags” are used to define content.
With Microsoft Access 2010© 2011 Pearson Education, Inc. Publishing as Prentice Hall1 PowerPoint Presentation to Accompany GO! with Microsoft ® Access.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
Adapting to missing data
ETEC 100 Information Technology
1 Unsupervised Learning With Non-ignorable Missing Data Machine Learning Group Talk University of Toronto Monday Oct 4, 2004 Ben Marlin Sam Roweis Rich.
Lecture 10: The FAT, VFAT, and NTFS Filesystems 6/17/2003 CSCE 590 Summer 2003.
Data format translation and migration Future possibilities Alasdair Crockett, Data Standards Manager UK Data Archive.
Geospatial standards Beyond FGDC Geog 458: Map Sources and Errors March 3, 2006.
Data Management: Documentation & Metadata Types of Documentation.
Tutorial 11: Connecting to External Data
Gregory Steffens Novartis Associate Director, Programming NJ CDISC Users’ Group 17 April 2014 Supplemental Qualifiers.
Data Quality Data quality Related terms:
Synthesis of Incomplete and Qualified Data using the GCE Data Toolbox Wade Sheldon Georgia Coastal Ecosystems LTER University of Georgia.
Web Application Architecture: multi-tier (2-tier, 3-tier) & mvc
Guide to Using Message Maker Robert Snelick National Institute of Standards & Technology (NIST) December 2005
Data Formats: Using Self-describing Data Formats Curt Tilmes NASA Version 1.0 February 2013 Section: Local Data Management Copyright 2013 Curt Tilmes.
Systems analysis and design, 6th edition Dennis, wixom, and roth
ACOT Intro/Copyright Succeeding in Business with Microsoft Excel
Introduction to SPSS Edward A. Greenberg, PhD
OCAN College Access Program Data Submissions Vonetta Woods HEI Analyst, Ohio Board of Regents
Introduction to MDA (Model Driven Architecture) CYT.
Workshop on QC in Derived Data Products, Las Cruces, NM, 31 January 2007 ClimDB/HydroDB Objectives Don Henshaw Improve access to long-term collections.
Dynamic, Rule-based Quality Control Framework for Real-time Sensor Data Wade Sheldon Georgia Coastal Ecosystems LTER University of Georgia.
File Systems Long-term Information Storage Store large amounts of information Information must survive the termination of the process using it Multiple.
Technical Aspects of SIARD “SIARD under the hood” 10. April 2003 / Stephan Heuscher.
The netCDF-4 data model and format Russ Rew, UCAR Unidata NetCDF Workshop 25 October 2012.
File Systems (1). Readings r Reading: Disks, disk scheduling (3.7 of textbook; “How Stuff Works”) r Reading: File System Implementation ( of textbook)
New Perspectives on XML, 2nd Edition
XML Documents Chao-Hsien Chu, Ph.D. School of Information Sciences and Technology The Pennsylvania State University Elements Attributes Comments PI Document.
The european ITM Task Force data structure F. Imbeaux.
Trends Vision Long-term time series of climate, biogeochemical, biotic & population data Create an “atlas” of these data in graphical (graphs & maps) &
Copyright 2006 Prentice-Hall, Inc. Essentials of Systems Analysis and Design Third Edition Joseph S. Valacich Joey F. George Jeffrey A. Hoffer Chapter.
Strategies for Adding EML Support to the GCE Data Toolbox for Matlab Wade Sheldon Georgia Coastal Ecosystems LTER (WWW: gce-lter.marsci.uga.edu/lter)
Data resource management
What it is and how it works
XML 2nd EDITION Tutorial 4 Working With Schemas. XP Schemas A schema is an XML document that defines the content and structure of one or more XML documents.
Tutorial 13 Validating Documents with Schemas
SW 983 Missing Data Treatment Most of the slides presented here are from the Modern Missing Data Methods, 2011, 5 day course presented by the KUCRMDA,
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
Database Basics BCIS 3680 Enterprise Programming.
XML CSC1310 Fall HTML (TIM BERNERS-LEE) HyperText Markup Language  HTML (HyperText Markup Language): December  Markup  Markup is a symbol.
Chapter 10 Designing Databases. Objectives:  Define key database design terms.  Explain the role of database design in the IS development process. 
GEM METADATA DEVELOPMENT Xiaoping Wang, Macrosearch Allen Macklin, PMEL and Bernard Megrey, AFSC.
User Guide, 21 May 2009 © Copyright ISAteam 1 ISAconfigurator for ISAcreator User Guide Alpha version: May 2009 Contact:
Copyright © 2009 Pearson Education, Inc. Publishing as Prentice Hall Chapter 9 Designing Databases 9.1.
Games: XML Presented by: Idham bin Mat Desa Mohd Sharizal bin Hamzah Mohd Radzuan bin Mohd Shaari Shukor bin Nordin.
HEI/OCAN College Access Program Data Submissions.
Chapter 04 Semantic Web Application Architecture 23 November 2015 A Team 오혜성, 조형헌, 권윤, 신동준, 이인용.
The HDF Group Introduction to HDF5 Session Two Data Model Comparison HDF5 File Format 1 Copyright © 2010 The HDF Group. All Rights Reserved.
1 Section 1 - Introduction to SQL u SQL is an abbreviation for Structured Query Language. u It is generally pronounced “Sequel” u SQL is a unified language.
Developing our Metadata: Technical Considerations & Approach Ray Plante NIST 4/14/16 NMI Registry Workshop BIPM, Paris 1 …don’t worry ;-) or How we concentrate.
3.3 Fundamentals of data representation
Data Quality Data quality Related terms:
IST 220 – Intro to Databases
Other Kinds of Arrays Chapter 11
Un</br>able’s MySecretSecrets
Multiple Imputation.
CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE
funCTIONs and Data Import/Export
Developing a Data Model
Grid Based Data Integration with Automatic Wrapper Generation
Metadata The metadata contains
Beyond Description: Metadata for Catalogers in the 21st Century
IPDA July 2013 CDF and PDS Todd King, Joseph Mafi, Steven Joy.
Real-World File Structures
Presentation transcript:

Survey of Current Practices for Reporting Missing, Qualified Data Wade Sheldon GCE-LTER

Missing Data Missing observations are ubiquitous in environmental data sets Primary data  Failures in measurement (equipment, data logging, communications)  Failures in data management (data entry, data loss, corruption) Processed data  QC/QA operations (data removal) Important to distinguish nature of missing values (Little & Rubin, 1984):  MCAR = missing completely at random (independent of data)  MAR = missing at random (independent of missing parameter, but may depend on other observed components and be predictable)  Non-ignorable (pattern non-random, cannot be predicted; mechanism related to missing values themselves like off-scale readings)

Common Reporting Practices Structured binary storage systems  RDBMS – ANSI NULL  MATLAB, R (C, Java, …) – NaN (IEEE 754) XML text  Omitted elements  Empty elements  Text codes (unless numeric-typed in schema) Other text storage formats, spreadsheets  Anything and everything  Commonly seen examples:  Omitted records (e.g. long data gaps)  Omitted fields (i.e. delimiter-delimiter, empty cell)  Text codes: nd, n/a, M, NaN, period  Out-of-range numeric values: -9999

Ramifications of Missing Value Encodings Non-standard codes need to be filtered, replaced before loading ASCII data into structured storage  Requires source-specific processing  Adds overhead, points of failure Omitted records can disrupt parsers (e.g. space- delimited text files) Out-of-range numeric values can lead to major analytical errors if not recognized by data users and automated workflow tools

Example – USGS

Example – NOAA NCDC/NWS

Example – NOAA NOS

Flags/Qualifiers Field annotations often present in data sets (record-level metadata) Often used to indicate anomalies identified during QC/QA (questionable/ suspect, invalid, estimated) Also used to convey data use information (accumulating amount, accepted/provisional, good value) Representations highly variable  Flag attribute adjacent to observation attribute in table  Text/special characters appended to value (e.g. *)  Embedded flags in place of observation value (ice, rat, eqp, ***)  Variation in formatting (braces/brackets around values) Code definitions often hard to find for federal data

Ramifications of Flags/Qualifers Flag formats other than dedicated attributes often break data parsers (particularly embedded flags) Conventional analysis software (e.g. spreadsheets, graphics apps) ignorant of flags, provide few uses for information Non-obvious, undefined flags of dubious value (1,*)

Example – ClimDB

Example – NOAA NOS

Metadata Practices USGS, NOAA  Rely on published protocols for documenting QC/QA practices and qualifier code defs – can be very hard to find  Metadata distributed with files sparse LTER/EML  Missing value codes defined at the attribute level (requires full implementation of dataTable, physical, attribute)  Various places to document QC/QA and data anomalies (e.g. add Q/C methods trees at various levels in doc like dataset, dataTable, attribute, …)  EBP document doesn’t provide specific guidelines, and no mention of how to describe data anomalies (dataTable/additionalInfo, additionalMetadata, ?) General  Reporting of QC/QA methodology and data anomalies varies tremendously in both structure and depth