Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dynamic, Rule-based Quality Control Framework for Real-time Sensor Data Wade Sheldon Georgia Coastal Ecosystems LTER University of Georgia.

Similar presentations


Presentation on theme: "Dynamic, Rule-based Quality Control Framework for Real-time Sensor Data Wade Sheldon Georgia Coastal Ecosystems LTER University of Georgia."— Presentation transcript:

1 Dynamic, Rule-based Quality Control Framework for Real-time Sensor Data Wade Sheldon Georgia Coastal Ecosystems LTER University of Georgia

2 Introduction Quality Control of high volume, real-time data from automated sensors is an emerging challenge  Traditional techniques (plotting, stats) often don’t scale well  Data validation and Q/C can be limiting factor in getting data “online”  Difficulties lead to release delays or posting provisional data Software developed at Georgia Coastal Ecosystems LTER has proven useful for Q/C of real-time data Designed to automate GCE data processing and metadata generation, but very generalized and supports any tabular data Provides dynamic, rule-based Q/C framework for data processing, analysis and synthesis

3 Framework Components Comprehensive data model  Implemented as hierarchical MATLAB ‘structure’ arrays  Package dataset & attribute metadata, data, Q/C rules, qualifier flags Metadata-based MATLAB software (GCE Data Toolbox)  Automatic (rule-based) and manual assignment of Q/C qualifier flags  Transparent management of flags throughout all data manipulation  Q/C-aware data management and analysis tools  Q/C-aware data integration and synthesis tools Modular implementation supports many scenarios  Interactive (command-line API and GUI forms)  Automated workflows (timed or triggered)  End-to-end (logger-to-scientist) or part of larger workflow  Runs natively on multiple platforms (PC, *nix, MacOS)

4 GCE Data Toolbox Data Model

5 Quality Control Rules Basic syntax: [logical expression]=’[flag code]’ Logical Expressions:  Any conditional statement or call to MATLAB function that returns logical array (0 = false, 1 = true)  Dataset columns referenced in statements as:  “x” – alias for current column (e.g. x<0)  “col_[name]” – any dataset column by name (e.g. “col_Depth<0”) Flag Codes:  Alphanumeric character to assign when expression true ( I, q, 9, *)  Codes defined in the dataset metadata (I = invalid value, …) Unlimited rules per attribute, multiple flags per value

6 Quality Control Rule Examples Numeric Comparisons:  Simple:  x<0=‘I’ (flags negative values)  x 100=‘I’;x 80=‘Q’ (overlapping bounds checks)

7 Quality Control Rule Examples Numeric Comparisons:  Simple:  x<0=‘I’ (flags negative values)  x 100=‘I’;x 80=‘Q’ (overlapping bounds checks)  Statistical:  x>(mean(x)+3*std(x))=‘Q’;x<(mean(x)-3*std(x))=‘Q’ (flags values more than 3 standard deviations from column mean)

8 Quality Control Rule Examples Numeric Comparisons:  Simple:  x<0=‘I’ (flags negative values)  x 100=‘I’;x 80=‘Q’ (overlapping bounds checks)  Statistical:  x>(mean(x)+3*std(x))=‘Q’;x<(mean(x)-3*std(x))=‘Q’ (flags values more than 3 standard deviations from column mean)  Multi-column:  col_DOC>col_TOC=‘I’ (in column DOC; flags DOC exceeding TOC)  col_Dry_Weight<(col_Wet_Weight-col_Ash_Weight)*0.90 =’I’ (flags dry weights below 90% wet weight – ash weight)  col_Depth<0=‘I’ (in column Salinity; flags Salinity when Depth < 0)

9 Quality Control Rule Examples Numeric Comparisons:  Simple:  x<0=‘I’ (flags negative values)  x 100=‘I’;x 80=‘Q’ (overlapping bounds checks)  Statistical:  x>(mean(x)+3*std(x))=‘Q’;x<(mean(x)-3*std(x))=‘Q’ (flags values more than 3 standard deviations from column mean)  Multi-column:  col_DOC>col_TOC=‘I’ (in column DOC; flags DOC exceeding TOC)  col_Dry_Weight<(col_Wet_Weight-col_Ash_Weight)*0.90 =’I’ (flags dry weights below 90% wet weight – ash weight)  col_Depth<0=‘I’ (in column Salinity; flags Salinity when Depth < 0)  Compound (Boolean operators):  col_RH_Percent>100&col_Precip 100% except during significant precipitation events)

10 Quality Control Rule Examples (cont.) Text Comparisons:  “IS”, “NOT” for string literals, “IN”, “NOT IN” for lists  flag_notinlist(x,’Spartina,Juncus,Zizaniopsis’)=‘Q’

11 Quality Control Rule Examples (cont.) Text Comparisons:  “IS”, “NOT” for string literals, “IN”, “NOT IN” for lists  flag_notinlist(x,’Spartina,Juncus,Zizaniopsis’)=‘Q’ Algorithmic Criteria (custom functions):  fn(columns,parameters)=‘Q’  Various included Q/C functions  pattern checks, geographic checks, specialized algorithms (O2 saturation, etc)  User-defined functions:  Any MATLAB code or “wrapped” calls to FORTRAN, Java, Python, etc  Unlimited scope

12 Quality Control Rule Examples (cont.) Text Comparisons:  “IS”, “NOT” for strings, “IN”, “NOT IN” for lists  flag_notinlist(x,’Spartina,Juncus,Zizaniopsis’)=‘Q’ Algorithmic Criteria (custom functions):  fn(parameters)=‘Q’  Various included Q/C functions  pattern checks, geographic checks, specialized algorithms (O2 saturation, etc)  User-defined functions:  Any MATLAB code or “wrapped” calls to FORTRAN, Java, Python, etc  Unlimited scope Full suite of MATLAB numeric analysis capabilities supported, and extensible to use other technology

13 Q/C Rule Management Rule definitions can be defined in metadata “templates”, automatically applied to attributes when raw data imported Rules can also be created, managed using a GUI form

14 Q/C Flag Assignment Q/C criteria evaluated to assign/clear flags when:  Metadata template applied or Q/C criteria edited  New data records, columns added  Values edited (GUI) or columns updated (CLI)  Evaluation function (dataflag) invoked directly Flags can also be assigned/cleared manually by:  Clicking/dragging on plots with the mouse  Using a spreadsheet-like grid  Importing from text attributes (e.g. 3 rd party codes)  Propagating flags from source column(s) to dependent column(s) Manual assignment locks flags by inserting “manual” token in criteria, removing “manual” restores automatic evaluation

15 Q/C-Aware Data Management & Analysis Q/C flags can be visualized in data editor grid and plots Flagged values can be selectively removed from data sets Statistics can be generated with/without flagged values Flags can be instantiated as coded text columns for export Flagged, missing values can be summarized by parameter and date for metadata

16 Q/C-Aware Data Synthesis Flagged, missing values summarized in re-sampled data (aggregated, binned, date-time resampled), with automatic Q/C rule creation Flags automatically “locked” when merging multiple data sets (i.e. unions) All Q/C operations logged to processing history, reported in metadata to document lineage

17 Implementation Scenarios End-to-End (logger-to-scientist)  Acquire raw data from logger or file system (standard or custom import filters)  Assign metadata from template or using forms to validate and flag data  Review data and fine-tune flag assignments  Generate distribution files & plots, archive data, index for searching  Desktop data management solution Data Pre-processing  Acquire, validate and flag raw data (on demand or timed/triggered)  Upload processed data files (e.g. csv) or value & flag arrays to RDBMS Workflow Step  Call toolbox functions as part of another workflow process, custom program  Kepler MATLAB actor?

18 Suitability for Real-Time Sensor Data Good Scalability  Data volumes only limited by computer memory (tested >2 GB data sets)  Multiple instances can be run on high-end, 64bit, clustered workstations  Good flag evaluation performance in use, testing with diverse rule sets Good scope for automation  Timed and triggered workflow implementations easy to deploy Support for multiple I/O formats, transport protocols  Formats: ASCII, MATLAB, SQL, XML (partially implemented)  Transport: local file system, UNC paths, HTTP, FTP, SOAP Already used for real-time GCE data, USGS data harvesting service (LTER HydroDB, CWT)

19 Concluding Remarks Benefits  Flexible, modular design  No qualifier vocabulary, semantics assumed – many purposes, standards  Many operations on flagged values – supports different strategies for archiving and distributing data at different processing levels Limitations  Requires MATLAB  Rule syntax environment-specific – a more open standard would be ideal  Support for XML metadata immature (but more development planned) More information and downloads at: http://gce-lter.marsci.uga.edu/public/im/tools/data_toolbox.htm This work was supported by the National Science Foundation under grant numbers OCE-9982133 and OCE-0620959


Download ppt "Dynamic, Rule-based Quality Control Framework for Real-time Sensor Data Wade Sheldon Georgia Coastal Ecosystems LTER University of Georgia."

Similar presentations


Ads by Google