Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia,

Slides:



Advertisements
Similar presentations
Info to Enterprise Migration Implementation Case Study: SBC Corporation Presented to the Crystal Decisions Regional Users Group for the Bay Area on October.
Advertisements

Managing Hardware and Software Assets
1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.
Copyright line. Maintaining an Active Directory Environment Exam Objectives Backup and Recovery Backup and Recovery Offline Maintenance Offline Maintenance.
Chapter 1: The Database Environment
1 jNIK IT tool for electronic audit papers 17th meeting of the INTOSAI Working Group on IT Audit (WGITA) SAI POLAND (the Supreme Chamber of Control)
1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett Slides:
Ada, Model Railroading, and Software Engineering Education John W. McCormick University of Northern Iowa.
By Rick Clements Software Testing 101 By Rick Clements
Circuit Monitoring July 16 th 2011, OGF 32: NMC-WG Jason Zurawski, Internet2 Research Liaison.
JMA Takayuki MATSUMURA (Forecast Department, JMA) C Asia Air Survey co., ltd New Forecast Technologies for Disaster Prevention and Mitigation 1.
1 9 Moving to Design Lecture Analysis Objectives to Design Objectives Figure 9-2.
Making the System Operational
Software change management
Chapter 6 Computer Assisted Audit Tools and Techniques
Performance Testing - Kanwalpreet Singh.
Yuhwei Ling Replacement Issues A Different Use of Claim Reports Yuhwei Ling Rutgers University Libraries Distributed Technical Services.
Networking Essentials Lab 3 & 4 Review. If you have configured an event log retention setting to Do Not Overwrite Events (Clear Log Manually), what happens.
Chapter 20 Oracle Secure Backup.
Database System Concepts and Architecture
Distributed DBMS©M. T. Özsu & P. Valduriez Ch.15/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
Chapter 13 The Data Warehouse
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 13 Slide 1 Application architectures.
Hydrological information systems Svein Taksdal Head of section, Section for Hydroinformatics Hydrology department Norwegian Water Resources and Energy.
Performance Considerations of Data Acquisition in Hadoop System
Unlocking the Scientific Value of NEXRAD Weather Radar Data Ramon Lawrence, Witek Krajewski, Anton Kruger, and Allen Bradley IIHR, University of Iowa
An Architecture for Real-Time Warehousing of Scientific Data Ramon Lawrence and Anton Kruger IIHR, University of Iowa
Chapter 19: Network Management Business Data Communications, 4e.
Data - Information - Knowledge
Managing Data Resources
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
Components and Architecture CS 543 – Data Warehousing.
Maintaining and Updating Windows Server 2008
CSC 351 FUNDAMENTALS OF DATABASE SYSTEMS
Check Disk. Disk Defragmenter Using Disk Defragmenter Effectively Run Disk Defragmenter when the computer will receive the least usage. Educate users.
CONTENTS:-  What is Event Log Service ?  Types of event logs and their purpose.  How and when the Event Log is useful?  What is Event Viewer?  Briefing.
System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.
ETL By Dr. Gabriel.
Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.
Topics Covered: Data preparation Data preparation Data capturing Data capturing Data verification and validation Data verification and validation Data.
SCIENCE-DRIVEN INFORMATICS FOR PCORI PPRN Kristen Anton UNC Chapel Hill/ White River Computing Dan Crichton White River Computing February 3, 2014.
DOE BER Climate Modeling PI Meeting, Potomac, Maryland, May 12-14, 2014 Funding for this study was provided by the US Department of Energy, BER Program.
Module 7: Fundamentals of Administering Windows Server 2008.
File Processing - Database Overview MVNC1 DATABASE SYSTEMS Overview.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Design of a Search Engine for Metadata Search Based on Metalogy Ing-Xiang Chen, Che-Min Chen,and Cheng-Zen Yang Dept. of Computer Engineering and Science.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 1 DATABASE SYSTEMS Instructor Ms. Arwa Binsaleh.
CE 394K.2 Surface Water Hydrology Lecture 1 – Introduction to the course Readings for today –Applied Hydrology, Chapter 1 –“Integrated Observatories to.
Vinay Paul. CONTENTS:- What is Event Log Service ? Types of event logs and their purpose. How and when the Event Log is useful? What is Event Viewer?
7 Strategies for Extracting, Transforming, and Loading.
Unit 17: SDLC. Systems Development Life Cycle Five Major Phases Plus Documentation throughout Plus Evaluation…
CSC 351 FUNDAMENTALS OF DATABASE SYSTEMS. LECTURE 1: INTRODUCTION TO DATABASES.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Day in the Life (DITL) Production Operations with Energy Builder Copyright © 2015 EDataViz LLC.
Unlocking the Scientific Value of NEXRAD Weather Radar Data Witold F. Krajewski with Anton Kruger, Ramon Lawrence, Allen A. Bradley, and Grzegorz J. Ciach.
Overview of CBRFC Flood Operations Arizona WFOs – May 19, 2011 Kevin Werner, SCH.
Maintaining and Updating Windows Server 2008 Lesson 8.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Estimating Rainfall in Arizona - A Brief Overview of the WSR-88D Precipitation Processing Subsystem Jonathan J. Gourley National Severe Storms Laboratory.
Towards Better Utilization of NEXRAD Data in Hydrology Anton Kruger, University of Iowa AGU Fall Meeting San Francisco, December 11, 2006 UCAR/Unidata.
Database Principles: Fundamentals of Design, Implementation, and Management Chapter 1 The Database Approach.
A Solution for Maintaining File Integrity within an Online Data Archive Dan Scholes PDS Geosciences Node Washington University 1.
Chapter 1 Database Systems
Building a Database on S3
Chapter 1 Database Systems
Presentation transcript:

Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia, Canada

Page 2 Ramon Lawrence - Managing Data Quality in a Terabyte- scale Sensor Archive – SAC 2008 Data-Driven Scientific Discovery Modern scientific discovery uses vast quantities of data generated from instruments, sensors, and experimental systems. u The quality and impact of the research is highly dependent on the quality of the data collection and analysis. Challenges: u The amount of data required for research is exploding. u The types and sources of data is increasing and data generated by different experiments or devices must be integrated. These two factors require scientists to be concerned with how their experimental data is collected, archived, analyzed, and integrated to lead to research contributions.

Page 3 Ramon Lawrence - Managing Data Quality in a Terabyte- scale Sensor Archive – SAC 2008 Fundamental Sensor Data Archive Issue Sensors produce vast amounts of data that is valuable for historical as well as real-time applications and analysis. Due to the number of sensors and volume of data collected, manual curation and data validation is difficult. By their nature, sensors are prone to failures, inaccuracies, and periods of intermittent or substandard performance. Despite this, the historical data record should be as clean and accurate as possible despite device limitations.

Page 4 Ramon Lawrence - Managing Data Quality in a Terabyte- scale Sensor Archive – SAC 2008 Key Question (and Answer) Question: How can we achieve high quality historical archives of sensor data? Answer: In addition to operational monitoring of the data archive system, the data stream should be analyzed using metadata properties to detect errors. Operational monitoring – Are the system components and workflow functioning properly? Metadata validation – Does the data stream conform to known ranges? Can data cleansing and correction be performed?

Page 5 Ramon Lawrence - Managing Data Quality in a Terabyte- scale Sensor Archive – SAC 2008 Our goal is to provide the science community with ready access to the vast archives and real-time information collected by the national network of NEXRAD radars. [This requires hiding the numerous data management issues.] NEXRAD Archive System Overview We will briefly overview: u The data collected by the NEXRAD system and its scientific value. u The current state of NEXRAD data archiving and its use in scientific discovery, including its data quality limitations. u An extension of the system that uses metadata properties to validate and clean archived data.

Page 6 Ramon Lawrence - Managing Data Quality in a Terabyte- scale Sensor Archive – SAC 2008 NEXRAD System and Generated Data There are over 150 NEXt generation RADars (NEXRAD) that collect real-time precipitation data across the United States. u The system has been operational for over 10 years, and the amount of collected data is continually expanding. A radar emits a coherent train of microwave pulses and processes reflected pulses. Each processed pulse corresponds to a bin. There are multiple bins in a ray (beam). Rotating the radar 360º is a sweep. After a sweep the radar elevation angle is increased, and another sweep performed. All sweeps together form a volume.

Page 7 Ramon Lawrence - Managing Data Quality in a Terabyte- scale Sensor Archive – SAC 2008 Usefulness of NEXRAD Data Although the NEXRAD system was designed for severe weather forecasting, data collected has been used in many areas including: u flood prediction u bird and insect migration u rainfall estimation The value of this data has been noted by a NRC report which labeled it a “critical resource.” Enhancing Access to NEXRAD Data—A Critical National Resource. National Academy Press, Washington D.C. ISBN , 1999

Page 8 Ramon Lawrence - Managing Data Quality in a Terabyte- scale Sensor Archive – SAC 2008 Archiving NEXRAD Data Researchers have two options for acquiring NEXRAD data: u 1) Retrieve RAW data from the National Climatic Data Center (NCDC) tape archive. u 2) Capture real-time data distributed by University Corporation for Atmospheric Research (UCAR) using their Unidata Internet Data Distribution (IDD) system. Acquiring, archiving, and analyzing the data requires significant computational and operational knowledge which makes it impractical for many researchers.

Page 9 Ramon Lawrence - Managing Data Quality in a Terabyte- scale Sensor Archive – SAC 2008 NEXRAD Archive System The NEXRAD archive system is a NSF-funded project that aims to simplify the analysis of NEXRAD data for researchers. The NEXRAD archive: u Collects and archives RAW data from the real-time stream. u Analyzes and indexes data for retrieval by metadata properties. u Performs data cleansing such as removing ground clutter. u Allows researchers access to historical and real-time data in RAW form. u Provides an analysis workflow system that will generate derived products (such as rainfall maps) using the RAW data, known algorithms, and researcher parameters. The NEXRAD archive is hosted at the University of Iowa and the development is done in conjunction with NCDC, Unidata, and Princeton University.

Page 10 Ramon Lawrence - Managing Data Quality in a Terabyte- scale Sensor Archive – SAC 2008 NEXRAD Archive Architecture u Files are added from real-time stream or from other sources. u Metadata extractor produces XML description of each data file used for indexing. u Clients can access archive directly using C library and their own program. ðAll data files are web accessible. ðMetadata directory can be queried using web services interface. u Most clients use pre-constructed web workflow system and do not access RAW data. u Data and metadata will be replicated at a supercomputing center and eventually at NCDC.

Page 11 Ramon Lawrence - Managing Data Quality in a Terabyte- scale Sensor Archive – SAC 2008 Metadata Archive “Find all the 2002 storms over the Ralston Creek watershed with mean areal precipitation greater than X mm, and with a spatial extent of more than Z km 2, with a duration of less than N hours. I want the data in GeoTIFF” User/Client User/Client’s View Get URIs Web Services Get data HTTP Query Metadata Metadata Archive “Find all the 2002 storms over the Ralston Creek watershed with mean areal precipitation greater than X mm, and with a spatial extent of more than Z km 2, with a duration of less than N hours. I want the data in GeoTIFF.” Distributed Data Archive (NCDC, Iowa, etc.)

Page 12 Ramon Lawrence - Managing Data Quality in a Terabyte- scale Sensor Archive – SAC 2008 NEXRAD System Current Statistics The NEXRAD archive system: u Has been running for over 2 years u Collects data for 30 of the 150 radars u Has indexed over 8 million radar scans u Has RAW data that is over 8 TB in compressed form u Processes real-time data stream of GB/day u Supports a sophisticated workflow system that produces derived data products (e.g. rainfall maps) for users on demand u Has an operational monitoring system (is the archive workflow pipeline functioning properly?) but only simple data validation checks Question: What is the quality of the data being archived?

Page 13 Ramon Lawrence - Managing Data Quality in a Terabyte- scale Sensor Archive – SAC 2008 Archive Monitoring System We developed a new archive monitoring system that: u Explicitly tracks all archive workflow events in logs that are stored and queried using a database u Detects data corruption using metadata properties as well as pipeline failures u Produces reports on a web interface to simplify the task of administration of the archive The monitoring system was developed and operated separately from the main archive to compare performance and to prevent issues with the operational system.

Page 14 Ramon Lawrence - Managing Data Quality in a Terabyte- scale Sensor Archive – SAC 2008 Archive System with Monitor Basic archive workflow components unchanged except for logging: u Converter – translate RAW form to compressed RLE u Metadata Extractor – analyze data properties/check for inconsistencies u Loader – load metadata into database and files onto web servers Monitoring system: u Loads XML log records from each archive component into DB. u Provides metadata ranges for checking data validity. u Tracks files through pipeline (lineage) and handles corrupt files. u Has separate log database that is accessed using web front-end. u Can restart any workflow software.

Page 15 Ramon Lawrence - Managing Data Quality in a Terabyte- scale Sensor Archive – SAC 2008 Validating Sensor Data using Metadata Operational ranges of data produced by sensors is commonly known. u For example, the timestamp of sensor readings for the radars should be close to the current time. Reflectivity readings are within known ranges given weather conditions. Monitoring system provides these operational ranges to the metadata extractor component that can verify that data is within accepted ranges. Data outside ranges causes files to be dropped if not recoverable, fixed if possible (date changes), or flagged as warnings of potential corruption otherwise. The goal is to get as much data through the pipeline as possible, but make sure compromised data is flagged.

Page 16 Ramon Lawrence - Managing Data Quality in a Terabyte- scale Sensor Archive – SAC 2008 Monitoring System Implementation The monitoring system required that each workflow component be changed to use XML log records instead of separate files. Each XML log record is loaded into a Postgres database by a log processor. The log processor and logging is separate from the archive system to ensure that logging does not slow the archiving. As a RAW file proceeds through the pipeline, log events for it are recorded at each stage. Files that do not make it through the pipeline are not “lost” to the archive as before. Administrator has a web front-end to control archive processes, monitor events on per file or per process level, and track operational characteristics across the entire workflow.

Page 17 Ramon Lawrence - Managing Data Quality in a Terabyte- scale Sensor Archive – SAC 2008 Monitor System Administrator Interface

Page 18 Ramon Lawrence - Managing Data Quality in a Terabyte- scale Sensor Archive – SAC 2008 Operational Results A duplicate NEXRAD archive system that included the monitoring system processed two radars in parallel with the live system for six months. Key results: u Data errors occur in about 5% of input files. Expectation was less than 1% given sophistication of sensors. ðOne radar had data corruption for a two week period that went unnoticed in the live system as radar was indicating good operational status. ðMost errors were not fixable, but significant # of correctable date errors. u Administrator time reduced dramatically compared to manual log investigation. u The cost of logging a sensor stream is high. Storing log records in a database is a bottleneck and must be separated from archive. (Database loading is also a bottleneck in archive itself.)

Page 19 Ramon Lawrence - Managing Data Quality in a Terabyte- scale Sensor Archive – SAC 2008 Future Work and Conclusions Archiving sensor data is going to be an increasing challenge. u Ensuring high quality archives requires more than operational monitoring and should also include data validation using metadata properties. Live archive system is being updated to use monitoring system. The bottleneck in archiving and monitoring sensor data is the database system. Monitor should be separated from archive. u Loading the metadata into the archive takes an order of magnitude longer than generating it. u Metadata is growing beyond the capabilities of a single database. It will be replicated and distributed for performance and political reasons. u Logging to the database provides easy access to information, but you must be aware of performance issues.

Page 20 Ramon Lawrence - Managing Data Quality in a Terabyte- scale Sensor Archive – SAC 2008 Project Participants The University of Iowa (Lead) u W.F. Krajewski (PI) u A.A. Bradley, A. Kruger, R. Lawrence Princeton University u J.A. Smith (PI) u M. Steiner, M.L.Baeck National Climatic Data Center u S.A. Delgreco (PI) u S. Ansari UCAR/Unidata Program Center u M. K. Ramamurthy (PI) u W.J. Weber Research supported by NSF ITR Grant ATM : “A Comprehensive Framework for Use of NEXRAD Data in Hydrometeorology and Hydrology”.

Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia, Canada Thank You!