GIS Data Quality Producing better data quality through robust business processes Kim Ollivier BrightStar TRAINING.

Slides:



Advertisements
Similar presentations
Test process essentials Riitta Viitamäki,
Advertisements

Conquering Data Conversion Projects. Who is that furry guy anyway? Austin Zellner = presenter 15+ years Information Technology Multiple large data migration.
Software Delivery. Software Delivery Management  Managing Requirements and Changes  Managing Resources  Managing Configuration  Managing Defects 
Basic guidelines for the creation of a DW Create corporate sponsors and plan thoroughly Determine a scalable architectural framework for the DW Identify.
Lecture 5 Themes in this session Building and managing the data warehouse Data extraction and transformation Technical issues.
Continuous Audit at Insurance Companies
SE 450 Software Processes & Product Metrics Reliability: An Introduction.
1 presented by: Tim Haithcoat University of Missouri Columbia QA/QC and Risk Management.
Geog 458: Map Sources and Errors January 20, 2006 Data Storage and Editing.
PowerPoint Presentation for Dennis, Wixom & Tegarden Systems Analysis and Design Copyright 2001 © John Wiley & Sons, Inc. All rights reserved. Slide 1.
SE 450 Software Processes & Product Metrics 1 Defect Removal.
Data Warehouse success depends on metadata
(c) 2007 Mauro Pezzè & Michal Young Ch 1, slide 1 Software Test and Analysis in a Nutshell.
Course Technology Chapter 3: Project Integration Management.
Page 1Prepared by Sapient for MITVersion 0.1 – August – September 2004 This document represents a snapshot of an evolving set of documents. For information.
Swami NatarajanJuly 14, 2015 RIT Software Engineering Reliability: Introduction.
Implementation/Acceptance Testing / 1 Implementation and Acceptance Testing Physical Implementation Criteria: 1. Data availability 2. Data reliability.
Implementation. We we came from… Planning Analysis Design Implementation Identify Problem/Value. Feasibility Analysis. Project Management. Understand.
Testing - an Overview September 10, What is it, Why do it? Testing is a set of activities aimed at validating that an attribute or capability.
COMP8130 and 4130Adrian Marshall 8130 and 4130 Test Management Adrian Marshall.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 1- 1.
9 C H A P T E R Transaction Processing and Enterprise Resource Planning Systems.
Troy Eversen | 19 May 2015 Data Integrity Workshop.
Data Structures and Programming.  John Edgar2.
Introduction to Computer Technology
Applied Software Project Management Andrew Stellman & Jennifer Greene Applied Software Project Management Applied Software.
Problem Solving Methodology
MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management Dave Salisbury ( )
S/W Project Management
Understanding Data Warehousing
What is Software Engineering? the application of a systematic, disciplined, quantifiable approach to the development, operation, and maintenance of software”
Chapter 8: Systems analysis and design
Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho.
Systems Development Lifecycle Project Identification & Selection Project Initiation & Planning Analysis Logical Design Physical Design Implementation Maintenance.
Objectives Overview Define the term, database, and explain how a database interacts with data and information Define the term, data integrity, and describe.
ITEC224 Database Programming
- 1 - Roadmap to Re-aligning the Customer Master with Oracle's TCA Northern California OAUG March 7, 2005.
Quality Control Project Management Unit Credit Value : 4 Essential
GIS Data Quality Producing better data quality through robust business processes BrightStar TRAINING Kim Ollivier.
© 2007 by Prentice Hall 1 Introduction to databases.
Moving into Implementation SYSTEMS ANALYSIS AND DESIGN, 6 TH EDITION DENNIS, WIXOM, AND ROTH © 2015 JOHN WILEY & SONS. ALL RIGHTS RESERVED.Roberta M. Roth.
ISM 5316 Week 3 Learning Objectives You should be able to: u Define and list issues and steps in Project Integration u List and describe the components.
Professor Michael J. Losacco CIS 1110 – Using Computers Database Management Chapter 9.
ArcGIS Data Reviewer: An Introduction
Data Warehousing Data Mining Privacy. Reading Bhavani Thuraisingham, Murat Kantarcioglu, and Srinivasan Iyer Extended RBAC-design and implementation.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Essentials of Data Quality for Predictive Modeling Jeremy Benson, FCAS, FSA Alietia Caughron, Ph.D Central States Actuarial Forum June 5, 2009.
Construction, Testing, Documentation, and Installation Chapters 15 and 16 Info 361: Systems Analysis and Design.
Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012.
Statistical Expertise for Sound Decision Making Quality Assurance for Census Data Processing Jean-Michel Durr 28/1/20111Fourth meeting of the TCG - Lubjana.
Any data..! Any where..! Any time..! Linking Process and Content in a Distributed Spatial Production System Pierre Lafond HydraSpace Solutions Inc
Copyright 2010, The World Bank Group. All Rights Reserved. Managing Data Processing Section B.
Chapter 1: Fundamental of Testing Systems Testing & Evaluation (MNN1063)
Project management Topic 7 Controls. What is a control? Decision making activities – Planning – Monitor progress – Compare achievement with plan – Detect.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
7 Strategies for Extracting, Transforming, and Loading.
Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.
Mahindra Satyam Confidential Quality Management System Software Defect Prevention.
Data Mining What is to be done before we get to Data Mining?
What is a software? Computer Software, or just Software, is the collection of computer programs and related data that provide the instructions telling.
Helping Your Data Warehouse Succeed: 10 Mistakes to Avoid in Data Integration Rafael Salas w:
Data Storage & Editing GEOG370 Instructor: Christine Erlien.
SOFTWARE TESTING TRAINING TOOLS SUPPORT FOR SOFTWARE TESTING Chapter 6 immaculateres 1.
Software Project Configuration Management
Server Upgrade HA/DR Integration
Overview of MDM Site Hub
Systems Analysis and Design
Data Quality By Suparna Kansakar.
Data Warehousing Concepts
Best Practices in Higher Education Student Data Warehousing Forum
Presentation transcript:

GIS Data Quality Producing better data quality through robust business processes Kim Ollivier BrightStar TRAINING

Schedule Day One Suggested breaks for the following times: Start: 9:00 Session 1 ( 90 min) Morning tea:10:30 to 10:45 Session 2 ( 105 min) Lunch: 12:30 to 1:30 Session 3 ( 90 min) Afternoon tea: 3:00 to 3:15 Session 4 ( 105 min) Finish: 5:00 Each session will have an exercise or interactive discussion

Today Introduction Introduction What causes poor quality What causes poor quality Lunch Lunch Assessing Quality processes Assessing Quality processes GIS upgrade project examples GIS upgrade project examples

Tomorrow Metadata Designing rules Lunch Data warehouse and ETL Feature maintenance

Overview Introduce yourself Introduce yourself Your goals for this course? Your goals for this course? Build a data quality system Build a data quality system Avoid the worst traps Avoid the worst traps Be able to describe a project scope Be able to describe a project scope Budget, timeline, prioritiesBudget, timeline, priorities

Sections of course based on With permission from the author ISBN

What is Data Quality? “If they are fit for their intended uses in operations, decision making and planning.” “If they correctly represent the real-world construct to which they refer.”

Spatial Accuracy

Statistical Accuracy Completeness Score= Relevant Relevant + Missing Accuracy Score = Relevant - Errors Relevant Overall Score= Relevant - Errors Relevant + Missing

Completeness LINZ Bulk Data Extract LINZ Bulk Data Extract metadata\meta.html metadata\meta.html metadata\meta.html

Data Profiling Find out what is there Find out what is there Assess the risks Assess the risks Understand data challenges early Understand data challenges early Have an enterprise view of all data Have an enterprise view of all data

Profile Metrics Integrity Integrity Consistency Consistency Completeness, Density Completeness, Density Validity Validity Timeliness Timeliness Accessibility Accessibility Uniqueness Uniqueness

Security Confidentiality Confidentiality Possession Possession Integrity Integrity Authenticity Authenticity Availability Availability Utility Utility

Consistency Discrepancies between attributes Discrepancies between attributes Exceptions in a cluster Exceptions in a cluster Spatial discrepancies Spatial discrepancies

A GIS Data Quality System Assess Data Quality Assessment Data Profiling Improve Prevent Recognise Data Cleaning Monitoring Data Integration Interfaces Ensuring Quality of Data Conversion and Consolidation Building Data Quality Metadata Warehouse Monitor Recurrent Data Quality Assessment

Course examples LINZ coordinate upgrade LINZ coordinate upgrade NSCC services upgrade 2008 NSCC services upgrade 2008 Valuation roll structure and matching Valuation roll structure and matching ETL of utilites from SDE to Autocad ETL of utilites from SDE to Autocad Address location issues NAR, DRA Address location issues NAR, DRA Documents and examples on memory stick

Exercise 1: Nominate your database Select a representative example dataset for later discussion You may be responsible for You may be responsible for Or, you have to integrate Or, you have to integrate Or, you have to load it Or, you have to load it Or, you supply it to others Or, you supply it to others Morning Tea

Assessing Quality 1. Project steps 2. Required roles 3. Defining the objectives 4. Designing rules 5. Scorecard and Metadata 6. Frequency of assessment 7. Common mistakes

Processes Affecting Data Quality Real-Time Interfaces Batch Feeds Manual Data Entry System Consolidations Initial Data Conversion Processes bringing data from outside Process Automation Loss of Expertise New Data Uses System Upgrades Changes not captured Processes causing data decay Processes changing data from within Data processingData cleaningData purging Database   

Outside: Initial Data Conversion Define data mapping Define data mapping Extract, Transform, Load (ETL) Extract, Transform, Load (ETL) Drown in Data Problems Drown in Data Problems Find Scapegoat  Find Scapegoat 

Outside: System Consolidation Often from mergers (Auckland?) Often from mergers (Auckland?) Unplanned, unreasonable timeframesUnplanned, unreasonable timeframes Head-on two car wreck Head-on two car wreck Square pegs into round holes Square pegs into round holes Winner – loser merging (50% wrong) Winner – loser merging (50% wrong)

Outside: Manual Data Entry High error rate High error rate Complex and poor entry forms Complex and poor entry forms Users find ways around checks Users find ways around checks Forcing non blanks does not work Forcing non blanks does not work

Outside: Batch Feeds Large volumes mean lots of errors Large volumes mean lots of errors Source system subject to changes Source system subject to changes Errors accumulate Errors accumulate Especially dangerous if triggers activated Especially dangerous if triggers activated

Outside: Real-Time Interfaces Data between db’s in synchronisation Data between db’s in synchronisation Data in small packets out of context Data in small packets out of context Too fast to validate Too fast to validate Rejection loses record, so accepted Rejection loses record, so accepted Faster or better but not both! Faster or better but not both!

Decay: Changes Not Captured Object changes are unnoticed by computers Object changes are unnoticed by computers Retroactive changes may not be propagated Retroactive changes may not be propagated

Decay: System Upgrades The data is assumed to comply with the new requirements The data is assumed to comply with the new requirements Upgrades are tested against what the data is supposed to be, not what is actually there Upgrades are tested against what the data is supposed to be, not what is actually there Once upgrades are implemented everything goes haywire Once upgrades are implemented everything goes haywire

Decay: New Data Uses “Fitness to the purpose of use” may not apply “Fitness to the purpose of use” may not apply Acceptable error rates may now be an issue Acceptable error rates may now be an issue Value granularity, map scale Value granularity, map scale Data retention policy Data retention policy

Decay: Loss of Expertise Meaning of codes may change over time that only “experts” know Meaning of codes may change over time that only “experts” know Experts know when data looks wrong Experts know when data looks wrong Retirees rehired to work systems Retirees rehired to work systems Auckland address points were entered on corners and the rest guessed, later used as exact. Auckland address points were entered on corners and the rest guessed, later used as exact.

Decay: Process Automation Web 2.0 bots automate form filling Web 2.0 bots automate form filling Transactions are generated without ever being checked by people Transactions are generated without ever being checked by people Customers given automated access are more sensitive to errors in their own data Customers given automated access are more sensitive to errors in their own data

Within: Data Processing Changes in the programs Changes in the programs Programs may not keep up with changes in data collection Programs may not keep up with changes in data collection Processing may be done at the wrong time Processing may be done at the wrong time

Special GIS Data Issues Coordinate data not usually readable Coordinate data not usually readable Data models CAD v GIS Data models CAD v GIS Fuzzy matching is not Boolean (near) Fuzzy matching is not Boolean (near) Atomic objects harder to define Atomic objects harder to define Features have 2,3,4,5 dimensions Features have 2,3,4,5 dimensions Projection systems are not exact Projection systems are not exact Topology requires special operators Topology requires special operators

Within: Data Purging Highly risky for data quality Highly risky for data quality Relevant data may be purged Relevant data may be purged Erroneous data may fit criteria Erroneous data may fit criteria It may not work the next year It may not work the next year

Within: Data Cleaning En masse processes may add errors En masse processes may add errors Cleaning processes may have bugs Cleaning processes may have bugs Incomplete information about data Incomplete information about data

Assessing Data Quality Data profiling Data profiling Interview users Interview users Examine data model Examine data model Data Gazing Data Gazing

Data Gazing Count the records Count the records Just open the sources and scroll Just open the sources and scroll Sort and look at the ends Sort and look at the ends Run some simple frequency reports Run some simple frequency reports See if the field names make sense See if the field names make sense What is missing that should be there What is missing that should be there Lunch

Data Cleaning There are always lots of errors There are always lots of errors It is too much to inspect all by hand It is too much to inspect all by hand Data experts are rare and too busy Data experts are rare and too busy It does not fix process errors It does not fix process errors You may make it worse You may make it worse

Automated Cleaning The only practical method The only practical method Needs sophisticated pattern analysis Needs sophisticated pattern analysis Allow for backtracking Allow for backtracking Data quality rules are interdependent Data quality rules are interdependent

Common Mistakes 1. Inadequate Staffing of Data Quality Teams 2. Hoping That Data Will Get Better by Itself 3. Lack of Data Quality Assessment 4. Narrow Focus 5. Bad Metadata 6. Ignoring Data Quality During Data Conversions 7. Winner-Loser Approach in Data Consolidation 8. Inadequate Monitoring of Data Interfaces 9. Forgetting About Data Decay 10. Poor Organization of Data Quality Metadata

Metadata Data model Data model Business rules, relations, state Business rules, relations, state Subclasses (lookup tables) Subclasses (lookup tables) GIS Metadata (NZGLS or ISO) XML GIS Metadata (NZGLS or ISO) XML Readme.txt Readme.txt Includes everything known about the data

Data Exchange Batch or interactive Batch or interactive ETL (Extract Transform Load) ETL (Extract Transform Load) Replication Replication Time differences in data Time differences in data

GIS in Business Processes Integrates many different sources Integrates many different sources Spatial patterns are revealed Spatial patterns are revealed Display thousands of records simultaneously with direct access Display thousands of records simultaneously with direct access Location now seen as important Location now seen as important

Scorecard DQ Score Score Summary Score Decompositions Intermediate Error Reports Atomic Level Data Quality Information

Case Study Outline a GIS data quality system Outline a GIS data quality system Measles Chart Measles Chart Prioritise Prioritise Interview Interview Build up a scorecard Build up a scorecard Afternoon Tea

Assessment Exercise Split into pairs Split into pairs Interview one person about their dataset Interview one person about their dataset Collect basic information Collect basic information Devise a strategy for a profile Devise a strategy for a profile Rotate pair with another Rotate pair with another Interview other person Interview other person Verbal reports to class Verbal reports to class

Major Upgrade Projects LINZ Coordinate upgrade LINZ Coordinate upgrade NSCC Coordinate upgrade NSCC Coordinate upgrade

References Data Quality Assessment – Arkady Maydanchik Data Quality Assessment – Arkady Maydanchik