Presentation is loading. Please wait.

Presentation is loading. Please wait.

GIS Data Quality Producing better data quality through robust business processes Kim Ollivier BrightStar TRAINING.

Similar presentations


Presentation on theme: "GIS Data Quality Producing better data quality through robust business processes Kim Ollivier BrightStar TRAINING."— Presentation transcript:

1 GIS Data Quality Producing better data quality through robust business processes Kim Ollivier BrightStar TRAINING

2 Schedule Day One Suggested breaks for the following times: Start: 9:00 Session 1 ( 90 min) Morning tea:10:30 to 10:45 Session 2 ( 105 min) Lunch: 12:30 to 1:30 Session 3 ( 90 min) Afternoon tea: 3:00 to 3:15 Session 4 ( 105 min) Finish: 5:00 Each session will have an exercise or interactive discussion

3 Today Introduction Introduction What causes poor quality What causes poor quality Lunch Lunch Assessing Quality processes Assessing Quality processes GIS upgrade project examples GIS upgrade project examples

4 Tomorrow Metadata Designing rules Lunch Data warehouse and ETL Feature maintenance

5 Overview Introduce yourself Introduce yourself Your goals for this course? Your goals for this course? Build a data quality system Build a data quality system Avoid the worst traps Avoid the worst traps Be able to describe a project scope Be able to describe a project scope Budget, timeline, prioritiesBudget, timeline, priorities

6 Sections of course based on With permission from the author ISBN 978-0-09771400-2

7 What is Data Quality? “If they are fit for their intended uses in operations, decision making and planning.” “If they correctly represent the real-world construct to which they refer.”

8 Spatial Accuracy

9

10 Statistical Accuracy Completeness Score= Relevant Relevant + Missing Accuracy Score = Relevant - Errors Relevant Overall Score= Relevant - Errors Relevant + Missing

11 Completeness LINZ Bulk Data Extract LINZ Bulk Data Extract metadata\meta.html metadata\meta.html metadata\meta.html

12 Data Profiling Find out what is there Find out what is there Assess the risks Assess the risks Understand data challenges early Understand data challenges early Have an enterprise view of all data Have an enterprise view of all data

13 Profile Metrics Integrity Integrity Consistency Consistency Completeness, Density Completeness, Density Validity Validity Timeliness Timeliness Accessibility Accessibility Uniqueness Uniqueness

14 Security Confidentiality Confidentiality Possession Possession Integrity Integrity Authenticity Authenticity Availability Availability Utility Utility

15 Consistency Discrepancies between attributes Discrepancies between attributes Exceptions in a cluster Exceptions in a cluster Spatial discrepancies Spatial discrepancies

16

17

18 A GIS Data Quality System Assess Data Quality Assessment Data Profiling Improve Prevent Recognise Data Cleaning Monitoring Data Integration Interfaces Ensuring Quality of Data Conversion and Consolidation Building Data Quality Metadata Warehouse Monitor Recurrent Data Quality Assessment

19 Course examples LINZ coordinate upgrade 1998-2003 LINZ coordinate upgrade 1998-2003 NSCC services upgrade 2008 NSCC services upgrade 2008 Valuation roll structure and matching Valuation roll structure and matching ETL of utilites from SDE to Autocad ETL of utilites from SDE to Autocad Address location issues NAR, DRA Address location issues NAR, DRA Documents and examples on memory stick

20 Exercise 1: Nominate your database Select a representative example dataset for later discussion You may be responsible for You may be responsible for Or, you have to integrate Or, you have to integrate Or, you have to load it Or, you have to load it Or, you supply it to others Or, you supply it to others Morning Tea

21 Assessing Quality 1. Project steps 2. Required roles 3. Defining the objectives 4. Designing rules 5. Scorecard and Metadata 6. Frequency of assessment 7. Common mistakes

22 Processes Affecting Data Quality Real-Time Interfaces Batch Feeds Manual Data Entry System Consolidations Initial Data Conversion Processes bringing data from outside Process Automation Loss of Expertise New Data Uses System Upgrades Changes not captured Processes causing data decay Processes changing data from within Data processingData cleaningData purging Database   

23 Outside: Initial Data Conversion Define data mapping Define data mapping Extract, Transform, Load (ETL) Extract, Transform, Load (ETL) Drown in Data Problems Drown in Data Problems Find Scapegoat  Find Scapegoat 

24 Outside: System Consolidation Often from mergers (Auckland?) Often from mergers (Auckland?) Unplanned, unreasonable timeframesUnplanned, unreasonable timeframes Head-on two car wreck Head-on two car wreck Square pegs into round holes Square pegs into round holes Winner – loser merging (50% wrong) Winner – loser merging (50% wrong)

25 Outside: Manual Data Entry High error rate High error rate Complex and poor entry forms Complex and poor entry forms Users find ways around checks Users find ways around checks Forcing non blanks does not work Forcing non blanks does not work

26 Outside: Batch Feeds Large volumes mean lots of errors Large volumes mean lots of errors Source system subject to changes Source system subject to changes Errors accumulate Errors accumulate Especially dangerous if triggers activated Especially dangerous if triggers activated

27 Outside: Real-Time Interfaces Data between db’s in synchronisation Data between db’s in synchronisation Data in small packets out of context Data in small packets out of context Too fast to validate Too fast to validate Rejection loses record, so accepted Rejection loses record, so accepted Faster or better but not both! Faster or better but not both!

28 Decay: Changes Not Captured Object changes are unnoticed by computers Object changes are unnoticed by computers Retroactive changes may not be propagated Retroactive changes may not be propagated

29 Decay: System Upgrades The data is assumed to comply with the new requirements The data is assumed to comply with the new requirements Upgrades are tested against what the data is supposed to be, not what is actually there Upgrades are tested against what the data is supposed to be, not what is actually there Once upgrades are implemented everything goes haywire Once upgrades are implemented everything goes haywire

30 Decay: New Data Uses “Fitness to the purpose of use” may not apply “Fitness to the purpose of use” may not apply Acceptable error rates may now be an issue Acceptable error rates may now be an issue Value granularity, map scale Value granularity, map scale Data retention policy Data retention policy

31 Decay: Loss of Expertise Meaning of codes may change over time that only “experts” know Meaning of codes may change over time that only “experts” know Experts know when data looks wrong Experts know when data looks wrong Retirees rehired to work systems Retirees rehired to work systems Auckland address points were entered on corners and the rest guessed, later used as exact. Auckland address points were entered on corners and the rest guessed, later used as exact.

32 Decay: Process Automation Web 2.0 bots automate form filling Web 2.0 bots automate form filling Transactions are generated without ever being checked by people Transactions are generated without ever being checked by people Customers given automated access are more sensitive to errors in their own data Customers given automated access are more sensitive to errors in their own data

33 Within: Data Processing Changes in the programs Changes in the programs Programs may not keep up with changes in data collection Programs may not keep up with changes in data collection Processing may be done at the wrong time Processing may be done at the wrong time

34 Special GIS Data Issues Coordinate data not usually readable Coordinate data not usually readable Data models CAD v GIS Data models CAD v GIS Fuzzy matching is not Boolean (near) Fuzzy matching is not Boolean (near) Atomic objects harder to define Atomic objects harder to define Features have 2,3,4,5 dimensions Features have 2,3,4,5 dimensions Projection systems are not exact Projection systems are not exact Topology requires special operators Topology requires special operators

35 Within: Data Purging Highly risky for data quality Highly risky for data quality Relevant data may be purged Relevant data may be purged Erroneous data may fit criteria Erroneous data may fit criteria It may not work the next year It may not work the next year

36 Within: Data Cleaning En masse processes may add errors En masse processes may add errors Cleaning processes may have bugs Cleaning processes may have bugs Incomplete information about data Incomplete information about data

37 Assessing Data Quality Data profiling Data profiling Interview users Interview users Examine data model Examine data model Data Gazing Data Gazing

38 Data Gazing Count the records Count the records Just open the sources and scroll Just open the sources and scroll Sort and look at the ends Sort and look at the ends Run some simple frequency reports Run some simple frequency reports See if the field names make sense See if the field names make sense What is missing that should be there What is missing that should be there Lunch

39 Data Cleaning There are always lots of errors There are always lots of errors It is too much to inspect all by hand It is too much to inspect all by hand Data experts are rare and too busy Data experts are rare and too busy It does not fix process errors It does not fix process errors You may make it worse You may make it worse

40 Automated Cleaning The only practical method The only practical method Needs sophisticated pattern analysis Needs sophisticated pattern analysis Allow for backtracking Allow for backtracking Data quality rules are interdependent Data quality rules are interdependent

41 Common Mistakes 1. Inadequate Staffing of Data Quality Teams 2. Hoping That Data Will Get Better by Itself 3. Lack of Data Quality Assessment 4. Narrow Focus 5. Bad Metadata 6. Ignoring Data Quality During Data Conversions 7. Winner-Loser Approach in Data Consolidation 8. Inadequate Monitoring of Data Interfaces 9. Forgetting About Data Decay 10. Poor Organization of Data Quality Metadata

42 Metadata Data model Data model Business rules, relations, state Business rules, relations, state Subclasses (lookup tables) Subclasses (lookup tables) GIS Metadata (NZGLS or ISO) XML GIS Metadata (NZGLS or ISO) XML Readme.txt Readme.txt Includes everything known about the data

43 Data Exchange Batch or interactive Batch or interactive ETL (Extract Transform Load) ETL (Extract Transform Load) Replication Replication Time differences in data Time differences in data

44 GIS in Business Processes Integrates many different sources Integrates many different sources Spatial patterns are revealed Spatial patterns are revealed Display thousands of records simultaneously with direct access Display thousands of records simultaneously with direct access Location now seen as important Location now seen as important

45 Scorecard DQ Score Score Summary Score Decompositions Intermediate Error Reports Atomic Level Data Quality Information

46 Case Study Outline a GIS data quality system Outline a GIS data quality system Measles Chart Measles Chart Prioritise Prioritise Interview Interview Build up a scorecard Build up a scorecard Afternoon Tea

47 Assessment Exercise Split into pairs Split into pairs Interview one person about their dataset Interview one person about their dataset Collect basic information Collect basic information Devise a strategy for a profile Devise a strategy for a profile Rotate pair with another Rotate pair with another Interview other person Interview other person Verbal reports to class Verbal reports to class

48 Major Upgrade Projects LINZ Coordinate upgrade LINZ Coordinate upgrade NSCC Coordinate upgrade NSCC Coordinate upgrade

49 References Data Quality Assessment – Arkady Maydanchik Data Quality Assessment – Arkady Maydanchik


Download ppt "GIS Data Quality Producing better data quality through robust business processes Kim Ollivier BrightStar TRAINING."

Similar presentations


Ads by Google