Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

Test process essentials Riitta Viitamäki,
System Integration Verification and Validation
Eurostat Secondary data: collection and use Presented by Arnout van Delden Methodologist Statistics Netherlands.
United Nations Economic Commission for Europe Statistical Division UNECE Training Workshop on Dissemination of MDG Indicators and Statistical Information.
Enhancing Data Quality of Distributive Trade Statistics Workshop for African countries on the Implementation of International Recommendations for Distributive.
Quality Guidelines for statistical processes using administrative data European Conference on Quality in Official Statistics Q2014 Giovanna Brancato, Francesco.
Saskia Ossen, and Piet Daas Introduction in the Source and Metadata hyperdimension.
Determination of Administrative Data Quality : Recent results and new developments Piet J.H. Daas, Saskia J.L. Ossen, and Martijn Tennekes Statistics Netherlands.
Measuring and Monitoring Program Outcomes
Results and next steps from the ESSnet Admin Data Alison Pritchard Business Outputs & Developments, Office for National Statistics, UK 4 December 2012.
1 Editing Administrative Data and Combined Data Sources Introduction.
1 CSI 101 Elements of Computing Fall 2009 Lecture #4 Using Flowcharts Monday February 2nd, 2009.
Unit 4: Monitoring Data Quality For HIV Case Surveillance Systems #6-0-1.
Software Process and Product Metrics
The Use of Administrative Sources for Economic Statistics An Overview Steven Vale Office for National Statistics UK.
Trade and business statistics: use of administrative data Lunch Seminar Enrico Giovannini Italian National Statistical Institute (ISTAT) New York, February,
United Nations Economic Commission for Europe Statistical Division Applying the GSBPM to Business Register Management Steven Vale UNECE
IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey.
Combining administrative and survey data: potential benefits and impact on editing and imputation for a structural business survey UNECE Work Session on.
Work Package 5: Integrating data from different sources in the production of business statistics Daniel Lewis Office for National Statistics (UK)
Using survey data collection as a tool for improving the survey process Silvia Biffignandi, Antonio Laureti Giulio Perani University of Bergamo Istat Istat.
Quality in the Swedish Business Database The Quality Survey 2004 Round Table Beijing 2004 Swedish presentation, session 5, 18 th Round Table, Beijing –
1 Quality Assurance In moving information from statistical programs into the hands of users we have to guard against the introduction of error. Quality.
Rudi Seljak, Metka Zaletel Statistical Office of the Republic of Slovenia TAX DATA AS A MEANS FOR THE ESSENTIAL REDUCTION OF THE SHORT-TERM SURVEYS RESPONSE.
Use of survey (LFS) to evaluate the quality of census final data Expert Group Meeting on Censuses Using Registers Geneva, May 2012 Jari Nieminen.
Classroom Assessment A Practical Guide for Educators by Craig A
Q2010, Helsinki Development and implementation of quality and performance indicators for frame creation and imputation Kornélia Mag László Kajdi Q2010,
FCS - AAO - DM COMPE/SE/ISE 492 Senior Project 2 System/Software Test Documentation (STD) System/Software Test Documentation (STD)
Integrating administrative and survey data in the new Italian system for SBS: quality issues O. Luzi, F. Oropallo, A. Puggioni, M. Di Zio, R. Sanzo Nurnberg,
List frames area frames and administrative data, are they complementary or in competition? Elisabetta Carfagna University of Bologna Department of Statistics.
Record matching for census purposes in the Netherlands Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands.
Comp 20 - Training & Instructional Design Unit 6 - Assessment This material was developed by Columbia University, funded by the Department of Health and.
Quality issues on the way from survey to administrative data: the case of SBS statistics of microenterprises in Slovakia Andrej Vallo, Andrea Bielakova.
Assessing Quality for Integration Based Data M. Denk, W. Grossmann Institute for Scientific Computing.
Deliverable 2.6: Selective Editing Hannah Finselbach 1 and Orietta Luzi 2 1 ONS, UK 2 ISTAT, Italy.
We provide information Challenges in the transition from traditional to register- based census in Austria High-level Seminar on Population.
Rev. 0 CONFIDENTIAL Mod.19 02/00 Rev.2 Mobile Terminals S.p.A. Trieste Author: M.Fragiacomo, D.Protti, M.Torelli 31 Project Idea Feasibility.
Implementation of quality indicators in the Finnish statistics production process Kari Djerf Statistics Finland Q2008, Rome Italy.
Quality framework for the evaluation of administrative data (to be used for statistics) Piet J.H. Daas, Judit Arends-Tóth, Barry Schouten and Léander Kuivenhoven.
European Conference on Quality in Official Statistics Session 26: Quality Issues in Census « Rome, 10 July 2008 « Quality Assurance and Control Programme.
Towards a more efficient system of administrative data management and quality evaluation to support statistics production in Istat Grazia Di Bella, Simone.
for statistics based on multiple sources
Use of Administrative Data Seminar on Developing a Programme on Integrated Statistics in support of the Implementation of the SNA for CARICOM countries.
Quality Assurance Programme of the Canadian Census of Population Expert Group Meeting on Population and Housing Censuses Geneva July 7-9, 2010.
CBS-SSB STATISTICS NETHERLANDS – STATISTICS NORWAY Work Session on Statistical Data Editing Oslo, Norway, September 2012 Jeroen Pannekoek and Li-Chun.
1 C. ARRIBAS, D. LORCA, A. SALINERO & A. COLMENERO Measuring statistical quality at the Spanish National Statistical Institute.
The Civil Registration and Vital Statistics System in Country Names & Titles of Presenters.
Pilot Census in Poland Some Quality Aspects Geneva, 7-9 July 2010 Janusz Dygaszewicz Central Statistical Office POLAND.
1 For a Population Statistical Register Characteristics and Potentials for the Official Statistics Central department for administrative data and archives.
Building a database for children with disabilities using administrative data and surveys Adele D. Furrie September 27, 2011.
United Nations Workshop on Evaluation and Analysis of Census Data, 1-12 December 2014, Nay Pyi Taw, Myanmar DATA VALIDATION-I Evaluation of editing and.
United Nations Oslo City Group on Energy Statistics OG7, Helsinki, Finland October 2012 ESCM Chapter 8: Data Quality and Meta Data 1.
Copyright 2010, The World Bank Group. All Rights Reserved. Recommended Tabulations and Dissemination Section B.
S T A T I S T I K A U S T R I A Quality Assessment of register-based Statistics A Quality Framework Manuela LENK Directorate.
QUALITY ASSESSMENT OF THE REGISTER-BASED SLOVENIAN CENSUS 2011 Rudi Seljak, Apolonija Flander Oblak Statistical Office of the Republic of Slovenia.
Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.
Q2010 Special session 34 Data quality and inference under register information Discussion by Carl-Erik Särndal.
First meeting of the Technical Cooperation Group for the Population and Housing Censuses in South East Europe Vienna, March 2010 POST-ENUMERATION.
Administrative Data and Official Statistics Administrative Data and Official Statistics Principles and good practices Quality in Statistics: Administrative.
Methods for Data-Integration
Implementation of Quality indicators for administrative data
4.1. Data Quality 1.
KOMUSO Information for the Big Data society in official statistics
6.1 Quality improvement Regional Course on
Software metrics.
Prodcom ESTP course October 2010
Sub-Regional Workshop on International Merchandise Trade Statistics Compilation and Export and Import Unit Value Indices 21 – 25 November Guam.
The role of metadata in census data dissemination
OBSERVER DATA MANAGEMENT PRINCIPLES AND BEST PRACTICE (Agenda Item 4)
Presentation transcript:

Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension

Purpose of the module -Introduction in Data hyperdimension -Introduction of indicators for data evaluation (implemented in R software package) Developed within European BLUE ETS project Theory and practical examples Group exercise in which groups determine whether a source should be used based on the results for the data hyperdimension. -Introduction of Quality Report Card

Data: quality of the input – Input quality of administrative data After evaluation of Source and Metadata hyperdimension – Data hyperdimension studies Quality of the facts (values) in the source Data are part of every delivery! Time needed for evaluation is a serious issue Evaluate every delivery thoroughly? Evaluation may differ depending on the use intended (output) Relation with process (availability and quality of other data sources)

Essential pre-requisites and considerations – Evaluation of the data quality of input sources needs to be efficient – Focus on essential quality components What are the essential dimensions of input data quality? What are the essential indicators for those dimensions? For objects (units/events) and variables – Purely input or also with output in mind? Data Source Quality (admin. data quality per se) Input oriented Output Quality (guestimate of expected effect on output)

Essential dimensions of input data quality – Five essential quality dimensions identified for input data of administrative sources: 1.Technical checks Technical accessibility, IT-part 2.Accuracy Correctness, validity, error-freeness 3.Completeness Coverage of units, missing variable data 4.Time-related dimension Timeliness, punctuality, period covered 5.Integrability Easiness of integration and consistency of data between sources

Technical checks: Theory IndicatorsDescription 1. Technical checksTechnical usability of the file and data in the file 1.1 ReadabilityAccessability of the file and data in the file 1.2 File declaration Compliance of the data in the file to the metadata complianceagreements 1.3 Convertability Conversion of the file to the NSI-standard format Technical checks dimension

Technical checks: Examples – Very important for new sources, becomes somewhat less essential later on ‐ Corrupt files ‐ Encoded files of which decoding password is missing ‐ Files of which the data is not compliant to the metadata description ‐ Files with errors during/after conversion

Technical checks: File declaration compliance – Simple frequency distributions are very helpful

Technical checks: File declaration compliance

Accuracy: Theory Indicators Description 2. Accuracy The extent to which data are correct, reliable, and certified Objects 2.1 Authenticity Legitimacy of objects 2.2 Inconsistent objects Extent of erroneous objects in source 2.3 Dubious objects Presence of untrustworthy objects Variables 2.4 Measurement errors Deviation of actual data value from ideal error-free measurements 2.5 Inconsistent values Extent of inconsistent combinations of variable values 2.6 Dubious values Presence of implausible values or combinations of values for variables Accuracy dimension

– Objects with incorrect Identification numbers (ID’s) – In the Netherlands all people have a Citizen’s Service Number ‐ 9-digit number (e.g ) ‐ Number has a feasibility check, last digit is a checking digit ‐ Rule used: sum(9*n 1 + 8*n 2 + 7*n 3 + 6*n 4 + 5*n 5 + 4*n 6 + 3*n 7 + 2*n 8 – 1*n 9 ) Remainder of sum/11 should be 0 – In the Social Statistical Database* it was found (in 2000) that: ‐ 0,3% of all persons in admin. data sources used had an invalid Citizen Service Number *set of integrated admin. data sources and surveys (then ~100 million admin records) Arts et al. (2000) Netherlands Official Statistics 15, pp Accuracy example: Authenticity (1) % of objects with a syntactically incorrect identification key

Accuracy example: Authenticity (2) % of objects for which the source contains information contradictive to information in a reference list for those objects – Studies reveal significant differences between findings for ‘educational attainment’ obtained from a survey and from linked administrative data sources. More in: Bakker (2011) Estimating the Validity of Administrative Variables. ISI-paper session IPS030, Dublin, Ireland.

Accuracy example: Authenticity (3) % of objects for which the source contains information contradictive to information in a reference list for those objects

Accuracy example: Inconsistent objects Rule: a person is part of exactly one household

Accuracy example: Dubious values Cross tabulation of the variable “Current activity status” versus age group

Completeness: Theory Indicators Description 3. Completeness Degree to which a data source includes data describing the corresponding set of real-world objects and variables Objects 3.1 UndercoverageAbsence of target objects (missing objects) in the source 3.2 OvercoveragePresence of non-target objects in the source 3.3 SelectivityStatistical coverage and representativity of objects 3.4 RedundancyPresence of multiple registrations of objects Variables 3.5 Missing valuesAbsent values for (key) variables 3.6 Imputed valuesPresence of values resulting from imputation actions by data source holder Completeness dimension

Completeness example: Selectivity (1)

Completeness example: Selectivity (2) The education register has age- related undercoverage of educational attainment (56,3% is missing) Explanation: 1) Children <15 age have a known level of education 2) Level of education of young adults is usually stored in recently created admin. data sources 3) Information from ‘middle-aged’ people is obtained from LFS-survey (small compared to admin. data info) 4) Information of ‘elderly’ people (≥65 year) almost completely missing (not surveyed and hardly registered)

Pre-evaluation and input quality of administrative data sources (Part 2) Completeness example: Selectivity (3)

Time related: Theory Indicators Description 4. Time-related dimension Indicators that are time and/or stability related 4.1 Timeliness Lapse of time between the end of the reference period and the moment of receipt of the data source 4.2 PunctualityPossible time lag between the actual delivery date of the source and the date it should have been delivered 4.3 Overall time lagOverall time difference between the end of the reference period and the moment it is concluded that it can definitely be used 4.4 DelayExtent of delays in registration Objects 4.5 Dynamics of objectsChanges in the population of objects (new and dead objects) over time Variables 4.6 Stability of variablesChanges of variables or values over time Time-related dimension

Time-related example: Delay – Events recorded some time after they have occurred Events are missing (or erroneously recorded) Particularly important for sources used immediately – Examples: Marriages contracted in immigrants’ country of origin are sometimes recorded two or three years after the event (Bakker et al. AIOS-paper 2008) Part of VAT-data is reported later than is needed for monthly estimates (Vlag, ISI-paper 2011)

Time-related example: Stability of variables (1) Type of comparison used in the Dutch Short term Statistics

Time-series for a single company Time-related example: Stability of variables (2)

Integrability: Theory IndicatorsDescription 5. Integrability Extent to which the data source is capable of undergoing integration or of being integrated. Objects 5.1 Comparability of objectsSimilarity of objects in source -at the proper level of detail- with the objects used by NSI 5.2 Alignment of objectsLinking-ability (align-ability) of objects in source with those of NSI Variables 5.3 Linking variable Usefulness of linking variables (keys) in source 5.4 Comparability of variablesProximity (closeness) of variables Integrability dimension

Integrability example: Alignment of objects export import VAT-turnover (€) ICP-turnover (€) Finding: -Differences between two admin. Data sources (ICP and VAT) both used for International trade statistics -Export aligns good but import is much more problematic! Explanation: -ICP import units are difficult to identify and can therefore not always by linked correctly -ICP export data can be integrated well. VAT-turnover (€) ICP-turnover (€)

Quality Report Card: Step 1 Indicator level – Step 1: Determine one score per indicator

Quality Report Card: Step 2 Dimensional level – Step 2: Determine one score per dimension

Quality Report Card: Step 3 General level – Step 3: Determine a general score

Questions? Any questions or comments?

Exercise – Let’s try to interpret some data quality findings! – To ease the exercise, every indicator has a single score

Group exercise – Participants will be split into groups and each group is provided with: ‐ The Source, Metadata and Data results for the administrative data source discussed in the previous exercise ‐ An intended use – Each group will be asked to discuss: ‐ whether the data in the source could be used for the purpose intended/ If yes, why is everything OK? If not, what is the problem that prevents its use and how can it be solved?