Presentation is loading. Please wait.

Presentation is loading. Please wait.

Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Similar presentations


Presentation on theme: "Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension."— Presentation transcript:

1 Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension

2 Purpose of the module -Introduction in Data hyperdimension -Introduction of indicators for data evaluation (implemented in R software package) Developed within European BLUE ETS project Theory and practical examples Group exercise in which groups determine whether a source should be used based on the results for the data hyperdimension. -Introduction of Quality Report Card

3 Data: quality of the input – Input quality of administrative data After evaluation of Source and Metadata hyperdimension – Data hyperdimension studies Quality of the facts (values) in the source Data are part of every delivery! Time needed for evaluation is a serious issue Evaluate every delivery thoroughly? Evaluation may differ depending on the use intended (output) Relation with process (availability and quality of other data sources)

4 Essential pre-requisites and considerations – Evaluation of the data quality of input sources needs to be efficient – Focus on essential quality components What are the essential dimensions of input data quality? What are the essential indicators for those dimensions? For objects (units/events) and variables – Purely input or also with output in mind? Data Source Quality (admin. data quality per se) Input oriented Output Quality (guestimate of expected effect on output)

5 Essential dimensions of input data quality – Five essential quality dimensions identified for input data of administrative sources: 1.Technical checks Technical accessibility, IT-part 2.Accuracy Correctness, validity, error-freeness 3.Completeness Coverage of units, missing variable data 4.Time-related dimension Timeliness, punctuality, period covered 5.Integrability Easiness of integration and consistency of data between sources

6 Technical checks: Theory IndicatorsDescription 1. Technical checksTechnical usability of the file and data in the file 1.1 ReadabilityAccessability of the file and data in the file 1.2 File declaration Compliance of the data in the file to the metadata complianceagreements 1.3 Convertability Conversion of the file to the NSI-standard format Technical checks dimension

7 Technical checks: Examples – Very important for new sources, becomes somewhat less essential later on ‐ Corrupt files ‐ Encoded files of which decoding password is missing ‐ Files of which the data is not compliant to the metadata description ‐ Files with errors during/after conversion

8 Technical checks: File declaration compliance – Simple frequency distributions are very helpful

9 Technical checks: File declaration compliance

10 Accuracy: Theory Indicators Description 2. Accuracy The extent to which data are correct, reliable, and certified Objects 2.1 Authenticity Legitimacy of objects 2.2 Inconsistent objects Extent of erroneous objects in source 2.3 Dubious objects Presence of untrustworthy objects Variables 2.4 Measurement errors Deviation of actual data value from ideal error-free measurements 2.5 Inconsistent values Extent of inconsistent combinations of variable values 2.6 Dubious values Presence of implausible values or combinations of values for variables Accuracy dimension

11 – Objects with incorrect Identification numbers (ID’s) – In the Netherlands all people have a Citizen’s Service Number ‐ 9-digit number (e.g. 123456782) ‐ Number has a feasibility check, last digit is a checking digit ‐ Rule used: sum(9*n 1 + 8*n 2 + 7*n 3 + 6*n 4 + 5*n 5 + 4*n 6 + 3*n 7 + 2*n 8 – 1*n 9 ) Remainder of sum/11 should be 0 – In the Social Statistical Database* it was found (in 2000) that: ‐ 0,3% of all persons in admin. data sources used had an invalid Citizen Service Number *set of integrated admin. data sources and surveys (then ~100 million admin records) Arts et al. (2000) Netherlands Official Statistics 15, pp. 16-22. Accuracy example: Authenticity (1) % of objects with a syntactically incorrect identification key

12 Accuracy example: Authenticity (2) % of objects for which the source contains information contradictive to information in a reference list for those objects – Studies reveal significant differences between findings for ‘educational attainment’ obtained from a survey and from linked administrative data sources. More in: Bakker (2011) Estimating the Validity of Administrative Variables. ISI-paper session IPS030, Dublin, Ireland.

13 Accuracy example: Authenticity (3) % of objects for which the source contains information contradictive to information in a reference list for those objects

14 Accuracy example: Inconsistent objects Rule: a person is part of exactly one household

15 Accuracy example: Dubious values Cross tabulation of the variable “Current activity status” versus age group

16 Completeness: Theory Indicators Description 3. Completeness Degree to which a data source includes data describing the corresponding set of real-world objects and variables Objects 3.1 UndercoverageAbsence of target objects (missing objects) in the source 3.2 OvercoveragePresence of non-target objects in the source 3.3 SelectivityStatistical coverage and representativity of objects 3.4 RedundancyPresence of multiple registrations of objects Variables 3.5 Missing valuesAbsent values for (key) variables 3.6 Imputed valuesPresence of values resulting from imputation actions by data source holder Completeness dimension

17 Completeness example: Selectivity (1)

18 Completeness example: Selectivity (2) The education register has age- related undercoverage of educational attainment (56,3% is missing) Explanation: 1) Children <15 age have a known level of education 2) Level of education of young adults is usually stored in recently created admin. data sources 3) Information from ‘middle-aged’ people is obtained from LFS-survey (small compared to admin. data info) 4) Information of ‘elderly’ people (≥65 year) almost completely missing (not surveyed and hardly registered)

19 Pre-evaluation and input quality of administrative data sources (Part 2) Completeness example: Selectivity (3)

20 Time related: Theory Indicators Description 4. Time-related dimension Indicators that are time and/or stability related 4.1 Timeliness Lapse of time between the end of the reference period and the moment of receipt of the data source 4.2 PunctualityPossible time lag between the actual delivery date of the source and the date it should have been delivered 4.3 Overall time lagOverall time difference between the end of the reference period and the moment it is concluded that it can definitely be used 4.4 DelayExtent of delays in registration Objects 4.5 Dynamics of objectsChanges in the population of objects (new and dead objects) over time Variables 4.6 Stability of variablesChanges of variables or values over time Time-related dimension

21 Time-related example: Delay – Events recorded some time after they have occurred Events are missing (or erroneously recorded) Particularly important for sources used immediately – Examples: Marriages contracted in immigrants’ country of origin are sometimes recorded two or three years after the event (Bakker et al. AIOS-paper 2008) Part of VAT-data is reported later than is needed for monthly estimates (Vlag, ISI-paper 2011)

22 Time-related example: Stability of variables (1) Type of comparison used in the Dutch Short term Statistics

23 Time-series for a single company Time-related example: Stability of variables (2)

24 Integrability: Theory IndicatorsDescription 5. Integrability Extent to which the data source is capable of undergoing integration or of being integrated. Objects 5.1 Comparability of objectsSimilarity of objects in source -at the proper level of detail- with the objects used by NSI 5.2 Alignment of objectsLinking-ability (align-ability) of objects in source with those of NSI Variables 5.3 Linking variable Usefulness of linking variables (keys) in source 5.4 Comparability of variablesProximity (closeness) of variables Integrability dimension

25 Integrability example: Alignment of objects export import VAT-turnover (€) ICP-turnover (€) Finding: -Differences between two admin. Data sources (ICP and VAT) both used for International trade statistics -Export aligns good but import is much more problematic! Explanation: -ICP import units are difficult to identify and can therefore not always by linked correctly -ICP export data can be integrated well. VAT-turnover (€) ICP-turnover (€)

26 Quality Report Card: Step 1 Indicator level – Step 1: Determine one score per indicator

27 Quality Report Card: Step 2 Dimensional level – Step 2: Determine one score per dimension

28 Quality Report Card: Step 3 General level – Step 3: Determine a general score

29 Questions? Any questions or comments?

30 Exercise – Let’s try to interpret some data quality findings! – To ease the exercise, every indicator has a single score

31 Group exercise – Participants will be split into groups and each group is provided with: ‐ The Source, Metadata and Data results for the administrative data source discussed in the previous exercise ‐ An intended use – Each group will be asked to discuss: ‐ whether the data in the source could be used for the purpose intended/ If yes, why is everything OK? If not, what is the problem that prevents its use and how can it be solved?


Download ppt "Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension."

Similar presentations


Ads by Google