Www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)

Slides:



Advertisements
Similar presentations
Innovation data collection: Advice from the Oslo Manual South East Asian Regional Workshop on Science, Technology and Innovation Statistics.
Advertisements

Innovation Surveys: Advice from the Oslo Manual South Asian Regional Workshop on Science, Technology and Innovation Statistics Kathmandu,
Conceptualization, Operationalization, and Measurement
Some considerations on developing a DWH for SBS estimates Orietta Luzi – Mauro Masselli Istat - Italy march 2013.
The Linked PDD-Death Product More than you want to know David Zingmond, MD, PhD Division of General Internal and Health Services Research UCLA School of.
Wisconsin Department of Health Services Richard Miller Research Scientist Wisconsin Office of Health Informatics October 28, 2014 Matching Traffic Crash.
Record Linkage Simulation Biolink Meeting June Adelaide Ariel.
Bosna i Hercegovina Agencija za statistiku Bosne i Hercegovine Bosna i Hercegovina Agencija za statistiku Bosne i Hercegovine Post-enumeration Survey-A.
Regional Workshop for African Countries on Compilation of Basic Economic Statistics Pretoria, July 2007 Administrative Data and their Use in Economic.
Capturing Sensitive Data & Data Linkage. Capturing Sensitive Data Data Protection Act 1998 (Section 33) – Allows data to be used for research purposes.
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2007.
Results and next steps from the ESSnet Admin Data Alison Pritchard Business Outputs & Developments, Office for National Statistics, UK 4 December 2012.
Lecture 2: Basic steps in SPSS and some tests of statistical inference
© John M. Abowd and Lars Vilhuber 2005, all rights reserved Introduction to Probabilistic Record Linking John M. Abowd and Lars Vilhuber March 2005.
FINAL REPORT: OUTLINE & OVERVIEW OF SURVEY ERRORS
Pieter Vlag ESSnet DWH: business register. Outline Central role of the  statistical units,  population frame, which includes number of enterprises,
The Use of Administrative Sources for Economic Statistics An Overview Steven Vale Office for National Statistics UK.
The Use of Administrative Sources for Statistical Purposes Administrative Sources and Statistical Registers.
Joint UNECE/Eurostat Work Session on Migration Statistics 3 March, 2008, Geneva, Switzerland Selected methods to improve emigration estimates MEASURING.
Improving Data Quality and Quality Assurance in Newborn Screening by Including the Bloodspot Screening Collection Device Serial Number on Birth Certificates.
Copyright 2010, The World Bank Group. All Rights Reserved. Agricultural Census Sampling Frames and Sampling Section A 1.
RESEARCH A systematic quest for undiscovered truth A way of thinking
Eurostat Results from the TEST EGR with respect to Inward and Outward FATS populations (2011)
Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved.
12th Meeting of the Group of Experts on Business Registers
Chapter Nine Copyright © 2006 McGraw-Hill/Irwin Sampling: Theory, Designs and Issues in Marketing Research.
Q2010, Helsinki Development and implementation of quality and performance indicators for frame creation and imputation Kornélia Mag László Kajdi Q2010,
List frames area frames and administrative data, are they complementary or in competition? Elisabetta Carfagna University of Bologna Department of Statistics.
Record matching for census purposes in the Netherlands Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands.
The Use of Administrative Sources for Statistical Purposes Matching and Integrating Data from Different Sources.
Quality issues on the way from survey to administrative data: the case of SBS statistics of microenterprises in Slovakia Andrej Vallo, Andrea Bielakova.
CountryData Technologies for Data Exchange SDMX Information Model: An Introduction.
1 The 2001 Census PUMFS Odyssey Sponsored by HAL and PALS Presented by Chuck Humphrey.
Copyright 2010, The World Bank Group. All Rights Reserved. Business registration, part 2 Administrative and statistical business registers 1 Business statistics.
May 2012 ESSnet DWH - Workshop III BUSINESS REGISTER IN STATISTICS LITHUANIA Jurga Rukšėnaitė Chief specialist.
Combining survey and administrative data to create a new input data file for National Accounts processes Shaun McLaughlin Central Statistics Office, Ireland.
Metadata Models in Survey Computing Some Results of MetaNet – WG 2 METIS 2004, Geneva W. Grossmann University of Vienna.
BAIGORRI Antonio – Eurostat, Unit B1: Quality; Classifications Q2010 EUROPEAN CONFERENCE ON QUALITY IN STATISTICS Terminology relating to the Implementation.
The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini.
1 Improving Data Quality. COURSE DESCRIPTION Introduction to Data Quality- Course Outline.
Data Quality & dissemination D. Sahoo Dy. Director General Central Statistical Organization, India.
The Dutch Virtual Census based on registers and already existing surveys Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics.
ESSnet on Datawarehousing - the business register Pieter Vlag – Statistics Netherlands.
1 Relational Databases and SQL. Learning Objectives Understand techniques to model complex accounting phenomena in an E-R diagram Develop E-R diagrams.
CHAPTER 12 Descriptive, Program Evaluation, and Advanced Methods.
Use of Administrative Data Seminar on Developing a Programme on Integrated Statistics in support of the Implementation of the SNA for CARICOM countries.
Editing of linked micro files for statistics and research.
1 The United Nations Demographic Yearbook and the Work Programme for Social Statistics Expert Group Meeting to Review the United Nations Demographic Yearbook.
Question paper 1997.
Pilot Census in Poland Some Quality Aspects Geneva, 7-9 July 2010 Janusz Dygaszewicz Central Statistical Office POLAND.
Work packages SGA II ESSnet on microdata linking and data warehousing in statistical production Harry Goossens – Statistics Netherlands Head Data Service.
Creating Databases Data normalization. Integrity and Robustness. Work session. Homework: Prepare short presentation on enhancement projects. Continue working.
Data sources of the EuroGroups Register Presentation by Eurostat
ESS-net DWH ESSnet on microdata linking and data warehousing in statistical production Harry Goossens – Statistics Netherlands Head Data Service Centre.
Biolink NL A national infrastructure for linkage of biobanks to medical and socioeconomic registries Adelaide Ariel SHIP Conference 28th-30th August 2013.
Stretching Your Data Management Skills Chuck Humphrey University of Alberta Atlantic DLI Workshop 2003.
Integration of Geography and Statistics in Europe – State of Progress Based on the GEOSTAT 2 survey Jerker Moström, Statistics Sweden Amelia Wardzińska-Sharif,
The 2011 Census: Estimating the Population Alexa Courtney.
ESS-net DWH ESSnet on microdata linking and data warehousing in statistical production.
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
Introduction/ Section 5.1 Designing Samples.  We know how to describe data in various ways ◦ Visually, Numerically, etc  Now, we’ll focus on producing.
Statistical Business Register Enterprise Groups in Latvia Sarmite Prole Head of Business Register Section Business Statics Department Central Statistical.
Methods for Data-Integration
Challenges in data linkage: error and bias
Implementation of Quality indicators for administrative data
6.1 Quality improvement Regional Course on
Administrative Data and their Use in Economic Statistics
Pnina ZADKA Central Bureau of Statistics Israel
Pnina ZADKA Central Bureau of Statistics Israel
Stephanie Hirner ESTP ”Administrative data and censuses
Presentation transcript:

March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT) Nadežda Fursova Chief specialist

March 2013 ESSnet DWH - Workshop IV The main topics of the presentation  What is data linking?  Input data set  Data linkage methods  Problems we meet linking data

March 2013 ESSnet DWH - Workshop IV Data linking and data integration Linking different input sources (administrative data, surveys data, etc.) to one population. (= data linking) In a next step, these linked data will be processed to one consistent dataset that will greatly increase the power of analysis then possible with the data. (= data integration)  oppurtunity reducing costs improve quality  challenge preporatory work to examine data normally easy if unique ID, but unlinkable cases

March 2013 ESSnet DWH - Workshop IV Type of data linking? Record linkage  for organizing ONE dataset data cleaning removing duplicates  for merging TWO or MORE datasets merging data to one consistent dataset

Input data set The first step in data linkage is to determine needs and check data availability. Proposed scope of input data: March 2013 ESSnet DWH - Workshop IV

Statistical Business Register and Population frame  to link several input data in a SDWH we need to agree about the default target population and about the enterprise unit to which all input data are matched  The default target population is defined as statistical enterprise units which have been active during the reference year.  input source -‘backbone: population frame’; includes the following information:  Frame reference year  Statistical enterprises unit, including its national ID and its EGR ID  Name/address of enterprise of the enterprises  National ID of the enterprises  Date in population (mm/yr)  Date out of population (mm/yr)  NACE-code  Institutional sector code  Size class  the population frame is crucial information to determine the default active population March 2013 ESSnet DWH - Workshop IV

Data sources  One aim of a SDWH is to create a set of fully integrated data about enterprises. And these data may come from different sources like surveys, administrative data, accounting data and census data. Different data sources cover different populations.  To link this input data sources and to ensure that these data are linked to the same enterprise unit and are compared with the same target population is the main issue.  Main data sources :  Surveys (censuses, sample surveys)  Combined data (survey and administrative data)  Administrative data March 2013 ESSnet DWH - Workshop IV

Defining metadata  The term metadata is very broad. A distinction is made between “structural” metadata that define the structure of statistical data sets and metadata sets, and “reference” metadata describing actual metadata contents.  NSIs need to define metadata before linking sources  What kind of reference metadata needs to be submitted?  ESMS Metadata files are used for describing the statistics released by Eurostat. It aims at documenting methodologies, quality and the statistical production processes in general March 2013 ESSnet DWH - Workshop IV

The statistical unit base  The unit base is closely related to the SBR  Their contents are also closely related to available input data and it was recommend to consider it as a separate input source. This unit base describes the relationship between the different units and the statistical enterprise unit March 2013 ESSnet DWH - Workshop IV Example of Netherlands Unit Base Example of Lithuania

Data linking methods Data linkage methods usually fall across a spectrum between :  Deterministic linkage – methods involve exact one-to-one character matching of linkage variables.  Probabilistic linkage – methods involve the calculation of linkage weights estimated given all the observed agreements and disagreements of the data values of the matching variables. A combination of linkage methods may be used, but the choice of method depends on the types and quality of linkage variables available on the data sets to be linked March 2013 ESSnet DWH - Workshop IV

Deterministic linkage  simplest method of matching – sort/merge  deterministic linkage is based on exact matches. Variables used in deterministic linkage need to be accurate, robust, stable over time and complete.  works best if there are common unique identifier (company ID number, Social Security number, etc.) When there is no unique identifier (= not ideal)  use statistical linkage key (SLK). Generally it is a combination of attribute, including last name, first name, sex and date of birth.  stepwise deterministic record linkage - more sophisticated form of deterministic linkage. It has been developed in response to variations that often exist in the attributes that are used in creating the SLKs for deterministic linkage.  “rules-based linkage” - a set of rules can be used to classify pairs of records as matches or non-matches March 2013 ESSnet DWH - Workshop IV

Deterministic linkage Statistical linkage keys:  Generally most SLK for personnel statistics are constructed from last name, first name, sex and full date of birth. SLK protects privacy and data confidentiality because they serve as an alternative to a person’s name and dates of birth being on the data sets to be linked.  There are two kinds of errors associated with SLKs.  there may be incomplete or missing data items on an individual’s record, which means that SLK will be incomplete.  errors in the source data may lead to generation of multiple SLKs for the same individual or multiple individuals will share the same SKL. Problems: often no unique, known and accurate ID poor quality data (errors, variations and missing data, etc.) March 2013 ESSnet DWH - Workshop IV

Probabilistic linkage  may be undertaken where there are no unique identifiers or SLKs  or where the linkage variables and/or entity identifiers are not as accurate, stable or complete as are required for deterministic method  can lead to much better linkage than simple deterministic linkage methods  has a greater capacity to link records with errors in their linking variables March 2013 ESSnet DWH - Workshop IV

Probabilistic linkage  M-probability (match probability):  probability that a field agrees given that the pair of records is a true match  for any given field, the same M-probability applies for all records  U-probability (unmatch probability):  probability that a field agrees given that the pair of records is not a true match.  often it is simplified as the chance that two records will randomly match March 2013 ESSnet DWH - Workshop IV

Summary deterministic and probabilistic linking  Ideal situation – availability of unique ID  Simplest and fastest method  Best quality  ( = deterministic linking) For a SDWH a unique ID is desired for most important datasources (if not, the work will be too elaborative)  If no unique ID for some – less important - datasources  several deterministic and probabilistic linking techniques (as presented before) can be used March 2013 ESSnet DWH - Workshop IV

The data linkage process The data linkage process may vary, depending on linkage model and the linkage method. But there are however four steps that are common to both data linkage models:  Data cleaning and standardization  Blocking (in case of large datasets)  Record pair comparison  Decision model Determinants of linkage quality:  the quality of SLKs (in case of deterministic linkage)  the quality of blocking and linkage variables (in the case of probabilistic linkage). Poor quality of variables can lead to some records not being linked or being linked to wrong records March 2013 ESSnet DWH - Workshop IV

Measures of quality of data linkage Measures that may be used to asses data linkage quality include:  accuracy  sensitivity  specificity  precision  false-positive March 2013 ESSnet DWH - Workshop IV

Conclusions For a successful data linking  the population of the different datasources should be known  the input sources should be of a high quality  an unique identifier is desired If no unique identifier  different methods to apply (deterministic and probabilistic) Quality of data linkage  depends on presence of unique ID  AND accuracy + precision of data and false-positive ratios when linking When data linking and data integration in next steps  challenge to deal with errors, missing data, conflicting data (presentations tomorrow) March 2013 ESSnet DWH - Workshop IV