Www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)

www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT) Nadežda Fursova Chief specialist

www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV The main topics of the presentation  What is data linking?  Input data set  Data linkage methods  Problems we meet linking data

www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV Data linking and data integration Linking different input sources (administrative data, surveys data, etc.) to one population. (= data linking) In a next step, these linked data will be processed to one consistent dataset that will greatly increase the power of analysis then possible with the data. (= data integration)  oppurtunity reducing costs improve quality  challenge preporatory work to examine data normally easy if unique ID, but unlinkable cases

www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV Type of data linking? Record linkage  for organizing ONE dataset data cleaning removing duplicates  for merging TWO or MORE datasets merging data to one consistent dataset

www.stat.gov.lt Input data set The first step in data linkage is to determine needs and check data availability. Proposed scope of input data: 20-21 March 2013 ESSnet DWH - Workshop IV

www.stat.gov.lt Statistical Business Register and Population frame  to link several input data in a SDWH we need to agree about the default target population and about the enterprise unit to which all input data are matched  The default target population is defined as statistical enterprise units which have been active during the reference year.  input source -‘backbone: population frame’; includes the following information:  Frame reference year  Statistical enterprises unit, including its national ID and its EGR ID  Name/address of enterprise of the enterprises  National ID of the enterprises  Date in population (mm/yr)  Date out of population (mm/yr)  NACE-code  Institutional sector code  Size class  the population frame is crucial information to determine the default active population 20-21 March 2013 ESSnet DWH - Workshop IV

www.stat.gov.lt Data sources  One aim of a SDWH is to create a set of fully integrated data about enterprises. And these data may come from different sources like surveys, administrative data, accounting data and census data. Different data sources cover different populations.  To link this input data sources and to ensure that these data are linked to the same enterprise unit and are compared with the same target population is the main issue.  Main data sources :  Surveys (censuses, sample surveys)  Combined data (survey and administrative data)  Administrative data 20-21 March 2013 ESSnet DWH - Workshop IV

www.stat.gov.lt Defining metadata  The term metadata is very broad. A distinction is made between “structural” metadata that define the structure of statistical data sets and metadata sets, and “reference” metadata describing actual metadata contents.  NSIs need to define metadata before linking sources  What kind of reference metadata needs to be submitted?  ESMS Metadata files are used for describing the statistics released by Eurostat. It aims at documenting methodologies, quality and the statistical production processes in general. 20-21 March 2013 ESSnet DWH - Workshop IV

www.stat.gov.lt The statistical unit base  The unit base is closely related to the SBR  Their contents are also closely related to available input data and it was recommend to consider it as a separate input source. This unit base describes the relationship between the different units and the statistical enterprise unit 20-21 March 2013 ESSnet DWH - Workshop IV Example of Netherlands Unit Base Example of Lithuania

www.stat.gov.lt Data linking methods Data linkage methods usually fall across a spectrum between :  Deterministic linkage – methods involve exact one-to-one character matching of linkage variables.  Probabilistic linkage – methods involve the calculation of linkage weights estimated given all the observed agreements and disagreements of the data values of the matching variables. A combination of linkage methods may be used, but the choice of method depends on the types and quality of linkage variables available on the data sets to be linked. 20-21 March 2013 ESSnet DWH - Workshop IV

www.stat.gov.lt Deterministic linkage  simplest method of matching – sort/merge  deterministic linkage is based on exact matches. Variables used in deterministic linkage need to be accurate, robust, stable over time and complete.  works best if there are common unique identifier (company ID number, Social Security number, etc.) When there is no unique identifier (= not ideal)  use statistical linkage key (SLK). Generally it is a combination of attribute, including last name, first name, sex and date of birth.  stepwise deterministic record linkage - more sophisticated form of deterministic linkage. It has been developed in response to variations that often exist in the attributes that are used in creating the SLKs for deterministic linkage.  “rules-based linkage” - a set of rules can be used to classify pairs of records as matches or non-matches. 20-21 March 2013 ESSnet DWH - Workshop IV

www.stat.gov.lt Deterministic linkage Statistical linkage keys:  Generally most SLK for personnel statistics are constructed from last name, first name, sex and full date of birth. SLK protects privacy and data confidentiality because they serve as an alternative to a person’s name and dates of birth being on the data sets to be linked.  There are two kinds of errors associated with SLKs.  there may be incomplete or missing data items on an individual’s record, which means that SLK will be incomplete.  errors in the source data may lead to generation of multiple SLKs for the same individual or multiple individuals will share the same SKL. Problems: often no unique, known and accurate ID poor quality data (errors, variations and missing data, etc.) 20-21 March 2013 ESSnet DWH - Workshop IV

www.stat.gov.lt Probabilistic linkage  may be undertaken where there are no unique identifiers or SLKs  or where the linkage variables and/or entity identifiers are not as accurate, stable or complete as are required for deterministic method  can lead to much better linkage than simple deterministic linkage methods  has a greater capacity to link records with errors in their linking variables 20-21 March 2013 ESSnet DWH - Workshop IV

www.stat.gov.lt Probabilistic linkage  M-probability (match probability):  probability that a field agrees given that the pair of records is a true match  for any given field, the same M-probability applies for all records  U-probability (unmatch probability):  probability that a field agrees given that the pair of records is not a true match.  often it is simplified as the chance that two records will randomly match 20-21 March 2013 ESSnet DWH - Workshop IV

www.stat.gov.lt Summary deterministic and probabilistic linking  Ideal situation – availability of unique ID  Simplest and fastest method  Best quality  ( = deterministic linking) For a SDWH a unique ID is desired for most important datasources (if not, the work will be too elaborative)  If no unique ID for some – less important - datasources  several deterministic and probabilistic linking techniques (as presented before) can be used 20-21 March 2013 ESSnet DWH - Workshop IV

www.stat.gov.lt The data linkage process The data linkage process may vary, depending on linkage model and the linkage method. But there are however four steps that are common to both data linkage models:  Data cleaning and standardization  Blocking (in case of large datasets)  Record pair comparison  Decision model Determinants of linkage quality:  the quality of SLKs (in case of deterministic linkage)  the quality of blocking and linkage variables (in the case of probabilistic linkage). Poor quality of variables can lead to some records not being linked or being linked to wrong records. 20-21 March 2013 ESSnet DWH - Workshop IV

www.stat.gov.lt Measures of quality of data linkage Measures that may be used to asses data linkage quality include:  accuracy  sensitivity  specificity  precision  false-positive 20-21 March 2013 ESSnet DWH - Workshop IV

www.stat.gov.lt Conclusions For a successful data linking  the population of the different datasources should be known  the input sources should be of a high quality  an unique identifier is desired If no unique identifier  different methods to apply (deterministic and probabilistic) Quality of data linkage  depends on presence of unique ID  AND accuracy + precision of data and false-positive ratios when linking When data linking and data integration in next steps  challenge to deal with errors, missing data, conflicting data (presentations tomorrow) 20-21 March 2013 ESSnet DWH - Workshop IV

Www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)

Similar presentations

Presentation on theme: "Www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)

Similar presentations

Presentation on theme: "Www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)"— Presentation transcript:

Similar presentations

About project

Feedback