Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco.

Slides:



Advertisements
Similar presentations
L3S Research Center University of Hanover Germany
Advertisements

Improving imputation methodology in the Hungarian Central Statistical Office (HCSO) NTTS 2009 seminar, Bruxelles February 2009 Improving imputation.
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Federal Statistics in an Age of a Self-Monitoring Social and Economic Eco-System Robert M. Groves US Census Bureau.
Configuration management
Software change management
Configuration management
1. 2 August Recommendation 9.1 of the Strategic Information Technology Advisory Committee (SITAC) report initiated the effort to create an Administrative.
Phase II/III Design: Case Study
Database System Concepts and Architecture
Chapter 10: The Traditional Approach to Design
Systems Analysis and Design in a Changing World, Fifth Edition
Nonparametric estimation of non- response distribution in the Israeli Social Survey Yury Gubman Dmitri Romanov JSM 2009 Washington DC 4/8/2009.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Stefania Bergamasco, Cecilia Colasanti An integrated approach to turn statistics into knowledge combining data warehouse, controlled vocabularies and advanced.
Bosna i Hercegovina Agencija za statistiku Bosne i Hercegovine Bosna i Hercegovina Agencija za statistiku Bosne i Hercegovine Post-enumeration Survey-A.
ESSnet DI WP2: Record Linkage Luca Valentino Istat.
Efficient modelling of record linked data A missing data perspective Harvey Goldstein Record Linkage Methodology Research Group Institute of Child Health.
Regional Workshop for African Countries on Compilation of Basic Economic Statistics Pretoria, July 2007 Administrative Data and their Use in Economic.
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2007.
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica.
When adjusting for bias due to linkage errors: a sensitivity analysis Q2014 Tiziana Tuoto 05/06/2014 Joint work with Loredana Di Consiglio.
© John M. Abowd and Lars Vilhuber 2005, all rights reserved Introduction to Probabilistic Record Linking John M. Abowd and Lars Vilhuber March 2005.
Mapping Techniques and Visualization of Statistical Indicators Haitham Zeidan Palestinian Central Bureau of Statistics IAOS 2014 Conference.
Programming Logic and Design, Introductory, Fourth Edition1 Understanding Computer Components and Operations (continued) A program must be free of syntax.
The Use of Administrative Sources for Economic Statistics An Overview Steven Vale Office for National Statistics UK.
Joint UNECE/Eurostat Meeting on Population and Housing Censuses (13-15 May 2008) Sample results expected accuracy in the Italian Population and Housing.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
Innovations on methods and survey process for the 2011 Italian population census European Conference on Quality in Official Statistics 8-11 July, 2008.
Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved.
United Nations Workshop on Principles and Recommendations for a Vital Statistics System, Revision 3, for African English-speaking countries Addis Ababa,
The Use of Administrative Sources for Statistical Purposes Matching and Integrating Data from Different Sources.
Emerging methodologies for the census in the UNECE region Paolo Valente United Nations Economic Commission for Europe Statistical Division International.
Quality issues on the way from survey to administrative data: the case of SBS statistics of microenterprises in Slovakia Andrej Vallo, Andrea Bielakova.
Luisa Franconi Integration, Quality, Research and Production Networks Development Department Unit on microdata access ISTAT Essnet on Common Tools and.
Register-Based Census 2011 in Slovenia – Some Quality Aspects Danilo Dolenc Statistical Office of the Republic of Slovenia UNECE-Eurostat Expert Group.
Designing Persistency Delos NoE, Preservation Cluster Workshop: Persistency in Digital Libraries 14. February 2006, Oxford Internet Institute.
Vector Application : A UML Example © Dr. David A. Workman School of EE and CS University of Central Florida Feb. 8, 2001.
Eurostat The impact of the Memobust project results.
The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini.
European Conference on Quality in Official Statistics Session 26: Quality Issues in Census « Rome, 10 July 2008 « Quality Assurance and Control Programme.
Luxembourg January CORE ESSnet (COmmon Reference Environment) final meeting Carlo Vaccari Istat - Italy.
Recommended Practices for Editing and Imputation in the European Statistical System: the EDIMBUS Project* Orietta Luzi (Istat, Italy) Ton De Waal (Statistics.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
MSIS-2014, Dublin, April IRIA: Statistics Production Model of the National Statistical Institute of Spain (INE). José Manuel Bercebal José Luis Maldonado.
Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, September 2012.
Jenny Linnerud, 27/10/2011, Cologne1 ESSnet CORE Common Reference Environment ESSnet workshop in Cologne 27th and 28th of October 2011.
Use of Administrative Data Seminar on Developing a Programme on Integrated Statistics in support of the Implementation of the SNA for CARICOM countries.
1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy.
Eurostat Weighting and Estimation. Presented by Loredana Di Consiglio Istituto Nazionale di Statistica, ISTAT.
Lyne Guertin Census Data Processing and Estimation Section Social Survey Methods Division Methodology Branch, Statistics Canada UNECE April 28-30, 2014.
The challenge of a mixed-mode design survey and new IT tools application: the case of the Italian Structure Earning Surveys Fabiana Rocci Stefania Cardinleschi.
Chapter 6 CASE Tools Software Engineering Chapter 6-- CASE TOOLS
Beijing, October 19, th International Roundtable on Business Survey Frames Co-ordinating role of the Business Register in Economic Statistics Results.
Review of Parnas’ Criteria for Decomposing Systems into Modules Zheng Wang, Yuan Zhang Michigan State University 04/19/2002.
On Implementing CSPA Specifications for Editing and Imputation Services Donato Summa, Monica Scannapieco, Diego Zardetto, Istat, Italy Istituto Nazionale.
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted.
An Overview of Editing and Imputation Methods for the next Italian Censuses Gianpiero Bianchi, Antonia Manzari, Alessandra Reale UNECE-Eurostat Meeting.
The combined use of multiple data sources in the population census Fabio Crescenzi, Giuseppe Sindoni National Institute of Statistics Rome, Italy
4° ESSnet workshop on the EuroGroups Register Development of an enhanced EGR Vision EGR version 2.0.
ESSNET Data Integration - Rome, January 2010 ESSNET on Statistical Disclosure Control Daniela Ichim.
Session topic (i) – Editing Administrative and Census data Discussants Orietta Luzi and Heather Wagstaff UNECE Worksession on Statistical Data Editing.
Proposals for linking Big Data and statistical registers
UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing April 2017 The Hague,
Semantic Interoperability and Data Warehouse Design
Objective of This Course
Administrative Data and their Use in Economic Statistics
Parallel Session: BR maintenance Quality in maintenance of a BR:
Stephanie Hirner ESTP ”Administrative data and censuses
Presentation transcript:

Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco Fortini 1, Miguel Guigò 2, Francisco Hernandez 2, Monica Scannapieco 1, Laura Tosco 1, Tiziana Tuoto 1 1 Italian National Statistical Institute – ISTAT – Italy 2 Spanish National Statistical Institute – INE – Spain NTTS 2009 Brussels February 2009

Outline 1.The Record Linkage 2.The ESSnet on ISAD 3.The Idea and the Features of the RELAIS Software 4.The Italian and Spanish Experiences in using RELAIS 5.Throughout RELAIS Conclusions Theory and Practice in Developing a Record Linkage Software NTTS 2009 Brussels February 2009 Nicoletta Cibella, Brussels, 19 th February 2009

Record Linkage The record linkage purpose is to identify the same real world entity, which can be differently represented in data sources Different approaches to deal with record linkage: Exact RL - Deterministic RL - Probabilistic RL (Fellegi and Sunter theory) - Bayesian RL - Machine Learning - Knowledge Representation … No particular technique has emerged as the best solution for all cases (maybe because such a solution does not exist…) NTTS 2009 Brussels February 2009 Nicoletta Cibella, Brussels, 19 th February 2009

Record Linkage Complexity The record linkage techniques are a multidisciplinary set of methods and practices RECORD LINKAGE SEARCH SPACE REDUCTION Sorted Neighbourhood Method Blocking Hierarchical Grouping … DECISION MODEL CHOICE Fellegi & Sunter Deterministic Bayesian Knowledge – based Mixed … COMPARISON FUNCTION CHOICE Exact Edit distance Smith-Waterman Q-grams Jaro string comparator Soundex code TF-IDF … PRE-PROCESSING Conversion of upper/lower cases Replacement of null strings Standardization Parsing … NTTS 2009 Brussels February 2009 Nicoletta Cibella, Brussels, 19 th February 2009

The Record Linkage Phases Record Linkage should be decomposed in its constituting phases as much as possible 1.Pre-processing of the input files 2.Creation-Reduction of the search space of link candidate pairs 3.Choice of the matching variables 4.Choice of the comparison function 5.Choice of the decision model 6.Selection of unique links 7.Record linkage evaluation NTTS 2009 Brussels February 2009 Nicoletta Cibella, Brussels, 19 th February 2009

The ESSnet ISAD: Integration of Surveys and Administrative Data NTTS 2009 Brussels February 2009 The ESSnet and its focus The aim of the project is to arise, in the whole ESS, knowledge and understanding of the statistical methodologies for the integration of two (or more) data sources. Partners The ESSnet ISAD, cofinanced by Eurostat, started December 2006 and ended June The project involved 5 countries: ISTAT – Italy (scientific coordinator) STAT – Austria CZSO – Czech Republic CBS – Netherlands INE – Spain Nicoletta Cibella, Brussels, 19 th February 2009

RELAIS: The Idea There is not a unique optimal solution for solving record linkage problems: for each phase the most appropriate technique should be chosen –depending on application and data requirements, not only on the practitioners skill Ad-hoc record linkage process (workflow) should be dynamically built RELAIS (REcord Linkage At IStat) is a toolkit serving such a purpose NTTS 2009 Brussels February 2009 Nicoletta Cibella, Brussels, 19 th February 2009

Record Linkage Workflows Preprocessing Search Space Reduction Comparison Function Decision Model Normalization UpperLowerCase Schema reconciliation Blocking SNM Edit Distance Jaro Equality Probabilistic Empirical RecLink WF Appl2 SNM Probabilistic RecLink WF Appl1 Normalization UpperLowerCase Blocking Jaro Empirical Equality NTTS 2009 Brussels February 2009 Nicoletta Cibella, Brussels, 19 th February 2009

RELAIS Features - Modular structure: each phase is planned as a module of the toolkit, with an explicit interface with the other modules - Top-down design: this allows to omit and/or iterate modules (phases) of the record linkage process Advantages: - dynamic composition of record linkage processes - parallel development of various techniques is allowed - design for Web service encapsulation in order to permit remote invocation NTTS 2009 Brussels February 2009 Nicoletta Cibella, Brussels, 19 th February 2009

RELAIS: An Open Source Project Results produced by the scientific community in the last years can be gathered and made available – papers mentioning record linkage (Google Scholar) Techniques for each phase can be implemented and maintained very rapidly by relying on a community of developers RELAIS Implementation Choices –Java –R statistical language NTTS 2009 Brussels February 2009 Nicoletta Cibella, Brussels, 19 th February 2009

RELAIS: the First Release SEARCH SPACE REDUCTION Cross Product Sorted Neighbourhood Method Blocking DECISION MODEL CHOICE Fellegi & Sunter COMPARISON FUNCTION CHOICE Equality 1:1 REDUCTION Optimised Transportation Problem RELAIS 1.0 NTTS 2009 Brussels February 2009 Nicoletta Cibella, Brussels, 19 th February 2009

RELAIS: the First Release NTTS 2009 Brussels February 2009 Nicoletta Cibella, Brussels, 19 th February 2009

RELAIS in the Italian and Spanish Experiences Common ideas and needs about the software (no ad-hoc solutions) Sharing knowledge and cooperation started in the ESSnet Evaluation of the RELAIS adaptability in order to solve also Spanish data integration problems Nicoletta Cibella, Brussels, 19 th February 2009 NTTS 2009 Brussels February 2009

A Scenario: the Data Individuals data from the 2001 Italian Census and PES (about each ones). Capture-recapture model to estimate Census Coverage Rate, - no matching errors in linking Census and PES records. Linkage was a very complex operation: - deterministic and probabilistic approaches and clerical review - almost 15 matching variables - several working months. Due to the accuracy of the matching procedures adopted, we know the true linkage status of all candidate pairs. Nicoletta Cibella, Brussels, 19 th February 2009 RELAIS in the Italian Tests NTTS 2009 Brussels February 2009

RELAIS in the Italian Tests A focus on Rome Size of PES and CEN files : about units each ones Cartesian Product CENxPES : more than pairs (Expected link probability ) 1° Linkage Pass Blocking on month of birth of the household header variable Matching Variables: name, surname, gender, day-month-year of birth Nicoletta Cibella, Brussels, 19 th February 2009 NTTS 2009 Brussels February 2009

True Linkage Status MatchedNot MatchedTotal Results of the 1° Linkage Pass Matched Not Matched856 Total6 872 Results of 1° Linkage Step Match Rate: 88% False Match Rate: 0.5% False Non-Match Rate: 12% The software also provides results at the block-level MATCH RATE TOO LOW IN COVERAGE CONTEXT Nicoletta Cibella, Brussels, 19 th February 2009 RELAIS in the Italian Tests NTTS 2009 Brussels February 2009

2° Linkage Pass Residuals of the 1° step: about units each file - mainly composed by records with missing value in the blocking variable at the 1° step; expected-link probability Cartesian Product : again not recommended … Blocking procedure by means of Sorted Neighborhoods Method Sorting variable: first letter of surname; window size = 450 (frequency of the most common first letter =250 ) Matching Variables: name, surname, day-month-year of birth Nicoletta Cibella, Brussels, 19 th February 2009 RELAIS in the Italian Tests NTTS 2009 Brussels February 2009

Theory and Practice in Developing a Record Linkage Software Nicoletta Cibella, Brussels, 19 th February 2009 True Linkage Status MatchedNot MatchedTotal Results of the Linkage Procedure Matched Not Matched160 Total6 872 Results of the Overall Linkage Procedure (1° plus 2° steps) Match Rate: 98.5% False Match Rate: 0.8% False Non-Match Rate: 2.3% Working Time: less than 2 hours RELAIS in the Italian Tests NTTS 2009 Brussels February 2009

Search Space Reduction Comparison Function Decision Model Blocking SNM Edit Distance Jaro-Winkler Equality Probabilistic Rome PES Workflow Theory and Practice in Developing a Record Linkage Software RELAIS 1.0 Cross Product Linking Type 1:1 Many:Many Probabilistic 1:1 Equality Step 2 SNM Probabilistic Blocking 1:1 Equality Step 1 RELAIS in the Italian Tests NTTS 2009 Brussels February 2009

A Scenario: the Data Individuals data from Living Conditions Survey (LCS) and Central Population Register (CPR) 1st Main Objective: obtain ID number for LCS 2nd Main Objective: compare the RELAIS results with ad-hoc procedures Linkage was a very complex operation: - only name and geographical variables were available - large amount of data. Blocking on geographic areas variables Nicoletta Cibella, Brussels, 19 th February 2009 RELAIS in the Spanish Tests NTTS 2009 Brussels February 2009

Weaknesses of the RELAIS 1.0 difficulties in managing great amount of blocks difficulties in dealing with different probability estimations in each block difficulties in writing the largest output files Strengths of the RELAIS 1.0 efficacy of the implemented probabilistic method noticeable flexibility in modify/adapt the implemented functionalities (reduction from M:N to 1:1) Nicoletta Cibella, Brussels, 19 th February 2009 RELAIS in the Spanish Tests NTTS 2009 Brussels February 2009

Throughout RELAIS 2.0 A relational database architecture in order to optimize the performances with respect to the management of huge amount of data through the whole record linkage process (input, intermediate phase and output). Several distance functions for string and numerical comparisons (not only the equality one). Exact and deterministic decision models to be used either as alternatives or in conjunction with the probabilistic model. A data profiling phase to help the user in the critical phases of choosing the best blocking or matching variables. One-shot Execution to deal with a large amount of blocks. RELAIS 2.0 is now on testing and will be available from May 2009 Theory and Practice in Developing a Record Linkage Software Nicoletta Cibella, Brussels, 19 th February 2009 NTTS 2009 Brussels February 2009

Concluding Remarks Profitable experiences in cooperation between NSIs. Winning choice of the open-source philosophy and of the overcoming of ad-hoc approaches. Common nature of problems and needs of NSIs in data integration projects. New Challenge: - Add in RELAIS methods for evaluating record linkage quality. Theory and Practice in Developing a Record Linkage Software Nicoletta Cibella, Brussels, 19 th February 2009 NTTS 2009 Brussels February 2009

RELAIS: Availability and Contacts Relais 1.0 is available on the website : Relais 2.0 will be available on May 2009 RELAIS Contacts: Nicoletta Cibella, Statistician Tiziana Tuoto, Statistician NTTS 2009 Brussels February 2009 Nicoletta Cibella, Brussels, 19 th February 2009