Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco.

Similar presentations


Presentation on theme: "Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco."— Presentation transcript:

1 Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco Fortini 1, Miguel Guigò 2, Francisco Hernandez 2, Monica Scannapieco 1, Laura Tosco 1, Tiziana Tuoto 1 1 Italian National Statistical Institute – ISTAT – Italy 2 Spanish National Statistical Institute – INE – Spain NTTS 2009 Brussels 18-20 February 2009

2 Outline 1.The Record Linkage 2.The ESSnet on ISAD 3.The Idea and the Features of the RELAIS Software 4.The Italian and Spanish Experiences in using RELAIS 5.Throughout RELAIS 2.0 6.Conclusions Theory and Practice in Developing a Record Linkage Software NTTS 2009 Brussels 18-20 February 2009 Nicoletta Cibella, Brussels, 19 th February 2009

3 Record Linkage The record linkage purpose is to identify the same real world entity, which can be differently represented in data sources Different approaches to deal with record linkage: Exact RL - Deterministic RL - Probabilistic RL (Fellegi and Sunter theory) - Bayesian RL - Machine Learning - Knowledge Representation … No particular technique has emerged as the best solution for all cases (maybe because such a solution does not exist…) NTTS 2009 Brussels 18-20 February 2009 Nicoletta Cibella, Brussels, 19 th February 2009

4 Record Linkage Complexity The record linkage techniques are a multidisciplinary set of methods and practices RECORD LINKAGE SEARCH SPACE REDUCTION Sorted Neighbourhood Method Blocking Hierarchical Grouping … DECISION MODEL CHOICE Fellegi & Sunter Deterministic Bayesian Knowledge – based Mixed … COMPARISON FUNCTION CHOICE Exact Edit distance Smith-Waterman Q-grams Jaro string comparator Soundex code TF-IDF …...... PRE-PROCESSING Conversion of upper/lower cases Replacement of null strings Standardization Parsing … NTTS 2009 Brussels 18-20 February 2009 Nicoletta Cibella, Brussels, 19 th February 2009

5 The Record Linkage Phases Record Linkage should be decomposed in its constituting phases as much as possible 1.Pre-processing of the input files 2.Creation-Reduction of the search space of link candidate pairs 3.Choice of the matching variables 4.Choice of the comparison function 5.Choice of the decision model 6.Selection of unique links 7.Record linkage evaluation NTTS 2009 Brussels 18-20 February 2009 Nicoletta Cibella, Brussels, 19 th February 2009

6 The ESSnet ISAD: Integration of Surveys and Administrative Data NTTS 2009 Brussels 18-20 February 2009 The ESSnet and its focus The aim of the project is to arise, in the whole ESS, knowledge and understanding of the statistical methodologies for the integration of two (or more) data sources. Partners The ESSnet ISAD, cofinanced by Eurostat, started December 2006 and ended June 2008. The project involved 5 countries: ISTAT – Italy (scientific coordinator) STAT – Austria CZSO – Czech Republic CBS – Netherlands INE – Spain Nicoletta Cibella, Brussels, 19 th February 2009

7 RELAIS: The Idea There is not a unique optimal solution for solving record linkage problems: for each phase the most appropriate technique should be chosen –depending on application and data requirements, not only on the practitioners skill Ad-hoc record linkage process (workflow) should be dynamically built RELAIS (REcord Linkage At IStat) is a toolkit serving such a purpose NTTS 2009 Brussels 18-20 February 2009 Nicoletta Cibella, Brussels, 19 th February 2009

8 Record Linkage Workflows Preprocessing Search Space Reduction Comparison Function Decision Model Normalization UpperLowerCase Schema reconciliation Blocking SNM Edit Distance Jaro Equality Probabilistic Empirical RecLink WF Appl2 SNM Probabilistic RecLink WF Appl1 Normalization UpperLowerCase Blocking Jaro Empirical Equality NTTS 2009 Brussels 18-20 February 2009 Nicoletta Cibella, Brussels, 19 th February 2009

9 RELAIS Features - Modular structure: each phase is planned as a module of the toolkit, with an explicit interface with the other modules - Top-down design: this allows to omit and/or iterate modules (phases) of the record linkage process Advantages: - dynamic composition of record linkage processes - parallel development of various techniques is allowed - design for Web service encapsulation in order to permit remote invocation NTTS 2009 Brussels 18-20 February 2009 Nicoletta Cibella, Brussels, 19 th February 2009

10 RELAIS: An Open Source Project Results produced by the scientific community in the last years can be gathered and made available –175 000 papers mentioning record linkage (Google Scholar) Techniques for each phase can be implemented and maintained very rapidly by relying on a community of developers RELAIS Implementation Choices –Java –R statistical language NTTS 2009 Brussels 18-20 February 2009 Nicoletta Cibella, Brussels, 19 th February 2009

11 RELAIS: the First Release SEARCH SPACE REDUCTION Cross Product Sorted Neighbourhood Method Blocking DECISION MODEL CHOICE Fellegi & Sunter COMPARISON FUNCTION CHOICE Equality 1:1 REDUCTION Optimised Transportation Problem RELAIS 1.0 NTTS 2009 Brussels 18-20 February 2009 Nicoletta Cibella, Brussels, 19 th February 2009

12 RELAIS: the First Release NTTS 2009 Brussels 18-20 February 2009 Nicoletta Cibella, Brussels, 19 th February 2009

13 RELAIS in the Italian and Spanish Experiences Common ideas and needs about the software (no ad-hoc solutions) Sharing knowledge and cooperation started in the ESSnet Evaluation of the RELAIS adaptability in order to solve also Spanish data integration problems Nicoletta Cibella, Brussels, 19 th February 2009 NTTS 2009 Brussels 18-20 February 2009

14 A Scenario: the Data Individuals data from the 2001 Italian Census and PES (about 180 000 each ones). Capture-recapture model to estimate Census Coverage Rate, - no matching errors in linking Census and PES records. Linkage was a very complex operation: - deterministic and probabilistic approaches and clerical review - almost 15 matching variables - several working months. Due to the accuracy of the matching procedures adopted, we know the true linkage status of all candidate pairs. Nicoletta Cibella, Brussels, 19 th February 2009 RELAIS in the Italian Tests NTTS 2009 Brussels 18-20 February 2009

15 RELAIS in the Italian Tests A focus on Rome Size of PES and CEN files : about 8 000 units each ones Cartesian Product CENxPES : more than 72 250 000 pairs (Expected link probability 0.0001) 1° Linkage Pass Blocking on month of birth of the household header variable Matching Variables: name, surname, gender, day-month-year of birth Nicoletta Cibella, Brussels, 19 th February 2009 NTTS 2009 Brussels 18-20 February 2009

16 True Linkage Status MatchedNot MatchedTotal Results of the 1° Linkage Pass Matched6 016306 046 Not Matched856 Total6 872 Results of 1° Linkage Step Match Rate: 88% False Match Rate: 0.5% False Non-Match Rate: 12% The software also provides results at the block-level MATCH RATE TOO LOW IN COVERAGE CONTEXT Nicoletta Cibella, Brussels, 19 th February 2009 RELAIS in the Italian Tests NTTS 2009 Brussels 18-20 February 2009

17 2° Linkage Pass Residuals of the 1° step: about 1 500 units each file - mainly composed by records with missing value in the blocking variable at the 1° step; expected-link probability 0.0003 Cartesian Product : again not recommended … Blocking procedure by means of Sorted Neighborhoods Method Sorting variable: first letter of surname; window size = 450 (frequency of the most common first letter =250 ) Matching Variables: name, surname, day-month-year of birth Nicoletta Cibella, Brussels, 19 th February 2009 RELAIS in the Italian Tests NTTS 2009 Brussels 18-20 February 2009

18 Theory and Practice in Developing a Record Linkage Software Nicoletta Cibella, Brussels, 19 th February 2009 True Linkage Status MatchedNot MatchedTotal Results of the Linkage Procedure Matched6 712586 770 Not Matched160 Total6 872 Results of the Overall Linkage Procedure (1° plus 2° steps) Match Rate: 98.5% False Match Rate: 0.8% False Non-Match Rate: 2.3% Working Time: less than 2 hours RELAIS in the Italian Tests NTTS 2009 Brussels 18-20 February 2009

19 Search Space Reduction Comparison Function Decision Model Blocking SNM Edit Distance Jaro-Winkler Equality Probabilistic Rome PES Workflow Theory and Practice in Developing a Record Linkage Software RELAIS 1.0 Cross Product Linking Type 1:1 Many:Many Probabilistic 1:1 Equality Step 2 SNM Probabilistic Blocking 1:1 Equality Step 1 RELAIS in the Italian Tests NTTS 2009 Brussels 18-20 February 2009

20 A Scenario: the Data Individuals data from Living Conditions Survey (LCS) and Central Population Register (CPR) 1st Main Objective: obtain ID number for LCS 2nd Main Objective: compare the RELAIS results with ad-hoc procedures Linkage was a very complex operation: - only name and geographical variables were available - large amount of data. Blocking on geographic areas variables Nicoletta Cibella, Brussels, 19 th February 2009 RELAIS in the Spanish Tests NTTS 2009 Brussels 18-20 February 2009

21 Weaknesses of the RELAIS 1.0 difficulties in managing great amount of blocks difficulties in dealing with different probability estimations in each block difficulties in writing the largest output files Strengths of the RELAIS 1.0 efficacy of the implemented probabilistic method noticeable flexibility in modify/adapt the implemented functionalities (reduction from M:N to 1:1) Nicoletta Cibella, Brussels, 19 th February 2009 RELAIS in the Spanish Tests NTTS 2009 Brussels 18-20 February 2009

22 Throughout RELAIS 2.0 A relational database architecture in order to optimize the performances with respect to the management of huge amount of data through the whole record linkage process (input, intermediate phase and output). Several distance functions for string and numerical comparisons (not only the equality one). Exact and deterministic decision models to be used either as alternatives or in conjunction with the probabilistic model. A data profiling phase to help the user in the critical phases of choosing the best blocking or matching variables. One-shot Execution to deal with a large amount of blocks. RELAIS 2.0 is now on testing and will be available from May 2009 Theory and Practice in Developing a Record Linkage Software Nicoletta Cibella, Brussels, 19 th February 2009 NTTS 2009 Brussels 18-20 February 2009

23 Concluding Remarks Profitable experiences in cooperation between NSIs. Winning choice of the open-source philosophy and of the overcoming of ad-hoc approaches. Common nature of problems and needs of NSIs in data integration projects. New Challenge: - Add in RELAIS methods for evaluating record linkage quality. Theory and Practice in Developing a Record Linkage Software Nicoletta Cibella, Brussels, 19 th February 2009 NTTS 2009 Brussels 18-20 February 2009

24 RELAIS: Availability and Contacts Relais 1.0 is available on the website : www.istat.it Relais 2.0 will be available on May 2009 RELAIS Contacts: Nicoletta Cibella, Statistician E-mail: cibella@istat.it Tiziana Tuoto, Statistician E-mail: tuoto@istat.it NTTS 2009 Brussels 18-20 February 2009 Nicoletta Cibella, Brussels, 19 th February 2009


Download ppt "Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco."

Similar presentations


Ads by Google