Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy.

Similar presentations


Presentation on theme: "1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy."— Presentation transcript:

1 1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

2 2 Outline The record linkage problem and the RELAIS solution RELAIS, a shareable tool The main features of RELAIS International experiences in using RELAIS

3 3 The problem Record linkage aims to accurately recognize the same real world entity at individual micro level, even when differently stored in sources of various type. Examples of applications (in official statistics): data integration update and de-duplication of a source quality improvement of a data source measure of population size by capture-recapture estimate the risk of re-identification in public-use microdata Also known as: Object Identification, Record Matching, …

4 4 Possible Solutions for Record Linkage A very jeopardized picture, not only in Istat. Different approaches to deal with record linkage: Exact RL - Deterministic RL - Probabilistic RL (Fellegi and Sunter theory) - Bayesian RL - Machine Learning - Knowledge Representation … No particular technique has emerged as the best solution for all cases (maybe because such a solution does not exist…) Several software and tools proposed, based on different approaches, free or commercial.

5 5 RELAIS is a toolkit for record linkage (RL) Istat started developing RELAIS in 2006 and the system is now at its 2.1 release –2.2. release is going to be published RELAIS, a brief history RELAIS (REcord Linkage At Istat)

6 6 RELAIS, a brief history – Istat working group with several cooperation and training courses on probabilistic record linkage – Enriched experiences on Data Integration as coordinator of Essnet Common nature of problems and needs of NSIs in data integration projects Profitable experiences in cooperation with NSIs also in sharing the same software tools (NTTS 2009)

7 7 RELAIS: a Shareable Tool A tool designed to be shared It is a toolkit: possibility of adding new techniques to the system, and thus reusing solutions that are already available Open source implementation: Java and R as programming languages and MySQL as database management system

8 8 RELAIS: a Shareable Tool Reuse of existing solutions Most of the comparison functions are part of the Java package StringMetrics –(http://www.dcs.shef.ac.uk/~sam/stringmetrics.html ) 1:1 reduction phase is implemented by making use of the R package lpSolve –(http://cran.r- project.org/web/packages/lpSolve/index.html).

9 9 RELAIS: a Shareable Tool Sharing of the software Both source code and executables of RELAIS have been released on : –Istat site: http://www.istat.it/strumenti/metodi/software/ analisi_dati/relais / –OSOR site: http://forge.osor.eu/projects/relais/

10 10 RELAIS: a Shareable Tool Licencing problem RELAIS was the first system that Istat decided to release as an open system so no previous experience was available Analysis of available licensing solutions Choice of EUPL (European Union Public Licence) –Consistency with the copyright law in the 27 Member States of the European Union –Compatibility with popular open-source software licences (e.g. GPL)

11 11 The main ideas of RELAIS RELAIS main ideas: - decompose the complex RL project in its constituting phases; - choose dynamically the most appropriate technique for each phase, depending on application and data requirements, not only on practitioner’s skill

12 12 Choose the most appropriate techniques

13 13 Build ad-hoc RL workflows Preprocessing Search Space Reduction Comparison Function Decision Model Normalization UpperLowerCase Blocking SNM Edit Distance Jaro Equality Probabilistic Deterministic RecLink WF Appl2 SNM Probabilistic RecLink WF Appl1 Normalization UpperLowerCase Blocking Jaro Deterministic Equality

14 14 Relational database support: input of data from database Oracle or MySQL. New default input values for the parameter estimation of the probabilistic model and new definition of the candidate pairs for the optimal 1:1 reduction. More than one variable for search space reduction by sorted neighborhood method. Minor bugs have been solved. RELAIS 2.1 - May 2010

15 15 Main features of RELAIS 2.1 Input files both in text format and from database (mysql or oracle) tables; Data profiling to guide the choice of matching and blocking variables; Creation of the search space of pairs candidate to link by means of the “cross product”, “blocking” and “sorted neighborhood” method; Choice of matching variables; Set of comparison functions (with several string distances); Probabilistic record linkage: estimation of the F - S model parameters via the EM algorithm; Deterministic record linkage: both exact and rule based; Reduction from N:M to 1:1 matching solution with optimal or greedy methods.

16 16 A glance on RELAIS 2.1

17 17 RELAIS 2.2 in June 2011 Explicit application for de-duplication Nested blocking methods Set probabilities by the users Improvement of GUI functionalities for output management and user interactions (manual review). Summary output on linkage results Batch execution Interfaces for clerical review

18 18 RELAIS and extra-Istat interaction Spontaneous collaboration among NSIs (Spain, UK, Tunisia, Brazil) was favoured by the open source philosophy adopted in RELAIS but even in a statistical system with shared goals and regulations (ESS) different constraints (e.g. language features), may be present and could affect the outcome of the same linkage.

19 19 RELAIS and extra-Istat interaction The collaboration among NSIs helped in: assessing the capabilities of the various functionalities included in the RELAIS toolkit, e.g. the use of the EM algorithm for record linkage purposes; comparing the results achieved by the software with those obtained throughout some alternative ad hoc techniques; testing in terms of performances the methods implemented in RELAIS.

20 20 RELAIS and extra-Istat interaction ISTAT, coordinator of the DI (Data Integration) ESSnet project, conducted on January 2011 in U.K. on-the-job training on record linkage methods. The training on the job had these crucial aspects: the combination of the theoretical concepts of record linkage with the solutions proposed in RELAIS; the test of the RELAIS toolkit, during the computer session, on the specific record linkage problem faced by ONS on their own data; a very interactive way of conducting the lessons by the trainers.

21 21 Next challenges Censuses and post-censual surveys (Population and Agriculture): integration of population registers and auxiliary ones to focus on population register under-coverage, de- duplication also due to multi-channel answers, Post Enumeration Survey. Longitudinal study of regular foreign people Integration of ICT enterprises

22 22 Future research projects Preprocessing (character conversions, schema reconciliation, standardization, etc.); Modification of the probabilistic approach: –Not binary comparison vector –Allowing interactions between matching variables –Bayesian approach Graphical analysis on the model fitting

23 23 Thanks and Invitation to Cooperation RELAIS Contacts: Computer Scientists: Monica Scannapieco E-mail: scannapi@istat.it Laura Tosco E-mail: tosco@istat.it Luca Valentino E-mail: luvalent@istat.it Statisticians: Nicoletta Cibella E-mail: cibella@istat.it Tiziana Tuoto E-mail: tuoto@istat.it http://www.istat.it/strumenti/metodi/software/analisi_dati/relais/ http://www.osor.eu/projects/relais


Download ppt "1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy."

Similar presentations


Ads by Google