1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy.

Slides:



Advertisements
Similar presentations
Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco.
Advertisements

Database System Concepts and Architecture
Corporate Administration Management System CAMS-ITech: Vertical CRM for the Administration/Finance Area CAMS-iTech™ is the technological answer developed.
ESSnet on SDMX phase II Laura Vignola ISTAT Rome, 3-4 December 2012.
Stefania Bergamasco, Cecilia Colasanti An integrated approach to turn statistics into knowledge combining data warehouse, controlled vocabularies and advanced.
Copyright 2008 Tieto Corporation Database merge. Copyright 2008 Tieto Corporation Table of contents Please, do not remove this slide if you want to use.
ESSnet DI WP2: Record Linkage Luca Valentino Istat.
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica.
SSP Re-hosting System Development: CLBM Overview and Module Recognition SSP Team Department of ECE Stevens Institute of Technology Presented by Hongbing.
System Design and Analysis
APPLICATION DEVELOPMENT BY SYED ADNAN ALI.
Russell Taylor Lecturer in Computing & Business Studies.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Overview of Database Languages and Architectures.
United Nations Economic Commission for Europe Statistical Division Applying the GSBPM to Business Register Management Steven Vale UNECE
Introduction to Systems Analysis and Design Trisha Cummings.
CORE Rome Meeting – 3/4 October WP3: A Process Scenario for Testing the CORE Environment Diego Zardetto (Istat CORE team)
Katanosh Morovat.   This concept is a formal approach for identifying the rules that encapsulate the structure, constraint, and control of the operation.
Geneva, 30 October 2009 Giuseppe Sindoni, Istat, Italy An online system for multi-channel, register-based census data collection.
FPDS- NG Reports Overview December 16, Today’s Goals Provide an overview of the FPDS-NG reporting capability Demonstrate each of the reporting tools.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
13 ° COSMO General Meeting Rome VERSUS2 Priority Project Report and Plan Adriano Raspanti.
Product Portability “Optimizing Your Investment in Dimensions CM” Presented by Lovell & Mercier, Inc.
1 The system aspect of statistical quality Q2014 european conference on quality in official statistics Special session: Consistency of Concepts and Applied.
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
Software School of Hunan University Database Systems Design Part III Section 5 Design Methodology.
Summary Data Modeling SDLC What is Data Modeling Application Audience and Services Entities Attributes Relationships Entity Relationship Diagrams Conceptual,Logical.
The Use of Administrative Sources for Statistical Purposes Matching and Integrating Data from Different Sources.
Transparency and Open Data: GSS Response Iain Bell HoP MoJ.
Population Census carried out in Armenia in 2011 as an example of the Generic Statistical Business Process Model Anahit Safyan Member of the State Council.
Luisa Franconi Integration, Quality, Research and Production Networks Development Department Unit on microdata access ISTAT Essnet on Common Tools and.
Module 4: Systems Development Chapter 12: (IS) Project Management.
Metadata Models in Survey Computing Some Results of MetaNet – WG 2 METIS 2004, Geneva W. Grossmann University of Vienna.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini.
Innovations in Data Dissemination Thomas L. Mesenbourg, Jr. Acting Director U.S. Census Bureau United Nations Seminar on Innovations in Official Statistics.
Metadata driven application for data processing – from local toward global solution Rudi Seljak Statistical Office of the Republic of Slovenia.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
1 Introduction to Software Engineering Lecture 1.
Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, September 2012.
Jenny Linnerud, 27/10/2011, Cologne1 ESSnet CORE Common Reference Environment ESSnet workshop in Cologne 27th and 28th of October 2011.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
EU Code of Practice Peer Review 2006 – 8 :A Peer’s Perspective Frank Nolan Office for National Statistics UK.
Outlining a Process Model for Editing With Quality Indicators Pauli Ollila (part 1) Outi Ahti-Miettinen (part 2) Statistics Finland.
1 Technical & Business Writing (ENG-715) Muhammad Bilal Bashir UIIT, Rawalpindi.
MODELING AND ANALYSIS Pertemuan-4
Database Systems Lecture 1. In this Lecture Course Information Databases and Database Systems Some History The Relational Model.
Patterns in caBIG Baris E. Suzek 12/21/2009. What is a Pattern? Design pattern “A general reusable solution to a commonly occurring problem in software.
QUALITY ASSESSMENT OF THE REGISTER-BASED SLOVENIAN CENSUS 2011 Rudi Seljak, Apolonija Flander Oblak Statistical Office of the Republic of Slovenia.
Oman College of Management and Technology Course – MM Topic 7 Production and Distribution of Multimedia Titles CS/MIS Department.
On Implementing CSPA Specifications for Editing and Imputation Services Donato Summa, Monica Scannapieco, Diego Zardetto, Istat, Italy Istituto Nazionale.
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted.
An Overview of Editing and Imputation Methods for the next Italian Censuses Gianpiero Bianchi, Antonia Manzari, Alessandra Reale UNECE-Eurostat Meeting.
ESSNET Data Integration - Rome, January 2010 ESSNET on Statistical Disclosure Control Daniela Ichim.
What it is about? © SkillsRate is registered mark of SKILLSRATE SRL It is all about testing, testing skills,
Proposals for linking Big Data and statistical registers
Business process management (BPM)
Business process management (BPM)
UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing April 2017 The Hague,
Istituto Nazionale di Statistica – Istat
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Objective of This Course
Tomaž Špeh, Rudi Seljak Statistical Office of the Republic of Slovenia
SISAI STATISTICAL INFORMATION SYSTEMS ARCHITECTURE AND INTEGRATION
Use of Web scraping for Enterprises Characteristics
Contents Preface I Introduction Lesson Objectives I-2
Parallel Session: BR maintenance Quality in maintenance of a BR:
Modeling and Analysis Tutorial
COmmon REference Environment - CORE:
Presentation transcript:

1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

2 Outline The record linkage problem and the RELAIS solution RELAIS, a shareable tool The main features of RELAIS International experiences in using RELAIS

3 The problem Record linkage aims to accurately recognize the same real world entity at individual micro level, even when differently stored in sources of various type. Examples of applications (in official statistics): data integration update and de-duplication of a source quality improvement of a data source measure of population size by capture-recapture estimate the risk of re-identification in public-use microdata Also known as: Object Identification, Record Matching, …

4 Possible Solutions for Record Linkage A very jeopardized picture, not only in Istat. Different approaches to deal with record linkage: Exact RL - Deterministic RL - Probabilistic RL (Fellegi and Sunter theory) - Bayesian RL - Machine Learning - Knowledge Representation … No particular technique has emerged as the best solution for all cases (maybe because such a solution does not exist…) Several software and tools proposed, based on different approaches, free or commercial.

5 RELAIS is a toolkit for record linkage (RL) Istat started developing RELAIS in 2006 and the system is now at its 2.1 release –2.2. release is going to be published RELAIS, a brief history RELAIS (REcord Linkage At Istat)

6 RELAIS, a brief history – Istat working group with several cooperation and training courses on probabilistic record linkage – Enriched experiences on Data Integration as coordinator of Essnet Common nature of problems and needs of NSIs in data integration projects Profitable experiences in cooperation with NSIs also in sharing the same software tools (NTTS 2009)

7 RELAIS: a Shareable Tool A tool designed to be shared It is a toolkit: possibility of adding new techniques to the system, and thus reusing solutions that are already available Open source implementation: Java and R as programming languages and MySQL as database management system

8 RELAIS: a Shareable Tool Reuse of existing solutions Most of the comparison functions are part of the Java package StringMetrics –( ) 1:1 reduction phase is implemented by making use of the R package lpSolve –( project.org/web/packages/lpSolve/index.html).

9 RELAIS: a Shareable Tool Sharing of the software Both source code and executables of RELAIS have been released on : –Istat site: analisi_dati/relais / –OSOR site:

10 RELAIS: a Shareable Tool Licencing problem RELAIS was the first system that Istat decided to release as an open system so no previous experience was available Analysis of available licensing solutions Choice of EUPL (European Union Public Licence) –Consistency with the copyright law in the 27 Member States of the European Union –Compatibility with popular open-source software licences (e.g. GPL)

11 The main ideas of RELAIS RELAIS main ideas: - decompose the complex RL project in its constituting phases; - choose dynamically the most appropriate technique for each phase, depending on application and data requirements, not only on practitioner’s skill

12 Choose the most appropriate techniques

13 Build ad-hoc RL workflows Preprocessing Search Space Reduction Comparison Function Decision Model Normalization UpperLowerCase Blocking SNM Edit Distance Jaro Equality Probabilistic Deterministic RecLink WF Appl2 SNM Probabilistic RecLink WF Appl1 Normalization UpperLowerCase Blocking Jaro Deterministic Equality

14 Relational database support: input of data from database Oracle or MySQL. New default input values for the parameter estimation of the probabilistic model and new definition of the candidate pairs for the optimal 1:1 reduction. More than one variable for search space reduction by sorted neighborhood method. Minor bugs have been solved. RELAIS May 2010

15 Main features of RELAIS 2.1 Input files both in text format and from database (mysql or oracle) tables; Data profiling to guide the choice of matching and blocking variables; Creation of the search space of pairs candidate to link by means of the “cross product”, “blocking” and “sorted neighborhood” method; Choice of matching variables; Set of comparison functions (with several string distances); Probabilistic record linkage: estimation of the F - S model parameters via the EM algorithm; Deterministic record linkage: both exact and rule based; Reduction from N:M to 1:1 matching solution with optimal or greedy methods.

16 A glance on RELAIS 2.1

17 RELAIS 2.2 in June 2011 Explicit application for de-duplication Nested blocking methods Set probabilities by the users Improvement of GUI functionalities for output management and user interactions (manual review). Summary output on linkage results Batch execution Interfaces for clerical review

18 RELAIS and extra-Istat interaction Spontaneous collaboration among NSIs (Spain, UK, Tunisia, Brazil) was favoured by the open source philosophy adopted in RELAIS but even in a statistical system with shared goals and regulations (ESS) different constraints (e.g. language features), may be present and could affect the outcome of the same linkage.

19 RELAIS and extra-Istat interaction The collaboration among NSIs helped in: assessing the capabilities of the various functionalities included in the RELAIS toolkit, e.g. the use of the EM algorithm for record linkage purposes; comparing the results achieved by the software with those obtained throughout some alternative ad hoc techniques; testing in terms of performances the methods implemented in RELAIS.

20 RELAIS and extra-Istat interaction ISTAT, coordinator of the DI (Data Integration) ESSnet project, conducted on January 2011 in U.K. on-the-job training on record linkage methods. The training on the job had these crucial aspects: the combination of the theoretical concepts of record linkage with the solutions proposed in RELAIS; the test of the RELAIS toolkit, during the computer session, on the specific record linkage problem faced by ONS on their own data; a very interactive way of conducting the lessons by the trainers.

21 Next challenges Censuses and post-censual surveys (Population and Agriculture): integration of population registers and auxiliary ones to focus on population register under-coverage, de- duplication also due to multi-channel answers, Post Enumeration Survey. Longitudinal study of regular foreign people Integration of ICT enterprises

22 Future research projects Preprocessing (character conversions, schema reconciliation, standardization, etc.); Modification of the probabilistic approach: –Not binary comparison vector –Allowing interactions between matching variables –Bayesian approach Graphical analysis on the model fitting

23 Thanks and Invitation to Cooperation RELAIS Contacts: Computer Scientists: Monica Scannapieco Laura Tosco Luca Valentino Statisticians: Nicoletta Cibella Tiziana Tuoto