Presentation is loading. Please wait.

Presentation is loading. Please wait.

Proposals for linking Big Data and statistical registers

Similar presentations


Presentation on theme: "Proposals for linking Big Data and statistical registers"— Presentation transcript:

1 Proposals for linking Big Data and statistical registers
Daniela Fusco* Tiziana Tuoto* Antony Rizzi** *Istat, Italian National Institute of Statistics **Consiglio di stato

2 Summary Introduction at the statistical use of Big Data
The proposed Record linkage methods A case study First results Concluding remarks Proposal for linking Big Data and statistical registers , Daniela Fusco– Bruxelles, 14° March 2017

3 Big Data Volume Velocity Variety Volume Velocity Variety
Introduction Big Data Volume Velocity Variety Statistical Registers Volume Velocity Variety Possible solution: Reduction costs Enlarge contents Timeless 1 Proposal for linking Big Data and statistical registers , Daniela Fusco– Bruxelles, 14° March 2017

4 STATISTICAL REGISTERS
BIG DATA STATISTICAL REGISTERS 2 Proposal for linking Big Data and statistical registers , Daniela Fusco– Bruxelles, 14° March 2017

5 Record linkage as we know
- Record linkage is a classification problem - The aim is to recognise the same units located in different sources even if represented in non-homogeneous ways - Statistical methods for RL, Probabilistic RL, follow the classical approach due to Fellegi and Sunter (1969) and are now well established (Herzog, Scheuren and Winkler 2007) - Software and tools to face with linkage problems FEBRL ( RELAIS (Record Linkage At Istat) 3 Proposal for linking Big Data and statistical registers , Daniela Fusco– Bruxelles, 14° March 2017

6 Record Linkage as we know it: the phases
Conversion upper/lower case Standardization Parsing Coding Construction of derived variables Blocking Sorted neighbourood Simhash Canopy cluster Hierarchical grouping Fellegi & Sunter Bayesian Deterministic An ONS report (Gill et al, 2001) describes Pre-elaborations 2) Record linkage 3) Analysis Select matching and blocking variables Edit and parse the variables Block and sort the files Check by clerical Evaluating linkage errors Select the method Select the model/rules Evaluate the model Set the thresholds Select the matching output 1 to 1 Many to Many 1 to many 4 Proposal for linking Big Data and statistical registers , Daniela Fusco– Bruxelles, 14° March 2017 PRE-ELABORAZIONE Trasformazione di maiuscole/minuscole Trattamento delle stringhe nulle Standardizzazione Parsing PRE-ELABORAZIONE Trasormazione di maiuscole/minuscole Trattamento delle stringhe nulle Standardizzazione Parsing

7 Georeferenced approach
5 Proposal for linking Big Data and statistical registers , Daniela Fusco– Bruxelles, 14° March 2017

8 Combined methods approach: the case study
Aim: using Big Data to update the Farm Register, permitting the production and the periodical dissemination of statistics related to the activities and to the services offered by the Agritourism farms, at a minimum cost. Specifically, at the end of the integration process, it will be possible to: • Validate the addresses in SFR and identify them if they are missing; • Estimate the variables available on the net (e-commerce, price, etc.) to add other information in the SFR; • Check and integrate information of the SFR (telephone number, , web site, etc.). Sources: Italian Farm Register - Interned-scraped data Target: “hub”, website hosting and describing a plurality of units Size: 13,000 units in the FR – 7,000 units scraped from 3 hubs 6 Proposal for linking Big Data and statistical registers , Daniela Fusco– Bruxelles, 14° March 2017

9 Farm Register Review: something about the Farm Register
Administrative sources: Integrated Administration and Control System (IACS) Animal register Tax declaration on agricultural land Land cadastre Chambers of Commerce Value Added Tax on agricultural income Statistical sources: Business Register Agricultural Census Survey on rural tourism accommodations Survey on quality products A 7 Proposal for linking Big Data and statistical registers , Daniela Fusco– Bruxelles, 14° March 2017

10 Number of Afs by main sectorial Hubs
Main URL of the hub web site Number of AFs 3,520 2,292 7,575 4,389 2,636 1,514 618 Variables scraped by internet by topics Topic Variable Updating Additional information Farm localization and contacts Address X Telephone number Web site Geo-localization Structural information Number of rooms Prices Number of restaurant seats Direct sales Product typology E-commerce The frame of SFR Agritourism is about 13,000 units on a total of 20,000 existing for the Agricultural Ministry (year 2013). We propose an integration of Internet-scraped data regarding agritourist farms (AFs) with data reported in the Farm Register built up by Istat. The initial and most important target of web scraping is represented by the different websites acting as “hubs” (hosting and describing a plurality of Agritourism), in general maintained by private societies, or by business associations. Obviously some AFs could be present in more than one source. In table 1 are reported the number of possible AFs extracted by the main sectorial websites. The information that came from the websites are often hard to combine since errors or missing information in the record identifiers. The final aim of this integration is the use of Internet information for statistical purpose, in particular to update and integrate some data collected in the SFR for the AFs. In table 2 are explained some variables available on internet useful for these purpose.   Specifically, at the end of the integration process, it will be possible to: Validate the addresses in SFR and identify them if they are missing; Estimate the variables available on the net (e-commerce, price, etc.) to add other information in the SFR; Check and integrate information of the SFR (telephone number, , web site, etc.). 8 Proposal for linking Big Data and statistical registers , Daniela Fusco– Bruxelles, 14° March 2017

11 Linkage Model: EM binomial and multinomial (5 and 8 classes)
Combined methods approach: the case study Linkage models Linking variables: denomination, address, longitude, latitude, postal code Comparison functions: Simhash, Jaro, 3grams, 3grams weighted by the frequency Linkage Model: EM binomial and multinomial (5 and 8 classes) 9 Proposal for linking Big Data and statistical registers , Daniela Fusco– Bruxelles, 14° March 2017

12 Combined methods approach: the case study
Multinomial EM Algorithm Traditionally the EM algorithm is applied to maximize the likelihood with two categories agree/disagree for each matching variable Here, we define k categories, k=5,8 where each category represents a class based on an interval of string comparators, in this case quantiles The EM algorithm under the multinomial distribution is used to estimate the match parameters for each variable q in class k 10 Proposal for linking Big Data and statistical registers , Daniela Fusco– Bruxelles, 14° March 2017

13 The result of the combined approach
Denomination Address 11 Proposal for linking Big Data and statistical registers , Daniela Fusco– Bruxelles, 14° March 2017

14 Comparison Evaluation Denomination Precision Recall Simhash 30% 47%
3grams 55% 87% Jaro 26% 41% Address 36% 57% 43% 67% 74% 12 Proposal for linking Big Data and statistical registers , Daniela Fusco– Bruxelles, 14° March 2017

15 The result of the georeferenced approach in Emilia Romagna Region
13 Proposal for linking Big Data and statistical registers , Daniela Fusco– Bruxelles, 14° March 2017

16 Lesson learnt The first evidence highlights the role played by the pre-processing phase and the data cleaning/reconciliation activity. It’s well known in official statistics, the preparation of input files is the first phase and requires 75% of the whole effort to implement a record linkage procedure, in this case the pre-processing step was particularly huge and expensive, requires almost the 95% of the whole time. Ignoring this task may compromise the effectiveness of the following steps. 14 Proposal for linking Big Data and statistical registers , Daniela Fusco– Bruxelles, 14° March 2017

17 Conclusions The agricultural field is the most challenging area for evaluating the performance of new linkage methodologies, due to the well-known difficulties in recognising statistical units related to this field as well as rural addresses. Dealing with new sources of data requires the availability of new methodologies in linking data, however the due attention should be devoted to the output quality evaluation, to better understand benefits and risks of the integration and to allow the analysts to take into account potential integration errors in subsequent analyses. In this paper, we experiment and compare the use of GPS coordinates as matching variables, and for spatial linkage as well. Moreover, we introduce some machine learning algorithms in order to test their effectiveness to deal with un-structured data and the advantages of these algorithms with respect to traditional standardization and parsing activities on linkage variables. In addition, these features are compared with some innovations in the traditional approach to probabilistic record linkage. 15 Proposal for linking Big Data and statistical registers , Daniela Fusco– Bruxelles, 14° March 2017

18 We will explore new solutions
Next steps We will explore new solutions We will assess the validation of the linkage results and the measurement of output quality 16 Proposal for linking Big Data and statistical registers , Daniela Fusco– Bruxelles, 14° March 2017

19 Thank you for your kind attention


Download ppt "Proposals for linking Big Data and statistical registers"

Similar presentations


Ads by Google