Presentation is loading. Please wait.

Presentation is loading. Please wait.

Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.

Similar presentations


Presentation on theme: "Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra."— Presentation transcript:

1 Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra Nurra(*), Marco Scarnò(**), Donato Summa(*) (*) Italian National Institute of Statistics (Istat) (**) Cineca Quality 2014 Wien, June 2-5 2014

2 The “ICT in enterprises” survey  In Italy, the survey investigates on a universe of 211,851 enterprises with at least 10 employees, by means of a sampling survey involving 19,186 of them (2011).  In the 2013 round of the survey, 8,687 indicated their website (45% of sampling respondent units).  The access to the indicated websites in order to gather information directly within them, gives different opportunities. Quality 2014

3 The “ICT in enterprises” survey Quality 2014 ActionTarget 1Substitute the traditional collection technique questionnaire-based, with an Internet as Data Source new one, for all suitable questions Reduction of respondent burden 2Integrate the information collected via questionnaire with the information collected via IaD Increase of accuracy of estimates 3Collect additional informationIncrease the offer of statistical information

4 The “ICT in enterprises” survey Quality 2014

5

6 Predictive approach vs Content Analysis Quality 2014 We assume that our target is to increase the accuracy of estimates by making use of data originating by the Internet as auxiliary data. This particular case is based on the use of textual data as auxiliary data. Texts are a “perfect” example of unstructured data, that is one of the characteristics of most Big Data. First, the usual model-based approach will be followed, requiring the prediction of values at unit level: under this approach, the target is to maximise the correctness of classification for each unit in the reference population. Next, a different approach will be illustrated, where the prediction of values at unit level is no more required and the target becomes to directly maximise the accuracy at the aggregate level (estimates accuracy).

7 Predictive approach Quality 2014 In a predictive approach, the subset of data related to sampled respondent units can be considered as the labeled data, and supervisioned learning methods can be applied. In other words, the subset of 8,687 enterprises that indicated to have a website or a home page, and also responded to questions [B8a : B8g], can be considered as the training and test set by means of which different models can be estimated in order to predict answers to [B8a : B8g] questions for the whole reference population. Texts (websites content) Survey Microdata Text and data mining Model

8 Predictive approach Quality 2014 In our case, we can apply one among the supervisioned learning methods: Classification Trees; “ensembles” (Bootstrap Aggregating, Adaptive Boosting, Random Forests); Supervised Latent Dirichlet Allocation for classification (SLDA); Neural Networks; Logistic Regression; Support Vector Machines; Naïve Bayes.

9 Evaluation of predictive models Quality 2014 From the error matrix it is possible to compute the following indicators: IndicatorExpressionMeaning Accuracy (precision) (TP+TN) / TotalRate of correctly classified cases Sensitivity (true positives rate) TP / (TP + FN)Rate of positive cases correctly classified Specificity (true negatives rate) TN / (FP+TN)Rate of negative cases correctly classified

10 Evaluation of predictive models Quality 2014 Application of different learners to predict question B8a “Online ordering or reservation or booking (Yes/No)”

11 Evaluation of predictive models Quality 2014 In general, when the misclassification cases are not balanced in absolute terms, the result is that the distribution of predicted values can be significantly different from the distribution of observed cases. From these results, Naïve Bayes predictor can be considered as the most convenient, because even if its precision (78%) is the lowest, though sensitivity is the highest, specificity is good, and the alignment of observed and predicted proportion is perfect.

12 Evaluation of predictive models Quality 2014 Application of Naïve Bayes to predict all questions in section B8

13 Content analysis Quality 2014

14 Content analysis performance … Quality 2014 In order to verify the robustness of the Content Analysis, we iterated 40 times the selection of a training set from survey data (each time producing an estimate of the proportion of web sales functionality), in correspondence to different rates of training set on the total (from 10% to 90%). The results show correctness of the method until 30% of training rate, but a great variability of estimates for every rate.

15 … compared to Naïve Bayes Quality 2014 The same exercise has been carried out for Naive Bayes. The results show a minimum bias (in the order of one or two percentage points), but a much lower variability.

16 Future work The experimented approach will be improved and extended in different directions: 1.with reference to the population of interest: we will consider the URLs of all the units belonging to the Business Register, and perform a mass scraping of related websites (in this case also experimenting more properly the high volume problems related to Big Data), considering the whole sampling subset of websites as a training set, so to obtain a model that can be applied the whole population. The aim is to produce estimates under a full predictive approach, reducing the sampling errors at the cost of introducing additional bias (both components of MSE should be evaluated); 2.with reference to the content of the questionnaire: the results obtained with the set of variables contained in the “B8” section of the questionnaire, will be evaluated also with the other suitable variables in the questionnaire (e-recruitment, e-procurement, use of social networks, etc.).

17 Contacts barcaroli@istat.it nurra@istat.it marco.scarno@cineca.it summa@istat.it Thank you for your attention Quality 2014


Download ppt "Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra."

Similar presentations


Ads by Google