Presentation is loading. Please wait.

Presentation is loading. Please wait.

Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) Q2014 European Conference on quality in official statistics.

Similar presentations


Presentation on theme: "Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) Q2014 European Conference on quality in official statistics."— Presentation transcript:

1 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014 Web scraping techniques to collect data on consumer electronics and airfares for Italian HICP compilation Riccardo Giannini (rigianni@istat.it), Rosanna Lo Conte (rolocont@istat.it), Stefano Mosca (stmosca@istat.it), Federico Polidoro (polidoro@istat.it), Francesca Rossetti (frrosset@istat.it)

2 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) Outline of the presentation 2 1.Implementing European project “Multipurpose price statistics” 2.Centralised data collection for Italian Harmonized Index of Consumer Prices (HICP) 3.Testing and implementing web scraping techniques on the survey concerning prices of “consumer electronics” products 4.Testing web scraping techniques on the survey concerning “airfares” 5.IT choices adopted to implement web scraping procedures 6.Possible future developments and conclusive remarks

3 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) The key role of price statistics for the choices of policy makers The requirements to ensure and improving quality in terms of methodology and production process, with specific reference to the data collection phase The demand for timely and cost efficient production of high quality statistical data increases, as well the need for new solutions to declining response levels (Scheveningen Memorandum) Multipurpose price statistics as reply to these requirements 3 Implementing European project “Multipurpose price statistics”

4 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) Implementing European project “Multipurpose price statistics” Multipurpose price statistics Modernizing data collection methods Linking of HICP and PPP processes Developing a data warehouse approach Providing more detailed and timely HICPs, Price Level Indices (PLIs) and Price Level Data (DAP) 4

5 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) Modernization of data collection tools for improving HICP quality is one of the pillars of “Multipurpose Price Statistics” Focus on “scanner data” and “web scraping” techniques as tools to capture big amount of data for the compilation of inflation Concerning web scraping Istat is testing and implementing procedures to “scrape” big amount of data for HICP aims, using the Internet as data source Focus on two groups of products: “consumer electronics” (goods) and “airfares” (services) 5 Implementing European project “Multipurpose price statistics”

6 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) Entire external data bases (medicines, school books, household contribution to National Health Service) 0.6% The most efficient way to collect prices necessary for indices compilation (i.e. camping, package holidays, highway toll) 11.6% Prices referred to the real purchase on the Internet (i.e. air tickets, consumer electronics and e-book readers) 2.3% Other prices centrally collected (i.e. tobacco and cigarettes) 7.0% 6 Centralised data collection for Italian HICP Territorial data collection 78.5% Centralized data collection 21.5% B REAKDOWN OF THE BASKET OF PRODUCTS IN TERMS OF WEIGHTS

7 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) 7 Centralised data collection for Italian HICP C RITERIA TO SELECT THE PRODUCTS FOR TESTING WEB SCRAPING TECHNIQUES Representativeness of both goods and servicesRelevance of web as retail trade channel Products for which the phase of data collection is extremely time consuming Products for which it is important widening the coverage of the sample in both temporal and spatial terms overcoming the constraints due to manual data collection

8 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) 8 Centralised data collection for Italian HICP R ELEVANCE OF WEB AS RETAIL TRADE CHANNEL Table 1. E-commerce. Individuals aged 14 and over who have used the web during the last 12 months who have bought or ordered goods or services for private use over the Internet, by groups of products purchased or ordered. 2012. Percentages Overnight stays for holidays (hotels, pension etc.).35.5 Other travel expenditures for holidays (railway and air tickets, rent a car, etc.)33.5 Clothing and footwear28.9 Books, newspapers, magazines, including e-books25.1 Tickets for shows19.7 Consumer electronics products18.6 Articles for the house, furniture, toys, etc..17.9 Others15.1 Film, music14.4 Telecommunication services14.0 Sofware for computer and updates (excluding videogames)11.5 Hardware for computer8.4 Videogames and their updates8.0 Financial and insurance services6.0 Food products5.6 Material for e-learning2.8 Games of chance1.2 Medicines0.8 Source: Istat survey on “Aspects of daily life”

9 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) The choice has finally fallen on two groups of products: consumer electronics (goods) and airfares (services) Testing web scraping techniques on these two groups of products was aimed first of all at making the on line data collection more efficient, for products for which the web is a relevant retail trade channel The aim of exploring the potentialities of web scraping to allow a better coverage of the reference population (linked to the issue about the use of big data for statistical purposes and the consequences of this use on the traditional sampling methodologies) 9 Centralised data collection for Italian HICP

10 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) Survey about prices of consumer electronics is carried out in ten phases and it concerns: 10 Testing and implementing web scraping techniques on “consumer electronics” products Mobile phonesSmartphonesPC notebookPC desktopPC TabletPc peripherals: monitorsPc peripherals: printersCordless or wired telephonesDigital CamerasVideo cameras

11 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) Phase 2 Market segmentation based on technical specifications and performance (annually fixed). Example1 – digital cameras: seg1= ‘compact’ camera; seg2= ‘bridge’ camera; etc. Phase 3 Identification of minimum requirements to be satisfied (annually fixed) Example1- PC Desktop: O.S. at least Windows XP, HD capacity 160 Gb or higher, RAM memory at least 2 Gb, etc.. Phase 4 Monthly data collection of all the range of models in terms of commercial name and main technical specifications offered on the market by the main brands, within the segments identified at phase 2 and satisfying the minimum requirements identified at phase 3. In this phase the sample is selected for a specific month (‘continually updated’ sample with ‘automated’ replacement of models that are losing importance in the market). Phase 5 Price data collection, for all the models included in the sample, from each web site of the shops sampled. Manual detection - for some shops (9) price collectors scanned the corresponding websites manually and registered the price in external files or databases; Semi - automatic detection - for other shops (9) price lists were manually downloaded (“copy and paste”), and then formatted and submitted to SAS procedures that linked (automatically) the product codes identified in phase 4 to the codes in the list from each store. 11 Testing and implementing web scraping techniques on “consumer electronics” products Focus for web scraping test on Phase 5 and semi automatic detection (time consuming phase and feasibility of the test)

12 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) 12 Testing and implementing web scraping techniques on “consumer electronics” products E VALUATION OF THE RESULTS OBTAINED I N TERMS OF Amount of prices downloaded in the lists Amount of prices that was possible to link automatically for each store (via SAS procedures) to the product codes in the sample selected in phase 4 Improvements of efficiency in terms of time saving

13 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) 13 Testing and implementing web scraping techniques on “consumer electronics” products Products Manual downloaded lists of prices February 2013 Web scraped lists of prices March 2013 Cordless or wired telephones195185 Mobile phones102111 Smartphones174171 PC desktop10283 PC peripherals: monitors142310 PC notebook328433 PC peripherals: printers383421 PC Tablet10087 Digital Cameras392322 Video cameras10383 Total20212206 Table 2. Number of elementary price quotes manually collected and web scraped and usable for indices compilation. A comparison between February and March 2013 Source: Istat

14 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) 14 Testing and implementing web scraping techniques on “consumer electronics” products Source: Istat On line shops website Number of products manual download: navigation, copy, and paste (minutes) manual download: standardization of formats (minutes) iMacros download: macro execution (minutes) iMacros download: formatting output (minutes) www.compushop.it1050801550 www.ekey.it630201570 www.misco.it1060901045 www.pmistore.it740901520 www.softprice.it10901802540 www.syspack.it845902045 Total time5 hours 15 minuts9 hours 10 minutes1 hour 40 minutes 4 hours 30 minutes Table 3. Current workload for monthly data collection. Comparison between manual and web scraping download. Hours The comparison between workload necessary for the manual detection of prices and the download of prices through web scraping macros shows the efficiency gains obtained

15 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) 15 Testing and implementing web scraping techniques on “consumer electronics” products Source: Istat Manual downloadWeb scraping Starting workload (annual changing base)034 Current maintenance012 Current data collection17374 Total working hours173120 Table 4. Annual working hours for half of shops sample for data collection of prices of consumer electronics products. Comparison between manual download and web scraping. Hours A correct evaluation of the gains in terms of efficiency assessing the annual workload, taking into account the starting cost of developing macros (starting workload to be considered for annual index changing base) and the maintenance of the macros (current maintenance) Also on annual basis efficiency increases: the workload necessary to manage the survey is reduced from about 23 working days to 16 working days (more than 30% of time saved that increases if we take into account that human resources are available for other tasks). The choice of switching from test to production the use of web scraping macros for consumer electronics

16 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) 16 Testing and implementing web scraping techniques on “consumer electronics” products Product number of models in the sample number of price quotes web scraped number of price quotes collected and linked to the sample Price quotes linked/price quotes scraped (%) Cordless or wired telephones19084422426.5 Mobile phones6320241085.3 Smartphones13123961877.8 Digital Cameras352264240015.1 PC desktop371837814.4 PC peripherals: monitors273273429910.9 PC notebook17935972888.0 PC peripherals: printers14358873706.3 PC Tablet1791824422.3 Video cameras1525605610.0 TOTAL16992434520558.4 Table 5. Sample of models, price quotes scraped and price quotes collected for consumer electronics products survey for Italian CPI/HICP compilation. January 2014. Units and percentages First column: the number of models selected in the sample in phase 4 Second column: the amount of elementary price quotes scraped Third column: the number of elementary quotes that it was possible to link with the codes of the models selected in phase 4

17 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) 17 Testing and implementing web scraping techniques on “consumer electronics” products The potentialities of web scraping techniques in terms of amount of information captured A big “waste” of information: why ? For being, from the statistical point of view, useless for the aims of estimating inflation estimation of consumer electronics products ? Or for sampling schemes that were conceived taking into account the constraints of the data collection and in particular the limitations to collect all the information available on web ? The possible massive use of web scraping techniques opens “new doors” to statisticians concerning the capture of the information necessary to measure the object of the survey ?

18 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) 18 Testing web scraping techniques on the survey concerning “airfares” A BRIEF DESCRIPTION OF THE SURVEY COICOP class (passenger transport by air, weight on the total HICP basket of products=0.85% in 2013) articulated in three consumption segments: Domestic flights, European flights, Intercontinental flights, further stratified by type of vector, destination and route In 2013 final sample consisted of 208 routes (from/to16 Italian airports): 47 national routes, 97 European routes, 64 intercontinental routes. 81 routes referred to TCs and 127 routes referred to LCCs. Product definition: one ticket, economy class, adult, fixed route connecting two cities, outward and return trip, on fixed departure/return days, final price including airport or agency fees

19 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) 19 Testing web scraping techniques on the survey concerning “airfares” A BRIEF DESCRIPTION OF THE SURVEY Prices are collected by means of purchasing simulations on Internet, according to a pre-fixed yearly calendar (first Tuesday of the month considering 2 or 4 time distances from the date of departure) In 2013, data collection was carried out on 16 LCCs’ websites and on three web agencies selling air tickets (Opodo, Travelprice and Edreams), where only TC’s airfares are collected More than 960 elementary price quotes are registered monthly, which correspond to the cheapest economy fare available at the moment of booking for the route and for the dates selected, including taxes and compulsory services charges

20 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) 20 Testing web scraping techniques on the survey concerning “airfares” T ESTING WEB SCRAPING The aim of testing web scraping techniques on airfares is twofold: verifying the improvements of efficiency and evaluating the chance of extending data collection to further dates (two and three months of “purchasing advance”) with respect to those ones ordinary scheduled (monthly or twice a month with departure dates ten days and one month after), exploiting the potentialities of web scraping procedures Taking into account characteristics and peculiarities of the survey on airfares the activity of testing web scraping techniques on airfares data collection has required not only developing and assembling scraping macros but also implementing a multitude of logic controls, derived from the statistical design of the survey Web scraping techniques have been tested on EasyJet, Ryanair, and Meridiana and for the traditional airlines companies on Opodo.it

21 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) 21 Testing web scraping techniques on the survey concerning “airfares” R ESULTS OF T ESTING WEB SCRAPING EasyJet did not allow to scrape directly the prices using the traditional link www.easyjet.com/it/ and required specific airport descriptions (different from the simple airport IATA codes) Ryanair, at the very beginning of the tests, presented CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart"), a challenge-response test used to determine whether the web user is human or no Meridiana website, in replying to a specific query, showed additional pages offering optional services or asking for travellers information before displaying the final price, thus obligating us to develop a distinctive and more complex macro to scrape prices. With regard to the LCCs, each airline company site showed its own specific problems:

22 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) 22 Testing web scraping techniques on the survey concerning “airfares” R ESULTS OF T ESTING WEB SCRAPING Finally, attention was concentrated on EasyJet The macros developed have provided excellent results in correctly replicating manual data collection. They are used in the current activities, starting from the more recent data collections Improvements in terms of time saving have been quite small. This is due to the time spent in preparing the input files used by the macros to correctly identify the routes and dates for which scraping the prices and returning a correct output usable for the index compilation, but also to the limited amount of elementary quotes involved (60) that does not allow to have a meaningful measure of time saving deriving from the adoption of web scraping techniques as a powerful tool to acquire big amount of elementary data in an efficient way.

23 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) 23 Testing web scraping techniques on the survey concerning “airfares” R ESULTS OF T ESTING WEB SCRAPING For airfares offered by traditional airlines companies web scraping macros have been tested on the web agency Opodo (www.opodo.it). In this case, an amount of about 160 monthly price quotes was involved. Improvements in terms of efficiency more meaningful than those ones obtained with EasyJet macro (1 hour and 48 minutes to download the 160 elementary price quotes manually downloaded in about 2 hours and half) Also for Opodo it is necessary to prepare an input file to drive the macro in searching the correct sample of routes and, in addition to Easyjet macro, in managing the distinction between traditional and low cost carriers

24 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) 24 Testing web scraping techniques on the survey concerning “airfares” R ESULTS OF T ESTING WEB SCRAPING Therefore the total time necessary for automatic detection of prices is not so different with respect to the manual detection; and time to update the macro is also needed But it has to be considered that, if the Opodo macro works correctly and only marginal check activity is needed, then the two hours time of manual work is saved and could be dedicated to other phases of the production process or to improve quality and coverage of the survey Also in the case of airfares the possibilities (enlarging the amount of elementary data collected through web scraping), to better cover the reference universe emerge clearly

25 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) 25 IT choices adopted to implement web scraping procedures The choice of Imacros as software to be used for testing web scraping techniques in the field of consumer price data collection. Why ? It allows speeding up the acquisition of textual information on the web and above all it can be used with the help of programming languages and scripting (e.g. Java, JavaScript) iMacros tasks can be performed with the most popular browsers Documented by wiki (i.e. http://wiki.imacros.net/iMacros_for_Firefox) and fora (e.g. http://forum.iopus.com/viewforum.php) It is possible to take advantage from some projects (e.g. http://sourceforge.net/projects/jacob-project/) for the use of Java, delivering to user a great potential for interface and integration with other solutions software and legacy environments.

26 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) 26 IT choices adopted to implement web scraping procedures The approach adopted has been implementing two different macros for each survey: pointing and scraping macro (the pointing ones to reach the page, the scraping ones to collect data and register them into a flat file) Main advantages: a)Easy maintenance due to modularity that helps the identification of problems when they occur; b) In all cases in which problems reside into pointing macro, there is no need of IT specialist support in maintenance The main disadvantages are: a)lower usability (collectors are forced to use two macros instead of one; b)More time necessary to execute the complete activity of web scraping But advantages prevail on disadvantages

27 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) 27 Possible future developments and conclusive remarks Developing and testing web scraping procedures for the Italian consumer price survey have confirmed the enormous potentialities of the use of automatic detection of prices (and related information) Concerning the efficiency the improvements are clear when data collection is carried out on a few websites with a big amount of information. The situation appears to be partially different if it is necessary to collect few prices on several distinct websites This issue stresses the potential use of web scraping techniques to collect information for Purchasing Power Parity (PPP) or Detailed Average Price (DAP) exercise at international level of comparison but seems to limit their use for sub national spatial comparison, for which the data collection on a certain amount of websites should be necessary But the actual challenges emerged have become more clear and they are in front of the statisticians in terms of use of “big data” for statistical purpose

28 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) 28 Possible future developments and conclusive remarks The open questions regard the adoption of web scraping techniques to gather big amount of data useful to better estimate inflation. This challenges is already proposed by the study carried out by economic researchers at the Massachusetts Institute of Technology (MIT), within the project called "The Billion Prices Project @ MIT" that is aimed at monitoring daily price fluctuations of online retailers across the world Web scraping (and scanner data) are the future of consumer price statistics as basis of reengineering production processes or also challenges to deal with a deep revision of the statistical survey design ? Is it possible fully exploiting the potentiality of these “big data” (web scraped prices and scanner data) to enhance the quality of official statistical information in a so delicate field as inflation estimation ?

29 Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) Thank you for the attention


Download ppt "Q2014 - European Conference on quality in official statistics Vienna, 2-5 June 2014 (4 June) Q2014 European Conference on quality in official statistics."

Similar presentations


Ads by Google