Download presentation
Presentation is loading. Please wait.
Published byAnnabelle Norman Modified over 7 years ago
1
Olav ten Bosch 23 March 2016, ESSnet big data WP2, Rome Webscraping at Statistics Netherlands
2
Content – Internet as a datasource (IAD): motivation – Some IAD projects over past years – Technologies used – Summary / trends – Observations / thoughts – Legal – The Dutch Business Register 2
3
The why 3 Administrative sources – Tax, social security services – Municipalities/ Provinces – Supermarkets – … – Surveys Internet sources Less!!! Less!!! Faster, better, more efficient Newindicators
4
Fuel prices (2009) ‐ Daily fuel prices from website of unmanned petrol stations (tinq.nl) ‐ Regional prices (per station) every day Now: 2016: ‐ A direct data feed from travelcard company, weekly ‐ Fuel prices per day and all transactions of that week ‐ Publication in website: prices per monthprices per month 4
5
Airline tickets (2010) 5 – Pilot: 3 robots on 6 airline companies – 2 robots by external companies, 1 by SN – Prices comply with manual collection – Quite expensive; negative business case – 2016: still manual price collection of airline tickets
6
Airline tickets Price of a ticket over time
7
Housing market – Housing market (from 2011): ‐ Discussions with external company for > 1 year (iWoz) ‐ We scraped 5 sites, about 250.000 observations / week, 2 years 2013 ->: ‐ Direct feed from one of the sites (Jaap.nl) ‐ Statline tables: Bestaande woningen in verkoopBestaande woningen in verkoop ‐ “based on 80-90 percent of the market” 7
8
Bulk price collection for CPI (1) – Bulk price collection for CPI (from 2012): ‐ Mainly clothing ‐ Software scrapes all prices and product data (id, name, description, category, colour, size,…) 2016: ‐ About 500.000 price observations daily from 10 sites ‐ Data from 3 sites used in production of Dutch CPI ‐ Price collection process embedded in organisation ‐ Plans to extend to > 20 sites; other domains 8
9
Bulk price collection for CPI (2) Processing bulk data from the Internet 9 Structured data Data collection & Feature extraction Index based on internet data Big Data Index methods Features: Fine-knit Jumper Dark blue Striped Cotton edges
10
Robot-assisted price collection – Robot tool for detecting price changes on (parts of) websites – Traffic light indicates status: ‐ Green: nothing changed, prices is saved in database ‐ Red: some change, need attention of statistician ‐ Two click to hold old price or store a new one ‐ In production from 2014
11
Collect data on enterprises for EGR (2013) – Pilot: find data about EGR enterprises on the web ‐ We scraped semi structured data from Wikipedia ‐ Multiple wikipedia languages (NL, EN, DE, FR) ‐ 2016: something alike in ESSnet BD WP2? 11
12
Search product descriptions for classifying business activities – Search product descriptions on web (from 2014) ‐ First time we used automated search with Google search API for statistics ‐ Pilot, no production ‐ Some doubts on google results 12
13
Twitter-LinkedIn (1) – LinkedIn-Twitter for profiling (2015) ‐ Automated search on LinkedIn based on a sample of twitter users ‐ Very specific and experimental ‐ “Profiling of Twitter data, a big data selectivity study”, Piet Daas, Joep Burger, Quan Lé, Olav ten Bosch 13
14
14
15
Scraping websites of enterprises – Identify family businesses (search and / or crawling) (2016) – Identify businesses with a Corporate Social Responsibility (CSR) (search and / or crawling) (2016) – Research program: ‐ “Extracting information from websites to improve economic figures” – This ESSnet BD WP2 !!! 15
16
Crawling for Statistics 16 Internet Focused Crawler (Roboto) Data store Search & Match ElasticSearch Url-base Incomplete statistical data More complete statistical data Search terms Navigation terms Item identifyer terms “year report, family business”
17
Technologies used – Perl (2009), Djuggler (2010) – Python, Scrapy (2010) – R (2011-2015) – NodeJS (Javacript on server) (2014-) – Google Search API (2014-) – ElasticSearch (2016) – Roboto (nodejs package, 2015-2016) – Nutch: tested, not used – Generic Framework (robot framework) for bulk scraping of prices 17
18
Summary / trends 18 ProductionScrapeSearchCrawlExternal company Tinqx(x)Travelcard Airlinesx2 robots Housingx(x)Jaap.nl BulkCPIxx Robottoolxx(x) EGRxx RGSx Twitter/ Linkedin xx EnterprisesxxDataprovider?
19
Observations / thoughts … ‐ If it is there, we can get it ‐ Technology is (usually) not the problem! ‐ The internet is a living thing! ‐ It’s too simple to think we can just buy the internet somewhere and then make statistics! ‐ It’s powerful to combine something we know with something we observe! ‐ External companies can help, but be careful … 19
20
20
21
Legal – Dutch Statistics Law: ‐ Enterprises have to provide data to Statistics Netherlands on request ‐ Scraping information from websites reduces response burden ‐ Statistics Netherlands does use data for official statistics only – Dutch database legislation: ‐ Commercial re-use of intellectual property is forbidden ‐ This may also apply to internet sources – Privacy: ‐ Dutch (statistical) legislation on protection of personal information ‐ Statistics Netherlands does only scrape public sources and processes data within Statistics Netherlands’ safe environment, just as with other (privacy-sensitive) data internally – Netiquette: ‐ respect robots.txt ‐ identify yourself (user-agent) ‐ do not overload servers, use some idle time between requests 21
22
Dutch Business Register (simplified) 22 Legal unitsrelationships Cluster of control Enterprise groups EnterprisesLocal units Sources: -Trade Register -Tax Register -Social security register (employees) -Profilers - From administrative units to statistical units: -About 1.5 Million administrative entities -About 0.5 Million have a url -Quality of url field not known, but seems usable
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.