Presentation is loading. Please wait.

Presentation is loading. Please wait.

Olav ten Bosch 23 March 2016, ESSnet big data WP2, Rome Webscraping at Statistics Netherlands.

Similar presentations


Presentation on theme: "Olav ten Bosch 23 March 2016, ESSnet big data WP2, Rome Webscraping at Statistics Netherlands."— Presentation transcript:

1 Olav ten Bosch 23 March 2016, ESSnet big data WP2, Rome Webscraping at Statistics Netherlands

2 Content – Internet as a datasource (IAD): motivation – Some IAD projects over past years – Technologies used – Summary / trends – Observations / thoughts – Legal – The Dutch Business Register 2

3 The why 3 Administrative sources – Tax, social security services – Municipalities/ Provinces – Supermarkets – … – Surveys Internet sources Less!!! Less!!! Faster, better, more efficient Newindicators

4 Fuel prices (2009) ‐ Daily fuel prices from website of unmanned petrol stations (tinq.nl) ‐ Regional prices (per station) every day Now: 2016: ‐ A direct data feed from travelcard company, weekly ‐ Fuel prices per day and all transactions of that week ‐ Publication in website: prices per monthprices per month 4

5 Airline tickets (2010) 5 – Pilot: 3 robots on 6 airline companies – 2 robots by external companies, 1 by SN – Prices comply with manual collection – Quite expensive; negative business case – 2016: still manual price collection of airline tickets

6 Airline tickets Price of a ticket over time

7 Housing market – Housing market (from 2011): ‐ Discussions with external company for > 1 year (iWoz) ‐ We scraped 5 sites, about 250.000 observations / week, 2 years 2013 ->: ‐ Direct feed from one of the sites (Jaap.nl) ‐ Statline tables: Bestaande woningen in verkoopBestaande woningen in verkoop ‐ “based on 80-90 percent of the market” 7

8 Bulk price collection for CPI (1) – Bulk price collection for CPI (from 2012): ‐ Mainly clothing ‐ Software scrapes all prices and product data (id, name, description, category, colour, size,…) 2016: ‐ About 500.000 price observations daily from 10 sites ‐ Data from 3 sites used in production of Dutch CPI ‐ Price collection process embedded in organisation ‐ Plans to extend to > 20 sites; other domains 8

9 Bulk price collection for CPI (2) Processing bulk data from the Internet 9 Structured data Data collection & Feature extraction Index based on internet data Big Data Index methods Features: Fine-knit Jumper Dark blue Striped Cotton edges

10 Robot-assisted price collection – Robot tool for detecting price changes on (parts of) websites – Traffic light indicates status: ‐ Green: nothing changed, prices is saved in database ‐ Red: some change, need attention of statistician ‐ Two click to hold old price or store a new one ‐ In production from 2014

11 Collect data on enterprises for EGR (2013) – Pilot: find data about EGR enterprises on the web ‐ We scraped semi structured data from Wikipedia ‐ Multiple wikipedia languages (NL, EN, DE, FR) ‐ 2016: something alike in ESSnet BD WP2? 11

12 Search product descriptions for classifying business activities – Search product descriptions on web (from 2014) ‐ First time we used automated search with Google search API for statistics ‐ Pilot, no production ‐ Some doubts on google results 12

13 Twitter-LinkedIn (1) – LinkedIn-Twitter for profiling (2015) ‐ Automated search on LinkedIn based on a sample of twitter users ‐ Very specific and experimental ‐ “Profiling of Twitter data, a big data selectivity study”, Piet Daas, Joep Burger, Quan Lé, Olav ten Bosch 13

14 14

15 Scraping websites of enterprises – Identify family businesses (search and / or crawling) (2016) – Identify businesses with a Corporate Social Responsibility (CSR) (search and / or crawling) (2016) – Research program: ‐ “Extracting information from websites to improve economic figures” – This ESSnet BD WP2 !!! 15

16 Crawling for Statistics 16 Internet Focused Crawler (Roboto) Data store Search & Match ElasticSearch Url-base Incomplete statistical data More complete statistical data Search terms Navigation terms Item identifyer terms “year report, family business”

17 Technologies used – Perl (2009), Djuggler (2010) – Python, Scrapy (2010) – R (2011-2015) – NodeJS (Javacript on server) (2014-) – Google Search API (2014-) – ElasticSearch (2016) – Roboto (nodejs package, 2015-2016) – Nutch: tested, not used – Generic Framework (robot framework) for bulk scraping of prices 17

18 Summary / trends 18 ProductionScrapeSearchCrawlExternal company Tinqx(x)Travelcard Airlinesx2 robots Housingx(x)Jaap.nl BulkCPIxx Robottoolxx(x) EGRxx RGSx Twitter/ Linkedin xx EnterprisesxxDataprovider?

19 Observations / thoughts … ‐ If it is there, we can get it ‐ Technology is (usually) not the problem! ‐ The internet is a living thing! ‐ It’s too simple to think we can just buy the internet somewhere and then make statistics! ‐ It’s powerful to combine something we know with something we observe! ‐ External companies can help, but be careful … 19

20 20

21 Legal – Dutch Statistics Law: ‐ Enterprises have to provide data to Statistics Netherlands on request ‐ Scraping information from websites reduces response burden ‐ Statistics Netherlands does use data for official statistics only – Dutch database legislation: ‐ Commercial re-use of intellectual property is forbidden ‐ This may also apply to internet sources – Privacy: ‐ Dutch (statistical) legislation on protection of personal information ‐ Statistics Netherlands does only scrape public sources and processes data within Statistics Netherlands’ safe environment, just as with other (privacy-sensitive) data internally – Netiquette: ‐ respect robots.txt ‐ identify yourself (user-agent) ‐ do not overload servers, use some idle time between requests 21

22 Dutch Business Register (simplified) 22 Legal unitsrelationships Cluster of control Enterprise groups EnterprisesLocal units Sources: -Trade Register -Tax Register -Social security register (employees) -Profilers - From administrative units to statistical units: -About 1.5 Million administrative entities -About 0.5 Million have a url -Quality of url field not known, but seems usable


Download ppt "Olav ten Bosch 23 March 2016, ESSnet big data WP2, Rome Webscraping at Statistics Netherlands."

Similar presentations


Ads by Google