Download presentation
Presentation is loading. Please wait.
Published byСветлана Данилович Modified over 5 years ago
1
Big Data tools for IT professionals supporting statisticians Istat SW for webscraping
Donato Summa THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION
2
Shortly we have 2 use cases
Url retrieval Webscraping of enterprise websites
3
Use case 1 Url retrieval
4
URLs retrieval In order to be able to crawl and scrape content on firm websites you need first of all to know the list of that website addresses If you don’t have the list of that website addresses available you have to procure it We tried several traditional ways in order to acquire the full list without success
5
What we have A list with about 200K enterprise records containing several information: ISTAT code (primary key) Official name of the enterprise Certified mail VAT City Province Telephone number
6
The automated URL retrieval procedure
Idea: obtain an enterprise web site (if it exists) starting from a search of the enterprise name in a search engine
7
The automated URL retrieval procedure
Input from existing sources (Register plus other archives): denomination, address, telephone number, fiscal code, …) For each enterprise in the target population: Introduce the denomination into a search engine Obtain a list of the first k resulting web pages For each one of these results, calculate the value of binary indicators. For instance: the URL contains the denomination (Yes/No); the scraped website contains geographical information coincident with already available in the Register (Yes/No); the scraped website contains the same fiscal code in the Register (Yes/No); the scraped website contains the same telephone number in the Register (Yes/No); … Compute a score on the basis of the values of the above indicators.
8
The automated URL retrieval procedure
On the subset of enteprises for which the URL is known (training set), model the relation between the binary indicators plus the score, and the success/failure of the found URL Apply the model to the subset of enterprises for which the URL is not known, in order to decide if the found URL is acceptable or not.
9
URLs retrieval chain List of names and ids of enterprises UrlSearcher
File with urls retrieved from Bing RootJuice (our crawler) List of URLs URLs retrieval chain R-scripts UrlScorer
10
URLs retrieval chain Step 1 List of names and ids of enterprises
UrlSearcher File with urls retrieved from Bing RootJuice (our crawler) List of URLs URLs retrieval chain Step 1 R-scripts UrlScorer
11
Step 1 - UrlSearcher takes as input 2 files containing the list of the firm names and the corresponding list of firm IDs. for each enterprise the program queries a search engine (we are actually using Bing) and retrieves the list of the first 10 urls provided by the search engine these urls are then stored in a file so we will have one file for each firm at the end, the program reads each produced file and creates the seed file
12
Step 1 - UrlSearcher The seed file is a txt file within wich every row is so composed: url_retrieved TAB + firm_Id TAB + position_of_the_url eg:
13
URLs retrieval chain Step 2 List of names and ids of enterprises
UrlSearcher File with urls retrieved from Bing RootJuice (our crawler) List of URLs URLs retrieval chain Step 2 R-scripts UrlScorer
14
Step 2 – RootJuice (crawling/scraping)
It takes as input 3 files: - the seed file containing the list of the URLs to be scraped - a list of web domains to avoid (directories domains) - a configuration file yellowpages.com domainToAvoid2 domainToAvoidN # proxy configuration PROXY_HOST = proxy.istat.it PROXY_PORT = 3128 # technical parameters of the scraper RESUMABLE_CRAWLING = false NUM_OF_CRAWLERS = 10 MAX_DEPTH_OF_CRAWLING = 2 MAX_PAGES_TO_FETCH = -1 MAX_PAGES_PER_SEED = 20 # paths CRAWL_STORAGE_FOLDER = specific path CSV_FILE_PATH = specific path LOG_FILE_PATH = specific path seed.txt domainsToFilterOut.txt rootJuiceConf.properties
15
Step 2 – RootJuice (crawling/scraping)
for each row of the seed file (if the url is not in the list of the domains to avoid) tries to acquire the HTML page from each acquired HTML page the program extracts just the textual content of the fields we are interested in and write a line in a CSV file
16
Step 2 – RootJuice (crawling/scraping)
The structure of each row of the produced CSV is this: id + TAB + url + TAB + imgsrc + TAB + imgalt + TAB + links + TAB + ahref + TAB + aalt + TAB + inputvalue + TAB + inputname + TAB + metatagDescription + TAB + metatagKeywords + TAB + firmId + TAB + sitoAzienda + TAB + link_position + TAB + title + TAB + text_of_the_pagebody
17
URLs retrieval chain Step 3 List of names and ids of enterprises
UrlSearcher File with urls retrieved from Bing RootJuice (our crawler) List of URLs URLs retrieval chain Step 3 R-scripts UrlScorer
18
Step 3 – Load scraped data into Solr
The produced solrInput.csv file is the input for the next step of the process Now that we have the scraped textual content of the html pages we need to index/persist it in Solr
19
Step 3 – Load scraped data into Solr
It is possible to load documents into Solr in different ways (eg. curl), we wrote an “ad hoc” program that uses an API for Java called SolrJ. SolrTSVImporter takes as input 2 files: - a configuration file - the CSV file containing the scraped content (produced by RootJuice) # proxy configuration PROXY_HOST = proxy.istat.it PROXY_PORT = 3128 # Solr server configuration SOLR_SERVER_URL = specify the url SOLR_SERVER_QUEUE_SIZE = 100 SOLR_SERVER_THREAD_COUNT = 5 # paths LOG_FILE_PATH = specific path solrTsvImporterConf.properties solrInput.csv id + TAB + url + TAB + imgsrc + TAB + imgalt + TAB + links + TAB + ahref + TAB + aalt + TAB + inputvalue + TAB + inputname + TAB + metatagDescription + TAB + metatagKeywords + TAB + firmId + TAB + sitoAzienda + TAB + link_position + TAB + title + TAB + text_of_the_pagebody row 1 with data row 2 with data row 3 with data row N with data
20
Step 3 – Load scraped data into Solr
21
URLs retrieval chain Step 4 List of names and ids of enterprises
UrlSearcher File with urls retrieved from Bing RootJuice (our crawler) List of URLs URLs retrieval chain Step 4 R-scripts UrlScorer
22
Step 4 – UrlScorer The “last” step before analysis operations
It requires in input 2 parameters : - a file containing information about each scraped firm - the Solr index directory For each document in Solr the program calculates a score vector and assigns a score on the basis of predefined conditions (eg: in the text of the document we can find the vat code or the telephone number )
23
Step 4 – UrlScorer Finally it writes down a row in a CSV file with the following structure: firmId + TAB + linkPosition + TAB + url + TAB + scoreVector + TAB score Normally there will be different rows (different URLs with different scores) for each firm
24
Step 5 – UrlMatchTableGenerator
The CSV file obtained in the step 4 is the input for the machine learning program that will try to predict the correct url for each firm. In order to be able to accomplish this task you have to instruct the learner in advance providing it a test set (a similar CSV file containing the extra boolean column “is the found url correct"). We created this test set using a custom Java program (UrlMatchTableGenerator) that merges the CSV file from step 4 with a list of correct sites
25
Step 5 – UrlMatchTableGenerator
Once the learner will be instructed it will be able to “recognize” the correct url for a firm without knowing the real official site in advance (as in the train phase)
26
Obtained results For each firm considered in the test set we compared the most probable official URL found by the procedure with the official URL that we already knew. In particular we compared only the domain part of the url (eg. ) We foud exact match in 64% of the cases
27
Considerations The real success percentage is probably bigger than obtained because sometimes the official web site that we know is not correct because: it is outdated (the url is changed) it does not exist anymore wrong domain (e.g. “rossi.it” instead of “rossi.com”) it is only an information page with contacts (paginegialle) it is the site that sells the products of the enterprise (e.g. mediaworld.it) it is the site of the mother company (franchising enterprises) Probably in this cases Bing and our algorithm find the correct site but we consider it uncorrect because different from the one we know
28
Web scraping of enterprise websites
Use case 2 Web scraping of enterprise websites
29
Web scraping chain List of URLs T/D Matrix generator RootJuice
Scraped content RootJuice T/D Matrix generator List of URLs Final results Learners Web scraping chain
30
Web scraping chain Step 1 List of URLs T/D Matrix generator RootJuice
Scraped content RootJuice T/D Matrix generator List of URLs Final results Learners Web scraping chain Step 1
31
Step 1 – RootJuice (crawling/scraping)
It takes as input 3 files: - the seed file containing the list of the URLs to be scraped - a list of web domains to avoid (directories domains) - a configuration file yellowpages.com domainToAvoid2 domainToAvoidN # proxy configuration PROXY_HOST = proxy.istat.it PROXY_PORT = 3128 # technical parameters of the scraper RESUMABLE_CRAWLING = false NUM_OF_CRAWLERS = 10 MAX_DEPTH_OF_CRAWLING = 2 MAX_PAGES_TO_FETCH = -1 MAX_PAGES_PER_SEED = 20 # paths CRAWL_STORAGE_FOLDER = specific path CSV_FILE_PATH = specific path LOG_FILE_PATH = specific path seed.txt domainsToFilterOut.txt rootJuiceConf.properties
32
Step 1 – RootJuice (crawling/scraping)
for each row of the seed file (if the url is not in the list of the domains to avoid) tries to acquire the HTML page from each acquired HTML page the program extracts just the textual content of the fields we are interested in and write a line in a CSV file
33
Step 1 – RootJuice (crawling/scraping)
The structure of each row of the produced CSV is this: id + TAB + url + TAB + imgsrc + TAB + imgalt + TAB + links + TAB + ahref + TAB + aalt + TAB + inputvalue + TAB + inputname + TAB + metatagDescription + TAB + metatagKeywords + TAB + firmId + TAB + sitoAzienda + TAB + link_position + TAB + title + TAB + text_of_the_pagebody
34
Web scraping chain Step 2 List of URLs T/D Matrix generator RootJuice
Scraped content RootJuice T/D Matrix generator List of URLs Final results Learners Web scraping chain Step 2
35
Step 2 – Load scraped data into Solr
The produced solrInput.csv file is the input for the next step of the process Now that we have the scraped textual content of the html pages we need to index/persist it in Solr
36
Step 2 – Load scraped data into Solr
It is possible to load documents into Solr in different ways (eg. curl), we wrote an “ad hoc” program that uses an API for Java called SolrJ. SolrTSVImporter takes as input 2 files: - a configuration file - the CSV file containing the scraped content (produced by RootJuice) # proxy configuration PROXY_HOST = proxy.istat.it PROXY_PORT = 3128 # Solr server configuration SOLR_SERVER_URL = specify the url SOLR_SERVER_QUEUE_SIZE = 100 SOLR_SERVER_THREAD_COUNT = 5 # paths LOG_FILE_PATH = specific path solrTsvImporterConf.properties solrInput.csv id + TAB + url + TAB + imgsrc + TAB + imgalt + TAB + links + TAB + ahref + TAB + aalt + TAB + inputvalue + TAB + inputname + TAB + metatagDescription + TAB + metatagKeywords + TAB + firmId + TAB + sitoAzienda + TAB + link_position + TAB + title + TAB + text_of_the_pagebody row 1 with data row 2 with data row 3 with data row N with data
37
Step 2 – Load scraped data into Solr
38
Web scraping chain Step 3 List of URLs T/D Matrix generator RootJuice
Scraped content RootJuice T/D Matrix generator List of URLs Final results Learners Web scraping chain Step 3
39
Step 3 – FirmsDocTermMatrixGenerator
It takes as input a configuration file : # ============================================ # technical parameters of the program # MAX_RESULTS = max num of documents per firm retrievable from storage platform MAX_RESULTS = 10000 FIRST_LANG = ITA SECOND_LANG = ENG # paths SOLR_INDEX_DIRECTORY_PATH = specific/path/on/my/computer MATRIX_FILE_FOLDER = specific/path/on/my/computer GO_WORDS_FILE_PATH = specific/path/on/my/computer STOP_WORDS_FILE_PATH = specific/path/on/my/computer LOG_FILE_PATH = specific/path/on/my/computer TREE_TAGGER_EXE_FILEPATH = specific/path/on/my/computer FIRST_LANG_PAR_FILE_PATH = specific/path/on/my/computer SECOND_LANG_PAR_FILE_PATH = specific/path/on/my/computer
40
Step 3 – FirmsDocTermMatrixGenerator
The output will be a matrix having : on the first column all the relevant stemmed terms found in all the documents on the first row all the firms id contained in the storage platform each cell will contain the number of occurencies of the specific term in all the documents referring the specific firm T/D Matrix firmId 1 firmId 2 firmId 3 firmId 4 firmId 5 firmId N term 1 4 2 1 … term 2 12 7 3 term 3 6 term 4 8 28 term 5 5 term N 13 75 9
41
Step 3 – FirmsDocTermMatrixGenerator
The words are obtained in this way: all the words present in Solr are retrieved all the words having less than 3 or more than 25 characters are discarded all the words not recognized as "first language" words or "second language" words are discarded the "first language" words are lemmatized with TreeTagger and stemmed with SnowballStemmer the "second language" words are lemmatized with TreeTagger and stemmed with SnowballStemmer the words contained in a "go word list" are added to the word list the words contained in a "stop word list" are removed from the word list
42
Web scraping chain Step 4 List of URLs T/D Matrix generator RootJuice
Scraped content RootJuice T/D Matrix generator List of URLs Final results Learners Web scraping chain Step 4
43
Thank you for your attention !
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.