Common Crawl Mining Team: Brian Clarke, Tommy Dean, Ali Pasha, Casey Butenhoff Manager: Don Sanderson (Eastman Chemical Company) Client: Ken Denmark.

Common Crawl Mining Team: Brian Clarke, Tommy Dean, Ali Pasha, Casey Butenhoff Manager: Don Sanderson (Eastman Chemical Company) Client: Ken Denmark (Eastman Chemical Company) Instructor: Edward Fox Multimedia, Hypertext, and Information Access (CS 4624) Virginia Tech, Blacksburg, VA Spring 2017

Project Goals Improve Eastman Chemical Company’s ability to inform key business decisions Assist product marketing and strategy development Generate information for trend analysis of keywords Publication date extraction Common Crawl keyword searches The main goal behind the Common Crawl Mining project was to improve Eastman Chemical Company’s ability to inform key business decisions, like whether or not to market a certain chemical. In order to do this, they wanted specific information, from the Common Crawl archive, to be used by their product marketing and strategy development divisions. This information would allow them to drill down to a particular time period and look at what people were writing about certain keywords and chemicals during that period. To perform such trend analysis, they simply needed assistance extracting publication dates and performing keyword searches on Common Crawl data; however the system we developed turned out to be much more than this.

History Established client contact
Experimented with MapReduce searching Transitioned to Elasticsearch[1] Reduced dataset Configured prototype Tested Elasticsearch prototype Refined Elasticsearch prototype Parallelization Custom relevance scoring Deduplication Here is an overview of our project’s history. We started off trying to establish contact with our client, which took longer than we would have liked. So, we began experimenting with MapReduce to search Common Crawl WARC files for keywords & chemical names. From there, we moved to an Elasticsearch engine due to resource and time constraints. Once we had set up our Elasticsearch prototype, we tested it using keyword searches with chemical names. Lastly we worked on refining our prototype by parallelizing the processing of WARC records, implementing custom relevance scoring, and implementing deduplication during indexing. Now, Ali will go over some of the problems we ran into during this process and how we addressed them.

Problems EMR cluster crashing (node size and default timeout)
Change configurations Webhose printing to stdout Dup2() JavaScript in HTTP responses Beautiful Soup Alternatives to Webhose date extraction (11,031/43,748) IBM AlchemyLanguage Scalability Multiprocessing One of the early problems we had during prototyping, was that the EMR cluster would run for a couple of hours and then crash. To limit the cost of production, we were initially using small and medium nodes for the master and core nodes and this was causing an issue. So in order for the cluster to initialize properly, we had to switch to larger nodes. We also ran into a problem with the default timeout for elasticsearch indexing which we had to increase. Another problem we had was when we would SSH into AWS, run the script and then exit, and one of the modules tried to write to STDOUT, webhose in this case, It would not have access to the terminal and so it would terminate. Simply changing STDOUT or running the script in the background wasn’t enough. We had to change it at the file descriptor level using the dup2 function. During the refinement stage, we wanted to get rid of the javascript code in the HTTP responses that were being stored in elasticsearch since they were consuming memory and making search more costly. So we used the decompose function provided by Beautiful soup to get rid of all the style and scripts text. The date extraction tool that we are using is Webhose, which is open source. The ratio of records for which date was extracted was about 25% which is not that great. We could use IBMs Alchemy to get the date but it would incur a large cost. For parallelization, we used multiprocessing instead of multithreading since the Cpythons global lock interpreter causes sequential execution, which isn’t much better.

Demo Querying[2] Visualization[2]

Consider all problems you may face and construct solutions ahead of time.

Acknowledgements Ken Denmark (Eastman Chemical Company)
Dr. Edward Fox (Virginia Tech) Liuqing Li (Virginia Tech) Don Sanderson (Eastman Chemical Company)

References [1] "Elastic MapReduce Using Python and MRJob." Elastic MapReduce Using Python and MRJob - Cs4980s15. MoinMoin, Web and% 20MRJob 19 Feb [2] “Kibana: Explore, Visualize, Discover Data”. Elastic. 10 Apr Web Apr [3]: Common Crawl. Common Crawl, 2017 Web. started/ 19 Feb

Thanks! Questions? [3]

Common Crawl Mining Team: Brian Clarke, Tommy Dean, Ali Pasha, Casey Butenhoff Manager: Don Sanderson (Eastman Chemical Company) Client: Ken Denmark.

Similar presentations

Presentation on theme: "Common Crawl Mining Team: Brian Clarke, Tommy Dean, Ali Pasha, Casey Butenhoff Manager: Don Sanderson (Eastman Chemical Company) Client: Ken Denmark."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Common Crawl Mining Team: Brian Clarke, Tommy Dean, Ali Pasha, Casey Butenhoff Manager: Don Sanderson (Eastman Chemical Company) Client: Ken Denmark.

Similar presentations

Presentation on theme: "Common Crawl Mining Team: Brian Clarke, Tommy Dean, Ali Pasha, Casey Butenhoff Manager: Don Sanderson (Eastman Chemical Company) Client: Ken Denmark."— Presentation transcript:

Similar presentations

About project

Feedback