Collection Management Webpages Final Presentation

Slides:



Advertisements
Similar presentations
Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.
Advertisements

DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,
Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A.
Web Categorization Crawler – Part I Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Final Presentation Sep Web Categorization.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
Databases & Data Warehouses Chapter 3 Database Processing.
Xpantrac connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS /6/2014.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.
Building Scalable Web Archives Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014.
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
Introduction to Hadoop and HDFS
HAMS Technologies 1
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
SQL Queries Relational database and SQL MySQL LAMP SQL queries A MySQL Tutorial and applications Database Building Assignment.
Solr Team CS5604: Cloudera Search in IDEAL Nikhil Komawar, Ananya Choudhury, Rich Gruss Tuesday May 5, 2015 Department of Computer Science Virginia Tech,
Web Search Algorithms By Matt Richard and Kyle Krueger.
Tweets Metadata May 4, 2015 CS Multimedia, Hypertext and Information Access Department of Computer Science Virginia Polytechnic Institute and State.
VIRGINIA TECH BLACKSBURG CS 4624 MUSTAFA ALY & GASPER GULOTTA CLIENT: MOHAMED MAGDY IDEAL Pages.
Design a full-text search engine for a website based on Lucene
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Problem Based Learning To Build And Search Tweet And Web Archives Richard Gruss Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science.
GFURR seminar Can Collecting, Archiving, Analyzing, and Accessing Webpages and Tweets Enhance Resilience Research and Education? Edward A. Fox, Andrea.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Information Storage and Retrieval(CS 5604) Collaborative Filtering 4/28/2016 Tianyi Li, Pranav Nakate, Ziqian Song Department of Computer Science Blacksburg,
Big Data Processing of School Shooting Archives
4.01 How Web Pages Work.
4.01 How Web Pages Work.
4.01 How Web Pages Work.
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
CS6604 Digital Libraries Global Events Team Final Presentation
Collection Management (Tweets) Final Presentation
Collection Management Webpages
Collection Management
Spark Presentation.
ArchiveSpark Andrej Galad 12/6/2016 CS-5974 – Independent Study
Common Crawl Mining Team: Brian Clarke, Tommy Dean, Ali Pasha, Casey Butenhoff Manager: Don Sanderson (Eastman Chemical Company) Client: Ken Denmark.
Zenodo Data Archive Irtiza Delwar, Michael Culhane, John Sizemore, Gil Turner Client: Dr. Seungwon Yang Instructor: Dr. Edward A. Fox CS 4624 Multimedia,
Basic Web Scraping with Python
Extraction, aggregation and classification at Web Scale
CLA Team Final Presentation CS 5604 Information Storage and Retrieval
Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor:
Visualizations of School Shootings
Trail Study Kevin Cianfarini, Shane Davies, Marshall Hansen, Andrew Eason … CS4624: Multimedia, Hypertext, and Information Access Instructor: Dr. Edward.
Clustering and Topic Analysis
Virginia Tech Blacksburg CS 4624
Clustering tweets and webpages
CS 5604 Information Storage and Retrieval
CS6604 Digital Libraries IDEAL Webpages Presented by
Hey everyone, I’m Sunny …harsh caroline xavier
Event Focused URL Extraction from Tweets
Team FE Final Presentation
CS110: Discussion about Spark
Event Trend Detector Ryan Ward, Skylar Edwards, Jun Lee, Stuart Beard, Spencer Su CS 4624 Multimedia, Hypertext, and Information Access Instructor: Edward.
Tracking FEMA Kevin Kays, Emily Maier, Tyler Leskanic, Seth Cannon
Twitter Equity Firm Value
CS6604 Digital Libraries IDEAL Webpages Presented by
Validation of Ebola LOD
Information Storage and Retrieval
News Event Detection Website Joe Acanfora, Briana Crabb, Jeff Morris
Tweet URL Analysis Guoxin Sun, Kehan Lyu, Liyan Li
CS5984:Big Data Text Summarization
4.01 How Web Pages Work.
4.01 How Web Pages Work.
Python4ML An open-source course for everyone
Presentation transcript:

Collection Management Webpages Final Presentation CS 5604 Information Storage & Retrieval Professor Edward A. Fox M. Eagan, X. Liang, L. Michael, S. Patil Virginia Polytechnic Institute and State University Blacksburg, VA 24061 December 7, 2017

Review and Background

The Web as a Graph Represented as a directed graph with nodes being webpages and edges being links. The anchor text pointing to page B is a good description of page B. The hyperlink from A to B represents an endorsement of page B by the creator of page A.

Web Crawlers Must have Robustness (resilient to traps) and Politeness Should Be - Distributed, Scalable, Efficient, Biased towards Quality, Continually Operate, and Extensible

Web Crawler Architecture 1. Start with a seed set of URLs 2. Fetch content from a URL and parse content 3. Text is fed to a text indexer 4. Extracted links added to the URL frontier 5. If continuous, URL is added back to the frontier

Event Focussed Crawler (EFC) Overview Designed by Mohamed Farag Event: “Something that happened at a certain place on a specific date” Generates a list of URLs based off of user provided seed URLs Uses much of the standard crawling principles like anchor text for information

Input Collection Management Tweets (CMT) team URLs generated collected by hand WARC files from organizations such as Internet Archive Mischa

Current System Design Mischa

Output WARC Files - to be uploaded to the Internet Archive Raw and Cleaned HTML along with associated entities such as ‘collection_name’, ‘location’ etc. - stored in HBase HBase Mischa

Twitter Data Flow CMT provided CSV files for tweets stored in HBase We pulled out only tweets with URLs We check these URLs for uniqueness We fetch these pages, clean them Upload processed data to HBase

Processing Completed and now in HBase

Processing Pipeline for August21 CSV from CMT URL CSV Unique URL CSV Processed TSV File Name August21_final.csv August21_finalURL.csv August21_finalURL_condensed.csv.csv August21_finalURL_rst_bk.tsv Number of Rows 2,017 690 510

Processing Pipeline for oreclipse CSV from CMT URL CSV Unique URL CSV Processed TSV File Name oreclipse_final.csv oreclipse_finalURL.csv oreclipse_finalURL_condensed.csv oreclipse_finalURL_rst_bk.tsv Number of Rows 15,243 2,411 992

Processing Pipeline for eclipseglasses CSV from CMT URL CSV Unique URL CSV Processed TSV File Name eclipseglasses_final.csv eclipseglasses_finalURL.csv oclipseglasses_finalURL_condensed.csv.csv eclipseglasses_finalURL_rst.tsv Number of Rows 20,618 3,364 2,399

Processing Pipeline for totaleclipse CSV from CMT URL CSV Unique URL CSV Processed TSV File Name totaleclipse_final.csv totaleclipse_finalURL.csv totaleclipse_finalURL_condensed.csv.csv totaleclipse_finalURL_rst.tsv Number of Rows 1,005,938 41,735 8,683

Processing Pipeline for totalsolareclipse CSV from CMT URL csv Unique URL CSV Processed TSV File Name totalsolareclipse_final.csv totalsolareclipse_finalURL.csv totalsolareclipse_finalURL_condensed.csv.csv totalsolareclipse_finalURL_rst.tsv Number of Lines 26,788 4,703 2,941

WARC Files Processed For HBase

Process Overview Provided with WARC files from EFC runs in CS 6604 HTML is extracted and cleaned Resulting information is uploaded to HBase

521 258 19 328 3442 Collection Name Number of Rows oregon school shooting Dunbar High School shooting NIU shooting University Alabama shooting Virginia Tech shooting Number of Rows 521 258 19 328 3442 These collections were based on WARC files provided to us, we speculate their sizes were a function iteration, growing bigger as the CS 6604 group worked throughout the semester

Crawling We Did

Process Overview URL List HTML is extracted and cleaned Resulting information is uploaded to HBase

1344 913 2726 Collection Name Eclipse2017 VegasShooting Hurricane Irma Number of Lines 1344 913 2726 These crawls were conducted throughout the semester and their size was based of hand examining urls to evaluate their relevance and setting the EFC to conduct crawls that produce quality results

Parallel Processing Big data! Better performance We have finally exploited the advantage of cluster Parallel Processing

PySpark What is Spark: Apache Spark is an open-source cluster-computing framework. What is PySpark: Python Spark API that allows us to use Spark with Python (our processing script is written in Python) We want to use PySpark because our processing...

Parallel Processing I - PySpark Total time used (crawling + cleaning), without NER and language detector Parallelized using PySpark, default number of nodes Unit: seconds URL size Baseline(s) Parallel Stemming(s) Parallel Crawling(s) 13 12.08366489 16.3642930984 21.1652750969 15.4960489273 15.6808428764 20.149545908 46 - 39.8726541996 26.4784898758 1000 650.2592189 714.77546 480.311785936 497.93241787 By that time we didn’t figure out how to run NER and language detector on cluster Stemming is only one small step in cleaning, the only step which does not depend on nltk_data or some uninstalled libraries (on other nodes) Here are some results using PySpark This table shows the total time used for crawling and cleaning, without some functions that we didn’t figure out how to run on the cluster by that time. It is not a complete validation so you can see some blanks in the table, but it gives us some sense about the result. We tried several different things. First we tried parallelizing stemming, which is one small step in cleaning. You can see the performance don’t get any better both for a small URL size and a large URL size. So we tried something else, which is parallelizing the crawling step. You can see the performance get better, especially for a relatively large URL size. This makes it feasible for us to process more URLs. Parallel Crawling did show speed up but was most likely bottled necked by network constraints

Parallel Processing II - PySpark Crawling time used Paralleled using PySpark, default number of nodes Unit: seconds URL size Parallel Crawling(s) 992 259.0064449 1000 288.0420101 2399 2097.015636 This is incomplete and missing baseline but that is all the information we have Here we can look at the actual crawling time used, without any cleaning. You can see the paralleled crawling actually does not consume a lot of time. If we also have a way to parallel our entire cleaning step, we should be able to have a much better performance.

Parallel Processing III - Processes and Pipes WARC Processing time used, populating all fields Parallelized using Python multiprocessing module to create additional processes and pipes for communication Unit: minutes WARC File/Number of Records No Parallelization on Local Machine Parallelized on Local Machine Parallelized on Cluster Dunbar High Shooting / 259 11.83 minutes 10.74 minutes 6.68 minutes Reynolds High Shooting / 522 23.09 minutes 21.91 minutes 13.75 minutes VT Shooting / 1941 n/a 78.81 minutes 52.14 minutes Parallel Crawling did show speed up but was most likely bottled necked by network constraints

Build up a single pipeline for all sources of inputs (WARC, EFC, tweets) Parallelized EFC (feasible using PySpark + URL queues) Speed up cleaning Phrases in NER (now separated words) Future Work

Our team would like to thank Dr Our team would like to thank Dr. Fox for his guidance during this project. We would also like to thank Liuqing Li for answering our questions and helping us to get up and running. Our efforts contribute and benefit from the work done on the Global Event and Trend Archive Research project funded by NSF Grant IIS-1619028. Acknowledgements