CPSC 8985 Fall 2015 P10 Web Crawler Mike Schmidt.

CPSC 8985 Fall 2015 P10 Web Crawler Mike Schmidt

Overview A web crawler is a script or piece of code that will go out onto the Internet and pull information or data A web crawler can be trained to only look for certain information. This data can then be saved into a database which is known as web scraping Analysis can be performed on the stored data to show trends or similarities between sets

Architecture Java – language in which the business objects and data access objects are written Jsoup – Java library the application uses to pull html elements from web pages MongoDB – a NoSQL database that the applications uses to store information collected from the web

MongoDB The name mongo comes from the word humongous, as MongoDB provides a solution to store massive amounts of data MongoDB is a NoSQL type database and stores information in a JSON like way, using document objects Mongo databases can be spread over multiple servers which makes them a perfect solution to large amounts of data that need to be accessed in a timely manner

JSoup The jsoup Java library is used to parse webpages into elements using HTML tags and attributes Jsoup tears down website pages by using CSS and jquery like methods Scraped jsoup elements can be easily added to a document object which is then sent to the MongoDB server

Scraped Data This application scrapes data live from the Internet (weather, sports scores, and movie listings) Data that is collected is stored into a Mongo database where analysis can be performed Scraping data allows users to pull in information from multiple sources and aggregate it into one central location

Live Demo of Application

CPSC 8985 Fall 2015 P10 Web Crawler Mike Schmidt.

Similar presentations

Presentation on theme: "CPSC 8985 Fall 2015 P10 Web Crawler Mike Schmidt."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CPSC 8985 Fall 2015 P10 Web Crawler Mike Schmidt.

Similar presentations

Presentation on theme: "CPSC 8985 Fall 2015 P10 Web Crawler Mike Schmidt."— Presentation transcript:

Similar presentations

About project

Feedback