Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Web Crawler Design for Data Mining

Similar presentations

Presentation on theme: "A Web Crawler Design for Data Mining"— Presentation transcript:

1 A Web Crawler Design for Data Mining
Mike Thelwall University of Wolverhampton, Wolverhampton, UK Journal of Information Science 2001 27 April 2011 IDB Lab Seminar Presented by Jee-bum Park

2 Outline Introduction Architecture Implementation System Testing

3 Introduction - Motive The importance of the web has guaranteed academic interest in it, not only for affiliated technologies, but also for its content

4 They will require the services of a web crawler,
Introduction - Motive Information scientists and others wish to perform data mining on large numbers of web pages They will require the services of a web crawler, To extract patterns from the web To extract meaning from the link structure of the web The necessity of an effective paradigm for a web mining crawler

5 Introduction - Web Crawler
A web crawler, robot or spider A program that is capable of iteratively and automatically, Downloading web pages Extracting URLs from their HTML Fetching them

6 Introduction - Web Crawler: Workflow
/ index.html login.php /images/ logo.gif menu.jpg bg.png /board/ index.php index.php?id=2 Index.php?id=3 /board/files/ a.jpg b.txt Web Crawler

7 Introduction - Web Crawler: Architecture

8 Introduction - Web Crawler: Roles
A sophisticated web crawler may also perform, Identifying pages judged relevant to the crawl Rejecting pages as duplicates of ones previously visited Supporting the action of search engines For example, constructing the searchable index

9 Introduction - Web Crawler: Issue
In the normal course of operation, a simple crawler will spend most of its time awaiting data Requesting a web page Receiving a web page For this reason, crawlers are normally multi-threaded If the crawling task requires more complex processing, the speed of the crawler will be reduced A distributed approach for crawlers is needed

10 Introduction - Distributed Systems
Using idle computers connected to the internet To gain extra processing power To distribute processing power For personal site-specific crawlers, a single personal computer solution may be fast enough An alternative is a distributed model A central control unit Many crawlers operating on individual personal computers

11 Outline Introduction Architecture Implementation System Testing

12 Architecture The crawler/analyzer units The control unit
Four constraints Almost all processing should be conducted on idle computers The distributed architecture should not increase network traffic The system must be able to operate through a firewall The components must be easy to install and remove

13 Architecture Crawler Crawler
Control unit Crawler Crawler Crawler Crawler Crawler Crawler

14 Architecture - The Crawler/Analyzer Units
The program Crawl a site or set of sites Analyze the pages Report its results It can execute on the type of computers on which there will be spare time, normally personal computers

15 Architecture - The Crawler/Analyzer Units: Data Management
Accessing permanent storage space to save the web pages Linking to a database Using the normal file storage system Pages must be saved on each host computer, in order to minimize network traffic If the system is capable of handling enough data, a large-scale server-based database can be used It must provide a facility for the user to delete all saved data

16 Architecture - The Crawler/Analyzer Units: Interface
Immediate stop Clear all data from the computer

17 Architecture - The Control Unit
The control unit will live on a web server When a crawler unit requests a job or sends some data, It will be triggered It will need to store the commands The owner wishes to be executed Indicating status Completed In progress Unallocated

18 Architecture Crawler Crawler
Control unit Crawler Crawler Crawler Crawler Crawler Crawler

19 Outline Introduction Architecture Implementation System Testing

20 Implementation - The Crawler/Analyzer Units
The architecture was employed to create a system for analyzing the link structure of university web sites

21 Implementation - The Crawler/Analyzer Units
Previous system Running a single crawler/analyzer program Issues Not run quickly enough Individually set up and run on a number of computers Inefficient in terms of both human time and processor use! New system The existing stand-alone crawler was used as the basis Communication and easy installation features added Buttons to instantly close the program and remove any saved data Processed by compressor for easy distribution

22 Implementation - The Crawler/Analyzer Units
Choice of the types of checking for duplicate pages No page checking HTML page checking Weak HTML page checking Comparing methods Comparing each page against all of the others Naive Various numbers were calculated from the text of each page For example, the length of the page, MD5 or SHA-1 hash, etc.

23 Implementation - The Control Unit
Entirely new! It was given a reporting facility Statistics To deliver a summary of crawlers

24 Outline Introduction Architecture Implementation System Testing

25 System Testing In June and July of 2000
A set of sites or web pages to download An analysis to perform on the downloaded sites

26 System Testing - Result
The total number of crawler units Peaked at just over 100 with three rooms of computers 9112 tasks completed by the system Over 100,000 pages downloaded Each crawler used approximately 1 GB of hard disk space The system had become a virtual computer with over 100 GB of disk space and over 100 processors

27 System Testing - Limitations
The system was not able to run fully automatically The problem was randomly generated web pages For example, a huge set of web pages containing usage statistics for electronic equipment with one page per device per day The solution was To manually check the root cause of the problem To add their URLs to a banned list operated by the control unit There is the alternative of designing a heuristic to avoid problems For example, a maximum crawl depth

28 Outline Introduction Architecture Implementation System Testing

29 Conclusion The distributed architecture has shown itself
Capable of crawling a large collection of web sites By using idle processing power and disk space The testing of the system has shown that It cannot operate fully automatically Without an effective heuristic for identifying duplicate pages

30 Conclusion The architecture is particularly suited to situations
Where a task can be decomposed into a collection of crawling based tasks It would be unsuitable if The crawls had to cross-reference each other The data mining had to be performed in an integrated way The architecture is an effective way to use idle computing resources in order to perform large-scale web data mining tasks

31 Any Questions or Comments?
Thank You! Any Questions or Comments?

Download ppt "A Web Crawler Design for Data Mining"

Similar presentations

Ads by Google