Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx.

Similar presentations


Presentation on theme: "Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx."— Presentation transcript:

1 Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx

2 Your Logo Introduction  Web content mining is the mining, extraction and integration of useful data, information and knowledge from Web page contents. - textual - audio - video - still images - metadata - hyperlinks

3 Your Logo Introduction  Problems with the web data  Distributed data  Large volume  Unstructured data  Redundant data  Quality of data  Extreme percentage volatile data  Varied data

4 Your Logo Introduction  Two approaches of web-content mining:  agent-based software agents perform the content mining  database oriented view the Web data as belonging to a database

5 Your Logo Web Crawler  A computer program that navigates the hypertext structure of the web  Crawlers are used to ease the formation of indexes used by search engines  The page(s) that the crawler begins with are called the seed URLs.  Builds an index visiting number of pages and then replaces the current index  Known as a periodic crawler because it is activated periodically

6 Your Logo Web Crawler  Another type is a Focused Crawler  Generally recommended for use due to large size of the Web  Visits pages related to topics of interest  If a page is not pertinent, the entire set of possible pages below it is pruned

7 Your Logo Web Crawler  Crawling process  Begin with group of URLs  Submitted by users  Common URLs  Breath-first or depth-first  Extract more URLs  Numerous crawlers  Problem of redundancy  Web partition  robot per partition

8 Your Logo Focused Crawler  The focused crawler structure consists of two major parts:  The distiller  The hypertext classifier

9 Your Logo Focused Crawler  The pages that the crawler visits are selected using a priority-based structure managed by the priority associated with pages by the classifier and the distiller

10 Your Logo Focused Crawler  Sample documents are identified and classified based on a hierarchical classification tree  Documents are used as the seed documents to begin the focused crawling

11 Your Logo Context Graph  Focused crawling has proposed the use of context graphs, which in turn created the context focused crawler (CFC)  The CFC performs crawling in two steps:  Context graphs and classifiers are constructed using a set of seed documents as a training set  Crawling is performed using the classifiers to guide it

12 Your Logo Content Graph

13 Your Logo Implementation of a Web Crawler  Wget is a free GNU utility that makes it possible to retrieve web documents  Wget supports Internet protocols  HTTP (Hyper Text Transfer Protocol)  FTP (File Transfer Protocol)  Recursively browse through the structure of HTML documents and FTP directory trees

14 Your Logo Commonly Used Options for Wget

15 Your Logo Methods for Crawl Class

16 Your Logo Crawl class Figure 7.7 Code from the main of Crawl class (Suitable for Java programmers)

17 Your Logo The readContent Method of Crawl Class  Figure 7.8 Code from the readContent method of Crawl class (Suitable for Java programmers)

18 Your Logo Code for Extracting Links from Crawl Class Figure 7.9

19 Your Logo Thank you for your attention


Download ppt "Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx."

Similar presentations


Ads by Google