Presentation is loading. Please wait.

Presentation is loading. Please wait.

Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Similar presentations


Presentation on theme: "Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo."— Presentation transcript:

1 Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo

2 Contents Introduction Applications Features Algorithms Experiments

3 Introduction Large amount of web pages on the World Wide Web Web information retrieval tasks: crawling, searching, extracting KBs,…

4 Introduction Subject classification: consider the subject or topic of web page. Example: “business”, “sports”,… Functional classification: role of web pages. Example: course page, researcher homepage,…

5 Applications Improving quality of search result Building focused crawler Extracting KBs

6 Improving Search Results Solve the query ambiguity User is asked to specify before searching (Chekuri et al. [1997]) Present the categorized view of results to users (Kaki [2005])

7 Building Focused Crawler When only domain-specific queries are expected, performing a full crawl is usually inefficient. Only documents relevant to a predefined set of topics are of interest. (Chakrabarti et al. [1999])

8 Extracting KBs Store complex structured and unstructured information from the World Wide Web to make a computer understandable environment. First step : recognize class instances by classifying web’s content. (Craven et al. [1998])

9 Feature Selection Textual contents, HTML tags, hyperlinks, anchor texts On-page features Neighbors features

10 On-page Features Textual Content ▫Bag-of-words ▫N-gram representation: n consecutive words (Mladenic [1998]). Example: New York, new, york HTML tags: Ardo [2005] URL: Kan and Thi [2005], Sujatha [2013]. Positive point: reduce processing time

11 Neighbors Features (1) Weak assumption: neighbor pages of the pages belong to the same category share common characteristics Strong assumption: a page is much more likely to be surrounded by pages of the same category.

12 Neighbors Features (2)

13 Neighbors Features (3) Sibling pages are more useful than parents and children. (Chakrabarti et al. [1998], Qi and Davison [2006]) The content of neighbors need to be sufficiently similar to the target page. (Oh et al. [2000]) Using a portion of content on parent and child pages: title, anchor text, and the surrounding text of anchor text on the parent pages

14 Algorithms k-NN Co-training Naïve Bayes

15 K-NN Kwon and Lee [2000] Bag-of-words

16 Co-traning Blum and Mitchell [1998] Labeled and unlabeled data Two classifiers that are trained on different sets of features are used to classify the unlabeled instances. The prediction of each classifier is used to train the other.

17 Web Page Classification using Naive Bayes Bernoulli model: a document is represented by a feature vector with binary elements taking value 1 if the corresponding word is present in the document and 0 if the word is not present ▫E.g: consider the vocabulary: and the short document “the blue dog ate a blue biscuit”. The Bernoulli feature vector is: b = (1, 0, 1, 0, 1, 0) T Consider a web page D, whose class is given by C, we classify D as the class which has the highest posterior probability P(C |D): 17

18 Web Page Classification using Naive Bayes The document likelihood P(D i |C) : Where:  b i : Bernoulli feature vector.  P( w t |C ) : the probability of word w t occurring in a document of class C.  n k (w t ) be the number of documents of class C = k in which w t is observed.  N k is the total document of class C = k. The prior term: 18

19 Experimental Results 19 Dataset: WebKB ▫Contains 8145 webs pages. ▫Seven categories: student, faculty, staff, course, project, department and othe r. ▫Data is collected in 4 departments and some pages from other universities.  Cornell, Texas, Washington, Wisconsin, and others. Experimental setup: ▫Select four most populous categories: student, faculty, course, and project. ▫Training data: Cornell, Washington, Texas and miscellaneous pages co llected from other universities. ▫Testing data: Wisconsin.

20 Experimental Results 20 ClassesFacultyCourseStudentProject # of training pages10828451485479 # of testing pages428515625 accuracy0.81820.88510.75950.8148 Experimental result:

21 21 THANK YOU


Download ppt "Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo."

Similar presentations


Ads by Google