Presentation is loading. Please wait.

Presentation is loading. Please wait.

Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Similar presentations


Presentation on theme: "Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender."— Presentation transcript:

1 Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender

2 Outline  Introduction to Classification  Background Classification Types Classification Methods  Applications  Features  Algorithms  Evolution of Websites

3 What is web page classification?  The process of assigning a web page to one or more predefined category labels (ex: news, sports, business…)  Classification is generally posed as a supervised learning problem Set of labeled data is used to train a classifier which is applied to label future examples

4 Background - Classification Types  Supervised learning problem broken into sub problems: Subject Classification Functional Classification Sentiment Classification Other types of Classification

5 Subject Classification  Concerned with subject or topic of the web page Judging whether a page is about arts, business, sports, etc… Functional Classification  Role that the page is playing Deciding a page to be a personal homepage, course page, admissions page, etc…

6 Sentiment Classification  Focuses on the opinion that is presented in a web page Other types of Classification  Such as genre classification and search engine spam classification

7 Background - Classification Methods  Binary vs. Multiclass  Single Label vs. Multi Label  Soft vs. Hard  Flat vs. Hierarchical

8 Binary vs. Multiclass Classification

9 Single-Label vs. Multi-Label Classification

10 Soft vs. Hard Classification

11 Flat vs. Hierarchical Classification

12 Applications  Why is classification important and how can we use it efficiently?

13 Constructing, maintaining, or expanding web directories  Web directories provide an efficient way to browse for information within a predefined set of categories  Example: Open Directory Project  Currently constructed by human effort 78,940 editors of ODP

14 Improving the quality of search results  Big problem with search results is search ambiguity

15 Helping question and answering systems  Can use classification systems to help improve the quality of answers  Example: Wolfram alpha Other applications  Contextual advertising

16

17 Features  What features can we extract from a web page to use to help classify it?

18 Features - Introduction  Because of features such as the hyperlink …, webpage classification is vastly different from other forms of classification such as plaintext classification.  Features organized into two groups: ○ On-page features – directly located on page ○ Neighbor features – found on related pages

19 On Page Features  Textual Contents & Tags Bag-of-words ○ N-gram feature Rather than analyzing individual words, group them into clusters of n-words. -Ex: New York vs. new ….. ….. York Yahoo! Has used a 5-gram feature HTML tags – title, heading, metadata, main text URL

20 On Page Features  Visual Analysis Each page has two representations ○ Text via HTML ○ Visual via the browser Each page can be represented as a visual adjacency multigraph

21 Features of Neighbors  What happens when a page’s features are missing or are unrecognizable?

22 Features of Neighbors  Assumptions If page1 is in the neighborhood of many “sports” pages then there is an increasing probability that page1 is also a “sports” page. Linked pages are more likely to have terms in common

23 Features of Neighbors  Neighbor Selection Focus on pages within 2 steps of target 6 types: parent, child, sibling, spouse, grandparent, and grandchild

24 Features of Neighbors  Labels  Anchor Text  Surrounding Anchor Text  By using the anchor text, surrounding text, and page title of a parent page in combination with text from target page, classification can be improved.

25 Features of Neighbors  Implicit Links Connections between pages that appear in the results of the same query and are both clicked by users

26 Algorithms  What are the algorithmic approaches to webpage classification? Dimension reduction Relational learning Hierarchal classification Information combination

27 Dimension Reduction  Boost classification by emphasizing certain features that are more useful in classification Feature Weighting ○ Reduces the dimensions of feature space ○ Reduces computational complexity ○ Classification more accurate as a result of reduced space

28 Dimension Reduction  Methods Use first fragment K-nearest neighbor algorithm ○ Weighted features ○ Weighted HTML Tags ○ Metrics Expected mutual information Mutual information

29 Relational Learning  Relaxation Labeling

30 Hierarchical Classification  Based on “divide and conquer” Classification problems split into hierarchical set of sub problems.  Error Minimization When a lower level category is uncertain of whether page belongs or not, shift assignment one level up.

31 Information Combination  Combine several methods into one Information from different sources are used to train multiple classifiers and the collective work of those classifiers make a final decision.

32 Conclusion  Webpage classification is a type of supervised learning problem aiming to categorize a webpage into a predefined set of categories.  In the future, efforts will most likely be focused on effectively combining content and link information to build a more accurate classifier

33 Evolution of Websites  Apple in 1998

34 Evolution of Websites  Apple 2008

35 Evolution of Websites  Nike in 2000

36 Evolution of Websites  Nike in 2008

37 Evolution of Websites  Yahoo in 1996

38 Evolution of Websites  Yahoo in 2008

39 Evolution of Websites  Microsoft in 1998

40 Evolution of Websites  Microsoft in 2008

41 Evolution of Websites  MTV in 1998

42 Evolution of Websites  MTV in 2008

43 Sources  Web Page Classification: Features and Algorithms by Xiaoguang Qi & Brian D. Davison  Visual Adjacency Multigraphs – A Novel Approach for a Web Page Classification by Milos Kovacevic, Michelangelo Diligenti, Marco Gori, and Veljko Milutinovic  The Evolution of Websites http://www.wakeuplater.com/website-building/evolution-of-websites-10- popular-websites.aspx


Download ppt "Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender."

Similar presentations


Ads by Google