Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya.

Similar presentations


Presentation on theme: "Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya."— Presentation transcript:

1 Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya Tarnikova and Hassan Alam

2 Why Classifying Web Pages? Web Page Classification is often a Pre-processing Stage in a Number of Applications Web Search Web Search Web Page Summarization Web Page Summarization Display of Web Pages in Display of Web Pages in Small Screen Devices Small Screen Devices Archiving Web Pages Archiving Web Pages Format Conversion from Format Conversion from HTML to other formats HTML to other formats

3 Why Classifying Web Pages? Specific Algorithm Specific Algorithm Different way to apply Different way to apply Specific parameters Specific parameters Local Optimizations Local Optimizations

4 What Makes Web Pages Different from Each Other? Type of Content –Banking and Finance –Programming Language –Science –Sport –Others? Manifestation –Linguistic Difference M. Sinha and D. Corne. A large benchmark dataset for web document clustering. Int. Conf. on Hybrid Intelligence Systems, 2002.

5 Sports Page

6 Programming Page

7 Banking/Finance Page

8 What is this?

9 How Do We Use Web Classes? Do people writing a web page on banking/finance do it differently than people writing a sports page? We know there will be linguistic differences, but will there be structural differences as well? If there are differences, how do we characterize it?

10 Alternate Definitions? Intent of the Web Page –What is the Main Message? Convey Information? Help in Locating Information? Allow Specific Requests to be processed? –Manifestation Text/Link Mapping Specific Task Oriented tagset

11 Example 1: Informative Web Page (Primarily Textual Content)

12 Example 2: Locating Information (Primarily Links)

13 Example 3: Facilitator (Large Chunks of Forms)

14 Non-Linguistic Features: Structural and Hierarchical Information Number of large-story-type columns Largest number of forms in one column Text size Number of links Number of images Number of columns with forms ……and others.

15 Support Vector Machine Structural Risk Minimization –Vapnik-Chervonenkis (VC) Dimension - Property of set of functions - Maximum number of training points that can be shattered by -Ex ‘s VC dimension of the set of oriented lines –VC Theory provides bounds on the test error, which depend on both empirical risk and capacity of function class

16 Hyperplane Classification

17 SVM Implementation We have adopted an implementation of SVM light, which is an implementation of Vapnik's Support Vector Machine [1] for the problem of pattern recognition. The optimization algorithm used in SVM light is described in [2]. [1] Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. [2] T. Joachims. In “Making large-Scale SVM Learning Practical”. Advances in Kernel Practical”. Advances in Kernel Methods – Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.). MIT Press, 1999.

18 Initial Experiment Database: 200 Randomly Selected Web Pages Training Database: 100 Test Database: 100Classes: 1. Story Pages 2. Reference Pages 3. Form Pages SVM Performance: On Training Data: 95% On Test Data: 87% SVM: Dot Product Pair-wise

19 Hybridization

20 Form Separation Heuristics Defining Form Probability Score (FPS) as (F) = ∑ all forms i f(i)*w(i), where, Individual form score f(i) = #(submits & resets) * 0.2 + #(radio buttons and check boxes) * 0.5 + #(all other active fields); And, defining the “Weight” w(i) for the form as the following: w(i) = f(i), if f(i) є [0, 2], w(i) = 2 + (f(i) – 2)/2, if f(i) є [2, 4], w(i) = 3 + (f(i) – 4)/4, if f(i) є [2, 6], w(i) = 3.5 if f(i) > 6 Based on these two parameters, a web page is a form if: size of the text preceding first form is less then 300, and F / (#links) > 0.25 and F / (#text) > 0.01.

21 New Experiment (1) Training Set: First 100 Test Set: Last 100 StoryReference Story421 Reference146 StoryReference Story411 Reference437 On Training Data: 97% Correct On Test Data: 90% Correct First Stage: (Heuristics): 100% on Train and Test Data Second Stage Combined: Training: 98% Test: 95%

22 New Experiment (2) Training Set: Last 100 Test Set: First 100 On Training Data: 98% Correct On Test Data: 90% Correct First Stage: (Heuristics): 100% on Train and Test Data Second Stage Combined: Training: 99% Test: 91% StoryReference Story401 Reference042 StoryReference Story394 Reference542

23 Average Accuracy Hybrid On Training Data: 98.5% On Test Data: 93% Pair-wise SVM On Training Data: 95% On Test Data: 87%

24 Future Work? We want to correlate different types of pages (structure) with respect to linguistic differences We want to characterize the structural features we used with respect to purely linguistic features Quantify the improvement in a secondary process due to the success/failure of web classification process

25 Conclusion SVM is a very effective solution for web page classification SVM is a very effective solution for web page classification Often the pre-defined number of web classes is small Often the pre-defined number of web classes is small Heuristics, if correctly applied, can be very useful in boosting Heuristics, if correctly applied, can be very useful in boosting the SVM ensemble the SVM ensemble For a problem of more than three classes, heuristics can be For a problem of more than three classes, heuristics can be applied in sequence applied in sequence For problems of more that three classes, solving ties of the For problems of more that three classes, solving ties of the pair-wise classifiers becomes a major problem – this is pair-wise classifiers becomes a major problem – this is addressed in a later paper (MCS2003) addressed in a later paper (MCS2003) Current applications of this include web page summarization and Current applications of this include web page summarization and re-authoring re-authoring


Download ppt "Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya."

Similar presentations


Ads by Google