Jump to first page The objective of our final project is to evaluate several supervised learning algorithms for identifying pre-defined classes among web.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Florida International University COP 4770 Introduction of Weka.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Chapter 5: Introduction to Information Retrieval
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Decision Tree Approach in Data Mining
Weka & Rapid Miner Tutorial By Chibuike Muoh. WEKA:: Introduction A collection of open source ML algorithms – pre-processing – classifiers – clustering.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Classification Techniques: Decision Tree Learning
Introduction to Data Mining with XLMiner
Decision Tree Rong Jin. Determine Milage Per Gallon.
1 Application of Metamorphic Testing to Supervised Classifiers Xiaoyuan Xie, Tsong Yueh Chen Swinburne University of Technology Christian Murphy, Gail.
Decision Tree Algorithm
WEKA Evaluation of WEKA Waikato Environment for Knowledge Analysis Presented By: Manoj Wartikar & Sameer Sagade.
Induction of Decision Trees
Lecture 5 (Classification with Decision Trees)
Three kinds of learning
Introduction to WEKA Aaron 2/13/2009. Contents Introduction to weka Download and install weka Basic use of weka Weka API Survey.
Classification.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Data Mining Techniques
An Exercise in Machine Learning
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Bayesian Networks. Male brain wiring Female brain wiring.
Inductive learning Simplest form: learn a function from examples
Appendix: The WEKA Data Mining Software
Machine Learning for Language Technology Introduction to Weka: Arff format and Preprocessing.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia.
Learning from Observations Chapter 18 Through
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Classification Techniques: Bayesian Classification
CS690L Data Mining: Classification
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
An Exercise in Machine Learning
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Lecture Notes for Chapter 4 Introduction to Data Mining
Data Mining and Decision Support
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,
Weka Tutorial. WEKA:: Introduction A collection of open source ML algorithms – pre-processing – classifiers – clustering – association rule Created by.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
CIS 335 CIS 335 Data Mining Classification Part I.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
DECISION TREES An internal node represents a test on an attribute.
Privacy-Preserving Data Mining
Chapter 6 Classification and Prediction
Waikato Environment for Knowledge Analysis
Classification and Prediction
Machine Learning with Weka
iSRD Spam Review Detection with Imbalanced Data Distributions
Classification and Prediction
CSCI N317 Computation for Scientific Applications Unit Weka
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Chapter 7: Transformations
Presentation transcript:

Jump to first page The objective of our final project is to evaluate several supervised learning algorithms for identifying pre-defined classes among web documents. Presented by Yi Cheng, Jianye Ge, Jun Liang, Sheng Yu Comparison of Web Page Classification Algorithms

Jump to first page n Problem Statement n Literature Review n Project Design n Implementation n Results & Comparison Project Outline

Jump to first page n Why Web Page Classification n Supervised or Unsupervised Classification n Classification Accuracy n Classification Efficiency Problem Statement

Jump to first page Literature Review –Web Categorization Arul Prakash Asirvatham etc. (2000) reviewed web categorization algorithms. Major classification applications have been divided into five classes: (1) Supervised classification, or so called manual categorization. This is useful when classes has been predefined. (2) Unsupervised, or Clustering approaches. Clustering algorithms can group web documents without any pre-defined framework, or background information. Most clustering algorithms, such as K- means, need to set the number of cluster in advance. And computational time is expensive here.

Jump to first page Literature Review –Web Categorization (3). Meta tags based categorization. Using meta tag attributes for web documents classification. The assumption that author of document will use correct keywords in the meta tags is not always true. (4) Text content based categorization. A database of keywords in a category is prepared and commonly occurring words (called stop words) are removed from this list. The remaining words can be used for classification. K-Nearest Neighbor classification algorithm. (5) Link and content analysis, or hub-authority analysis. The link- based approach is an automatic web page categorization technique based on the fact that a web page that refers to a document must contain enough hints about its content to induce someone to read it.

Jump to first page Literature Review --Supervised Classification Given a set of example records Each record consists of A set of attributes A class label Build an accurate model for each class based on the set of attributes Use the model to classify future data for which the class labels are unknown

Jump to first page n Neural networks n Statistical models – linear/quadratic discriminants n Decision trees n Genetic models Literature Review– Supervised Classification Model

Jump to first page n A straightforward and frequently used method for supervised learning. n Provides a flexible way for dealing with any number of attributes or classes, based on probability theory of Bayes’s rule. n The asymptotically fastest learning algorithm that examines all its training input. n Performs surprisingly well in a very wide variety of problems in spite of the simplistic nature of the model. n Small amounts of “noise” do not perturb the results by much. Literature Review-- Naïve Bayes Algorithm

Jump to first page How it works Suppose Ck are classes which the data will be classified into. For each class, P(Ck) represents the prior probability of classifying an attribute into Ck, and it can be estimated from the training dataset. For n attribute values Vj ( j=1…n ), the goal of classification is clearly to find the conditional probability P(Ck|V1^V2^...^Vn). By Bayes’s rule, For classification, the denominator is irrelevant, since for given values of the Vj, it is the same regardless of the value of Ck. Literature Review-- Naïve Bayes Algorithm

Jump to first page Relatively fast compared to other classification models Obtain similar and sometimes better accuracy compared to other models Simple and easy to understand Can be converted into simple and easy to understand classification rules Literature Review-- Decision Tree Classification

Jump to first page A decision tree is created in two phases: u Tree Building Phase Repeatedly partition the training data until all the examples in each partition belong to one class or the partition is sufficiently small u Tree Pruning Phase Remove dependency on statistical noise or variation that may be particular only to the training set Literature Review-- Decision Tree Classification

Jump to first page The ID3 algorithm is used to build a decision tree, given a set of non-categorical attributes C1, C2,.., Cn, the categorical attribute C, and a training set T of records. The basic ideas behind ID3 are that: In the decision tree each node corresponds to a non-categorical attribute and each arc to a possible value of that attribute. A leaf of the tree specifies the expected value of the categorical attribute for the records described by the path from the root to that leaf. Entropy is used to measure how informative is a node. Literature Review-- Decision Tree Classification

Jump to first page C4.5 is an extension of ID3 that accounts for unavailable values, continuous attribute value ranges, pruning of decision trees, rule derivation, and so on. In building a decision tree we can deal with training sets that have records with unknown attribute values by evaluating the gain, or the gain ratio, for an attribute by considering only the records where that attribute is defined. In using a decision tree, we can classify records that have unknown attribute values by estimating the probability of the various possible results. Literature Review-- Decision Tree Classification

Jump to first page n Searching for web page set based on a topic n Defining the categories by observation Three categories 1-clothes, 2-computer, 3-food n Generating the training set based web page set Random download web pages, some for each category Define keywords for each category n Building up the training set Use 30 keywords 80 records Automatically done by program n Building up categories and decision tree Naïve Bayes Decision Tree n Classifying the test set of new web pages Project Design

Jump to first page Java 2 Application Topic - Apple A Java Program for building up a training set Classification Algorithms are based on Weka Classification based on Naïve and Decision Tree Weka Java Package Implementation

Jump to first page Java package developed at the University of Waikato in New Zealand. “Weka” stands for the Waikato Environment for Knowledge Analysis. Weka is a collection of machine learning algorithms for solving real-world data mining problems. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License. What is Weka?

Jump to first page n Two steps processing implemented by Java 1. Keywords vector space generating All web documents collected for each category have been defined as input training sets and processed with java program. The input for the java program are two – one is the training set, the other is keywords index file. Keywords are decided based on the properties of each category. The result is a matrix. Row variable is individual file, and column variable is keyword. Cell value is the frequency of each keyword appeared in individual document. 2. Conversion to ARFF format ARFF format is the standard input for Weka program package. Examples see the executing of our sample data. Processing Training and Test Data Set

Jump to first page n Result of Decision Tree Training : computer <= 0 | power <= 3 | | recipe <= 0 | | | power <= 0 | | | | shop <= 2: 1 (7.0) | | | | shop > 2: 3 (3.0) | | | power > 0: 3 (4.0/1.0) | | recipe > 0: 3 (21.0) | power > 3: 1 (3.0/1.0) computer > 0 | jeans <= 0: 2 (39.0) | jeans > 0: 1 (5.0) Number of Leaves : 7 Size of the tree : 13 Result of Decision Tree Training Set Quality: a b c <-- classified as | a = | b = | c = 3 Test Set Result: a b c <-- classified as | a = | b = | c = 3 Three categories: 1-clothes, 2- computer, 3-food

Jump to first page n Three categories: 1-clothes, 2-computer, 3- food n Result of Decision Tree Training : Class 1: Prior probability = 0.19 Class 2: Prior probability = 0.48 Class 3: Prior probability = 0.33 For each keyword Normal Distribution Mean, StandardDev, WeightSum, Precision Result of Naïve Bayes Training Set Quality: a b c <-- classified as | a = | b = | c = 3 Test Set Result: a b c <-- classified as | a = | b = | c = 3

Jump to first page Comparison of two classifiers 1.Naïve Bayes Classifier has better overall performance, compared to decision tree. Correctly ClassifiedInstancesPercentage Naïve Bayes %, Decision Tree %. (Total Test Set 42, training set 82). 2. Naïve Bayes perform better in classe1 and 3, but not in 2 Decision tree performs better in class 2 and 3, but not in class 1 They both perform good in class 3. See results. Class 1-clothes, 2-computer, 3-food

Jump to first page 1. Heide Brücher, Gerhard Knolmayer, Marc-André Mittermayer, Document Classification Methods for Organizing Explicit Knowledge, Y. Bi, F. Murtagh, S. McClean and T.Anderson, Text Passage Classification Using Supervised Learning, Soumen Chakrabarti, Mining the Web, Morgan Kaufmann Publishers, Dumais, S.T., Platt, J., Heckerman, D., and Sahami, M., Inductive Learning Algorithms and representations for text categorization, Proceedings of the Seventh International conference on Information and Knowledge Management (CIKM’98), pp , Arul Prakash Asirvatham, Kranthi Kumar. Ravi, Web Page Categorization based on Document Structure, References