Bayesian Classification Using P-tree  Classification –Classification is a process of predicting an – unknown attribute-value in a relation –Given a relation,

Slides:



Advertisements
Similar presentations
Data Mining Classification: Naïve Bayes Classifier
Advertisements

ITCS 6265/8265 Project Group 5 Gabriel Njock Tanusree Pai Ke Wang.
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Overview Full Bayesian Learning MAP learning
Final review LING572 Fei Xia Week 10: 03/13/08 1.
Classification and risk prediction
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Thanks to Nir Friedman, HU
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Bayesian Decision Theory Making Decisions Under uncertainty 1.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Final review LING572 Fei Xia Week 10: 03/11/
Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Simple Bayesian Classifier
Bayesian Networks. Male brain wiring Female brain wiring.
Statistical Decision Theory
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
1 Data Mining Lecture 5: KNN and Bayes Classifiers.
Naive Bayes Classifier
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
1 Bayesian Methods. 2 Naïve Bayes New data point to classify: X=(x 1,x 2,…x m ) Strategy: – Calculate P(C i /X) for each class C i. – Select C i for which.
Bayesian Classifier. 2 Review: Decision Tree Age? Student? Credit? fair excellent >40 31…40
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.
1 E. Fatemizadeh Statistical Pattern Recognition.
Classification Techniques: Bayesian Classification
CS690L Data Mining: Classification
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Slides for “Data Mining” by I. H. Witten and E. Frank.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Bayesian Classification
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Classification & Prediction — Continue—. Overfitting in decision trees Small training set, noise, missing values Error rate decreases as training set.
Classification Today: Basic Problem Decision Trees.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
Applied statistics Usman Roshan.
Usman Roshan CS 675 Machine Learning
Decision Tree Classification of Spatial Data Streams Using Peano Count Trees Qiang Ding Qin Ding * William Perrizo Department of Computer Science.
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Naive Bayes Classifier
Chapter 6 Classification and Prediction
Bayesian Classification Using P-tree
Data Mining Lecture 11.
Classification and Prediction
Classification Techniques: Bayesian Classification
More about Posterior Distributions
CSE P573 Applications of Artificial Intelligence Bayesian Learning
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 8 —
Classification and Prediction
LECTURE 07: BAYESIAN ESTIMATION
Machine Learning: UNIT-3 CHAPTER-1
Bayesian Classification
Notes from 02_CAINE conference
Presentation transcript:

Bayesian Classification Using P-tree  Classification –Classification is a process of predicting an – unknown attribute-value in a relation –Given a relation, R(k 1..k n, A 1, …, A n, C), where k i ’s are the structural attribute A 1, …, A n, C are attributes and C is the class label attribute. Given an unclassified data sample (no C-value present), classification predicts the C-value for the given sample and thus determine its class.  There are two types of classification techniques –Eager classifier: Build a classifier from training sample ahead of classifying a new sample. –Lazy classifier: No classifier is built ahead of time, training data is used directly to classify new sample.  Stream Data: comes continuously or in fixed time intervals. –E.g., weather data for a particular area or images taken from a satellite within fixed intervals. These notes contain NDSU confidential & Proprietary material. Patents pending on bSQ, Ptree technology

Preparing the data for Classification  Data Cleaning –Involves the handling of noisy data and missing values. –Noise could be removed or reduce by applying "smoothing" and missing values could be replaced with most common or some statistically determined value.  Relevance Analysis –In the given data not all its attributes are relevant to the classification task. –To reduce the task of classification these attribute should be identified and remove from classification task.  Data transformation –Data can be generalized using a concept hierarchy from low level to high level. –For spatial data, values of different bands are continuous numerical values. We may intervalize them as high, medium, low etc. using the concept hierarchy.

Bayesian Classification A Bayesian classifier is a statistical classifier, which is based on following theorem known as Bayes theorem: Bayes theorem: Let X be a data sample whose class label is unknown. Let H be a hypothesis (i.e., X belongs to class, C). P(H|X) is the posterior probability of H given X. P(H) is the probability of H, then P(H|X) = P(X|H)P(H)/P(X) Where P(X|H) is the posterior probability of X given H and P(X) is the probability of X.

Naïve Bayesian Classification  Given a relation R(K, A 1..A n, C) where K is the structure attribute and A i and C are feature attributes. Also C is the class label attribute.  Each data sample is represented by feature vector, X=(x 1..,x n ) depicting the measurements made on the sample from A 1,..A n, respectively.  Given classes, C 1,...C m, the naive Bayesian Classifier will predict the class of unknown data sample, X, to be class, C j having the highest posterior probability, conditioned on X P(C j |X) > P(C i |X), where i  j. (called the maximum posteriori hypothesis),  From the Bayes theorem: P(C j |X) = P(X|C j )P(C j )/P(X) –P(X) is constant for all classes so we maximize P(X|C j )P(C j ). If we assume equal likelihood of classes, maximize P(X|C j ) otherwise we maximize the whole product. –To reduce the computational complexity of calculating all P(X|C j )'s the naive assumption of class conditional independence of values is used.

Naïve Bayesian Classification Class Conditional Independence:  This assumption says that the values of the attributes are conditionally independent of one another. So, P(X|C j )=P(X 1 |C j )*..*P(X n |C j ).  Now the P(X i |C j )’s can be calculated directly from data sample. Calculating P(X i |C j ) from P-trees : P(X i |C j ) = s j x i /s j where s j = # of samples in class C j and s j x i = # of training samples of class C j, having A i -value x i. These values can be calculated by: s j x i = RootCount [ (P i (x i ) ^ (P C (C j ) ], s j = RootCount [ P C (C j )]

 One problem with Non-Naïve-Bayesian P-tree classifiers: If the rc(Ptree(X))=0 then we will not get a class label for that tuple.  It could happen if the whole tuple is not present in the training data Non-Naive Bayesian Classifier (Cont.) X 1 X 2 … X k-1 X k X k+1 … X n Whole tuple  Solution (Partial Naïve): So in that case we can divide the whole tuple into two parts separating one attribute from the whole tuple. e.g. X 1 X 2 … X k-1 X k X k+1 … X n XkXk X 1 X 2 … X k-1 X k+1 … X n Whole tuple Separated tuple X’ P(X|C i )=rc[Ptree( X’) ^ P C (C i )] * rc[P k ( X k ) ^ P C (C i )]  Now the problem is how to select the attribute X k  One way to use the information gain theory.  Calculate the info gain of all the attributes X i then X k is the one having lowest information gain

Information Gain Let C have m different classes, C 1 to C m The information needed to classify a given sample is: I(s 1..s m ) = -  (i =1..m)[p i *log 2 (p i )] where p i =s i /s is the probability that a sample belongs to C i. A, an attrib, having values, {a 1...a v }. The entropy of A is E(A) = (  j=1..v  i=1..m s ij / s ) * I(s 1j..s mj ) I(s 1j..s mj ) = -  i=1..m p ij *log 2 (p ij ) where p ij =s ij /S j is the probability that a sample in C i belongs to A i Information gain of A: Gain(A) = I(s 1..s m ) - E(A) s i = rc(P C (c i ) S j = rc(P A (a j ) s ij = rc( P C (c i ) ^ P A (a j ) )

Performance of Ptree AND operation Performance of Classification: Comparison Performance for 4 classification classes BitsNBCBCIG Succ IG Use IG Use - Proportion of the number of times the information gain was used for successful classification. Performance of Classification: Classification success rate comparisons

Performance in Data Stream Application  Data Stream mining should have the following criteria –It must require a small constant time per record. –It must use only a fixed amount of main memory –It must be able to build a model at most one scan of the data –It must make a usable model available at any point of time. –It should produce a model that is equivalent to the one that would be obtained by the corresponding database-mining algorithm. –When the data-generating phenomenon is changing over time, the model at any time should be up- to-date but also include the past information.  Data Stream mining Using P-tree –It must require a small constant time per record. P-tree require small and constant time. –It must use only a fixed amount of main memory Ok for P-tree –It must be able to build a model at most one scan of the data To build the P-tree only one scan is required –It must make a usable model available at any point of time. Ok for P-tree –It should produce a model that is equivalent to the one that would be obtained by the corresponding database-mining algorithm. Any conventional algorithm is also implementable with P-tree –When the data-generating phenomenon is changing over time, the model at any time should be up-to-date but also include the past information. Ok for P-tree