Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.

Slides:



Advertisements
Similar presentations
Naïve-Bayes Classifiers Business Intelligence for Managers.
Advertisements

Data Mining Classification: Alternative Techniques
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Data Mining Techniques
Artificial Neural Network Applications on Remotely Sensed Imagery Kaushik Das, Qin Ding, William Perrizo North Dakota State University
Data Mining Chun-Hung Chou
Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.
Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.
Bayesian Networks. Male brain wiring Female brain wiring.
MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR DISTRIBUTED DATA MINING
Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.
Bit Sequential (bSQ) Data Model and Peano Count Trees (P-trees) Department of Computer Science North Dakota State University, USA (the bSQ and P-tree technology.
Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo.
Ptree * -based Approach to Mining Gene Expression Data Fei Pan 1, Xin Hu 2, William Perrizo 1 1. Dept. Computer Science, 2. Dept. Pharmaceutical Science,
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Classification Techniques: Bayesian Classification
Association Rule Mining on Remotely Sensed Imagery Using Peano-trees (P-trees) Qin Ding, Qiang Ding, and William Perrizo Computer Science Department North.
K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU.
Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.
TEMPLATE DESIGN © Predicate-Tree based Pretty Good Protection of Data William Perrizo, Arjun G. Roy Department of Computer.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.
Bayesian Classification Using P-tree  Classification –Classification is a process of predicting an – unknown attribute-value in a relation –Given a relation,
The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.
A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.
Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees.
Peano Count Trees and Association Rule Mining for Gene Expression Profiling using DNA Microarray Data Dr. William Perrizo, Willy Valdivia, Dr. Edward Deckard,
Outline K-Nearest Neighbor algorithm Fuzzy Set theory Classifier Accuracy Measures.
Lazy Learners K-Nearest Neighbor algorithm Fuzzy Set theory Classifier Accuracy Measures.
Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
Accelerating Multilevel Secure Database Queries using P-Tree Technology Imad Rahal and Dr. William Perrizo Computer Science Department North Dakota State.
Overview Data Mining - classification and clustering
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.
Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.
P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.
Multimedia Data Mining using P-trees* William Perrizo,William Jockheck, Amal Perera, Dongmei Ren, Weihua Wu, Yi Zhang Computer Science Department North.
Item-Based P-Tree Collaborative Filtering applied to the Netflix Data
Data Mining Motivation: “Necessity is the Mother of Invention”
Decision Tree Classification of Spatial Data Streams Using Peano Count Trees Qiang Ding Qin Ding * William Perrizo Department of Computer Science.
Histograms CSE 6363 – Machine Learning Vassilis Athitsos
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Efficient Image Classification on Vertically Decomposed Data
Decision Tree Induction for High-Dimensional Data Using P-Trees
Efficient Ranking of Keyword Queries Using P-trees
Efficient Ranking of Keyword Queries Using P-trees
North Dakota State University Fargo, ND USA
Yue (Jenny) Cui and William Perrizo North Dakota State University
Bayesian Classification Using P-tree
Classification and Prediction
Efficient Image Classification on Vertically Decomposed Data
Classification Techniques: Bayesian Classification
Vertical K Median Clustering
A Fast and Scalable Nearest Neighbor Based Classification
Data Mining extracting knowledge from a large amount of data
Vertical K Median Clustering
Shashi Shekhar Weili Wu Sanjay Chawla Ranga Raju Vatsavai
Outline Introduction Background Our Approach Experimental Results
North Dakota State University Fargo, ND USA
Vertical K Median Clustering
North Dakota State University Fargo, ND USA
The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy
A task of induction to find patterns
Notes from 02_CAINE conference
Presentation transcript:

Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science North Dakota State University Fargo, ND These notes contain NDSU confidential and Proprietary material. Patents pending on the P-tree technology

Outline Introduction P-Tree P-Tree Algebra Bayesian Classifier Calculating Probabilities using P-Trees Band-based vs. Bit-based approach Sample Data Classification Accuracy Classification Time Conclusion

Introduction Classification is a form of data analysis and data mining that can be used to extract models describing important data classes or to predict future data trends. Some data classification techniques are:  Decision Tree Induction  Bayesian  Neural Networks  K-Nearest Neighbor  Case Based Reasoning  Genetic Algorithm  rough sets  fuzzy logic techniques A Bayesian classifier is a statistical classifier, which uses Bayes’ theorem to predict class membership as a conditional probability that a given data sample falls into a particular class.

Introduction Cont.. The P-Tree data structure allows us to compute the Bayesian probability values efficiently, without resorting to the naïve Bayesian assumption. Bayesian classification with P-Trees has been used successfully in remotely sensed image precision agriculture to predict yield and in genomics (2-yeast hybrid classification) to place in the ACM 02KDD-cup competition. To completely eliminate the naïve assumption, a bit-based Bayesian classification is used instead of a band-based approach.

P-Tree Most spatial data comes in a band format called BSQ. Each BSQ band is divided into several files, one for each bit position of the data values. This format is called ‘bit Sequential’ or bSQ. Each bSQ bit file, B ij (file constructed from the j th bits of i th band), into a tree structure, called a Peano Tree (P-Tree). P-Trees represent tabular data in a lossless, compressed, bit-by-bit, recursive, datamining-ready arrangement.

A bSQ file, its raster spatial file and P-Tree  Peano or Z-ordering  Pure (Pure-1/Pure-0) quadrant  Root Count  Level  Fan-out  QID (Quadrant ID)

P-Tree Algebra Logical operator –And –Or –Complement –Other (XOR, etc) Applying this operators we calculate value P-Trees, interval P-Trees, and slice P-Trees. Ptree: 55 ____________/ / \ \___________ / ___ / \___ \ / / \ \ 16 ____8__ _15__ 16 / / | \ / | \ \ //|\ //|\ //|\ Complement: 9 ____________/ / \ \___________ / ___ / \___ \ / / \ \ 0 ____8__ __1__ 0 / / | \ / | \ \ //|\ //|\ //|\

’ indicates COMPLEMENT operation P-Tree Algebra Cont.. Basic P-Trees can be combined using logical operations to produce P-Trees for the original values at any level of bit precision. Using 8-bit precision for values, P b , which counts the numer of occurrences of in each quadrant, can be constructed from the basic P-Trees as: P b = P b1 AND P b2 AND P b3 ’ AND P b4 AND P b5 ’ AND P b6 ’ AND P b7 AND P b8 AND operation is simply the pixel-wise AND of the bits Similarly, any data set in the relational format can be represented as P-Trees. For any combination of values, (v1,v2,…,vn), where vi is from band-i, the quadrant-wise count of occurrences of this combination of values is given by: P(v1,v2,…,vn) = P 1 v1 ^ P 2 v2 ^ … ^ P n vn

Bayesian Classifier  Pr(C i | X) is the posterior probability  Pr(C i ) is the prior probability  Can find conditional probabilities, Pr(X|C i ).  Classify X with Max Pr(C i | X)  Since Pr(X) is constant for all classes, therefore, instead maximize Pr(X|Ci) * Pr(Ci). )( )(*)|( )|( XPr i C i CX X i C  Based on Bayes Theorem:

Calculating Probabilities Pr(X|Ci) Using naïve assumption Pr(X | C i ) = Pr( X 1 | C i ) × Pr( X 2 | C i )… × Pr( X n | C i )× Pr( X C | C i ) Scan the data and calculate Pr(X | C i ) for given X. Using P-Trees: Pr(X|C i ) = # training samples in C i having pattern X / # samples in class C i = RC[ P 1 (X 1 ) ^ P 2 (X 2 ) ^ … ^P n (X n ) ^ P C (C i ) ] / RC[ P C (C i ) ] Problem ? : if RC[ P 1 (X 1 ) ^ P 2 (X 2 ) ^ … ^P n (X n ) ^ P C (C i ) ] = 0 for all i i.e unclassified pattern does not exist in the training set.

Band-based-P-tree Approach When all RC = 0 for given pattern –Reduce the restrictiveness of the pattern Removing the attribute with least information gain –Calculate (assume attribute 2 has the least IG) Pr( X | C i ) = RC[ P 1 X 1 ^ P 3 X 3 ^ … ^ P n X n ^ P C C i ] / RC[ P C C i ] Calculation of information gain Using P-trees –1 time calculation for the entire training data

Bit-based Approach Search for similar patterns by removing the least significant bits in the attribute space. The order of the bits to be removed is selected by calculating the info gain (IG). (b)(a) R G G (c) R R G (d) R G E.g., Calculate the Bayesian conditional probability value for the pattern [G,R] = [10,01] in 2-attribute space. Assume IG for 1 st significant bit of R < that of G. Assume IG for 2 nd significant bit of G < that of R. Initially, search for the pattern, [10,01] (a). If not found, search for [1_,01] considering IG for the 2 nd significant bit. Search space will increase (b). If not found, search for [1_,0_] considering IG for the 2 nd significant bit. Search space will increase (c). If not found, search for [1_,_ _] considering IG for the 1 st significant bit. Search space will increase (d).

Experiments The experimental data was extracted from two sets of aerial photographs of the Best Management Plot (BMP) of the Oakes Irrigation Test Area (OITA) near Oaks, North Dakota. » The images were taken in 1997 and Each image contains 3 bands, red, green and blue reflectance values. » Three other files contain synchronized soil moisture, nitrate and yield values.

Classification Accuracy Accuracy of the proposed bit-based approach is compared with band-based, and KNN with Euclidian distance. It is clear that our approach out performs the others. Classification accuracy for '97 Data K4K16K65K260K Training Data Size (pixels) Band-PtreeKNN-Euc.Bit

Classification Accuracy Cont.. The accuracy of the approach was also compared to an existing Bayesian belief network classifier. The classifier is J Cheng's Bayesian Belief Network available at –This classifier was the winning entry for the KDD Cup 2001 data mining competition. The developer claims that the classifier can perform with or without domain knowledge. For the comparison smaller training data sets ranging from 4K to 16K pixels were used due to the inability of the implementation to handle larger data sets. Training Size (pixels) Bit-Ptree BasedBayesian Belief %26 % %51 % The Belief network was built without using any domain knowledge to make it comparable with to P-Tree approach. Accuracy

Classification Time P-Tree approach requires no build time (lazy classifier). In most lazy classifiers the classification time per tuple varies with the number of items in the training set due to the requirement of having to scan the training data. P-Tree approach does not require a traditional data scan. The data in figure was collected using 5 significant bits and a threshold probability of The time is given for scalability comparisons. Variation of Classification Time with Training Size for bit-P-tree alg Trainig sample size (pixels)

Conclusion Naïve assumption reduces the accuracy of the classification in this particular application domain. Our approach increases accuracy of a P-Tree Bayesian classifier by completely eliminating the naïve assumption. –New approach has a better accuracy than the existing P-Tree based Bayesian classifier. –It was also shown to be better than a Bayesian belief network implementation and a Euclidian distance based KNN approach. It has the same computational cost with respect to the use of P-Tree operations as the previous P-tree approach, and is scalable with respect to the size of the data set.