PGM: Tirgul 11 Na?ve Bayesian Classifier + Tree Augmented Na?ve Bayes (adapted from tutorial by Nir Friedman and Moises Goldszmidt.

PGM: Tirgul 11 Na?ve Bayesian Classifier + Tree Augmented Na?ve Bayes (adapted from tutorial by Nir Friedman and Moises Goldszmidt

The Classification Problem u From a data set describing objects by vectors of features and a class F u Find a function F: features  class to classify a new object Vector 1 = Presence Vector 2 = Presence Vector 3 = Presence Vector 4 = Absence Vector 5 = Absence Vector 6 = Presence Vector 7 = Absence

Examples u Predicting heart disease l Features: cholesterol, chest pain, angina, age, etc. l Class: {present, absent} u Finding lemons in cars l Features: make, brand, miles per gallon, acceleration,etc. l Class: {normal, lemon} u Digit recognition l Features: matrix of pixel descriptors l Class: {1, 2, 3, 4, 5, 6, 7, 8, 9, 0} u Speech recognition l Features: Signal characteristics, language model l Class: {pause/hesitation, retraction}

Approaches u Memory based l Define a distance between samples l Nearest neighbor, support vector machines u Decision surface l Find best partition of the space l CART, decision trees u Generative models l Induce a model and impose a decision rule l Bayesian networks

Generative Models u Bayesian classifiers l Induce a probability describing the data P(A 1,…,A n,C) l Impose a decision rule. Given a new object c = argmax C P(C = c | a 1,…,a n )  We have shifted the problem to learning P(A 1,…,A n,C)  We are learning how to do this efficiently: learn a Bayesian network representation for P(A 1,…,A n,C)

Optimality of the decision rule Minimizing the error rate... u Let c i be the true class, and let l j be the class returned by the classifier. correcterror u A decision by the classifier is correct if c i =l j, and in error if c i  l j. u The error incurred by choose label l j is u Thus, had we had access to P, we minimize error rate by choosing l i when which is the decision rule for the Bayesian classifier

Advantages of the Generative Model Approach u Output u Output: Rank over the outcomes---likelihood of present vs. absent u Explanation u Explanation: What is the profile of a “typical” person with a heart disease u Missing values u Missing values: both in training and testing u Value of information u Value of information: If the person has high cholesterol and blood sugar, which other test should be conducted? u Validation u Validation: confidence measures over the model and its parameters u Background knowledge u Background knowledge: priors and structure

Evaluating the performance of a classifier: n-fold cross validation D1D2D3Dn Run 1 Run 2 Run 3 Run n u Partition the data set in n segments u Do n times l Train the classifier with the green segments l Test accuracy on the red segments u Compute statistics on the n runs Variance Mean accuracy u Accuracy: on test data of size m l Acc = Original data set

Age Sex ChestPainRestBP Cholesterol BloodSugar ECG MaxHeartRate Angina OldPeak STSlope Vessels Thal Outcome Heart disease Accuracy = 85% Data source UCI repository Advantages of Using a Bayesian Network u Efficiency in learning and query answering l Combine knowledge engineering and statistical induction l Algorithms for decision making, value of information, diagnosis and repair

Problems with BNs as classifiers When evaluating a Bayesian network, we examine the likelyhood of the model B given the data D and try to maximize it: When Learning structure we also add penalty for structure complexity and seek a balance between the two terms (MDL or variant). The following properties follow: u A Bayesian network minimized the error over all the variables in the domain and not necessarily the local error of the class given the attributes (OK with enough data). u Because of the penalty, a Bayesian network in effect looks at a small subset of the variables that effect a given node (it’s Markov blanket)

Problems with BNs as classifiers (cont.) Let’s look closely at the likelyhood term: u The first term estimates just what we want: the probability of the class given the attributes. The second term estimates the joint probability of the attributes. u When there are many attributes, the second term starts to dominate (value of log is increased for small values). u Why not use the just the first term? We can no longer factorize and calculations become much harder.

C F1F1 F2F2 F5F5 F3F3 F4F4 F6F6 pregnant age insulin dpf massglucose Diabetes in Pima Indians (from UCI repository) The Naïve Bayesian Classifier u Fixed structure encoding the assumption that features are independent of each other given the class. u Learning amounts to estimating the parameters for each P(F i |C) for each F i.

The Naïve Bayesian Classifier (cont.) What do we gain? u We ensure that in the learned network, the probability P(C|A1…An) will take every attribute into account. u We will show polynomial time algorithm for learning the network. u Estimates are robust consisting of low order statistics requiring few instances u Has proven to be a powerful classifier often exceeding unrestricted Bayesian networks.

The Naïve Bayesian Classifier (cont.) u Common practice is to estimate u These estimate are identical to MLE for multinomials C F1F1 F2F2 F5F5 F3F3 F4F4 F6F6

Improving Naïve Bayes u Naïve Bayes encodes assumptions of independence that may be unreasonable: Are pregnancy and age independent given diabetes? Problem: same evidence may be incorporated multiple times (a rare Glucose level and a rare Insulin level over penalize the class variable) u The success of naïve Bayes is attributed to l Robust estimation l Decision may be correct even if probabilities are inaccurate u Idea: improve on naïve Bayes by weakening the independence assumptions Bayesian networks provide the appropriate mathematical language for this task

Tree Augmented Naïve Bayes (TAN) u Approximate the dependence among features with a tree Bayes net u Tree induction algorithm l Optimality: maximum likelihood tree l Efficiency: polynomial algorithm u Robust parameter estimation C pregnantage insulin dpf mass glucose F1F1 F2F2 F5F5 F3F3 F4F4 F6F6

Optimal Tree construction algorithm The procedure of Chow and Lui construct a tree structure B T that maximizes LL(B T |D) u Compute the mutual information between every pair of attributes: u Build a complete undirected graph in which the vertices are the attributes and each edge is annotated with the corresponding mutual information as weight. u Build a maximum weighted spanning tree of this graph. Complexity: O(n 2 N) + O(n 2 ) + O(n 2 logn) = O(n 2 N) where n is the number of attributes and N is the sample size

It is easy to “plant” the optimal tree in the TAN by revising the algorithm to use a revised conditional measure which takes the conditional probability on the class into account: This measures the gain in the log-likelyhood of adding A i as a parent of A j when C is already a parent. Tree construction algorithm (cont.)

When evaluating parameters we estimate the conditional probability P(A i |Parents(A i )). This is done by partitionaing the data according to possible values of Parents(A i ). u When a partition contains just a few instances we get an unreliable estimate u In Naive Bayes the partition was only on the values of the classifier (and we have to assume that is adequate) u In TAN we have twice the number of partitions and get unreliable estimates, especially for small data sets. Solution: where s is the smoothing bias and typically small. Problem with TAN

Performance: TAN vs. Naïve Bayes 65 70 75 80 85 90 95 100 65707580859095100 TAN Naïve Bayes 25 Data sets from UCI repository Medical Signal processing Financial Games Accuracy based on 5-fold cross- validation No parameter tuning

Performance: TAN vs C4.5 65 70 75 80 85 90 95 100 65707580859095100 C4.5 TAN 25 Data sets from UCI repository Medical Signal processing Financial Games Accuracy based on 5-fold cross- validation No parameter tuning

Beyond TAN u Can we do better by learning a more flexible structure? u Experiment: learn a Bayesian network without restrictions on the structure

Performance: TAN vs. Bayesian Networks 65 70 75 80 85 90 95 100 65707580859095100 TAN Bayesian Networks 25 Data sets from UCI repository Medical Signal processing Financial Games Accuracy based on 5-fold cross- validation No parameter tuning

Classification: Summary u Bayesian networks provide a useful language to improve Bayesian classifiers l Lesson: we need to be aware of the task at hand, the amount of training data vs dimensionality of the problem, etc u Additional benefits l Missing values l Compute the tradeoffs involved in finding out feature values l Compute misclassification costs u Recent progress: l Combine generative probabilistic models, such as Bayesian networks, with decision surface approaches such as Support Vector Machines

PGM: Tirgul 11 Na?ve Bayesian Classifier + Tree Augmented Na?ve Bayes (adapted from tutorial by Nir Friedman and Moises Goldszmidt.

Similar presentations

Presentation on theme: "PGM: Tirgul 11 Na?ve Bayesian Classifier + Tree Augmented Na?ve Bayes (adapted from tutorial by Nir Friedman and Moises Goldszmidt."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PGM: Tirgul 11 Na?ve Bayesian Classifier + Tree Augmented Na?ve Bayes (adapted from tutorial by Nir Friedman and Moises Goldszmidt.

Similar presentations

Presentation on theme: "PGM: Tirgul 11 Na?ve Bayesian Classifier + Tree Augmented Na?ve Bayes (adapted from tutorial by Nir Friedman and Moises Goldszmidt."— Presentation transcript:

Similar presentations

About project

Feedback