PGM: Tirgul 11 Na?ve Bayesian Classifier + Tree Augmented Na?ve Bayes (adapted from tutorial by Nir Friedman and Moises Goldszmidt.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
ICONIP 2005 Improve Naïve Bayesian Classifier by Discriminative Training Kaizhu Huang, Zhangbing Zhou, Irwin King, Michael R. Lyu Oct
Pattern Recognition and Machine Learning
Data Mining Classification: Alternative Techniques
Indian Statistical Institute Kolkata
Model Assessment, Selection and Averaging
What is Statistical Modeling
Rosa Cowan April 29, 2008 Predictive Modeling & The Bayes Classifier.
Visual Recognition Tutorial
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
Bayesian Belief Networks
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Bayesian Networks Alan Ritter.
Thanks to Nir Friedman, HU
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
MACHINE LEARNING. What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Part I: Classification and Bayesian Learning
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Bayesian Networks 4 th, December 2009 Presented by Kwak, Nam-ju The slides are based on, 2nd ed., written by Ian H. Witten & Eibe Frank. Images and Materials.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon
Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Classification And Bayesian Learning
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1.
Data Mining and Decision Support
NTU & MSRA Ming-Feng Tsai
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.
Boosted Augmented Naive Bayes. Efficient discriminative learning of
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Medical Diagnosis via Genetic Programming
Data Mining Lecture 11.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Parametric Methods Berlin Chen, 2005 References:
Machine Learning: UNIT-3 CHAPTER-1
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

PGM: Tirgul 11 Na?ve Bayesian Classifier + Tree Augmented Na?ve Bayes (adapted from tutorial by Nir Friedman and Moises Goldszmidt

The Classification Problem u From a data set describing objects by vectors of features and a class F u Find a function F: features  class to classify a new object Vector 1 = Presence Vector 2 = Presence Vector 3 = Presence Vector 4 = Absence Vector 5 = Absence Vector 6 = Presence Vector 7 = Absence

Examples u Predicting heart disease l Features: cholesterol, chest pain, angina, age, etc. l Class: {present, absent} u Finding lemons in cars l Features: make, brand, miles per gallon, acceleration,etc. l Class: {normal, lemon} u Digit recognition l Features: matrix of pixel descriptors l Class: {1, 2, 3, 4, 5, 6, 7, 8, 9, 0} u Speech recognition l Features: Signal characteristics, language model l Class: {pause/hesitation, retraction}

Approaches u Memory based l Define a distance between samples l Nearest neighbor, support vector machines u Decision surface l Find best partition of the space l CART, decision trees u Generative models l Induce a model and impose a decision rule l Bayesian networks

Generative Models u Bayesian classifiers l Induce a probability describing the data P(A 1,…,A n,C) l Impose a decision rule. Given a new object c = argmax C P(C = c | a 1,…,a n )  We have shifted the problem to learning P(A 1,…,A n,C)  We are learning how to do this efficiently: learn a Bayesian network representation for P(A 1,…,A n,C)

Optimality of the decision rule Minimizing the error rate... u Let c i be the true class, and let l j be the class returned by the classifier. correcterror u A decision by the classifier is correct if c i =l j, and in error if c i  l j. u The error incurred by choose label l j is u Thus, had we had access to P, we minimize error rate by choosing l i when which is the decision rule for the Bayesian classifier

Advantages of the Generative Model Approach u Output u Output: Rank over the outcomes---likelihood of present vs. absent u Explanation u Explanation: What is the profile of a “typical” person with a heart disease u Missing values u Missing values: both in training and testing u Value of information u Value of information: If the person has high cholesterol and blood sugar, which other test should be conducted? u Validation u Validation: confidence measures over the model and its parameters u Background knowledge u Background knowledge: priors and structure

Evaluating the performance of a classifier: n-fold cross validation D1D2D3Dn Run 1 Run 2 Run 3 Run n u Partition the data set in n segments u Do n times l Train the classifier with the green segments l Test accuracy on the red segments u Compute statistics on the n runs Variance Mean accuracy u Accuracy: on test data of size m l Acc = Original data set

Age Sex ChestPainRestBP Cholesterol BloodSugar ECG MaxHeartRate Angina OldPeak STSlope Vessels Thal Outcome Heart disease Accuracy = 85% Data source UCI repository Advantages of Using a Bayesian Network u Efficiency in learning and query answering l Combine knowledge engineering and statistical induction l Algorithms for decision making, value of information, diagnosis and repair

Problems with BNs as classifiers When evaluating a Bayesian network, we examine the likelyhood of the model B given the data D and try to maximize it: When Learning structure we also add penalty for structure complexity and seek a balance between the two terms (MDL or variant). The following properties follow: u A Bayesian network minimized the error over all the variables in the domain and not necessarily the local error of the class given the attributes (OK with enough data). u Because of the penalty, a Bayesian network in effect looks at a small subset of the variables that effect a given node (it’s Markov blanket)

Problems with BNs as classifiers (cont.) Let’s look closely at the likelyhood term: u The first term estimates just what we want: the probability of the class given the attributes. The second term estimates the joint probability of the attributes. u When there are many attributes, the second term starts to dominate (value of log is increased for small values). u Why not use the just the first term? We can no longer factorize and calculations become much harder.

C F1F1 F2F2 F5F5 F3F3 F4F4 F6F6 pregnant age insulin dpf massglucose Diabetes in Pima Indians (from UCI repository) The Naïve Bayesian Classifier u Fixed structure encoding the assumption that features are independent of each other given the class. u Learning amounts to estimating the parameters for each P(F i |C) for each F i.

The Naïve Bayesian Classifier (cont.) What do we gain? u We ensure that in the learned network, the probability P(C|A1…An) will take every attribute into account. u We will show polynomial time algorithm for learning the network. u Estimates are robust consisting of low order statistics requiring few instances u Has proven to be a powerful classifier often exceeding unrestricted Bayesian networks.

The Naïve Bayesian Classifier (cont.) u Common practice is to estimate u These estimate are identical to MLE for multinomials C F1F1 F2F2 F5F5 F3F3 F4F4 F6F6

Improving Naïve Bayes u Naïve Bayes encodes assumptions of independence that may be unreasonable: Are pregnancy and age independent given diabetes? Problem: same evidence may be incorporated multiple times (a rare Glucose level and a rare Insulin level over penalize the class variable) u The success of naïve Bayes is attributed to l Robust estimation l Decision may be correct even if probabilities are inaccurate u Idea: improve on naïve Bayes by weakening the independence assumptions Bayesian networks provide the appropriate mathematical language for this task

Tree Augmented Naïve Bayes (TAN) u Approximate the dependence among features with a tree Bayes net u Tree induction algorithm l Optimality: maximum likelihood tree l Efficiency: polynomial algorithm u Robust parameter estimation C pregnantage insulin dpf mass glucose F1F1 F2F2 F5F5 F3F3 F4F4 F6F6

Optimal Tree construction algorithm The procedure of Chow and Lui construct a tree structure B T that maximizes LL(B T |D) u Compute the mutual information between every pair of attributes: u Build a complete undirected graph in which the vertices are the attributes and each edge is annotated with the corresponding mutual information as weight. u Build a maximum weighted spanning tree of this graph. Complexity: O(n 2 N) + O(n 2 ) + O(n 2 logn) = O(n 2 N) where n is the number of attributes and N is the sample size

It is easy to “plant” the optimal tree in the TAN by revising the algorithm to use a revised conditional measure which takes the conditional probability on the class into account: This measures the gain in the log-likelyhood of adding A i as a parent of A j when C is already a parent. Tree construction algorithm (cont.)

When evaluating parameters we estimate the conditional probability P(A i |Parents(A i )). This is done by partitionaing the data according to possible values of Parents(A i ). u When a partition contains just a few instances we get an unreliable estimate u In Naive Bayes the partition was only on the values of the classifier (and we have to assume that is adequate) u In TAN we have twice the number of partitions and get unreliable estimates, especially for small data sets. Solution: where s is the smoothing bias and typically small. Problem with TAN

Performance: TAN vs. Naïve Bayes TAN Naïve Bayes 25 Data sets from UCI repository Medical Signal processing Financial Games Accuracy based on 5-fold cross- validation No parameter tuning

Performance: TAN vs C C4.5 TAN 25 Data sets from UCI repository Medical Signal processing Financial Games Accuracy based on 5-fold cross- validation No parameter tuning

Beyond TAN u Can we do better by learning a more flexible structure? u Experiment: learn a Bayesian network without restrictions on the structure

Performance: TAN vs. Bayesian Networks TAN Bayesian Networks 25 Data sets from UCI repository Medical Signal processing Financial Games Accuracy based on 5-fold cross- validation No parameter tuning

Classification: Summary u Bayesian networks provide a useful language to improve Bayesian classifiers l Lesson: we need to be aware of the task at hand, the amount of training data vs dimensionality of the problem, etc u Additional benefits l Missing values l Compute the tradeoffs involved in finding out feature values l Compute misclassification costs u Recent progress: l Combine generative probabilistic models, such as Bayesian networks, with decision surface approaches such as Support Vector Machines