1 Bayesian Learning Provides practical learning algorithms Naïve Bayes learningBayesian belief network learningCombine prior knowledge (prior probabilities)Provides foundations for machine learningEvaluating learning algorithmsGuiding the design of new algorithmsLearning from models : meta learning
2 Bayesian Classification: Why? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problemsIncremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilitiesStandard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured
3 Basic Formulas for Probabilities Product Rule : probability P(AB) of a conjunction of two events A and B:Sum Rule: probability of a disjunction of two events A and B:Theorem of Total Probability : if events A1, …., An are mutually exclusive with
4 Basic Approach Bayes Rule: P(h) = prior probability of hypothesis h P(D) = prior probability of training data DP(h|D) = probability of h given D (posterior density )P(D|h) = probability of D given h (likelihood of D given h)The Goal of Bayesian Learning: the most probable hypothesis given the training data (Maximum A Posteriori hypothesis )
5 An Example Does patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer.
6 MAP LearnerFor each hypothesis h in H, calculate the posterior probabilityOutput the hypothesis hmap with the highest posterior probabilityComments:Computational intensiveProviding a standard for judging the performance of learning algorithmsChoosing P(h) and P(D|h) reflects our prior knowledge about the learning task
7 Bayes Optimal Classifier Question: Given new instance x, what is its most probable classification?Hmap(x) is not the most probable classification!Example: Let P(h1|D) = .4, P(h2|D) = .3, P(h3 |D) =.3Given new data x, we have h1(x)=+, h2(x) = -, h3(x) = -What is the most probable classification of x ?Bayes optimal classification:Example:P(h1| D) =.4, P(-|h1)=0, P(+|h1)=1P(h2|D) =.3, P(-|h2)=1, P(+|h2)=0P(h3|D)=.3, P(-|h3)=1, P(+|h3)=0
8 Naïve Bayes LearnerAssume target function f: X-> V, where each instance x described by attributes <a1, a2, …., an>. Most probable value of f(x) is:Naïve Bayes assumption:(attributes are conditionally independent)
9 Bayesian classification The classification problem may be formalized using a-posteriori probabilities:P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C.E.g. P(class=N | outlook=sunny,windy=true,…)Idea: assign to sample X the class label C such that P(C|X) is maximal
10 Estimating a-posteriori probabilities Bayes theorem:P(C|X) = P(X|C)·P(C) / P(X)P(X) is constant for all classesP(C) = relative freq of class C samplesC such that P(C|X) is maximum = C such that P(X|C)·P(C) is maximumProblem: computing P(X|C) is unfeasible!
11 Naïve Bayesian Classification Naïve assumption: attribute independenceP(x1,…,xk|C) = P(x1|C)·…·P(xk|C)If i-th attribute is categorical: P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class CIf i-th attribute is continuous: P(xi|C) is estimated thru a Gaussian density functionComputationally easy in both cases
12 Naive Bayesian Classifier (II) Given a training set, we can compute the probabilities
14 Example : Naïve BayesPredict playing tennis in the day with the condition <sunny, cool, high, strong> (P(v| o=sunny, t= cool, h=high w=strong)) using the following training data:Day Outlook Temperature Humidity Wind Play Tennis1 Sunny Hot High Weak No2 Sunny Hot High Strong No3 Overcast Hot High Weak Yes4 Rain Mild High Weak Yes5 Rain Cool Normal Weak Yes6 Rain Cool Normal Strong No7 Overcast Cool Normal Strong Yes8 Sunny Mild High Weak No9 Sunny Cool Normal Weak Yes10 Rain Mild Normal Weak Yes11 Sunny Mild Normal Strong Yes12 Overcast Mild High Strong Yes13 Overcast Hot Normal Weak Yes14 Rain Mild High Strong Nowe have :
15 The independence hypothesis… … makes computation possible… yields optimal classifiers when satisfied… but is seldom satisfied in practice, as attributes (variables) are often correlated.Attempts to overcome this limitation:Bayesian networks, that combine Bayesian reasoning with causal relationships between attributesDecision trees, that reason on one attribute at the time, considering most important attributes first
16 Naïve Bayes Algorithm Naïve_Bayes_Learn (examples) for each target value vjestimate P(vj)for each attribute value ai of each attribute aestimate P(ai | vj )Classify_New_Instance (x)Typical estimation of P(ai | vj)Wheren: examples with v=v; p is prior estimate for P(ai|vj)nc: examples with a=ai, m is the weight to prior
17 Bayesian Belief Networks Naïve Bayes assumption of conditional independence too restrictiveBut it is intractable without some such assumptionsBayesian Belief network (Bayesian net) describe conditional independence among subsets of variables (attributes): combining prior knowledge about dependencies among variables with observed training data.Bayesian NetNode = variablesArc = dependencyDAG, with direction on arc representing causality
18 Bayesian Networks: Multi-variables with Dependency Bayesian Belief network (Bayesian net) describe conditional independence among subsets of variables (attributes): combining prior knowledge about dependencies among variables with observed training data.Bayesian NetNode = variables and each variable has a finite set of mutually exclusive statesArc = dependencyDAG, with direction on arc representing causalityTo each variables A with parents B1, …., Bn there is attached a conditional probability table P (A | B1, …., Bn)
19 Bayesian Belief Networks Age, Occupation and Income determine if customer will buy this product.Given that customer buys product, whether there is interest in insurance is now independent of Age, Occupation, Income.P(Age, Occ, Inc, Buy, Ins ) = P(Age)P(Occ)P(Inc)P(Buy|Age,Occ,Inc)P(Int|Buy)Current State-of-the Art: Given structure and probabilities, existing algorithms can handle inference with categorical values and limited representation of numerical valuesOccAgeIncomeBuy XInterested inInsurance
21 Nodes as Functions A node in BN is a conditional distribution function P(X|A=a, B=b)0.10.30.6lmhaab~aba~b~a~bA0.10.30.60.188.8.131.52.184.108.40.206lmhbBXinput: parents state valuesoutput: a distribution over its own value
22 Special Case : Naïve Bayes he1e2………….enP(e1, e2, ……en, h ) = P(h) P(e1 | h) …….P(en | h)
23 Inference in Bayesian Networks AgeIncomeHow likely are elderly rich people to buy Sun?P( paper = Sun | Age>60, Income > 60k)HouseOwnerLivingLocationNewspaperPreferenceEUVotingPattern
24 Inference in Bayesian Networks AgeIncomeHow likely are elderly rich people who voted labour to buy Daily Mail?P( paper = DM | Age>60,Income > 60k, v = labour)HouseOwnerLivingLocationNewspaperPreferenceEUVotingPattern
25 Bayesian Learning Burglary Earthquake B E A C N ~b e a c n ………………...AlarmNewscastCallInput : fully or partially observable data casesOutput : parameters AND also structureLearning Methods:EM (Expectation Maximisation)using current approximation of parameters to estimate filled in datausing filled in data to update parameters (ML)Gradient Ascent TrainingGibbs Sampling (MCMC)
Your consent to our cookies if you continue to use this website.