1 CS 391L: Machine Learning: Bayesian Learning: Beyond Naïve Bayes Raymond J. Mooney University of Texas at Austin.

Slides:



Advertisements
Similar presentations
CS188: Computational Models of Human Behavior
Advertisements

Bayesian networks Chapter 14 Section 1 – 2. Outline Syntax Semantics Exact computation.
Markov Networks Alan Ritter.
Bayesian Networks CSE 473. © Daniel S. Weld 2 Last Time Basic notions Atomic events Probabilities Joint distribution Inference by enumeration Independence.
Belief networks Conditional independence Syntax and semantics Exact inference Approximate inference CS 460, Belief Networks1 Mundhenk and Itti Based.
Exact Inference in Bayes Nets
Dynamic Bayesian Networks (DBNs)
For Monday Read chapter 18, sections 1-2 Homework: –Chapter 14, exercise 8 a-d.
For Monday Finish chapter 14 Homework: –Chapter 13, exercises 8, 15.
Overview of Inference Algorithms for Bayesian Networks Wei Sun, PhD Assistant Research Professor SEOR Dept. & C4I Center George Mason University, 2009.
Markov Networks.
CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) March, 16, 2009.
Bayesian network inference
Review: Bayesian learning and inference
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11
1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.
Inference in Bayesian Nets
Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Bayesian Belief Networks
Bayesian Networks Alan Ritter.
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
Bayesian Reasoning. Tax Data – Naive Bayes Classify: (_, No, Married, 95K, ?)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Crash Course on Machine Learning
11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University.
Bayesian networks Chapter 14. Outline Syntax Semantics.
1 CS 343: Artificial Intelligence Bayesian Networks Raymond J. Mooney University of Texas at Austin.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Undirected Models: Markov Networks David Page, Fall 2009 CS 731: Advanced Methods in Artificial Intelligence, with Biomedical Applications.
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
1 Chapter 14 Probabilistic Reasoning. 2 Outline Syntax of Bayesian networks Semantics of Bayesian networks Efficient representation of conditional distributions.
For Wednesday Read Chapter 11, sections 1-2 Program 2 due.
Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.
2 Syntax of Bayesian networks Semantics of Bayesian networks Efficient representation of conditional distributions Exact inference by enumeration Exact.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Bayesian Statistics and Belief Networks. Overview Book: Ch 13,14 Refresher on Probability Bayesian classifiers Belief Networks / Bayesian Networks.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
1 CMSC 671 Fall 2001 Class #21 – Tuesday, November 13.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
Lecture 2: Statistical learning primer for biologists
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Inference Algorithms for Bayes Networks
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
Conditional Independence As with absolute independence, the equivalent forms of X and Y being conditionally independent given Z can also be used: P(X|Y,
Chapter 12. Probability Reasoning Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
Web-Mining Agents Data Mining Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
1 Chapter 6 Bayesian Learning lecture slides of Raymond J. Mooney, University of Texas at Austin.
A Brief Introduction to Bayesian networks
CS 2750: Machine Learning Review
CS 2750: Machine Learning Directed Graphical Models
Qian Liu CSE spring University of Pennsylvania
Web-Mining Agents Part: Data Mining
Markov Networks.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18
Read R&N Ch Next lecture: Read R&N
Class #19 – Tuesday, November 3
Expectation-Maximization & Belief Propagation
Class #16 – Tuesday, October 26
Hankz Hankui Zhuo Bayesian Networks Hankz Hankui Zhuo
Discriminative Probabilistic Models for Relational Data
Markov Networks.
Presentation transcript:

1 CS 391L: Machine Learning: Bayesian Learning: Beyond Naïve Bayes Raymond J. Mooney University of Texas at Austin

2 Logistic Regression Assumes a parametric form for directly estimating P(Y | X). For binary concepts, this is: Equivalent to a one-layer backpropagation neural net. –Logistic regression is the source of the sigmoid function used in backpropagation. –Objective function for training is somewhat different.

3 Logistic Regression as a Log-Linear Model Logistic regression is basically a linear model, which is demonstrated by taking logs. Also called a maximum entropy model (MaxEnt) because it can be shown that standard training for logistic regression gives the distribution with maximum entropy that is consistent with the training data.

4 Logistic Regression Training Weights are set during training to maximize the conditional data likelihood : where D is the set of training examples and Y d and X d denote, respectively, the values of Y and X for example d. Equivalently viewed as maximizing the conditional log likelihood (CLL)

5 Logistic Regression Training Like neural-nets, can use standard gradient descent to find the parameters (weights) that optimize the CLL objective function. Many other more advanced training methods are possible to speed convergence. –Conjugate gradient –Generalized Iterative Scaling (GIS) –Improved Iterative Scaling (IIS) –Limited-memory quasi-Newton (L-BFGS)

6 Preventing Overfitting in Logistic Regression To prevent overfitting, one can use regularization (a.k.a. smoothing) by penalizing large weights by changing the training objective: This can be shown to be equivalent to assuming a Guassian prior for W with zero mean and a variance related to 1/λ. Where λ is a constant that determines the amount of smoothing

7 Multinomial Logistic Regression Logistic regression can be generalized to multi-class problems (where Y has a multinomial distribution). Effectively constructs a linear classifier for each category.

8 Relation Between Naïve Bayes and Logistic Regression Naïve Bayes with Gaussian distributions for features (GNB), can be shown to given the same functional form for the conditional distribution P(Y|X). –But converse is not true, so Logistic Regression makes a weaker assumption. Logistic regression is a discriminative rather than generative model, since it models the conditional distribution P(Y|X) and directly attempts to fit the training data for predicting Y from X. Does not specify a full joint distribution. When conditional independence is violated, logistic regression gives better generalization if it is given sufficient training data. GNB converges to accurate parameter estimates faster (O(log n) examples for n features) compared to Logistic Regression (O(n) examples). –Experimentally, GNB is better when training data is scarce, logistic regression is better when it is plentiful.

9 Graphical Models If no assumption of independence is made, then an exponential number of parameters must be estimated for sound probabilistic inference. No realistic amount of training data is sufficient to estimate so many parameters. If a blanket assumption of conditional independence is made, efficient training and inference is possible, but such a strong assumption is rarely warranted. Graphical models use directed or undirected graphs over a set of random variables to explicitly specify variable dependencies and allow for less restrictive independence assumptions while limiting the number of parameters that must be estimated. –Bayesian Networks: Directed acyclic graphs that indicate causal structure. –Markov Networks: Undirected graphs that capture general dependencies.

10 Bayesian Networks Directed Acyclic Graph (DAG) –Nodes are random variables –Edges indicate causal influences Burglary Earthquake Alarm JohnCalls MaryCalls

11 Conditional Probability Tables Each node has a conditional probability table (CPT) that gives the probability of each of its values given every possible combination of values for its parents (conditioning case). –Roots (sources) of the DAG that have no parents are given prior probabilities. Burglary Earthquake Alarm JohnCalls MaryCalls P(B).001 P(E).002 BEP(A) TT.95 TF.94 FT.29 FF.001 AP(M) T.70 F.01 AP(J) T.90 F.05

12 CPT Comments Probability of false not given since rows must add to 1. Example requires 10 parameters rather than 2 5 –1=31 for specifying the full joint distribution. Number of parameters in the CPT for a node is exponential in the number of parents (fan-in).

13 Joint Distributions for Bayes Nets A Bayesian Network implicitly defines a joint distribution. Example Therefore an inefficient approach to inference is: –1) Compute the joint distribution using this equation. –2) Compute any desired conditional probability using the joint distribution.

14 Naïve Bayes as a Bayes Net Naïve Bayes is a simple Bayes Net Y X1X1 X2X2 … XnXn Priors P(Y) and conditionals P(X i |Y) for Naïve Bayes provide CPTs for the network.

15 Bayes Net Inference Given known values for some evidence variables, determine the posterior probability of some query variables. Example: Given that John calls, what is the probability that there is a Burglary? Burglary Earthquake Alarm JohnCalls MaryCalls ??? John calls 90% of the time there is an Alarm and the Alarm detects 94% of Burglaries so people generally think it should be fairly high. However, this ignores the prior probability of John calling.

16 Bayes Net Inference Example: Given that John calls, what is the probability that there is a Burglary? Burglary Earthquake Alarm JohnCalls MaryCalls ??? John also calls 5% of the time when there is no Alarm. So over 1,000 days we expect 1 Burglary and John will probably call. However, he will also call with a false report 50 times on average. So the call is about 50 times more likely a false report: P(Burglary | JohnCalls) ≈ 0.02 P(B).001 AP(J) T.90 F.05

17 Bayes Net Inference Example: Given that John calls, what is the probability that there is a Burglary? Burglary Earthquake Alarm JohnCalls MaryCalls ??? Actual probability of Burglary is since the alarm is not perfect (an Earthquake could have set it off or it could have gone off on its own). On the other side, even if there was not an alarm and John called incorrectly, there could have been an undetected Burglary anyway, but this is unlikely. P(B).001 AP(J) T.90 F.05

18 Complexity of Bayes Net Inference In general, the problem of Bayes Net inference is NP-hard (exponential in the size of the graph). For singly-connected networks or polytrees in which there are no undirected loops, there are linear- time algorithms based on belief propagation. –Each node sends local evidence messages to their children and parents. –Each node updates belief in each of its possible values based on incoming messages from it neighbors and propagates evidence on to its neighbors. There are approximations to inference for general networks based on loopy belief propagation that iteratively refines probabilities that converge to accurate values in the limit.

19 Belief Propagation Example λ messages are sent from children to parents representing abductive evidence for a node. π messages are sent from parents to children representing causal evidence for a node. Burglary Earthquake Alarm JohnCalls MaryCalls λ λ λ π Alarm Burglary Earthquake MaryCalls

20 Markov Networks Undirected graph over a set of random variables, where an edge represents a dependency. The Markov blanket of a node, X, in a Markov Net is the set of its neighbors in the graph (nodes that have an edge connecting to X). Every node in a Markov Net is conditionally independent of every other node given its Markov blanket.

21 Distribution for a Markov Network The distribution of a Markov net is most compactly described in terms of a set of potential functions, φ k, for each clique, k, in the graph. For each joint assignment of values to the variables in clique k, φ k assigns a non-negative real value that represents the compatibility of these values. The joint distribution of a Markov is then defined by: Where x {k} represents the joint assignment of the variables in clique k, and Z is a normalizing constant that makes a joint distribution that sums to 1.

22 Inference in Markov Networks Inference in general Markov nets is #P complete. Approximation algorithms include: –Markov Chain Monte Carlo (MCMC) –Loopy belief propagation

23 Bayes Nets vs. Markov Nets Bayes nets represent a subclass of joint distributions that capture non-cyclic causal dependencies between variables. A Markov net can represent any joint distribution. –If network is fully connected then there is one clique that is includes all of the variables and whose potential function directly encodes the full joint distribution.

24 Learning Graphical Models Structure Learning: Learn the graphical structure of the network. Parameter Learning: Learn the real- valued parameters of the network –CPTs for Bayes Nets –Potential functions for Markov Nets

25 Structure Learning Use greedy top-down search through the space of networks, considering adding each possible edge one at a time and picking the one that maximizes a statistical evaluation metric that measures fit to the training data. Alternative is to test all pairs of nodes to find ones that are statistically correlated and adding edges accordingly. Bayes net learning requires determining the direction of causal influences. Special algorithms for limited graph topologies. –TAN (Tree Augmented Naïve-Bayes) for learning Bayes nets that are trees.

26 Parameter Learning If values for all variables are available during training, then parameter estimates can be directly estimated using frequency counts over the training data. –Must smooth estimates to compensate for limited training data. If there are hidden variables, some form of gradient descent or Expectation Maximization (EM) must be used to estimate distributions for hidden variables. –Like setting the weights feeding hidden units in backpropagation neural nets.

27 Statistical Relational Learning Expand graphical model learning approach to handle instances more expressive than feature vectors that include arbitrary numbers of objects with properties and relations between them. –Probabilistic Relational Models (PRMs) –Stochastic Logic Programs (SLPs) –Bayesian Logic Programs (BLPs) –Relational Markov Networks (RMNs) –Markov Logic Networks (MLNs) –Other TLAs Collective classification: Classify multiple dependent objects based on both and object’s properties as well as the class of other related objects. –Get beyond IID assumption for instances

28 Collective Classification of Web Pages using RMNs [Taskar, Abbeel & Koller 2002]

29 Conclusions Bayesian learning methods are firmly based on probability theory and exploit advanced methods developed in statistics. Naïve Bayes is a simple generative model that works fairly well in practice. Logistic Regression is a discriminative classifier that directly models the conditional distribution P(Y|X). Graphical models allow specifying limited dependencies using graphs. –Bayes Nets: DAG –Markov Nets: Undirected Graph