CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.

Slides:



Advertisements
Similar presentations
Bayesian networks Chapter 14 Section 1 – 2. Outline Syntax Semantics Exact computation.
Advertisements

Markov Networks Alan Ritter.
BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.
Belief networks Conditional independence Syntax and semantics Exact inference Approximate inference CS 460, Belief Networks1 Mundhenk and Itti Based.
Identifying Conditional Independencies in Bayes Nets Lecture 4.
For Monday Read chapter 18, sections 1-2 Homework: –Chapter 14, exercise 8 a-d.
For Monday Finish chapter 14 Homework: –Chapter 13, exercises 8, 15.
Bayesian Networks Chapter 14 Section 1, 2, 4. Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact.
Introduction of Probabilistic Reasoning and Bayesian Networks
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Chapter 8-3 Markov Random Fields 1. Topics 1. Introduction 1. Undirected Graphical Models 2. Terminology 2. Conditional Independence 3. Factorization.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Review: Bayesian learning and inference
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11
Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.
PGM 2003/04 Tirgul 3-4 The Bayesian Network Representation.
Bayesian networks Chapter 14 Section 1 – 2.
Bayesian Belief Networks
Bayesian Networks Alan Ritter.
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
. DAGs, I-Maps, Factorization, d-Separation, Minimal I-Maps, Bayesian Networks Slides by Nir Friedman.
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
Bayesian Reasoning. Tax Data – Naive Bayes Classify: (_, No, Married, 95K, ?)
Bayesian networks More commonly called graphical models A way to depict conditional independence relationships between random variables A compact specification.
11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University.
Artificial Intelligence CS 165A Tuesday, November 27, 2007  Probabilistic Reasoning (Ch 14)
Machine Learning CUNY Graduate Center Lecture 21: Graphical Models.
Bayesian networks Chapter 14. Outline Syntax Semantics.
CSC2535 Spring 2013 Lecture 1: Introduction to Machine Learning and Graphical Models Geoffrey Hinton.
1 CS 343: Artificial Intelligence Bayesian Networks Raymond J. Mooney University of Texas at Austin.
Artificial Intelligence CS 165A Thursday, November 29, 2007  Probabilistic Reasoning / Bayesian networks (Ch 14)
1 CS 391L: Machine Learning: Bayesian Learning: Beyond Naïve Bayes Raymond J. Mooney University of Texas at Austin.
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
1 Chapter 14 Probabilistic Reasoning. 2 Outline Syntax of Bayesian networks Semantics of Bayesian networks Efficient representation of conditional distributions.
Perceptual and Sensory Augmented Computing Machine Learning, Summer’11 Machine Learning – Lecture 13 Introduction to Graphical Models Bastian.
For Wednesday Read Chapter 11, sections 1-2 Program 2 due.
Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.
2 Syntax of Bayesian networks Semantics of Bayesian networks Efficient representation of conditional distributions Exact inference by enumeration Exact.
1 Monte Carlo Artificial Intelligence: Bayesian Networks.
Introduction to Bayesian Networks
机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. 课程基本信息  主讲教师:陈昱 Tel :  助教:程再兴, Tel :  课程网页:
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS.
Probabilistic Reasoning [Ch. 14] Bayes Networks – Part 1 ◦Syntax ◦Semantics ◦Parameterized distributions Inference – Part2 ◦Exact inference by enumeration.
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Review: Bayesian inference  A general scenario:  Query variables: X  Evidence (observed) variables and their values: E = e  Unobserved variables: Y.
Lecture 2: Statistical learning primer for biologists
Machine Learning – Lecture 11
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
Pattern Recognition and Machine Learning
Conditional Independence As with absolute independence, the equivalent forms of X and Y being conditionally independent given Z can also be used: P(X|Y,
Belief Networks CS121 – Winter Other Names Bayesian networks Probabilistic networks Causal networks.
1 BN Semantics 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 15 th, 2006 Readings: K&F: 3.1, 3.2, 3.3.
PROBABILISTIC REASONING Heng Ji 04/05, 04/08, 2016.
Chapter 12. Probability Reasoning Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.
Web-Mining Agents Data Mining Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)
1 Chapter 6 Bayesian Learning lecture slides of Raymond J. Mooney, University of Texas at Austin.
CS 2750: Machine Learning Review
CS 2750: Machine Learning Directed Graphical Models
Bayesian networks Chapter 14 Section 1 – 2.
Qian Liu CSE spring University of Pennsylvania
Read R&N Ch Next lecture: Read R&N
Bayesian Networks Probability In AI.
Prof. Adriana Kovashka University of Pittsburgh April 4, 2017
Read R&N Ch Next lecture: Read R&N
Hankz Hankui Zhuo Bayesian Networks Hankz Hankui Zhuo
Bayesian networks Chapter 14 Section 1 – 2.
Read R&N Ch Next lecture: Read R&N
Presentation transcript:

CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Plan for today and next week Today and next time: – Bayesian networks (Bishop Sec. 8.1) – Conditional independence (Bishop Sec. 8.2) Next week: – Markov random fields (Bishop Sec ) – Hidden Markov models (Bishop Sec ) – Expectation maximization (Bishop Ch. 9)

Graphical Models If no assumption of independence is made, then an exponential number of parameters must be estimated for sound probabilistic inference. No realistic amount of training data is sufficient to estimate so many parameters. If a blanket assumption of conditional independence is made, efficient training and inference is possible, but such a strong assumption is rarely warranted. Graphical models use directed or undirected graphs over a set of random variables to explicitly specify variable dependencies and allow for less restrictive independence assumptions while limiting the number of parameters that must be estimated. Bayesian networks: Directed acyclic graphs indicate causal structure. Markov networks: Undirected graphs capture general dependencies. Slide credit: Ray Mooney

Learning Graphical Models Structure Learning: Learn the graphical structure of the network. Parameter Learning: Learn the real-valued parameters of the network. CPTs for Bayes nets Potential functions for Markov nets Slide credit: Ray Mooney

Parameter Learning If values for all variables are available during training, then parameter estimates can be directly estimated using frequency counts over the training data. If there are hidden variables, some form of gradient descent or Expectation Maximization (EM) must be used to estimate distributions for hidden variables. Adapted from Ray Mooney

Bayesian Networks Directed Acyclic Graph (DAG) Slide from Bishop

Bayesian Networks General Factorization Slide from Bishop

Bayesian Networks Directed Acyclic Graph (DAG) Nodes are random variables Edges indicate causal influences Burglary Earthquake Alarm JohnCalls MaryCalls Slide credit: Ray Mooney

Conditional Probability Tables Each node has a conditional probability table (CPT) that gives the probability of each of its values given every possible combination of values for its parents (conditioning case). Roots (sources) of the DAG that have no parents are given prior probabilities. Burglary Earthquake Alarm JohnCalls MaryCalls P(B).001 P(E).002 BEP(A) TT.95 TF.94 FT.29 FF.001 AP(M) T.70 F.01 AP(J) T.90 F.05 Slide credit: Ray Mooney

CPT Comments Probability of false not given since rows must add to 1. Example requires 10 parameters rather than 2 5 –1=31 for specifying the full joint distribution. Number of parameters in the CPT for a node is exponential in the number of parents. Slide credit: Ray Mooney

Bayes Net Inference Given known values for some evidence variables, determine the posterior probability of some query variables. Example: Given that John calls, what is the probability that there is a Burglary? Burglary Earthquake Alarm JohnCalls MaryCalls ??? John calls 90% of the time there is an Alarm and the Alarm detects 94% of Burglaries so people generally think it should be fairly high. However, this ignores the prior probability of John calling. Slide credit: Ray Mooney

Bayes Net Inference Example: Given that John calls, what is the probability that there is a Burglary? Burglary Earthquake Alarm JohnCalls MaryCalls ??? John also calls 5% of the time when there is no Alarm. So over 1,000 days we expect 1 Burglary and John will probably call. However, he will also call with a false report 50 times on average. So the call is about 50 times more likely a false report: P(Burglary | JohnCalls) ≈ 0.02 P(B).001 AP(J) T.90 F.05 Slide credit: Ray Mooney

Bayesian Curve Fitting (1) Polynomial Slide from Bishop

Bayesian Curve Fitting (2) Plate Slide from Bishop

Bayesian Curve Fitting (3) Input variables and explicit hyperparameters Slide from Bishop

Bayesian Curve Fitting—Learning Condition on data Slide from Bishop

Bayesian Curve Fitting—Prediction Predictive distribution: where Slide from Bishop

Generative vs Discriminative Models Generative approach: Model Use Bayes’ theorem Discriminative approach: Model directly Slide from Bishop

Generative Models Causal process for generating images Slide from Bishop

Discrete Variables (1) General joint distribution: K 2 { 1 parameters Independent joint distribution: 2(K { 1) parameters Slide from Bishop

Discrete Variables (2) General joint distribution over M variables: K M { 1 parameters M -node Markov chain: K { 1 + (M { 1) K(K { 1) parameters Slide from Bishop

Discrete Variables: Bayesian Parameters (1) Slide from Bishop

Discrete Variables: Bayesian Parameters (2) Shared prior Slide from Bishop

Parameterized Conditional Distributions If are discrete, K -state variables, in general has O(K M ) parameters. The parameterized form requires only M + 1 parameters Slide from Bishop

Conditional Independence a is independent of b given c Equivalently Notation Slide from Bishop

Conditional Independence: Example 1 Slide from Bishop Node c is “tail to tail” for path from a to b: path makes a and b dependent

Conditional Independence: Example 1 Slide from Bishop Node c is “tail to tail” for path from a to b: c blocks the path thus making a and b conditionally independent

Conditional Independence: Example 2 Slide from Bishop Node c is “head to tail” for path from a to b: path makes a and b dependent

Node c is “head to tail” for path from a to b: c blocks the path thus making a and b conditionally independent Conditional Independence: Example 2 Slide from Bishop

Conditional Independence: Example 3 Note: this is the opposite of Example 1, with c unobserved. Slide from Bishop Node c is “head to head” for path from a to b: c blocks the path thus making a and b independent

Conditional Independence: Example 3 Note: this is the opposite of Example 1, with c observed. Slide from Bishop Node c is “head to head” for path from a to b: c unblocks the path thus making a and b conditionally dependent

“Am I out of fuel?” B =Battery (0=flat, 1=fully charged) F =Fuel Tank (0=empty, 1=full) G =Fuel Gauge Reading (0=empty, 1=full) and hence Slide from Bishop

“Am I out of fuel?” Probability of an empty tank increased by observing G = 0. Slide from Bishop

“Am I out of fuel?” Probability of an empty tank reduced by observing B = 0. This referred to as “explaining away”. Slide from Bishop

D-separation A, B, and C are non-intersecting subsets of nodes in a directed graph. A path from A to B is blocked if it contains a node such that either a)the arrows on the path meet either head-to-tail or tail- to-tail at the node, and the node is in the set C, or b)the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C. If all paths from A to B are blocked, A is said to be d- separated from B by C. If A is d-separated from B by C, the joint distribution over all variables in the graph satisfies. Slide from Bishop

D-separation: Example Slide from Bishop

D-separation: I.I.D. Data Slide from Bishop Are the x i ’s marginally independent? The x i ’s conditionally independent.

Naïve Bayes Conditioned on the class z, the distributions of the input variables x 1, …, x D are independent. Are the x 1, …, x D marginally independent?

Naïve Bayes as a Bayes Net Naïve Bayes is a simple Bayes Net Y X1X1 X2X2 … XnXn Priors P(Y) and conditionals P(X i |Y) for Naïve Bayes provide CPTs for the network. Slide credit: Ray Mooney

The Markov Blanket Factors independent of x i cancel between numerator and denominator. Slide from Bishop The parents, children and co-parents of x i form its Markov blanket, the minimal set of nodes that isolate x i from the rest of the graph.

Bayes Nets vs. Markov Nets Bayes nets represent a subclass of joint distributions that capture non-cyclic causal dependencies between variables. A Markov net can represent any joint distribution. Slide credit: Ray Mooney

Markov Chains In general: First-order Markov chain:

Markov Chains: Second-order Markov chain:

Markov Random Fields Undirected graph over a set of random variables, where an edge represents a dependency. The Markov blanket of a node, X, in a Markov Net is the set of its neighbors in the graph (nodes that have an edge connecting to X). Every node in a Markov Net is conditionally independent of every other node given its Markov blanket. Slide credit: Ray Mooney

Markov Random Fields Markov Blanket Slide from Bishop A node is conditionally independent of all other nodes conditioned only on the neighboring nodes.

Cliques and Maximal Cliques Clique Maximal Clique Slide from Bishop

Distribution for a Markov Network The distribution of a Markov net is most compactly described in terms of a set of potential functions, φ k, for each clique, k, in the graph. For each joint assignment of values to the variables in clique k, φ k assigns a non-negative real value that represents the compatibility of these values. The joint distribution of a Markov is then defined by: where x {k} represents the joint assignment of the variables in clique k, and Z is a normalizing constant that makes a joint distribution that sums to 1. Slide credit: Ray Mooney

Illustration: Image De-Noising (1) Original Image Noisy Image Slide from Bishop

Illustration: Image De-Noising (2) Slide from Bishop y i in {+1, -1}: labels in observed noisy image, x i in {+1, -1}: labels in noise-free image, i is the index over pixels

Illustration: Image De-Noising (3) Noisy ImageRestored Image (ICM) Slide from Bishop