Presentation is loading. Please wait.

Presentation is loading. Please wait.

CVPR Winter seminar Jaemin Kim

Similar presentations


Presentation on theme: "CVPR Winter seminar Jaemin Kim"— Presentation transcript:

1 CVPR Winter seminar Jaemin Kim
Bayesian Network CVPR Winter seminar Jaemin Kim

2 Outline Concepts in Probability Bayesian Networks Inference
Random variables Basic properties (Bayes rule) Bayesian Networks Inference Decision making Learning networks from data Reasoning over time Applications

3 Probability distribution P(X|x)
Probabilities Probability distribution P(X|x) X is a random variable Discrete Continuous x is background state of information

4 Discrete Random Variables
Finite set of possible outcomes X binary:

5 Continuous Random Variables
Probability distribution (density function) over continuous values

6 Joint Conditional More Probabilities Probability that both X=x and Y=y
Probability that X=x given we know that Y=y

7 Rule of Probabilities Product Rule Marginalization X binary:

8 Bayes Rule

9 Graph Model 목적: Definition:
특정 variable에 관한 정보 (probability distribution)를 상관관계가 있는 다른 variables 관한 정보부터 추출 Definition: A collection of variables (nodes) with a set of dependencies (edges) between the variables, and a set of probability distribution functions for each variable A Bayesian network is a special type of graph model which is a directed acyclic graph (DAG)

10 Bayesian Networks Conditional probability specifications A Graph
nodes represent the random variables directed edges (arrows) between pairs of nodes it must be a Directed Acyclic Graph (DAG) the graph represents relationships between variables Conditional probability specifications the conditional probability distribution (CPD) of each variable given its parents discrete variable: table (CPT)

11 Bayesian Networks (Belief Networks)
A Graph directed edges (arrows) between pairs of nodes causality: A “causes” B AI an statistics communities Markov Random fields (MRF) A Graph undirected edges (arrows) between pairs of nodes a simple definition of independence: If all paths between the nodes in A and B are separated by a node c A and B are conditionally independent given a third set C physics and vision communities

12 Bayesian Networks

13 Basics Bayesian networks Structured representation
Conditional independence Naïve Bayes model Independence facts

14 Bayesian networks P( S=no) 0.80 S=light) 0.15 S=heavy) 0.05 Smoking=
Cancer P( S=no) 0.80 S=light) 0.15 S=heavy) 0.05 P(S): Smoking= no light heavy P( C=none) 0.96 0.88 0.60 C=benign) 0.03 0.08 0.25 C=malig) 0.01 0.04 0.15 P(C|S):

15 P(C,S) = P(C|S) P(S) Product Rule
P(C=none ^ S=no) = P(C=none | S=no)P(S=no) = 0.96*0.8 = 0.768

16 P(C,S) = P(C|S) P(S) Product Rule
P(C=none ^ S=no) = P(C=none | S=no)P(S=no) = 0.96*0.8 = 0.768

17 Marginalization P(Smoke) P(Cancer)
P(S=no) = P(S=no ^ C=no) + P(S=no ^ C=be) P(S=no & C=mal) P(C=mal) = P(C=mal ^ S=no) + P(C=mal ^ S=light) + P(C=mal | S=heavy)

18 Bayes Rule Revisited Cancer= none benign malignant P( S=no) 0.821
0.522 0.421 S=light) 0.141 0.261 0.316 S=heavy) 0.037 0.217 0.263 P(S|C):

19 A Bayesian Network Age Gender Exposure to Toxics Smoking Cancer Serum
Calcium Lung Tumor

20 Problems with Large Instances
The joint probability distribution, P(A,G,E,S,C,L,SC) For five binary variables there are 27 = 128 values in the joint distribution (for 100 variables there are over 1030 values) How are these values to be obtained? Inference To obtain posterior distributions once some evidence is available requires summation over an exponential number of terms eg 22 in the calculation of which increases to 297 if there are 100 variables.

21 Independence Age and Gender are independent. Age Gender
P(A,G) = P(G)P(A) P(A|G) = P(A) A ^ G P(G|A) = P(G) G ^ A P(A,G) = P(G|A) P(A) = P(G)P(A) P(A,G) = P(A|G) P(G) = P(A)P(G)

22 Conditional Independence
Cancer is independent of Age and Gender given Smoking. Age Gender Smoking P(C|A,G,S) = P(C|S) C ^ A,G | S Cancer (Smoking=heavy)조건은 Age와 Gender의 확률분포를 제한 (Smoking=heavy)조건은 cancer의 확률분포를 제한 (Smoking=heavy)조건하에서 cancer는 age와 gender에 독립

23 More Conditional Independence: Naïve Bayes
Serum Calcium and Lung Tumor are dependent Cancer Serum Calcium is independent of Lung Tumor, given Cancer P(L|SC,C) = P(L|C) Serum Calcium Lung Tumor 혈청

24 More Conditional Independence: Explaining Away
Exposure to Toxics and Smoking are independent Exposure to Toxics Smoking E ^ S Cancer Exposure to Toxics is dependent on Smoking, given Cancer P(E = heavy | C = malignant) > P(E = heavy | C = malignant, S=heavy)

25 More Conditional Independence: Explaining Away
Exposure to Toxics Exposure to Toxics Smoking Smoking Cancer Cancer Exposure to Toxics is dependent on Smoking, given Cancer Moralize the graph.

26 Put it all together Smoking Gender Age Cancer Lung Tumor Serum Calcium
Exposure to Toxics

27 General Product (Chain) Rule for Bayesian Networks
Pai=parents(Xi)

28 Conditional Independence
A variable (node) is conditionally independent of its non-descendants given its parents. Age Gender Non-Descendants Exposure to Toxics Smoking Parents Cancer is independent of Age and Gender given Exposure to Toxics and Smoking. Cancer Serum Calcium Lung Tumor Descendants

29 Another non-descendant
Age Gender Cancer is independent of Diet given Exposure to Toxics and Smoking. Exposure to Toxics Smoking Diet Cancer Serum Calcium Lung Tumor

30 Representing the Joint Distribution
In general, for a network with nodes X1, X2, …, Xn then An enormous saving can be made regarding the number of values required for the joint distribution. To determine the joint distribution directly for n binary variables 2n – 1 values are required. For a BN with n binary variables and each node has at most k parents then less than 2kn values are required.

31 An Example P(s1)=0.2 Smoking history Fatigue Bronchitis Lung Cancer
X-ray P(b1|s1)=0.25 P(b1|s2)=0.05 P(l1|s1)=0.003 P(l1|s2)= P(f1|b1,l1)=0.75 P(f1|b1,l2)=0.10 P(f1|b2,l1)=0.5 P(f1|b2,l2)=0.05 P(x1|l1)=0.6 P(x1|l2)=0.02

32 Solution Note that our joint distribution with 5 variables can be represented as: Consequently the joint probability distribution can now be expressed as For example, the probability that someone has a smoking history, lung cancer but not bronchitis, suffers from fatigue and tests positive in an X-ray test is

33 Independence and Graph Separation
Given a set of observations, is one set of variables dependent on another set? Observing effects can induce dependencies. d-separation (Pearl 1988) allows us to check conditional independence graphically.

34 Bayesian networks Additional structure Nodes as functions
Causal independence Context specific dependencies Continuous variables Hierarchy and model construction

35 A BN node is conditional distribution function
Nodes as funtions A BN node is conditional distribution function its parent values are the inputs its output is a distribution over its values lo : 0.7 med : 0.1 hi : 0.2 b a a b a b A 0.1 0.3 0.6 0.7 0.1 0.2 lo med hi 0.4 0.2 0.7 0.1 0.2 0.5 X X 0.3 0.2 B

36 Nodes as funtions Any type of function from Val(A,B) to distributions
lo : 0.7 med : 0.1 hi : 0.2 b a A Any type of function from Val(A,B) to distributions over Val(X) X X B

37 Continuous variables A/C Setting Outdoor Temperature
hi 97o Function from Val(A,B) to density functions over Val(X) P(x) x Indoor Temperature Indoor Temperature

38 Gaussian (normal) distributions
N(m, s) different mean different variance

39 Each variable is a linear function of its parents,
Gaussian networks X Y Each variable is a linear function of its parents, with Gaussian noise Joint probability density functions: X Y X Y

40 Composing functions Recall: a BN node is a function We can compose functions to get more complex functions. The result: A hierarchically structured BN. Since functions can be called more than once, we can reuse a BN model fragment in multiple contexts.

41 Owner Owner Car: Maintenance Age Original-value Mileage Brakes: Brakes
Income Owner Maintenance Age Original-value Mileage Brakes: Power Brakes Tires: RF-Tire LF-Tire Traction Pressure Car: Engine Engine: Power Engine Tires Fuel-efficiency Braking-power

42 Knowledge acquisition
Bayesian Networks Knowledge acquisition Variables Structure Numbers

43 Values versus Probabilities
What is a variable? Collectively exhaustive, mutually exclusive values Error Occured No Error Risk of Smoking Smoking Values versus Probabilities

44 Clarity Test: Knowable in Principle
Weather {Sunny, Cloudy, Rain, Snow} Gasoline: Cents per gallon Temperature {  100F , < 100F} User needs help on Excel Charting {Yes, No} User’s personality {dominant, submissive}

45 Structuring Network structure corresponding
Gender Age Network structure corresponding to “causality” is usually good. Smoking Exposure to Toxic Cancer Genetic Damage Extending the conversation. Lung Tumor

46 Course Contents Concepts in Probability Bayesian Networks Inference Decision making Learning networks from data Reasoning over time Applications

47 Inference Patterns of reasoning Basic inference Exact inference Exploiting structure Approximate inference

48 How likely are elderly males to get malignant cancer?
Predictive Inference Age Gender How likely are elderly males to get malignant cancer? Exposure to Toxics Smoking Cancer P(C=malignant | Age>60, Gender= male) Serum Calcium Lung Tumor

49 Combined Age Gender How likely is an elderly male patient with high Serum Calcium to have malignant cancer? Exposure to Toxics Smoking Cancer P(C=malignant | Age>60, Gender= male, Serum Calcium = high) Serum Calcium Lung Tumor

50 Explaining away Age Gender If we see a lung tumor, the probability of heavy smoking and of exposure to toxics both go up. Exposure to Toxics Smoking If we then observe heavy smoking, the probability of exposure to toxics goes back down. Smoking Cancer Serum Calcium Lung Tumor

51 Inference in Belief Networks
Find P(Q=q|E= e) Q the query variable E set of evidence variables P(q | e) = P(q, e) P(e) X1,…, Xn are network variables except Q, E P(q, e) = S P(q, e, x1,…, xn) x1,…, xn

52 A B C Basic Inference P(b) = S P(a, b) = S P(b | a) P(a)
P(c) = S P(c | b) P(b) P(c) = S P(a, b, c) b,a b,a = S P(c | b) P(b | a) P(a) = S P(c | b) S P(b | a) P(a) b a P(b)

53 because of independence of Y1, Y2:
Inference in trees Y1 Y2 X X P(x) = S P(x | y1, y2) P(y1, y2) y1, y2 because of independence of Y1, Y2: y1, y2 = S P(x | y1, y2) P(y1) P(y2)

54 Polytrees A network is singly connected (a polytree) if it contains no undirected loops. D C Theorem: Inference in a singly connected network can be done in linear time*. Main idea: in variable elimination, need only maintain distributions over single nodes. * in network size including table sizes.

55 The problem with loops Cloudy Rain Sprinkler Grass-wet
P(c) 0.5 Cloudy c c c c Rain Sprinkler P(s) P(r) 0.01 0.99 0.99 0.01 Grass-wet deterministic or The grass is dry only if no rain and no sprinklers. P(g) = P(r, s) ~ 0

56 The problem with loops contd.
1 P(g) = P(g | r, s) P(r, s) P(g | r, s) P(r, s) + P(g | r, s) P(r, s) P(g | r, s) P(r, s) = P(r, s) ~ 0 = P(r) P(s) ~ 0.5 ·0.5 = 0.25 problem

57 S S A P(c) = S P(c | b) S P(b | a) P(a) P(b) B C Variable elimination
x P(A) P(B | A) P(B, A) C S A P(B) x P(C | B) P(C, B) S B P(C)

58 Inference as variable elimination
A factor over X is a function from val(X) to numbers in [0,1]: A CPT is a factor A joint distribution is also a factor BN inference: factors are multiplied to give new ones variables in factors summed out A variable can be summed out as soon as all factors mentioning it have been multiplied.

59 Variable Elimination with loops
x P(A,G,S) P(A) P(S | A,G) P(G) Age Gender P(A,E,S) P(E | A) x P(A,S) S G Exposure to Toxics Smoking P(E,S) S A P(C | E,S) P(E,S,C) x Cancer S E,S P(C) P(L | C) x P(C,L) S C P(L) Serum Calcium Lung Tumor Complexity is exponential in the size of the factors

60 Inference in BNs and Junction Tree
The main point of BNs is to enable probabilistic inference to be performed. Inference is the task of computing the probability of each value of a node in BNs when other variables’ values are know. The general idea is doing inference by representing the joint probability distribution on an undirected graph called the Junction tree The junction tree has the following characteristics: it is an undirected tree, its nodes are clusters of variables given two clusters, C1 and C2, every node on the path between them contains their intersection C1  C2 a Separator, S, is associated with each edge and contains the variables in the intersection between neighbouring nodes ABC BCD CDE BC CD

61 Inference in BNs Moralize the Bayesian network
Triangulate the moralized graph Let the cliques of the triangulated graph be the nodes of a tree, and construct the junction tree Belief propagation throughout the junction tree to do inference

62 Constructing the Junction Tree (1)
Step 1. Form the moral graph from the DAG Consider BN in our example S F B L X S F B L X Moral Graph – marry parents and remove arrows DAG

63 Constructing the Junction Tree (2)
Step 2. Triangulate the moral graph An undirected graph is triangulated if every cycle of length greater than 3 possesses a chord S F B L X

64 Constructing the Junction Tree (3)
Step 3. Identify the Cliques A clique is a subset of nodes which is complete (i.e. there is an edge between every pair of nodes) and maximal. S F B L X Cliques {B,S,L} {B,L,F} {L,X}

65 Constructing the Junction Tree (4)
Step 4. Build Junction Tree The cliques should be ordered (C1,C2,…,Ck) so they possess the running intersection property: for all 1 < j ≤ k, there is an i < j such that Cj  (C1… Cj-1)  Ci. To build the junction tree choose one such I for each j and add an edge between Cj and Ci. Junction Tree Cliques {B,S,L} {B,L,F} {L,X} BSL BLF LX BL L

66 Potentials Initialization
To initialize the potential functions: 1. set all potentials to unity 2. for each variable, Xi, select one node in the junction tree (i.e. one clique) containing both that variable and its parents, pa(Xi), in the original DAG 3. multiply the potential by P(xi|pa(xi)) BSL BLF LX BL L

67 Potential Representation
The joint probability distribution can now be represented in terms of potential functions, ϕ, defined on each clique and each separator of the junction tree. The joint distribution is given by The idea is to transform one representation of the joint distribution to another in which for each clique, C, the potential function gives the marginal distribution for the variables in C, i.e. This will also apply for the separators, S.

68 Triangulation Given a numbered graph, proceed from node n, decrease to 1 Determine the lower-numbered nodes which are adjacent to the current node, including those which may have been made adjacent to this node earlier in this algorithm Connects these nodes to each other.

69 Triangulation Numbering the nodes Arbitrarily number the nodes
Maximum cardinality search Give any node a value of 1 For each subsequent number, pick an new unnumbered node that neighbors the most already numbered nodes

70 Triangulation BN Moralized graph

71 Triangulation 6 7 8 2 5 4 1 3 5 3 8 4 6 7 2 1 Arbitrary numbering

72 Triangulation Maximum cardinality search 6 7 8 2 5 4 1 3 6 7 8 2 5 4 1

73 Concepts in Probability Bayesian Networks Inference Decision making
Course Contents Concepts in Probability Bayesian Networks Inference Decision making Learning networks from data Reasoning over time Applications

74 Decision - an irrevocable allocation of domain resources
Decision making Decision - an irrevocable allocation of domain resources Decision should be made so as to maximize expected utility. View decision making in terms of Beliefs/Uncertainties Alternatives/Decisions Objectives/Utilities

75 Course Contents Concepts in Probability Bayesian Networks Inference
Decision making Learning networks from data Reasoning over time Applications

76 Learning networks from data
The learning task Parameter learning Fully observable Partially observable Structure learning Hidden variables

77 Input: fully or partially observable data cases?
The learning task B E A C N Call Alarm Burglary Earthquake Newscast Output: BN modeling data e a c b n b e a c n ... Input: training data Input: fully or partially observable data cases? Output: parameters or also structure?

78 Parameter learning: one variable
Unfamiliar coin: Let q = bias of coin (long-run fraction of heads) If q known (given), then P(X = heads | q ) = q Different coin tosses independent given q P(X1, …, Xn | q ) = q h (1-q)t h heads, t tails

79 Input: a set of previous coin tosses
Maximum likelihood Input: a set of previous coin tosses X1, …, Xn = {H, T, H, H, H, T, T, H, . . ., H} h heads, t tails Goal: estimate q The likelihood P(X1, …, Xn | q ) = q h (1-q )t The maximum likelihood solution is: q* = h h+t

80 Conditioning on data D P(q ) P(q | D)  P(q ) P(D | q )
h heads, t tails P(q ) P(q | D)  P(q ) P(D | q ) = P(q ) q h (1-q )t 1 head 1 tail

81 Conditioning on data Good parameter distribution:
* Dirichlet distribution generalizes Beta to non-binary variables.

82 General parameter learning
A multi-variable BN is composed of several independent parameters (“coins”). Three parameters: A B qA, qB|a, qB|a Can use same techniques as one-variable case to learn each one separately Max likelihood estimate of qB|a would be: #data cases with b, a #data cases with a q*B|a =

83 Partially observable data
B E A C N Burglary Earthquake b ? a c ? Alarm b ? a ? n Newscast Call ... Fill in missing data with “expected” value expected = distribution over possible values use “best guess” BN to estimate distribution

84 Sj I(n,e | dj) S j I(e | dj) Intuition In fully observable case:
1 if E=e in data case dj 0 otherwise = q*n|e = #data cases with n, e #data cases with e In partially observable case I is unknown. Best estimate for I is: Problem: q* unknown.

85 Expectation Maximization (EM)
Repeat : until convergence. Expectation (E) step Use current parameters q to estimate filled in data. Maximization (M) step Use filled in data to do max likelihood estimation Set:

86 Structure learning Goal: find “good” BN structure (relative to data) Solution: do heuristic search over space of network structures.

87 Search space Space = network structures
Operators = add/reverse/delete edges

88 Heuristic search Use scoring function to do heuristic search (any algorithm). Greedy hill-climbing with randomness works pretty well. score

89 Scoring Fill in parameters using previous techniques & score completed networks. One possibility for score: D likelihood function: Score(B) = P(data | B) Example: X, Y independent coin tosses typical data = (27 h-h, 22 h-t, 25 t-h, 26 t-t) Maximum likelihood network structure: X Y Max. likelihood network typically fully connected This is not surprising: maximum likelihood always overfits…

90 Better scoring functions
MDL formulation: balance fit to data and model complexity (# of parameters) Score(B) = P(data | B) - model complexity Full Bayesian formulation prior on network structures & parameters more parameters  higher dimensional space get balance effect as a byproduct* * with Dirichlet parameter prior, MDL is an approximation to full Bayesian score.

91 There may be interesting variables that we never get to observe:
Hidden variables There may be interesting variables that we never get to observe: topic of a document in information retrieval; user’s current task in online help system. Our learning algorithm should hypothesize the existence of such variables; learn an appropriate state space for them.

92 E1 E3 E2 randomly scattered data

93 E1 E3 E2 actual data

94 Bayesian clustering (Autoclass)
naïve Bayes model: …... E1 E2 En (hypothetical) class variable never observed if we know that there are k classes, just run EM learned classes = clusters Bayesian analysis allows us to choose k, trade off fit to data with model complexity

95 E1 E3 E2 Resulting cluster distributions

96 Detecting hidden variables
Unexpected correlations hidden variables. Cholesterolemia Test1 Test2 Test3 Hypothesized model Cholesterolemia Test1 Test2 Test3 Data model Cholesterolemia Test1 Test2 Test3 “Correct” model Hypothyroid

97 Course Contents Concepts in Probability Bayesian Networks Inference
Decision making Learning networks from data Reasoning over time Applications

98 Dynamic Bayesian networks Hidden Markov models
Reasoning over time Dynamic Bayesian networks Hidden Markov models Decision-theoretic planning Markov decision problems Structured representation of actions The qualification problem & the frame problem Causality (and the frame problem revisited)

99 Dynamic environments Markov property:
State(t+2) State(t+1) State(t) Markov property: past independent of future given current state; a conditional independence assumption; implied by fact that there are no arcs t t+2.

100 Dynamic Bayesian networks
State described via random variables. Velocity(t) Position(t) Weather(t) Drunk(t) Velocity(t+1) Position(t+1) Weather(t+1) Drunk(t+1) Velocity(t+2) Position(t+2) Weather(t+2) Drunk(t+2) ...

101 An HMM is a simple model for a partially observable stochastic domain.
Hidden Markov model An HMM is a simple model for a partially observable stochastic domain. State transition model Observation State(t) State(t+1) Obs(t) Obs(t+1)

102 Partially observable stochastic environment:
Hidden Markov model Partially observable stochastic environment: 0.8 0.15 0.05 Mobile robots: states = location observations = sensor input Speech recognition: states = phonemes observations = acoustic signal Biological sequencing: states = protein structure observations = amino acids

103 Acting under uncertainty
Markov Decision Problem (MDP) action model agent observes state State(t+2) Action(t+1) Action(t) Reward(t+1) Reward(t) State(t) State(t+1) Overall utility = sum of momentary rewards. Allows rich preference model, e.g.: rewards corresponding to “get to goal asap” = goal states other states

104 Partially observable MDPs
agent observes Obs, not state State(t+2) State(t) State(t+1) Action(t) Action(t+1) Reward(t+1) Reward(t) Obs(t) Obs(t+1) Obs depends on state The optimal action at time t depends on the entire history of previous observations. Instead, a distribution over State(t) suffices.

105 Structured representation
Position(t+1) Holding(t+1) Direction(t+1) Move: Position(t) Holding(t) Direction(t) Preconditions Effects Position(t) Holding(t) Direction(t) Position(t+1) Holding(t+1) Direction(t+1) Turn: Probabilistic action model allows for exceptions & qualifications; persistence arcs: a solution to the frame problem.

106 Medical expert systems
Applications Medical expert systems Pathfinder Parenting MSN Fault diagnosis Ricoh FIXIT Decision-theoretic troubleshooting Vista Collaborative filtering


Download ppt "CVPR Winter seminar Jaemin Kim"

Similar presentations


Ads by Google