Presentation is loading. Please wait.

Presentation is loading. Please wait.

Knowledge Representation & Reasoning Lecture #5 UIUC CS 498: Section EA Professor: Eyal Amir Fall Semester 2005 (Based on slides by Lise Getoor and Alvaro.

Similar presentations


Presentation on theme: "Knowledge Representation & Reasoning Lecture #5 UIUC CS 498: Section EA Professor: Eyal Amir Fall Semester 2005 (Based on slides by Lise Getoor and Alvaro."— Presentation transcript:

1 Knowledge Representation & Reasoning Lecture #5 UIUC CS 498: Section EA Professor: Eyal Amir Fall Semester 2005 (Based on slides by Lise Getoor and Alvaro Cardenas (UMD) (in turn based on slides by Nir Friedman (Hebrew U)))

2 So Far and Today 1.Probabilistic graphical models 1.Bayes Networks (Directed GMs) 2.Markov Fields (Undirected GMs) 2.Treewidth methods: 1.Variable elimination 2.Clique tree algorithm 3. Applications du jour: Sensor Networks

3 Markov Assumption We now make this independence assumption more precise for directed acyclic graphs (DAGs) Each random variable X, is independent of its non- descendents, given its parents Pa(X) Formally, I (X, NonDesc(X) | Pa(X)) Descendent Ancestor Parent Non-descendent X Y1Y1 Y2Y2

4 Markov Assumption Example In this example: –I ( E, B ) –I ( B, {E, R} ) –I ( R, {A, B, C} | E ) –I ( A, R | B,E ) –I ( C, {B, E, R} | A) Earthquake Radio Burglary Alarm Call

5 I-Maps A DAG G is an I-Map of a distribution P if all Markov assumptions implied by G are satisfied by P (Assuming G and P both use the same set of random variables) Examples: XYXY

6 Factorization Theorem From assumption: Thm: if G is an I-Map of P, then Proof: By chain rule: wlog. X 1,…,X p is an ordering consistent with G Since G is an I-Map, I (X i, NonDesc(X i )| Pa(X i )) We conclude, P(X i | X 1,…,X i-1 ) = P(X i | Pa(X i ) ) Hence,

7 Factorization Example P(C,A,R,E,B) = P(B)P(E|B)P(R|E,B)P(A|R,B,E)P(C|A,R,B,E) Earthquake Radio Burglary Alarm Call versus P(C,A,R,E,B) = P(B) P(E) P(R|E) P(A|B,E) P(C|A)

8 Consequences We can write P in terms of “local” conditional probabilities If G is sparse, – that is, |Pa(X i )| < k,  each conditional probability can be specified compactly –e.g. for binary variables, these require O(2 k ) params.  representation of P is compact –linear in number of variables

9 Summary We defined the following concepts The Markov Independences of a DAG G –I (X i, NonDesc(X i ) | Pa i ) G is an I-Map of a distribution P –If P satisfies the Markov independencies implied by G We proved the factorization theorem if G is an I-Map of P, then

10 Let Markov(G) be the set of Markov Independencies implied by G The factorization theorem shows G is an I-Map of P  We can also show the opposite: Thm:  G is an I-Map of P Conditional Independencies

11 Proof (Outline) Example: X Y Z

12 Implied Independencies Does a graph G imply additional independencies as a consequence of Markov(G)? We can define a logic of independence statements Some axioms: –I( X ; Y | Z )  I( Y; X | Z ) –I( X ; Y 1, Y 2 | Z )  I( X; Y 1 | Z )

13 d-separation A procedure d-sep(X; Y | Z, G) that given a DAG G, and sets X, Y, and Z returns either yes or no Goal: d-sep(X; Y | Z, G) = yes iff I(X;Y|Z) follows from Markov(G)

14 Paths Intuition: dependency must “flow” along paths in the graph A path is a sequence of neighboring variables Examples: R  E  A  B C  A  E  R Earthquake Radio Burglary Alarm Call

15 Paths We want to know when a path is –active -- creates dependency between end nodes –blocked -- cannot create dependency end nodes We want to classify situations in which paths are active.

16 Blocked Unblocked E R A E R A Path Blockage Three cases: –Common cause – Blocked Active

17 Blocked Unblocked E C A E C A Path Blockage Three cases: –Common cause –Intermediate cause – Blocked Active

18 Blocked Unblocked E B A C E B A C E B A C Path Blockage Three cases: –Common cause –Intermediate cause –Common Effect Blocked Active

19 Path Blockage -- General Case A path is active, given evidence Z, if Whenever we have the configuration B or one of its descendents are in Z No other nodes in the path are in Z A path is blocked, given evidence Z, if it is not active. A C B

20 A –d-sep(R,B)? Example E B C R

21 –d-sep(R,B) = yes –d-sep(R,B|A)? Example E B A C R

22 –d-sep(R,B) = yes –d-sep(R,B|A) = no –d-sep(R,B|E,A)? Example E B A C R

23 d-Separation X is d-separated from Y, given Z, if all paths from a node in X to a node in Y are blocked, given Z. Checking d-separation can be done efficiently (linear time in number of edges) –Bottom-up phase: Mark all nodes whose descendents are in Z –X to Y phase: Traverse (BFS) all edges on paths from X to Y and check if they are blocked

24 Soundness Thm: If –G is an I-Map of P –d-sep( X; Y | Z, G ) = yes then –P satisfies I( X; Y | Z ) Informally: Any independence reported by d- separation is satisfied by underlying distribution

25 Completeness Thm: If d-sep( X; Y | Z, G ) = no then there is a distribution P such that –G is an I-Map of P –P does not satisfy I( X; Y | Z ) Informally: Any independence not reported by d- separation might be violated by the underlying distribution We cannot determine this by examining the graph structure alone

26 Summary: Structure We explored DAGs as a representation of conditional independencies: –Markov independencies of a DAG –Tight correspondence between Markov(G) and the factorization defined by G –d-separation, a sound & complete procedure for computing the consequences of the independencies –Notion of minimal I-Map –P-Maps This theory is the basis for defining Bayesian networks

27 Complexity of variable elimination Suppose in one elimination step we compute This requires multiplications –For each value for x, y 1, …, y k, we do m multiplications additions –For each value of y 1, …, y k, we do |Val(X)| additions Complexity is exponential in number of variables in the intermediate factor

28 Undirected graph representation At each stage of the procedure, we have an algebraic term that we need to evaluate In general this term is of the form: where Z i are sets of variables We now plot a graph where there is undirected edge X--Y if X,Y are arguments of some factor –that is, if X,Y are in some Z i Note: this is the Markov network that describes the probability on the variables we did not eliminate yet

29 Chordal Graphs elimination ordering  undirected chordal graph Graph: Maximal cliques are factors in elimination Factors in elimination are cliques in the graph Complexity is exponential in size of the largest clique in graph L T A B X V S D V S L T A B XD

30 Induced Width The size of the largest clique in the induced graph is thus an indicator for the complexity of variable elimination This quantity is called the induced width of a graph according to the specified ordering Finding a good ordering for a graph is equivalent to finding the minimal induced width of the graph

31 PolyTrees A polytree is a network where there is at most one path from one variable to another Thm: Inference in a polytree is linear in the representation size of the network –This assumes tabular CPT representation A C B D E FG H

32 Today 1.Probabilistic graphical models 2.Treewidth methods: 1.Variable elimination 2.Clique tree algorithm 3. Applications du jour: Sensor Networks

33 Junction Tree Why junction tree? –More efficient for some tasks than variable elimination –We can avoid cycles if we turn highly- interconnected subsets of the nodes into “supernodes”  cluster Objective –Compute is a value of a variable and is evidence for a set of variable

34 Properties of Junction Tree An undirected tree Each node is a cluster (nonempty set) of variables Running intersection property: –Given two clusters and, all clusters on the path between and contain Separator sets (sepsets): –Intersection of the adjacent cluster ADEABD DEF ADDE Cluster ABD Sepset DE

35 Potentials Potentials: –Denoted by Marginalization –, the marginalization of into X Multiplication –, the multiplication of and

36 Properties of Junction Tree Belief potentials: –Map each instantiation of clusters or sepsets into a real number Constraints: –Consistency: for each cluster and neighboring sepset –The joint distribution

37 Properties of Junction Tree If a junction tree satisfies the properties, it follows that: –For each cluster (or sepset), –The probability distribution of any variable, using any cluster (or sepset) that contains

38 Building Junction Trees DAG Moral GraphTriangulated GraphJunction TreeIdentifying Cliques

39 Constructing the Moral Graph A B D C E G F H

40 Constructing The Moral Graph Add undirected edges to all co- parents which are not currently joined –Marrying parents A B D C E G F H

41 Constructing The Moral Graph Add undirected edges to all co- parents which are not currently joined –Marrying parents Drop the directions of the arcs A B D C E G F H

42 Triangulating An undirected graph is triangulated iff every cycle of length >3 contains an edge to connects two nonadjacent nodes A B D C E G F H

43 Identifying Cliques A clique is a subgraph of an undirected graph that is complete and maximal A B D C E G F H EGH ADEABD ACEDEF CEG

44 Junction Tree A junction tree is a subgraph of the clique graph that –is a tree –contains all the cliques –satisfies the running intersection property EGH ADEABD ACEDEF CEG ADE ABD ACE AD AE CEG CE DEF DE EGH EG

45 Principle of Inference DAG Junction Tree Inconsistent Junction Tree Initialization Consistent Junction Tree Propagation Marginalization

46 Example: Create Join Tree X1X2 Y1Y2 HMM with 2 time steps: Junction Tree: X1,X2 X1,Y1 X2,Y2 X1 X2

47 Example: Initialization Variable Associated Cluster Potential function X1X1,Y1 Y1X1,Y1 X2X1,X2 Y2X2,Y2 X1,X2 X1,Y1 X2,Y2 X1 X2

48 Example: Collect Evidence Choose arbitrary clique, e.g. X1,X2, where all potential functions will be collected. Call recursively neighboring cliques for messages: 1. Call X1,Y1. –1. Projection: –2. Absorption:

49 Example: Collect Evidence (cont.) 2. Call X2,Y2: –1. Projection: –2. Absorption: X1,X2 X1,Y1 X2,Y2 X1 X2

50 Example: Distribute Evidence Pass messages recursively to neighboring nodes Pass message from X1,X2 to X1,Y1: –1. Projection: –2. Absorption:

51 Example: Distribute Evidence (cont.) Pass message from X1,X2 to X2,Y2: –1. Projection: –2. Absorption: X1,X2 X1,Y1 X2,Y2 X1 X2

52 Example: Inference with evidence Assume we want to compute: P(X2|Y1=0,Y2=1) (state estimation) Assign likelihoods to the potential functions during initialization:

53 Example: Inference with evidence (cont.) Repeating the same steps as in the previous case, we obtain:

54 Next Time Inference with Propositional Logic Later in the semester: (a) Approximate Probabilistic Inference via sampling: Gibbs, Priority, MCMC (b) Approximate Probabilistic Inference using a close, simpler distribution

55 THE END

56 Example: Naïve Bayesian Model A common model in early diagnosis: –Symptoms are conditionally independent given the disease (or fault) Thus, if –X 1,…,X p denote whether the symptoms exhibited by the patient (headache, high- fever, etc.) and –H denotes the hypothesis about the patients health then, P(X 1,…,X p,H) = P(H)P(X 1 |H)…P(X p |H), This naïve Bayesian model allows compact representation –It does embody strong independence assumptions

57 Elimination on Trees Formally, for any tree, there is an elimination ordering with induced width = 1 Thm Inference on trees is linear in number of variables


Download ppt "Knowledge Representation & Reasoning Lecture #5 UIUC CS 498: Section EA Professor: Eyal Amir Fall Semester 2005 (Based on slides by Lise Getoor and Alvaro."

Similar presentations


Ads by Google