# Knowledge Representation and Reasoning University "Politehnica" of Bucharest Department of Computer Science Fall 2010 Adina Magda Florea

## Presentation on theme: "Knowledge Representation and Reasoning University "Politehnica" of Bucharest Department of Computer Science Fall 2010 Adina Magda Florea"— Presentation transcript:

Knowledge Representation and Reasoning University "Politehnica" of Bucharest Department of Computer Science Fall 2010 Adina Magda Florea http://turing.cs.pub.ro/krr_10 curs.cs.pub.ro Master of Science in Artificial Intelligence, 2010-2012

2 Lecture 11 Uncertain representation of knowledge Lecture outline  Uncertain knowledge  Belief networks  Bayesian prediction

3 1. Uncertain knowledge  Probability theory – 2 main interpretations  Statistical = measure of proportion of individuals (long range frequency of a set of events)  Prob of a bird flying = proportion of birds that fly out of the set af all birds  Personal, subjective or Bayesian = an agent's measure of belief in some proposition based on the agent's knowledge  Prob of a bird flying = the agent's measure of belief in the flying ability of an individual based on the knowledge that the individual is a bird  Can be viewed as a measure over all the worlds that are possible, given the agent's knowledge about a particular situation (in each possible world, the bird either flies or it does not)

4 Bayesian probability  Both views have the same calculus  We talk about the second view  We assume uncertainty is epistemological - pertaining to the agent's knowledge about the world, rather than ontological – how the world is Semantics of (prior) probability  Interpretations – on possible worlds  Specify not only the truth of formulas but also how likely the real world is as compared to these formulas  Modal logics – possible worlds + accessibility relation  Probabilities – possible worlds + a measure on p.w.

5 Semantics of probability  A possible world is an assignment of exactly one value to every random variable.  Let W be the set of all possible worlds. If w  W and f is a formula, f is true in w (w |= f) is defined inductively on the structure of f: w |= x=v iffw assigns value v to x w |= W  f iff w |=/ f (or w |= ¬f) w |= f  g iff w |= f and w |= g w |= f  g iff w |= f or w |= W g  Associated with each possible world is a measure. When there are only a finite no. of worlds:  0  p(w) for all w  W  w  W p(w) = 1

6 Semantics of probability  The probability of a formula f is the sum of all measures of the possible worlds in which f is true. P(f)=  w |= f p(w) Semantics of conditional probability  A formula e representing the conjunction of all agent's observations of the world is called evidence  The measure of belief in formula h based on formula e is called conditional probability of h given e, P(h|e)  Evidence e will rule out all possible worlds that are incompatible with evidence e

7 Semantics of probability  Evidence e introduces a new measure p e over possible worlds where all worlds in which e is false have measure 0 and the remaining worlds are normalized so that the sum of the measures of the worlds is 1  p e (w) = p(w)/P(e) if w |= f 0 if w |=/ f P(h|e) =  w |=h p e (w) = (  w |= h  e p(w) )/ P(e) = P(h  e)/P(e)  We assume P(e)>0. If P(e) = 0 then e is false in all possible worlds and thus can not be observed  Chain rule P(f 1  … f n )=P(f 1 ) x P(f 2 |f 1 ) x …P(f n |f 1  …  f n-1 )

8 Bayes theorem  Given the current belief in a proposition H based on evidence K, P(H|K), we observe E. P(E|H  K) * P(H|K)  P(H|E  K) = P(E|K)  If the background knowledge K is implicit P(E|H) * P(H)  P(H|E) = P(E)

9 Independence assumptions  Independence. The knowledge of the truth of one proposition does not affect the belief in another  A random variable X is independent of a random variable Y given a random variable Z if for all values of the random variables (i.e., a i, b j, c k ) P(X=a i |Y=b j  Z=c k ) = P(X=a i |Z=c k )  Knowledge of Y's value does not affect the belief in the value of X, given the value of Z.

10 2. Belief networks  A BN (Belief Network or Bayesian Network) is a graphical representation of conditional independence  It is represented a Directed Acyclic Graph (DAG)  The nodes represent random variables.  The edges represent direct dependence among the variables.  X  Y: X has a direct influence on Y (represents a statistical dependence)  X = Parent(Y) if X  Y  X = Ancestor(Y) if there is a direct path from X to Y (X ..  Y)  Z = Descendant(Y) if Z=Y or there is a direct path from Y to Z (Y ..  Z)

11 BN The independence assumption embedded in a BN is:  Each random variable is independent of its nondescendants given its parents Y 1,..Y n – parents of X P(X=a|Y 1 =v 1  …  Y n =v n  R)= P(X=a|Y 1 =v 1  …  Y n =v n ) if R does not involve descendants, including itself  The number of probabilities needed to be specified for each variable is exponential in the number of parents of a variable  BN contains a set of conditional probability tables P(X=a|Y 1 =v 1  …  Y n =v n )

12 BN  Therefore a BN defines a Joint Probability Distribution (JPD) over the variables in the network  A value of the JPD can be computed as: P(X 1 =x 1  … X n =x n ) =  i=1,n P(X i =x i | parents(X i )) where parents(x i ) represent the specific values of Parents(X i )  P(X 1 =x 1  … X n =x n ) = P(x 1,…, x n ) = P(x n | x n-1,…, x 1 ) * P(x n-1,…, x 1 ) = … =  i=1,n P(x i | x i-1,…, x 1 )  Order of variables in the BN P(X i | X i-1,…, X 1 ) = P(X i | Parents(X i )) provided that Parents(X i )  { X i-1,…, X 1 }

13 BN  P(X i | X i-1,…, X 1 ) = P(X i | Parents(X i )) provided that Parents(X i )  { X i-1,…, X 1 }  A BN is a correct representation of the domain, provided that each node is conditionally independent of its predecessors, given its parents

Tampering Fire Alarm Leaving Smoke P(T) 0.002 P(F) 0.001 F T P(A) T T 0.5 T F 0.99 F T 0.85 F F 0.0001 A P(L) T 0.88 F 0.001 F P(S) T 0.9 F 0.01 Instead of computing the joint distribution of all the variables by the chain rule P(T,F,A,S,L,R) = P(T)*P(F|T)*P(S|F,T)*P(A|S,F,T)*P(L|A,S,F,T)*P(R|L,A,S,F,T) the BN deﬁnes a unique JPD in a factored form, i.e. P(T,F,A,S,L,R) = P(T) * P(F) * P(A|T,F) * P(S|F) * P(L|A) * P(R|L) Report L P(R) T 0.75 F 0.01

Inferences  The probability of a variable given nondescendants can be computed using the "reasoning by case" rule  P(L|S) = P(L|A,S)*P(A|S) + P(L|~A,S)*(1-P(A|S))= P(L|A)*P(A|S) + P(L|~A)*(1-P(A|S))  P(A|S) = P(A|F,T)*P(F,T|S) + P(A|F,~T)*P(F,~T|S) + P(A|~F,T)*P(~F,T|S) + P(A|~F,~T)*P(~F,~T|S)  The right hand side of each product can be computed using the multiplicative rule P(F,T|S) = P(F|T,S)*P(T|S) = P(F|T,S)*P(T)  For computing P(F|T,S) we can not use the independence assumption because S is a descendant of F; we can use Bayes rule instead P(F|T,S) = (P(S|F,T)*P(F|T)) / P(S|T) = (P(S|F)*P(F)) / P(S|T)

16 Inferences  The prior probabilities (with no evidence) of each variable are: P(Tampering) = 0.02 P(Fire) = 0.1 P(Report) = 0.028 P(Smoke) = 0.0189  Observing the Report gives P(Tampering|Report) = 0.399 P(Fire|Report) = 0.2305 P(Smoke|Report) = 0.215  The probability of both Tampering and Fire are increased by the Report  Because Fire is increased, so is the probability of Smoke

17 Inferences  Suppose instead that Smoke was observed P(Tampering|Smoke) = 0.02 P(Fire|Smoke) = 0.476 P(Report|Smoke) = 0.320  Note that the probability of tampering is not affected by observing Smoke, however the probability of Report and Fire are increased  Suppose that both Report and Smoke were observed P(Tampering|Report, Smoke) = 0.0284 P(Fire|Report, Smoke) = 0.964  Thus, observing both makes Fire more likely  However, in the context of Report, the presence of Smoke makes Tampering less likely.

18 Inferences  Suppose instead that there is a Report but no Smoke P(Tampering|Report,~Smoke) = 0.501 P(Fire|Report,~Smoke) = 0.0294  In the context of Report, Fire becomes much less likely and so the probability of Tampering increases to explain Report.

19 Determining posterior distributions  Problem = computing conditional probabilities given the evidence  Estimating posterior probabilities in a BN within an absolute error (of less than 0.5) is NP-hard  3 main approaches (1) Exploit the structure of the network  Clique tree propagation method – the network is transformed into a tree with nodes labeled with sets of variables. Reasoning is performed by passing messages between the nodes in the tree  Time complexity is linear in the number of nodes of the tree; the tree is in fact a polytree, so its size may be exponential in the size of the belief network

20 Determining posterior distributions (2) Search-based approaches  Enumerate all possible worlds and estimate posterior probabilities from the worlds in general (3) Stochastic simulation  Random cases are generated according to a probability distribution. By treating these cases as a set of samples, one can estimate the marginal distribution on any combination of variables

21 A structure approach method  Based on the notion that a BN specifies a factorization of the JPD  A factor is a representation of a function from a tuple of random variables into a number.  f(X 1,..,X n ), X 1,..,X n are the variables of the factor, f is a factor on X 1,..,X n ;  if f(X 1,..,X n ) is a factor and each v i is an element of the domain of X i  f(X 1 =v 1,..,X j =v j ) is a number that is the value of f when each X i has value v j

22 A structure approach method  The product of two factors f 1 and f 2 is a factor on the union of the variables (f 1 x f 2 )(X 1, …,X i,Y 1, …,Y j,Z 1, …,Z k ) = f 1 (X 1, …,X i,Y 1, …,Y j ) x f 2 (Y 1, …,Y j,Z 1, …,Z k )  Given a factor f(X 1, …,X i ), one can sum out a variable, say X1, and the result is a factor on X 2, …,X i (  X1 f)(X 2,…,X i ) = f(X 1 =v 1,…,X i )+…+ f(X 1 =v k,…,X i )  A conditional probability distribution can be seen as f(X=u,Y 1 =v 1 …Y j =v j ) = P(X=u|Y 1 =v 1 ….Y j =v j )

23 A structure approach method  BN inference problem = computing the posterior distribution of a variable given some evidence  can be reduced to the problem of computing the probabilities of conjunctions  Given the evidence Y 1 =v 1 … Y j =v j and the query variable Z: P(Z|v 1,…. v j ) = P(Z,v 1,…v j ) / P(v 1,..v j ) = P(Z,v 1,…v j ) /  z P(z,v 1,..v j )  => compute the factor P(Z,v 1,…v j ) and normalize

24 A structure approach method  The variables of the BN are X 1,…,X n.  To compute the factor P(Z,v 1,…v j ) we must sum out the other variables from the JPD.  Be Z 1,…Z k an enumeration of the other variables in the BN  Z 1,…Z k = {X 1,…,X n } - {Z} - {Y 1,…,Y j }  The factor can be computing by summing out on Z i.  The order of the Z i is an elimination order  P(Z,Y 1 =v 1,…Y j =v j ) =  Zk ….  Z1 P(X 1,…X n ) Y1=v1,…,Yj=vj

25 A structure approach method  P(Z,Y 1 =v 1,…Y j =v j ) =  Zk ….  Z1 P(X 1,…X n ) Y1=v1,…,Yj=vj  There is a possible world for each assignment of a value to each variable.  The JPD P(X 1,…X n ) gives the probability (measure) for each possible world  The approach selects the worlds with the observed values for the Y's and sum over possible worlds with the same value for Z => in fact this is the definition of conditional probability

26 A structure approach method  By the rule for conjunction of probabilities and the definition of a BN: P(X 1,…X n )=P(X 1 |Parents(X 1 )) * …*P(X n |Parents(X n ))  Now the BN inference problem is reduced to a problem of summing out a set of variables from a product of factors.  To compute the posterior distribution of a query variable given observations: Construct the JPD in terms of a product of factors Set the observed variables to their observed values Sum out each of the other variables (the Z 1 …Z k ) Multiply the remaining factors and normalize

27 A structure approach method  To sum out a variable Z from a product f 1 …f k of factors:  We must first partition the factors into those that do not contain Z, say f 1,..,f i, and those that contain Z, say f i+1 …f k  Then  Z f 1 x …x f k = f 1 x.. x f i x (  Z f i+1 x … x f k )  Then explicitly construct a representation (in terms of a multidimensional array, a tree, or a set of rules) of the rightmost factor  The factor size is exponential in the number of variables of the factor

28 3. Bayesian prediction 5 bags of candies Candies h1: 100% cherry h2: 75% cherry25% lime h3: 50% cherry50% lime h4: 25% cherry75% lime h5: 100% lime H (set of hypothesis) – type of bag with values h1.. h5 Collect evidence (random variables): d1, d2, … with possible values cherry or lime Goal: predict the flavour of the next candy

29 Bayesian prediction Be D the data with observed value d The probability of each hypothesis, based on Bayes' rule, is: P(h i |d) =  P(d|h i ) P(h i )(1) The prediction on an unknown hypothesis X is P(X|d) = Σ i P(X|h i ) P(h i |d)(2)  Key elements: prior probabilities P(h i ) and the probability of an evidence for each hypothesis P(d|h i ) P(d|h i ) = Π j P(d j |h i )(3) We assume the prior probability: h1 h2 h3 h4 h5 0.1 0.2 0.4 0.2 0.1

h1 h2 h3 h4 h5 0.1 0.2 0.4 0.2 0.1 h1: 100% cherry h2: 75% cherry25% lime h3: 50% cherry50% lime h4: 25% cherry75% lime h5: 100% lime P(lime) = 0.1*0 + 0.2*0.25 + 0.4*0.5 + 0.2*0.75+ 0.1*1 = 0.5  = 1/0.5 = 2 P(h1|lime) =  P(lime|h1)P(h1) = 2*0.1*0 = 0 P(h2|lime) =  P(lime|h2)P(h2) = 2 * (0.25*0.2) = 0.1 P(h3|lime) =  P(lime|h3)P(h3) = 2 * (0.5*0.4) = 0.4 P(h4|lime) =  P(lime|h4)P(h4) = 2 * (0.75*0.2) = 0.3 P(h5|lime) =  P(lime|h5)P(h5) = 2 * (1*0.1) = 0.2 30 P(h i |d) =  P(d|h i ) P(h i ) (1)

h1 h2 h3 h4 h5 0.1 0.2 0.4 0.2 0.1 h1: 100% cherry h2: 75% cherry25% lime h3: 50% cherry50% lime h4: 25% cherry75% lime h5: 100% lime P(lime,lime) = 0.1*0 + 0.2*0.25*0.25 + 0.4*0.5*0.5 + 0.2*0.75*0.75+ 0.1*1*1 = 0.325  = 1/0.325 = 3.0769 P(h1|lime,lime) =  P(lime,lime|h1)P(h1) = 3* 0.1*0*0 =0 P(h2|lime,lime) =  P(lime,lime|h2)P(h2) = 3 * (0.25*.25*0.2) = 0.0375 P(h3|lime,lime) =  P(lime,lime|h3)P(h3) = 3 * (0.5*0.5*0.4) = 0.3 P(h4|lime,lime) =  P(lime,lime|h4)P(h4) = 3 * (0.75*0.75*0.2) = 0.3375 P(h5|lime,lime) =  P(lime,lime|h5)P(h5) = 3 * (1*1*0.1) = 0.3 31 P(hi|d) =  P(d|hi) P(hi) (1) P(d|h i ) = Π j P(d j |h i ) (3)

32  P(h i |d 1,…,d 10 ) from equation (1)

h1 h2 h3 h4 h5 0.1 0.2 0.4 0.2 0.1 h1: 100% cherry h2: 75% cherry25% lime h3: 50% cherry50% lime h4: 25% cherry75% lime h5: 100% lime P(d 2 =lime|d 1 )=P(d 2 |h1)*P(h1|d 1 ) + P(d 2 |h2)*P(h2|d 1 ) + P(d 2 |h3)*P(h3|d 1 ) + P(d 2 |h4)*P(h4|d 1 ) + P(d 2 |h5)*P(h5|d 1 ) = = 0*0.1+0.25*0.2+0.5*0.4+0.75*0.3+1*0.2 = 0.65 33 P(X|d) = Σ i P(X|h i ) P(h i |d) (2) Bayesian prediction

34 Remarks  The true hypothesis will finally dominate the prediction  Problems if the hypothesis space is big  Aproximation  Prediction based on the most probable hypothesis  MAP Learning – maximum aposteriori  P(X|d)=~P(X|h MAP )  In the xemaple h MAP =h5 after 3 evidences so 1.0  As more data is collected MAP and Bayes tend to be closer

Download ppt "Knowledge Representation and Reasoning University "Politehnica" of Bucharest Department of Computer Science Fall 2010 Adina Magda Florea"

Similar presentations