Presentation is loading. Please wait.

Presentation is loading. Please wait.

Modeling biological data and structure with probabilistic networks I Yuan Gao, Ph.D. 11/05/2002 Slides prepared from text material by Simon Kasif and Arthur.

Similar presentations


Presentation on theme: "Modeling biological data and structure with probabilistic networks I Yuan Gao, Ph.D. 11/05/2002 Slides prepared from text material by Simon Kasif and Arthur."— Presentation transcript:

1 Modeling biological data and structure with probabilistic networks I Yuan Gao, Ph.D. 11/05/2002 Slides prepared from text material by Simon Kasif and Arthur Delcher of UIC and Loyola College Computational Methods in Molecular Biology

2 Probabilistic networks Also known as Bayes’ networks. Bayes’ networks are graphical models of probability distributions that can provide a convenient medium for scientists to: –Express their knowledge of the domain as a collection of probabilistic or causal rules that describe biological sequences. These rules capture our intuition about the information of biological sequences –Learn the exact structure of the network and the associated probabilities that govern the probabilistic processes generating the sequences by directly observing data.

3 Probabilistic networks Bayes’s Law –P(X|Y) = P(X, Y)/P(Y) or equivalently –P(X, Y) = P(X|Y)P(Y) where P(X, Y) is the joint probability distribution of random variables X and Y And it can be generalizes to more random variables P(X1, X2, …,XN) = P(X1|X2, …,XN)P(X2|X3, …XN) … P(XN-1|XN)P(XN)

4 Probabilistic networks X and Y are independent iff –P(X|Y) = P(X) or equivalently –P(X, Y) = P(X)P(Y), And X and Y are conditionally independent given Z iff P(X|Y, Z) = p(X|Z) or equivalently P(X, Y|Z) = P(X|Z)P(Y|Z) The notion of independence can be generalized to sets of random variables.

5 Probabilistic networks Let’s consider a sequence of random variables X1,X2,…,XN, Assuming that each Xi is conditionally independent of X1, X2,…,Xi-2 given Xi-1, Or equivalently, P(Xi|X1,…Xi-1) = P(Xi|Xi-1) –This is often refered to as the Markov assumption. In other word, the future only depends on the present, it has no memory of the entire past!

6 Probabilistic networks Under the Markov assumption, –P(X1,X2,…,XN) = p(X1|X2)P(X2|X3)…P(XN-1|XN) –Graphically, these dependencies can be represented as a chain graph X1X2 X3 X4X5 Figure 1. Chain Model. An example. P(X2|X1) P(X3|X2) P(X4|X3)P(X5|X4)

7 Chain Models in Biology A simple problem: –Given a set of homologous protein sequences D, a so called training set, now given a new protein sequence, what is the likelihood that it is a member of the set (family) D? Let’s start with a few assumptions, which can be removed easily –The sequences are of length L A standard solution: devise a simple probabilistic network model of the sequences in D

8 A standard solution with independency assumption First associate a random variable Xi with each position 1<= i <=L, where the values of X1 range over the 20 amino acids. –Assume the model has the topology X1X2 X3X4 X5 Figure 2. Simplest possible network assuming independency of the random variables.

9 A standard solution with independency assumption The distribution of Xi is intended to model the relative frequency of each amino acid in position i independently of other position. The joint probability of X1, …, XL is defined by –P(X1,…,XL) = P(X1)P(X2)…P(XL) This is a stardard model in biology and typically expressed as a “Consensus Matrix”

10 A standard solution with independency assumption Generative interpretation of the model –Generating a database of protein sequences using the model (1)For I = 1 to L, generate a random value for the ith position in the sequence using the probability distribution associated with the ith variable. (2)Repeat step 1, generating many more new sequences. Now we ask the question of what is the best network in this simple class of networks that models the sequences in our database as closely as possible. This is referred to as the Learning Problem for Bayes’ networks.

11 A standard solution with independency assumption More formally, we want to find a probabilistic model M such that P(M|D) is maximized. Or mathematically, using Bayes’ Law, we want to maximize P(D|M)P(M)/P(D) –Assuming each model is equally likely, this is equivalent to finding a model that maximize P(D|M). –Again assuming each sequence is generated independently, we can maximize the product of P(Di|M), where Di is the ith sequence Since the model is of the form P(X1,…XL) = P(X1)P(X2)…P(XL) By simply differentiation, the model that maximizes the probability of the database D is given by recording the individual frequencies of each amino acid at each position. Alternatively, learning the BEST model that generates the data is done simply by recording the relative frequency of each amino acid in each position!

12 A standard solution with independency assumption Now return to our problem, given a new protein sequence x = x1,x2,…xL we evaluate its probability of being generated by the model as –P(x|M) = P(x1)P(x2)…P(xL) Typically, a score is computed as –P(x|M)/Pr(x) or log[P(x|M)/Pr(x) ] –Pr is the probability x has been generated “at random” The score is then compared to some “carefully” chosen threshold. This is known as “Consensus Matrix” approach” or “Weight Matrix” approach. Is this a Markov model? A Zero-th order Markov Model because of the independency assumption.

13 A standard solution with independency assumption A few Notes –Simple Model with few parameters –Complex model requires more parameters –However, the simple learning methodology is valid for arbitrary fixed (predetermined)- structure Bayes’ networks without hidden variables.

14 A Slight Generalization to first-order Markov model Remove the independence assumption from the previous zero-th order Markov model, under the Markov assumption, we need to record the frequencies of the amino-acid pairs for each position in the protein sequence. With this model, given a new protein sequence x=x1,x2,…,xL, we evaluate the probability that x has been generated by the model as the product P(x1|x2)…P(xL- 1|xL)P(xL) This generalizes the consensus matrix approach for biological modeling. It was followed by Salzberg and produced statistically significant improvement on the false positive rate of detecting donor/acceptor sites in DNA sequences.

15 Modeling hidden state systems with probabilistic networks Now consider a very simple hidden Markov Model (HMM) for protein sequence modeling. –Assuming each amino acid in the protein can be in one of three states: alpha-helix, beta-sheet, or coil –With each state Si, we associate a set of transition probabilities. Let Ai,j denote the probability of moving from state Si to state Sj –With each state, we also associate a matrix of “output” probabilities. That is, in each state, an amino acid is “outputted” or “emitted” according to some probability distribution. Use Bi,k denote the probability of generating the kth amino acid in state Si. bb

16 Modeling hidden state systems with probabilistic networks Helix Coil Sheet Figure 3. Simple HMM for generating protein sequences.

17 A simple HMM for protein Sequences A generative model of proteins –To generate protein sequences, we simply follow the following algorithm. Let L be the length of protein squence (1)Set the current state S = coil. (2)Output an amino acid according to the output distribution of S, Bs,k. (3)Transition to a new state S according to the probability distribution associated with the current state S, As,j (4)Repeat steps 2-3 L-1 more times.

18 A probabilistic analog of HMM First, fix the length of the sequences to L Associate a random variable Psi with the State in position I Associate a random variable Ei with the amino acid “generated” in position i. In fact, every HMM has an equivalent probabilistic network that is a two-layer tree network. Note: –during learning, all variables are fully observable. –During inference, the values of the state variables are hidden. PSi-1 Ei-1EiEi+1 PSiPSi+1 … … … …


Download ppt "Modeling biological data and structure with probabilistic networks I Yuan Gao, Ph.D. 11/05/2002 Slides prepared from text material by Simon Kasif and Arthur."

Similar presentations


Ads by Google