Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fa07CSE182-L8 HMMs, Gene Finding. Fa07CSE182-L8 Midterm 1 In class next Tuesday Syllabus: L1-L8 –Please review HW, questions, and other notes –Bring one.

Similar presentations


Presentation on theme: "Fa07CSE182-L8 HMMs, Gene Finding. Fa07CSE182-L8 Midterm 1 In class next Tuesday Syllabus: L1-L8 –Please review HW, questions, and other notes –Bring one."— Presentation transcript:

1 Fa07CSE182-L8 HMMs, Gene Finding

2 Fa07CSE182-L8 Midterm 1 In class next Tuesday Syllabus: L1-L8 –Please review HW, questions, and other notes –Bring one sheet of written formulae Most of the questions will test concepts, so hopefully, you don’t have to remember too much. Final will be a take-home exam We will discuss class projects next time.

3 Fa07CSE182-L8 QUIZ! Question: your ‘friend’ likes to gamble. He tosses a coin: TAILS, he gives you a dollar. HEADS, you give him a dollar. Usually, he uses a fair coin, but ‘once in a while’, he uses a loaded coin. Can you say what fraction of the times he loads the coin? Suppose you know that in the loaded coin Pr(H)=0.8. Also, that he switches coins with 30% probability.

4 Fa07CSE182-L8 Modeling the coin problem as an HMM Q: Given the HMM, can you predict the hidden states? Observed: HHTHHHTHHHHTHHTHHHHHTTTHTH Hidden: LLFFLLLLLFFFLFFFLLLFFFFFFF Pr(H)=0.8Pr(H)=0.5 Pr(F->L)=0.3 Pr(L->F)=0.3 Pr(F->F)=0.7 Pr(L->L)=0.7 L F

5 Fa07CSE182-L8 HMM? It is an automaton with different ‘states’ (Ex: L,F) We start from a specific start state Each state of the automaton generates a symbol from an alphabet ({‘H’,’T’}), according to a distribution. After generating, we move (transition) to a new state. While the output (generated symbols) are visible, the states themselves are hidden. Observed: HHTHHHTHHHHTHHTHHHHHTTTHTH Hidden: LLFFLLLLLFFFLFFFLLLFFFFFFF LF

6 Fa07CSE182-L8 Profile HMMs Many natural problems can be expressed as an HMM. Ex: protein sequence profiles Recall that a profile is created by looking at position specific distributions 0.71 0.14 0.28

7 Fa07CSE182-L8 The generative model Think of each column in the alignment as generating a distribution. For each column, build a node that outputs a residue with the appropriate distribution 0.71 0.14 Pr[F]=0.71 Pr[Y]=0.14

8 Fa07CSE182-L8 A simple Profile HMM Connect nodes for each column into a chain. This chain generates sequences based on the profile distribution. What is the probability of generating FKVVGQVILD? In this representation –Prob [New sequence S belongs to a family]= Prob[HMM generates sequence S] What is the difference with Profiles?

9 Fa07CSE182-L8 Profile HMMs can handle gaps The match states are the same as on the previous page. Insertion and deletion states help introduce gaps. A sequence may be generated using different paths.

10 Fa07CSE182-L8 Example The HMM M models a family of 3 sequences. Question: Is sequence S=ALIL part of the family? Solution: Compute Pr(S|M), the probability that the HMM M generated S. If Pr(S|M) is ‘High’, S belongs to the family. If it is ‘Low’, S does not belong. How can we compute Pr(S=ALIL|M)? A L - L A I V L A I - L

11 Fa07CSE182-L8 Example Probability [ALIL|M]? Note that multiple paths can generate this sequence. –P1=M 1 I 1 M 2 M 3 –P2=M 1 M 2 I 2 M 3 We ask a simpler question. Suppose a specific path was given. –Compute the joint probability Pr(S,P|M) A L - L A I V L A I - L

12 Fa07CSE182-L8 Joint probability computation Joint probability of seeing a sequence S, and path P –Pr[S,P|M] = Pr[S|P,M] Pr[P|M] –Pr[ALIL AND M 1 I 1 M 2 M 3 | M] = Pr[ALIL| M 1 I 1 M 2 M 3,M] Pr[M 1 I 1 M 2 M 3 | M] = ?

13 Fa07CSE182-L8 Formally The emitted sequence is S=S 1 S 2 …S m The path traversed is P 1 P 2 P 3.. e j (s) = emission probability of symbol s in state P j Transition probability T[P j,P k ] : Probability of transitioning from state P j to state P k. Pr(P,S|M) = e P1 (S 1 ) T[P 1,P 2 ] e P2 (S 2 ) …… Let P[k,n] = Pr(P 1 …P k,S 1 …S n |M) Then, P[k,n] = P[k-1,n-1] T[P k-1,P k ] e Pk (S n ) What is Pr(S|M)?

14 Fa07CSE182-L8 Two solutions An unknown (hidden) path is traversed to produce (emit) the sequence S. The probability that M emits S can be either –The sum over the joint probabilities over all paths. Pr(S|M) = ∑ P Pr(S,P|M) –OR, it is the probability of the most likely path Pr(S|M) = max P Pr(S,P|M) Both are appropriate ways to model, and have similar algorithms to solve them.

15 Fa07CSE182-L8 Viterbi Algorithm for HMM Let P max (i,j|M) be the probability of the most likely solution that emits S 1 …S i, and ends in state j (is it sufficient to compute this?) P max (i,j|M) = max k P max (i-1,k) T[k,j] e j (S i ) (Viterbi) P sum (i,j|M) = ∑ k ( P sum (i-1,k) T[k,j] ) e j (S i ) A L - L A I V L A I - L

16 Fa07CSE182-L8 Profile HMM membership We can use the Viterbi/Sum algorithm to compute the probability that the sequence belongs to the family. Backtracking can be used to get the path, which allows us to give an alignment A L - L A I V L A I - L Path: M 1 M 2 I 2 M 3 A L I L

17 Fa07CSE182-L8 Summary To search for remote homologs (family members) of a protein sequence, we need to have alignments of “disparate protein sequences”. HMMs allow us to model position specific gap penalties, and allow for automated training to get a good alignment. Patterns/Profiles/HMMs allow us to represent families and foucs on key residues Each has its advantages and disadvantages, and needs special algorithms to query efficiently.

18 Fa07CSE182-L8 Protein Domain databases A number of databases capture proteins (domains) using various representations Each domain is also associated with structure/function information, parsed from the literature. Each database has specific query mechanisms that allow us to compare our sequences against them, and assign function 3D HMM

19 Fa07CSE182-L8 Sequence databases We talked about DNA and protein sequence databases. Next, we will talk about the connection between DNA and proteins –Proteins are the cellular machinery –DNA is the inherited material, and must contain the code for generating proteins –Solution of the genetic code was one of the great intellectual achievements of the last century

20 Fa07CSE182-L8 Gene Finding What is a Gene?

21 Fa07CSE182-L8 Gene We define a gene as a location on the genome that codes for proteins. The genic information is used to manufacture proteins through transcription, and translation. There is a unique mapping from triplets to amino-acids

22 Fa07CSE182-L8 Eukaryotic gene structure

23 Fa07CSE182-L8 Translation The ribosomal machinery reads mRNA. Each triplet is translated into a unique amino-acid until the STOP codon is encountered. There is also a special signal where translation starts, usually at the ATG (M) codon.

24 Fa07CSE182-L8 Translation The ribosomal machinery reads mRNA. Each triplet is translated into a unique amino-acid until the STOP codon is encountered. There is also a special signal where translation starts, usually at the ATG (M) codon. Given a DNA sequence, how many ways can you translate it?

25 Fa07CSE182-L8 Gene Features ATG 5’ UTR intron exon 3’ UTR Acceptor Donor splice site Transcription start Translation start

26 Fa07CSE182-L8 End of L8

27 Fa07CSE182-L8 Gene identification Eukaryotic gene definitions: –Location that codes for a protein –The transcript sequence(s) that encodes the protein –The protein sequence(s) Suppose you want to know all of the genes in an organism. This was a major problem in the 70s. PhDs, and careers were spent isolating a single gene sequence. All of that changed with better reagents and the development of high throughput methods like EST sequencing

28 Fa07CSE182-L8 Computational Gene Finding Given Genomic DNA, identify all the coordinates of the gene TRIVIA QUIZ! What is the name of the FIRST gene finding program? (google testcode) ATG 5’ UTR intron exon 3’ UTR Acceptor Donor splice site Transcription start Translation start

29 Fa07CSE182-L8 Gene Finding: The 1st generation Given genomic DNA, does it contain a gene (or not)? Key idea: The distributions of nucleotides is different in coding (translated exons) and non- coding regions. Therefore, a statistical test can be used to discriminate between coding and non-coding regions.

30 Fa07CSE182-L8 Coding versus non-coding You are given a collection of exons, and a collection of intergenic sequence. Count the number of occurrences of ATGATG in Introns and Exons. –Suppose 1% of the hexamers in Exons are ATGATG –Only 0.01% of the hexamers in Intergenic are ATGATG How can you use this idea to find genes?

31 Fa07CSE182-L8 Generalizing AAAAAA AAAAAC AAAAAG AAAAAT IE Compute a frequency count for all hexamers. Exons, Intergenic and the sequence X are all vectors in a multi-dimensional space Use this to decide whether a sequence X is exonic/intergenic. 10 520 10 X 5 Frequencies (X10 -5 )

32 Fa07CSE182-L8 A geometric approach (2 hexamers) Plot the following vectors – E= [10, 20] – I = [10, 5] – V 3 = [6, 10] – V 4 = [9, 15] Is V 3 more like E or more like I? 5 20 15 10 15105 E I V3V3

33 Fa07CSE182-L8 Choosing between Intergenic and Exonic V’ = V/||V|| All vectors have the same length (lie on the unit circle) Next, compute the angle to E, and I. Choose the feature that is ‘closer’ (smaller angle. E I V3V3

34 Fa07CSE182-L8 Coding versus non-coding signals Fickett and Tung (1992) compared various measures Measures that preserve the triplet frame are the most successful. Genscan uses a 5th order Markov Model

35 Fa07CSE182-L8 5th order markov chain Pr EXON [AAAAAACGAGAC..] =T[AAAAA,A] T[AAAAA,C] T[AAAAC,G] T[AAACG,A]…… = (20/78) (50/78)………. AAAAAA 20 1 AAAAAC 50 10 AAAAAG 5 30 AAAAAT 3.. AAAAA A G C AAAAG AAAAC EXON

36 Fa07CSE182-L8 Scoring for coding regions The coding differential can be computed as the log odds of the probability that a sequence is an exon vs. and intron. In Genscan, separate transition matrices are trained for each frame, as different frames have different hexamer distributions

37 Fa07CSE182-L8 Coding differential for 380 genes

38 Fa07CSE182-L8 Coding region can be detected Coding Plot the coding score using a sliding window of fixed length. The (large) exons will show up reliably. Not enough to predict gene boundaries reliably

39 Fa07CSE182-L8 Other Signals GT ATG AG Coding Signals at exon boundaries are precise but not specific. Coding signals are specific but not precise. When combined they can be effective

40 Fa07CSE182-L8 Combining Signals We can compute the following: –E-score[i,j] –I-score[i,j] –D-score[i] –I-score[i] –Goal is to find coordinates that maximize the total score ij

41 Fa07CSE182-L8 The second generation of Gene finding Ex: Grail II. Used statistical techniques to combine various signals into a coherent gene structure. It was not easy to train on many parameters. Guigo & Bursett test revealed that accuracy was still very low. Problem with multiple genes in a genomic region

42 Fa07CSE182-L8 Combining signals using D.P. An HMM is the best way to model and optimize the combination of signals Here, we will use a simpler approach which is essentially the same as the Viterbi algorithm for HMMs, but without the formalism.

43 Fa07CSE182-L8 Hidden states & gene structure Identifying a gene is equivalent to labeling each nucleotide as E/I/intergenic etc. These ‘labels’ are the hidden states For simplicity, consider only two states E and I IIIIIEEEEEEIIIIIIEEEEEEIIIIEEEEEE IIIII i1i1 i2i2 i3i3 i4i4

44 Fa07CSE182-L8 Gene finding reformulated Given a labeling L, we can score it as I-score[0..i 1 ] + E-score[i 1..i 2 ] + D-score[i 2 +1] + I-score[i 2 +1..i 3 -1] + A-score[i 3 -1] + E-score[i 3..i 4 ] + ……. Goal is to compute a labeling with maximum score. IIIIIEEEEEEIIIIIIEEEEEEIIIIEEEEEE IIIII i1i1 i2i2 i3i3 i4i4

45 Fa07CSE182-L8 Optimum labeling using D.P. (Viterbi) Define V E (i) = Best score of a labeling of the prefix 1..i such that the i-th position is labeled E Define V I (i) = Best score of a labeling of the prefix 1..i such that the i-th position is labeled I Why is it enough to compute V E (i) & V I (i) ?

46 Fa07CSE182-L8 Optimum parse of the gene j i ji

47 Fa07CSE182-L8 Generalizing Note that we deal with two states, and consider all paths that move between the two states. E I i

48 Fa07CSE182-L8 Generalizing We did not deal with the boundary cases in the recurrence. Instead of labeling with two states, we can label with multiple states, –E init, E fin, E mid, –I, I G (intergenic) E init I E fin E mid IGIG Note: all links are not shown here

49 Fa07CSE182-L8 An HMM for Gene structure

50 Fa07CSE182-L8 Generalized HMMs, and other refinements A probabilistic model for each of the states (ex: Exon, Splice site) needs to be described In standard HMMs, there is an exponential distribution on the duration of time spent in a state. This is violated by many states of the gene structure HMM. Solution is to model these using generalized HMMs.

51 Fa07CSE182-L8 Length distributions of Introns & Exons

52 Fa07CSE182-L8 Generalized HMM for gene finding Each state also emits a ‘duration’ for which it will cycle in the same state. The time is generated according to a random process that depends on the state.

53 Fa07CSE182-L8 Forward algorithm for gene finding ji qkqk Emission Prob.: Probability that you emitted X i..X j in state q k (given by the 5th order markov model) Forward Prob: Probability that you emitted i symbols and ended up in state q k Duration Prob.: Probability that you stayed in state q k for j-i+1 steps

54 Fa07CSE182-L8 HMMs and Gene finding Generalized HMMs are an attractive model for computational gene finding –Allow incorporation of various signals –Quality of gene finding depends upon quality of signals.

55 Fa07CSE182-L8 DNA Signals Coding versus non-coding Splice Signals Translation start

56 Fa07CSE182-L8 Splice signals GT is a Donor signal, and AG is the acceptor signal GTAG

57 Fa07CSE182-L8 PWMs Fixed length for the splice signal. Each position is generated independently according to a distribution Figure shows data from > 1200 donor sites 321123456AAGGTGAGTCCGGTAAGTGAGGTGAGGTAGGTAAGG

58 Fa07CSE182-L8 MDD PWMs do not capture correlations between positions Many position pairs in the Donor signal are correlated

59 Fa07CSE182-L8 Choose the position which has the highest correlation score. Split sequences into two: those which have the consensus at position I, and the remaining. Recurse until

60 Fa07CSE182-L8 MDD for Donor sites

61 Fa07CSE182-L8 De novo Gene prediction: Sumary Various signals distinguish coding regions from non-coding HMMs are a reasonable model for Gene structures, and provide a uniform method for combining various signals. Further improvement may come from improved signal detection

62 Fa07CSE182-L8 How many genes do we have? Nature Science

63 Fa07CSE182-L8 Alternative splicing

64 Fa07CSE182-L8 Comparative methods Gene prediction is harder with alternative splicing. One approach might be to use comparative methods to detect genes Given a similar mRNA/protein (from another species, perhaps?), can you find the best parse of a genomic sequence that matches that target sequence Yes, with a variant on alignment algorithms that penalize separately for introns, versus other gaps.

65 Fa07CSE182-L8 Comparative gene finding tools Procrustes/Sim4: mRNA vs. genomic Genewise: proteins versus genomic CEM: genomic versus genomic Twinscan: Combines comparative and de novo approach.

66 Fa07CSE182-L8 Databases RefSeq and other databases maintain sequences of full-length transcripts. We can query using sequence.


Download ppt "Fa07CSE182-L8 HMMs, Gene Finding. Fa07CSE182-L8 Midterm 1 In class next Tuesday Syllabus: L1-L8 –Please review HW, questions, and other notes –Bring one."

Similar presentations


Ads by Google