Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov.

Similar presentations


Presentation on theme: "Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov."— Presentation transcript:

1 Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov

2 Reading GENSCAN EasyGene SLAM Twinscan Optional: Chris Burge’s Thesis

3 Gene expression DNA RNA Protein PEPTIDE transcription translation
CCTGAGCCAACTATTGATGAA CCUGAGCCAACUAUUGAUGAA But first some quick genetics to make sure that we all are on the same page. The genes are expressed by the DNA in our chromosomes being transcribed into RNA. Basically an enzyme attaches to the DNA molecule and copies it, creating a RNA molecule. Then the RNA is translated in triplets into amin acids, creating a peptide, which in its finished form becomes a protein. The protein is the end product of the gene, and thus the expression of the genes. PEPTIDE

4 Gene structure transcription splicing translation exon = coding
intron1 intron2 exon1 exon2 exon3 transcription splicing The genes themselves are structured in coding bits, that is the stuff that becomes amino acids, called exons, and non-coding stretches of sequence in between, called introns. When the gene is transcribed the whole thing becomes an RNA molecule, including the garbage in between the exons, and then these introns are cut out in a process called splicing. The resulting bits are glued together and translated into a protein. translation exon = coding intron = non-coding

5 Finding genes Exon 1 Exon 2 Exon 3 Intron 1 Intron 2 Splice sites
5’ 3’ Exon 1 Exon 2 Exon 3 Intron 1 Intron 2 Start codon ATG Stop codon TAG/TGA/TAA The problem of predicting genes means to give coordinates for the exon boundaries. There are several existing approaches to this. Splice sites

6

7

8 Approaches to gene finding
Homology BLAST, Procrustes. Ab initio Genscan, Genie, GeneID. Hybrids GenomeScan, GenieEST, Twinscan, SGP, ROSETTA, CEM, TBLASTX, SLAM. Annotating features of biological importance is relatively straightforward for organisms with compact genomes such as bacteria and yeast, because exons tend to be large and the introns small or non-existent. In large genomes as for mammals the coding signal is scattered in a vast sea of non-coding noise. The crudest, yet often most reliable approach has been to manually look for exons by integrating and filtering various sources of homology information. This is based on the fact that exons and regulatory regions tend to be more strongly conserved by evolution than random genomic sequences. The tool that is frequently used is called BLAST. Procrustes is a program that attempts to integrate BLAST based gene finding with ‘ab initio methods'. Ab inito approaches, such as the following directly use only information about the input sequence itself to identify likely splice sites and to detect differences in sequence composition between coding and non-coding DNA. One of the best of this sort is GENSCAN, which uses a Hidden Markov Model to scan large genomic sequences. Then there are various forms of hybrids. Here we mean softwares that takes extra information from a second sequence, or database, in some sense. GenomeScan and GenieEST, uses protein and EST databases respectively to confirm the potential exons. Twinscan and SGP both have a single species algorithm as base, take two sequences as input and use the second to boost the exon probability if the alignment is good. ROSETTA is the first cross-species gene finder, using a anchor based approach to determine homologous exon, and then a heuristic scoring scheme for the final prediction. CEM is similar to ROSETTA. TBLASTX is another variant of BLAST taking a nucleotide queary and a nucleotide database as input, but translating both the query and the database into peptides in all reading frames before searching for matches. SLAM uses a generalized pair HMM, that is a merging of the Genscan HMM and a pair HMM, usually used for alignments.

9 HMMs for single species gene finding: Generalized HMMs
Traditionally the problems of gene finding and alignment have been treated separately, but it’s becoming more and more apparent that they are closely connected. The gene finding can be helped by aligning the sequences, and the alignment can be helped by actually first locating the genes. Our approach is to combine the two, performing them simultaneously.

10 GHMM for gene finding Exon1 Exon2 Exon3
This is a HMM similar to the GENSCAN HMM This is the state space of the Generalized HMM used by for instance Genscan and Genie, performing single species gene finding. The hidden states are the exons in red and the introns ang intergene in green. The reason to why we have so many states is because the sequence is translated into protein in triplets, so there are three different ways to translate the same sequence, or three different reading frames. The end product has to be a sequence divisible by three, and if one exon ends in the middle of a codon, that codon has to be finished in the beginning of the next codon. Thus in each exon we would have to remember the number of extra bases in the previous exon, that is two states ago, which violates the Markov model assumption that the transition to the next state only depends on the current. If we instead have one intron for each reading frame, we so to speak bring the information with us, preserving the Markov property. duration Exon1 Exon2 Exon3 T A G C

11 Elements of the HMM Duration of states – length distributions of
Exons Introns Signals at state transitions ATG Stop Codon TAG/TGA/TAA Exon/Intron and Intron/Exon Splice Sites Emissions Coding potential and frame at exons Intron emissions

12 1. Durations of states Why the algorithm has to be generalized is because in a standard HMM the output in each step would be one base, leading to state durations of geometric length. If we look at empirical data, such as these plots, we see that the intron lengths seem to follow the geometric distribution fairly well, but for the exon that would be a pretty bad model. So in our state space the exons are generalized states, choosing a length from a general distribution and outputting the entire exon in one step, while the intron and intergene states still output one base at a time and follow the geometric distribution.

13 Better way to do it: negative binomial
EasyGene: Prokaryotic gene-finder Larsen TS, Krogh A Negative binomial with n = 3

14 2. Splice Site Detection Donor: 7.9 bits Acceptor: 9.4 bits
(Stephens & Schneider, 1996) How much info? Find branch sites (

15 2. Splice Site Detection Donor site 5’ 3’ Position %

16 2. Splice Site Detection WMM: weight matrix model = PSSM (Staden 1984)
WAM: weight array model = 1st order Markov (Zhang & Marr 1993) MDD: maximal dependence decomposition (Burge & Karlin 1997) Decision-tree algorithm to take pairwise dependencies into account For each position I, calculate Si = ji2(Ci, Xj) Choose i* such that Si* is maximal and partition into two subsets, until No significant dependencies left, or Not enough sequences in subset Train separate WMM models for each subset Lots of negative results, correlations around splice sites G5 G5G-1 G5G-1 A2 G5G-1 A2U6 All donor splice sites not G5 G5 not G-1 G5G-1 not A2 G5G-1A2 not U6

17 atg caggtg ggtgag cagatg ggtgag cagttg ggtgag caggcc ggtgag tga

18 3. Coding potential Amino Acid SLC DNA codons
Isoleucine I ATT, ATC, ATA Leucine L CTT, CTC, CTA, CTG, TTA, TTG Valine V GTT, GTC, GTA, GTG Phenylalanine F TTT, TTC Methionine M ATG Cysteine C TGT, TGC Alanine A GCT, GCC, GCA, GCG Glycine G GGT, GGC, GGA, GGG Proline P CCT, CCC, CCA, CCG Threonine T ACT, ACC, ACA, ACG Serine S TCT, TCC, TCA, TCG, AGT, AGC Tyrosine Y TAT, TAC Tryptophan W TGG Glutamine Q CAA, CAG Asparagine N AAT, AAC Histidine H CAT, CAC Glutamic acid E GAA, GAG Aspartic acid D GAT, GAC Lysine K AAA, AAG Arginine R CGT, CGC, CGA, CGG, AGA, AGG Stop codons Stop TAA, TAG, TGA

19 4. GENSCAN’s hidden weapon
C+G content is correlated with: Gene content (+) Mean exon length (+) Mean intron length (–) These quantities affect parameters of model Solution Train parameters of model in four different C+G content ranges!

20 Results of GENSCAN On the initial test dataset (Burset & Guigo)
80% exact exon detection 10% partial exons 10% wrong exons In general Most accurate single sequence-based gene predictor In practice it overpredicts human genes by ~2x

21 Comparison-based Methods

22 Comparison of 1196 orthologous genes (Makalowski et al., 1996)
Sequence identity between genes in human/mouse exons: 84.6% protein: 85.4% introns: 35% 5’ UTRs: 67% 3’ UTRs: 69% 27 proteins were 100% identical.

23 Human-mouse homology Human Mouse
So mouse in many cases seems to be a very good organism to compair with.

24 Here is a plot of a homologous region in 6 different organisms versus human. This is a variant of a percent identity plot produced by a program called VISTA. So this gives us a quick overview of how similar the sequences are. The first one is macaque, a monkey, the second is pig, then rabbit, mouse, rat and chicken. The blue region shows the exons and the red conserved non-coding sequence. We see how the same exons can be identified in all of the top five, but not in chicken in this case. So to use monkey would not give much extra help, since most of the sequence is conserved anyway. Pig is much better, but is still perhaps too similar, and then it gets better with rabbit and mouse, while chicken would not help at all either.

25 Not always: HoxA human-mouse
While for generalized HMM methods for single species gene finding there is a problem of overprediction in general, that we bring down by doing the comparative approach, we run into a new problem. Here's a visualization of the alignment of the HoxA region in human versus mouse. The region contains 11 genes with two exons each, these are colored blue. The red bits in between are just highly similar regions. So this region is very special in the sense that the non-coding region is almost as well conserved as the coding. Even though the probability of a potential exon really being exon includes several elements beside the alignment, the alignment turns out to be the most important ingredient. A region like this therefor becomes very difficult for a comparative gene finder.

26 Alignment : : : : : 247 GGTGAGGTCGAGGACCCTGCA CGGAGCTGTATGGAGGGCA AGAGC |: || ||||: |||| --:|| ||| |::| |||---|||| 368 GAGTCGGGGGAGGGGGCTGCTGTTGGCTCTGGACAGCTTGCATTGAGAGG : : : : : 292 TTC CTACAGAAAAGTCCCAGCAAGGAGCCACACTTCACTG ||| || | |::| |: ||||::|:||:-|| ||:| | 418 TTCTGGCTACGCTCTCCCTTAGGGACTGAGCAGAGGGCT CAGGTCGCGG : : : : : ATGTCGAGGGGAAGACATCATTCGGGATGTCAGTG ||||||||||||||||||||||:|||||||||||| 467 TGGGAGATGAGGCCAATGTCGAGGGGAAGACATCATTTGGGATGTCAGTG : : : : : 367 TTCAACCTCAGCAATGCCATCATGGGCAGCGGCATCCTGGGACTCGCCTA |||||:||||||||:||||||||||||||:|| ||:|||||:|||||||| 517 TTCAATCTCAGCAACGCCATCATGGGCAGTGGAATTCTGGGGCTCGCCTA

27 Twinscan Twinscan is an augmented version of the Gencscan HMM. I E
transitions duration emissions ACUAUACAGACAUAUAUCAU

28  = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| }
Twinscan Algorithm Align the two sequences (eg. from human and mouse) Mark each human base as gap ( - ), mismatch ( : ), match ( | ) New “alphabet”: 4 x 3 = 12 letters  = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| }

29 Twinscan Algorithm Run Viterbi using emissions ek(b)
where b  { A-, A:, A|, …, T| } Note: Emission distributions ek(b) estimated from real genes from human/mouse eI(x|) < eE(x|): matches favored in exons eI(x-) > eE(x-): gaps (and mismatches) favored in introns

30 Example Human: ACGGCGACUGUGCACGU Mouse: ACUGUGAC GUGCACUU
Alignment: ||:|:|||-||||||:| Input to Twinscan HMM: A| C| G: G| C: G| A| C| U- G| U| G| C| A| C| G: U| Recall, eE(A|) > eI(A|) eE(A-) < eI(A-) Likely exon

31 HMMs for simultaneous alignment and gene finding: Generalized Pair HMMs
Traditionally the problems of gene finding and alignment have been treated separately, but it’s becoming more and more apparent that they are closely connected. The gene finding can be helped by aligning the sequences, and the alignment can be helped by actually first locating the genes. Our approach is to combine the two, performing them simultaneously.

32 A Pair HMM for alignments
BEGIN M P(xi, yj) I P(xi) J P(yj) 1 - 2 1-  - 2 M I J END

33 Cross-species gene finding
Exon1 Exon2 Exon3 Intron1 Intron2 [human] 5’ 3’ CNS [mouse] Exon = coding Intron = non-coding CNS = conserved non-coding

34 Generalized Pair HMMs This is the state space, or half the state space, for SLAM. As you can see it looks very similar to the single organism case. In fact it is identical. But the output of the states are now pairs of sequences, and moreover an _aligned_ pair of sequences, rather than just one sequence. There are a number of complications with this approach.

35 Ingredients in exon scores
Splice site detection (VLMM) Length distribution (generalized) Coding potential (codon freq. tables) GC-stratification

36 Exon GPHMM 1.Choose exon lengths (d,e).
2.Generate alignment of length d+e. e One problem with the exon PHMM is that since the state is generalized we first choose exon lengths (d,e) say, and then we run the PHMM to generate the sequence. But we have no way of knowing when the PHMM has generated exactly d bases in the first sequence and exactly e bases in the second. We can only stop the PHMM after producing an aligment of a certain length. To get around this and get the probability of an aligned exon pair we need to normalize by the probability that the PHMM produced the exon lengths that we want. In practice this is done by running two forward algorithms in parallell, one for the nominator and one for the denominator. d

37 The SLAM hidden Markov model
Therefor we had to adjust or model a bit. I told you before that the observed sequence identity in a study of 1196 human and mouse genes was 36%. We found that in fact most of the intron and intergenic regions were even lower in identity, with stretches of highly similar, non-coding sequence, conserved non-coding sequence, or CNS. Thus we changed our intron and intergene model from being a PHMM, to model the two sequences completely independent, one at a time, interspersed by highly conserved stretches, CNSs, modelled with a regular PHMM. Thus our model now consists of a PHMM on the protein level in the coding regions, and a PHMM on the DNA level in the CNSs. We found this to bring down a whole lot of overprediction, and our program is now a powerful tool to distinguish between conserved coding and conserved non-coding. Something that is a completely novel feature in existing gene finding tools.

38 Example: HoxA2 and HoxA3 SLAM SGP-2 Twinscan Genscan TBLASTX SLAM CNS
Here is a bp region containing the two genes HoxA2 and HoxA3 from the region before. The green track is the RefSeq annotaion, the so called "true" answer, and the other ones are the results of various softwares. As I said before this is a very difficult region, since all programs but Genscan here tries to locate genes by using the conservation. Twinscan is a combination of Genscan and TBLASTX, identifying potential genes according to the Genscan method, and boosting their probability if there is additional TBLASTX support. We see here that overpredictions in Genscan supported by TBLASTX lead to overpredictions also in Twinscan, but that also exons disregarded by Genscan still can get found if the conservation is good. SLAM has as an additional feature the ability to distinguish between conservation that is coding and conservation that is not coding, and we see in regions like these this is very useful to bring down the level of overprediction. SLAM SGP-2 Twinscan Genscan TBLASTX SLAM CNS VISTA RefSeq

39 Computational complexity
Another problem is the complexity of the model. Here’s the time the number of operations needed to run for instance the Viterbi algorithm on the problem. We see that for the HMM and the GHMM the problem is still decent, it grows linearly with the input sequence. But as soon as we introduce another sequence we get something quadratic. As an example, N is currently 14 in SLAM, we would at least be able to handle BAC sized sequences (and even that seems a bit naive these days) such that T and U is about 200,000 bp. D is the maximum length of an exon here, and we’re currently using D=3000, which is still a limitation since there exists quite a few exons longer than that. Anyway, if we assume that a fairly decent computer takes 10e-08 seconds to perform one operation I can leave you as an exercise to calculate how many years it would take to solve the full GPHMM problem... (it’s along the order of 10^11 years...) So, we have to pre-process our data in some way. no. states length seq1 max duration length seq2

40 Approximate alignment
Reduces TU -factor to hT This is an example of a fairly simply limitation of the search space, also know as a banded Smith-Waterman alignment. Since most sequence combination in the search space will have zero probability, and the true solution is likely to be found along the diagonal, one can limit the search space for each base pair in the first sequence to a subinterval, or a window of length h, in the second. This makes the problem grow linearly in the length of one of the input sequences.

41 Measuring Performance
Definition: Sensitivity (SN): (# correctly predicted)/(# true) Specificity (SP): (# correctly predicted)/(# total predicted)

42 Measuring Performance


Download ppt "Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov."

Similar presentations


Ads by Google