Presentation is loading. Please wait.

Presentation is loading. Please wait.

Eukaryotic Gene Finding

Similar presentations

Presentation on theme: "Eukaryotic Gene Finding"— Presentation transcript:

1 Eukaryotic Gene Finding
Adapted in part from

2 Prokaryotic vs. Eukaryotic Genes
Prokaryotes small genomes high gene density no introns (or splicing) no RNA processing similar promoters overlapping genes Eukaryotes large genomes low gene density introns (splicing) RNA processing heterogeneous promoters polyadenylation


4 Pre-mRNA Splicing ... ... U 1 s n R N P 2 intronic repressor 5 ’
splice signal U 2 A F 6 5 3 1 s n R N P SR proteins intron definition exon definition exonic enhancers 5 splice signal 3 polyY branch signal intronic enhancers exonic repressor ... (assembly of spliceosome, catalysis) ...


6 Some Statistics On average, a vertebrate gene is about 30KB long
Coding region takes about 1KB Exon sizes can vary from double digit numbers to kilobases An average 5’ UTR is about 750 bp An average 3’UTR is about 450 bp but both can be much longer.

7 Human Splice Signal Motifs



10 Semi-Markov HMM Model

11 Genscan HSMM

12 GenScan States N - intergenic region P - promoter
F - 5’ untranslated region Esngl – single exon (intronless) (translation start -> stop codon) Einit – initial exon (translation start -> donor splice site) Ek – phase k internal exon (acceptor splice site -> donor splice site) Eterm – terminal exon (acceptor splice site -> stop codon) Ik – phase k intron: 0 – between codons; 1 – after the first base of a codon; 2 – after the second base of a codon

13 GenScan features Model both strands at once
Each state may output a string of symbols (according to some probability distribution). Explicit intron/exon length modeling Advanced splice site modeling Parameters learned from annotated genes Separate parameter training for different CpG content groups


15 GenScan Signal Modeling
PSSM: P(S) = P1(S1)•P2(S2) •…•Pn(Sn) PolyA signal Translation initiation/termination signal Promoters WAM: P(S) = P1(S1) •P2(S2|S1)•…•Pn(Sn|Sn-1) 5’ and 3’ splice sites

16 HMM-based Gene Finding
GENSCAN (Burge 1997) FGENESH (Solovyev 1997) HMMgene (Krogh 1997) GENIE (Kulp 1996) GENMARK (Borodovsky & McIninch 1993) VEIL (Henderson, Salzberg, & Fasman 1997)

17 GenomeScan proteins are available.
Idea: We can enhance our gene prediction by using external information: DNA regions with homology to known proteins are more likely to be coding exons. Combine probabilistic ‘extrinsic’ information (BLAST hits) with a probabilistic model of gene structure/composition (GenScan) Focus on ‘typical case’ when homologous but not identical proteins are available.



20 GeneWise [Birney, Amitai]
Motivation: Use good DB of protein world (PFAM) to help us annotate genomic DNA GeneWise algorithm aligns a profile HMM directly to the DNA

21 Sample GeneWise Output

22 Developing GeneWise Model
Start with a PFAM domain HMM Replace AA emissions with codon emissions Allow for sequencing errors (deletions/insertions) Add a 3-state intron model

23 GeneWise Model

24 GeneWise Intron Model PY tract central spacer 5’ site 3’ site

25 GeneWise Model Viterbi algorithm -> “best” alignment of DNA to protein domain Alignment gives exact exon-intron boundaries Parameters learned from species-specific statistics

26 GeneWise problems Only provides partial prediction, and only where the homology lies Does not find “more” genes Pseudogenes, Retrotransposons picked up CPU intensive Solution: Pre-filter with BLAST

27 Summary Genes are complex structures which are difficult to predict with the required level of accuracy/confidence Different approaches to gene finding: Ab Initio : GenScan Ab Initio modified by BLAST homologies: GenomeScan Homology guided: GeneWise

Download ppt "Eukaryotic Gene Finding"

Similar presentations

Ads by Google