4 Pre-mRNA Splicing ... ... U 1 s n R N P 2 intronic repressor 5 ’ splice signalU2AF6531snRNPSR proteinsintron definitionexon definitionexonic enhancers5’splice signal3polyYbranch signalintronic enhancersexonic repressor...(assembly ofspliceosome,catalysis)...
6 Some Statistics On average, a vertebrate gene is about 30KB long Coding region takes about 1KBExon sizes can vary from double digit numbers to kilobasesAn average 5’ UTR is about 750 bpAn average 3’UTR is about 450 bp but both can be much longer.
12 GenScan States N - intergenic region P - promoter F - 5’ untranslated regionEsngl – single exon (intronless) (translation start -> stop codon)Einit – initial exon (translation start -> donor splice site)Ek – phase k internal exon (acceptor splice site -> donor splice site)Eterm – terminal exon (acceptor splice site -> stop codon)Ik – phase k intron: 0 – between codons; 1 – after the first base of a codon; 2 – after the second base of a codon
13 GenScan features Model both strands at once Each state may output a string of symbols (according to some probability distribution).Explicit intron/exon length modelingAdvanced splice site modelingParameters learned from annotated genesSeparate parameter training for different CpG content groups
17 GenomeScan proteins are available. Idea: We can enhance our gene prediction by using external information: DNA regions with homology to known proteins are more likely to be coding exons.Combine probabilistic ‘extrinsic’ information (BLAST hits) with a probabilistic model of gene structure/composition (GenScan)Focus on ‘typical case’ when homologous but not identicalproteins are available.
24 GeneWise Intron ModelPY tractcentralspacer5’ site3’ site
25 GeneWise ModelViterbi algorithm -> “best” alignment of DNA to protein domainAlignment gives exact exon-intron boundariesParameters learned from species-specific statistics
26 GeneWise problemsOnly provides partial prediction, and only where the homology liesDoes not find “more” genesPseudogenes, Retrotransposons picked upCPU intensiveSolution: Pre-filter with BLAST
27 SummaryGenes are complex structures which are difficult to predict with the required level of accuracy/confidenceDifferent approaches to gene finding:Ab Initio : GenScanAb Initio modified by BLAST homologies: GenomeScanHomology guided: GeneWise