Presentation is loading. Please wait.

Presentation is loading. Please wait.

“Gene Finding in Novel Genomes” by Ian Korf Presented by: Christine Lee SoCAL BSI 2004.

Similar presentations


Presentation on theme: "“Gene Finding in Novel Genomes” by Ian Korf Presented by: Christine Lee SoCAL BSI 2004."— Presentation transcript:

1 “Gene Finding in Novel Genomes” by Ian Korf Presented by: Christine Lee SoCAL BSI 2004

2 Outline Background and Motivation Existing gene finder programs Snap as ab initio high performance gene finder Novel genome gene prediction The Data Genome compositional differences Parameter estimation in novel genomes Conclusion

3 Background and motivation Rapid genome sequencing Key task: identification of structure of protein-coding region Ab initio gene prediction Gene finder Annotation of novel genome SNAP & de novo species-specific parameter estimation

4 Existing gene finder programs Genscan: performs as well as recent gene finders designed for Arabidopsis HMMGene and Genefinder: well- established gene prediction programs for C. elegans Augustus: one of the latest, shown to outperform Genscan, GENIE, and GENEID in Drosophilla

5 GenomeGene finderSNSPSNSPSNSP AtSNAP97.195.282.981.254.346.8 Genscan79.992.965.371.219.521.3 CeSNAP97.694.285.579.346.032.5 Genefinder98.195.389.286.151.648.0 Genscan81.391.648.666.410.29.6 HMMGene84.197.058.971.720.919.6 DmSNAP94.386.578.667.250.837.5 Augustus92.488.677.268.250.731.9 Genscan84.581.168.762.922.120.0 OsSNAP86.294.070.272.451.237.0 Genscan70.389.858.274.825.932.0 NucleotideExonGene

6 Data Set Data set characteristics At Arabidopsis thaliana, Ce Caenorhabditis elegans, Dm Drosophila melanogaster, Os Oryza sativa. GenomeSequence Genes GC Single-exon Genes Mean Exon Mean Intron At1.89 Mb63137.3%19.8%230 bp157 bp Ce3.02 Mb62636.1%2.2%220 bp334 bp Dm3.66 Mb60243.6%24.9%394 bp948 bp Os1.55 Mb42444.5%22.9%237 bp350 bp

7 AtCeDmOs ParametersMeasureSNSPSNSPSNSPSNSP AtNuc97.195.278.791.377.768.090.771.8 Exon82.981.244.352.838.624.057.142.3 Gene54.346.820.911.318.85.720.59.7 CeNuc83.591.597.694.281.373.679.774.5 Exon40.549.985.579.342.229.827.526.0 Gene25.718.146.032.521.98.813.97.3 DmNuc30.095.345.995.094.386.578.489.8 Exon16.541.329.947.278.667.250.058.4 Gene3.24.37.86.950.837.536.328.9 OsNuc39.396.324.995.579.888.786.294.0 Exon30.747.611.136.647.444.470.272.4 Gene5.16.15.37.827.217.251.2 37.0

8 Codon Frequency Codon frequency The frequency of each degenerate codon is indicated in a species-specific color (At Arabidopsis thaliana, Ce Caenorhabditis elegans, Dm Drosophila melanogaster, Os Oryza sativa). Codons are grouped by their parent amino acid.

9 Pictograms of splice sites and translation start The height of each letter is proportional to its frequency. At Arabidopsis thaliana, Ce Caenorhabditis elegans, Dm Drosophila melanogaster, Os Oryza sativa. (a) splice acceptor site – canonical AG is at positions -2 and -1, (b) splice donor site – canonical GT is at +1 and +2, (c) translation start site – canonical ATG is at +1 to +3, (d) splice acceptor site consensus derived from gene predictions in A. thaliana with C. elegans

10 Performance of foreign and bootstrapped parameters. The bold face values are determined by 5-fold cross-validation within the same species. At Arabidopsis thaliana, Ce Caenorhabditis elegans, Dm Drosophila melanogaster, Os Oryza sativa. Sensitivity (NSN) and specificity (NSP) are reported at the nucleotide level. The bootstrapped values (bottom part of the table) are derived from parameter estimates based on gene predictions and no actual data. In these experiments, only inter-species gene parameters were used; dashes represent cells that would contain intra-species predictions. Genomic DNA AtCeDmOs ParametersNSNNSPNSNNSPNSNNSPNSNNSP ActualAt97.694.381.090.175.363.690.768.5 Ce86.391.098.192.585.172.479.873.0 Dm26.096.038.696.093.887.076.189.8 Os36.096.921.796.178.588.785.194.2

11 Genomic DNA AtCeDmOs ParametersNSNNSPNSNNSPNSNNSPNSNNSP BootstrappedAt--95.888.294.776.092.176.0 Ce96.693.2--95.780.391.079.2 Dm75.295.590.094.9--74.188.7 Os85.695.876.594.392.586.6-- At Ce----95.778.392.877.8 At Dm--96.791.1--85.480.9 At Os--95.590.394.081.2-- Ce Dm94.394.4----84.083.0 Ce Os94.594.6--94.783.3-- Dm Os84.995.888.494.9---- At Ce Dm------88.180.2 At Ce Os----95.280.9-- At Dm Os--95.891.9---- Ce Dm Os93.495.1----- -

12 Conclusion and Future Work Feasibility of gene finder as bootstrap predictor Improved results through implementation of advanced statistical methods Apply gene prediction to large genomes


Download ppt "“Gene Finding in Novel Genomes” by Ian Korf Presented by: Christine Lee SoCAL BSI 2004."

Similar presentations


Ads by Google