Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.

Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry Inc.

Steps of FGENESH++ ANNOTATION PIPELINE 1. RefSeq set of mRNA mapping by EST_MAP program – sequences with mapped genes are excluded from further gene prediction process. 2. NR proteins mapping by Prot_map program 3. Fgenesh+ gene prediction on sequences having significant hit with the protein sequences (sequences with predicted genes are excluded from further gene prediction process) 4. Run FGENESH ab initio gene prediction in regions free from predictions made on stages 1 and 3. 6. Run of FGENESH gene predictions in large introns of known and predicted genes. Simple variant of pipeline was used For Human a lot of additional info can be used as ESTs, for example

Components of Fgenesh++ automatic pipeline FGENESH – ab initio gene prediction. Run on whole chromosomes (~300MB) FAST: The Human genome of 3 GB sequences is processed for ~ 4 hours EST_MAP a program for fast mapping of a set of mRNAs/ESTs to a chromosome sequence. EST_MAP takes into account splice site weight matrices for accurate mapping. Maps more accurately than BLAT small exon sequences. FGENESH+ This derivative of FGENESH use information on homologous proteins for improving gene prediction, if a homolog can be found. PROT_MAP is used for mapping a database of protein sequences to genome with accounting for splice sites

Example of Prot_map – mapping of a protein sequence to genome First sequence Chr19 [cut:1 3000000] [DD] Sequence: 1( 1), S: 52.623, L:1739 IPI:IPI00170643.1|SWISS-PROT:Q8TEK3-1 Summ of block lengths: 1468, Alignment bounds: On first sequence: start 2146727, end 2167939, length 21213 On second sequence: start 263, end 1739, length 1477 Blocks of alignment: 19 1 E: 2146727 70 [ca GT] P: 2146727 263 L: 23, G: 101.313, W: 1160, S:14.1355 2 E: 2147573 107 [AG GT] P: 2147575 287 L: 35, G: 102.892, W: 1810, S:17.7256 3 E: 2148934 42 [AG GT] P: 2148934 322 L: 14, G: 102.539, W: 720, S:11.1699 4 E: 2150399 111 [AG GT] P: 2150399 336 L: 37, G: 101.777, W: 1880, S:18.0157 5 E: 2150620 235 [AG GT] P: 2150620 373 L: 78, G: 101.251, W: 3930, S:26.0143 6 E: 2151098 114 [AG GT] P: 2151100 452 L: 37, G: 105.778, W: 2000, S:18.7669 7 E: 2151750 92 [AG GT] P: 2151752 490 L: 30, G: 101.188, W: 1510, S:16.1227 8 E: 2153538 102 [AG GT] P: 2153538 520 L: 34, G: 100.414, W: 1690, S:17.0246 9 E: 2153848 138 [AG GT] P: 2153848 554 L: 46, G: 99.168, W: 2240, S:19.5414 10 E: 2154470 126 [AG GT] P: 2154470 600 L: 42, G: 101.071, W: 2110, S:19.0531 11 E: 2156280 485 [AG GT] P: 2156280 642 L: 161, G: 102.616, W: 8290, S:37.9091 12 E: 2156954 136 [AG GT] P: 2156955 804 L: 45, G: 103.244, W: 2340, S:20.1719 13 E: 2157771 147 [AG GT] P: 2157771 849 L: 49, G: 98.511, W: 2360, S:20.0267 14 E: 2160107 115 [AG GT] P: 2160107 898 L: 38, G: 100.777, W: 1900, S:18.0672 15 E: 2161975 584 [AG GT] P: 2161977 937 L: 194, G: 101.031, W: 9740, S:40.932 16 E: 2163280 206 [AG GC] P: 2163280 1131 L: 68, G: 103.135, W: 3530, S:24.7691 17 E: 2165387 65 [AG GT] P: 2165388 1200 L: 21, G: 98.427, W: 1010, S:13.0987 …………………………………………………………………

Prot_map example of alignment 1 11 2146713 2146723 2146739 2146769 gatcacagaggctgg(..)agtgtctgtgtttca?[GGRIVSSKPFAPLNFRINSRNLSg...............(..)evdhqlkerfanmke GGRIVSSKPFAPLNFRINSRNLS- 248 248 249 259 267 277 2146797 2146806 2147558 2147568 2147581 2147611 ]gtaagaaactctcat(..)ctgtggctcctgcag[acIGTIMRVVELSPLKGSVSWTGK ---------------(..)--------------- -dIGTIMRVVELSPLKGSVSWTGK 286 286 286 286 289 299 2147641 2147671 2147686 2148919 2148926 2148937 PVSYYLHTIDRTI]gtgagtatctcgctg(..)ctttcttctttttag[LENYFSSLKNP PVSYYLHTIDRTI ---------------(..)--------------- LENYFSSLKNP 309 319 322 322 322 323 2148967 2148982 2150384 2150391 2150402 2150432 KLR]gtaagtttgtgtgtt(..)ctgctctccttccag[EEQEAARRRQQRESKSNAATP KLR ---------------(..)--------------- EEQEAARRRQQRESKSNAATP 333 336 336 336 337 347 2150462 2150492 2150513 2150523 2150609 2150619 TKGPEGKVAGPADAPM]gtaaggccccagcct(..)ccttgtgtcctccag[DSGAEEEK TKGPEGKVAGPADAPM ---------------(..)--------------- DSGAEEEK 357 367 373 373 373 373

Prot_map aligns (using on 1 processor) Human protein set of 55946 proteins to chromosome 19 (~59 MB) for 90 min (best hit for each protein) and 148 min (all significant hits for each protein)

Predicted genes in different classes 44 sequences 31 sequences 13 sequences Predictions mRNA supported 35.14% 34.34% 36.72% prot. supported 51.84% 51.35% 52.82% ab initio 13.29% 14.41% 11.07% % protein coding bases mRNA supported might have alternative splice forms that are overlapped

Predicted gene numbers 44 seq 31 seq 13 seq mRNA supported 177 (313) 118 (209) 59 (104) prot. supported 293 203 90 ---------------------------------------------------------------------------- ab initio 208 147 61 Total 678 (814) 468 (559) 210 (255) Havana 435 (1061) 297 (716) 138 (345)

CDS prediction accuracy on nucleotide level 44 sequences 31 sequences 13 sequences all genes, nucleotide level, CDS, shift 1 base fixed: sn+ = 87.79 sn+ = 89.84 sn+ = 83.53 sp+ = 74.21 sp+ = 76.20 sp+ = 70.12 sn- = 88.85 sn- = 92.13 sn- = 83.02 sp- = 76.96 sp- = 77.56 sp- = 75.81 sn = 88.34 sn = 91.00 sn = 83.25 sp = 75.62 sp = 76.89 sp = 73.10 all genes, nucleotide level, CDS, WITHOUT fix: sn = 88.26 sn = 90.92 sn = 83.17 sp = 75.55 sp = 76.83 sp = 73.03 It was a bug in initial posting where exon of mRNA supported genes in negative chain were shifted by 1 bp

Prediction accuracy on nucleotide level 44 sequences 31 sequences 13 sequences CDS: sn = 88.34 sn = 91.00 sn = 83.25 sp = 75.62 sp = 76.89 sp = 73.10 Coding + noncoding EXONS: sn = 62.28 sn = 63.02 sn = 60.87 sp = 78.87 sp = 80.14 sp = 76.44 HAVANA annotations contain much more untranslated and partially translated exons than we have in our predictions We have such exons only for mRNA mapped genes (~ 35% cases) Need to add such exons in annotations in future using EST and provisional mRNA

Nucleotide specificity depending on prediction class 44 sequences 31 sequences 13 sequences CDS: sn = 88.34 sn = 91.00 sn = 83.25 sp = 75.62 sp = 76.89 sp = 73.10 mRNA supported genes vs. "44regions_coding.gff“ 35% ----------------------------------------------- sp = 89.26 sp = 94.22 sp = 80.04 protein supported genes vs. "44regions_coding.gff“ 53% -------------------------------------------------- sp = 82.51 sp = 83.91 sp = 79.82 ab initio genes vs. "44regions_coding.gff“ (13% of all CDS) ------------------------------------------ sp = 13.20 sp = 10.74 sp = 19.57 some NEW genes (?), also ~ 10% of them overlapped with predicted pseudogenes

Accuracy of exact CDS prediction: 44 sequences 31 sequences 13 sequences CDS OVERLAP ----------- sn = 87.68 sn = 90.19 sn = 83.31 sp = 74.85 sp = 75.36 sp = 73.94 CDS 1EDGE --------- sn = 84.79 sn = 88.30 sn = 78.72 sp = 72.40 sp = 73.92 sp = 69.62 CDS EXACT --------- sn = 72.13 sn = 74.30 sn = 68.36 sp = 68.42 sp = 69.22 sp = 66.95

Canonical and Non-canonical splice sites GT-AG: 99.24% GC-AG: 0.69% AT-AC: 0.05% other sites: 0.02% SpliceDB (Burset, Seledtsov, Solovyev, NAR 1999,2000) Gene prediction is usually done with only standard splice sites What we have not done: Fgenesh/Fgenesh+ have an option to account for GC donor site At least for Prot_map + Fgenesh+ predictions we need to include GC splice sites

How we can improve the power of Fgenesh++annotation pipeline: USE ESTs and provisional mRNA Fgenesh_c predicts genes using genomic sequence and est sequence Add EST-based noncoding exons/parts of exons USE synteny We have a pipeline to generate syntenic regions between genomes based on coding exons annotation produced by Fgenesh++ Fgenesh2 predicts genes using 2 syntenic genomic sequences Mark or remove pseudogenes from the predictions (especially check ab initio) Include Promoter prediction to Fgenesh (developed) Then include prediction of non-coding exons Time + testing to define in what extent we can improve by above approaches

To Encode: Keep and improve annotations of 44 Encode regions to use them as a test bed for addition of new blocks to annotation pipelines Good to have GTF annotations of 44 regions with sequences extended with inclusion of complete genes at both ends Include in check of downloading predictions signalling of UNUSUAL CDS without GT/AG ends or ATG-GT or AG-STOP structure to avoid bugs in data posted for evaluation

Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.

Similar presentations

Presentation on theme: "Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.

Similar presentations

Presentation on theme: "Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry."— Presentation transcript:

Similar presentations

About project

Feedback