Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Havana-Gencode annotation GENCODE CONSORTIUM.

Similar presentations


Presentation on theme: "The Havana-Gencode annotation GENCODE CONSORTIUM."— Presentation transcript:

1 The Havana-Gencode annotation GENCODE CONSORTIUM

2 Loci annotated in the 44 ENCODE regions

3 Experimental validations of the manual annotations 5'RACEs to obtain full length mRNA(s) RT-PCRs to check 360 junctions Bidirectionnal RACEs to obtain full length mRNAs Experimental validation of the single exon annotated The annotations produced by the Havana team at Sanger are being verified experimenally through RT-PCRs and RACEs (University of Geneva) Initial annotation Experimental validations Updated annotation New set of confirmed genes

4 5’RACEs to extend Known and Novel protein genes - 214 / 426 loci provided positive RACEs for at least one primer (50%) - About 10% of the successful RACEs extend the loci in 5’ (and some provide new exon junctions) (some RACE products are still being analysed) Experimental validations of the manual annotations

5 RT-PCRs VEGA Novel_transcript and Putative  The Novel transcript loci have a higher success rate than the Putative loci (in accordance to their definition) When more than one junction were submitted for the same transcript, all the junctions were in accordance in 2/3 of the cases (mostly all junctions negative). Experimental validations of the manual annotations

6 RT-PCRs on non canonical splice sites 43 non canonical splice sites (non GT-AG or GC-AG) were detected in the 13 training ENCODE regions 32 could be tested by RT-PCR (others: too short exons for primer picking)  1 was confirmed: it is actually a canonical U12 intron (AT-AC)  6 provided canonical junctions (already existing in other annotated splice forms)  25 were negative => None of the non canonical splice sites could be validated experimentally (83 other splice sites are being checked in the 31 other regions) Experimental validations of the manual annotations

7 Gene predictions outside of Havana-Gencode annotations In 13 ENCODE regions, 1255 predicted introns (by one or more of the 9 methods) are not annotated in VEGA: - 380 (30%) extend VEGA objects (1) - 530 (42%) are in introns of VEGA objects (2) - 11 (1%) link exons from distinct VEGA objects (3) - 334 (27%) are completely outside of VEGA annotations (4) Havana-Gencode: Predictions: (1) (2) (3) (4) 6 computational gene prediction programs (geneid, genscan, SGP, twinscan, fgenesh, exonify) ; 3 EST-based methods (acembly, Ecgene, Ensembl EST)

8 1255 predicted introns tested: => 16 RT-PCRs confirmed the predicted junction, 9 provided another junction. (excluding pseudogenes) => Only 3 are intergenic (new loci?) --> being extended by RACE Gene predictions outside of Havana-Gencode annotations RT-PCRs on exons junctions *1: RT-PCR successful ; 2: RT-PCR povided a product with a wrong exon junction

9 Gene predictions outside of Havana-Gencode annotations: 31 last regions -About 3500 introns predicted by standard prograns from UCSC tracks are outside of the Havana-Gencode annotation (about 900 intergenic). Very few of those could correspond to real positive (=> Need to prioritize) - Additionaly, the EGASP predictions add about 7000 other new introns (about 1000 intergenic)

10 Description of the annotations: gene density

11 Description of the annotations: alternative splicing Avg: 4.2 transcripts per locus 6.7 exons per transcript

12 Description of the annotations: coding loci 424 coding loci in 44 ENCODE regions On average, 44.6% of the transcripts are annotated as coding

13 Description of the annotations: lengths of exons, introns, cds, utrs…

14 Comparison between Havana-Gencode annotation and other sets ENSEMBL, REFSEQ, MGC, CCDS

15 => Most of the genes from the other sets are contained in Havana-Gencode annotation (less for ENSEMBL) Gene level

16 => Very few full transcripts are exactly identical The coding part of the transcripts is better conserved Transcript level

17 => Few transcripts are exactly identical but most of the transcripts from other sets are included in transcripts from Havana-Encode, especially MGC genes (transcripts not as extended as the annotation) Havana-Gencode transcript: Transcript from other sets: Not supporting the annotated transcript Supporting the annotated transcript Relaxed criterion: allows transcripts from the other sets to be included in Havana-Gencode transcripts

18 Transcript level: relaxed criterion =>

19 => More common introns than exons: could be explained by the fact that most differences are in UTRs (last exons) Exon/intron level

20 Nucleotide level - Havana-Gencode annotation is richer than the other data sets. -REFSEQ, MGC and CCDS are almost completely contained in Havana –Gencode, especially CCDS (smaller set) - ENSEMBL contains more “false positives” (bigger set) - Transcripts from the other sets are less extended than transcripts from Havana-Gencode annotations, especially MGC (very few transcripts are completely identical) Conclusions

21

22 Exon pair level (exon-intron-exon)


Download ppt "The Havana-Gencode annotation GENCODE CONSORTIUM."

Similar presentations


Ads by Google