Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gene prediction in ENCODE roderic guigó i serra crg-imim-upf, barcelona Advanced Bioinformatics, chsl, october 2005.

Similar presentations


Presentation on theme: "Gene prediction in ENCODE roderic guigó i serra crg-imim-upf, barcelona Advanced Bioinformatics, chsl, october 2005."— Presentation transcript:

1 gene prediction in ENCODE roderic guigó i serra crg-imim-upf, barcelona Advanced Bioinformatics, chsl, october 2005

2 6/1/2015 Advanced Bioinformatics CHSL, 2005 2

3 6/1/2015 Advanced Bioinformatics CHSL, 2005 3 1% of the genome. 44 regions target selection. commitee to select sequence targets –manual targets – a lot of information –radom targets – stratified by non exonic conservation with mouse gene density

4

5 Long-range regulatory elements (enhancers, repressors/silencers, insulators) Cis-regulatory elements (promoters, transcription factor binding sites) DNA Replication DNase Hypersensitive Sites Genes and Transcripts Epigenetic 

6 6/1/2015 Advanced Bioinformatics CHSL, 2005 6 gencode: encyclopedia of genes and gene variants Roderic Guigó, IMIM-UPF-CRG Stylianos Antonarakis, Geneve Alexandre Reymond Ewan Birney, EBI Michael Brent, WashU Lior Pachter, Berkeley Manolis Dermitzkakis, Sanger Jennifer Ashurst, Tim Hubbard identify all protein coding genes in the ENCODE regions: identify one complete mRNA sequence for at least one splice isoform of each protein coding gene. eventually, identify a number of additional alternative splice forms.

7 the gencode annotation pipeline manual curation: havana (sanger) experimental verification: geneva bioinformatics: imim

8 6/1/2015 Advanced Bioinformatics CHSL, 2005 8 ALL EXONS CODING EXONS comparison with other gene sets

9 6/1/2015 Advanced Bioinformatics CHSL, 2005 9 from the encode Cromatin and Replication Group, John Stamatoyannopoulos

10 6/1/2015 Advanced Bioinformatics CHSL, 2005 10 one gene - many proteins very complex transcription units

11 6/1/2015 Advanced Bioinformatics CHSL, 2005 11 chimering tandem transcription / intergenic splicing

12 6/1/2015 Advanced Bioinformatics CHSL, 2005 12 KUA and UEV, Thomson et al., Genome Research 2000

13 6/1/2015 Advanced Bioinformatics CHSL, 2005 13 systematic search for functional chimeras in ENCODE : 165 tandem pairs in the same orientation 126 chimeric predictions obtained 96 tested, at least 4 positve Parra et al., Genome Research in press

14 6/1/2015 Advanced Bioinformatics CHSL, 2005 14 EGASP’05 the complete annotation of 13 regions was released in january 30. –The annotation of the remaining 31 regions was being obtained, and it was withheld. gene prediction groups were asked to submit predictions by april 15 in the remaining 31 regions. –18 groups participated, submiting 30 prediction sets predictions were compared to the annoations in an NHGRI sponsored workshop at the Wellcome Trust Sanger Institute, on may 6 and 7.

15 6/1/2015 Advanced Bioinformatics CHSL, 2005 15

16 6/1/2015 Advanced Bioinformatics CHSL, 2005 16

17 6/1/2015 Advanced Bioinformatics CHSL, 2005 17 EGASP’05 two main goals: 1.to assess how automatic methods are able to reproduce the (costly) manual/computational/experimental gencode annotation 2.how complete is the gencode annotation. are there still genes consistenly predicted by computational methods

18 6/1/2015 Advanced Bioinformatics CHSL, 2005 18

19

20 6/1/2015 Advanced Bioinformatics CHSL, 2005 20 accuracy measures

21 6/1/2015 Advanced Bioinformatics CHSL, 2005 21 accuracy at the exon level -- coding exons 18 groups participated submitting 30 prediction sets: evidence-based dual genome “ab intio”

22 6/1/2015 Advanced Bioinformatics CHSL, 2005 22 accuracy at the exon level -- all exons 18 groups participated submitting 30 prediction sets: evidence-based dual genome “ab intio”

23 programs are quite good at calling the protein coding exons (accuracy at 80%) Not as good at calling the transcribed exons), but the best of the programs predict correctly only 40% of the complete CDS exonic structures, and in about 30% of the cases, they are able to predict correctly none of the CDS exonic structures

24 programs are quite good at calling the protein coding exons (accuracy at 80%) Not as good at calling the transcribed exons), but the best of the programs predict correctly only 40% of the complete transcripts (considering only the coding fraction) in about 30% of the cases, they are able to predict correctly none of the CDS exonic structures

25 the issue of completness

26 6/1/2015 Advanced Bioinformatics CHSL, 2005 26 many novel exons predicted: we will prioritize a few hundred for experimental verification using race + rt-pcr although our experiment in the 13 regions suggests that only a few of them are likely to be real

27 6/1/2015 Advanced Bioinformatics CHSL, 2005 27 many computational predictions outside of the annotation In 13 ENCODE regions: 1255 unique predicted introns (exon pairs) in one or more of the 9 UCSC gene prediction tracks are not annotated 334 (27%) are outside annotations (could correspond to novel genes)

28 6/1/2015 Advanced Bioinformatics CHSL, 2005 28 many computational predictions outside of the annotation In 13 ENCODE regions: 1255 unique predicted introns (exon pairs) in one or more of the 9 UCSC gene prediction tracks are not annotated 334 (27%) are outside annotations (could correspond to novel genes) all tested by rt-pcr on 24 tissues 25 (2.0%) confirmed by rt-pcr in 24 tissues 16 (1.2%) with correctly predicted intron junctions 3 (0.2%) outside annotations (1% confirmation)

29 6/1/2015 Advanced Bioinformatics CHSL, 2005 29 Overview of the verification efforts II AFFX-GenCode: novel regions 40 intergenic transfrags from HL60 cell line that overlap GenCode gene predictions –20 overlapping gene predictions with no verification attempted by GenCode –20 overlapping gene predictions where verification by GenCode was negative 40 intergenic GenCode gene predictions that do not overlap HL60 transfrags –20 where no verification was attempted by GenCode –20 where verification by GenCode was negative (slide by Phil Kaphranov, Affymetrix)

30 6/1/2015 Advanced Bioinformatics CHSL, 2005 30 Some preliminary stats on the 80 regions: 3’ RACE only Gene predictions overlapping transfrags: total 39 (1/40 is a duplicated transfrag) 27 (69%) are positive in HL60 and 31(80%) in HepG2 in the 3’ RACE assays (slide by Phil Kaphranov, Affymetrix) Gene predictions not overlapping transfrags: total 38 (2/40 are outside of the regions where we have probes on the ENCODE array) 18 (47%) are positive in HL60 and 25 (66%) in HepG2 in the 3’ RACE assays

31 6/1/2015 Advanced Bioinformatics CHSL, 2005 31 3’ RACE based on a predicted exon ENr131_egasp_224555_224677 identifies new major and minor exons (shown by arrows) of a gene BC042133 in HepG2 cell line only. Good correspondence between RACE exons and GenScan exons. HepG2 3’RACE Bottom strand HepG2 3’RACE Top strand GenScan

32 6/1/2015 Advanced Bioinformatics CHSL, 2005 32 high-throughput genome-wide unbiased transcription interrogation techniques transcriptionstrandconnectiviystructure transfrags cages ditags genes the encode genes and transcripts group: transfrags, Tom Gingeras (Affymetrix) and Mike Snyder (Yale) cage tags, Albin Sandelin, Riken ditags Yijun Ruan, Genome Insitute of Singapore

33 6/1/2015 Advanced Bioinformatics CHSL, 2005 33 Proteasome (prosome, macropain) 26S subunit, non-ATPase, 4 (inhibits cholera-induced intestinal fluid secretion) Chrom 2

34 6/1/2015 Advanced Bioinformatics CHSL, 2005 34 protein coding genes are only a fraction of the transcription detected in ENCODE Total nb of nucleotides : 29409540 Nb of nucleotide covered % nucleotides covered Annotated exon1624,3265,5% transfrag/tar(Affymetrix,Yale)26992569,2% Cage (RIKEN)1465880,5% ditags (GIS)265280,1% TOTAL UNIQUE 3,534,868 12.0%

35 6/1/2015 Advanced Bioinformatics CHSL, 2005 35 transcription (aparently) not associated to protein coding genes TRANSCRIPTION MAP of HL-60 DEVELOPMENTAL TIME COURSE (data by Tom Gingeras, affymerix)

36 6/1/2015 Advanced Bioinformatics CHSL, 2005 36 THREADING TRANSFRAGS into PROTEIN CODING GENES inferring novel protein coding genes from transfrags

37 6/1/2015 Advanced Bioinformatics CHSL, 2005 37

38 6/1/2015 Advanced Bioinformatics CHSL, 2005 38

39 6/1/2015 Advanced Bioinformatics CHSL, 2005 39

40 6/1/2015 Advanced Bioinformatics CHSL, 2005 40

41 6/1/2015 Advanced Bioinformatics CHSL, 2005 41

42 6/1/2015 Advanced Bioinformatics CHSL, 2005 42

43 6/1/2015 Advanced Bioinformatics CHSL, 2005 43

44 6/1/2015 Advanced Bioinformatics CHSL, 2005 44

45 6/1/2015 Advanced Bioinformatics CHSL, 2005 45

46 6/1/2015 Advanced Bioinformatics CHSL, 2005 46

47 6/1/2015 Advanced Bioinformatics CHSL, 2005 47

48 6/1/2015 Advanced Bioinformatics CHSL, 2005 48

49 6/1/2015 Advanced Bioinformatics CHSL, 2005 49

50 6/1/2015 Advanced Bioinformatics CHSL, 2005 50

51 6/1/2015 Advanced Bioinformatics CHSL, 2005 51 http://genome.imim.es/gencode ENCODE France Denoeud (IMIM) Julien Lagarde Josep F. Abril Robert Castelo Eduardo Eyras Stylianos Antonarakis (Geneva) Alexandre Reymond Catherine Ucla Ewan Birney (EBI) Damian Keefe Paul Fliceck Michael Brent (WashU) Lior Patcher (Berkeley) Manolis Dermitakis (Sanger) HAVANA (Sanger) Jennifer Ashurst Tim Hubbard Adam Frankish David Swarbreck James Gilbert AFFYMETRIX Tom Gingeras Sujit Dike Phil Kaphranov EGASP’05 Michael Ashburner Vladimir Bajic Suzanne Lewis Martin Reese Peter Good Elise Feingold


Download ppt "Gene prediction in ENCODE roderic guigó i serra crg-imim-upf, barcelona Advanced Bioinformatics, chsl, october 2005."

Similar presentations


Ads by Google