Gene prediction in ENCODE roderic guigó i serra crg-imim-upf, barcelona Advanced Bioinformatics, chsl, october 2005.

Slides:



Advertisements
Similar presentations
Breakdown of 244 total (Yale+Vega) Pseudogenes Amongst Various ENCODE Regions 211 Yale, 178 Vega, Union is 244 More pseudogenes in the manually picked.
Advertisements

Outline Questions from last lecture? P. 40 questions on Pax6 gene Mechanism of Transcription Activation –Transcription Regulatory elements Comparison between.
Transcriptome Sequencing with Reference
Gene Finding BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Comparison of array detected transcription map with GENCODE/HAVANA annotations in ENCODE regions.
Reese, E-GASP Short comparion GASP ‘99- EGASP ‘05 Martin Reese Omicia Inc Horton Street Emeryville, CA
Finding Eukaryotic Open reading frames.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
R ESEARCH G ENOME B IOINFORMATICS L AB R ESEARCH at G ENOME B IOINFORMATICS L AB Josep F. Abril Ferrando and Genís Parra Farré Genome BioInformatics Research.
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
Characterizing Alternative Splicing With Respect To Protein Domains BME 220 Project Charlie Vaske.
Chris Chander, Luke Adea BioSci D145 Feb. 12, 2015
“An integrated encyclopedia of DNA elements in the human genome” ENCODE Project Consortium. Nature 2012 Sep 6; 489: Michael M. Hoffman University.
1 1 - Lectures.GersteinLab.org Overview of ENCODE Elements Mark Gerstein for the "ENCODE TEAM"
March 9, 2007 Bologna, February the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona.
Nuevas perspectivas en análisis genomico: implicaciones del proyecto ENCODE 1 Rory Johnson Bioinformatics and Genomics Centre for Genomic Regulation AEEH.
1 ENCODE Pseudogene Summary for GT call Mark Gerstein 2005, :00 EDT summary of 6 Calls: Sept. 15, 22; Oct. 6, 13, 20, 27.
Current Topics in Genomics and Epigenomics – Lecture 2.
An Introduction to ENCODE Mark Reimers, VIPBG (borrowing heavily from John Stamatoyannopoulos and the ENCODE papers)
Gene Finding BIO337 Systems Biology / Bioinformatics – Spring 2014 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BIO337/Spring.
Amandine Bemmo 1,2, David Benovoy 2, Jacek Majewski 2 1 Universite de Montreal, 2 McGill university and Genome Quebec innovation centre Analyses of Affymetrix.
Discussion Points for 2 nd Pseudogene Call Mark Gerstein 2005, :00 EST.
COURSE OF BIOINFORMATICS Exam_31/01/2014 A.
Analysis of the RNAseq Genome Annotation Assessment Project by Subhajyoti De.
Recombinant DNA Technology and Genomics A.Overview: B.Creating a DNA Library C.Recover the clone of interest D.Analyzing/characterizing the DNA - create.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Finding genes by comparing genomes roderic guigó i serra imim/upf/crg, barcelona.
D A S for ENCODE data coordination Felix Kokocinski, WTSI.
Encode variation analysis. Analysis goals Quantify genetic variation in ENCODE regions Detect selective constraint in ENCODE features Develop rules for.
MCDB 4650 Developmental Control of Gene Expression.
Proposed redefinition of “gene” requires it to have a biological role Gerstein MB, …, Snyder M Genome Res 17: example of complexities observed.
The Havana-Gencode annotation GENCODE CONSORTIUM.
Mark D. Adams Dept. of Genetics 9/10/04
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Gene prediction roderic guigó i serra IMIM/UPF/CRG.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
1 ENCODE Pseudogene Call Summary Mark Gerstein 2005, :00 EDT (Draft for G&T call on 2005, :00 EDT)
August 20, 2007 BDGP modENCODE Data Production. BDGP Data Production Project Goals 21,000 RACE experiments 6,000 cDNA’s from directed screening and full.
Overview of ENCODE Elements
Supplemental Figure 1. Bias-corrected NGS bioinformatics strategies. Paired-end DNA sequencing reveals the sequence of the genomic clone, the sample ID.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
IB Saccharomyces cerevisiae - Jan Major model system for molecular genetics. For example, one can clone the gene encoding a protein if you.
Do not reproduce without permission 1 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Permissions Statement This Presentation.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
Canadian Bioinformatics Workshops
GENCODE: a rich dataset of all gene features in the human genome The GENCODE consortium aims to identify all gene features in the human genome, using a.
EGASP 2005 Evaluation Protocol
The Transcriptional Landscape of the Mammalian Genome
National Human Genome Research Institute
EGASP 2005 Evaluation Protocol
Experimental Verification Department of Genetic Medicine
ENCODE Pseudogenes and Transcription
Exam #1 is T 9/23 in class (bring cheat sheet).
Exam #1 W 9/26 at 7-8:30pm in UTC 2.102A Review T 9/25 at 5pm in WRW 102 and in class 9/26.
International Conference on Bioinformatics HKUST, Hong Kong 2007
Introduction to Bioinformatics II
Ensembl Genome Repository.
Volume 116, Issue 4, Pages (February 2004)
closing in on the set of human genes. The ENCODE project.
Alex M. Plocik, Brenton R. Graveley  Molecular Cell 
Volume 128, Issue 6, Pages (March 2007)
Gene Structure.
Universal Alternative Splicing of Noncoding Exons
Volume 11, Issue 7, Pages (May 2015)
Gene Structure.
Presentation transcript:

gene prediction in ENCODE roderic guigó i serra crg-imim-upf, barcelona Advanced Bioinformatics, chsl, october 2005

6/1/2015 Advanced Bioinformatics CHSL,

6/1/2015 Advanced Bioinformatics CHSL, % of the genome. 44 regions target selection. commitee to select sequence targets –manual targets – a lot of information –radom targets – stratified by non exonic conservation with mouse gene density

Long-range regulatory elements (enhancers, repressors/silencers, insulators) Cis-regulatory elements (promoters, transcription factor binding sites) DNA Replication DNase Hypersensitive Sites Genes and Transcripts Epigenetic 

6/1/2015 Advanced Bioinformatics CHSL, gencode: encyclopedia of genes and gene variants Roderic Guigó, IMIM-UPF-CRG Stylianos Antonarakis, Geneve Alexandre Reymond Ewan Birney, EBI Michael Brent, WashU Lior Pachter, Berkeley Manolis Dermitzkakis, Sanger Jennifer Ashurst, Tim Hubbard identify all protein coding genes in the ENCODE regions: identify one complete mRNA sequence for at least one splice isoform of each protein coding gene. eventually, identify a number of additional alternative splice forms.

the gencode annotation pipeline manual curation: havana (sanger) experimental verification: geneva bioinformatics: imim

6/1/2015 Advanced Bioinformatics CHSL, ALL EXONS CODING EXONS comparison with other gene sets

6/1/2015 Advanced Bioinformatics CHSL, from the encode Cromatin and Replication Group, John Stamatoyannopoulos

6/1/2015 Advanced Bioinformatics CHSL, one gene - many proteins very complex transcription units

6/1/2015 Advanced Bioinformatics CHSL, chimering tandem transcription / intergenic splicing

6/1/2015 Advanced Bioinformatics CHSL, KUA and UEV, Thomson et al., Genome Research 2000

6/1/2015 Advanced Bioinformatics CHSL, systematic search for functional chimeras in ENCODE : 165 tandem pairs in the same orientation 126 chimeric predictions obtained 96 tested, at least 4 positve Parra et al., Genome Research in press

6/1/2015 Advanced Bioinformatics CHSL, EGASP’05 the complete annotation of 13 regions was released in january 30. –The annotation of the remaining 31 regions was being obtained, and it was withheld. gene prediction groups were asked to submit predictions by april 15 in the remaining 31 regions. –18 groups participated, submiting 30 prediction sets predictions were compared to the annoations in an NHGRI sponsored workshop at the Wellcome Trust Sanger Institute, on may 6 and 7.

6/1/2015 Advanced Bioinformatics CHSL,

6/1/2015 Advanced Bioinformatics CHSL,

6/1/2015 Advanced Bioinformatics CHSL, EGASP’05 two main goals: 1.to assess how automatic methods are able to reproduce the (costly) manual/computational/experimental gencode annotation 2.how complete is the gencode annotation. are there still genes consistenly predicted by computational methods

6/1/2015 Advanced Bioinformatics CHSL,

6/1/2015 Advanced Bioinformatics CHSL, accuracy measures

6/1/2015 Advanced Bioinformatics CHSL, accuracy at the exon level -- coding exons 18 groups participated submitting 30 prediction sets: evidence-based dual genome “ab intio”

6/1/2015 Advanced Bioinformatics CHSL, accuracy at the exon level -- all exons 18 groups participated submitting 30 prediction sets: evidence-based dual genome “ab intio”

programs are quite good at calling the protein coding exons (accuracy at 80%) Not as good at calling the transcribed exons), but the best of the programs predict correctly only 40% of the complete CDS exonic structures, and in about 30% of the cases, they are able to predict correctly none of the CDS exonic structures

programs are quite good at calling the protein coding exons (accuracy at 80%) Not as good at calling the transcribed exons), but the best of the programs predict correctly only 40% of the complete transcripts (considering only the coding fraction) in about 30% of the cases, they are able to predict correctly none of the CDS exonic structures

the issue of completness

6/1/2015 Advanced Bioinformatics CHSL, many novel exons predicted: we will prioritize a few hundred for experimental verification using race + rt-pcr although our experiment in the 13 regions suggests that only a few of them are likely to be real

6/1/2015 Advanced Bioinformatics CHSL, many computational predictions outside of the annotation In 13 ENCODE regions: 1255 unique predicted introns (exon pairs) in one or more of the 9 UCSC gene prediction tracks are not annotated 334 (27%) are outside annotations (could correspond to novel genes)

6/1/2015 Advanced Bioinformatics CHSL, many computational predictions outside of the annotation In 13 ENCODE regions: 1255 unique predicted introns (exon pairs) in one or more of the 9 UCSC gene prediction tracks are not annotated 334 (27%) are outside annotations (could correspond to novel genes) all tested by rt-pcr on 24 tissues 25 (2.0%) confirmed by rt-pcr in 24 tissues 16 (1.2%) with correctly predicted intron junctions 3 (0.2%) outside annotations (1% confirmation)

6/1/2015 Advanced Bioinformatics CHSL, Overview of the verification efforts II AFFX-GenCode: novel regions 40 intergenic transfrags from HL60 cell line that overlap GenCode gene predictions –20 overlapping gene predictions with no verification attempted by GenCode –20 overlapping gene predictions where verification by GenCode was negative 40 intergenic GenCode gene predictions that do not overlap HL60 transfrags –20 where no verification was attempted by GenCode –20 where verification by GenCode was negative (slide by Phil Kaphranov, Affymetrix)

6/1/2015 Advanced Bioinformatics CHSL, Some preliminary stats on the 80 regions: 3’ RACE only Gene predictions overlapping transfrags: total 39 (1/40 is a duplicated transfrag) 27 (69%) are positive in HL60 and 31(80%) in HepG2 in the 3’ RACE assays (slide by Phil Kaphranov, Affymetrix) Gene predictions not overlapping transfrags: total 38 (2/40 are outside of the regions where we have probes on the ENCODE array) 18 (47%) are positive in HL60 and 25 (66%) in HepG2 in the 3’ RACE assays

6/1/2015 Advanced Bioinformatics CHSL, ’ RACE based on a predicted exon ENr131_egasp_224555_ identifies new major and minor exons (shown by arrows) of a gene BC in HepG2 cell line only. Good correspondence between RACE exons and GenScan exons. HepG2 3’RACE Bottom strand HepG2 3’RACE Top strand GenScan

6/1/2015 Advanced Bioinformatics CHSL, high-throughput genome-wide unbiased transcription interrogation techniques transcriptionstrandconnectiviystructure transfrags cages ditags genes the encode genes and transcripts group: transfrags, Tom Gingeras (Affymetrix) and Mike Snyder (Yale) cage tags, Albin Sandelin, Riken ditags Yijun Ruan, Genome Insitute of Singapore

6/1/2015 Advanced Bioinformatics CHSL, Proteasome (prosome, macropain) 26S subunit, non-ATPase, 4 (inhibits cholera-induced intestinal fluid secretion) Chrom 2

6/1/2015 Advanced Bioinformatics CHSL, protein coding genes are only a fraction of the transcription detected in ENCODE Total nb of nucleotides : Nb of nucleotide covered % nucleotides covered Annotated exon1624,3265,5% transfrag/tar(Affymetrix,Yale) ,2% Cage (RIKEN) ,5% ditags (GIS)265280,1% TOTAL UNIQUE 3,534, %

6/1/2015 Advanced Bioinformatics CHSL, transcription (aparently) not associated to protein coding genes TRANSCRIPTION MAP of HL-60 DEVELOPMENTAL TIME COURSE (data by Tom Gingeras, affymerix)

6/1/2015 Advanced Bioinformatics CHSL, THREADING TRANSFRAGS into PROTEIN CODING GENES inferring novel protein coding genes from transfrags

6/1/2015 Advanced Bioinformatics CHSL,

6/1/2015 Advanced Bioinformatics CHSL,

6/1/2015 Advanced Bioinformatics CHSL,

6/1/2015 Advanced Bioinformatics CHSL,

6/1/2015 Advanced Bioinformatics CHSL,

6/1/2015 Advanced Bioinformatics CHSL,

6/1/2015 Advanced Bioinformatics CHSL,

6/1/2015 Advanced Bioinformatics CHSL,

6/1/2015 Advanced Bioinformatics CHSL,

6/1/2015 Advanced Bioinformatics CHSL,

6/1/2015 Advanced Bioinformatics CHSL,

6/1/2015 Advanced Bioinformatics CHSL,

6/1/2015 Advanced Bioinformatics CHSL,

6/1/2015 Advanced Bioinformatics CHSL,

6/1/2015 Advanced Bioinformatics CHSL, ENCODE France Denoeud (IMIM) Julien Lagarde Josep F. Abril Robert Castelo Eduardo Eyras Stylianos Antonarakis (Geneva) Alexandre Reymond Catherine Ucla Ewan Birney (EBI) Damian Keefe Paul Fliceck Michael Brent (WashU) Lior Patcher (Berkeley) Manolis Dermitakis (Sanger) HAVANA (Sanger) Jennifer Ashurst Tim Hubbard Adam Frankish David Swarbreck James Gilbert AFFYMETRIX Tom Gingeras Sujit Dike Phil Kaphranov EGASP’05 Michael Ashburner Vladimir Bajic Suzanne Lewis Martin Reese Peter Good Elise Feingold