Presentation is loading. Please wait.

Presentation is loading. Please wait.

GENCODE: a rich dataset of all gene features in the human genome The GENCODE consortium aims to identify all gene features in the human genome, using a.

Similar presentations


Presentation on theme: "GENCODE: a rich dataset of all gene features in the human genome The GENCODE consortium aims to identify all gene features in the human genome, using a."— Presentation transcript:

1 GENCODE: a rich dataset of all gene features in the human genome The GENCODE consortium aims to identify all gene features in the human genome, using a combination of computational analysis, manual annotation and experimental validation of selected transcripts. Despite the number of protein-coding genes being relatively steady since the first release, the number of transcripts per protein-coding locus in GENCODE annotation has gradually increased from an average of 4.8 to 6.9, and so has the number of distinct translations (29% increase). GENCODE also contains the most comprehensive annotation of long non-coding RNA (lncRNA) loci publicly available currently totalling 11,790. Unlike protein-coding loci, the number of lncRNA loci is likely to continue to increase as new RNAseq-derived tissue-specific datasets are incorporated into the annotation. The number of splice variants per locus is significantly higher in GENCODE in comparison with other public gene sets in both protein- coding and lncRNA genes. GENCODE covers 81% of RefSeq and 67% of UCSC transcripts, whereas 109,000 GENCODE transcripts (68%) are not present in RefSeq or UCSC. Almost 40% of those are protein-coding and give rise to 39,000 unique translations. The GENCODE data release cycle is coupled to the tri-monthly Ensembl releases. Each release contains updated gene sets where new data from the HAVANA manual annotation has been integrated with the refined Ensembl automated gene set. GENCODE is publicly available from the gencodegenes.org website where the main annotation files can be downloaded in GTF format. GENCODE data can also be visualized via the Ensembl and UCSC genome browsers or accessed through the Ensembl databases, Perl API and BioMart. González J.M., Tapanari E., Harrow J., The GENCODE Consortium * Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1HH, UK GENCODE pipeline GENCODE contains the most comprehensive annotation of lncRNAs GENCODE includes pseudogene annotation The aim of GENCODE is to annotate all evidence-based gene features in the entire human genome at a high accuracy. The process to create this annotation involves manual curation, different computational analyses and targeted experimental approaches. Putative loci can be verified by wet-lab experiments and computational predictions are analysed manually. The result is a set of annotations including all protein-coding loci with alternatively transcribed variants, non- coding loci with transcript evidence (lncRNAs), and pseudogenes. The GENCODE data release cycle is coupled to the tri-monthly Ensembl releases. How to access GENCODE data 1. GTF files : annotation files in GTF format from the GENCODE web and ftp sites. These files contain unique information derived from the manual annotation, including experimental validation status, transcript and CDS completeness and presence of non-canonical splice sites. http://www.gencodegenes.org ftp://ftp.sanger.ac.uk/pub/gencodehttp://www.gencodegenes.org 2. Ensembl browser : GENCODE is the default human gene set in the Ensembl core database. -Genome browser http://www.ensembl.org -MySQL databases mysql -hensembldb.ensembl.org -P5306 -uanonymous -Perl API http://www.ensembl.org/info/docs/api -BioMart http://www.ensembl.org/biomart/martview 3. UCSC browser : GENCODE tracks in the UCSC genome browser. http://genome.ucsc.edu This diagram shows the flow of data between the groups of the GENCODE Consortium. Manual annotation is central to the process but relies on specialized prediction pipelines to provide hints to first-pass annotation and quality control (QC) for completed annotation. Automated annotation supplements manual annotation, the two being merged to produce the GENCODE data set, and also to apply QC to the completed annotation. A subset of annotated gene models is subject to experimental validation. The Annotrack tracking system contains data from all groups and is used to highlight differences, co-ordinate QC and track outcomes. GENCODE stats Proportion of GENCODE lncRNAs and mRNAs transcripts with Cap Analysis Gene Expression (CAGE) clusters mapped around their transcription start sites (TSS) in bins of increasing expression levels (log10 RPKM, Reads Per Kilobase per Million mapped reads). Number of exons per transcript for all lncRNA transcripts, lncRNAs having CAGE or PET (paired- end tag) supports for either their 5’ or 3’ ends, lncRNAs having PET tags mapping to both ends of the transcript and protein-coding transcripts. The lncRNA content of three major publicly available gene sets, GENCODE, RefSeq and UCSC, were compared at the levels of total gene number, total transcript number and mean transcripts per locus. As of v13, GENCODE outperforms the other gene sets with 12,393 genes and 19,835 transcripts (1.6 transcripts/locus). The GENCODE pseudogene annotation is an integrated procedure including HAVANA manual annotation and two automated prediction pipelines: PseudoPipe (Yale U.) and RetroFinder (UCSC). The loci that are annotated by both PseudoPipe and RetroFinder are collected in a subset labeled as “2-way consensus”, which is further intersected with the manually annotated HAVANA pseudogenes. The intersection results in three subsets of pseudogenes. Level 1 pseudogenes are loci that have been identified by all three methods. Level 2 pseudogenes are loci that have been discovered through manual curation and were not found by either automated pipeline. Delta 2-way contains pseudogenes that have been identified only by computational pipelines and were not validated by manual annotation. As a QC exercise to determine completeness of pseudogene annotation in chromosomes that have been manually annotated, 2-way consensus pseudogenes are analysed by the HAVANA team to establish their validity and are included in the manually annotated pseudogene set if appropriate. Classification of GENCODE pseudogenes based on their origin. Processed pseudogenes are derived by a retrotransposition event and unprocessed pseudogenes by a gene duplication event; in both cases followed by the gain of a disabling mutation. Both processed and unprocessed pseudogenes can retain or gain transcriptional activity which is reflected in the transcribed_processed and transcribed_unprocessed pseudogene classification. Polymorphic pseudogenes contain a disabling mutation in the reference genome but are known to be coding in other individuals while unitary pseudogenes have functional protein-coding orthologs in other species but contain a fixed disabling mutation in human. * The Wellcome Trust Sanger Institute (WTSI, Hinxton, Cambridge, United Kingdom), The University of Lausanne (Lausanne, Switzerland), Centre for Genomic Regulation (CRG, Barcelona, Spain), University of California, Santa Cruz (UCSC, Santa Cruz, California), Washington University, St. Louis (WashU, St. Louis, USA), Massachusetts Institute of Technology (MIT, Boston, Massachusetts, USA), Yale University (Yale, New Haven, Connecticut, USA), The Spanish National Cancer Research Centre (CNIO, Madrid, Spain) REFSEQ GENCODE UCSC 1,376 108,334 1,773 Overlap among GENCODE, RefSeq and UCSC data sets. Both protein- coding and lncRNA transcripts were compared. Two transcripts were considered to match if all their exon junction coordinates were identical in the case of multiexonic transcripts, or if their transcript coordinates were the same for monoexonic transcripts. Pseudogene pipeline Pseudogene biotypes Distribution of genes and transcripts in the GENCODE v13 data set across four broad biotypes. * Excluding immunoglobulin gene segments 17,091 5,782 16,411 32,662 Classification of GENCODE lncRNAs according to their relation with protein-coding loci. LincRNAs are long intergenic non-coding RNA loci with a length >200bp. Antisense lncRNAs overlap any coding exon of a locus on the opposite strand. Sense_intronic lncRNAs lie in introns of a coding gene and do not overlap any of its exons. Sense_overlapping lncRNAs contain a coding gene in its intron on the same strand. GENCODE track configuration display Linked to Ensembl gene view page UCSC transcript view page showing metadataGENCODE gene annotation in basic/comprehensive display mode LncRNA biotypes LncRNA start site accuracy Comparison with other lncRNA sets LncRNA structure complexity Related publications Harrow et al. 2012. GENCODE: The reference human genome annotation for The ENCODE Project. Genome Res. 22:1760-1774 Derrien et al. 2012. The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression. Genome Res. 22:1775-1789 Pei et al. 2012. The GENCODE pseudogene resource. Genome Biology 13:R51 Howald et al. 2012. Combining RT-PCR-seq and RNA-seq to catalog all genic elements encoded in the human genome. Genome Res. 22:1698-1710 Specialized prediction pipelines (Congo, PseudoPipe, RetroFinder, APPRIS, etc) Automated annotation (Ensembl) GENCODE data set Merge data QC Release and submit data set Tracking system (AnnoTrack) Manual annotation (HAVANA) Experimental verification (RT-PCR-Seq, RACE) - Assign validation level - Highlight conflicts - Track solutions - Annotate new regions - Update annotation incorporating QC Novel and putative transcripts antisense sense-intronic lincRNA sense-overlapping 557 142 6,096 4,220 Number of genes in Gencode v13 Biotypes PseudoPipe 16,881 RetroFinder 14,729 HAVANA 12,830 2-way consensus 9,023  2-way 1,017 Level 1 8,006 Level 2 4,824 Feedback Loop intersect


Download ppt "GENCODE: a rich dataset of all gene features in the human genome The GENCODE consortium aims to identify all gene features in the human genome, using a."

Similar presentations


Ads by Google