Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 of 28 Evaluating Genes and Transcripts (“Genebuild”)

Similar presentations


Presentation on theme: "1 of 28 Evaluating Genes and Transcripts (“Genebuild”)"— Presentation transcript:

1 1 of 28 Evaluating Genes and Transcripts (“Genebuild”)

2 2 of 28 Outline Ensembl gene set Ensembl EST genes Ab initio predictions Manual curation (Vega) Ensembl / Havana merged gene set CCDS project

3 3 of 28 Biological Evidence UniProt/Swiss-Prot A manually curated database and therefore of highest accuracy NCBI RefSeq A partially manually curated database UniProt/TrEMBL Automatically annotated translations of EMBL coding sequence (CDS) features EMBL / GenBank / DDBJ Primary nucleotide sequence repository All Ensembl gene predictions are based on experimental evidence:

4 4 of 28 The Ensembl Genebuild Genome assembly Computer programs Experimental evidence Ensembl Genes + +

5 5 of 28 The Ensembl Genebuild A new release of Ensembl doesn’t contain a new genebuild for each species! New genebuilds are only done if there is: a new genome assembly a lot of new supporting evidence

6 6 of 28 Genome Assemblies Genome assemblies are not created by Ensembl, but provided by other institutes / consortia, e.g. NCBI: human, mouse Rat Genome Sequencing Consortium: rat Sanger: zebrafish Broad Institute: mammals Baylor College: cow Washington University: chicken etc.

7 7 of 28 The Ensembl Genebuild Targeted build: Align species-specific proteins to the genome to create transcripts Similarity build: Align proteins from closely related species to locate additional transcripts Add UTRs using mRNA evidence Eliminate redundant transcripts and create genes

8 8 of 28 “Special” cases Pseudogenes Non-coding RNA genes sequences from RFAM and miRBase dbs and covariance models hand-checked set Ig Segment Genes (Immunoglobulin and T-cell receptor segments) sequences from IMGT db and Exonerate

9 9 of 28 Classification of Transcripts Ensembl Transcripts and Proteins are mapped to UniProt/Swiss-Prot, NCBI RefSeq and UniProt/TrEMBL entries Genes that map to species-specific protein/mRNA records are classified as known Genes that do not map to species- specific protein/mRNA records are classified as novel

10 10 of 28 Names and Descriptions Transcript names are inferred from mapped transcripts and proteins Swiss-Prot > RefSeq > TrEMBL ID Novel transcripts have only Ensembl identifiers Genes are assigned the official gene symbol if available HGNC (HUGO) symbol for human genes Species-specific nomenclature committees (MGI, ZFIN etc.) Otherwise Swiss-Prot > RefSeq > TrEMBL ID Gene description is inferred from mapped database entries, the source is always given

11 11 of 28 Supporting evidence ExonView mRNA peptide mRNA UTRcoding/UTR

12 12 of 28 Supporting evidence ContigView

13 13 of 28 Configuring the Genebuild Genebuild configured for each species Data availibility Targeted build most useful in human, mouse Similarity build most useful in C. intestinalis, mosquito Structural issues Zebrafish Many duplications Genome from different haplotypes Mosquito Many single-exon genes Genes within genes

14 14 of 28 Low Coverage Genomes Low coverage genomes (~2x) come in lots of scaffolds: “classic” genebuild will result in many partial and fragmented genes Whole Genome Alignment (WGA) to an annotated reference genome: this method reduces fragmentation by piecing together scaffolds into “gene-scaffolds” that contain complete gene(s)

15 15 of 28 Low Coverage Genomes NNNNNN “gene-scaffold ” reference assembly

16 16 of 28 EST Gene Set ESTs (Expressed Sequence Tags) are single reads, high chance of sequencing mistakes EST libraries are regularly contaminated with genomic DNA Generally ~ 400 bp, so unlikely to cover a whole gene THEREFORE EST gene predictions are less reliable and thus kept separate from the core Ensembl Gene Set

17 17 of 28 EST Gene Set ContigView ESTs EST genes

18 18 of 28 Ab initio Predictions Predict translatable transcript structures solely on the basis of genome sequence. No validation with biological expression information. GENSCAN for vertebrate genomes SNAP better for invertebrates NB: Both programs are over- predicting transcript structures.

19 19 of 28 Ab initio Predictions ContigView GENSCAN prediction

20 20 of 28 Automatic vs Manual Annotation Automatic Annotation Quick Use unfinished sequence or shotgun assembly Consistent annotation Manual Annotation Slow Need finished sequence Flexible, can deal with inconsistencies Most rules have exceptions Consult publications as well as databases

21 21 of 28 Annotation that Causes Problems for Ensembl Multiple variants UTRs Pseudogenes Non-coding genes (ncRNAs) Overlapping genes, anti-sense genes Gene duplication events

22 22 of 28 Manually Curated Gene Sets FlyBasefruitfly WormBaseC. elegans SGDyeast Vegahuman, zebrafish, mouse, dog

23 23 of 28 Vega Genome Browser http://vega.sanger.ac.uk

24 24 of 28 Vega Transcripts Vega transcripts Vega Havana transcripts annotated by the Havana team at Sanger Vega External transcripts annotated by other Vega teams

25 25 of 28 Ensembl / Havana Merge Transcripts: Ensembl/Havana: gold Ensembl: red / black Havana: blue Genes: Ensembl/Havana: gold Ensembl: red / black Havana: blue Full-length protein-coding transcripts annotated by the Sanger Havana team (part of Vega) are merged with the human Ensembl transcript set

26 26 of 28 Ensembl / Havana Merge Merged Ensembl / Havana gene Merged Ensembl / Havana transcript

27 27 of 28 CCDS (Consensus Coding Sequences) Collaboration between NCBI, UCSC, Ensembl and Havana to produce a set of stable, reliable, complete (ATG->stop) CDS structures for human and mouse Long term aim is to get to a single gene set for human and mouse The genebuild pipeline has been modified to retain these ‘blessed’ CDSs (stored in a database for incorporation in the build)

28 28 of 28 Q & A Q U E S T I O N S A N S W E R S


Download ppt "1 of 28 Evaluating Genes and Transcripts (“Genebuild”)"

Similar presentations


Ads by Google