Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Ensembl Gene set The “Genebuild” 21 April 2008.

Similar presentations


Presentation on theme: "The Ensembl Gene set The “Genebuild” 21 April 2008."— Presentation transcript:

1 The Ensembl Gene set The “Genebuild” 21 April 2008

2 2 of 32  The GeneBuild (determining the Ensembl gene set)  What it means for the scientist?  ‘annotation pipeline’ vs ‘manual curation’  Pseudogenes  ncRNAs  The CCDS project Outline

3 3 of 32 What is available? I) Sequence Assemblies from genome sequencing efforts Introduction

4 4 of 32 Gene Sequencing- the Assembly http://seqcore.brcf.med.umich.edu/doc/educ/dnapr/sequencing.html This generates clones, vs new sequencing methods

5 5 of 32 Clones Available Human: (Tilepath- used in the assembly) Ciona intestinalis Shotgun assembly

6 6 of 32 ContigView: Clones and Contigs Contigs Clones (Plate/well numbers) Ensembl Transcripts

7 7 of 32 Task: View the tilepath clone in ContigView for the region containing the human BRCA2 gene. Hint: Start with a search for the BRCA2 gene.

8 8 of 32 The Ensembl Geneset How does Ensembl use mRNA and protein information along with the sequence assembly to define distinct genes on the genome? ProteinSequence Assembly Ensembl Geneset

9 9 of 32 Once the Assembly is Imported… Proteins/mRNAs are aligned. These have been submitted to databases such as: UniProt (manually curated) and RefSeq (partially manually curated)

10 10 of 32 The Biological Evidence UniProt/Swiss-Prot A manually curated database and therefore of highest accuracy NCBI RefSeq A partially manually curated database UniProt/TrEMBL Automatically annotated translations of EMBL coding sequence (CDS) features EMBL / GenBank / DDBJ Primary nucleotide sequence repository All Ensembl gene predictions are based on experimental evidence:

11 11 of 32 Database Relationship NCBI RefSeq EMBL-Bank DDBJ GenBank UniProt Swiss- Prot TrEMBL Individual Lab’s Submission

12 12 of 32 Sequence (Assembly) Proteins ( e.g. Swiss-Prot) mRNA EST Manual annotation (HAVANA) EST genes Ensembl Genebuild EMBL-Bank GenBank DDBJ

13 13 of 32 Ensembl genes may be based on multiple protein/mRNAs What is an Ensembl gene based on? Why do I want to know?…

14 14 of 32 Task Look at the evidence for the human EPO gene. What was this gene based on? Hint: Go to Exon Information from the GeneView page

15 15 of 32 EPO gene supporting evidence

16 16 of 32 Species-Specific GeneBuilds Pan troglodytes genes are built by projection from human genes. Zebrafish has many gene duplications. Homo sapiens genes must have protein evidence, not just mRNA.

17 17 of 32 Task When was the chimpanzee (Pan troglodytes) Genebuild performed? Can you find information as to how genes were annotated? Hint: Look on the chimpanzee index page

18 18 of 32 External Gene Set: VEGA/Havana Human, zebrafish, mouse and dog Havana transcripts in blue or gold… What are Havana transcripts?

19 19 of 32 Automatic vs Manual Annotation Automatic Annotation (Ensembl Genebuild) Quick Use unfinished sequence or shotgun assembly Consistent annotation Manual Annotation (Havana) Flexible, can deal with inconsistencies Most rules have exceptions Consult publications as well as databases ‘Out of the Ordinary’ Biology However… Slow Need finished sequence

20 20 of 32 Havana and Ensembl match When a Havana (manually curated) and Ensembl (automatic methods) predict the same transcript, basepair for basepair, the transcripts are merged and coloured gold.

21 21 of 32 Manually-curated gene sets in Ensembl Vega (Havana) Homo sapiens, Danio rerio, Mus musculus and Canis familiaris WormBase Caenorhabditis elegans FlyBase Drosophila melanogaster SGD Saccharomyces cerevisiae

22 22 of 32 Consensus coding sequences (CCDS) Collaboration between NCBI, UCSC, Ensembl and Havana to agree on a coding sequence for a transcript. The long term aim is to have a single gene set for human http://www.ncbi.nlm.nih.gov/CCDS/ The genebuild pipeline has been modified to retain these CDSs

23 23 of 32 What Can Go Wrong? I)A Gap in the assembly Gene might not be found in Ensembl II) Fused genes BLAST hit (SwissProt entry) Gene might be associated with two names

24 24 of 32  The genome sequence  The Genebuild  ‘manual curation’ by Havana  Other: EST gene set Pseudogenes ncRNAs Outline

25 25 of 32 Expressed Sequence Tags vs ‘cDNA’ ESTs are annotated separately. Why?  mRNA and cDNA used in the GeneBuild: Sequenced to high standard, often complete.  EST: Lower quality sequence. ‘One shot’ sequencing of cDNA from the 5’ and 3’ end creates the EST sequence. ESTs are only 500-800 nucleotides long Low quality fragment- sequence error of ~2%. BUT confers useful expression information  discovery of new genes esp in diseased organisms  Tissue type  Timing/developmental stage  Samples more transcripts, variants

26 26 of 32 Where Can I See This EST Geneset? ContigView Choose EST genes EST track

27 27 of 32 Pseudogenes: ‘False’ Genes Unprocessed Produced by gene duplication and rearrangement Reverse transcription and re-integration mRNA pseudogene AAAAAA Processed AAAAAA

28 28 of 32 ncRNAs (non coding RNAs) What types are in Ensembl? tRNA (transfer RNA) rRNA (ribosomal RNA) scRNA (small cytoplasmic) snRNA (small nuclear) snoRNA (small nucleolar) miRNA (microRNA)

29 29 of 32 ncRNAs (2 types) I) RNA with low homology can be identified through conserved 2 ary structure (search genome using Rfam pattern) II) High sequence conservation (miRNA) BLAST alignment ‘RNA fold’ applied to make sure sequences can fold (hairpin)

30 30 of 32 ncRNAs… where can I see them? Find them in ContigView: or use BioMart.

31 31 of 32 *All Ensembl genes are based on biological evidence (protein and mRNA)  One Ensembl gene may come from proteins and mRNAs in various databases.  Havana (manually curated) genes are incorporated into the Ensembl geneset, merged for human.  The CCDS set strives for consensus coding sequences across databases.  Pseudogenes and RNAs are annotated, along with a separate EST gene set. Summary – Ensembl Genes

32 32 of 32 For more on GeneBuild: Help and Documentation (About Ensembl) http://www.ensembl.org/info/about/docs/genome_annotation.html


Download ppt "The Ensembl Gene set The “Genebuild” 21 April 2008."

Similar presentations


Ads by Google