Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004.

Slides:



Advertisements
Similar presentations
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Advertisements

LECTURE 17: RNA TRANSCRIPTION, PROCESSING, TURNOVER Levels of specific messenger RNAs can differ in different types of cells and at different times in.
Two short pieces MicroRNA Alternative splicing.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction methods Gene indices Mapping cDNA on genomic DNA Genome-genome.
Gene Expression And Regulation Bioinformatics January 11, 2006 D. A. McClellan
1 Alternative Splicing. 2 Eukaryotic genes Splicing Mature mRNA.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
The Molecular Genetics of Gene Expression
CSE182-L12 Gene Finding.
Alternative Splicing from ESTs
BioSci 145B lecture 1 page 1 © copyright Bruce Blumberg All rights reserved mRNA frequency and cloning mRNA frequency classes –classic references.
Bioinformatics Alternative splicing Multiple isoforms Exonic Splicing Enhancers (ESE) and Silencers (ESS) SpliceNest Lecture 13.
From Gene to Protein. Genes code for... Proteins RNAs.
Chris Chander, Luke Adea BioSci D145 Feb. 12, 2015
Goals of the Human Genome Project determine the entire sequence of human DNA identify all the genes in human DNA store this information in databases improve.
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Transcription: Synthesizing RNA from DNA
Protein Synthesis.
CHAPTER 3 GENE EXPRESSION IN EUKARYOTES (cont.) MISS NUR SHALENA SOFIAN.
Doug Brutlag 2011 Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University School of Medicine Genomics, Bioinformatics.
BIOLOGY 3020 Fall 2008 Gene Hunting (DNA database searching)
Fine Structure and Analysis of Eukaryotic Genes
Essentials of the Living World Second Edition George B. Johnson Jonathan B. Losos Chapter 13 How Genes Work Copyright © The McGraw-Hill Companies, Inc.
CHAPTER 17 FROM GENE TO PROTEIN Copyright © 2002 Pearson Education, Inc., publishing as Benjamin Cummings Section B: The Synthesis and Processing of RNA.
Transcription Transcription is the synthesis of mRNA from a section of DNA. Transcription of a gene starts from a region of DNA known as the promoter.
The Ensembl Gene set The “Genebuild” 21 April 2008.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
RNA and Protein Synthesis
UNIT 1 INFORMATION METHODS OF A CELL. What do you know about DNA? Building blocks are called? –nucleotides The shape is ? –Double helix The three primary.
RNA Processing By: Kelvin Liu, Jeff Wu, Alex Eishingdrelo.
Understanding genes using mathematical tools Adam Sartiel COMPUGEN.
MPL Identification of alternative spliced mRNA variants related to cancers by genome-wide ESTs alignment KIM DAE SOO Oncogene Apr.
Organizing information in the post-genomic era The rise of bioinformatics.
DNA to Protein – 12 Part one AP Biology. What is a Gene? A gene is a sequence of DNA that contains the information or the code for a protein or an RNA.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Sackler Medical School
A Biology Primer Part III: Transcription, Translation, and Regulation Vasileios Hatzivassiloglou University of Texas at Dallas.
Relationship between Genotype and Phenotype
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Transcription Vocabulary of transcription: transcription - synthesis of RNA under the direction of DNA messenger RNA (mRNA) - carries genetic message from.
Chapter 2 From Genes to Genomes. 2.1 Introduction We can think about mapping genes and genomes at several levels of resolution: A genetic (or linkage)
Comparative Genomics Methods for Alternative Splicing of Eukaryotic Genes Liliana Florea Department of Computer Science Department of Biochemistry GWU.
Research about Alternative Splicing recently 楊佳熒.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
ESTs Ian Keller Laboratory Techniques in Molecular Bio.
RNA and Gene Expression BIO 224 Intro to Molecular and Cell Biology.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Finding genes in the genome
The beginning of protein synthesis. OVERVIEW  Uses a strand of nuclear DNA to produce a single-stranded RNA molecule  Small section of DNA molecule.
Gene Technologies and Human ApplicationsSection 3 Section 3: Gene Technologies in Detail Preview Bellringer Key Ideas Basic Tools for Genetic Manipulation.
Transcription & Translation Molecular Structure of Ion Channels.
生物資料庫搜尋 ( 第八組 ) 連威森 王鼎 黃智楹 張鈞淵
TRANSCRIPTION (DNA → mRNA). Fig. 17-7a-2 Promoter Transcription unit DNA Start point RNA polymerase Initiation RNA transcript 5 5 Unwound.
Mahmuda Akter, Paige Fairrow-Davis, and Rebecca Seipelt-Thiemann
bacteria and eukaryotes
The Transcriptional Landscape of the Mammalian Genome
Eukaryotic Gene Structure
Genomics A Systematic Study of the Locations, Functions and Interactions of Many Genes at Once.
Section 3: Gene Technologies in Detail
Genomes and Their Evolution
Introduction to Bioinformatics II
From DNA to Protein Class 4 02/11/04 RBIO-0002-U1.
Introduction to Alternative Splicing and my research report
Presentation transcript:

Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004

Intro ESTs Prediction of Alternative Splicing from ESTs

AAAAAAA5’ CAP Mature mRNA Splicing 5’ 3’ 5’ pre-mRNA Transcription exons introns Translation Peptide

AAAAAAA5’ CAP Mature mRNA Different Splicing 5’ 3’ 5’ pre-mRNA Transcription exons introns Translation Different Peptide

Alt splicing as a mechanism of gene regulation Functional domains can be added/subtracted  protein diversity Can introduce early stop codons, resulting in truncated proteins or unstable mRNAs It can modify the activity of the transcription factors, affecting the expression of genes It is observed nearly in all metazoans Estimated to occur in 30%-40% of human

Forms of alternative splicing Exon skipping / inclusion Alternative 3’ splice site Alternative 5’ splice site Mutually exclusive exons Intron retention Constitutive exon Alternatively spliced exons

How to study alternative splicing?

ESTs (Expressed Sequence Tags) Single-pass sequencing of a small (end) piece of cDNA Typically nucleotides long It may contain coding and/or non-coding region

ESTs Cells from a specific organ, tissue or developmental stage AAAAAA 3’5’ AAAAAA 3’5’ TTTTTT 5’3’ AAAAAA 3’5’ TTTTTT 5’3’ TTTTTT 5’3’ AAAAAA 3’5’ TTTTTT 5’3’ mRNA extraction RNA DNA Double stranded cDNA Add oligo-dT primer Reverse transcriptase Ribonuclease H DNA polimerase Ribonuclease H

ESTs AAAAAA 3’5’ TTTTTT 5’3’ Clone cDNA into a vector Multiple cDNA clones 5’ EST 3’ EST Single-pass sequence reads

Splice variants Genomic Primary transcript Splicing cDNA clones EST sequences 5’ 3’ Alternative Splicing from ESTs

ESTs can also provide information about potential alternative splicing when aligned to the genome (and when aligned to mRNA data)

EST sequencing Is fast and cheap Gives direct information about the gene sequence Partial information Resulting ESTsKnown gene (DB searches)Similar to known gene Contaminant Novel gene

ESTs provide expression data eVOC Ontologies Anatomical System Cell Type The tissue, organ or anatomical system from which the sample was prepared. Examples are digestive, lung and retina. Pathology The precise cell type from which a sample was prepared. Examples are: B- lymphocyte, fibroblast and oocyte. Developmental Stage The pathological state of the sample from which the sample was prepared. Examples are: normal, lymphoma, and congenital. Pooling The stage during the organism's development at which the sample was prepared. Examples are: embryo, fetus, and adult. Indicates whether the tissue used to prepare the library was derived from single or multiple samples. Examples are pooled, pooled donor and pooled tissue.

Linking the expression vocabulary to gene annotations ESTs Genes

Normalized vs. non-normalized libraries

The down side of the ESTs Cannot detect lowly/rarely expressed genes or non- expressed sequences (regulatory) Random sampling: the more ESTs we sequence the less new useful sequences we will get

Gene Hunting Sequencing of the Human Genome (HGP) EST Sequencing

Origin of the ESTs Science Jun 21;252(5013): Complementary DNA sequencing: expressed sequence tags and human genome project. Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF, et al. Section of Receptor Biochemistry and Molecular Biology, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD. Automated partial DNA sequencing was conducted on more than 600 randomly selected human brain complementary DNA (cDNA) clones to generate expressed sequence tags (ESTs). ESTs have applications in the discovery of new human genes, mapping of the human genome, and identification of coding regions in genomic sequences. Of the sequences generated, 337 represent new genes, including 48 with significant similarity to genes from other organisms, such as a yeast RNA polymerase II subunit; Drosophila kinesin, Notch, and Enhancer of split; and a murine tyrosine kinase receptor. Forty-six ESTs were mapped to chromosomes after amplification by the polymerase chain reaction. This fast approach to cDNA characterization will facilitate the tagging of most human genes in a few years at a fraction of the cost of complete genomic sequencing, provide new genetic markers, and serve as a resource in diverse biological research fields.

EST-sequencing explosion Merck and WashU (1994)  public ESTs  GenBank  dbEST  non-exclusivity (1992)

Number of public entries: 20,039,613 Summary by organism Homo sapiens (human) 5,472,005 Mus musculus + domesticus (mouse) 4,056,481 Rattus sp. (rat) 583,841 Triticum aestivum (wheat) 549,926 Ciona intestinalis 492,511 Gallus gallus (chicken) 460,385 Danio rerio (zebrafish) 450,652 Zea mays (maize) 391,417 Xenopus laevis (African clawed frog) 359,901 … dbEST release 20 February 2004

EST lengths Human EST length distribution (dbEST Sep ) ~ 450 bp

Recover the mRNA from the ESTs

What is an EST cluster? A cluster is a set of fragmented EST data (plus mRNA data if known), consolidated according to sequence similarity Clusters are indexed by gene such that all expressed data concerning a single gene is in a single index class, and each index class contains the information for only one gene. (Burke, Davison, Hide, Genome Research 1999).

EST pre-processing Vector Repeats Mitochondrial Xenocontaminants

EST Clustering UniGene (NCBI) TIGR Human Gene Index (The Institute for Genomic Research) StackDB (South African Bioinformatics Institute)

UniGene Species UniGene Entries Homo sapiens 118,517 Mus musculus 82,482 Rattus norvegicus 43,942 Sus scrofa 20,426 Gallus gallus 11,970 Xenopus laevis 21,734 Xenopus tropicalis 17,102 …

ESTs and the Genome

ESTs aligned to the genome Some advantages: It defines the location of exons and introns We can verify the splice sites of introns (e.g. GT-AG)  hence also check the correct strand of spliced ESTs It helps preventing chimeras It can avoid putting together ESTs from paralogous genes We can prevent including pseudogenes in our analysis

Aligning ESTs to the Genome Many ESTs  Fast programs, Fast computers Nearly exact matchesCoverage>= 97% Percent_id>= 97% Splice sites: GT—AG, AT—AC, GC—AG

Aligning ESTs to the Genome Clip poly A tails/Clip 20bp from either end Best in genome Remove potential processed pseudogenes Give preference to ESTs that are spliced Extra pre-processing of ESTs:

Human ESTGenes Genomic length distribution of aligned human ESTs Tail up to ~ 800kb ~ 400bp

The Problem What are the transcripts represented in this set of mapped ESTs? ESTs Genome

Transcript predictions ESTs Predict Transcripts from ESTs Merge ESTs according to splicing structure compatibility

Representation Extension Inclusion zx y x Sort by the smallest coordinate ascending and by the largest coordinate descending Every 2 ESTs in a Genomic Cluster may represent the same splicing (redundant) or not The redundancy relation is a graph: x y x z

Criteria of merging Allow internal mismatches Allow intron mismatches Allow edge-exon mismatches

Transitivity Extension Inclusion wz y x w x This reduces the number of comparisons needed x y z x z w

ClusterMerge graph z x x y y z w Each node defines an inclusion sub-tree Extensions form acyclic graphs y x z x y z w

Recovering the Solution Mergeable sets of ESTs can be recovered as special paths in the graph

Recovering the Solution Root Leaves Leaf: not-extended and root of an inclusion tree Root: does not extend any node

Recovering the Solution Root Leaves Any set of ESTs in a path from a root to a leaf is mergeable

Recovering the Solution Root Leaves Add the inclusion tree attached to each node in the path

Recovering the Solution Lists produced: (1,2,3,4,5,6,7,8) ( 1,2,3,4,5,6,7,9) This representation minimizes the necessary comparisons between ESTs

How to build the graph Mutual Recursion Search graph (leaves) Recursion search along extension branch Search sub-graph Inclusion => go up in the tree

How to build the graph Example

How to build the graph Example

How to build the graph Example Leaves

How to build the graph Example Inclusion

How to build the graph Example Inclusion

How to build the graph Example Extension

How to build the graph Example Inclusion

How to build the graph Example Place 7

How to build the graph Example Inclusion 7

How to build the graph Example tagged as visited - skip 7

How to build the graph Example Possible sub-trees beyond 1 or 3 remain unseen! The representation minimizes the necessary comparisons 7

Deriving the transcripts from the lists Internal Splice Sites:external coordinates of the 5’ and 3’ exons are not allowed to contribute

Deriving the transcripts from the lists Splice Sites: are set to the most common coordinate 5’ and 3’ coordinates: are set to the exon coordinate that extends the potential UTR the most

Single exon transcripts Reject resulting single exon transcripts when using ESTs

Annotation with ESTs ESTs aligned to the genome can provide information about UTRs and alternative splicing

Annotation with ESTs EST-Transcripts at

Annotation with ESTs

Results for Human and Mouse Human EST-genes (assembly ncbi33): 38,581 Genes 122,247Transcripts ( 42% with full CDS ) Mouse EST-genes (assembly ncbi30): 32,848 Genes 103,664 Transcripts ( 36% with full CDS )

How many transcripts are conserved? Is Alternative Splicing conserved?

EST-transcript pairs 42,625 transcript pairs (in 18,242 gene pairs) gene pairs 78% with one transcript pair conserved 22% with more than one transcript pair conserved For 22% of the gene pairs some form of alt. splicing is conserved

Conservation of Alt. Splicing Take gene-pairs with more than one transcript-pair 19% of alt. variants in human are conserved in mouse 32% of alt. variants in mouse are conserved in human ∑ ( number of paired transcripts - 1) %conservation = ∑ ( number of transcripts - 1 ) ∑ = sum over genes in a gene pair with more than one variant ( subtract the ‘main’ transcript form)

How many predicted ‘novel’ genes are validated by Human-Mouse comparison?

Novel genes ESTGenes Not in Ensembl Human ESTGenes validated by comparison to mouse 13,17418,242 ESTGenes with at least one complete ORF 24,201

Novel genes 984 ESTGenes not in Ensembl validated by comparison to mouse With a complete ORF

THE END