Presentation on theme: "Alternative Splicing from ESTs"— Presentation transcript:
1 Alternative Splicing from ESTs Eduardo EyrasBioinformatics UPF – February 2004In this presentation I would like to give an overview of how Ensembl produces comparative genomics data. In particular I will present results of the comparison of the Mouse and Human genomes according to the Ensembl analyses.
2 IntroESTsPrediction ofAlternative Splicing from ESTs
5 Alt splicing as a mechanism of gene regulation Functional domains can be added/subtracted protein diversityCan introduce early stop codons, resulting in truncated proteins or unstable mRNAsIt can modify the activity of the transcription factors, affecting the expression of genesIt is observed nearly in all metazoansEstimated to occur in 30%-60% of human
6 Forms of alternative splicing Exon skipping / inclusionAlternative 3’ splice siteAlternative 5’ splice siteMutually exclusive exonsThere are 5 types of alternative splicing.Exon skipping: one exon is not included in one of the variants.Alternative 3’ splice site: one variant contains an extra piece of sequence at the 3’ end.Alternative 5’ splice site: similarly, but in the 5’ end.Mutually exclusive exons: a pair of exons are each one included in different variants only, so they will never appear together.Intron retention: one intron in a variant is part of an exon in another variant.Intron retentionConstitutive exonAlternatively spliced exons
8 ESTs (Expressed Sequence Tags) Single-pass sequencing of a small (end) piece of cDNATypically nucleotides longIt may contain coding and/or non-coding regionESTs: They represent snapshots of the genome being expressed in a certain set of conditions.They are single pass sequence reads from cDNAs cloned from a cellThey are usually short, 5’ and 3’ ends from the clones are usually over-represented.Sequence quality usually diminishes at the end of the ESTs.Some contain pieces of sequence from the vectors.ESTs may contain coding and non-coding regions from the cDNAThe information they provide can be biased by a too restrictive sampling.Note: mRNA is very unstable outside of a cell; therefore, scientists use special enzymes to convert it to complementary DNA (cDNA). cDNA is a much more stable compound and, importantly, because it was generated from a mRNA in which the introns had been removed, cDNA represents only expressed DNA sequence.
9 ESTs Cells from a specific organ, tissue or developmental stage mRNA extractionAAAAAA3’5’Add oligo-dT primerAAAAAA3’5’TTTTTT3’5’Reverse transcriptaseRNAAAAAAA3’5’TTTTTTDNA3’5’Ribonuclease HTTTTTTExpressed genes are converted to mature RNAs after transcription and splicing. mRNA molecules are unstable but can be converted into cDNAs (or complementary DNA). cDNA molecules are DNA molecules, hence have the usual base pairing and are more stable.Most eukayotic mRNAs have poly A tail at their 3’-end. This is used as priming site for the cDNA synthesis. The primer is a short stretch of synthetic DNA olinucleotide, typically 20 nucleotides in length, made up entirely of T’s. After the first strand is synthesized, the preparation is treated with ribonuclease H, which specifically degrades the RNA component of an RNA-DNA hybrid. This is done so that short segments of the RNA are left to prime the second strand synthesis, which is catalyzed by DNA polymerase I.3’5’DNA polimerase Ribonuclease H5’3’AAAAAADouble stranded cDNATTTTTT3’5’
10 Single-pass sequence reads ESTs5’3’AAAAAAClone cDNA into a vectorTTTTTT3’5’5’ ESTSingle-pass sequence readsMultiple cDNA clones3’ ESTDouble stranded cDNAs are cloned into vectors, this generates what is called a clone library. Clones are picked at random for sequencing. Only short segments are sequenced from the 5’ and 3’ end.The ESTs therefore represent the ends of expressed mRNAs.
11 Sampling the Transcriptome with ESTs GenomicPrimary transcriptSplicingSplice variantsoligo-dT primerReverse transcriptasecDNA clones(double stranded)The reverse transcriptase used to manufacture each cDNA in the library will eventually fall off the template, and this will terminate the production of the cDNA. Thus a series of length-differentiated 3' delimited cDNA fragments may be produced for each mRNA that is a viable template in the library. The length of the cDNA will vary, and this is an important factor for development of coverage for each mRNA template of an available gene. Usually, several hundred to several thousand clones are isolated at random from a given cDNA library. Clones are sequenced a single time, from one or both ends of the DNA insert, using universal primers which are complementary to the vector at the multiple cloning site.In almost all cases, the process produces ‘oriented’ clones, where the positions of the 5‘ and 3' ends of the cDNA relative to the vector are known in principle (although subject to some experimental error). Thus, two defined vector-based primers can be used to obtain a 3' and a 5‘ sequence from the same clone; depending on the length of the insert and the quality of the trace data, the sequences determined from the two ends may or may not overlap. A single read is taken from each primer.ESTs potentially retain information about the differential splicing of the primary RNA.EST sequences(Single-pass sequence reads)5’ ’5’ ’
12 Large scale EST-sequencing coupled to Genome sequencing
13 EST sequencing Is fast and cheap Gives direct information about the gene sequencePartial informationResulting ESTs Known gene(DB searches) Similar to known geneContaminantNovel geneEST sequencing turned out to be a very fast and relatively cheap way of obtaining direct information about the genes. Each sequence contains partial information but are long enough to identify the gene they originate from.ESTs can be analyzed using database searches with programs like BLAST. They usually fall into four main categories:Those that are identical to a portion of a known geneThose with sequence similarity to a known geneThose that can be deemed useless because they are either devoid of meaningful sequence or matched sequence contaminating organisms.And those that did not match anything in the database
14 dbEST release 20 February 2004 Number of public entries: ,039,613Summary by organismHomo sapiens (human) ,472,005Mus musculus + domesticus (mouse) ,056,481Rattus sp. (rat) ,841Triticum aestivum (wheat) ,926Ciona intestinalis ,511Gallus gallus (chicken) ,385Danio rerio (zebrafish) ,652Zea mays (maize) ,417Xenopus laevis (African clawed frog) ,901…
15 Human EST length distribution EST lengths~ 450 bpHuman EST length distribution(dbEST Sep )ESTs are usually between 400 and 600 bp in length. For human, this distribution is peaked at around 450 bp. There is, however, a second peak at nearly 1000 bp which is perhaps related to the fact that there are also many cDNA sequences (almost full length) in dbEST.
16 ESTs provide expression data eVOC OntologiesAnatomical SystemThe tissue, organ or anatomical system from which the sample was prepared. Examples are digestive, lung and retina.The precise cell type from which a sample was prepared. Examples are: B-lymphocyte, fibroblast and oocyte.Cell TypePathologyThe pathological state of the sample from which the sample was prepared. Examples are: normal, lymphoma, and congenital.Developmental StageThe stage during the organism's development at which the sample was prepared. Examples are: embryo, fetus, and adult.ESTs also allow the identification of genes specifically expressed in a chosen library or tissue, since they are obtained in a given set of known conditions. Once we localize the gene an EST belongs to, we obtain expression information about that gene.Currently there are several projects to organize the expression information in a set of Orthogonal Vocabularies which can describe the expression in a Specific manner: Ontology. One of these projects is the eVOC Ontologies from SANBI which provides a very high quality classification of the expression information for ESTs and it is becoming a standard. eVOC provides a link between the vocabularies and the EST sequences.A link between genes and eVOC expression data can also be found atIndicates whether the tissue used to prepare the library was derived from single or multiple samples. Examples are pooled, pooled donor and pooled tissue.PoolingJ Kelso et al. Genome Research 2002
17 ESTs provide expression data eVOC OntologiesDevelopmental StageCell TypePathologyAnatomical SystemPooling…nervousbraincerebellum…Library 1Library 2…ESTsESTs
18 Linking the expression vocabulary to gene annotations ESTsFrom the set of ESTs aligned to the genome, we can derive a mapping between the ESTs and the ensembl genes according to compatible splicing structure. The comparison is coordinate based and not sequence based. One could also imagine a sequence-based comparison system, although it would be less specific than the genomic-position-based system.In the SANBI expression database each EST is linked to a library_name, which is linked to five Ontologies or trees of expression vocabulary: Anatomy, Cell Type, Pathology, Developmental Stage, Preparation. In this way we can link each ensembl gene with an expression vocabulary.GenesV Curwen et al. Genome Research (2004)
19 Gene expression vocabulary This vocabulary can be used for querying in Ensmart.
20 Normalized vs. non-normalized libraries In order to obtain information from the lowly expressed genes and not to be overwhelmed by the highly expressed genes, the results are usually ‘normalized’, that is, we equilibrate the density of all ESTs regardless of how much expressed they are. This is usually called a ‘normalized library’, and it is the standard information to work with.Here we can see the case of the human genome with non-normalized EST libraries mapped to ensembl genes. In blue we can see the amount of ESTs per gene. This gives us an idea of the transcription activity in the genome.Normalized vs. non-normalized libraries
21 The down side of the ESTs Cannot detect lowly/rarely expressed genes or non-expressed sequences (regulatory)Random sampling: the more ESTs we sequence the less new useful sequences we will getDespite the usefulness of ESTs, they have some problems which sets some limitations to this approach. With ESTs is very hard to detect genes which are expressed are very low level, or genes that are expressed under very rare and specific set of conditions. We would have to reproduce every single possible set of conditions to be able to find those. Moreover, this method would not detect non-expressed sequences, like regulatory regions.An added problem is the nature of random sampling. In the way the ESTs are obtained, every time we time we get one EST, to get the next one we have not reduced the number of possible sequences that we may obtain, it is always the same pool of sequences. This results in that the further we sequence does not put us any closer to obtain an exhaustive collection of all the genes.
22 Using ESTs to study Alternative Splicing Since the first EST sequencing project, many institutions around the world have carried intensive EST sequencing of different organisms and for different conditions (as we can see from the release in dbEST). In the meantime, the human genome has been sequenced. All the sequence corresponding to the the euchromatin and part of the heterochromatin is known. Now it is a good moment to combine both sources of information to explore the genome. The study of the genome with ESTs is now known as Trascriptomics.
23 ESTs aligned to the genome Stop*GTAGPolyAProcessed pseudogeneTrue matchbest in genomeParalogIt defines the location of exons and intronsWe can verify the splice sites of introns check the correct strand of spliced ESTsIt helps preventing chimerasIt can avoid putting together ESTs from paralogous genesWe can prevent including pseudogenes in our analysisESTs can provide the gene sequence, but it is limited to expressed sequences. The genomic sequence is necessary to obtain information about regulatory signals.Approach: Our approach to finding alt. Splicing using ESTs is considering ESTs aligned to the genome. This has the following advantages:It defines the location of exons and introns.We can verify the splice site sequences hence also check the correct strand of spliced ETSsIt helps preventing chimerasIt can find paralogs in the genomic sequenceWith the appropriate filtering, sequencing errors can be avoidedIn this situation we define the problem of finding alternative splicing information as follows:Problem: find the maximal set of transcripts which is compatible with the splicing of a given EST cluster, such that the transcripts are not redundant with each other, that is, their splicing is non-equivalent.Must Clip poly A tails before aligning
24 Alternative Exons/ 3´ PolyA sites from ESTs ESTs can provide information about possible alternative splicing when they are aligned against the genome or against mRNA data. On the left we can see an example of several ESTs aligned to the genome. Their alignment structure suggest several possible forms of splicing, i.e. several possible combinations of exons. We will see in more detail later on a method to derive this.On the right we see a set of ESTs aligned to an mRNA sequence. In this case, the ESTs suggest multiple polyandenylation sites on the mRNA. Likewise, EST alignment scan suggest exon skipping a other similar alternative splicing phenomena.ESTs can also provide information about potential alternative splicing when aligned to the genome (and when aligned to mRNA data)
25 Aligning ESTs to the Genome Many ESTs Fast programs, Fast computersNearly exact matches Coverage >= 97%Percent_id >= 97%Splice sites: GT—AG, AT—AC, GC—AGWe use exonerate to align the ESTs, with an est2genome model.We clip a number of bases on either end of the ESTs and further remove any remaining polyA/polyT tails. This increases the number of ESTs which are mapped in full length. The thresholds take are much more strict than usual to make sure to obtain a good set of predictions.The criterion of merging is exact-match for internal splice-sites, allowing any mismatch at external sites.
26 Genomics as a Technology Development of special software:fast versus accurate alignmentDevelopment of special technology:efficient use of computer farms (~2000 CPUs)
28 Recover the mRNA from the ESTs EST sequences are partial information. We want to recover the full mature RNA sequence from the ESTs. This has lead to strategies to ‘cluster’ ESTs according to sequence similarity in order to try to recover the complete sequence. Moreover, since for each EST it is known the clone library and whether it is a 5’ or 3’ EST, we can use that information to put together clusters from both ends of a gene. Nevertheless, this information is not always available, so it will not be always possible to recover the full gene sequence.Clustering methods try to provide an equilibrium between the gene coverage by the clusters and the specificity of the clusters. This is dependent on how stringent or loose the clustering is performed. Stringent one-pass assembly methods tend to result in fewer, shorter consensus sequences. Looser systems for clustering result in larger, more 'sloppy‘ clusters, with various expressed forms being represented within each cluster. Each approach has its advantages and disadvantages. Stringent clustering provides greater initial fidelity, at a cost of lower coverage of expressed gene data and a lower inclusion rate of expressed gene forms.Loose clustering provides greater coverage, at a cost of possible inclusion of paralogous expressed genes, lower fidelity data, but at a gain of greater inclusion of alternate expressed forms.
29 What are the transcripts represented in this set of mapped ESTs? The ProblemESTsGenomeWhat are the transcripts represented in this set of mapped ESTs?
30 Predict Transcripts from ESTs Transcript predictionsIn this situation we define the problem of finding alternative splicing information as follows:Problem: find the minimal set of transcripts which is compatible with the splicing of a given EST cluster, such that the transcripts are not redundant with each other, that is, their splicing is non-equivalent.We must consider the global relation between the splices in a given ESTs, to avoid a resulting combinatorial combination of the different splices.Thus we consider each EST as a set of splices such that every two ESTs must be either compatible or incompatible regarding the splicing structure.We consider ESTs as whole structures, and only combine the splices from two ESTs if they have ALL the overlapping splice sites equivalentMerge ESTs according to splicing structure compatibility
31 Redundant ESTs Consider 2 ESTs in a Genomic Cluster with more ESTS x z z gives redundant splicing information, we could keep only xxzwx + zEvery 2 overlapping ESTs in the cluster may or may not be splicing-compatible. Ifz + wHowever, the relation with other ESTs in the cluster is important: a third EST, w, is compatible with z but not with x.--> keep all relations
32 Extension of the exon structure Consider 2 ESTs in a Genomic Cluster with more ESTSxyx + yy extends x, we can assume that they are from the same mRNAxzwEvery 2 overlapping ESTs in the cluster may or may not be splicing-compatible. IfOur success will depend on the coverage of the exons.However, ESTs are 3’and 5’ biased(ESTs like z not so frequent), hence we will have fragmentation.
33 E Eyras et al. Genome Research (2004) RepresentationFor every 2 ESTs in a Genomic Cluster, we decide if they represent equivalent splicing structuresThe compatibility relation is a graph:xxExtensionyyxConsider a set of ESTs that we have mapped onto the genome. We can cluster those mapped ESTs according to their position in the chromosome, which gives rise to a number of potential gene-loci defined by these EST clusters.Every 2 overlapping ESTs in the cluster may or may not be splicing-compatible.If they are compatible, we can always represent that relation as one of two possibilities: extension ( one EST extends the 3’ of the other) or inclusion (one EST is totally included in the other).We represent these relations as a graph where the nodes are the ESTs and there are two types of edges: single arrows for extension and double arrows for inclusion.Furthermore, we will sort the ESTs in the cluster in two variables: by the 5’ end coordinate in descending order and by the 3’ coordinate in descending order (if the 5’ coordinate is the same).InclusionxzzE Eyras et al. Genome Research (2004)
34 Criteria of “merging” Allow edge-exon mismatches Allow internal mismatchesThe comparison between two transcripts has as result one of four possible results:1.- Inclusion2.- Extension3.- Clash (overlap but structure non-compatible)4.- No overlap ( none of the exons overlap )At the level of the comparison we have to establish the criteria for defining two ESTs as mergeable ( or redundant).The algorithm is implemented so that we can choose between different types of merging criteria, according to the type of data we are dealing with:We can merge in a strict way, which is not very realistic, so we never use it.We can allow mismatches of exons or part of exons at the edges of the transcripts. This could be used when we have good cDNA or EST data.We can allow internal mismatches. This can be used with data we know may contain lot of noise, or when we have annotated the transcripts using two different alignment methods that may have produced different splice sites (after all, we’re doing this automatically).Finally, we could allow for intron mismatches if we cover an intron which is too small to be real, maybe due to an incomplete alignment or to a disagreement between the cDNA and the genomic sequence. This is the typical case used for human ESTs.Allow intron mismatchesIs this intron real?
35 Transitivity x x y y Extension z w x Inclusion z x z w w The ordering induces naturally a transitivity in the representation: The extension and the inclusion are transitive, so we do not need to show redundant relations.This ordering and, in turn, the transitivity also minimizes the number of comparisons that we will have to make to the ESTs in a graph when comparing a new EST.This reduces the number of comparisons needed
36 E Eyras et al. Genome Research (2004) ClusterMerge graphEach node defines an inclusion sub-treeyzyxzxExtensions form acyclic graphsxxyyzzMore complicated situations arise from the interaction of inclusions and extensions. We choose to put the inclusions as high as possible in the extension tree.Considering only the inclusions, each node defines and (inclusion) tree. In fact, for every given node, we can define a sub tree which is the tree given by all the nodes ‘included’ in this node, which is the root of the inclusion tree.The extensions, however, do not necessary form a tree. On the other hand, the directed graph is always acyclic. This property will be exploited in the algorithm.A generic graph of this type can be then seen as an intertwined forest of inclusion trees and extension acyclic directed graphs. We call this structure a ClusterMerge graph.wwE Eyras et al. Genome Research (2004)
37 Mergeable sets Example 1 2 3 4 5 6 7 Consider a set of ESTs that we have mapped onto the genome. We can cluster those mapped ESTs according to their position in the chromosome, which gives rise to a number of potential gene-loci defined by these EST clusters.Consider as an example a set of 6 ESTs, already put in the order specified.
38 Mergeable sets Example 1 3 1 2 3 2 5 7 4 5 6 4 6 7 Consider a set of ESTs that we have mapped onto the genome. We can cluster those mapped ESTs according to their position in the chromosome, which gives rise to a number of potential gene-loci defined by these EST clusters.Consider as an example a set of 6 ESTs, already put in the order specified.
39 Mergeable sets Example Root 1 3 1 2 3 2 5 7 4 5 6 4 6 7 Leaves Consider a set of ESTs that we have mapped onto the genome. We can cluster those mapped ESTs according to their position in the chromosome, which gives rise to a number of potential gene-loci defined by these EST clusters.Consider as an example a set of 6 ESTs, already put in the order specified.
40 Mergeable sets Example Root 1 3 1 2 3 2 5 7 4 5 6 4 6 7 Leaves Consider a set of ESTs that we have mapped onto the genome. We can cluster those mapped ESTs according to their position in the chromosome, which gives rise to a number of potential gene-loci defined by these EST clusters.Consider as an example a set of 6 ESTs, already put in the order specified.Lists produced: (1,2,3,5,6,7) ( 1,2,3,4,5,7)
41 Deriving the transcripts from the lists Once we have the lists we must produce the putative transcripts from those lists.To merge the linked ESTs into a transcript we cluster the exons. Each exon-cluster will contribute to a given exon in the final transcript. If this exon is going to be internal, we do not allow external exons in the ESTs to contribute. If the external coordinate of one EST is longer than the most common internal coordinate, this is a potential alternative UTR termination. On the other hand it is also very hard to conclude whether those are real or not when working with ESTs.Internal Splice Sites: external coordinates of the 5’ and 3’ exons are not allowed to contribute
42 Deriving the transcripts from the lists Splice Sites: are set to the most common coordinate5’ and 3’ coordinates: are set to the exon coordinate that extends the potential UTR the mostWe can parameterize how much mismatch in the 3’ and 5’ splice sites we can allow when comparing ESTs, to try to distinguish true alternative 5’ and 3’ sites from sequencing errors in the ESTs. It is very difficult to determine which threshold this should be. In the example, human ESTs giving evidence for alternative 3’ splice site are not considered as we have set a higher threshold, so this EST is merged.For the internal splice sites we take the most common coordinates in the exon cluster.For the external splices (potential 5’ or 3’ ends of the transcript) we take the coordinate that extends the final exon the most, to maximise the chance of covering UTRs.
43 Single exon transcripts From the resulting set of putative transcripts we reject the un-spliced ones. They could be produced from spurious hits, perhaps ESTs containing genomic sequence. They could also be related to pseudogenes. Another possibility is that the EST cluster that it was derived from represents and UTR region of a gene which did not have any overlap with a spliced EST (see the figure for a possible case like this).Reject resulting single exon transcripts when using ESTs
45 Conservation of Alternative Splicing Degree of conservation: 30-60%Methods:1.- compare single events2.- Cross-alignment of full transcripts
46 Exon Skipping EventsIntrons flanking alternatively spliced (skipped) exons have high sequence conservation.Higher on average than constitutive inrons.R Sorek & G Ast. Genome Research 13: , 2003
47 Overrepresented hexamer (downstream) Sequences regulating the (Alternative) splicingConservedAlternativeExonFlankingIntronsOverrepresented hexamer (downstream)Overrepresented sequences in conserved introns (between human and mouse) may beInvolved in the regulation of alternative splicing.Overrepresented: found in these introns more often than expected at random AND not foundin intronic sequences flanking constitutive exons (and upstream of skipped ones)R Sorek & G Ast. Genome Research (2003) 13:
48 Overrepresented hexamer Sequences regulating the (Alternative) splicingConservedAlternativeExonFlankingIntronsOverrepresented hexamerNot all types of events are equally conserved.Introns flanking alternative 5´and 3´exons, and retained introns, have higher sequence conservation.Sugnet CW, Kent WJ, Ares M Jr, Haussler D. Pac Symp Biocomput. 2004;:66-77
49 A Resch et al. Nucleic Acids Research 2004, 32 (4) 1261-1269 Frame preservationFrame preservingConstitutive exonsAlternative exonsAll exons39.7% (Human)39.5% (Mouse)41.6% (Human)44.7% (Mouse)ConservedExon40.9% (Human)38% (Mouse)51.8% (Human)51.9% (Mouse)A Resch et al. Nucleic Acids Research 2004, 32 (4)
51 R Sorek et al. Genome Research (2004) 14:1617-1623 Features Differentiating Between Alternatively splice and Constitutively spliced exonsAlternative exonsConstitutive exonsAverage size87128length = mutliple of 373%37%Average human-mouse exon conservation94%89%(A) Exons with upstream intron conserved in mouse92%45%(B) Exons with downstream intron conserved in mouse82%35%(A) + (B)77%17%(A), (B) : conservation is considered if at least there 12 consecutive matches over 100bp of the intronR Sorek et al. Genome Research (2004) 14:
52 Build a classifier to make predictions Rule: Set of conditions over the parameters:e.g. “at least 99% conservation with mouse AND divisible by 3, etc…”Try all the possible combinations of parametersSelect the rule that would correctly identify a maximum number of truealternative exons minimizing the number of false positivesThis rule achieved 31% sensitivity and no false positives in a set of known exons:At least 95% identity with mouse orthologous exonExon size is a multiple of 3An upstream intronic alignment of at least 15bp with at least 85% identityA downstream intronic exact alignment of at least 12bpR Sorek et al. Genome Research (2004) 14:
53 SummaryAlternative splicing is a mechanism to generate function diversityWe can study alternative splicing using ESTs (Expressed Sequence Tags)EST data is fragmented and full of noise: need to be processedSome alternative splicing is conserved across species (Human-Mouse)Prediction of alternative (conserved) exons is possible (a classifier) but no ab initioEvolution of alternative splicing?