How many transcripts does it take to reconstruct the splice graph? Introduction Alternative splicing is the process by which a single gene may be used.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

A very short introduction (in plants)
RNA-Seq based discovery and reconstruction of unannotated transcripts
Hidden Markov Model in Biological Sequence Analysis – Part 2
Ka-Lok Ng Dept. of Bioinformatics Asia University
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Dynamic Tuning of the IEEE Protocol to Achieve a Theoretical Throughput Limit Frederico Calì, Marco Conti, and Enrico Gregori IEEE/ACM TRANSACTIONS.
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
INTEGRALS Areas and Distances INTEGRALS In this section, we will learn that: We get the same special type of limit in trying to find the area under.
INTEGRALS 5. INTEGRALS We saw in Section 5.1 that a limit of the form arises when we compute an area.  We also saw that it arises when we try to find.
Gene Finding Charles Yan.
Optimized Numerical Mapping Scheme for Filter-Based Exon Location in DNA Using a Quasi-Newton Algorithm P. Ramachandran, W.-S. Lu, and A. Antoniou Department.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Similarity-Based Approaches.
Sequencing a genome and Basic Sequence Alignment
Hidden Markov Models In BioInformatics
CHAPTER 17 FROM GENE TO PROTEIN Copyright © 2002 Pearson Education, Inc., publishing as Benjamin Cummings Section B: The Synthesis and Processing of RNA.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Ch 8.1 Numerical Methods: The Euler or Tangent Line Method
LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Amandine Bemmo 1,2, David Benovoy 2, Jacek Majewski 2 1 Universite de Montreal, 2 McGill university and Genome Quebec innovation centre Analyses of Affymetrix.
COT 6930 HPC and Bioinformatics Introduction to Molecular Biology Xingquan Zhu Dept. of Computer Science and Engineering.
INTEGRALS Areas and Distances INTEGRALS In this section, we will learn that: We get the same special type of limit in trying to find the area under.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
INTEGRALS 5. INTEGRALS In Chapter 2, we used the tangent and velocity problems to introduce the derivative—the central idea in differential calculus.
Integrals  In Chapter 2, we used the tangent and velocity problems to introduce the derivative—the central idea in differential calculus.  In much the.
Sequencing a genome and Basic Sequence Alignment
1 TRANSCRIPTION AND TRANSLATION. 2 Central Dogma of Gene Expression.
Sackler Medical School
Proposed redefinition of “gene” requires it to have a biological role Gerstein MB, …, Snyder M Genome Res 17: example of complexities observed.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
RNA Processing Data Analysis Lisa Bloomer Green April 26, 2010.
Introduction to Bioinformatics Algorithms Algorithms for Molecular Biology CSCI Elizabeth White
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Subtree Prune Regraft & Horizontal Gene Transfer or Recombination.
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
In Chapters 6 and 8, we will see how to use the integral to solve problems concerning:  Volumes  Lengths of curves  Population predictions  Cardiac.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
INTEGRALS We saw in Section 5.1 that a limit of the form arises when we compute an area. We also saw that it arises when we try to find the distance traveled.
5 INTEGRALS.
Splicing Exons: A Eukaryotic Challenge to Gene Prediction Ian McCoy.
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS) LECTURE 13 ANALYSIS OF THE TRANSCRIPTOME.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Finding genes in the genome
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
Unit 1: DNA and the Genome Structure and function of RNA.
INTEGRALS 5. INTEGRALS In Chapter 3, we used the tangent and velocity problems to introduce the derivative—the central idea in differential calculus.
CPH Dr. Charnigo Chap. 14 Notes In supervised learning, we have a vector of features X and a scalar response Y. (A vector response is also permitted.
Hidden Markov Models BMI/CS 576
bacteria and eukaryotes
Using DNA Subway in the Classroom
The Transcriptional Landscape of the Mammalian Genome
What is a Hidden Markov Model?
Promoters and expression
Michael Epstein, Ben Calderhead, Mark A. Girolami, Lucia G. Sivilotti 
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
1 Department of Engineering, 2 Department of Mathematics,
by Mario Cazzola, Marianna Rossi, and Luca Malcovati
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Widespread Inhibition of Posttranscriptional Splicing Shapes the Cellular Transcriptome following Heat Shock  Reut Shalgi, Jessica A. Hurt, Susan Lindquist,
Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey
RNA Export: Searching for mRNA Identity
Manfred Schmid, Agnieszka Tudek, Torben Heick Jensen  Cell Reports 
Volume 11, Issue 7, Pages (May 2015)
Presentation transcript:

How many transcripts does it take to reconstruct the splice graph? Introduction Alternative splicing is the process by which a single gene may be used to encode for more than one protein. Genes are comprised of encoding material – exons – separated by long stretches of non-coding material, introns. DNA is transcribed into a strand of precursor messenger RNA which is then matured by a macromolecular complex known as the spliceosome, before being translated into protein. By retaining different configurations of exons the spliceosome enables a gene to be translated in a number of different ways. With a gene we can associate its alternative splice graph, or ASG (figure 1) – a graph minimally explaining each observed configuration of exons. A natural question that arises is: how large is this graph relative to the true ASG of the gene? In this work we propose a probabilistic model of transcript generation from a gene idealized as a real interval [0,L]. The growth of the ASG with sampled transcripts was investigated for different probabilities and different example genes. The applicability of different models was assessed for a selection of sample genes. Paul Jenkins and Jotun Hein Figure 1: Example alternative splice graph. Exons are numbered rectangles, translation occurs from left to right. Splicing events are shown as curved edges. Intron retention shown in pink. Competing 3’ splice site shown in blue. More complicated and nested relationships are also visible. The ASG of this gene offers more than 5000 putative transcripts, though far fewer have been observed (see Leipzig et al. (2004).) A Stochastic Model Idealise a gene as the interval [0,L] and assume transcripts are always spliced at a set of S exact locations on this interval. Transcription can be modelled either by associating pairwise probabilities p(i,,j) between splice sites (model 1), or by associating each splice site with probabilities of jump ‘into’ and ‘out of’ transcription (model 2). In model 1 transcript generation can be seen as a walk along the line [0,L], jumping forwards (or not) with well-defined probabilities at each splice site. In model 2 transcripts are obtained by travelling along the real line from 0 to L, and as we reach each splice site jumping ‘in’ if we are ‘out’, or jumping ‘out’ if we are ‘in’, with well-defined probabilities. The transcript is the concatenation of all those subintervals of [0,L] for which we are ‘in’. Model 2 is simpler in the sense that it attempts to explain the same data with fewer parameters. Figure 2: Model 1 (left). Here, S = {1, 2, 3, 4, 5, 6, 7, 8}. Transcription commences from position 1 (marked by a blue square), and terminates at one of the terminal positions marked by a green circle. Each transcript has a well-defined probability dependent on p(2,3), p(2,7) and p(4,5). Model 2 (right). Exons may be spliced together more flexibly. In this example an additional possible transcripts skips from position 2 to position 5. Figure 3: A directed acyclic graph, with vertices V = {s, a, b, c, d, e, f, g, h, t} and directions from left to right. An edge covering of 5 transcripts is shown (each in a different colour). The weight of each edge is marked. In fact this graph requires only 4 transcripts. Minimal transcripts required An ASG is a directed, acyclic graph. By utilizing graph theory we can make statements about the ASG. One theoretical result we obtained was to provide a polynomial-time algorithm to calculate the minimal number of transcripts required to reconstruct a given ASG. In terms of graph theory a transcript is any path from a source to a sink, and the graph is recovered when we obtain an edge covering – a set of transcripts passing over each edge (figure 3). Results We simulated transcripts for a number of selected genes and for a range of different probability values in our model. Example results are illustrated in figure 4. Different genes displayed varying responses, dependent not simply on their length or the number of exons. We also performed likelihood ratio tests to compare model 1 versus model 2, this time basing model probabilities on maximum likelihood estimates taken from the original EST data. We found that, for the small sample of genes tested, exon clusters tended to fall neatly into model 1 (7/11) or model 2 (3/11), with only one test resulting in a p-value difficult to interpret at the 5% level (0.047). This supports the idea that the regulation of alternative splicing can vary widely both between genes and within genes. Figure 4: (Top). Ten simulated reconstructions of the ASG for human gene ABCB5, under model 1. Shown in black is the minimal number of transcripts required to reach the full ASG size, as calculated using the algorithm outlined above. (Bottom). Mean number of reconstructed edges across simulations. References S. Heber, M. Alekseyev, S. Sze, H. Tang & P.A. Pevzner. “Splicing graphs and EST assembly problem.” Bioinformatics, 18: 181–188 (2002). J. Leipzig, P. Pevzner & S. Heber. “The Alternative Splicing Gallery (ASG): bridging the gap between genome and transcriptome.” Nucleic Acids Res., 32: (2004). Discussion and further work As alternative splicing becomes more important in bioinformatics, so too does the need for its theoretical modelling. We have introduced a mathematical framework to consider how to predict transcript generation. Given a gene and a sample of transcripts it can be used to simulate transcripts from its ASG in a quantitatively controlled way. As we have illustrated, mathematical modelling allows us both to make further use of mathematical results (such as the graph theory problem considered above) and to make predictions of biological behaviour. In the near future microarray data will rapidly increase the potential for both. Future work can then avail itself of experimentally derived probabilities for application to a model. Other extensions to this sort of work include the appropriation of other biological features into the model, such as tissue-specific regulation, which could be modelled as conferring a gene with two or more overlapping, weighted ASGs. Functionality of transcripts and evolution of the ASG are other examples illustrating the potential for future modelling.