Sequence analysis with Scripture

Slides:



Advertisements
Similar presentations
RNA-Seq as a Discovery Tool
Advertisements

Functional Genomics with Next-Generation Sequencing
Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone.
RNA-Seq based discovery and reconstruction of unannotated transcripts
RIP – T RANSCRIPT E XPRESSION L EVELS. O UTLINE RNA Immuno-Precipitation (RIP) NGS on RIP & its alternatives Alternate splicing Transcription as a graph.
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
RNAseq.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology NGS Part I RNA-Seq Short Reads Sequence Analysis Feb 29, 2012 Daniel Fernandez and Alejandro.
12/04/2017 RNA seq (I) Edouard Severing.
Transcriptome Sequencing with Reference
Peter Tsai Bioinformatics Institute, University of Auckland
DEG Mi-kyoung Seo.
RNA-seq: the future of transcriptomics ……. ?
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.
TOPHAT Next-Generation Sequencing Workshop RNA-Seq Mapping
Data Analysis for High-Throughput Sequencing
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
How many transcripts does it take to reconstruct the splice graph? Introduction Alternative splicing is the process by which a single gene may be used.
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
mRNA-Seq: methods and applications
Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.
RNA-Seq and RNA Structure Prediction
Expression Analysis of RNA-seq Data
Todd J. Treangen, Steven L. Salzberg
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA
RNAseq analyses -- methods
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
The iPlant Collaborative
RNA surveillance and degradation: the Yin Yang of RNA RNA Pol II AAAAAAAAAAA AAA production destruction RNA Ribosome.
The generalized transcription of the genome Víctor Gámez Visairas Genomics Course 2014/15.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Proposed redefinition of “gene” requires it to have a biological role Gerstein MB, …, Snyder M Genome Res 17: example of complexities observed.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Introduction to RNAseq
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
No reference available
Lecture 12 RNA – seq analysis.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Canadian Bioinformatics Workshops
RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
RNA Quantitation from RNAseq Data
The Transcriptional Landscape of the Mammalian Genome
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Department of Computer Science
Gene expression estimation from RNA-Seq data
Reference based assembly
From: TopHat: discovering splice junctions with RNA-Seq
Transcriptome analysis
Maximize read usage through mapping strategies
Alternative Splicing QTLs in European and African Populations
Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey
Quantitative analyses using RNA-seq data
Sequence Analysis - RNA-Seq 2
Volume 11, Issue 7, Pages (May 2015)
Derek de Rie and Imad Abuessaisa Presented by: Cassandra Derrick
Presentation transcript:

Sequence analysis with Scripture Manuel Garber

The dynamic genome Bound by Transcription Factors Repaired Wrapped in marked histones The genome is a very dynamic place. Proteins like Braca2 bind to it to repaire it, PolII reads its content into RNA, The histones around which it wraps may be modified and “marked” with very specific biochemical alterations. Transcription factors bind to it to prevent or intiate transcription It folds into a fractal globule that is easily unwrapped to give access to regions of interest. Folded into a fractal globule Transcribed

With a very dynamic epigenetic state So the genome is a dynamic place whose state is determined by a combination of biochemical modifications to histones, proteins that bind to it, and RNA that is read. Histone modifications, in particular, are very important, different methilaltion or acetilation of lysine residues determine accesibility to the underlying DNA sequence for example. Catherine Dulac, Nature 2010

Which define the state of its functional elements For example gene promoter demarcated by thri-methilation of lysisnes 9 or 27 signals inactive genes, while methilation of lysine 4 indicates active or available promoters. K36 thrymethilation marks genes actively expresed. Active enchancers can are marked by a combination of low monomethiliation of k4, high thi-methilation of K4 and acetialtion of lysine 27 Motivation: find the genome state using sequencing data

We can use sequencing to find the genome state (Protein-DNA) Histone Marks ChIP-Seq Transcription Factors Park, P

We can use sequencing to find the genome state Wang, Z Nature Reviews Genetics 2009 RNA-Seq Transcription

Once sequenced the problem becomes computational sequencer Sequenced reads cells cDNA ChIP Alignment read coverage genome

Overview of the session We’ll cover the 3 main computational challenges of sequence analysis for counting applications: Read mapping: Placing short reads in the genome Reconstruction: Finding the regions that originated the reads Quantification: Assigning scores to regions Finding regions that are differentially represented between two or more samples.

Trapnell, Salzberg, Nature Biotechnology 2009

Short read mapping software for ChIP-Seq Seed-extend Short indels Use base qual B-W Maq No YES BWA BFAST Yes NO Bowtie GASSST Soap2 RMAP SeqMap SHRiMP

What software to use If read quality is good (error rate < 1%) and there is a reference. BWA is a very good choice. If read quality is not good or the reference is phylogenetically far (e.g. Wolf to dog) and you have a server with enough memory SHRiMP or BFAST should be a sensitive but relatively fast choice. What about RNA-Seq?

RNA-Seq read mapping is more complex 100s bp 10s kb RNA-Seq reads can be spliced, and spliced reads are most informative

Method 1: Seed-extend spliced alignment

Method 1I: Exon-first spliced alignment

Short read mapping software for RNA-Seq Seed-extend Short indels Use base qual Exon-first GSNAP No NO MapSplice QPALMA Yes SpliceMap STAMPY YES TopHat BLAT Exon-first alignments will map contiguous first at the expense of spliced hits

Exon-first aligners are faster but at cost How do we visualize the results of these programs

IGV: Integrative Genomics Viewer A desktop application for the visualization and interactive exploration of genomic data Microarrays Epigenomics RNA-Seq NGS alignments Comparative genomics Visualization of data is a key need in this field, IGV currently has all the necessary features to allow * Desktop app – you run on your computer – not through a Web browser (like UCS genome browser) Viewer – not computational tool Supports any type of data that can be mapped to the genome. Ask who works with what type of data

Visualizing read alignments with IGV Long marks Medium marks Punctuate marks

Visualizing read alignments with IGV — RNASeq Gap between reads spanning exons

Visualizing read alignments with IGV — RNASeq close-up What are the gray reads? We will revisit later.

Visualizing read alignments with IGV — zooming out How can we identify regions enriched in sequencing reads?

Overview of the session The 3 main computational challenges of sequence analysis for counting applications: Read mapping: Placing short reads in the genome Reconstruction: Finding the regions that originate the reads Quantification: Assigning scores to regions Finding regions that are differentially represented between two or more samples.

Scripture was originally designed to identify ChIP-Seq peaks Goal: Identify regions enriched in the chromatin mark of interest Challenge: As we saw, Chromatin marks come in very different forms. In this talk we are concerned with the identification of chromatin modifications from ChIP seq data Mikkelsen et al. 2007

Chromatin domains demarcate interesting surprises in the transcriptome K4me3 K36me3 ??? XIST

Scripture is a method to solve this general question How can we identify these chromatin marks and the genes within? Short modification Long modification Discontinuous data Scripture is a method to solve this general question

Our approach Permutation Poisson α=0.05 We have an efficient way to compute read count p-values …

So we want to compute a multiple hypothesis correction The genome is big: A lot happens by chance Expected ~150,000,000 bases So we want to compute a multiple hypothesis correction

Bonferroni correction is way to conservative Correction factor 3,000,000,000 Bonferroni corrects the number of hits but misses many true hits because its too conservative– How do we get more power?

Controlling FWER Max Count distribution α=0.05 αFWER=0.05 Given a region of size w and an observed read count n. What is the probability that one or more of the 3x109 regions of size w has read count >= n under the null distribution? We could go back to our permutations and compute an FWER: max of the genome-wide distributions of same sized region) but really really really slow!!! Count distribution (Poisson)

Scan distribution, an old problem Is the observed number of read counts over our region of interest high? Given a set of Geiger counts across a region find clusters of high radioactivity Are there time intervals where assembly line errors are high? Scan distribution α=0.05 αFWER=0.05 Thankfully, there is a distribution called the Scan Distribution which computes a closed form for this distribution. ACCOUNTS for dependency of overlapping windows thus more powerful! Poisson distribution

Scan distribution for a Poisson process The probability of observing k reads on a window of size w in a genome of size L given a total of N reads can be approximated by (Alm 1983): where The scan distribution gives a computationally very efficient way to estimate the FWER

By utilizing the dependency of overlapping windows we have greater power, while still controlling the same genome-wide false positive rate.

Segmentation method for contiguous regions Example : PolII ChIP Significant windows using the FWER corrected p-value Merge Trim But, which window?

Small windows detect small punctuate regions. Longer windows can detect regions of moderate enrichment over long spans. In practice we scan different windows, finding significant ones in each scan. In practice, it helps to use some prior information in picking the windows although globally it might be ok. We use multiple windows

Applying Scripture to a variety of ChIP-Seq data 200, 500 & 1000 bp windows 100 bp windows

Application of scripture to mouse chromatin state maps K4me3 K36me3 lincRNA Identifed ~1500 lincRNAs Conserved Noncoding Robustly expressed lincRNA Mitch Guttman

  Can we identify enriched regions across different data types? Short modification  Long modification Using chromatin signatures we discovered hundreds of putative genes. What is their structure? Discontinuous data: RNA-Seq to find gene structures for this gene-like regions

Scripture for RNA-Seq: Extending segmentation to discontiguous regions

The transcript reconstruction problem Challenges: Genes exist at many different expression levels, spanning several orders of magnitude. Reads originate from both mature mRNA (exons) and immature mRNA (introns) and it can be problematic to distinguish between them. Reads are short and genes can have many isoforms making it challenging to determine which isoform produced each read. 100s bp 10s kb There are two main approaches to this problem, first lets discuss Scripture’s

Scripture: A statistical genome-guided transcriptome reconstruction Statistical segmentation of chromatin modifications uses continuity of segments to increase power for interval detection RNA-Seq If we know the connectivity of fragments, we can increase our power to detect transcripts

Longer (76) reads provide increased number of junction reads intron Animate the exons without segments. Exon junction spanning reads provide the connectivity information.

The power of spliced alignments Protein coding gene with 2 isoforms Read coverage Exon-exon junctions Alternative isoforms Lets look at how our data looks with 76 b reads. Shown on the top is a the loci of an annotated gene with two known isoforms (red and green). The read coverage below shows clear enrichments of all 4 exons. Are both or only the bottom isoform expressed? Lets start by zooming in on the first two exons. Reads that If we zoom we see that there are indeed spliced reads spanning the first two exon junction I will represent an aligned SPLICED read as a “dumbbell” with each thick end representing the aligned sequence and the horizontal line the gap between the aligned portions. There are many such reads for this robustly expressed gene and they help us HELP US CONNECT THE TWO SEGMENTS INTO A TRANSCRIPT, AND (2) IF EACH SMALL EXON IS too small or LOWLY EXPRESSED TO BE FOUND ON ITS OWN, AS A CONNECTED PAIR WE MAY FIND THEM. NOW, LET’S LOOK AT THIS ALTERNATIVELY SPLICED EXON. WITH SPLICED READS WE CAN IDENTIFY THE TWO ALTERNATIVE ISOFORMS Aligned read Gap ES RNASeq 76 base reads Aligned with TopHat

Statistical reconstruction of the transcriptome Step 1: Align Reads to the genome allowing gaps flanked by splice sites genome Step 2: Build an oriented connectivity map oriented every spliced alignment using the flanking motifs WE START BY alignING reAds to the genome with gaps (e.g. using TopHat). WE FIRST FOCUS ONLY on reads that aligned with a gap flanked by splice sites (THE DUMBELLS). In the second step we use this splice reads to build a connectivity graph and give the Genome a new topology in which each base is connected to its neigbohrs as in the linear genome but also connected to any other base at the other end of a gap connecting them. We orient these connections based on the splice sites flanking them. This connections recapitulate the original topology of the RNA sequenced in now the discontiguous. The connectivity graph then gives the genome a new topology that captures the original contiguity of the RNA. The “connectivity graph” connects all bases that are directly connected within the transcriptome

Statistical reconstruction of the transcriptome Step 3: Identify “segments” across the graph We now rely on the connectivity graph to apply a method similar to the one we used to determine regions enriched in chromatin data to regions (which now may be disconnected) enriched in RNASeq reads. For example the region demarcated by the red path would be significant and corresponding to one of the original isoforms. Once we have enriched paths, they translated into putative isoforms. Another significant path will be the one that goes through the larger exon (here in green) and that corresponds to the isoforms that include this exon. We then use paired end data to join paths that may have not had enough spliced read support and also to remove paths that require paired reads to be at a distance not likely given the insert size distribution and to gauge whether or not we’ve reached the 3’ end of the isoform. Step 4: Find significant transcripts

Scripture Overview Map reads Scan “discontiguous” windows Merge windows & build transcript graph Splice reads now give the genome a graph structure now sliding a window is a discontinuous process Filter & report isoforms

   Can we identify enriched regions across different data types? Short modification  Long modification  Discontinuous data Are we really sure reconstructions are complete?

RNA-Seq data is incomplete for comprehensive annotation Library construction can help provide more information. More on this later

Applying scripture: Annotating the mouse transcriptome

Reconstructing the transcriptome of mouse cell types Sequence Reconstruct

Sensitivity across expression levels 20th percentile 20th percentile Move 60% and 95% by arrow Even at low expression (20th percentile), we have: average coverage of transcript is ~95% and 60% have full coverage

Sensitivity at low expression levels improves with depth Why don’t we always get to 1? One reason is that a lot of the known annotations are wrong in ES cells Pie chart focusing on the genes As coverage increases we are able to fully reconstruct a larger percentage of known protein-coding genes

Novel variation in protein-coding genes ES cells 3 cell types 1,804 3,137 1,310 2,477 588 903

Novel variation in protein-coding genes ~85% contain K4me3

Compared to ~6% for random Novel variation in protein-coding genes ~50% contain polyA motif Compared to ~6% for random

Novel variation in protein-coding genes ~80% retain ORF

What about novel genes? Class 1: Overlapping ncRNA Class 2: Large Intergenic ncRNA (lincRNA) Class 3: Novel protein-coding genes

Class 1: Overlapping ncRNA ES cells 3 cell types 201 446

Overlapping ncRNAs show little evolutionary conservation Overlapping ncRNAs: Assessing their evolutionary conservation SiPhy-Garber et al. 2009 Omega-Phylogentic conservation based on contraction of tree Flip neutral and conserved High Conservation Low Overlapping ncRNAs show little evolutionary conservation

What about novel genes? Class 1: Overlapping ncRNA Class 2: Large Intergenic ncRNA (lincRNA) Class 3: Novel protein-coding genes

Class 2: Intergenic ncRNA (lincRNA) ES cells 3 cell types ~500 ~1500

>95% do not encode proteins lincRNAs: How do we know they are non-coding? ORF Length CSF (ORF Conservation) >95% do not encode proteins

lincRNAs: Assessing their evolutionary conservation Omega-Phylogentic conservation based on contraction of tree Flip neutral and conserved High Conservation Low

What about novel genes? Class 1: Overlapping ncRNA Class 2: Large Intergenic ncRNA (lincRNA) Class 3: Novel protein-coding genes

Finding novel coding genes ~40 novel protein-coding genes

80% have reconstructions lincRNAs: Comparison to K4-K36 Chromatin Approach 80% have reconstructions RNA-Seq First Approach 85% have chromatin REDO! RNA-Seq reconstruction and chromatin signature synergize to identify lincRNAs

Other transcript reconstruction methods

Method I: Direct assembly

Method II: Genome-guided

Transcriptome reconstruction method summary

Transcript assembly methods are the obvious choice for organisms without a reference sequence. Genome-guided approaches are ideal for annotating high-quality genomes and expanding the catalog of expressed transcripts and comparing transcriptomes of different cell types or conditions. Hybrid approaches for lesser quality or transcriptomes that underwent major rearrangements, such as in cancer cell. More than 1000 fold variability in expression leves makes assembly a harder problem for transcriptome assembly compared with regular genome assembly. Genome guided methods are very sensitive to alignment artifacts. Pros and cons of each approach

RNA-Seq transcript reconstruction software Assembly Published Genome Guided Oasis NO Cufflinks Trans-ABySS YES Scripture Trinity

Differences between Cufflinks and Scripture Scripture was designed with annotation in mind. It reports all possible transcripts that are significantly expressed given the aligned data (Maximum sensitivity). Cuffllinks was designed with quantification in mind. It limits reported isoforms to the minimal number that explains the data (Maximum precision). Differences between Cufflinks and Scripture We now rely on the connectivity graph to apply a method similar to the one we used to determine regions enriched in chromatin data to regions (which now may be disconnected) enriched in RNASeq reads. For example the region demarcated by the red path would be significant and corresponding to one of the original isoforms. Once we have enriched paths, they translated into putative isoforms. Another significant path will be the one that goes through the larger exon (here in green) and that corresponds to the isoforms that include this exon. We then use paired end data to join paths that may have not had enough spliced read support and also to remove paths that require paired reads to be at a distance not likely given the insert size distribution and to gauge whether or not we’ve reached the 3’ end of the isoform.

Differences between Cufflinks and Scripture - Example Annotation Scripture Cufflinks We now rely on the connectivity graph to apply a method similar to the one we used to determine regions enriched in chromatin data to regions (which now may be disconnected) enriched in RNASeq reads. For example the region demarcated by the red path would be significant and corresponding to one of the original isoforms. Once we have enriched paths, they translated into putative isoforms. Another significant path will be the one that goes through the larger exon (here in green) and that corresponds to the isoforms that include this exon. We then use paired end data to join paths that may have not had enough spliced read support and also to remove paths that require paired reads to be at a distance not likely given the insert size distribution and to gauge whether or not we’ve reached the 3’ end of the isoform. Alignments

Different ways to go at the problem Cufflinks Scripture Total loci called 27,700 19,400 Median # isoforms per loci 1 3rd quantile # isoforms per loci 2 Max # isoforms 19 1,400 We now rely on the connectivity graph to apply a method similar to the one we used to determine regions enriched in chromatin data to regions (which now may be disconnected) enriched in RNASeq reads. For example the region demarcated by the red path would be significant and corresponding to one of the original isoforms. Once we have enriched paths, they translated into putative isoforms. Another significant path will be the one that goes through the larger exon (here in green) and that corresponds to the isoforms that include this exon. We then use paired end data to join paths that may have not had enough spliced read support and also to remove paths that require paired reads to be at a distance not likely given the insert size distribution and to gauge whether or not we’ve reached the 3’ end of the isoform.

Different ways to go at the problem We now rely on the connectivity graph to apply a method similar to the one we used to determine regions enriched in chromatin data to regions (which now may be disconnected) enriched in RNASeq reads. For example the region demarcated by the red path would be significant and corresponding to one of the original isoforms. Once we have enriched paths, they translated into putative isoforms. Another significant path will be the one that goes through the larger exon (here in green) and that corresponds to the isoforms that include this exon. We then use paired end data to join paths that may have not had enough spliced read support and also to remove paths that require paired reads to be at a distance not likely given the insert size distribution and to gauge whether or not we’ve reached the 3’ end of the isoform. Many of the bogus locus and isoforms are due to alignment artifacts

Why so many isoforms Annotation Reconstructions We now rely on the connectivity graph to apply a method similar to the one we used to determine regions enriched in chromatin data to regions (which now may be disconnected) enriched in RNASeq reads. For example the region demarcated by the red path would be significant and corresponding to one of the original isoforms. Once we have enriched paths, they translated into putative isoforms. Another significant path will be the one that goes through the larger exon (here in green) and that corresponds to the isoforms that include this exon. We then use paired end data to join paths that may have not had enough spliced read support and also to remove paths that require paired reads to be at a distance not likely given the insert size distribution and to gauge whether or not we’ve reached the 3’ end of the isoform. Every such splicing event or alignment artifact doubles the number of isoforms reported

Alignment revisited — spliced alignment is still work in progress

Exon-first aligners are faster but at cost Alignment artifacts can also decrease sensitivity

Missing spliced reads for highly expressed genes Sfrs3 TopHat Read mapped uniquely Read ambiguously mapped New Pipeline

Can more sensitive alignments overcome this problem? Use gapped aligners (e.g. BLAT) to map reads Align all reads with BLAT Filter hits and build candidate junction “database” from BLAT hits (Scripture light). Use a short read aligner (Bowtie) to map reads against the connectivity graph inferred transcriptome Map transcriptome alignments to the genome Can more sensitive alignments overcome this problem?

ScriptAlign: Can increase alignment across junctions Sfrs3 TopHat ` BLAT pipeline

We even get more uniquely aligned reads (not just spliced reads) ScriptAlign: Can increase alignment across junctions “Map first” reconstruction approaches directly benefit with mapping improvements We even get more uniquely aligned reads (not just spliced reads)

Overview of the session The 3 main computational challenges of sequence analysis for counting applications: Read mapping: Placing short reads in the genome Reconstruction: Finding the regions that originate the reads Quantification: Assigning scores to regions Finding regions that are differentially represented between two or more samples.

Quantification Fragmentation of transcripts results in length bias: longer transcripts have higher counts Different experiments have different yields. Normalization is required for cross lane comparisons: Reads per kilobase of exonic sequence per million mapped reads (Mortazavi et al Nature methods 2008) This is all good when genes have one isoform.

Quantification with multiple isoforms How do we define the gene expression? How do we compute the expression of each isoform?

Computing gene expression Idea1: RPKM of the constitutive reads (Neuma, Alexa-Seq, Scripture)

Computing gene expression — isoform deconvolution

Computing gene expression — isoform deconvolution If we knew the origin of the reads we could compute each isoform’s expression. The gene’s expression would be the sum of the expression of all its isoforms. E = RPKM1 + RPKM2 + RPKM3

Programs to measure transcript expression Implemented method Alexa-seq Gene expression by constitutive exons ERANGE Gene expression by using all Exons Scripture Cufflinks Transcript deconvolution by solving the maximum likelihood problem MISO RSEM

Impact of library construction methods

Library construction improvements — Paired-end sequencing Adapted from the Helicos website

Paired-end reads are easier to associate to isoforms Paired ends increase isoform deconvolution confidence P1 originates from isoform 1 or 2 but not 3. P2 and P3 originate from isoform 1 P1 P2 P3 Isoform 1 Isoform 2 Isoform 3 Do paired-end reads also help identifying reads originating in isoform 3?

We can estimate the insert size distribution P1 P2 Get all single isoform reconstructions Splice and compute insert distance d1 d2 Estimate insert size empirical distribution

… and use it for probabilistic read assignment Isoform 1 Isoform 2 Isoform 3 d1 d2 d1 d2 P(d > di)

And improve quantification Katz et al Nature Methods 2008

Paired-end improve reconstructions Paired-end data complements the connectivity graph

And merge regions Single reads Paired reads

Or split regions Single reads Paired reads

Paired-end reads are now routine in Illumina and SOLiD sequencers. Paired end alignment is supported by most short read aligners Transcript quantification depends heavily in paired-end data Transcript reconstruction is greatly improved when using paired-ends (work in progress) Summary

? Giving orientation to transcripts — Strand specific libraries Scripture relies on splice motifs to orient transcripts. It orients every edge in the connectivity graph. ? Single exon genes are left unoriented

Strand specific library construction results in oriented reads. Sequence depends on the adapters ligated The second strand is destroyed, thus the cDNA read is always in reverse orientation to the RNA Adapted from Levine et al Nature Methods Scripture & Cufflinks allow the user to specify the orientation of the reads.

The libraries we will work with are strand sepcific

Several methods now exist to build strand sepecific RNA-Seq libraries. Quantification methods support strand specific libraries. For example Scripture will compute expression on both strand if desired. Summary

Complementing RNA-Seq libraries with 3’ and 5’ tagged segments to pinpoint the start and end of reconstructions. In the works

Overview of the session The 3 main computational challenges of sequence analysis for counting applications: Read mapping: Placing short reads in the genome Reconstruction: Finding the regions that originate the reads Quantification: Assigning scores to regions Finding regions that are differentially represented between two or more samples.

Finding genes that have different expression between two or more conditions. Find gene with isoforms expressed at different levels between two or more conditions. Find differentially used slicing events Find alternatively used transcription start sites Find alternatively used 3’ UTRs The problem.

Differential gene expression using RNA-Seq (Normalized) read counts  Hybridization intensity We observe the individual events.

The Poisson model Suppose you have 2 conditions and R replicates for each conditions and each replicate in its own lane L. Lets consider a single gene G. Let Cik the number of reads aligned to G in lane i of condition k then (k=1,2) and i=(1,…R). Assume for simplicity that all lanes give the same number of reads (otherwise introduce a normalization constant) Assume Cik distributes Poisson with unknown mean mik. Use a GLN to estimate mik using two parameters, a gene dependent parameter a and a sample dependent parameter sk log(mik) = a + sk to obtain two estimators m1 and m2 Alternatively estimate a mean m using all replicates for all conditions

The Poisson model G is differentially expressed when m1 != m2 Is P(C1k,C2k|m) is close to P(C1k|m1)P(C2k|m2) The likelihood ratio test is ideal to see this and since the difference between the two models is one variable it distributes X2 of degree 1. The X2 can be used to assess significance. For details see Auer and Doerge - Statistical Design and Analysis of RNA Sequencing Data genomics 2010. Marioni et al – RNASeq: An assessment technical of reproducibility and comparison with gene expression arrays Genome Reasearch 2008.

Cufflinks differential issoform ussage Let a gene G have n isoforms and let p1, … , pn the estimated fraction of expression of each isoform. Call this a the isoform expression distribution P for G Given two samples we the differential isoform usage amounts to determine whether H0: P1 = P2 or H1: P1 != P2 are true. To compare distributions Cufflinks utilizes an information content based metric of how different two distributions are called the Jensen-Shannon divergence: The square root of the JS distributes normal.

RNA-Seq differential expression software Underlying model Notes DegSeq Normal. Mean and variance estimated from replicates Works directly from reference transcriptome and read alignment EdgeR Negative Bionomial Gene expression table DESeq Poisson Myrna Empirical Sequence reads and reference transcriptome

Mouse Dendritic Cell stimulation timecourse pam lps pic cpg Dendritic cells are one of the major regulators of inflammation in our body DC guardians of homeostasis Liew et al 2005

Acknowledgements Mitchell Guttman New Contributors: Moran Cabili Hayden Metsky RNA-Seq analysis: Cole Trapnel Manfred Grabher Could not do it without the support of: Ido Amit Eric Lander Aviv Regev Merge ES and others (change to bar chart) Compress method Remake K4-K36 slide Case 1 and case 2 ncRNA swap

What are the gray colored reads? This is a “paired-end” library, gray reads aligned without their mate

We now can include novel genes in differential expression analysis RPKM

Long modifications (K36/K27) Short modifications (K4) Scripture: Genomic jack of all trades… Long modifications (K36/K27) Tiny modification (TFs) Short modifications (K4) Alignment Discontinuous (RNA) Tiling arrays

Conclusions A method for transcriptome reconstruction Lots of cell-type specific variation in protein-coding annotations Chromatin and RNA maps synergize to annotated the genome www.broadinstitute.org/scripture

Longer windows for dimmer but longer regions Significant windows Merge & trim In practice we scan a range of windows and integrated the results of each scan

But which window?

This works given a known window size how do we find this… So since the Scan Dist depends on 3 params (genome, number of reads, and window size) we can vary window size and get an adjusted dist for each We can then scan the genome using different window sizes allowing the distributions to account directly for the variable enrichment sizes that we observe. So in practice, we scan with multiple different sizes (from low-> high) and merge results While the scan dists accounts for these different sizes, there are practical which some times can confound. For example, if one has short enrichment which are close together they can be merged by long windows bec they make the long window significant. In practice, it helps to use some prior information in picking the windows although globally it might be ok.