Presentation is loading. Please wait.

Presentation is loading. Please wait.

The iPlant Collaborative

Similar presentations


Presentation on theme: "The iPlant Collaborative"— Presentation transcript:

1 The iPlant Collaborative
Community Cyberinfrastructure for Life Science Kallisto: near-optimal RNA seq quantification tool Discovery Environment

2 Welcome to the Discovery Environment
A Simple Interface to Hundreds of Bioinformatics Apps, Powerful Computing, and Data

3 ? Discovery Environment Overview
A bench-biologist’s view on the evolution of computing It can be difficult to keep up! Investing in cutting edge tools and resources is expensive. ? Image From:

4 Discovery Environment Overview
Manage data Data Upload / Download files and folders Share files via URL (Public Links) Share files/folders with other users

5 Discovery Environment Overview
Analyze data and customize Applications Apps Run hundreds of bioinformatics Apps Build automated workflows Modify Apps or integrate new ones

6 Discovery Environment Overview
View history, find results, reproduce analyses, optimize parameters Analyses Monitor job status and find results Cancel jobs or re-launch jobs Detailed job history

7 Discovery Environment Overview
Leverage dedicated support and training Support iPlant Ask forums Tell iPlant about your needs Mention this to be user driven. And there always a support staff to relsove any issues that come up

8 Discovery Environment Overview
Benefits Get Science Done Use hundreds of bioinformatics Apps without the command line Add your own applications – an extensible, scalable platform Create and publish Apps and workflows so anyone can use them Analysis history and provenance – “avoid forensic bioinformatics” High-performance computing – not dependent on your hardware Manage a secure data repository and share data easily Reproducibility Productivity

9 cDNAs using sequencing platform
RNA seq Overview Sequence- cDNAs using sequencing platform Analysis Reads are mapped to reference or transcriptome Mapped reads counted per gene or per transcripts Counts are tested statistically for significant differences

10 RNA seq analysis pipeline
QC, Demultiplex, filter, and trim sequencing reads FASTQC, Trimmomatic Normalize sequencing reads Diginorm Trinity normalization de novo assembly of transcripts or Trinity, SOAP-denovo Map (align) sequencing reads to reference genome or transcriptome Tophat, HISAT, STAR Annotate transcripts assembled and count mapped reads to estimate transcript abundance Cufflinks Perform statistical analysis to identify differential expression (or differential splicing) among samples or treatments Cuffdiff, eXpress,DESeq2 Normalization-Several factors preclude raw read counts across different libraries, Some samples are sequenced at higher depth than others RPKM = reads per kilobase per million mapped reads (FOR SINGLE END RNA SEQ) ​ FPKM: Fragment per kilobases per million mapped reads (FOR PAIRED END RNA SEQ)

11 “Alignment free” quantification

12 Kallisto- near optimal RNA seq quantification tool

13 Kallisto Introduction of pseudoalignment instead of alignment
-Nicolas Bray, Ph.D. thesis 2014. RNA-Seq analysis of 30 million reads in 2.5 minutes; 500—1000x faster than previous approaches. Possible thanks to fast hashing techniques and pseudoalignment via the Target de Bruijn Graph. First ever RNA-Seq analysis approach that is tractable on a laptop while being as accurate (or more accurate) than existing methods. Speed allows for bootstrapping to obtain uncertainty estimates, thus leading to new methods for differential analysis. Nicolas Bray Harold Pimentel Páll Melsted Lior Pachter

14 RNA-Seq transcript abundance
Given a set of RNA-seq reads and a reference transcriptome , quantify proportion of each transcript RNA-seq reads: assume standard reads, single or paired end reads Reference transcriptome: does not require a genome reference, works only with transcriptome Proportion: corresponds to TPM(transcripts per million) “for every 1M transcripts expressed how many are in this one?” TPM- if took a 1 million transcripts and if I am looking at this specific transcripts how many would I see. This is a scaled propbality distribution

15 Why Kallisto? Advantages:
Two conditions, it coould be some control and a treatement., and kallisto, alginemnet without or with annotation , without means you are typically aligneing to genome and there are number of ways to qauntify. is therefore not only fast, but also as accurate than existing quantification tools Advantages: Pseudoalignment of reads preserves the key information needed for quantification. Blazing fast and accurate

16 How fast is Pseudoalignment?
Given a paired read, from which transcript could I have originated from? Not nucleotide sequence alignment It determines, for each read, not where in each transcript it aligns, but rather which transcripts it is compatible with. Pseudoalignments provide the sufficient statistic for the EM algorithm How fast is Pseudoalignment? Pseudo – given a read, list of trancripts it could have come from The quantification of 78.6 million reads takes 14 minutes on a standard desktop using a single CPU core. ~6 million reads quantified per minute

17 Why Kallisto? Most RNA seq tools(Cufflinks, RSEM, eXpress etc) do RNA seq analysis in two parts- Alignment- Align reads to transcriptome or split reads over genome Quantification- converts the alignments to abundance metrics( FPKM, RPKM, TPM) Two clusters of quantification tools, count based Vs. Expectation-Maximization(EM) based Key difference is how they deals with ambiguous read alignments Kallisto fuses the two steps Reads are pseudoaligned to the reference transcriptome EM algorithm deconvolutes pseudoalignments to obtain transcript abundances Count based: How many genes are assigned to each gene, EM: look this at transcript level and takes into account normailzation different lengths of transcripts . Things that are non-ambig when u map to gene model become ambinoug when reads are shared betwn two transcripts

18 Target de Bruijn Graph (T-DBG)
This is the data structure that givrs advantage to akllisto its called the T-DBG. You build a kmer of every single transcript and walk along this path, wherever they differ you create a new path Create every k-mer in the transcriptome (k=31), build de Bruin Graph and color each k-mer Preprocess the transcriptome to create the T-DBG Indexing is faster

19 Target de Bruijn Graph (T-DBG)
Whe you get a read you match the Kmers along the graph, you notice all the paths are colored, colors basically tell use what transcripts this match is compatable with Use k-mers in read to find which transcript it came from Want to find pseudo alignments pseudoalignment : which transcripts the read (pair) is compatible with not an alignment of the nucleotide sequences.

20 Target de Bruijn Graph (T-DBG)
Each k-mer appears in a set of transcripts The intersection of all sets is our pseudoalignment Can jump over k-mers in the T-DBG that provide same information Jumping provides ~8x speedup over chekcing all k-mers

21 Performance - Accuracy
Simulated 20, 30M PE reads using RSEM simulator Relative difference = Relative diffrence metric, the simulatro set is supposed to be aplpha true we estimated it to be alpha est and we compute the realtive diffrence Accuracy

22 Performance - speed Total running time for running 20 samples on 20 cores. Speed

23 Bootstrap A new statistical feature of Kallisto, possible only because of its speed, is the bootstrap The result is that we can accurately estimate the uncertainty in abundance estimates The result is that we can accurately estimate the uncertainty in abundance estimates. It is based on an analysis of 40 samples of 30 million reads subsampled from 275 million rat RNA-Seq reads. Each dot corresponds to a transcript and is colored by its abundance. The x-axis shows the variance estimated from kallisto bootstraps on a single subsample while the y-axis shows the variance computed from the different subsamples of the data. We see that the bootstrap recapitulates the empirical variance.

24 Hands on Demo of Kallisto in DE


Download ppt "The iPlant Collaborative"

Similar presentations


Ads by Google