The iPlant Collaborative

Slides:



Advertisements
Similar presentations
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
Advertisements

RNAseq.
Peter Tsai Bioinformatics Institute, University of Auckland
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
RNA-seq Analysis in Galaxy
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Todd J. Treangen, Steven L. Salzberg
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
RNAseq analyses -- methods
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Objectives.
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
Transcriptome Analysis
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop RNA-Seq using the Discovery Environment And COGE.
The iPlant Collaborative
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Using Biological Cyberinfrastructure Scaling Science and People: Applications in Data Storage, HPC, Cloud Analysis, and Bioinformatics Training Scaling.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop iPlant Data Store.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
IPlant Collaborative Hands-on Cyberinfrastructure Workshop - Part 1 R. Walls University of Arizona Biodiversity Information Standards (TDWG) Sep. 28, 2015,
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of the iPlant Discovery Environment.
Introduction to RNAseq
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
The iPlant Collaborative
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of the iPlant Discovery Environment.
Manuel Holtgrewe Algorithmic Bioinformatics, Department of Mathematics and Computer Science PMSB Project: RNA-Seq Read Simulation.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of the iPlant Discovery Environment.
IPlant Collaborative Tools and Services Workshop Overview of the iPlant Discovery Environment Sriram Srinivasan.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Canadian Bioinformatics Workshops
Overview of Genomics Workflows
RNA Seq Analysis Aaron Odell June 17 th Mapping Strategy A few questions you’ll want to ask about your data… - What organism is the data from? -
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
CyVerse Workshop Discovery Environment Overview. Welcome to the Discovery Environment A Simple Interface to Hundreds of Bioinformatics Apps, Powerful.
Transforming Science Through Data-driven Discovery Workshop Overview Ohio State University MCIC Jason Williams – Lead, CyVerse – Education, Outreach, Training.
Canadian Bioinformatics Workshops
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
Transforming Science Through Data-driven Discovery Bringing your Bioinformatics tools to CyVerse’s Discovery Environment using Docker Upendra Kumar Devisetty.
Konstantin Okonechnikov Qualimap v2: advanced quality control of
Introductory RNA-seq Transcriptome Profiling
RNA Quantitation from RNAseq Data
An Introduction to RNA-Seq Data and Differential Expression Tools in R
Placental Bioinformatics
WS9: RNA-Seq Analysis with Galaxy (non-model organism )
Tools and Services Workshop
Joslynn Lee – Data Science Educator
CyVerse Discovery Environment
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
Canadian Bioinformatics Workshops
Introductory RNA-Seq Transcriptome Profiling
Kallisto: near-optimal RNA seq quantification tool
Transcriptome Assembly
Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey
Additional file 2: RNA-Seq data analysis pipeline
Quantitative analyses using RNA-seq data
Sequence Analysis - RNA-Seq 2
RNA-Seq Data Analysis UND Genomics Core.
Presentation transcript:

The iPlant Collaborative Community Cyberinfrastructure for Life Science Kallisto: near-optimal RNA seq quantification tool Discovery Environment

Welcome to the Discovery Environment A Simple Interface to Hundreds of Bioinformatics Apps, Powerful Computing, and Data

? Discovery Environment Overview A bench-biologist’s view on the evolution of computing It can be difficult to keep up! Investing in cutting edge tools and resources is expensive. ? Image From: http://www.wired.com/wired/archive/17.01/ff_mac_viewer.html

Discovery Environment Overview Manage data Data Upload / Download files and folders Share files via URL (Public Links) Share files/folders with other users

Discovery Environment Overview Analyze data and customize Applications Apps Run hundreds of bioinformatics Apps Build automated workflows Modify Apps or integrate new ones

Discovery Environment Overview View history, find results, reproduce analyses, optimize parameters Analyses Monitor job status and find results Cancel jobs or re-launch jobs Detailed job history

Discovery Environment Overview Leverage dedicated support and training Support iPlant Ask forums Tell iPlant about your needs Mention this to be user driven. And there always a support staff to relsove any issues that come up

Discovery Environment Overview Benefits Get Science Done Use hundreds of bioinformatics Apps without the command line Add your own applications – an extensible, scalable platform Create and publish Apps and workflows so anyone can use them Analysis history and provenance – “avoid forensic bioinformatics” High-performance computing – not dependent on your hardware Manage a secure data repository and share data easily Reproducibility Productivity

cDNAs using sequencing platform RNA seq Overview Sequence- cDNAs using sequencing platform Analysis Reads are mapped to reference or transcriptome Mapped reads counted per gene or per transcripts Counts are tested statistically for significant differences

RNA seq analysis pipeline QC, Demultiplex, filter, and trim sequencing reads FASTQC, Trimmomatic Normalize sequencing reads Diginorm Trinity normalization de novo assembly of transcripts or Trinity, SOAP-denovo Map (align) sequencing reads to reference genome or transcriptome Tophat, HISAT, STAR Annotate transcripts assembled and count mapped reads to estimate transcript abundance Cufflinks Perform statistical analysis to identify differential expression (or differential splicing) among samples or treatments Cuffdiff, eXpress,DESeq2 Normalization-Several factors preclude raw read counts across different libraries, Some samples are sequenced at higher depth than others RPKM = reads per kilobase per million mapped reads (FOR SINGLE END RNA SEQ) ​ FPKM: Fragment per kilobases per million mapped reads (FOR PAIRED END RNA SEQ)

“Alignment free” quantification

Kallisto- near optimal RNA seq quantification tool

Kallisto Introduction of pseudoalignment instead of alignment -Nicolas Bray, Ph.D. thesis 2014. RNA-Seq analysis of 30 million reads in 2.5 minutes; 500—1000x faster than previous approaches. Possible thanks to fast hashing techniques and pseudoalignment via the Target de Bruijn Graph. First ever RNA-Seq analysis approach that is tractable on a laptop while being as accurate (or more accurate) than existing methods. Speed allows for bootstrapping to obtain uncertainty estimates, thus leading to new methods for differential analysis. Nicolas Bray Harold Pimentel Páll Melsted Lior Pachter https://math.berkeley.edu/~lpachter/group.html

RNA-Seq transcript abundance Given a set of RNA-seq reads and a reference transcriptome , quantify proportion of each transcript RNA-seq reads: assume standard reads, single or paired end reads Reference transcriptome: does not require a genome reference, works only with transcriptome Proportion: corresponds to TPM(transcripts per million) “for every 1M transcripts expressed how many are in this one?” TPM- if took a 1 million transcripts and if I am looking at this specific transcripts how many would I see. This is a scaled propbality distribution

Why Kallisto? Advantages: Two conditions, it coould be some control and a treatement., and kallisto, alginemnet without or with annotation , without means you are typically aligneing to genome and there are number of ways to qauntify. is therefore not only fast, but also as accurate than existing quantification tools Advantages: Pseudoalignment of reads preserves the key information needed for quantification. Blazing fast and accurate

How fast is Pseudoalignment? Given a paired read, from which transcript could I have originated from? Not nucleotide sequence alignment It determines, for each read, not where in each transcript it aligns, but rather which transcripts it is compatible with. Pseudoalignments provide the sufficient statistic for the EM algorithm How fast is Pseudoalignment? Pseudo – given a read, list of trancripts it could have come from The quantification of 78.6 million reads takes 14 minutes on a standard desktop using a single CPU core. ~6 million reads quantified per minute

Why Kallisto? Most RNA seq tools(Cufflinks, RSEM, eXpress etc) do RNA seq analysis in two parts- Alignment- Align reads to transcriptome or split reads over genome Quantification- converts the alignments to abundance metrics( FPKM, RPKM, TPM) Two clusters of quantification tools, count based Vs. Expectation-Maximization(EM) based Key difference is how they deals with ambiguous read alignments Kallisto fuses the two steps Reads are pseudoaligned to the reference transcriptome EM algorithm deconvolutes pseudoalignments to obtain transcript abundances Count based: How many genes are assigned to each gene, EM: look this at transcript level and takes into account normailzation different lengths of transcripts . Things that are non-ambig when u map to gene model become ambinoug when reads are shared betwn two transcripts

Target de Bruijn Graph (T-DBG) http://arxiv.org/pdf/1505.02710v2.pdf This is the data structure that givrs advantage to akllisto its called the T-DBG. You build a kmer of every single transcript and walk along this path, wherever they differ you create a new path Create every k-mer in the transcriptome (k=31), build de Bruin Graph and color each k-mer Preprocess the transcriptome to create the T-DBG Indexing is faster

Target de Bruijn Graph (T-DBG) http://arxiv.org/pdf/1505.02710v2.pdf Whe you get a read you match the Kmers along the graph, you notice all the paths are colored, colors basically tell use what transcripts this match is compatable with Use k-mers in read to find which transcript it came from Want to find pseudo alignments pseudoalignment : which transcripts the read (pair) is compatible with not an alignment of the nucleotide sequences.

Target de Bruijn Graph (T-DBG) http://arxiv.org/pdf/1505.02710v2.pdf Each k-mer appears in a set of transcripts The intersection of all sets is our pseudoalignment Can jump over k-mers in the T-DBG that provide same information Jumping provides ~8x speedup over chekcing all k-mers

Performance - Accuracy Simulated 20, 30M PE reads using RSEM simulator Relative difference = Relative diffrence metric, the simulatro set is supposed to be aplpha true we estimated it to be alpha est and we compute the realtive diffrence Accuracy http://arxiv.org/pdf/1505.02710v2.pdf

Performance - speed Total running time for running 20 samples on 20 cores. Speed http://arxiv.org/pdf/1505.02710v2.pdf

Bootstrap A new statistical feature of Kallisto, possible only because of its speed, is the bootstrap The result is that we can accurately estimate the uncertainty in abundance estimates The result is that we can accurately estimate the uncertainty in abundance estimates. It is based on an analysis of 40 samples of 30 million reads subsampled from 275 million rat RNA-Seq reads. Each dot corresponds to a transcript and is colored by its abundance. The x-axis shows the variance estimated from kallisto bootstraps on a single subsample while the y-axis shows the variance computed from the different subsamples of the data. We see that the bootstrap recapitulates the empirical variance. http://arxiv.org/pdf/1505.02710v2.pdf

Hands on Demo of Kallisto in DE