RNA-Seq visualization with CummeRbund

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
DEG Mi-kyoung Seo.
© by Pearson Education, Inc. All Rights Reserved.
Information Retrieval in Practice
RNA-seq Analysis in Galaxy
Chapter 7 Managing Data Sources. ASP.NET 2.0, Third Edition2.
Scaffold Download free viewer:
Introduction to SPSS Short Courses Last created (Feb, 2008) Kentaka Aruga.
Overview of Search Engines
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
ATM 315 Environmental Statistics Course Goto Follow the link and then choose the desktop application.
An Introduction to RNA-Seq Transcriptome Profiling with iPlant
RNA-Seq Visualization
Introduction to RNA-Seq and Transcriptome Analysis
Expression Analysis of RNA-seq Data
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
IE 411/511: Visual Programming for Industrial Applications
BIF Group Project Group (A)rabidopsis: David Nieuwenhuijse Matthew Price Qianqian Zhang Thijs Slijkhuis Species: C. Elegans Project: Advanced.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 Part 4 Curve Fitting.
RNAseq analyses -- methods
Copyright OpenHelix. No use or reproduction without express written consent1.
Agenda Introduction to microarrays
1 Computer Programming (ECGD2102 ) Using MATLAB Instructor: Eng. Eman Al.Swaity Lecture (1): Introduction.
An Introduction to RNA-Seq Transcriptome Profiling with iPlant.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Introduction to RNA-Seq
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop RNA-Seq using the Discovery Environment And COGE.
RNA surveillance and degradation: the Yin Yang of RNA RNA Pol II AAAAAAAAAAA AAA production destruction RNA Ribosome.
Introduction to SPSS. Object of the class About the windows in SPSS The basics of managing data files The basic analysis in SPSS.
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.
Introduction to RNAseq
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop RNA-Seq visualization with cummeRbund.
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov.
The iPlant Collaborative
An Introduction to RNA-Seq Transcriptome Profiling with iPlant (
Accessing and visualizing genomics data
Canadian Bioinformatics Workshops
HOMER – a one stop shop for ChIP-Seq analysis
Canadian Bioinformatics Workshops
Transforming Science Through Data-driven Discovery Genomics in Education University of Delaware – February 2016 Jason Williams, Education, Outreach, Training.
Canadian Bioinformatics Workshops
Transforming Science Through Data-driven Discovery Tools and Services Workshop Atmosphere Joslynn Lee – Data Science Educator Cold Spring Harbor Laboratory,
IENG-385 Statistical Methods for Engineers SPSS (Statistical package for social science) LAB # 1 (An Introduction to SPSS)
Transforming Science Through Data-driven Discovery Tools and Services Workshop Data Store Overview.
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
Transforming Science Through Data-driven Discovery Workshop Overview Ohio State University MCIC Jason Williams – Lead, CyVerse – Education, Outreach, Training.
Transforming Science Through Data-driven Discovery Tools and Services Workshop Data Store – Managing your ‘Big’ Data Joslynn Lee, Ph.D. – Data Science.
Transforming Science Through Data-driven Discovery Tools and Services Workshop Data Store – Managing your ‘Big’ Data Joslynn Lee – Data Science Educator.
Arrays How do they work ? What are they ?. WT Dwarf Transgenic Other species Arrays are inverted Northerns: Extract target RNA YFG Label probe + hybridise.
Transforming Science Through Data-driven Discovery Using CyVerse Cyberinfrastructure to Enable Data Intensive Research, Collaboration, and Education Joslynn.
Transforming Science Through Data-driven Discovery Using CyVerse Cyberinfrastructure to Enable Data Intensive Research, Collaboration, and Education Atmosphere.
Joslynn S. Lee, PhD, Data Science Educator Cold Spring Harbor Laboratory, DNA Learning Center Transforming Science Through Data-driven Discovery.
Introductory RNA-seq Transcriptome Profiling
WS9: RNA-Seq Analysis with Galaxy (non-model organism )
RNA-Seq visualization with CummeRbund
RNA-Seq analysis in R (Bioconductor)
By Dr. Madhukar H. Dalvi Nagindas Khandwala college
Introductory RNA-Seq Transcriptome Profiling
DEPARTMENT OF COMPUTER SCIENCE
Communication and Coding Theory Lab(CS491)
Assessing changes in data – Part 2, Differential Expression with DESeq2
Additional file 2: RNA-Seq data analysis pipeline
Transcriptomics – towards RNASeq – part III
Presentation transcript:

RNA-Seq visualization with CummeRbund Atmosphere Workflow Joslynn Lee – Data Science Educator Cold Spring Harbor Laboratory, DNA Learning Center jolee@cshl.edu

RNA-Seq Experiment Overview In this example we will compare gene transcript abundance between wild type and mutant Arabidopisis strains. LONG HYPOCOTYL 5 (HY5) is a basic leucine zipper transcription factor (TF) Mutations in the HY5 gene cause a myriad of aberrant phenotypes (morphology /pigmentation) in Arabidopsis We will use RNA-seq to compare expression levels for genes between WT and hy5- samples

Useful resources for Tuxedo workflow Papers with some CuffDiff results *Graphics taken from these publications

Compare expression analysis using CuffDiff One option using Tuxedo Package Other programs: Cuffdiff is a program that uses the Cufflinks transcript quantification engine to calculate gene and transcript expression levels in more than one condition and test them for significant differences. The folder cuffdiff_out contains the output files from cuffdiff. Files related to differential gene expression at the gene and transcript levels: gene_exp.diff Results (gene fragments per Kb of exon per million [FPKM] values) of differential expression testing between the two samples, WT and hy5, at the gene level.

Workflow of RNA-Seq in CyVerse Using a GUI Tophat (bowtie) Cufflinks Cuffmerge Cuffdiff CummeRbund Your Data CyVerse Data Store Discovery Environment Atmosphere FASTQ

Workflow of RNA-Seq in Atmosphere CummeRbund Any image w/R can work, and you could also search for an image with cummeRbund installed Description: Allows for persistent storage, access, exploration, and manipulation of Cufflinks high-throughput sequencing data. In addition, provides numerous plotting functions for commonly used visualizations.

cummeRbund is a visualization package Available via Bioconductor All of the output files are related to each other through their various tracking_ids, but parsing through individual files to query for important result information requires both a good deal of patience and a strong grasp of command-line text manipulation. Enter cummeRbund, an R solution to aggregate, organize, and help visualize this multi-layered dataset promote rapid analysis of RNA-Seq data by aggregating, indexing, and allowing you easily visualize create publication-ready figures of your RNA-Seq data while maintaining appropriate relationships between connected data points

Bringing data into Atmosphere Use iCommands or Cyberduck Cyberduck iCommands

Installing CummerBund Available via Bioconductor Bioconductor is an open source, open development software project to provide tools for the analysis and comprehension of high-throughput genomic data. CummeRbund is now included as part of the R/Bioconductor project. The links above point to the R packages directly, and every effort should be made to use the version provided by Bioconductor

Connect with VNC Viewer Visualization for Atmosphere Integrated Development Environment (IDE) for R Execute R code directly from the source editor Quickly jump to function definitions Easily manage multiple working directories using projects

Output files of CuffLinks Find in Atmosphere 1. Find output files Desktop  test_data  iplant_cuffdiff_out

Output files of CuffLinks Find in Atmosphere 1. Find output files Desktop  test_data  iplant_cuffdiff_out 2. Install cummeRbund (once) > source("https://bioconductor.org/biocLite.R") > biocLite("cummeRbund") 3. Load library >library(cummeRbund) 4. Set working directory to “iplant_cuffdiff_out” > setwd("~/Desktop/test_data/iplant_cuffdiff_out”)

Load cuffdiff data into the database Import files into R Studio CummeRbund begins by re-organizing output files of a cuffdiff analysis, and storing these data in a local SQLite database. To read your cuffdiff files: In R studio: > cuff <- readCufflinks(rebuild=TRUE) > cuff Can store values in variable (cuff) using the assignment operator <- readCufflinks(): enables you to keep the entire dataset out of memory and significantly improve performance for large cufflinks datasets. Variable (cuff) now contains a value

What do we visualize? Global statistics and Quality Control How good is my data? Are my biological replicates Creating gene sets from significantly regulated genes Exploring the relationships between conditions

Global statistics and Quality Control A good place to begin is to evaluate the quality of the model fitting Overdispersion is a common problem in RNA-Seq data As of cufflinks v2.0 mean counts, variance, and dispersion are all emitted, allowing you to visualize the estimated overdispersion for each sample as a quality control measure. >disp<-dispersionPlot(genes(cuff)) >disp Counts (x-axis) vs. dispersion (y-axis) Overdispersion greater variability in a data set than would be expected based on a given model (in our case extra-Poisson variation) If you use Poisson model, you will overestimate differential expression WT (left) and hy5 (right)

Poisson does not capture biological variability Use dispersion If you use Poisson model, you will overestimate differential expression REDO http://www.fgcz.ch/education/StatMethodsExpression/03_Count_data_analysis.pdf

Example of overdispered data Use dispersion

Squared Coefficient of Variation The squared coefficient of variation is a normalized measure of cross-replicate variability that can be useful for evaluating the quality your RNA-seq data >genes.scv<-fpkmSCVPlot(genes(cuff)) >genes.scv Represents the relationship of the standard deviation to the mean Differences in SCV can result in lower numbers of differentially expressed genes due to a higher degree of variability between replicate FPKM estimates

Distribution of FPKM scores To assess the distributions of the genes in the two samples, you can use the csDensity plot. The density plot generated by cummeRbund can be thought of as a ‘smoothed’ representation of a histogram. >dens<-csDensity(genes(cuff)) >dens >densRep<-csDensity(genes(cuff),replicates=T) >densRep Two WT and two samples

Distribution of FPKM scores Pairwise comparisons between conditions can be made by using a scatterplot. The data is displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis. > csScatter(genes(cuff),‘WT’,‘hy5’,smooth=T) These plots can be useful for identifying global changes and trends in gene expression between pairs of conditions. Scatterplot of log10(FPKM) in WT and hy5 sample is shown

Selecting and Filtering Gene Sets Creating gene sets from significantly regulated genes Determine which genes are differentially regulated. cummeRbund makes accessing the results of these significance tests simple via getSig(): Returns the identifiers of significant genes in a vector format. >sig <-getSig(cuff, alpha=0.05, level =‘genes’) #An alpha value by which to filter multiple-testing corrected q-values to determine significance, 0.05 is default >length(sig) #returns the number of genes in the sig object

Selecting and Filtering Gene Sets Using the ‘getGenes’ function # Get the gene information >sigGenes <- getGenes(cuff,sig) Plot this in another scatter plot >csScatter(sigGenes, ‘WT’, ‘hy5’)

Heatmap: Similar Expression Values There are several plotting functions available for gene-set-level visualization: The csHeatmap() function is a plotting wrapper that takes as input either a CuffGeneSet object (essentially a collection of genes and/or features) and produces a heatmap of FPKM expression values. >sigGenes <-getGenes(cuff,tail(sig,50)) #last 50 genes in the list we created >csHeatmap(sigGenes,cluster=‘both’)

Heatmap: Similar Expression Values >csHeatmap(sigGenes,cluster=‘both’,replicates=‘T’) REDO

Expression Plots by Genes > myGeneId<-”AT5G41471" > myGene<-getGene(cuff,myGeneId) > myGene

Expression Plots by Genes > expressionPlot(myGene,replicates=‘T’)

Help: ask.iplantcollaborative.org Detailed instructions with videos, manuals, documentation in CyVerse Wiki Search by tag