Differential Expression Analysis Introduction to Systems Biology Course Chris Plaisier Institute for Systems Biology.

Slides:



Advertisements
Similar presentations
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Shibing Deng Pfizer, Inc. Efficient Outlier Identification in Lung Cancer Study.
Multiple testing and false discovery rate in feature selection
CS1100: Computer Science and Its Applications Building Flexible Models in Microsoft Excel.
Inferring Biologically Meaningful Relationships Introduction to Systems Biology Course Chris Plaisier Institute for Systems Biology.
Microsoft Office Illustrated Fundamentals Unit H: Using Complex Formulas, Functions, and Tables.
Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is, “What is the statistical model to this data?” We then characterize.
Supervised and unsupervised analysis of gene expression data Bing Zhang Department of Biomedical Informatics Vanderbilt University
1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Multiple testing adjustments European Molecular Biology Laboratory Predoc Bioinformatics Course 17 th Nov 2009 Tim Massingham,
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Differentially expressed genes
Statistical Analysis of Microarray Data
1 Data Analysis for Gene Chip Data Part I: One-gene-at-a-time methods Min-Te Chao 2002/10/28.
False Discovery Rate Methods for Functional Neuroimaging Thomas Nichols Department of Biostatistics University of Michigan.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Statistics for Microarrays
Multiple Testing Procedures Examples and Software Implementation.
Introduction to Glioblastoma Chris Plaisier Introduction to Systems Biology Course Institute for Systems Biology.
Databases Ms. Scales. What is a Database? Database  A collection of data organized for fast search and retrieval  Examples: Telephone Directories Hospital.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets.
Multiple Comparison Correction in SPMs Will Penny SPM short course, Zurich, Feb 2008 Will Penny SPM short course, Zurich, Feb 2008.
Multiple testing correction
Proteomics Informatics – Data Analysis and Visualization (Week 13)
Hypothesis Testing Statistics for Microarray Data Analysis – Lecture 3 supplement The Fields Institute for Research in Mathematical Sciences May 25, 2002.
Lucio Baggio - Lucio Baggio - False discovery rate: setting the probability of false claim of detection 1 False discovery rate: setting the probability.
Candidate marker detection and multiple testing
Differential Analysis & FDR Correction
Essential Statistics in Biology: Getting the Numbers Right
StAR web server tutorial for ROC Analysis. ROC Analysis ROC Analysis: This module allows the user to input data for several classifiers to be tested.
Differential Gene Expression Dennis Kostka, Christine Steinhoff Slides adapted from Rainer Spang.
False Discovery Rates for Discrete Data Joseph F. Heyse Merck Research Laboratories Graybill Conference June 13, 2008.
Microsoft Office 2007 Intermediate© 2008 Pearson Prentice Hall1 PowerPoint Presentation to Accompany GO! With Microsoft ® Office 2007 Intermediate Chapter.
Multiple Testing in Microarray Data Analysis Mi-Ok Kim.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman.
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
Introduction to Microarrays Dr. Özlem İLK & İbrahim ERKAN 2011, Ankara.
Statistical Testing with Genes Saurabh Sinha CS 466.
1 ArrayTrack Demonstration National Center for Toxicological Research U.S. Food and Drug Administration 3900 NCTR Road, Jefferson, AR
For a specific gene x ij = i th measurement under condition j, i=1,…,6; j=1,2 Is a Specific Gene Differentially Expressed Differential expression.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
The Broad Institute of MIT and Harvard Differential Analysis.
1 Drug Screening and the False Discovery Rate Charles W Dunnett McMaster University 3 rd International Conference on Multiple Comparisons, Bethesda, Maryland,
1 Significance analysis of Microarrays (SAM) Applied to the ionizing radiation response Tusher, Tibshirani, Chu (2001) Dafna Shahaf.
Bioinformatics for biologists
SMART LIGHTING Using Excel with Data from Analog Discovery and LTspice K. A. Connor Mobile Studio Project Center for Mobile Hands-On STEM SMART LIGHTING.
Variability & Statistical Analysis of Microarray Data GCAT – Georgetown July 2004 Jo Hardin Pomona College
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Matrix Multiplication The Introduction. Look at the matrix sizes.
Stitching the Tutorials Together Introduction to Systems Biology Course Chris Plaisier Institute for Systems Biology.
Canadian Bioinformatics Workshops
Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
Module 2: Analyzing gene lists: over-representation analysis
Differential Gene Expression
Statistical Testing with Genes
Patients with SLE have elevated ERV expression.
Statistical Analysis and Design of Experiments for Large Data Sets
Statistical Analysis of Microarray Data
False discovery rate estimation
Statistical Testing with Genes
Distribution plot of the differentially expressed genes.
Benjamini & Hochberg-corrected p < 0.05 Bonferroni-corrected
Distinct molecular and clinical correlates of H3F3A mutation subgroups
Microsoft Office Illustrated Fundamentals
Presentation transcript:

Differential Expression Analysis Introduction to Systems Biology Course Chris Plaisier Institute for Systems Biology

Glioma: A Deadly Brain Cancer Wikimedia commons

miRNAs in Cancer Caldas et al., 2005 RISC

Utility of miRNAs in Cancer Chan et al., 2011

miRNAs are Dysregulated in Cancer Chan et al., 2011

What data do we need? TCGA

Analysis Method Mischel et al, 2004

Utility of miRNAs in Cancer Chan et al., 2011

Student’s T-test

Data for Analysis Patient tumor miRNA expression levels By identifying miRNAs whose expression is significantly different between glioma and normal – Could be drivers of cancer related processes

Loading the Data Comma separated values file is a text file where each line is a row and the columns separated by a comma. In R you can easily load these types of files using: # Load up data for differential expression analysis d1 = read.csv( ' systemsbiology.net.events.sysbio/files/uploads/cnvData_miRNAE xp.csv', header=T, row.names=1) NOTE: CSV files can easily be imported or exported from Microsoft Excel.

What does the data look like?

Subset Data Types This file contains the case/control stats, CNVs and miRNA expression We want to separate these out to make our analysis easier # Case or control status (1 = case, 0 = control) case_control = d1[1,] # miRNA expression levels mirna = d1[361:894,] # Copy number variation cnv = d1[2:360,]

Plot the Data

Questions What statistics should we compute? What results should we save from the analysis?

Calculating T-test for all miRNAs Use a Student's T-test to identify the differentially expressed miRNAs from the study (compare experimental to controls). Input: cnvData_miRNAExp.csv - matrix of miRNA expression profiles Desired output: t.test.fc.mirna.csv – a matrix of fold-changes, Student’s T-test statistics, Student's T-test p- values with Bonferroni and Benjamini-Hochberg correction in separate columns labeled by miRNA names (write them out sorted by Benjamini-Hochberg corrected p-values). The number of miRNAs differentially expressed (α ≤ 0.05 and fold-change ± 2) for no multiple testing correction, Bonferroni and Benjamini-Hochberg correction (use whatever method you like: R or Excel) volcanoPlotTCGAmiRNAs.pdf – Create a volcano plot of the –log 10 (p-value) vs. log 2 (fold- change) Useful functions: read.csv, t.test, sapply, p.adjust, order, write.csv, print, pdf, plot, t, pdf, dev.off, paste

Calculating Fold-Changes Now lets calculate the fold-changes for each of the miRNAs, values are log 2 transformed so need to reverse this before calculating fold-changes: # Calculate fold-changes fc = rep(NA, nrow(mirna)) for(i in 1:nrow(mirna)) { fc[i] = median(2^as.numeric(mirna[i, which(case_control==1)]), na.rm = T) / median(2^as.numeric(mirna[i, which(case_control==0)]), na.rm = T) } or a faster version using an apply: # Faster version using an sapply fc.2 = sapply(1:nrow(mirna), function(i) { return(median(2^as.numeric(mirna[i, which(case_control==1)]), na.rm = T) / median(2^as.numeric(mirna[i, which(case_control==0)]), na.rm = T)) } )

Calculating T-Test for All miRNAs Now lets calculate the significance of differential expression for each of the miRNAs: # Calculate Student's T-test p-values t1.t = rep(NA, nrow(mirna)) t1.p = rep(NA, nrow(mirna)) for(i in 1:nrow(mirna)) { t1 = t.test( mirna[ i, which(case_control==1) ], mirna[ i, which(case_control==0) ] ) t1.t[ i ] = t1$statistic t1.p[ i ] = t1$p.value }

Multiple Testing Correction When to use FDR vs. FWER for setting a threshold? Family Wise Error Rate (FWER) - e.g. Bonferroni –Extremely conservative only few miRNAs are called significant. –Is used when one needs to be certain that all called miRNAs are truly positive. False Discovery Rate (FDR) - e.g. Benjamini-Hochberg –If the FWER is too stringent when one is more interested in having more true positives. The false positives can be sorted out in subsequent experiments (expensive). –By controlling the FDR one can choose how many of the subsequent experiments one is willing to be in vain.

Adjust for Multiple Testing Next we will correct our p-values for multiple testing in two ways: # Do Bonferroni multiple testing correction (FWER) p.bonferroni = p.adjust(pValues, method='bonferroni') # Do Benjamini-Hochberg multiple testing correction (FDR) p.benjaminiHochberg = p.adjust(pValues, method='BH') # How many miRNAs are considered significant print(paste('Uncorrected = ',sum(pValues<=0.05),'; Bonferroni = ',sum(p.bonferroni<=0.05),'; Benjamini- Hochberg = ',sum(p.benjaminiHochberg<=0.05),sep=''))

Write Out Results to CSV We will now write out the results of the T-test analysis to a CSV file: # Create index ordered by Benjamini-Hochberg corrected p-values to sort each vector o1 = order(p.benjaminiHochberg) # Make a data.frame with the three columns td1 = data.frame(fold.change = fc[o1], t.stats = t1.t[o1], t.p = t1.p[o1], t.p.bonferroni = p.bonferroni[o1], t.p.benjaminiHochberg = p.benjaminiHochberg[o1]) # Add miRNAs names as rownames rownames(td1) = sub('exp.', '', rownames(mirna)[o1]) # Write out results file write.csv(td1, file = 't.test.fc.mirna.csv') This can now be opened in Excel for further analysis.

What are the DE miRNAs? Typically genes / miRNAs are considered DE if adjusted p-value ≤ 0.05 and fold-change ± 2 – Benjamini-Hochberg FDR ≤ 0.05 and FC ± 2 = 66 miRNA How do we figure out which 66? #The significant miRNAs sub('exp.', '', rownames(mirna)[ which(p.benjaminiHochberg = 2)) ] )

Basic Code for a Volcano Plot # Make a volcano plot plot(log(fc,2), -log(t1.p, 10), ylab = '- log10(p-value)', xlab = 'log2(Fold Change)', axes = F, col = rgb(0, 0, 1, 0.25), pch = 20, main = "TCGA miRNA Differential Expresion", xlim = c(-6, 5.5), ylim=c(0, 110)) p1 = par() axis(1) axis(2)

Volcano Plot

Adding Some Flair to the Volcano Plot # Open a PDF output device to store the volcano plot pdf('volcanoPlotTCGAmiRNAs.pdf') # Make a volcao plot plot(log(fc,2), -log(t1.p, 10), ylab = '-log10(p-value)', xlab = 'log2(Fold Change)', axes = F, col = rgb(0, 0, 1, 0.25), pch = 20, main = "TCGA miRNA Differential Expresion", xlim = c(-6, 5.5), ylim=c(0, 110)) # Get some plotting information for later p1 = par() # Add the axes axis(1) axis(2) ## Label significant miRNAs on the plot # Don’t make a new plot just write over top of the current plot par(new=T) # Choose the significant miRNAs included = c(intersect(rownames(td1)[which(td1[, 't.p'] =2)]), intersect(rownames(td1)[which(td1[, 't.p']<=(0.05/534))], rownames(td1)[which(td1[, 'fold.change']<=0.5)])) # Plot the red highlighting circles plot(log(td1[included, 'fold.change'], 2), -log(td1[included, 't.p'], 10), ylab = '-log10(p-value)', xlab = 'log2(Fold Change)', axes = F, col = rgb(1, 0, 0, 1), pch = 1, main = "TCGA miRNA Differential Expresion", cex = 1.5, xlim = c(-6, 5.5), ylim = c(0, 110)) # Add labels as text text((log(td1[included, 'fold.change'], 2)), ((-log(td1[included, 't.p'], 10))+-3), included, cex = 0.4) # Close PDF output device, closes PDF file dev.off()

Making a Volcano Plot

Making a PDF R has options that allow you to easily make PDFs of your plots – Nice because they can be loaded into Illustrator and modified Can either be done at the command line or through the graphical interface

Integrating miRNA Expression and CNVs Hypothesis: If an miRNA was deleted or amplified it could affect its expression in a dose dependent manner TCGA, Nature 2012

Correlation: Finding Linear Relationships

What is Linear? y = mx + b

Can CNV Levels Predict miRNA Expression Levels?

What kind of data do we need?

Does the TCGA have it? TCGA Integrate

Does the biology modify integration? Should we correlate each CNV across genome with each miRNA? Is there a way to reduce multiple testing? Does it imply something about the causality of the association?

Tabulating miRNA CNVs 1.Collect miRNA genomic coordinates 2.Collect CNV levels across genome 3.Identify CNV levels for each miRNA 4.Correlate a expression and CNV levels for each miRNA

Calculating Correlation Between miRNA CNV and Expression Use a correlation to identify the copy number variants that have a dose dependent effect on miRNA expression: Input: cnvData_miRNAExp.csv - matrix of miRNA expression profiles Desired output: corTestCnvExp_miRNA_gbm.csv - a matrix of correlation coefficients, correlation p-values, and Bonferroni and Benjamini-Hochberg correction in separate columns labeled by miRNA names (write them out sorted by Benjamini-Hochberg corrected p-values). corTestCnvExp_miRNA_gbm.pdf – scatter plots of the top 15 miRNAs correlated with CNV variation. Best two candidate miRNAs for follow-up studies. Useful functions: read.csv, cor.test, sapply, p.adjust, order, write.csv, print, pdf, plot, t, pdf, dev.off, paste

Formulae in R Formulae in R are very handy: response.variable ~ explanatory.variables Formulae can be used in place of data vectors for many functions. In our case: cor.test(.exp.miRNA ~ cnv.miRNA)

Calculating Correlation Now lets calculate the fold-changes for each of the miRNAs, values are log2 transformed so need to reverse this before calculating fold-changes: # Make a matrix to hold the Copy Number Variation data for each miRNA # Most miRNAs should have a corresponding CNV entry cnv = d1[2:360,] # Run the analysis for hsa-miR-10b c1 = cor.test(as.numeric(cnv['cnv.hsa-miR-10b',]), as.numeric(mirna['exp.hsa-miR- 10b',]), na.rm = T) # Plot hsa-miR-10b expression vs. Copy Number levels plot(as.numeric(mirna['exp.hsa-miR-10b',]) ~ as.numeric(cnv['cnv.hsa-miR-10b',]), col = rgb(0, 0, 1, 0.5), pch = 20, xlab = 'Copy Number', ylab = 'hsa-miR-10b Expression', main = 'hsa-miR-10b:\n Expression vs. Copy Number') # Add a trend line to the plot lm1 = lm(as.numeric(mirna['exp.hsa-miR-10b',]) ~ as.numeric(cnv['cnv.hsa-miR- 10b',])) abline(lm1, col='red', lty=1, lwd=1)

Plot Correlation # Plot hsa-miR-10b expression vs. Copy Number levels plot(as.numeric(mirna['exp.hsa-miR-10b',]) ~ as.numeric(cnv['cnv.hsa-miR-10b',]), col = rgb(0, 0, 1, 0.5), pch = 20, xlab = 'Copy Number', ylab = 'hsa-miR-10b Expression', main = 'hsa-miR-10b:\n Expression vs. Copy Number') # Add a trend line to the plot lm1 = lm(as.numeric(mirna['exp.hsa-miR-10b',]) ~ as.numeric(cnv['cnv.hsa-miR-10b',])) abline(lm1, col='red', lty=1, lwd=1)

Not Associated with Copy Number P-Value = 0.51 R = -0.03

Scaling it Up to Whole miRNAome # Create a matrix to strore the output m1 = matrix(nrow = 359, ncol = 2, dimnames = list(rownames(cnv), c('cor.r', 'cor.p'))) # Run the analysis for(i in rownames(cnv)) { # Try function catches errors caused by missing data c1 = try(cor.test(as.numeric(cnv[i,]), as.numeric(mirna[sub('cnv','exp',i),]), na.rm = T), silent = T)\ # If there are no errors then adds values to matrix m1[i, 'cor.r'] = ifelse(class(c1)=='try-error', 'NA', c1$estimate) m1[i, 'cor.p'] = ifelse(class(c1)=='try-error', 'NA', c1$p.value) } # Adjust p-values and get rid of NA’s using na.omit m2 = na.omit(cbind(m1, p.adjust(as.numeric(m1[,2]), method = 'BH')))

Write Out Results Write out the resulting correlations and sort them by the correlation coefficient: # Create index ordered by correlation coefficient to sort the entire matrix o1 = order(m2[,1], decreasing = T) # Write out results file write.csv(m2[o1,], file = 'corTestCnvExp_miRNA_gbm.csv')

Plot Top 15 Correlations # Get top 15 to plot based on correlation coefficient top15 = sub('cnv.', '', rownames(head(m2[order(as.numeric(m2[,1]), decreasing = T),], n = 15))) ## Plot top 15 correlations # Open a PDF device to output plots pdf('corTestCnvExp_miRNA_gbm.pdf') # Iterate through all the top 15 miRNAs for(mi1 in top15) { # Plot correlated miRNA expression vs. copy number variation plot(as.numeric(mirna[paste('exp.', mi1, sep = ''),]) ~ as.numeric(cnv[paste('cnv.', mi1, sep = ''),]), col = rgb(0, 0, 1, 0.5), pch = 20, xlab = 'Copy Number', ylab = 'Expression', main = paste(mi1, '\n Expression vs. Copy Number'), sep = '') # Make a trend line and plot it lm1 = lm(as.numeric(mirna[paste('exp.', mi1, sep = ''),]) ~ as.numeric(cnv[paste('cnv.', mi1, sep = ''),])) abline(lm1, col = 'red', lty = 1, lwd = 1) } # Close PDF device dev.off()

Amplification: Associated with Copy Number Correlation P-Value = < 2.2 x R = 0.77 Differential Expression Fold-change = 0.85 P-Value = 0.82

Deletion: Associated with Copy Number Correlation P-Value = < 2.2 x R = 0.45 Differential Expression Fold-change = -4.1 P-Value = 1.33 x 10 -8

Sub-clonal Amplification: Associated with Copy Number Correlation P-Value = < 2.2 x R = 0.50 Differential Expression Fold-change = 2.2 P-Value = 2.0 x10 -10

Top Two Candidates for Follow-Up? What are your suggestions? What other data would help to choose? Can we overlap the miRNA DE and CNV correlation studies? – What if they don’t overlap? What should we do for follow-up studies?