Making Sense of Complicated Microarray Data

Slides:



Advertisements
Similar presentations
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Advertisements

Normalization of microarray data
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Getting the numbers comparable
Normalization for cDNA Microarray Data Yee Hwa Yang, Sandrine Dudoit, Percy Luu and Terry Speed. SPIE BIOS 2001, San Jose, CA January 22, 2001.
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Microarray Data Preprocessing and Clustering Analysis
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Statistical Analysis of Microarray Data
Independent Samples and Paired Samples t-tests PSY440 June 24, 2008.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Gene Expression Data Analyses (2)
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
Normalization of 2 color arrays Alex Sánchez. Dept. Estadística Universitat de Barcelona.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Topic 3: Regression.
Independent Sample T-test Often used with experimental designs N subjects are randomly assigned to two groups (Control * Treatment). After treatment, the.
Today Concepts underlying inferential statistics
Independent Sample T-test Classical design used in psychology/medicine N subjects are randomly assigned to two groups (Control * Treatment). After treatment,
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Richard M. Jacobs, OSA, Ph.D.
(4) Within-Array Normalization PNAS, vol. 101, no. 5, Feb Jianqing Fan, Paul Tam, George Vande Woude, and Yi Ren.
The following slides have been adapted from to be presented at the Follow-up course on Microarray Data Analysis.
Inference for regression - Simple linear regression
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 – Multiple comparisons, non-normality, outliers Marshall.
Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
CDNA Microarrays MB206.
Significance analysis of microarrays (SAM) SAM can be used to pick out significant genes based on differential expression between sets of samples. Currently.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
The Analysis of Microarray data using Mixed Models David Baird Peter Johnstone & Theresa Wilson AgResearch.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Maximum Likelihood Estimation Methods of Economic Investigation Lecture 17.
Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Statistics for Differential Expression Naomi Altman Oct. 06.
Single-Factor Studies KNNL – Chapter 16. Single-Factor Models Independent Variable can be qualitative or quantitative If Quantitative, we typically assume.
Analysis of Affy 1.0 ST Gene Array Data in R To analyze Affymetrix 1.0 ST data (exon or gene) you need: Expression data in.CEL format A CDF (chip definition.
CSIRO Insert presentation title, do not remove CSIRO from start of footer Experimental Design Why design? removal of technical variance Optimizing your.
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Introducing Communication Research 2e © 2014 SAGE Publications Chapter Seven Generalizing From Research Results: Inferential Statistics.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-2 Correlation 10-3 Regression.
1 Estimation of Gene-Specific Variance 2/17/2011 Copyright © 2011 Dan Nettleton.
Henrik Bengtsson Mathematical Statistics Centre for Mathematical Sciences Lund University Plate Effects in cDNA Microarray Data.
Variability & Statistical Analysis of Microarray Data GCAT – Georgetown July 2004 Jo Hardin Pomona College
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Jump to first page Inferring Sample Findings to the Population and Testing for Differences.
THE SCIENTIFIC METHOD: It’s the method you use to study a question scientifically.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.
Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).
DNA Microarray. Microarray Printing 96-well-plate (PCR Products) 384-well print-plate Microarray.
CHAPTER 15: THE NUTS AND BOLTS OF USING STATISTICS.
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Computational Diagnostics
Getting the numbers comparable
Inferential Statistics
Normalization for cDNA Microarray Data
Presentation transcript:

Making Sense of Complicated Microarray Data Data normalization and transformation Prashanth Vishwanath Boston University

Dealing with Data Before any pattern analysis can be done, one must first normalize and filter the data. Normalization facilitates comparisons between arrays. Filtering transformations can eliminate questionable data and reduce complexity.

Why Normalize Data? Goal : Measure ratios of gene expression levels. Ratio = Ti/Ci. Ratio of measured treatment intensity to control intensity for the ith spot In self/self experiments, ratio = 1. But this is never true Imbalances can be caused by Different incorporation of dyes Different amounts of mRNA Different scanning parameters etc. Normalization balances red and green intensities. Reduce intensity-dependent effects Remove non-linear effects

The starting point - Using the log2 ratio The log2 ratio treats up and down regulated genes equally. e.g. when looking for genes with more than 2 fold variation in expression Differential expression ratio is log ratio is over-expression  2  1 neutral  1  0 under-expression  0.5  -1

The MA or Ratio-Intensity plots Differential expression log ratio M = log2(R / G) Log intensity A = log10(R*G) / 2

Common problems diagnosed using RI plots Low end variation Saturation High end variation Curvature at Low intensity Large curvature Heterogeneity

Different Normalization techniques Total Intensity Iterative log(mean) centering Linear Regression Lowess correction ……..and others Dataset for normalization Entire data set (all genes) User defined dataset/ controls (e.g. housekeeping genes)

Total Intensity Normalization: mean or median Assumption: equal quantities of RNA in samples that are being compared Normalized expression ratio adjusts each ratio such that mean ratio is 1 log2(ratio’i ) = log2(ratioi )- log2(Ntotal) A similar scaling factor could be used to normalize intensities across all arrays

Normalization - Lowess Lowess detects systematic bias in the data Intensity-dependent structure Different ways to fit a lowess curve Local linear regression Splines

Global vs. local normalization Most normalization algorithms can be applied globally or locally Advantages of local normalization Correct systematic spatial variation Inconsistencies in the spotting pen Variability in slide surface Local differences in hybridization conditions An example of a local set may be each group of array elements spotted by a single spotting pen.

Spatial Lowess Before After spatial Lowess

Printing Variability

Normalization - print-tip-group Assumption: For every print group, changes roughly symmetric at all intensities. Before After

Variance regularization Stochastic processes can cause the variance of the measured log2(ratio) values to differ Adjust ratios such that variance is the same (using a single factor ai for a subarray in a chip to scale all values in that subarray) Scaling factor = variancesubarray variance of all subarrays A similar measure can be used to regularize variances between arrays

Averaging over replicates Result is equivalent to taking the geometric mean ratio = (Ti1*Ti2*Ti3)1/3 (Ci1*Ci2*Ci3)1/3 Combining intensity data Ratio of the geometric means of the channel intensities Dye swaps – helps in reducing dye effects

Pro-platelet basic protein Number of replicates Integrin alpha 2b Pro-platelet basic protein

I have normalized the data – what next What was I asking? Typically: “which genes changed expression patterns when I did ____” Typical contexts Binary conditions: knock out, treatment, etc Unordered discrete scales: multiple types of treatment or mutations, tissue types etc Continuous scales: time courses, levels of treatment, etc Analysis methods vary with question of interest

Common questions Which genes changed expression patterns? Statistical tests Which genes can be used to classify or predict the diagnostic category of the sample? Machine-learning class prediction methods e.g. Support vector machines (References) Which genes behave similarly over time when exposed to treatment Cluster analysis (Gabriel Eichler)

Selecting a subset of genes Question - which genes are (most) differentially expressed? Common Methods Fold change in expression after combining replicates Genes that have fold changes more than two standard deviations from the mean or pass the Z-score test Statistical tests Diagnostic experiments (two treatments) t-test A t-test compares the means of the control and treatment groups Multiple treatments – ANOVA F test test the equality of three or more means at one time by using variances.

Significance – Z-scores Intensity dependent Z-scores to identify differential Expression Z < 1 1<Z<2 Z > 2 log2(T/C) The uncertainty in measurements increases as intensity decreases Measurements close to the detection limit are the most uncertain Fold-change measurements ignore these effects We can calculate an intensity-dependent Z-score that measures the ratio relative to the standard deviation in the data: log10(T*C) where µ - mean log(ratio)  - standard deviation Z = (log2(T/C) - µ) 

Statistical tests Control Treatment Exp 1 Exp 2 Exp 5 Exp 3 Exp 4 Exp 6 Mean 1 Mean 2 These means are definitely significantly different Control Treatment Mean 1 Mean 2 Less than a 0.05 % chance that the sample with mean s came from population 1, i.e., s is significantly different from “mean 1” at the p < 0.05 significance level. But we cannot reject the hypothesis that the sample came from population 2. Control Treatment Sample gene, mean “s”

Statistical tests – permutation tests Many biological variables, such as height and weight, can reasonably be assumed to approximate the normal distribution. But expression measurements? Probably not. Permutation tests can be used to get around the violation of the normality assumption For each gene, calculate the t or F statistic Randomly shuffle the values of the gene between groups A and B,such that the reshuffled groups A and B respectively have the same number of elements as the original groups A and B. Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Group A Group B Original grouping Group A Group B Exp 2 Exp 3 Exp 6 Exp 4 Exp 5 Exp 1 Randomized grouping Gene 1

Statistical tests – permutation tests Compute t or F-statistic for the randomized gene Repeat randomization n times Let x be the number of times t or F exceeds the absolute values of the randomized statistic for n randomizations. Then, the p-value associated with the gene = 1 – (x/n) The p-value of an event is a measure of the likelihood of its occurring. The lower the p-value the better If the calculated p-value for a gene is less than or equal to the critical p-value, the gene is considered significant. But it may not be that simple………..

The problem of multiple testing Let’s imagine there are 10,000 genes on a chip, AND None of them is differentially expressed. Suppose we use a statistical test for differential expression, where we consider a gene to be differentially expressed if it meets the criterion at a p-value of p < 0.01. (adapted from presentation by Anja von Heydebreck, Max–Planck–Institute for Molecular Genetics, Dept. Computational Molecular Biology, Berlin, Germany http://www.bioconductor.org/workshops/Heidelberg02/mult.pdf)

The problem of multiple testing We are testing 10,000 genes, not just one!!! Even though none of the genes is differentially expressed, about 1% of the genes (i.e., 100 genes) will be erroneously concluded to be differentially expressed, because we have decided to “live with” a p-value of 0.01 If only one gene were being studied, a 1% margin of error might not be a big deal, but 100 false conclusions in one study? That doesn’t sound too good.

The problem of multiple testing There are “tricks” we can use to reduce the severity of this problem. Slash p-value for each gene - each gene will be evaluated at a lower p-value. False Discovery Rate (FDR)- proportion of genes likely to have been wrongly identified by chance as being significant

Statistical Tests - Conclusions Don’t get too hung up on p-values. P-values should help you evaluate the strength of the evidence. P-values are not an absolute yardstick of significance. Statistical significance is not necessarily the same as biological significance. Results from statistical tests need to be verified either in the lab (Quantitative RT-PCRs) or by comparisons to previous studies.

Treatment 1: Ratio of treatment to control Time series Analysis Consider an experiment with 4 timepoints for each treatment. Treatment 1: Ratio of treatment to control Genes Timepoint 1: Time 0 Timepoint 2: 30 min Timepoint 3: 1 h Timepoint 4 : 2h   Replicates 1 2 3 Gene 1 1.2 1.5 0.8 2.7 1.3 0.6 0.9 0.3 0.4 Gene 2 - 1.05 1.7 1.4 1.8 1.6 Gene 3 0.5 3.7 4.3 4.1 2.8 2.3 1.9 Gene 4 0.65 0.7 0.93 0.2 0.15 0.32 0.76 0.87 Gene 5 0.98 0.54 0.85 Gene 6 1.67 1.1 7.32 6.43 10.54 4.54 4.32 5.3 2.31 2.12 Gene … Ratio at Time 0 should be 1 in perfect circumstances

Time Series Analysis : Genes of interest Treatment effects Time effects on reference or control Interaction between treatment and time Identifying genes that are co-expressed. Prediction of function. Identifying a set of co-regulated genes Identifying regulatory modules.

Time series Analysis - transforming data The time series profile has both behavior and amplitude The time series profile has only behavior information

What’s next (by Gabriel Eichler) What I have covered Various normalization techniques Replicate filtering Various statistical methods for selecting differentially expressed genes. Time series data transformations What’s next (by Gabriel Eichler) Clustering algorithms Concatenating time profiles from various treatments – GEDI Other tools Supervised machine learning algorithms (suggested papers for reading) Post clustering analysis (introduced in part this morning) GO terms Analysis of promoter elements for transcription factor binding sites

Analysis of replicates- replicate trim The lowess-adjusted log2(R/G) values for two independent replicates are plotted against each other element by element. Outliers in the original data (in red) are excluded from the remainder of the data (blue) selected on the basis of a two-standard-deviation cut on the replicates. Log2(T/C) array 2 For a multi-slide experiment Cy5/Cy3 ratios should be the same for all two replicates filter out genes whose log2(ratio1/ratio2) deviates greatly from this expected value of zero. calculate the mean and SD of this log2(ratio1/ratio2) for each pair of replicates in the entire array, and eliminate pairs of elements whose log2(ratio1/ratio2) is greater than 2 SD from the mean. Repeat the process for all pairs of replicates Log2(T/C) array 1