Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.

Slides:



Advertisements
Similar presentations
Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When.
Advertisements

AP Statistics Course Review.
Dahlia Nielsen North Carolina State University Bioinformatics Research Center.
Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.
Multivariate Analysis of Pathways. Multivariate Approaches to Gene Set Selection.
Microarray Quality Assessment Issues in High-Throughput Data Analysis BIOS Spring 2010 Dr Mark Reimers.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Statistical Methods in Microarray Data Analysis Mark Reimers, Genomics and Bioinformatics, Karolinska Institute.
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Getting the numbers comparable
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
10 Hypothesis Testing. 10 Hypothesis Testing Statistical hypothesis testing The expression level of a gene in a given condition is measured several.
Low-Level Analysis and QC Regional Biases Mark Reimers, NCI.
Differentially expressed genes
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Statistical Analysis of Microarray Data
GCB/CIS 535 Microarray Topics John Tobias November 8th, 2004.
Making Sense of Complicated Microarray Data
Analysis of Drug-Gene Interaction Data Florian Ganglberger Sebastian Nijman Lab.
Chapter 2 Simple Comparative Experiments
Chapter 11: Inference for Distributions
Clustering and MDS Exploratory Data Analysis. Outline What may be hoped for by clustering What may be hoped for by clustering Representing differences.
Microarray Data Analysis Illumina Gene Expression Data Analysis Yun Lian.
Chapter 4. Exercise 1 The 95% CI represents a range of values that contains a population parameter with a 0.95 probability. This range is determined by.
Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University.
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Statistics & Biology Shelly’s Super Happy Fun Times February 7, 2012 Will Herrick.
CDNA Microarrays MB206.
Review of Chapters 1- 5 We review some important themes from the first 5 chapters 1.Introduction Statistics- Set of methods for collecting/analyzing data.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 08/10/ :23 PM 1 Some basic statistical concepts, statistics.
Epigenetic Analysis BIOS Statistics for Systems Biology Spring 2008.
Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.
Review of Chapters 1- 6 We review some important themes from the first 6 chapters 1.Introduction Statistics- Set of methods for collecting/analyzing data.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman.
Lo w -Level Analysis of Affymetrix Data Mark Reimers National Cancer Institute Bethesda Maryland.
The Analysis of Microarray data using Mixed Models David Baird Peter Johnstone & Theresa Wilson AgResearch.
A Short Overview of Microarrays Tex Thompson Spring 2005.
Measures of central tendency are statistics that express the most typical or average scores in a distribution These measures are: The Mode The Median.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Statistics - methodology for collecting, analyzing, interpreting and drawing conclusions from collected data Anastasia Kadina GM presentation 6/15/2015.
Statistical Principles of Experimental Design Chris Holmes Thanks to Dov Stekel.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
Introduction to Statistics Alastair Kerr, PhD. Think about these statements (discuss at end) Paraphrased from real conversations: – “We used a t-test.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
MRNA Expression Experiment Measurement Unit Array Probe Gene Sequence n n n Clinical Sample Anatomy Ontology n 1 Patient 1 n Disease n n ProjectPlatform.
Statistics for Differential Expression Naomi Altman Oct. 06.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
1 ArrayTrack Demonstration National Center for Toxicological Research U.S. Food and Drug Administration 3900 NCTR Road, Jefferson, AR
CSIRO Insert presentation title, do not remove CSIRO from start of footer Experimental Design Why design? removal of technical variance Optimizing your.
For a specific gene x ij = i th measurement under condition j, i=1,…,6; j=1,2 Is a Specific Gene Differentially Expressed Differential expression.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Lecture 11 Differential Expressions: A summary. The Purposes of Statistical Tests Microarray data is often used as a guide to further, more precise studies.
1 Estimation of Gene-Specific Variance 2/17/2011 Copyright © 2011 Dan Nettleton.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Describing Data: Summary Measures. Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable.
Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.
Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).
Some statistical musings Naomi Altman Penn State 2015 Dagstuhl Workshop.
Chapter 2 Simple Comparative Experiments
Experimental Power Graphing Program
Getting the numbers comparable
Dimension reduction : PCA and Clustering
Advanced Algebra Unit 1 Vocabulary
Presentation transcript:

Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland

Overview Scales for analysis Systematic errors Sample outliers & experimental consistency Useful graphics Implications for experimental design Platform consistency Individual differences

Distribution of Signals Most genes are expressed at very low levels Even after log-transform the distribution is skewed NB: Signal to abundance ratio NOT the same for different genes on the chip

Explanation of Distribution Shape Left hand steep bell curve probably due to measurement noise Underlying real distribution probably even steeper += abundances + noise = observed values

Variation Between Chips Technical variation: differences between measures of transcript abundance in same samples Causes: Sample preparation Slide Hybridization Measurement Individual variation: variation between samples or individuals Healthy individuals really do have consistently different levels of gene expression!

Replicates in True Scale Signals vary more between replicates at high end Level of ‘noise’ increases with signal mean signal chip 1 chip 2 Comparison of chips (Affy) Std Dev as a function of signal across all chips SD Red line is lowess fit

Replicates on Log Scale Measures fold-change identically across genes Noise at lower end is higher in log transform SD vs signal after log transform chip 1 vs chip 2 after log transform

Ratio-Intensity (R-I) plots Log scale makes it convenient to represent fold- changes up or down symmetrically R = log( Red/Green ); I = (1/2)log( Red*Green ) aka. MA (minus, add) plots (log) Ratio (log) Intensity

Simple power transforms (Box-Cox) often nearly stabilize variance Durbin and Huber derived variance-stabilizing transform from a theoretical model: y =  (background) + m e  (mult. error) +  static error) m is true signal;  and  have N(0,  ) distribution Transform: Could estimate  (background) and     empirically In practice often best effect on variance comes from parameters different from empirical estimates Huber’s harder to estimate Variance Stabilization

Box-Cox Transforms Simple power transformations (including log as extreme case), eg cube root Often work almost as well as variance- stabilizing transform

Should you use Transforms? Transforms change the list of genes that are differentially regulated The common argument is that bright genes have higher variability However you aren’t comparing different genes Log transform expands the variability of repressed genes Strong transforms (eg log) most suitable for situations where large fold-changes occur (eg. Cancers) Weak transforms more suited for situations where small changes are of interest (eg. Neurobiology)

Graphical methods Aims: Exploratory analysis, to see natural groupings, and to detect outliers To identify combinations of features that usefully characterize samples or genes Not really suitable for quantitative measures of confidence Principal Components Analysis (PCA) Standard procedure of finding combinations with greatest variance Multi-dimensional scaling (MDS) Represent distances between samples as a two- or three-dimensional distance Easy to visualize

MDS Plots

Representing Groups Cluster diagram Multi-dimensional scaling Day 1 Chips

Different Metrics – Same Scale 8 tumor; 2 normal tissue samples Distances are similar in each tree Normals close Tree topologies appear different Take with a grain of salt!

Volcano Plot Displays both biological importance and statistical significance log 2 (fold change) log 2 (p-value) or t-score

Quantile Plot Plot sample t- scores against t- scores under random hypothesis Statistically significant genes stand out Corresponding quantiles of t-distribution Sample t-scores

Systematic Variation Intensity-dependent dye bias due to ‘quenching’ Stringency (specificity) of hybridization due to ionic strength of hyb solution How far hybridization reaction progresses due to variation in mixing efficiency Spatial variation in all of the above

Relevance for Experimental Designs Balanced designs with several replicates built in have smaller standard errors than reference design with same number of chips – Kerr & Churchill Assuming error is random! Sample 2 Sample 1 Sample 5 Sample 4Sample 3 In practice very hard to deal with systematic errors in a symmetric design No two slides with comparable fold- changes

Critique of Optimal Designs Optimal for reduction of variance, if All chips are good quality No systematic errors – only random noise In fact systematic error is almost as great as random noise in many microarray experiments With loop designs single chip failures cause more loss of information than with reference designs

Individual Variation Numerous genes show high levels of inter- individual variation Level of variation depends on tissue also Donors, or experimental animals may be infected, or under social stress Tissues are hypoxic or ischemic for variable times before freezing

Frequent False Positives Immuno-globulins, and stress response proteins often 5-10X higher than typical in one or two samples Permutation p-values will be insignificant, even if t-score appears large gene levels frequency Group 1 Group 2