# Statistical Analysis of Microarray Data

## Presentation on theme: "Statistical Analysis of Microarray Data"— Presentation transcript:

Statistical Analysis of Microarray Data
By H. Bjørn Nielsen & Hanne Jarmer

The DNA Array Analysis Pipeline
Question Experimental Design Array design Probe design Sample Preparation Hybridization Buy Chip/Array Image analysis Normalization Expression Index Calculation Comparable Gene Expression Data Statistical Analysis Fit to Model (time series) Advanced Data Analysis Clustering PCA Classification Promoter Analysis Meta analysis Survival analysis Regulatory Network

What's the question? without alcohol with alcohol
Typically we want to identify differentially expressed genes Example: alcohol dehydrogenase is expressed at a higher level when alcohol is added to the media alcohol dehydrogenase without alcohol with alcohol

Statistics However, the measurements contain stochastic noise
There is no way around it He’s going to say it Statistics

You can choose to think of statistics as a black box
Noisy measurements p-value statistics But, you still need to understand how to interpret the results

The output of the statistics
P-value The chance of rejecting the null hypothesis by coincidence For gene expression analysis we can say: the chance that a gene is categorized as differentially expressed by coincidence

The statistics gives us a p-value for each gene
We can rank the genes according to the p-value But, we can’t really trust the p-value in a strict statistical way!

Why not! For two reasons:
We are rarely fulfilling all the assumptions of the statistical test We have to take multi-testing into account

The t-test Assumptions
1. The observations in the two categories must be independent 2. The observations should be normally distributed 3. The sample size must be ‘large’ (>30 replicates)

Multi-testing? In a typical microarray analysis we test thousands of genes If we use a significance level of 0.05 and we test 1000 genes. We expect 50 genes to be significant by chance 1000 x 0.05 = 50

Correction for multiple testing
Bonferroni: P ≤ 0.01 N Confidence level of 99% Benjamini-Hochberg: P ≤ i N 0.01 N = number of genes i = number of accepted genes

But really, those methods are too strict
What we can trust is the ability of the statistical test to rank the genes according to their reliability The number of genes that are needed or can be handled in downstream processes can be used to set the cutoff If we permute the samples we can get an estimate of the False Discovery Rate (FDR) in this set

Volcano Plot log2 fold change (M) P-value

What's inside the black box ‘statistics’
t-test or ANOVA

The t-test Calculate T Lookup T in a table

The t-test II The t-test tests for difference in means () wt wt
mut mutant Intensity of gene x Density

The t-test III The t statistic is based on the sample mean and variance t

ANOVA ANalysis Of Variance
Very similar to the t-test, but can test multiple categories Ex: is gene x differentially expressed between wt, mutant 1 and mutant 2 Advantage: it has more ‘power’ than the t-test

ANOVA II Density Intensity Variance between groups
Variance within groups Density Intensity

Blocks and paired tests
Some undesired factors may influence the experiments, the effect of such can be greatly reduced if they are blocked out or if the experiment is paired. Some possible blocks: - Dye - Patient - Technician - Batch - Day of experiment - Array

Paired t-test Block 1 Block 2 Block 3 Hypothesis:
There is no difference between the mean BLUE and RED Block 1 Block 2 Block 3

An example: (2-way ANOVA with blocks)
Experimental Design: A CreA Glucose Ethanol Glucose Ethanol 3x Everything was done in batches to capture the systematic noise

Example: Batch to batch variation
Within batch variation is lower than the between batch variation

Example: Data analysis - Blocking
We can capture the batch variation by blocking Two-way ANOVA Glucose Ethanol A CreA Effect 1 Effect 2 Batch A B C

Ex: Result of the 2-way ANOVA (3 p-values)
A genotype p-value A growth media p-value An interaction p-value Two-way ANOVA Ethanol Glucose A CreA Media effect Genotype effect Batch A B C

Conclusion Array data contains stochastic noise
Therefore statistics is needed to conclude on differential expression We can’t really trust the p-value But the statistics can rank genes The capacity/needs of downstream processes can be used to set cutoff FDR can be estimated t-test is used for two category tests ANOVA is used for multiple categories