# Linear Models for Microarray Data

## Presentation on theme: "Linear Models for Microarray Data"— Presentation transcript:

Linear Models for Microarray Data
LIMMA Linear Models for Microarray Data

Difficulties with microarray data
Variability of the expression values differs between genes Non-identical and dependent distribution between genes Multiple testing of tens of thousands of genes

Correct for multiple comparisons
Multiple testing - Family-wise error rate - False Discovery Rate etc. Parallel nature of the inference allows for compensating possibilities Borrowing information from the ensemble of genes to assist in inference from individual genes

Empirical Bayes Frequentist methods, a hypothesis is typically rejected or not rejected without directly assigning a probability Bayesian methods, specifies some prior probability, which is then updated in the light of new data. For Bayesian techniques, the prior distribution is assigned independent of the data and fixed before any data is observed.

Empirical Bayes Superficially similar to Bayesian methods in that a prior distribution is assigned. However, prior distribution is estimated from the data Therefore Empirical Bayes is a frequentist technique

LIMMA Empiricial Bayes techniques have previously been applied to microarray data Analysis specific to experiment and very difficult to implement LIMMA - Simple model with simple expression of posterior odds Allows linear modelling to be applied to microarray data

Estrogen Data 2x2 factorial experiment on MCF7 breast cancer cells using Affymetrix HGU95av2 arrays Factors : Estrogen (Presence/Absence) Length of exposure (10hr/48hr) The idea of the study is to identify genes that respond to estrogen treatment

Read in the Data Load in the estrogen data Normalise the data
Define the targets (factors) for the linear model

Design Matrix Eight arrays Four pairs of replicates
1 low10-1.cel absent 10 2 low10-2.cel absent 10 3 high10-1.cel present 10 4 high10-2.cel present 10 5 low48-1.cel absent 48 6 low48-2.cel absent 48 7 high48-1.cel present 48 8 high48-2.cel present 48 Eight arrays Four pairs of replicates Four parameters in the linear model

Contrast Matrix Estrogen effect at 10 hours
1 low10-1.cel absent 10 2 low10-2.cel absent 10 3 high10-1.cel present 10 4 high10-2.cel present 10 5 low48-1.cel absent 48 6 low48-2.cel absent 48 7 high48-1.cel present 48 8 high48-2.cel present 48 Estrogen effect at 10 hours Time effect without estrogen Estrogen effect at 48 hours

Differential Expression
Extract linear model fit for contrasts Obtain list of differentially expressed genes for contrasts Look for overlap among differentially expressed genes

Linear Model Fit logFC - Estimate of the log2-fold-change corresponding to the effect or contrast AveExpr - Average log2-expression for the probe over all arrays/channels t - moderated t-statistic P.Value - Raw p-value adj.P.Value -Adjusted p-value B - log odds that the gene is differentially expressed

Annotating Data Probe arrays can be annotated with external data
Multiple sources of gene annotations

Gene Set Enrichment All biochemical pathways are determined by sets of genes Gene sets are determined by prior biological knowledge relating to co-expression, function, location or known biochemical pathways. If a pathway is in any way related to a biological trait then the co-functioning genes should display a higher degree of enrichment compared to the rest of the transcriptome. Gene Set Enrichment (GSE) is a computational technique which determines whether a priori defined set of genes show statistically significant overlap

Estrogen receptor (ER) gene set
If estrogen is present, ER genes will bind the estrogen and become activated Gain ability to regulate gene expression and result in differential expression between the cells with and without estrogen Should lead to up regulation of ER genes