Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Module #: Title of Module
2

Module 2 From Pre-Processing to Gene-Lists
Paul Boutros Microarray Data Analysis June 4-5, 2012

Pre-Processing What exactly is pre-processing (aka normalization)?
Why do we do it?

Sources of Technical Noise
Where does technical noise come from?

More Sources of Technical Noise

Any step in the experimental pipeline can introduce artifactual noise
Array design Array manufacturing Sample quality Sample identity  sequence effects? Sample processing Hybridization conditions  ozone? Scanner settings Pre-Processing tries to remove these systematic effects

Affymetrix Pre-Processing Steps
Background Correction Normalization Probe-Specific Adjustment Summarizing multiple Probes into a single ProbeSet Let’s look at two common approaches

Approach #1: MAS5 Affymetrix put significant effort into developing good data pre-processing approaches MAS5 was an attempt to develop a “standard” technique for 3’ expression arrays The flaws of MAS5 led to an influx of research in this area. The algorithm is best-described in an Affymetrix white-paper, and is actually quite challenging to reproduce exactly in R.

MAS5 Model Observations = True Signal + Random Noise + Probe Effects
Assumptions?

What is RMA? RMA = Robust Multi-Array Why do we use a “robust” method?
Robust summaries really improve over the standard ones by down weighing outliers and leaving their effects visible in residuals. multi chip? Why do we use “array”? To put each chip’s values in the context of a set of similar values.

What is RMA? Why? It is a log scale linear additive model
Assumes all the chips have the same background distribution Does not use the mismatch probe (MM) data from the microarray experiments Why does it assume all chips have same background distribution? Why?

What is RMA? Mismatch probes (MM) definitely have information - about both signal and noise - but using it without adding more noise is a challenge We should be able to improve the background correction using MM, without having the noise level blow up: topic of current research (GCRMA) Ignoring MM decreases accuracy but increases precision Why does ignoring MM decrease accuracy?

Methodology Quantile Normalization – the goal of this method is to make the distribution of probe intensities for each array in a set of arrays the same. This method is motivated by the idea that a Q-Q plot shows that the distribution of two data vectors is the same if the plot is a straight diagonal line and not the same if it is anything else. The normalization distribution is chosen by averaging each quantile across chips. Results in a normalization that is probably overly conservative Distribution of two data vectors – what if you want to see if n vectors have the same distribution? Plotting in n dimensions

Methodology So we see data from multiple arrays all with different distributions then we see the that after the quantile distribution, the think black distribution normalizes all these PM intensities into one

Methodology Yij = aj + βi + εij
Summarization: combining multiple probe intensities of each probeset to produce expression values An additive linear model is fit to the normalized data to obtain an expression measure for each probe on the GeneChip Yij = aj + βi + εij

Yij denotes the background-corrected normalized probe value corresponding to the ith GeneChip and the jth probe within the probeset [log2(PM-BG)*ij] aj is the probe affinity jth probe Probe effects are additive on the log scale βi is the chip effect for the ith GeneChip (log scale expression level) εij is the random error term

Estimate aj ( probe affinity) and βi (chip effect) using a robust method: Tukey’s Median polish (quick) - fits iteratively, successively removing row and column medians, and accumulating the terms, until the process stabilizes. The residuals are what is left at the end Robust - important because of potential for outliers in large data sets Exploratory Allows for a “general picture” approach to statistical ideas Important for computational efficiency and complex structures

RMA vs MAS5 RMA sacrifices accuracy for precision
RMA is generally not appropriate for clinical settings RMA provides higher sensitivity/specificity in some tests RMA reduces variance (critical for small-n studies) RMA is better accepted by journals and reviewers

One key detail has been omitted so far:
How do we know if our pre-processing actually worked? Not mRNA levels in a cell mRNA levels across all cells in a population But only *relative* mRNA levels And only for some genes So: some relative mRNA levels averaged across a population of cells

Outline Assessing Pre-Processing Results
Univariate Statistical Approaches ProbeSet Remapping Consequtive basepairs 25

Can we determine how well our pre-processing worked
Can we determine how well our pre-processing worked? Or if our data looks good?

Let’s See Some “Bad” Data

Those Three Were From A Spike-In Experiment Done by Affymetrix

Those Last Three Were From An Experiment We Did On Rat Liver Samples

Were Those Bad Samples? Lots of evident spatial artifacts
But in practice all samples were carried forward into analysis And validation (RT-PCR) confirmed the overall study results for many genes

Eye-ball Assessments Are Hard
A couple of useful tricks: Look at the distributions Did quantile normalization work (for RMA)? Look at the inter-sample correlations Is one sample a strong outlier? Look at the 3’  5’ trend across a ProbeSet I know of no accepted, systematic QA/QC methods

Distributions (Raw)

Distributions (normalized)

Inter-Sample Correlations

3’  5’ Signal Trend

What Do You Do If You Find a Bad Array?
Repeat it? Drop the sample? Include it but account for the “noise” in another way?

In This Case We excluded a series of outlier samples
We believed these samples had been badly degraded because their were derived from FFPE blocks

Final Distribution

Final Heatmap

Outline Assessing Pre-Processing Results

Significance-Testing
Statistics Reminder Very generally, statistics can be divided into two major branches: Estimation Significance-Testing Measures of centrality Measures of error P-values Goodness of fit tests Mean, Median, Std-Dev T-Test, F-Test, ANOVA

Statistics Reminder #2 What is a P-Value?
Imagine we are playing a dice game I roll a 5 You need to roll a 6 to win What is the chance that you will win? 1 in 6 P = 1 / 6 = 0.167 The probability that you will win is 0.167 You have a 16.7% chance of winning I have a 83.3% chance of winning

Significance Testing Questions
Are these two groups different? Do these two things synergize? Does treatment affect patient outcome?

Distributional Assumptions
A parametric test makes assumptions about the underlying distribution of the data. A non-parametric test makes no assumptions about the underlying distribution, but may make other assumptions!

Two-Sample Analyses Also called univariate analysis
Requires two conditions Treating them as a binary variable E.g. treatment or control Probably the most common experimental design Standard approaches: T-test Wilcoxon rank-sum test T-test variants Permutation tests

T-tests What are the assumptions of the t-test?
When would you feel comfortable using a t-test?

T-Test Alternative: Wilcoxon Rank-Sum
Also called: U-test Mann-Whitney (U) test Some argue that for continuous microarray data there is rarely a good reason to use this test: Low n: tests of normality are not very powerful High n: the central limit theorem provides support If the sample is normal, asymptotic efficiency is 0.95

T-Test Alternative: Moderated Statistics
A series of highly complex methods based on Bayesian statistical methodologies Gordon Smyth’s limma R package is by far the most widely used implementation of this technique This term is “shrunk” by borrowing power across all genes. This increases effective power.

T-Test Alternative: Permutation Tests
SAM is the classic method Most people suggest not using SAM today Empirically estimate the null distribution Iterate Start with many samples Randomly Sample

Problems with Significance Testing
What happens if there are NO changes? Imagine: You analyzed 1,000 clinical samples 20,000 genes in the genome P < 0.05 What if… somebody comes and randomizes all your data?

You had a lot of Data 20,000 genes / array All 1,000 patients
Randomized 1,000 patients 20,000,000 data points Genes are mixed up together Patients are mixed together What happens if you analyze this data? There should be NO real hits anymore!

What will you actually find?
Array: 20,000 genes Threshold: p < 0.05 20,000 x 0.05 = 1000 False Positives This is called “multiple testing”. There is a solution

20% A “false-discovery rate adjustment” (FDR) for multiple testing considers all 20,000 p-values simultaneously 15% In this experiment, lots of low p-values, so we can use this to “adjust” the p-values so we can find the true hits. 10% 5% Expected Value 0% P-Value

This is what you get from randomized data…
In this experiment, NO enrichment for low p-values, so no more hits than expected randomly.

Outline Pre-Processing Matters! Assessing Pre-Processing Results

Arrays Can Become Outdated
Gene definitions change The reference genome sequence gets finished Novel splice variants are found Errors are made in the initial design and remain present in all arrays made

The Mask Production Makes Affymetrix Designs Expensive To Change
Photolithographic mask

But… there are multiple probes per gene

We Can Change Those Mappings!
Hybridized Chip

CDF File Chip Definition File
This file maps Probes (positions) into ProbeSets We can update those mappings Ignore deprecated or cross-hybridizing probes Merge multiple probes that recognize the same gene Account for entirely new genes that were not known at the time of array-design

Sequence Mappings Are Slow
Requires aligning millions of 25 bp probes against the transcriptome and identifying the best match for each Fortunately, other groups have done this for us, and regularly update their mappings

Many Probes Are Lost

But There Is Also A Major Benefit
Increased validation rates using RT-PCR (~10%) Sandberg et al BMC Bioinformatics 2007

After the break Learning how to make QA/QC plots in R
Compare univariate statistical analysis techniques Apply an alternative ProbeSet remapping Contrast the effects of pre-processing

We are on a Coffee Break & Networking Session

Canadian Bioinformatics Workshops

Similar presentations

Presentation on theme: "Canadian Bioinformatics Workshops"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Canadian Bioinformatics Workshops

Similar presentations

Presentation on theme: "Canadian Bioinformatics Workshops"— Presentation transcript:

Similar presentations

About project

Feedback