Variability & Statistical Analysis of Microarray Data GCAT – Georgetown July 2004 Jo Hardin Pomona College

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

Independent t -test Features: One Independent Variable Two Groups, or Levels of the Independent Variable Independent Samples (Between-Groups): the two.
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Departments of Medicine and Biostatistics
Genomic Profiles of Brain Tissue in Humans and Chimpanzees II Naomi Altman Oct 06.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
OHRI Bioinformatics Introduction to the Significance Analysis of Microarrays application Stem.
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Preprocessing Methods for Two-Color Microarray Data
Microarray Data Preprocessing and Clustering Analysis
10 Hypothesis Testing. 10 Hypothesis Testing Statistical hypothesis testing The expression level of a gene in a given condition is measured several.
Gene Expression Data Analyses (3)
Differentially expressed genes
1/55 EF 507 QUANTITATIVE METHODS FOR ECONOMICS AND FINANCE FALL 2008 Chapter 10 Hypothesis Testing.
Test statistic: Group Comparison Jobayer Hossain Larry Holmes, Jr Research Statistics, Lecture 5 October 30,2008.
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
Statistical Methods in Computer Science Hypothesis Testing I: Treatment experiment designs Ido Dagan.
Student’s t statistic Use Test for equality of two means
Microarray Data Analysis
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.
(4) Within-Array Normalization PNAS, vol. 101, no. 5, Feb Jianqing Fan, Paul Tam, George Vande Woude, and Yi Ren.
July, 2000Guang Jin Statistics in Applied Science and Technology Chapter 4 Summarizing Data.
The following slides have been adapted from to be presented at the Follow-up course on Microarray Data Analysis.
Multiple testing in high- throughput biology Petter Mostad.
Jeopardy Hypothesis Testing T-test Basics T for Indep. Samples Z-scores Probability $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
Hypothesis testing – mean differences between populations
Affymetrix vs. glass slide based arrays
Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
POPULATION DYNAMICS Required background knowledge:
CDNA Microarrays MB206.
Panu Somervuo, March 19, cDNA microarrays.
Significance analysis of microarrays (SAM) SAM can be used to pick out significant genes based on differential expression between sets of samples. Currently.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Differential Gene Expression Dennis Kostka, Christine Steinhoff Slides adapted from Rainer Spang.
Biostatistics in Practice Peter D. Christenson Biostatistician LABioMed.org /Biostat Session 6: Case Study.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
The Analysis of Microarray data using Mixed Models David Baird Peter Johnstone & Theresa Wilson AgResearch.
You will be given a data set (on a computer) and a hypothesis. You will be asked the following questions (word for word): 1. How many degrees of freedom.
Lecture 9 Chap 9-1 Chapter 2b Fundamentals of Hypothesis Testing: One-Sample Tests.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
MRNA Expression Experiment Measurement Unit Array Probe Gene Sequence n n n Clinical Sample Anatomy Ontology n 1 Patient 1 n Disease n n ProjectPlatform.
Copyright ©2013 Pearson Education, Inc. publishing as Prentice Hall 9-1 σ σ.
KNR 445 Statistics t-tests Slide 1 Introduction to Hypothesis Testing The z-test.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
CSIRO Insert presentation title, do not remove CSIRO from start of footer Experimental Design Why design? removal of technical variance Optimizing your.
For a specific gene x ij = i th measurement under condition j, i=1,…,6; j=1,2 Is a Specific Gene Differentially Expressed Differential expression.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
NON-PARAMETRIC STATISTICS
The Broad Institute of MIT and Harvard Differential Analysis.
Microarray Data Analysis The Bioinformatics side of the bench.
CGH Data BIOS Chromosome Re-arrangements.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Midterm. T/F (a) False—step function (b) False, F n (x)~Bin(n,F(x)) so Inverting and estimating the standard error we see that a factor of n -1/2 is missing.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Other uses of DNA microarrays
Microarray: An Introduction
Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.
Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).
Differential Gene Expression
Normalization Methods for Two-Color Microarray Data
Differential Expression of RNA-Seq Data
Presentation transcript:

Variability & Statistical Analysis of Microarray Data GCAT – Georgetown July 2004 Jo Hardin Pomona College

Variability key to statistics within slide vs. between slide replication red (Cy5) > green dye (Cy3): dye swap log (base 2) transformation

Example: Variation in Gene Expression Patterns in Follicular Lymphoma and the Response to Rituximab,by Bohen, Troyanskaya, Alter, Warnke, Botstein, Brown, and Levy 2 groups: those who responded to treatment, and those who did not respond to treatment. Cy5 dye used on malignant lymphoid tissues, Cy3 dye used on mRNA derived from cell lines Biopsies obtained before treatment of Rituximab Are there differences in gene expression across those who responded to treatment and those who didn’t?

Data Cleaning Individual points were median centered for each cDNA clone and filtered for data quality. Data values are either:

The Data:

Differential Expression Across Two Groups Fold Change t-test Wilcoxon Rank-Sum Test SAM

Fold Change Of mean? Of median? Across treatment groups? vs. reference group? Small vs. large values What about how variable the groups are?

An Example using one gene:

t-test Test statistic: p-value = probability of seeing your data or more extreme if there is no difference in the groups

t-test in Excel Syntax: TTEST(array1,array2,tails,type) Example:  first group is in cells c3 – k3  second group is in cells l3 – v3  we want a two sided t-test (no preconceived idea about which group is more highly expressed)  we assume the variance is unequal in cell w3 type: “=ttest(c3:k3,l3:v3,2,3)”

Wilcoxon Rank Sum Test Instead of comparing averages, this test compares rankings (or medians) In order to discount influential points, we replace the data values with their appropriate rankings. We compute a z-test (sister of the t-test) on the ranked data.

Up regulated genesDown regulated genes

Technical Details Replace values with ranks Sum the ranks in the first group Calculate hypothesized mean1 = n1*(n1+n2+1)/2 Calculate hypothesized standard deviation1 = sqrt(n1*n2*(n1+n2+1)/12) Calculate test statistic = (sum ranks – hyp mean1) / hyp stdev1 Find the p-value using the normal distribution (probability of being greater than the test statistic if there are no differences in the two groups)

Wilcoxon Rank Sum in Excel Using the rank function, translate your data into ranks  Y3: “=RANK(C3,C3:V3)” this finds the rank of C3 in the range C3-V3 (you’ll probably get a “#value” here, that’s OK because C3 is empty for gene = IMAGE:253507)  Repeat this command for Z3 to AR3 keeping the second half of the function always C3:V3  Copy the row from Y3 to AR3 and paste from Y4 to AR2366 AS2: “=SUMIF(Y3:AG3,">0",Y3:AG3)” (sum rank grp1) AT2: “=COUNT(Y3:AG3)*(COUNT(Y3:AR3)+1)/2” (mean1) AU2: “=SQRT(COUNT(Y3:AG3)*COUNT(AH3:AR3)* (COUNT(Y3:AR3)+1)/12)” (stdev1) AV2: “=(AS3-AT3)/AU3” (zscore1 = test stat) AW2: “=2*(1-NORMDIST(ABS(AV3),0,1,TRUE))” (p-value)

SAM (Significance Analysis of Microarrays) is a statistical technique for finding significant genes in a set of microarray experiements can be used in a comparison experiment can also be used with a quantitative response (like tumor size) or with one class data

Technical Details For the i th gene, comparing two groups, the test statistic is: Rank the d i and keep as test statistics Permute the data labels 100 times, and calculate expected values for the d i given no structure. Plot observed d i vs. expected d i

False Discovery Rate We know that the expected d i were computed with no group structure. Any “large” expected d i values will be false positives. If we see 30 observed d i above some cutoff and 10 expected d i above the same cutoff, we know that we probably have 10 false positives (though we can never know *which* genes are the false positives)

Features of SAM Slider – we can change the false discovery rate Fold change – in addition to the false discovery rate, we can require the genes to be at some fold change threshold (on average) Gene lists – gene lists are given along with corresponding significance levels Web Link option for more information about particular genes

Imputation Most microarray data has missing values If background is bigger than foreground, the observed signal will be negative! Poor quality spots are removed prior to analysis. SAM needs a full data set which can be computed by:  Substitution of the row average  Substitution using k-nearest neighbors