Comp. Genomics Recitation 10 4/7/09 Differential expression detection.

Slides:



Advertisements
Similar presentations
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Relating Gene Expression to a Phenotype and External Biological Information Richard Simon, D.Sc. Chief, Biometric Research Branch, NCI
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Genomic Profiles of Brain Tissue in Humans and Chimpanzees II Naomi Altman Oct 06.
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
10 Hypothesis Testing. 10 Hypothesis Testing Statistical hypothesis testing The expression level of a gene in a given condition is measured several.
Gene Expression Data Analyses (3)
Differentially expressed genes
ANOVA Determining Which Means Differ in Single Factor Models Determining Which Means Differ in Single Factor Models.
Statistical Analysis of Microarray Data
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
. Differentially Expressed Genes, Class Discovery & Classification.
Statistical Methods in Computer Science Hypothesis Testing I: Treatment experiment designs Ido Dagan.
ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Today Concepts underlying inferential statistics
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Overview of Lecture Between Group & Within Subjects Designs Mann-Whitney Test.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 5 – Testing for equivalence or non-inferiority. Power.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Statistics for Biologists 1. Outline estimation and hypothesis testing two sample comparisons linear models non-linear models application to genome scale.
Wfleabase.org/docs/tileMEseq0905.pdf Notes and statistics on base level expression May 2009Don Gilbert Biology Dept., Indiana University
Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University.
Multiple testing correction
Multiple testing in high- throughput biology Petter Mostad.
Jeopardy Hypothesis Testing T-test Basics T for Indep. Samples Z-scores Probability $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
Essential Statistics in Biology: Getting the Numbers Right
Significance analysis of microarrays (SAM) SAM can be used to pick out significant genes based on differential expression between sets of samples. Currently.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Differential Expression II Adding power by modeling all the genes Oct 06.
CSCE555 Bioinformatics Lecture 16 Identifying Differentially Expressed Genes from microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun.
Differential Gene Expression Dennis Kostka, Christine Steinhoff Slides adapted from Rainer Spang.
We calculated a t-test for 30,000 genes at once How do we handle results, present data and results Normalization of the data as a mean of removing.
Multiple Testing in Microarray Data Analysis Mi-Ok Kim.
Biostatistics, statistical software VII. Non-parametric tests: Wilcoxon’s signed rank test, Mann-Whitney U-test, Kruskal- Wallis test, Spearman’ rank correlation.
Ordinally Scale Variables
Inference and Inferential Statistics Methods of Educational Research EDU 660.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman.
Maximum Likelihood - "Frequentist" inference x 1,x 2,....,x n ~ iid N( ,  2 ) Joint pdf for the whole random sample Maximum likelihood estimates.
Introduction to Microarrays Dr. Özlem İLK & İbrahim ERKAN 2011, Ankara.
Jeopardy Hypothesis Testing t-test Basics t for Indep. Samples Related Samples t— Didn’t cover— Skip for now Ancient History $100 $200$200 $300 $500 $400.
Ch11: Comparing 2 Samples 11.1: INTRO: This chapter deals with analyzing continuous measurements. Later, some experimental design ideas will be introduced.
Single-Factor Studies KNNL – Chapter 16. Single-Factor Models Independent Variable can be qualitative or quantitative If Quantitative, we typically assume.
Statistical Testing with Genes Saurabh Sinha CS 466.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Statistical Analysis of Microarray Data By H. Bjørn Nielsen.
NON-PARAMETRIC STATISTICS
1 Estimation of Gene-Specific Variance 2/17/2011 Copyright © 2011 Dan Nettleton.
The Broad Institute of MIT and Harvard Differential Analysis.
1 Significance analysis of Microarrays (SAM) Applied to the ionizing radiation response Tusher, Tibshirani, Chu (2001) Dafna Shahaf.
Variability & Statistical Analysis of Microarray Data GCAT – Georgetown July 2004 Jo Hardin Pomona College
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Review Statistical inference and test of significance.
Canadian Bioinformatics Workshops
Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).
Canadian Bioinformatics Workshops
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
1 A Discussion of False Discovery Rate and the Identification of Differentially Expressed Gene Categories in Microarray Studies Ames, Iowa August 8, 2007.
Differential Gene Expression
Significance analysis of microarrays (SAM)
Significance Analysis of Microarrays (SAM)
Significance Analysis of Microarrays (SAM)
Statistical Analysis and Design of Experiments for Large Data Sets
False discovery rate estimation
Presentation transcript:

Comp. Genomics Recitation 10 4/7/09 Differential expression detection

Outline Clustering vs. Differential expression Fold change T-test Multiple testing FDR/SAM Mann-Whitney Examples

Microarray preliminaries General input: A matrix of probes (sequences) and intensities We assume the hard work is over: Probes are assigned to genes The data is properly (?) normalized We have an expression matrix Rows correspond to genes Columns correpond to conditions

Microarray analysis Common scenarios: We tested the behavior of genes across several time points We test a large number of different condtions Clustering is the solution We compared a small number of conditions (2) and have multiple replicates for each condition E.g., we took blood expression in 10 sick and 10 individuals Differential expression analysis

Identification of differential genes The most basic experimental design: comparison between 2 conditions – ‘treatment’ vs. control More complex: sick/treatment/control The goal: identify genes that are differentially expressed in the examined conditions Number of replicates is usually low (n=2-4) Statistics are important Slides: Rani Elkon

Approaches for identification of differential genes 1.Fold Change 2.T-test 3.SAM

1. Fold Change Consider genes whose mean expression level was change by at least fold as differential genes Pros: Very simple! Cons: Usually no estimation of false positive rate is provided Biased to genes with low expression level Ignores the variability of gene levels over replicates.

Fold Change limit – Biased to low expression levels Determine ‘floor’ cut-off and set all expression levels below it to this floor level

Fold Change limit – ignores variability over replicates We need a score that ‘punishes’ genes with high variability over replicates

1.Fold Change 2.T-test 3.SAM Approaches for identification of differential genes

2. T-test Compute a t-score for each gene m c, m t – mean levels in Control and Treatment S c 2, S t 2 – variance estimates in Control and Treatment n c, n t – number of replicates in in Control and Treatment

T-test The t-score is good because it is a results of a well known statistical hypothesis testing If we assume the sample is normally distributed (unknown variance) and compare two hypotheses: H 0 – All the measurements come from the same distribution H 1 – All the measurements come from different normal distributions In this case a p-value can be derived for every t- score

T-test Set cut-off for p-value (α=0.01) and consider all genes with p-value < α as differential genes

Multiple Testing P g associated with the t-score t g is the probability for obtaining by random a t-score that is at least as extreme as t g. Multiplicity problem: thousands of genes are tested simultaneously (all the genes on the array!) Simple example: 10,000 genes on a chip not a single one is differentially expressed (everything is random) α= x0.01 = 100 genes are expected to have a p- value < 0.01 just by chance.

Multiple testing Individual p–values of e.g no longer correspond to significant findings. Need to adjust for multiple testing when assessing the statistical significance of findings Actually this is a somewhat common problem in statistics

Multiple Testing Simple solution (Bonferroni): consider as differential genes only those with p-value < (α/N) N: number of tests α=0.01, N=10,000: cut-off= Ensure very low probability for having any false positive genes (less than α) Advantage: very clean list of differential genes Limit: the list usually contains very few genes … unacceptable high rate of false negatives

FDR correction (Benjamini & Hochberg) False Discovery Rate In high-throughput studies certain proportion of false positives is tolerable Control the expected proportion of false positives among the genes declared as differential (q=10%). Scheme: Rank genes according to their p-vals: p (1) <p (2) …<p (N) Consider as differential the top k genes, where k = max{i: p (i) < i*(q/N)}

1.Fold Change 2.T-test 3.SAM Approaches for identification of differential genes

3. SAM (Tusher, Tibshirani & Chu) ‘Significance Analysis of Microarray’ Limit of analytical FDR approach: assumes that the tests are independent In the microarray context, the expression levels of some genes are highly correlated → unreliable FDR estimate SAM uses permutations to get an ‘ empirical ’ estimate for the FDR of the reported differential genes

SAM Scheme: Compute for each gene a statistic that measures its relative expression difference in control vs ‘treatment’ (t-score or a variant) Rank the genes according to their ‘difference score’ Set a cut off (d 0 ) and consider all genes above it as differential (N d ) Permute the condition labels, and count how many genes got score above d 0 (N p ) Repeat on many (all possible) permutations and count (N pj ) estimate FDR as the proportion: Average(N pj )/N d

Permutation on condition labels D score G1e11e12e13e14e15e16e17e18d1 G2e21e22e23e24e25e26e27e28d2 G3e31e32e33e34e35e36e37e38d3 d1p1 d2p1 d3p1 d1p2 d2p2 d3p2 BACK

SAM example Ionizing radiation response experiment After setting the threshold: 46 genes found significant 36 permutations 8.4 genes on average pass the threshold False discovery rate is 18%

Mann-Whitney/Wilcoxon In general normality assumption of t-test is problematic Aparametric statistics are very useful in many bioinfo related problem Assume nothing about the distribution of the samples Less powerful (more false negatives, but less false positives)

Mann-Whitney/Wilcoxon MW/Wilcoxon test for two samples: H 0 – The medians of both distributions are the same H 1 – The medians of the distributions are different Assumes: The two samples are independent The observations can ordered (ordinal)

Mann-Whitney/Wilcoxon Computes a U-score whose distribution is known under H 0 (& can be approximated by normal distribution in large samples) Arrange all the observations into a single ranked series Add up the ranks in sample 1. The sum of ranks in sample 2 follows by calculation, since the sum of all the ranks equals N(N+1)/2 U-score: