Statistical Testing with Genes Saurabh Sinha CS 466.

Slides:



Advertisements
Similar presentations
Tests of Hypotheses Based on a Single Sample
Advertisements

Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is, “What is the statistical model to this data?” We then characterize.
Chapter 10: Estimating with Confidence
Hypothesis Testing A hypothesis is a claim or statement about a property of a population (in our case, about the mean or a proportion of the population)
1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.
Introduction to Functional Analysis J.L. Mosquera and Alex Sanchez.
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Differentially expressed genes
Chapter 6 Hypotheses texts. Central Limit Theorem Hypotheses and statistics are dependent upon this theorem.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
8-2 Basics of Hypothesis Testing
BCOR 1020 Business Statistics Lecture 20 – April 3, 2008.
Today Concepts underlying inferential statistics
Chapter 10: Estimating with Confidence
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
Statistical hypothesis testing – Inferential statistics I.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 9 Hypothesis Testing.
Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University.
Multiple testing correction
Lecture Slides Elementary Statistics Twelfth Edition
Hypothesis Testing.
Tests of significance & hypothesis testing Dr. Omar Al Jadaan Assistant Professor – Computer Science & Mathematics.
Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need.
Gene Set Enrichment Analysis (GSEA)
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Psy B07 Chapter 4Slide 1 SAMPLING DISTRIBUTIONS AND HYPOTHESIS TESTING.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests.
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
Chapter 8 Introduction to Hypothesis Testing ©. Chapter 8 - Chapter Outcomes After studying the material in this chapter, you should be able to: 4 Formulate.
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
1 Chapter 8 Hypothesis Testing 8.2 Basics of Hypothesis Testing 8.3 Testing about a Proportion p 8.4 Testing about a Mean µ (σ known) 8.5 Testing about.
1 9 Tests of Hypotheses for a Single Sample. © John Wiley & Sons, Inc. Applied Statistics and Probability for Engineers, by Montgomery and Runger. 9-1.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Statistical Significance The power of ALPHA. “ Significant ” in the statistical sense does not mean “ important. ” It means simply “ not likely to happen.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Overview.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Logic and Vocabulary of Hypothesis Tests Chapter 13.
Copyright © Cengage Learning. All rights reserved. Chi-Square and F Distributions 10.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need.
1 Definitions In statistics, a hypothesis is a claim or statement about a property of a population. A hypothesis test is a standard procedure for testing.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
1 Section 8.2 Basics of Hypothesis Testing Objective For a population parameter (p, µ, σ) we wish to test whether a predicted value is close to the actual.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
The Idea of the Statistical Test. A statistical test evaluates the "fit" of a hypothesis to a sample.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 1 FINAL EXAMINATION STUDY MATERIAL III A ADDITIONAL READING MATERIAL – INTRO STATS 3 RD EDITION.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
PCB 3043L - General Ecology Data Analysis Organizing an ecological study What is the aim of the study? What is the main question being asked? What are.
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Statistical Testing with Genes
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Statistical Testing with Genes
Presentation transcript:

Statistical Testing with Genes Saurabh Sinha CS 466

Hypergeometric test A population of N genes Microarray data identified a subset of n genes as being up-regulated in cancer Also know a set of m genes involved in cell division Want to test for association between cancer and cell-division

Hypergeometric test Population (N) Cancer (n)Cell Div (m) Cancer and Cell Div (k) Is the intersection (size “k”) significant large, given N, m, n?

Hypergeometric Distribution Given that m of N genes are labeled “cell division”. If we picked a random sample of n genes, how likely is an intersection equal to k ?

Hypergeometric Distribution Given that m of N genes are labeled “cell division”. If we picked a random sample of n genes, how likely is an intersection equal to or greater than k ?

“Enrichment” analysis If P calculated this way is below some threshold , e.g., 0.05, we say that the association between the cancer set and the cell division set is statistically significant. In general, such “enrichment analysis” –has a set of genes with known function (or other property, e.g., “fast evolving”) –has a set of genes identified by the particular study (e.g., microarray study of cancer) –tests if the identified genes are enriched for the function or property

Gene Ontology Gene Ontology (GO) is a highly popular source of “known gene sets”, e.g., gene sets with common known function Source: “Bionformatics” - Polanski and Kimmel

Motivation for GO Large numbers of genes in different organisms, with large numbers of protein products and functions Several databases (often species specific) store such information (e.g., FlyBase) Need a way to organize all this information from all these databases

Consistent and Structured organization Browsing through genome databases like FlyBase is powerful, but … … lack of consistency from one database to another. (The “interface” is not the same.) Annotations (e.g., functions of genes) could be more “structured”, for much easier and automated access –e.g., GoogleScholar/Pubmed versus Bibliography software/databases

Gene Ontology database In response to these needs. Aims at structured and consistent terms to describe gene function A “standardized” and “structured” vocabulary to describe genes and their products

Structure of GO Tree structure, with three main branches: –“molecular function”: activities of gene product at molecular level, e.g., “DNA binding” –“biological process”: process (series of events), e.g., “metabolism” or “oxygen transport” –“cellular component”: e.g., “cell nucleus” Every GO term is a descendent of one of these three branches Tree structure captures natural hierarchy of terms, e.g., “metabolism” is an ancestor node of “amino acid metabolism”

GO database GO is not only a hierarchy of descriptive terms, it is also the assignment of one or more of these terms to genes Large numbers of genes in different organisms have been manually annotated with GO terms

Accessing GO Can download entire hierarchy of terms, as well as assignment of these terms to genes in a species. Can access through specialized interfaces, e.g., –GeneMerge for enrichment analysis of sets of genes –AmiGO: Search by gene name (all terms for that gene) or by term name (all genes for that term)

Statistical Tests Enrichment analysis is a statistical test: We calculate the random chance of seeing something “as good” (e.g., intersection as large) as what is observed If this is very small, we “reject the null hypothesis” that the observation happened “by chance”

Statistical Testing with genes: another example Consider a single gene g. Take its expression values in experimental condition 1, and in experimental condition 2. Compare the two samples, e.g., calculate the difference of means. Null hypothesis: both samples come from populations with equal means. Is the difference of means significantly different from zero ? A statistical test about gene g

Multiple genes Null hypothesis about equality of means is “gene-wise” null hypothesis How do we extend this framework to test a set of genes (for difference of expression between the two conditions)? One natural extension: Declare equality in both conditions for every gene. This is the complete null hypothesis. Source: Stat. Methods in Bioinformatics; Ewens & Grant.

Complete Null Hypothesis But this may not be the hypothesis we are interested in accepting or rejecting: rejecting it doesn’t tell us which (subset of) genes are differentially expressed Rejecting the complete null hypothesis only tells us “some gene is not equally expressed in the two conditions”.

What we’d like A procedure that predicts which genes are differentially expressed Remember that “prediction” in this context is not a definitive prediction, there is some margin of error We’d like our procedure to provide us with an overall margin of error

To be more specific Let’s go back to gene-wise null hypothesis H 0 Pr (X  k | H 0 ) <  If we observe random variable X to have a value of k or more, we reject H 0, i.e., we predict that H 0 is not true. However, this prediction has a margin of error:  It is possible that H 0 is true, yet we rejected it; in fact the probability of this “false positive” error is  ! So we are able to control “false positive” rate in the single gene test. “Positive” because rejecting null hypothesis usually incriminates the gene as being “interesting” in some way. “False” because H 0 being true means that the prediction is false.

Controlling false positives In the multiple gene test procedure (not defined as yet), we would like to predict a set of genes as being interesting, i.e., as violating null hypothesis. But we would like to have some control over (i.e., some idea of) the false positive rate

False Discovery Rate Is one such procedure The final outcome will be a set of genes predicted to be differentially expressed We will have some control on the proportion of false positives among these predicted genes Say there are 10,000 genes and 100 are true differentially expressed. It be OK to predict some set of 100 genes, with the disclaimer that 50 of these may be false positives. –False Discovery Rate of 50% may be OK !

Some definitions Consider the genes for which null hypothesis is rejected. (Say R in number.) Let V be # genes (from these R) for which null hypothesis is true (i.e., falsely rejected, or false positive) Let S be the # genes (from these R) for which null hypothes is false (i.e., correctly rejected) Define Q = 0 if R = 0 = V/R if R > 0

Some definitions V, S, R, and Q are random variables, and even after all tests are done we don’t know Q (the false positive proportion) Define FDR = Expectation(Q) The FDR “procedure” will aim to control this expected value of Q

A difference between p-values and FDRs FDR is fundamentally different from a p-value. P-value assesses significance of data. If we publish some data that we claim to be significant, we should present a small p-value for the data (e.g., <= 0.05) FDR is generally used as a “culling tool”; the investigator wants to predict a set of genes to test experimentally, and an FDR of 0.5 may be acceptable to her (she’ll do twice as much experimental work, which may be fine)

The “Procedure” Proposed by Benjamini and Hochberg in Many other procedures since then, but we’ll only see this original one Begin with a per-gene p-value, i.e., Pr(X  k| H 0 ), for every one of the g genes being studied. –g gene-wise null hypotheses, with their corresponding p-values computed.

Benjamini-Hochberg procedure Let the g gene-wise p-values be denoted by p (i) Consider these g p-values to be sorted in ascending order, i.e., p (1)  p (2)  …  p (g) Let H (i) be the hypothesis corresponding to p (i) Let for i = 1,2,3 … g and  is the desired FDR Let k be the max i such that p (i)  q i Procedure: Reject null hypothesis H (1), H (2), … H (k) and accept all others. Theorem: E(Q) = (g 0 /g)  <= 