Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need.

Slides:



Advertisements
Similar presentations
1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html
Advertisements

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Chi-Squared Hypothesis Testing Using One-Way and Two-Way Frequency Tables of Categorical Variables.
What is Ontology? Dictionary:A branch of metaphysics concerned with the nature and relations of being. Barry Smith:The science of what is, of the kinds.
CAVEAT 1 MICROARRAY EXPERIMENTS ARE EXPENSIVE AND COMPLICATED. MICROARRAY EXPERIMENTS ARE THE STARTING POINT FOR RESEARCH. MICROARRAY EXPERIMENTS CANNOT.
Gene Ontology John Pinney
Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.
Introduction to Functional Analysis J.L. Mosquera and Alex Sanchez.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
1-1 Copyright © 2015, 2010, 2007 Pearson Education, Inc. Chapter 25, Slide 1 Chapter 25 Comparing Counts.
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Gene ontology & hypergeometric test Simon Rasmussen CBS - DTU.
1 Using Gene Ontology. 2 Assigning (or Hypothesizing About) Biological Meaning to Clusters What do you want to be able to to? –Identify over-represented.
1. 2 BIOSTATISTICS 5.6 TEST OF HYPOTHESIS 3 BIOSTATISTICS TERMINAL OBJECTIVE: 5.6 Perform a test of significance on a hypothesis using Chi-square test.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Chapter 4 Probability Distributions
Demonstration Trupti Joshi Computer Science Department 317 Engineering Building North (O)
Lecture Slides Elementary Statistics Twelfth Edition
8-3 Testing a Claim about a Proportion
ICA-based Clustering of Genes from Microarray Expression Data Su-In Lee 1, Serafim Batzoglou 2 1 Department.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
QNT 531 Advanced Problems in Statistics and Research Methods
GO::TermFinder Gavin Sherlock Department of Genetics Stanford University
Gene Set Enrichment Analysis (GSEA)
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Slide 1 Copyright © 2004 Pearson Education, Inc..
Suppose we have analyzed total of N genes, n of which turned out to be differentially expressed/co-expressed (experimentally identified - call them significant)
Chapter 16 The Chi-Square Statistic
Gene expression analysis
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
Jeopardy Hypothesis Testing t-test Basics t for Indep. Samples Related Samples t— Didn’t cover— Skip for now Ancient History $100 $200$200 $300 $500 $400.
Hierarchical Bayesian Model Specification Model is specified by the Directed Acyclic Network (DAG) and the conditional probability distributions of all.
Tissue dynamic and Morphogenesis Dept. Physiology Chang Gung University J. K. Chen, Professor.
Statistical Testing with Genes Saurabh Sinha CS 466.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
Copyright © Cengage Learning. All rights reserved. 12 Analysis of Variance.
Chapter 13 Repeated-Measures and Two-Factor Analysis of Variance
Cluster validation Integration ICES Bioinformatics.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Flat clustering approaches
Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
GO enrichment and GOrilla
A significance test or hypothesis test is a procedure for comparing our data with a hypothesis whose truth we want to assess. The hypothesis is usually.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Review: Stages in Research Process Formulate Problem Determine Research Design Determine Data Collection Method Design Data Collection Forms Design Sample.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Uncertainty and confidence Although the sample mean,, is a unique number for any particular sample, if you pick a different sample you will probably get.
2/3/2005 Gene Ontology (GO) The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions.
Canadian Bioinformatics Workshops
Slide 1 Copyright © 2004 Pearson Education, Inc. Chapter 5 Probability Distributions 5-1 Overview 5-2 Random Variables 5-3 Binomial Probability Distributions.
Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).
1 A Discussion of False Discovery Rate and the Identification of Differentially Expressed Gene Categories in Microarray Studies Ames, Iowa August 8, 2007.
Clustering Manpreet S. Katari.
Tutorial 6 : RNA - Sequencing Analysis and GO enrichment
GO : the Gene Ontology & Functional enrichment analysis
Statistical Testing with Genes
Overview and Basics of Hypothesis Testing
Analysis of GO annotation at cluster level by Agnieszka S. Juncker
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
What is Ontology? s Dictionary:A branch of metaphysics concerned with the nature and relations of being. Barry Smith:The science of what is, of.
Chapter 18: The Chi-Square Statistic
Statistical Testing with Genes
Presentation transcript:

Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need to figure out what it all means Since we don't know much about function of most of the genes this is not easy Complicated further by the fact that the gene function is context-specific. Depends on the tissue, developmental stage of the organism and multiple other factors "Functional clustering" grouping genes with respect to their known function (ontology) Establishing statistical significance between groups of genes identified in the analysis and "Functional clusters"

Analyzing Microarray Data Experimental Design Universal Control Not Treated C 1 Treated Not Treated C 3 Treated Not Treated C 2 Treated Not Treated C 4 Treated Data Normalization – reducing technical variability Statistical Analysis (ANOVA): Identifying differentially expressed genes Factoring out variability sources Data Mining

Data Integration and Interpretation

Gene Ontology (GO) The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products in different databases. The GO collaborators are developing three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner.

Molecular Function Biochemical activity or action of the gene product. MF describes a capability that the gene product has and there is no reference to where or when this activity or usage actually occurs. Examples: enzyme transporter ligand cytochrome c: electron transporter activity

Biological process A biological objective to which the gene product contributes. A biological process is accomplished via one or more ordered assemblies of molecular functions. There is generally some temporal aspect to the process and it will often involve the transformation of some physical thing. Examples: cell growth and maintenance cytochrome c oxidative phosphorylation, induction of cell death

Cellular Component A component of a cell that is part of some larger object or structure. Examples: chromosome nucleus ribosome cytochrome c: mitochondrial matrix, mitochondrial inner membrane

Suppose we have analyzed total of N genes, n of which turned out to be differentially expressed/co-expressed (experimentally identified - call them significant) Suppose that x out of n significant genes and y out of N total genes were classified into a specific "Functional group" Q1: Is this "Functional group" significantly correlated with our group of significant genes? Q2: Are significant genes overrepresented in this functional group when compared to their overall frequency among all analyzed genes? Q3: What is the chance of getting x or more significant genes if we randomly draw y out of N genes "out of a hat" with assumption that each gene remaining in the hat has an equal chance of being drawn? ( H 0 : p(significant gene belonging to this category) = y/N Q3A: What is the p-value for rejecting this null hypothesis First step of making a story: Statistical significance of a particular "Functional cluster"

Strategy for finding "Statistically Significant" GO categories: Identify all categories that contain at least 5 genes from the microarray (about 1800 in our case) Perform a Fisher's exact test for each category to test for statistically significant over-representation of differentially expressed genes Adjust individual Fisher's p-values for the fact that we are testing 1800 hypotheses by calculating FDR's Repeat this for different levels of the statistical significance used to select differentially expressed genes (FDR<0.01, 0.05, 0.1, 0.2) and observe the statistical significance of two most significant GO categories Fisher's tests (

Top 2 GO Categories for genes with FDR< 0.01 GO Term 1 FDR for the category= GOID = GO: Term = muscle contraction Definition = A process leading to shortening and/or development of tension in muscle tissue. Muscle contraction occurs by a sliding filament mechanism whereby actin filaments slide inward among the myosin filaments. Ontology = BP Two By Two matrix of gene memberships in this category [,1] [,2] [1,] 3 12 [2,] GO Term 2 FDR for the category= GOID = GO: Term = regulation of muscle contraction Definition = Any process that modulates the frequency, rate or extent of muscle contraction. Ontology = BP Two By Two matrix of gene memberships in this category [,1] [,2] [1,] 2 13 [2,] Statistically Significant GO Categories

Top 2 GO Categories for genes with FDR< 0.05 GO Term 1 FDR for the category= GOID = GO: Term = extracellular region Synonym = extracellular Definition = The space external to the outermost structure of a cell. For cells without external protective or external encapsulating structures this refers to space outside of the plasma membrane. This term covers the host cell environment outside an intracellular parasite. Ontology = CC Two By Two matrix of gene memberships in this category [,1] [,2] [1,] [2,] GO Term 2 FDR for the category= GOID = GO: Term = extracellular space Synonym = intercellular space Definition = That part of a multicellular organism outside the cells proper, usually taken to be outside the plasma membranes, and occupied by fluid. Ontology = CC Two By Two matrix of gene memberships in this category [,1] [,2] [1,] [2,] Statistically Significant GO Categories

Top 2 GO Categories for genes with FDR< 0.1 GO Term 1 FDR for the category= GOID = GO: Term = blood vessel development Definition = Processes aimed at the progression of the blood vessel over time, from its formation to the mature structure. The blood vessel is the vasculature carrying blood. Ontology = BP Two By Two matrix of gene memberships in this category [,1] [,2] [1,] [2,] GO Term 2 FDR for the category= GOID = GO: Term = blood vessel morphogenesis Definition = Processes by which the anatomical structures of blood vessels are generated and organized. Morphogenesis pertains to the creation of form. The blood vessel is the vasculature carrying blood. Ontology = BP Two By Two matrix of gene memberships in this category [,1] [,2] [1,] [2,] Statistically Significant GO Categories

Top 2 GO Categories for genes with FDR< 0.2 GO Term 1 FDR for the category= GOID = GO: Term = blood vessel development Definition = Processes aimed at the progression of the blood vessel over time, from its formation to the mature structure. The blood vessel is the vasculature carrying blood. Ontology = BP Two By Two matrix of gene memberships in this category [,1] [,2] [1,] [2,] GO Term 2 FDR for the category= GOID = GO: Term = blood vessel morphogenesis Definition = Processes by which the anatomical structures of blood vessels are generated and organized. Morphogenesis pertains to the creation of form. The blood vessel is the vasculature carrying blood. Ontology = BP Two By Two matrix of gene memberships in this category [,1] [,2] [1,] [2,] > Statistically Significant GO Categories

Statistical significance of a particular "Functional cluster" - cont g n+1 g1g1 gngn gNgN... g1g1 gxgx g x+1 gygy g n+y-x+1 g y+1 g n+y-x gNgN... Observed Removing Functional Classification Q: By randomly drawing y boxes to color their border blue, what is the chance to draw x or more red ones Outcome (o 1,...,o T ): A set of y genes with selected from the list of N genes Event of interest (E): Set of all outcomes for which the number of red boxes among the y boxes drawn is equal to x Since drawing is random all outcomes are equally probable

Statistical significance of a particular "Functional cluster" - cont Outcome (o 1,...,o T ): A set of y genes with selected from the list of N genes Event of interest (E): Set of all outcomes for which the number of red boxes among the y boxes drawn is equal to x All we have to do is calculating M and N where: T=number of different sets we can draw a set of y genes out of total of N genes M=number of different ways to obtain x red boxes (significant genes) when drawing y boxes (genes) out of total of N boxes (genes), x of which are red (significant) Comes from the fact that order in which we pick genes does not matter First pick x red boxes. For each such set of x red boxes pick a set of y-x non-red boxes

Statistical significance of a particular "Functional cluster" - p-value Fisher's exact test or the "hypergeometric" test P-value: Probability of observing x or more significant genes under the null hypothesis

381 genes that were differentially expressed after the treating a cell line with three different carcinogens: Dex and E2 and Irradiation Dex_Day1 Dex_Day2 Dex_Day3 E2_Day4 E2_Day7 E2_Day10 Irr_Day1 Irr_Day2 Irr_Day3

Up

Finding important functional groups for up-regulated genes Using the "Ease" annotation tool We obtained following significant gene ontologies Up_DexANDNE2ANDirr_381_GO.htm Homework: 1) Download and install Ease 2) Select top 20 most-signficianly up-regulated genes in our W-C dataset and identify significantly over-represented categories (using the three-way ANOVA analysis) 3) Repeat the analysis with 30, 40, 50 and 100 up-regulated and down- regulated gene 4) Prepare questions for the next class regarding problems you run into

Regulating Transcription -transcription factor itself does not need to be transcriptionally regulated

Modeling Microarray Data Mathematical./ Statistical Models Computer Algorithms/ Software