Supervised and unsupervised analysis of gene expression data Bing Zhang Department of Biomedical Informatics Vanderbilt University

Slides:

Advertisements

Similar presentations

Analysis of Microarray Genomic Data of Breast Cancer Patients Hui Liu, MS candidate Department of statistics Prof. Eric Suess, faculty mentor Department.

Advertisements

Basic Gene Expression Data Analysis--Clustering

Relating Gene Expression to a Phenotype and External Biological Information Richard Simon, D.Sc. Chief, Biometric Research Branch, NCI

1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html

From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.

Cluster analysis for microarray data Anja von Heydebreck.

Introduction to Microarry Data Analysis - II BMI 730

BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic.

Microarray technology and analysis of gene expression data Hillevi Lindroos.

Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.

More On Preprocessing Javier Cabrera. Outline 1.Transform the data into a scale suitable for analysis. 2.Remove the effects of systematic and obfuscating.

Microarray Data Preprocessing and Clustering Analysis

Differentially expressed genes

Microarray II. What is a microarray Microarray Experiment RT-PCR LASER DNA “Chip” High glucose Low glucose.

Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.

. Differentially Expressed Genes, Class Discovery & Classification.

Introduction to Microarry Data Analysis - II BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University.

Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.

Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.

Introduction to Hierarchical Clustering Analysis Pengyu Hong 09/16/2005.

Introduction to Bioinformatics - Tutorial no. 12

1 John R. Stevens Utah State University Notes 3. Statistical Methods II Mathematics Educators Workshop 28 March Advanced Statistical Methods: Beyond.

Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.

Microarray Data Analysis Using R Studies in Tissue Databases Mark Reimers, NCI.

Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.

Generate Affy.dat file Hyb. cRNA Hybridize to Affy arrays Output as Affy.chp file Text Self Organized Maps (SOMs) Functional annotation Pathway assignment.

Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.

Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.

Wfleabase.org/docs/tileMEseq0905.pdf Notes and statistics on base level expression May 2009Don Gilbert Biology Dept., Indiana University

Proteomics Informatics – Data Analysis and Visualization (Week 13)

Multiple testing in high- throughput biology Petter Mostad.

Whole Genome Expression Analysis

Differential Analysis & FDR Correction

Analysis and Management of Microarray Data Dr G. P. S. Raghava.

DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.

Essential Statistics in Biology: Getting the Numbers Right

Lecture 11. Microarray and RNA-seq II

Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.

Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.

A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.

Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman.

Multiple Testing Matthew Kowgier. Multiple Testing In statistics, the multiple comparisons/testing problem occurs when one considers a set of statistical.

Analysis of Variance (ANOVA) Brian Healy, PhD BIO203.

Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.

Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.

MRNA Expression Experiment Measurement Unit Array Probe Gene Sequence n n n Clinical Sample Anatomy Ontology n 1 Patient 1 n Disease n n ProjectPlatform.

More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.

An Overview of Clustering Methods Michael D. Kane, Ph.D.

Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.

Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.

CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:

Comp. Genomics Recitation 10 4/7/09 Differential expression detection.

Analyzing Expression Data: Clustering and Stats Chapter 16.

The Broad Institute of MIT and Harvard Differential Analysis.

Pan-cancer analysis of prognostic genes Jordan Anaya Omnes Res, In this study I have used publicly available clinical and.

Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.

Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.

Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.

Jump to first page Inferring Sample Findings to the Population and Testing for Differences.

Canadian Bioinformatics Workshops

C LUSTERING José Miguel Caravalho. CLUSTER ANALYSIS OR CLUSTERING IS THE TASK OF ASSIGNING A SET OF OBJECTS INTO GROUPS ( CALLED CLUSTERS ) SO THAT THE.

Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).

Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.

CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:

Differential Gene Expression

Gene expression.

I. Statistical Tests: Why do we use them? What do they involve?

Anastasia Baryshnikova Cell Systems

Volume 3, Issue 1, Pages (July 2016)

Presentation transcript:

Supervised and unsupervised analysis of gene expression data Bing Zhang Department of Biomedical Informatics Vanderbilt University

Overall workflow of gene expression studies Microarray Biological question Experimental design Image analysis Data Analysis Hypothesis Experimental validation Reads mapping RNA-Seq 2 Peptide/protein ID Shotgun proteomics Spectral counts; Peak intensities Spectral counts; Peak intensities Read counts Signal intensities

Data matrix Genes Samples 3 Spectral counts; Peak intensities Spectral counts; Peak intensities Read counts Signal intensities

Three major goals of gene expression studies Differential expression (supervised analysis)  Input: gene expression data, class label of the samples  Output: differentially expressed genes  e.g. disease biomarker discovery Clustering (unsupervised analysis)  Input: gene expression data  Output: groups of similar samples or genes  e.g. disease subtype identification Classification (machine learning)  Input: gene expression data, class label of the samples (training data)  Output: prediction model  e.g. disease diagnosis and prognosis 4

Data preprocessing I: missing value imputation Replace with zeros  Replace all missing values with 0 Replace with row averages  Replace missing values with mean of available values in each row (gene) KNN imputation  Estimate missing values via the K-nearest neighbors analysis 5

Data preprocessing II: normalization To remove systematic variations and make experiments comparable Use some control or housekeeping genes that you would expect to have the same expression level across all experiments Use spike-in controls Equalize the mean values for all experiments (Global normalization) Match data distributions for all experiments (Quantile normalization) No normalizationGlobal normalizationQuantile normalization 6

Data preprocessing III: transformation To make the data more closely meet the assumptions of a statistical inference procedure log transformation to improve normality 7

Differential expression (supervised analysis) ControlCase (Treatment) Genes Samples Which genes are differentially expressed between the two groups? 8

Fold change n-fold change  Arbitrarily selected fold change cut-offs  Usually ≥ 2 fold Pros  Intuitive  Simple and rapid Cons  Outlier observations can create an apparent difference  Many real biological difference can not pass the 2-fold cutoff 9

Statistical analysis: hypothesis testing Null hypothesis Alternative hypothesis ControlCase (treatment) Genes Samples A statistical hypothesis is an assumption about a population parameter, e.g. group mean. 10

t-test graph courtesy of 11

p value p value: probability of more extreme test statistic, or sum of tail areas 12

Correction for multiple testing: why? In an experiment with a 10,000-gene array in which the significance level p is set at 0.05, 10,000 x 0.05 = 500 genes would be inferred as significant even though none is differentially expressed The probability of drawing the wrong conclusion in at least one of the n different test is Where is the significance level at single gene level, and is the global significance level. Each row is a test n 13

Correction for multiple testing: how? Control the family-wise error rate (FWER), the probability that there is a single type I error in the entire set (family) of hypotheses tested. e.g. Standard Bonferroni Correction: uncorrected p value x no. of genes tested Control the false discovery rate (FDR), the expected proportion of false positives among the number of rejected hypotheses. e.g. Benjamini and Hochberg correction.  Ranking all genes according to their p value  Picking a desired FDR level, q (e.g. 5%)  Starting from the top of the list, accept all genes with, where i is the number of genes accepted so far, and m is the total number of genes tested. pBonferroni Rank (i)q(i/m)*qsignificant?

Clustering (unsupervised analysis) Clustering algorithms are methods to divide a set of n objects (genes or samples) into g groups so that within group similarities are larger than between group similarities Unsupervised techniques that do not require sample annotation in the process Identify candidate subgroups in complex data. e.g. identification of novel sub-types in cancer, identification of co-expressed genes Sample_1Sample_2Sample_3Sample_4Sample_5…… TNNC …… DKK …… ZNF …… CHST …… FABP …… MGST …… DEFA …… VIL …… AKAP …… HS3ST …… Genes Samples 15

Hierarchical clustering Agglomerative hierarchical clustering (bottom-up)  Start out with all sample units in n clusters of size 1.  At each step of the algorithm, the pair of clusters with the shortest distance are combined into a single cluster.  The algorithm stops when all sample units are combined into a single cluster of size n. Require distance measurement  Between two objects  Between clusters 16

Between objects distance measurement Euclidean distance  Focus on the absolute expression value Pearson correlation coefficient  Focus on the expression profile shape  Linear relationship Spearman correlation coefficient  Focus on the expression profile shape  Monotonic relationship  Less sensitive but more robust than Pearson 17 Sample_1Sample_2Sample_3Sample_4Sample_5…… TNNC …… DKK …… ZNF …… CHST …… FABP ……

Different measurement, different distance Most similar profile to GeneA (blue) based on different distance measurement: Euclidean: GeneB (pink) Pearson: GeneC (green) Spearman: GeneD (red) 18

Between cluster distance measurement Single linkage: the smallest distance of all pairwise distances Complete linkage: the maximum distance of all pairwise distances Average linkage: the average distance of all pairwise distances 19

Visualization of hierarchical clustering results Dendrogram  Output of a hierarchical clustering  Tree structure with the genes or samples as the leaves  The height of the join indicates the distance between the branches Heat map  Graphical representation of data where the values are represented as colors. 20

Example #1 21 Clustered display of data from time course of serum stimulation of primary human fibroblasts. the sequence-verified named genes in these clusters contain multiple genes involved in (A) cholesterol biosynthesis, (B) the cell cycle, (C) the immediate–early response, (D) signaling and angiogenesis, and (E) wound healing and tissue remodeling. These clusters also contain named genes not involved in these processes and numerous uncharacterized genes. Eisen et al. Cluster analysis and display of genome-wide expression patterns. PNAS, 1998

Example #2 22 Sorlie et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. PNAS, 2001

Summary Three major goals of gene expression studies  Differential expression (supervised analysis)  Clustering (unsupervised analysis)  Classification (machine learning) Gene expression data pre-processing steps  Missing data imputation  Normalization  Transformation Differential expression analysis  Student’s t-test  Multiple-test adjustment Control the family-wise error rate (FWER) Control the false discovery rate (FDR) 23 Agglomerative hierarchical clustering  Bottom-up  Between objects distance measurement Euclidean distance Pearson’s correlation coefficient Spearman’s correlation coefficient  Between cluster distance measurement Single linkage Complete linkage Average linkage  Visualization Dendrogram Heat map

Reading 24 Sabates-Bellver et al., Mol Cancer Res, 5(12): , 2007