Download presentation

Presentation is loading. Please wait.

Published byEmely Gallamore Modified about 1 year ago

1
Supervised and unsupervised analysis of gene expression data Bing Zhang Department of Biomedical Informatics Vanderbilt University

2
Overall workflow of gene expression studies Microarray Biological question Experimental design Image analysis Data Analysis Hypothesis Experimental validation Reads mapping RNA-Seq 2 Peptide/protein ID Shotgun proteomics Spectral counts; Peak intensities Spectral counts; Peak intensities Read counts Signal intensities

3
Data matrix Genes Samples 3 Spectral counts; Peak intensities Spectral counts; Peak intensities Read counts Signal intensities

4
Three major goals of gene expression studies Differential expression (supervised analysis) Input: gene expression data, class label of the samples Output: differentially expressed genes e.g. disease biomarker discovery Clustering (unsupervised analysis) Input: gene expression data Output: groups of similar samples or genes e.g. disease subtype identification Classification (machine learning) Input: gene expression data, class label of the samples (training data) Output: prediction model e.g. disease diagnosis and prognosis 4

5
Data preprocessing I: missing value imputation Replace with zeros Replace all missing values with 0 Replace with row averages Replace missing values with mean of available values in each row (gene) KNN imputation Estimate missing values via the K-nearest neighbors analysis 5

6
Data preprocessing II: normalization To remove systematic variations and make experiments comparable Use some control or housekeeping genes that you would expect to have the same expression level across all experiments Use spike-in controls Equalize the mean values for all experiments (Global normalization) Match data distributions for all experiments (Quantile normalization) No normalizationGlobal normalizationQuantile normalization 6

7
Data preprocessing III: transformation To make the data more closely meet the assumptions of a statistical inference procedure log transformation to improve normality 7

8
Differential expression (supervised analysis) ControlCase (Treatment) Genes Samples Which genes are differentially expressed between the two groups? 8

9
Fold change n-fold change Arbitrarily selected fold change cut-offs Usually ≥ 2 fold Pros Intuitive Simple and rapid Cons Outlier observations can create an apparent difference Many real biological difference can not pass the 2-fold cutoff 9

10
Statistical analysis: hypothesis testing Null hypothesis Alternative hypothesis ControlCase (treatment) Genes Samples A statistical hypothesis is an assumption about a population parameter, e.g. group mean. 10

11
t-test graph courtesy of 11

12
p value p value: probability of more extreme test statistic, or sum of tail areas 12

13
Correction for multiple testing: why? In an experiment with a 10,000-gene array in which the significance level p is set at 0.05, 10,000 x 0.05 = 500 genes would be inferred as significant even though none is differentially expressed The probability of drawing the wrong conclusion in at least one of the n different test is Where is the significance level at single gene level, and is the global significance level. Each row is a test n 13

14
Correction for multiple testing: how? Control the family-wise error rate (FWER), the probability that there is a single type I error in the entire set (family) of hypotheses tested. e.g. Standard Bonferroni Correction: uncorrected p value x no. of genes tested Control the false discovery rate (FDR), the expected proportion of false positives among the number of rejected hypotheses. e.g. Benjamini and Hochberg correction. Ranking all genes according to their p value Picking a desired FDR level, q (e.g. 5%) Starting from the top of the list, accept all genes with, where i is the number of genes accepted so far, and m is the total number of genes tested. pBonferroni Rank (i)q(i/m)*qsignificant?

15
Clustering (unsupervised analysis) Clustering algorithms are methods to divide a set of n objects (genes or samples) into g groups so that within group similarities are larger than between group similarities Unsupervised techniques that do not require sample annotation in the process Identify candidate subgroups in complex data. e.g. identification of novel sub-types in cancer, identification of co-expressed genes Sample_1Sample_2Sample_3Sample_4Sample_5…… TNNC …… DKK …… ZNF …… CHST …… FABP …… MGST …… DEFA …… VIL …… AKAP …… HS3ST …… Genes Samples 15

16
Hierarchical clustering Agglomerative hierarchical clustering (bottom-up) Start out with all sample units in n clusters of size 1. At each step of the algorithm, the pair of clusters with the shortest distance are combined into a single cluster. The algorithm stops when all sample units are combined into a single cluster of size n. Require distance measurement Between two objects Between clusters 16

17
Between objects distance measurement Euclidean distance Focus on the absolute expression value Pearson correlation coefficient Focus on the expression profile shape Linear relationship Spearman correlation coefficient Focus on the expression profile shape Monotonic relationship Less sensitive but more robust than Pearson 17 Sample_1Sample_2Sample_3Sample_4Sample_5…… TNNC …… DKK …… ZNF …… CHST …… FABP ……

18
Different measurement, different distance Most similar profile to GeneA (blue) based on different distance measurement: Euclidean: GeneB (pink) Pearson: GeneC (green) Spearman: GeneD (red) 18

19
Between cluster distance measurement Single linkage: the smallest distance of all pairwise distances Complete linkage: the maximum distance of all pairwise distances Average linkage: the average distance of all pairwise distances 19

20
Visualization of hierarchical clustering results Dendrogram Output of a hierarchical clustering Tree structure with the genes or samples as the leaves The height of the join indicates the distance between the branches Heat map Graphical representation of data where the values are represented as colors. 20

21
Example #1 21 Clustered display of data from time course of serum stimulation of primary human fibroblasts. the sequence-verified named genes in these clusters contain multiple genes involved in (A) cholesterol biosynthesis, (B) the cell cycle, (C) the immediate–early response, (D) signaling and angiogenesis, and (E) wound healing and tissue remodeling. These clusters also contain named genes not involved in these processes and numerous uncharacterized genes. Eisen et al. Cluster analysis and display of genome-wide expression patterns. PNAS, 1998

22
Example #2 22 Sorlie et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. PNAS, 2001

23
Summary Three major goals of gene expression studies Differential expression (supervised analysis) Clustering (unsupervised analysis) Classification (machine learning) Gene expression data pre-processing steps Missing data imputation Normalization Transformation Differential expression analysis Student’s t-test Multiple-test adjustment Control the family-wise error rate (FWER) Control the false discovery rate (FDR) 23 Agglomerative hierarchical clustering Bottom-up Between objects distance measurement Euclidean distance Pearson’s correlation coefficient Spearman’s correlation coefficient Between cluster distance measurement Single linkage Complete linkage Average linkage Visualization Dendrogram Heat map

24
Reading 24 Sabates-Bellver et al., Mol Cancer Res, 5(12): , 2007

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google