September 24, 2003 Microarray data analysis. Many of the images in this powerpoint presentation are from Bioinformatics and Functional Genomics by Jonathan.

Slides:



Advertisements
Similar presentations
Supervised and unsupervised analysis of gene expression data Bing Zhang Department of Biomedical Informatics Vanderbilt University
Advertisements

Copyright © Allyn & Bacon (2007) Statistical Analysis of Data Graziano and Raulin Research Methods: Chapter 5 This multimedia product and its contents.
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Getting the numbers comparable
DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Microarray Data Preprocessing and Clustering Analysis
Low-Level Analysis and QC Regional Biases Mark Reimers, NCI.
Differentially expressed genes
Figure 1: (A) A microarray may contain thousands of ‘spots’. Each spot contains many copies of the same DNA sequence that uniquely represents a gene from.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Microarray Technology Types Normalization Microarray Technology Microarray: –New Technology (first paper: 1995) Allows study of thousands of genes at.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Topic 3: Regression.
Cluster Analysis Hierarchical and k-means. Expression data Expression data are typically analyzed in matrix form with each row representing a gene and.
Statistical Analysis of Microarray Data
Clustering and MDS Exploratory Data Analysis. Outline What may be hoped for by clustering What may be hoped for by clustering Representing differences.
Analysis of High-throughput Gene Expression Profiling
Relationships Among Variables
Analysis of microarray data
Filtering and Normalization of Microarray Gene Expression Data Waclaw Kusnierczyk Norwegian University of Science and Technology Trondheim, Norway.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.
Gene Expression - Microarrays
More Analysis of Gene Expression Data Brent D. Foy, Ph.D. Wright State University.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Clustering of DNA Microarray Data Michael Slifker CIS 526.
Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.
CDNA Microarrays MB206.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Lecture 11. Microarray and RNA-seq II
Microarray data analysis
Microarray - Leukemia vs. normal GeneChip System.
Epigenetic Analysis BIOS Statistics for Systems Biology Spring 2008.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A Short Overview of Microarrays Tex Thompson Spring 2005.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.
Statistical Analysis of DNA Microarray. An Example of HDLSS in Genetics.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Appendix I A Refresher on some Statistical Terms and Tests.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
4.0 - Data Mining Sébastien Lemieux Elitra Canada Ltd.
Microarray data analysis
Microarray - Leukemia vs. normal GeneChip System.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Functional Genomics in Evolutionary Research
Getting the numbers comparable
Dimension reduction : PCA and Clustering
Volume 3, Issue 1, Pages (July 2016)
Presentation transcript:

September 24, 2003 Microarray data analysis

Many of the images in this powerpoint presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN ). Copyright © 2003 by John Wiley & Sons, Inc.John Wiley & Sons, Inc These images and materials may not be used without permission from the publisher. We welcome instructors to use these powerpoints for educational purposes, but please acknowledge the source. The book has a homepage at Including hyperlinks to the book chapters. Copyright notice

Microarray data analysis begin with a data matrix (gene expression values versus samples) Page 190

Microarray data analysis begin with a data matrix (gene expression values versus samples) Page 190 Typically, there are many genes (>> 10,000) and few samples (~ 10)

Microarray data analysis begin with a data matrix (gene expression values versus samples) Preprocessing Inferential statisticsDescriptive statistics Page 190

Microarray data analysis: preprocessing Observed differences in gene expression could be due to transcriptional changes, or they could be caused by artifacts such as: different labeling efficiencies of Cy3, Cy5 uneven spotting of DNA onto an array surface variations in RNA purity or quantity variations in washing efficiency variations in scanning efficiency Page 191

Microarray data analysis: preprocessing The main goal of data preprocessing is to remove the systematic bias in the data as completely as possible, while preserving the variation in gene expression that occurs because of biologically relevant changes in transcription. A basic assumption of most normalization procedures is that the average gene expression level does not change in an experiment. Page 191

Data analysis: global normalization Global normalization is used to correct two or more data sets. In one common scenario, samples are labeled with Cy3 (green dye) or Cy5 (red dye) and hybridized to DNA elements on a microrarray. After washing, probes are excited with a laser and detected with a scanning confocal microscope. Page 192

Data analysis: global normalization Global normalization is used to correct two or more data sets Example: total fluorescence in Cy3 channel = 4 million units Cy 5 channel = 2 million units Then the uncorrected ratio for a gene could show 2,000 units versus 1,000 units. This would artifactually appear to show 2-fold regulation. Page 192

Data analysis: global normalization Global normalization procedure Step 1: subtract background intensity values (use a blank region of the array) Step 2: globally normalize so that the average ratio = 1 (apply this to 1-channel or 2-channel data sets) Page 192

Microarray data preprocessing Some researchers use housekeeping genes for global normalization Visit the Human Gene Expression (HuGE) Index: Page 192

Scatter plots Useful to represent gene expression values from two microarray experiments (e.g. control, experimental) Each dot corresponds to a gene expression value Most dots fall along a line Outliers represent up-regulated or down-regulated genes Page 193

Scatter plot analysis of microarray data Page 193

Brain Astrocyte Fibroblast Differential Gene Expression in Different Tissue and Cell Types

expression level high low up down Expression level (sample 1) Expression level (sample 2) Page 193

Page 195 Log-log transformation

Scatter plots Typically, data are plotted on log-log coordinates Visually, this spreads out the data and offers symmetry raw ratiolog 2 ratio time behavior valuevalue t=0basal t=1hno change t=2h2-fold up t=3h2-fold down Page 194, 197

expression level high low up down Mean log intensity Log ratio Page 196

SNOMAD converts array data to scatter plots 2-fold Log 10 (Ratio ) Mean ( Log 10 ( Intensity ) ) EXP CON EXP CON EXP > CON EXP < CON 2-fold Linear-linear plot Log-log plot Page

SNOMAD corrects local variance artifacts 2-fold Log 10 ( Ratio ) Mean ( Log 10 ( Intensity ) ) robust local regression fit residual EXP > CON EXP < CON Corrected Log 10 ( Ratio ) [residuals] Mean ( Log 10 ( Intensity ) ) Page

SNOMAD describes regulated genes in Z-scores Corrected Log 10 ( Ratio ) Mean ( Log 10 ( Intensity ) ) 2-fold Locally estimated standard deviation of positive ratios Z= 1 Z= -1 Locally estimated standard deviation of negative ratios Local Log 10 ( Ratio ) Z-Score Mean ( Log 10 ( Intensity ) ) Z= 5 Z= -5 Corrected Log 10 ( Ratio ) Mean ( Log 10 ( Intensity ) ) 2-fold Z= 2 Z= 1 Z= -1 Z= -2 Z= 5 Z= -5

Inferential statistics Inferential statistics are used to make inferences about a population from a sample. Hypothesis testing is a common form of inferential statistics. A null hypothesis is stated, such as: “There is no difference in signal intensity for the gene expression measurements in normal and diseased samples.” The alternative hypothesis is that there is a difference. We use a test statistic to decide whether to accept or reject the null hypothesis. For many applications, we set the significance level  to p < Page 199

Inferential statistics A t-test is a commonly used test statistic to assess the difference in mean values between two groups. t = = Questions Is the sample size (n) adequate? Are the data normally distributed? Is the variance of the data known? Is the variance the same in the two groups? Is it appropriate to set the significance level to p < 0.05? Page 199 x 1 – x 2  difference between mean values variability (noise)

Inferential statistics ParadigmParametric testNonparametric Compare two unpaired groupsUnpaired t-testMann-Whitney test Compare two paired groupsPaired t-testWilcoxon test Compare 3 orANOVA more groups Page

Inferential statistics Is it appropriate to set the significance level to p < 0.05? If you hypothesize that a specific gene is up-regulated, you can set the probability value to You might measure the expression of 10,000 genes and hope that any of them are up- or down-regulated. But you can expect to see 5% (500 genes) regulated at the p < 0.05 level by chance alone. To account for the thousands of repeated measurements you are making, some researchers apply a Bonferroni correction. The level for statistical significance is divided by the number of measurements, e.g. the criterion becomes: p < (0.05)/10,000 or p < 5 x Page 199

Page 200 Significance analysis of microarrays (SAM) SAM-- an Excel plug-in (URL: page 202) -- modified t-test -- adjustable false discovery rate

Page 202

up- regulated Page 202 down- regulated expected observed

Descriptive statistics Microarray data are highly dimensional: there are many thousands of measurements made from a small number of samples. Descriptive (exploratory) statistics help you to find meaningful patterns in the data. A first step is to arrange the data in a matrix. Next, use a distance metric to define the relatedness of the different data points. Two commonly used distance metrics are: -- Euclidean distance -- Pearson coefficient of correlation 203

Page 205 Data matrix (20 genes and 3 time points from Chu et al.)

Page 205 3D plot (using S-PLUS software) t=0t=0.5 t=2.0

Descriptive statistics: clustering Clustering algorithms offer useful visual descriptions of microarray data. Genes may be clustered, or samples, or both. We will next describe hierarchical clustering. This may be agglomerative (building up the branches of a tree, beginning with the two most closely related objects) or divisive (building the tree by finding the most dissimilar objects first). In each case, we end up with a tree having branches and nodes. Page 204

Algorithmic Techniques Hierarchical K-Nearest Neighbors (K-Means, K-Median) Neural Networks Self-Organizing Maps Principal Component Analysis

Agglomerative clustering a b c d e a,b Page 206

a b c d e a,b d,e Agglomerative clustering Page 206

a b c d e a,b d,e c,d,e Agglomerative clustering Page 206

a b c d e a,b d,e c,d,e a,b,c,d,e Agglomerative clustering …tree is constructed Page 206

Divisive clustering a,b,c,d,e Page 206

Divisive clustering c,d,e a,b,c,d,e Page 206

Divisive clustering d,e c,d,e a,b,c,d,e Page 206

Divisive clustering a,b d,e c,d,e a,b,c,d,e Page 206

Divisive clustering a b c d e a,b d,e c,d,e a,b,c,d,e …tree is constructed Page 206

divisive agglomerative a b c d e a,b d,e c,d,e a,b,c,d,e Page 206

Page 205

Page 207

Page 207 Agglomerative and divisive clustering sometimes give conflicting results, as shown here

Cluster and TreeView Page 208

Cluster and TreeView clustering PCASOMK means Page 208

Cluster and TreeView Page 208

Cluster and TreeView Page 208

Page 209 Two-way clustering of genes (y-axis) and cell lines (x-axis) (Alizadeh et al., 2000)

Self-organizing maps (SOM) To download GeneCluster:

Self-organizing maps (SOM) One chooses a geometry of 'nodes'-for example, a 3x2 grid Page 210

Self-organizing maps (SOM) The nodes are mapped into k-dimensional space, initially at random and then successively adjusted. Page 210

Self-organizing maps (SOM) Page 211

Unlike k-means clustering, which is unstructured, SOMs allow one to impose partial structure on the clusters. The principle of SOMs is as follows. One chooses an initial geometry of “nodes” such as a 3 x 2 rectangular grid (indicated by solid lines in the figure connecting the nodes). Hypothetical trajectories of nodes as they migrate to fit data during successive iterations of SOM algorithm are shown. Data points are represented by black dots, six nodes of SOM by large circles, and trajectories by arrows.

Self-organizing maps (SOM) Neighboring nodes tend to define 'related' clusters. An SOM based on a rectangular grid thus is analogous to an entomologist's specimen drawer in which adjacent compartments hold similar insects.

Two pre-processing steps essential to apply SOMs 1. Variation Filtering: Data were passed through a variation filter to eliminate those genes showing no significant change in expression across the k samples. This step is needed to prevent nodes from being attracted to large sets of invariant genes. 2. Normalization: The expression level of each gene was normalized across experiments. This focuses attention on the 'shape' of expression patterns rather than absolute levels of expression.

Principal component axis #2 (10%) Principal component axis #1 (87%) PC#3: 1% C3 C4 C2 C1 N2 N3 N4 P1 P4 P2 P3 Lead (P) Sodium (N) Control (C) Legend Principal components analysis (PCA), an exploratory technique that reduces data dimensionality, distinguishes lead-exposed from control cell lines

An exploratory technique used to reduce the dimensionality of the data set to 2D or 3D For a matrix of m genes x n samples, create a new covariance matrix of size n x n Thus transform some large number of variables into a smaller number of uncorrelated variables called principal components (PCs). Principal components analysis (PCA) Page 211

Principal components analysis (PCA): objectives to reduce dimensionality to determine the linear combination of variables to choose the most useful variables (features) to visualize multidimensional data to identify groups of objects (e.g. genes/samples) to identify outliers Page 211

Page 212

Page 212

Page 212

Page 212

Page 212

Chr 21 Use of PCA to demonstrate increased levels of gene expression from Down syndrome (trisomy 21) brain