Lecture 2 – Pre-processing and Normalization José Luis Mosquera Computational Lab on Microarrays Data Analysis Special Topics in Computer Science Institute.

Slides:



Advertisements
Similar presentations
Pre-processing in DNA microarray experiments Sandrine Dudoit PH 296, Section 33 13/09/2001.
Advertisements

Microarray Normalization
Filtering and Normalization of Microarray Gene Expression Data Waclaw Kusnierczyk Norwegian University of Science and Technology Trondheim, Norway.
Normalization of microarray data
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Image Quantitation in Microarray Analysis More tomorrow...
Normalization of Microarray Data - how to do it! Henrik Bengtsson Terry Speed
Getting the numbers comparable
Statistics for Microarrays
Normalization for cDNA Microarray Data Yee Hwa Yang, Sandrine Dudoit, Percy Luu and Terry Speed. SPIE BIOS 2001, San Jose, CA January 22, 2001.
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Preprocessing Methods for Two-Color Microarray Data
Microarray Data Preprocessing and Clustering Analysis
Normalization Class web site: Statistics for Microarrays.
Low-Level Analysis and QC Regional Biases Mark Reimers, NCI.
Gene Expression Data Analyses (2)
Normalization of 2 color arrays Alex Sánchez. Dept. Estadística Universitat de Barcelona.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Microarray Analysis Jesse Mecham CS 601R. Microarray Analysis It all comes down to Experimental Design Experimental Design Preprocessing Preprocessing.
ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”
Analysis of microarray data
Filtering and Normalization of Microarray Gene Expression Data Waclaw Kusnierczyk Norwegian University of Science and Technology Trondheim, Norway.
Microarray Preprocessing
1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.
(4) Within-Array Normalization PNAS, vol. 101, no. 5, Feb Jianqing Fan, Paul Tam, George Vande Woude, and Yi Ren.
Preprocessing of cDNA microarray data Lecture 19, Statistics 246, April 1, 2004.
Image Quantitation in Microarray Analysis More tomorrow...
Scanning and Image Processing -by Steve Clough. GSI Lumonics cDNA microarrays use two dyes with well separated emission spectra such as Cy3 and Cy5 to.
The following slides have been adapted from to be presented at the Follow-up course on Microarray Data Analysis.
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
The following slides have been adapted from to be presented at the Follow-up course on Microarray Data Analysis.
Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.
CDNA Microarrays MB206.
Panu Somervuo, March 19, cDNA microarrays.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
WORKSHOP SPOTTED 2-channel ARRAYS DATA PROCESSING AND QUALITY CONTROL Eugenia Migliavacca and Mauro Delorenzi, ISREC, December 11, 2003.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
Analysis of Microarray Data Analysis of images Preprocessing of gene expression data Normalization of data –Subtraction of Background Noise –Global/local.
Agenda Introduction to microarrays
Microarray - Leukemia vs. normal GeneChip System.
1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.
Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
Henrik Bengtsson Mathematical Statistics Centre for Mathematical Sciences Lund University, Sweden Plate Effects in cDNA Microarray Data.
Design of Micro-arrays Lecture Topic 6. Experimental design Proper experimental design is needed to ensure that questions of interest can be answered.
Microarray hybridization Usually comparative – Ratio between two samples Examples – Tumor vs. normal tissue – Drug treatment vs. no treatment – Embryo.
(1) Normalization of cDNA microarray data Methods, Vol. 31, no. 4, December 2003 Gordon K. Smyth and Terry Speed.
Lecture 7 Sections 2.3 – 2.4 Objectives: More Detailed Summary Quantities − Quartiles and IQR − Boxplots − Quantile Plots.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Henrik Bengtsson Mathematical Statistics Centre for Mathematical Sciences Lund University Plate Effects in cDNA Microarray Data.
The microarray data analysis Ana Deckmann Carla Judice Jorge Lepikson Jorge Mondego Leandra Scarpari Marcelo Falsarella Carazzolle Michelle Servais Tais.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
DNA Microarray. Microarray Printing 96-well-plate (PCR Products) 384-well print-plate Microarray.
Microarray - Leukemia vs. normal GeneChip System.
Copyright © 2007 Dan Nettleton
Functional Genomics in Evolutionary Research
Normalization Methods for Two-Color Microarray Data
The Basics of Microarray Image Processing
Volume 6, Issue 5, Pages e5 (May 2018)
Getting the numbers comparable
Optimal gene expression analysis by microarrays
Introduction to Experimental Design
Normalization for cDNA Microarray Data
The Normal Distribution
Pre-processing AFFY data
Design Issues Lecture Topic 6.
Presentation transcript:

Lecture 2 – Pre-processing and Normalization José Luis Mosquera Computational Lab on Microarrays Data Analysis Special Topics in Computer Science Institute of Bioinformatics – Johannes Kepler University June 2010

Outline 1. Microarray Raw Data 2. Image Analysis 3. Diagnostic Plots 4. Non-specific Filtering 5. Normalization

Microarray Data Analysis Pipeline Flowchart of a experiment with microarrays To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. Ronald Fisher

Microarray Raw Data ● Raw data differs considerably between ● cDNA ● Spotted arrays ● The spirit of the process is similar, but ● specific procedures or steps differ To keep in mind...

Microarray Raw Data (1) ● One.GPR file per chip containing ● One row per gene but many columns with – Intensitiy values for each channel (R, G) – Summary values for intensities – Quality controls, such as FLAG ● Intensity values are converted into a single expression matrix containing ● One column per chip with log(R/G) values ● One row per gene (same rows as.GRP files) ● Gene information stored in the.GAL file ● Both.GPR and.GAL are ASCI files ● An accurate description of these files is available herehere... from cDNA arrays (1)

● One.CEL file per chip, containing ● PM and MM values for each probe in the chip ● Presence/Absence calls (one per probeset) – They can be interpreted as a statistical test of the spot foreground intensity in the experimental sample respect to the background intensity distribution ● Separate PM/MM values are converted into a single expression matrix containing ● One column per chip with absolute intensity values ● One row per probeset ● Gene information stored in the.CDF file ●.CEL file is a binary file...from Affymetrix arrays (1) Microarray Raw Data (2)

Image Analysis (1) ● Co-registration and overlay offers a quick visualization, providing information on ● color balance, ● uniformity of hybridization, ● spot uniformity, ● background, and ● artifacts such as – dust or – scratches Bab array (high background, little DE) Good (low background, detectable DE) Red/Green overlay

Image Analysis (2) ● Image analysis is a crucial pre-processing step ● Association of a location and corresponding annotation with signal intensities ● Several non-trivial technical choices – Scanner, – image analysis software –... ● can affect the quality of the signal ● Background correction is sometimes not desirable (low B g arrays) ● Multiple pixel values on image reduced to a single gene-specific value for each channel Some hints

Diagnostic Plots ● Plots are useful to... ● Check microarray data quality ● Give hints on how to pre-process the data ● Verify how the pre-processing has worked Look at the data!

MA-plot (1) ● The data obtained from cDNA arrays come in the form of fluorescent Red (Cy5 or R) and Green (Cy3 or G) dye intensities ● To determine whether normalization is needed, one can plot R vs G intensities and see whether the slope of the line is around 1 ● But a better representation of genes with “medium” expression is to take logs... Scatterplots

MA-plot (2) ● Biologically, a unit change in log 2 represents a 2-fold change ● Increase and decrease are symmetric under log Log 2 transformation Linear scaleLog scale

MA-plot (3) ● An improved method of the R vs G plot is the MA-plot, which is basically a scaled 45 degree rotation ● It is a plot of the distribution of ● M-value which is the log 2 of the R/G intensity ratio ● A-value which is the log 2 of average intensity M and A values

MA-plot (4) ● The general assumption is that most of the genes would not see any change in their expression ● The majority of the points on the M would be located at 0, since log 2 (1) = 0 Relationship Intensity and M vs A

● The five-number summary is a descriptive statistic that provides information about a set of observations ● It consists of the five most important sample percentiles ● the sample minimum (min) is the smallest observation ● the lower or first quartile (Q 1 ) is the 25 th quantile ● the median, middle value or second quartile (Q 2 ) is the 50 th quantile ● the upper or third quartile (Q 3 ) is the 75 th called quantile ● the sample maximum (max) is the largest observation Five-Number Summary Boxplots (1)

Boxplots (2) ● Liver samples from 16 mice ( 8 WT and 8 ApoAI KO) Example from limma package

Signal/Noise Histograms ● Images with high background tend to have lower ratios Example

Spatial Plots (1) ● They are useful when coordinates are available Slide backgrounds

Spatial Plots (2) ● If there are no spatial effects high intensity spots should be uniformly distributed ● Top (black) and bottom (green) 5% of log ratios Highlighting extreme log ratios

Pin-group Effects (1) Example Lowess lines through points from pin groupsBoxplots of log ratios by pin-group

Pin-group Effects (2) Example Boxplots show a clear example of spatial bias

Plate Effects Quality between slides

Filtering (1) ● There may be errors during Hybridization and/or Scanning which yield bad spots ● These are automatically flagged ● Many spots may show very low signals ● Problems with spotting ● No hybridization in this spot ● Bad spots may be removed from the analysis to avoid unnecessary noise Why?

Filtering (2) ● We may filter the data on intensity ● by excluding values where both the red and green channels are less than 100 ● by setting the value of an intensity to the minimum in the event only one of the two channel intensities is below the minimum of 100 ● We may use the flag column imported with the data, and exclude intensities with a flag value not equal to 1 ●......spots and ajust signal

● Filtering is intended to remove spots whose images or signals were wrong due to different possible reasons 1) Small quantity of cDNA in the array 2) Errors during the scanning process ● But, some people prefer not to filter to avoid eliminating “good spots” unintentionally ● So Must we filter data? Filtering (3)

Normalization (1) ● Normalization describes techniques used to transform the data to correct for systematic variation (or differences) ● among arrays and ● within arrays ● But not biological variation among samples What is the normalization?

Normalization (2) ● By looking at diagnostic plots for biases that vary ● spot intensities, ● location on the array ● plate origin ● pins ● scanning parameters,.. ● By performing self-self hybridization How to know if normalization is necessary?

Normalization (3) ● In dual cDNA, two samples are each one labeled with a different fluorescent dye ● In most studies, the samples are from different sources (e.g. cancer vs normal) ● However, it is also possible to co-hybridize two samples from the same source (but differently labeled) ● If we hybridize a sample with itself intensities should be the same in both channels ● All deviations from this equality means there is systematic bias that needs correction What is self-self normalization? (1)

Normalization (4) What is self-self normalization? (2)

Normalization (5) What is self-self normalization? (3) ● In a self-self hybridization, we would expect all ratios to be equal to one ● But they may not be! ● Why not? ● Unequal labeling efficiency ● Noise in the system ● Differential expression ● Normalization brings (appropriate) ratios back to one

Normalization (6) Examples of self-self hybridization False Color OverlayBoxplots within pin-groupsScatter MA-plots

Normalization (7) Examples of non self-self hybridizations Early Goodman (UC Berkeley) From the NCI60 data set Early Ngai la (UC Berkeley) Early PMCRI (Melbourne, Australia)

Normalization (8) ● Methods ● Global adjustment ● Intensity dependent normalization ● Within print-tip group normalization ● And many other... ● Selection of spots for normalization Methods and Issues

Normalization (9) ● Normalization based on a global adjustment log 2 R/G -> log 2 R/G - c = log 2 R/(kG) ● Choices for k or c = log 2 k are ● c = median or mean of log ratios for a particular gene set ● e.g. housekeeping genes) ● total intensity normalization, where k = ∑R i / ∑G i Global adjustment

● Here we run a line through the middle of the MA-plot, shifting the M value of the pair (A, M) by c=c(A) log 2 R/G -> log 2 R/G - c (A) = log 2 R / (k(A)G) ● One estimate of c(A) is made using the LOWESS function of Cleveland (1979): LOcally Weighted Scatterplot Smoothing. Intensity-dependent Normalization (10)

● The LOWESS lines can be run through many different sets of points, and each strategy has its own implicit set of assumptions justifying its applicability ● e.g. global LOWESS justified by supposing that, when stratified by mRNA abundance a)only a minority of genes are expected to be differentially expressed, or b)any differential expression is as likely to be up-regulation as down-regulation ● Pin-group LOWESS requires stronger assumptions: that one of the above applies within each pin-group ● The use of other sets of genes, e.g. control or housekeeping genes, involve similar assumptions. Which spots to use for normalization? Normalization (11)

Which spots to use for normalization? Normalization (12) Unnormalized Print-tip normalization Print-tip and scale normalization