Bioinformatics Expression profiling and functional genomics Part I: Preprocessing Ad 29/10/2006.

Bioinformatics Expression profiling and functional genomics Part I: Preprocessing Ad 29/10/2006

http://www.esat.kuleuven.ac.be/~kmarchal/ Course material: course notes + powerpoint files Exercises

Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing Exercises

mRNA DNA transcription translation+1 protein Gene expression

Adaptation of cell to its environment FNR box cytNcytOcytQcytP ? ? Bacterial cell ininininout Signal 1 Signal 2 Adaptation of a cell: response on environmental signals response to e.g. hormones (cell differentiation) Cellular response determined by the genes which are switched on upon a signal Gene expression

Action of genetic networks underlie the observed phenotypical behavior Gene expression

Functional genomics Structural Genomics Comparative Genomics

Traditional molecular biology –Directed toward understanding the role of a particular gene or protein in a molecular biological process –Northern analysis –Mutational analysis –Expression by reporter fusions Omics era Measurement of the expression of 1000 of genes, proteins simultaneously Omics era – The function or the expression of a gene in a global context of the cell – Holistic approaches allow better understanding of fundamental molecular biological processes Because a gene does not act on its own, it is always embedded in a larger network (systems biology)

Detection Reference Test Reference sample Test sample RNA cDNA transcriptomics Omics era

proteomics Omics era

metabolomics Omics era

SYSTEMS BIOLOGY Consider the cell as a system Omics era

SYSTEMS BIOLOGY Mechanistic insight in the biological system at molecular biological level High throughput data Omics era

analysis of such large scale data is no longer trivial => computational challenges –Low signal/ noise –High dimensionality Simple spreadsheet analysis such as excel are no longer sufficient More advanced datamining procedures become necessary Another urgent problem is also how to store and organize all the information. Bioinformatics Omics era

Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling –Principle of microarray –Applications Experiment design Preprocessing Exercises

Detection Reference Test Reference sample Test sample RNA cDNA transcriptomics Transcript profiling

Previously: measure expression level of one gene: Northern blot analysis Novel techniques: measure expression level of all genes simultaneously => EXPRESSION PROFILING Principle: hybridisation mRNA: 5’ –UGACCUGACG- 3’ cDNA 3’ -ACTGGACTGC-5’ Hybridize : stick together Transcript profiling

Monitor molecular activities on a global level –protein levels proteomics, –enzyme activities –Metabolites –gene expression (mRNA), transcriptomics = transcript profiling allows to gain a general insight in the global cell behavior (holistic) Molecular biological methods –RT-PCR –SAGE –Protein arrays –Microarray analysis Transcript profiling

cDNA array Spotted cDNA Glass side Upscaled Northern hybridisation +1+1+1+1 Gene (DNA) Transcript (mRNA) cDNA Transcript profiling

Preparation of probes Collect cDNA clones Amplify target cDNA insert by PCR Check yield & specificity by electrophoresis Spot + PCR products on glass slides Transcript profiling

Detection Reference Test Reference sample Test sample RNA cDNA Transcript profiling

Signal 1 Signal 2 2. mRNA isolation 3. labeling 4. Hybridization + washing 5. scanning 6. Image analysis numerical value 1. Cell culture Transcript profiling

http://www.bio.davidson.edu/courses/genomics/chip/chip.html Transcript profiling

Superimposed color image * Transform into color images * Superimpose color images from R and G channel good alignment bad alignment Transcript profiling

black spots : gene was neither expressed in test nor in control sample green : gene was only expressed in control sample red : gene was only expressed in test sample yellow : gene was expressed both in test and in control sample Superimposed color image Transcript profiling

Signal intensity is proportional with the amount of cDNA present in the sample signal cy3 -> numerical value signal cy5 -> numerical value Data analysis Image analysis Transcript profiling

Data representation Gene profile Experiment profile

Spotted DNA microarrayHigh density oligonucleotide array Transcript profiling

Depending on experimental design other mathematical approach Comparison of 2 samples (black/white) Comparison of multiple arrays Global dynamic profiling Static experiment: Comparison of samples (mutants, patients) Experiment Design

Type1: Comparison of 2 samples Statistical testing Control sample Induced sample Retrieve statistically over or under expressed genes 2 sample design Experiment Design

black/white experiment description (array V mice genes) Condition 1 : pygmee mouse 10 days old (test) Condition 2 : normal mouse 10 days old (ref) detect differentially expressed genes Experiment design (Latin Square) Condition 1 Dye1 Replica L Condition 1 dye1 Replica R Condition 2 dye2 Replica L Condition 2 dye2 Replica R Condition 2 dye1 Replica L Condition 2 dye1 Replica R Condition 1 dye2 Replica L Condition 1 dye2 Replica R Array 1 Array 2 Per gene, per condition 4 measurements available Experiment Design

Measure expression of all genes During time (dynamic profile) In different conditions Identify coexpressed genes Identify mechanism of coregulation Motif Finding Clustering Multiple array design Experiment Design

Original dataset : 6178 genes Preprocessing: select 4634 most variable (25 % most variable) variance normalized adaptive quality based clustering (32 clusters) (95%) Multiple array design Study of Mitotic cell cycle of Saccharomyces cerevisiae with oligonucleotide arrays (Cho et al.1999) - 15 time points (E=18) time points 90 & 100 min deleted (Zhang et al. 1999, Tavazoie et al., 1999) Experiment Design

Reference: unsynchronized cells Condition: synchronized cells during cell cycle at distinct time intervals Condition 1 Dye1 Replica L Condition 2 Dye1 Replica L Condition 3 Dye1 Replica L Condition 4 Dye1 Replica L. … Condition 19 Dye2 Replica L Condition 19 Dye2 Replica L Condition 19 Dye2 Replica L Condition 19 Dye2 Replica L Array 1 Reference design: e.g. Spellman dataset Experiment Design

Loop design Experiment Design

Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing –Sources of Variation –General normalization steps –Slide by slide normalization –ANOVA normalization

Sources of variation –Overshine effects –Dye effect –Spot effects –Array effect Consistent errors Consistent errors complicate direct comparison of measurements of the same gene/condition Consistent errors need to be removed by preprocessing/normalization Preprocessing Tedious Influences downstream measurements

Signal 1 Signal 2 2. mRNA isolation 3. labeling 4. Hybridization + washing 5. scanning 6. Image analysis numerical value 1. Cell culture Preprocessing Dye effect

Dye, condition effect: within slide variation Measurement error: –Preparation mRNA –Labeling &reverse transcription Normalization Global normalization assumption Overall signal in one channel more pronounced than in other channel Preprocessing

Signal 1 Signal 2 2. mRNA isolation 3. labeling 4. Hybridization + washing 5. scanning 6. Image analysis numerical value 1. Cell culture Preprocessing Array effect

normalization within slide ratio Differences in global intensity between slides Comparison between slides impossible Array effects: between slide variation Preprocessing Hybridization differences

Array effects: Between slide variation Preprocessing

Measurement error: Different quantity of DNA in spot Difference in duplicate spots Ratio: compare differential expression between genes Spot effect Absolute levels between genes incomparable Gene 1: test: 4ref:2R/G:2 Gene 2:test: 8ref:4R/G:2 Pin main effects: spot effects Preprocessing

Non specific signal Cy5 or Cy3 resulting from overshining = emission from neighboring spots Overshine effects: within slide variation Preprocessing Background intensity increases with the intensity of the neighboring spots

Removing sources of variation is obligatory step To make comparisons within a slide possible E.g. find differentially expressed genes To allow interslide comparisons E.g. combining the replica’s of the original experiment and the color flip Preprocessing

Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing –Sources of Variation –General normalization steps –Slide by slide normalization –ANOVA normalization ANOVA

ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr

Background correction compensates for overshining Background correction is considered additive Preprocessing: Background correction Background correction

additive error: independent on the measured intensity the absolute level of the error remains the same (at low levels high relative error, at high expression levels low relative error). multiplicative error: the error increases with the measured intensity (at high levels high relative error) Multiplicative error Preprocessing: log transformation

LOG2 transformed intensity values: Multiplicative effects removed, additive effects more pronounced residuals are constant at high intensities Additive error: error increases as the signal is lower (intuitively plausible) Preprocessing: log transformation

Log (test/ref) = log2(test)-log2(ref): upregulation range 0…+infinity downregulation range 0…-infinity 2 fold overexpression 2 fold underexpression Ratio = 2 Ratio = 0.5 Log2(Ratio) = 1 Log2(Ratio) = -1 ratio (test/ref) test>ref upregulation range 1…+infinity test<ref downregulation range 0...1: range of downregulation squashed Why log2 Preprocessing: log transformation

Spots are identified by Image analysis –Array Vision –ImaGene –Matarray Spot detection and signal acquisition e.g. Signal is defined Mean pixel intensity of all pixels in a spot for which the Intensity is higher than the local background + 2SD Spots can have different qualities –Irregular spots –Spots with excessive large diameter –Spots which are extremely small artifacts Preprocessing: filtering

Red >0.1 stdev Green >1 stdev Blue >2 stdev Preprocessing: filtering

Filtering: Zero values: treat these separately ratio log transformation Zero values: black white experiment interesting genes off in condition 1 versus on in condition 2 Undefined Preprocessing: filtering

Some genes only labeled with green dye, not with red dye If no mRNA of a gene is present, the green dye binds aspecifically to a spot? color flip essential to eliminate false positives Seemingly underexpressed Preprocessing: filtering

MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing –Sources of Variation –General normalization steps –Slide by slide normalization –ANOVA normalization Overview

On average ratio red/green should be 1 – Rescale based on average of housekeeping genes – Rescale based on spikes – Rescale based on average expression value of the full array (global normalization) Methods used for normalization – linear normalization – Intensity dependent normalization Preprocessing: normalization

Linear Normalization G R G R Preprocessing: normalization

–Red and green related by a constant factor –Calculate factor by linear regression Log2(ratio) 0 0 Linear normalization factor determined by linear regression Filtering to remove outliers in the non-linear range (green values) http://afgc.stanford.edu/~finkel/talk.htm Preprocessing: normalization

Linear normalization not straightforward,… Log2(R/G) (Log2(R) + Log2(G))/2 Linear fit Lowess fit Preprocessing: normalization

Non-linear intensity dependent normalization Lowess (Dudoit et al., 2000) : genes seemingly underexpressed due to specific dye effect will be compensated for Log R and log G recalculated based on the lowess fit Lowess linearizes and normalizes the data !!!!! Preprocessing: normalization

Intensity dependent normalization Preprocessing: normalization

Result of the normalization Preprocessing: normalization

Compensates for spot effects Choice of the reference important –Intuitive reference: First time point Uninduced sample –Independent reference (reference design) Tissue mixture Intuitive interpretation possible Ratio often undefined interpretation complicated Ratio defined Preprocessing: ratio

Log ratio: upregulation range 0…+infinity downregulation range 0…-infinity 2 fold overexpression 2 fold underexpression Ratio = 2 Ratio = 0.5 Log2(Ratio) = 1 Log2(Ratio) = -1 ratio (R/G): R>G upregulation range 1…+infinity R<G downregulation range 0...1: range of downregulation squashed Preprocessing: ratio

Overview further analysis Raw data Preprocessed data Differentially expressed genes Clusters of coexpressed genes Preprocessing ClusteringTest statistic

ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering Normalization Ratio Test statistic (T-test) Log transformation Background corr Preprocessing

I. MAIN EFFECTS + EFFECT OF INTEREST Overall mean Array effect (hybridisation effciency) Condition effect (mRNA isolation effciency) Gene effect Constitutive level of gene GC effect Differential expression due to the altered variety Dye effect (labeling efficiency) Model the expression level of each as a combination of the different factors Least squares fit: subject to restrictions contrast of interest: estimate (GC) i1 – (GC) i2 MultiFactor, Linear, fixed levels Preprocessing: ANOVA

Assumption: Independent, additive error ~F where F is a distribution with mean and variance  2 Plot the residuals y estimated - y measured Estimated intensity Preprocessing: ANOVA

I. MAIN EFFECTS + EFFECT OF INTEREST Analysis of variance shows relative contribution of each of the effects Explains the relative contribution of each of these effects Preprocessing: ANOVA

Advantages: Gains more information with less observations => derives variation from all measurements made (less replica’s required e.g. array effect based on N-1 gene measurements) Statistical testing: estimated error can be used for bootstrapping to estimate confidence levels No ratio’s required Requirements: Requires knowledge about experimental effects Model used implicates that all effects and combinations of effects should be linear Bootstrapping: residuals should be normally distributed around zero with constant variance Preprocessing: ANOVA

Estimate error Simulate new datasets based on estimated error (3000 times) Calculate factor of interest (GC effect) for each bootstrapped dataset (recalculate ANOVA) Calculate CI on (GC1-GC2) of N genes based on 3000 bootstraps Use this interval to test for significant genes 0 GC1-GC2 ANOVA Bootstrap analysis Preprocessing: ANOVA

DATA Filtered for zero values set 1: unnormalised data MODELS (Kerr et al. 2000, 2001) Model 1 (no spot effects) Model 2 (spot effects independent) Model 3 (spot effects dependent) MODELS GC effects not confounded with the spot effects type of model does influence the  (residual error) => Does influence the bootstrap interval More Arrays Simulaneously Preprocessing

I. MAIN EFFECTS + EFFECT OF INTEREST Overall mean Array effect (hybridisation effciency) Condition effect (mRNA isolation effciency) Gene effect Constitutive level of gene GC effect Differential expression due to the altered variety Dye effect (labeling efficiency) More Arrays Simulaneously Preprocessing

Least squares fit: subject to restrictions contrast of interest: estimate (VG)k1g – (VG)k2g Usual confidence intervals based on normal theory not appropriate Bootstrap analysis of residuals avoid making distributional assumptions about error Assumption: Independent, additive error ~F where F is a distribution with mean and variance  2 More Arrays Simulaneously Preprocessing

More Arrays Simulaneously Preprocessing

ŷ ŷŷ ŷ TEST, ARRAY 1 REFERENCE, ARRAY 1 REFERENCE, ARRAY 2TEST, ARRAY 2 More Arrays Simulaneously Preprocessing

More Arrays Simulaneously Additive error and non linear effects undermine application of ANOVA Preprocessing

ŷ ŷŷ ŷ TEST, ARRAY 1 REFERENCE, ARRAY 1 REFERENCE, ARRAY 2TEST, ARRAY 2 More Arrays Simulaneously Preprocessing

Lowess 99 % confidence interval based on 100 genes, 3000 bootstraps retained 370 genes (62 T-test p value < 0.01) Bootstrap analysis Preprocessing

Methods tested on pygmee dataset 3750 genes 1.ANOVA 99 % CI 2.ANOVA 95 % CI 3.SAM 4.T-test 5.Fold test Retained 360 genes Construct for each gene a binary profile 1 1 1 1 1 Hierarchically cluster genes based on this profile methods Comparison Only 8 genes retained by all methods

methods Comparison

Latin Square (mouse data set) Reference: normal mouse Condition: pygmee mouse Two experiments C=1, C=2 reflects two sample time points 2 batches: not all genes of the genome on one array A 1, C 1 B1 Test = R Ref = G A 2, C 1 B1 Test = G Ref = R A 5, C 2 B1 Test = R Ref = G A 6, C 2 B1 Test = G Ref = R A 3, C 1 B2 Test = R Ref = G A 4, C 1 B2 Test = R Ref = G A 7, C 2 B2 Test = R Ref = G A 8, C 2 B2 Test = G Ref = R Transcript profiling Experiment Design

Bioinformatics Expression profiling and functional genomics Part I: Preprocessing Ad 29/10/2006.

Similar presentations

Presentation on theme: "Bioinformatics Expression profiling and functional genomics Part I: Preprocessing Ad 29/10/2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bioinformatics Expression profiling and functional genomics Part I: Preprocessing Ad 29/10/2006.

Similar presentations

Presentation on theme: "Bioinformatics Expression profiling and functional genomics Part I: Preprocessing Ad 29/10/2006."— Presentation transcript:

Similar presentations

About project

Feedback