Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics Expression profiling and functional genomics Part I: Preprocessing Ad 29/10/2006.

Similar presentations


Presentation on theme: "Bioinformatics Expression profiling and functional genomics Part I: Preprocessing Ad 29/10/2006."— Presentation transcript:

1 Bioinformatics Expression profiling and functional genomics Part I: Preprocessing Ad 29/10/2006

2 http://www.esat.kuleuven.ac.be/~kmarchal/ Course material: course notes + powerpoint files Exercises

3 Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing Exercises

4 mRNA DNA transcription translation+1 protein Gene expression

5 Adaptation of cell to its environment FNR box cytNcytOcytQcytP ? ? Bacterial cell ininininout Signal 1 Signal 2 Adaptation of a cell: response on environmental signals response to e.g. hormones (cell differentiation) Cellular response determined by the genes which are switched on upon a signal Gene expression

6 Action of genetic networks underlie the observed phenotypical behavior Gene expression

7 Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing Exercises

8 Functional genomics Structural Genomics Comparative Genomics

9 Traditional molecular biology –Directed toward understanding the role of a particular gene or protein in a molecular biological process –Northern analysis –Mutational analysis –Expression by reporter fusions Omics era Measurement of the expression of 1000 of genes, proteins simultaneously Omics era – The function or the expression of a gene in a global context of the cell – Holistic approaches allow better understanding of fundamental molecular biological processes Because a gene does not act on its own, it is always embedded in a larger network (systems biology)

10 Detection Reference Test Reference sample Test sample RNA cDNA transcriptomics Omics era

11 proteomics Omics era

12 metabolomics Omics era

13 SYSTEMS BIOLOGY Consider the cell as a system Omics era

14 SYSTEMS BIOLOGY Mechanistic insight in the biological system at molecular biological level High throughput data Omics era

15 analysis of such large scale data is no longer trivial => computational challenges –Low signal/ noise –High dimensionality Simple spreadsheet analysis such as excel are no longer sufficient More advanced datamining procedures become necessary Another urgent problem is also how to store and organize all the information. Bioinformatics Omics era

16 Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling –Principle of microarray –Applications Experiment design Preprocessing Exercises

17 Detection Reference Test Reference sample Test sample RNA cDNA transcriptomics Transcript profiling

18 Previously: measure expression level of one gene: Northern blot analysis Novel techniques: measure expression level of all genes simultaneously => EXPRESSION PROFILING Principle: hybridisation mRNA: 5’ –UGACCUGACG- 3’ cDNA 3’ -ACTGGACTGC-5’ Hybridize : stick together Transcript profiling

19 Monitor molecular activities on a global level –protein levels proteomics, –enzyme activities –Metabolites –gene expression (mRNA), transcriptomics = transcript profiling allows to gain a general insight in the global cell behavior (holistic) Molecular biological methods –RT-PCR –SAGE –Protein arrays –Microarray analysis Transcript profiling

20

21 cDNA array Spotted cDNA Glass side Upscaled Northern hybridisation +1+1+1+1 Gene (DNA) Transcript (mRNA) cDNA Transcript profiling

22 Preparation of probes Collect cDNA clones Amplify target cDNA insert by PCR Check yield & specificity by electrophoresis Spot + PCR products on glass slides Transcript profiling

23 Detection Reference Test Reference sample Test sample RNA cDNA Transcript profiling

24 Signal 1 Signal 2 2. mRNA isolation 3. labeling 4. Hybridization + washing 5. scanning 6. Image analysis numerical value 1. Cell culture Transcript profiling

25 http://www.bio.davidson.edu/courses/genomics/chip/chip.html Transcript profiling

26 Superimposed color image * Transform into color images * Superimpose color images from R and G channel good alignment bad alignment Transcript profiling

27 black spots : gene was neither expressed in test nor in control sample green : gene was only expressed in control sample red : gene was only expressed in test sample yellow : gene was expressed both in test and in control sample Superimposed color image Transcript profiling

28 Signal intensity is proportional with the amount of cDNA present in the sample signal cy3 -> numerical value signal cy5 -> numerical value Data analysis Image analysis Transcript profiling

29 Data representation Gene profile Experiment profile

30 Spotted DNA microarrayHigh density oligonucleotide array Transcript profiling

31 Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing Exercises

32 Depending on experimental design other mathematical approach Comparison of 2 samples (black/white) Comparison of multiple arrays Global dynamic profiling Static experiment: Comparison of samples (mutants, patients) Experiment Design

33 Type1: Comparison of 2 samples Statistical testing Control sample Induced sample Retrieve statistically over or under expressed genes 2 sample design Experiment Design

34 black/white experiment description (array V mice genes) Condition 1 : pygmee mouse 10 days old (test) Condition 2 : normal mouse 10 days old (ref) detect differentially expressed genes Experiment design (Latin Square) Condition 1 Dye1 Replica L Condition 1 dye1 Replica R Condition 2 dye2 Replica L Condition 2 dye2 Replica R Condition 2 dye1 Replica L Condition 2 dye1 Replica R Condition 1 dye2 Replica L Condition 1 dye2 Replica R Array 1 Array 2 Per gene, per condition 4 measurements available Experiment Design

35 Measure expression of all genes During time (dynamic profile) In different conditions Identify coexpressed genes Identify mechanism of coregulation Motif Finding Clustering Multiple array design Experiment Design

36 Original dataset : 6178 genes Preprocessing: select 4634 most variable (25 % most variable) variance normalized adaptive quality based clustering (32 clusters) (95%) Multiple array design Study of Mitotic cell cycle of Saccharomyces cerevisiae with oligonucleotide arrays (Cho et al.1999) - 15 time points (E=18) time points 90 & 100 min deleted (Zhang et al. 1999, Tavazoie et al., 1999) Experiment Design

37 Reference: unsynchronized cells Condition: synchronized cells during cell cycle at distinct time intervals Condition 1 Dye1 Replica L Condition 2 Dye1 Replica L Condition 3 Dye1 Replica L Condition 4 Dye1 Replica L. … Condition 19 Dye2 Replica L Condition 19 Dye2 Replica L Condition 19 Dye2 Replica L Condition 19 Dye2 Replica L Array 1 Reference design: e.g. Spellman dataset Experiment Design

38 Loop design Experiment Design

39 Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing –Sources of Variation –General normalization steps –Slide by slide normalization –ANOVA normalization

40 Sources of variation –Overshine effects –Dye effect –Spot effects –Array effect Consistent errors Consistent errors complicate direct comparison of measurements of the same gene/condition Consistent errors need to be removed by preprocessing/normalization Preprocessing Tedious Influences downstream measurements

41 Signal 1 Signal 2 2. mRNA isolation 3. labeling 4. Hybridization + washing 5. scanning 6. Image analysis numerical value 1. Cell culture Preprocessing Dye effect

42 Dye, condition effect: within slide variation Measurement error: –Preparation mRNA –Labeling &reverse transcription Normalization Global normalization assumption Overall signal in one channel more pronounced than in other channel Preprocessing

43 Signal 1 Signal 2 2. mRNA isolation 3. labeling 4. Hybridization + washing 5. scanning 6. Image analysis numerical value 1. Cell culture Preprocessing Array effect

44 normalization within slide ratio Differences in global intensity between slides Comparison between slides impossible Array effects: between slide variation Preprocessing Hybridization differences

45 Array effects: Between slide variation Preprocessing

46 Measurement error: Different quantity of DNA in spot Difference in duplicate spots Ratio: compare differential expression between genes Spot effect Absolute levels between genes incomparable Gene 1: test: 4ref:2R/G:2 Gene 2:test: 8ref:4R/G:2 Pin main effects: spot effects Preprocessing

47 Non specific signal Cy5 or Cy3 resulting from overshining = emission from neighboring spots Overshine effects: within slide variation Preprocessing Background intensity increases with the intensity of the neighboring spots

48 Removing sources of variation is obligatory step To make comparisons within a slide possible E.g. find differentially expressed genes To allow interslide comparisons E.g. combining the replica’s of the original experiment and the color flip Preprocessing

49 Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing –Sources of Variation –General normalization steps –Slide by slide normalization –ANOVA normalization ANOVA

50 ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr

51 Background correction compensates for overshining Background correction is considered additive Preprocessing: Background correction Background correction

52 ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr

53 additive error: independent on the measured intensity the absolute level of the error remains the same (at low levels high relative error, at high expression levels low relative error). multiplicative error: the error increases with the measured intensity (at high levels high relative error) Multiplicative error Preprocessing: log transformation

54 LOG2 transformed intensity values: Multiplicative effects removed, additive effects more pronounced residuals are constant at high intensities Additive error: error increases as the signal is lower (intuitively plausible) Preprocessing: log transformation

55

56 Log (test/ref) = log2(test)-log2(ref): upregulation range 0…+infinity downregulation range 0…-infinity 2 fold overexpression 2 fold underexpression Ratio = 2 Ratio = 0.5 Log2(Ratio) = 1 Log2(Ratio) = -1 ratio (test/ref) test>ref upregulation range 1…+infinity test<ref downregulation range 0...1: range of downregulation squashed Why log2 Preprocessing: log transformation

57 ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr

58 Spots are identified by Image analysis –Array Vision –ImaGene –Matarray Spot detection and signal acquisition e.g. Signal is defined Mean pixel intensity of all pixels in a spot for which the Intensity is higher than the local background + 2SD Spots can have different qualities –Irregular spots –Spots with excessive large diameter –Spots which are extremely small artifacts Preprocessing: filtering

59 Red >0.1 stdev Green >1 stdev Blue >2 stdev Preprocessing: filtering

60 Filtering: Zero values: treat these separately ratio log transformation Zero values: black white experiment interesting genes off in condition 1 versus on in condition 2 Undefined Preprocessing: filtering

61 Some genes only labeled with green dye, not with red dye If no mRNA of a gene is present, the green dye binds aspecifically to a spot? color flip essential to eliminate false positives Seemingly underexpressed Preprocessing: filtering

62 MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing –Sources of Variation –General normalization steps –Slide by slide normalization –ANOVA normalization Overview

63 ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr

64 On average ratio red/green should be 1 – Rescale based on average of housekeeping genes – Rescale based on spikes – Rescale based on average expression value of the full array (global normalization) Methods used for normalization – linear normalization – Intensity dependent normalization Preprocessing: normalization

65 Linear Normalization G R G R Preprocessing: normalization

66 –Red and green related by a constant factor –Calculate factor by linear regression Log2(ratio) 0 0 Linear normalization factor determined by linear regression Filtering to remove outliers in the non-linear range (green values) http://afgc.stanford.edu/~finkel/talk.htm Preprocessing: normalization

67 Linear normalization not straightforward,… Log2(R/G) (Log2(R) + Log2(G))/2 Linear fit Lowess fit Preprocessing: normalization

68 Non-linear intensity dependent normalization Lowess (Dudoit et al., 2000) : genes seemingly underexpressed due to specific dye effect will be compensated for Log R and log G recalculated based on the lowess fit Lowess linearizes and normalizes the data !!!!! Preprocessing: normalization

69 Intensity dependent normalization Preprocessing: normalization

70 Result of the normalization Preprocessing: normalization

71 ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr

72 Compensates for spot effects Choice of the reference important –Intuitive reference: First time point Uninduced sample –Independent reference (reference design) Tissue mixture Intuitive interpretation possible Ratio often undefined interpretation complicated Ratio defined Preprocessing: ratio

73 Log ratio: upregulation range 0…+infinity downregulation range 0…-infinity 2 fold overexpression 2 fold underexpression Ratio = 2 Ratio = 0.5 Log2(Ratio) = 1 Log2(Ratio) = -1 ratio (R/G): R>G upregulation range 1…+infinity R<G downregulation range 0...1: range of downregulation squashed Preprocessing: ratio

74 ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr

75 Overview further analysis Raw data Preprocessed data Differentially expressed genes Clusters of coexpressed genes Preprocessing ClusteringTest statistic

76 ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering Normalization Ratio Test statistic (T-test) Log transformation Background corr Preprocessing

77 I. MAIN EFFECTS + EFFECT OF INTEREST Overall mean Array effect (hybridisation effciency) Condition effect (mRNA isolation effciency) Gene effect Constitutive level of gene GC effect Differential expression due to the altered variety Dye effect (labeling efficiency) Model the expression level of each as a combination of the different factors Least squares fit: subject to restrictions contrast of interest: estimate (GC) i1 – (GC) i2 MultiFactor, Linear, fixed levels Preprocessing: ANOVA

78 Assumption: Independent, additive error ~F where F is a distribution with mean and variance  2 Plot the residuals y estimated - y measured Estimated intensity Preprocessing: ANOVA

79 I. MAIN EFFECTS + EFFECT OF INTEREST Analysis of variance shows relative contribution of each of the effects Explains the relative contribution of each of these effects Preprocessing: ANOVA

80 Advantages: Gains more information with less observations => derives variation from all measurements made (less replica’s required e.g. array effect based on N-1 gene measurements) Statistical testing: estimated error can be used for bootstrapping to estimate confidence levels No ratio’s required Requirements: Requires knowledge about experimental effects Model used implicates that all effects and combinations of effects should be linear Bootstrapping: residuals should be normally distributed around zero with constant variance Preprocessing: ANOVA

81 Estimate error Simulate new datasets based on estimated error (3000 times) Calculate factor of interest (GC effect) for each bootstrapped dataset (recalculate ANOVA) Calculate CI on (GC1-GC2) of N genes based on 3000 bootstraps Use this interval to test for significant genes 0 GC1-GC2 ANOVA Bootstrap analysis Preprocessing: ANOVA

82

83 DATA Filtered for zero values set 1: unnormalised data MODELS (Kerr et al. 2000, 2001) Model 1 (no spot effects) Model 2 (spot effects independent) Model 3 (spot effects dependent) MODELS GC effects not confounded with the spot effects type of model does influence the  (residual error) => Does influence the bootstrap interval More Arrays Simulaneously Preprocessing

84 DATA Filtered for zero values set 1: unnormalised data MODELS (Kerr et al. 2000, 2001) Model 1 (no spot effects) Model 2 (spot effects independent) Model 3 (spot effects dependent) MODELS GC effects not confounded with the spot effects type of model does influence the  (residual error) => Does influence the bootstrap interval More Arrays Simulaneously Preprocessing

85 I. MAIN EFFECTS + EFFECT OF INTEREST Overall mean Array effect (hybridisation effciency) Condition effect (mRNA isolation effciency) Gene effect Constitutive level of gene GC effect Differential expression due to the altered variety Dye effect (labeling efficiency) More Arrays Simulaneously Preprocessing

86 Least squares fit: subject to restrictions contrast of interest: estimate (VG)k1g – (VG)k2g Usual confidence intervals based on normal theory not appropriate Bootstrap analysis of residuals avoid making distributional assumptions about error Assumption: Independent, additive error ~F where F is a distribution with mean and variance  2 More Arrays Simulaneously Preprocessing

87 More Arrays Simulaneously Preprocessing

88 ŷ ŷŷ ŷ TEST, ARRAY 1 REFERENCE, ARRAY 1 REFERENCE, ARRAY 2TEST, ARRAY 2 More Arrays Simulaneously Preprocessing

89 More Arrays Simulaneously Additive error and non linear effects undermine application of ANOVA Preprocessing

90 ŷ ŷŷ ŷ TEST, ARRAY 1 REFERENCE, ARRAY 1 REFERENCE, ARRAY 2TEST, ARRAY 2 More Arrays Simulaneously Preprocessing

91 Lowess 99 % confidence interval based on 100 genes, 3000 bootstraps retained 370 genes (62 T-test p value < 0.01) Bootstrap analysis Preprocessing

92 Methods tested on pygmee dataset 3750 genes 1.ANOVA 99 % CI 2.ANOVA 95 % CI 3.SAM 4.T-test 5.Fold test Retained 360 genes Construct for each gene a binary profile 1 1 1 1 1 Hierarchically cluster genes based on this profile methods Comparison Only 8 genes retained by all methods

93 methods Comparison

94 methods Comparison

95 Latin Square (mouse data set) Reference: normal mouse Condition: pygmee mouse Two experiments C=1, C=2 reflects two sample time points 2 batches: not all genes of the genome on one array A 1, C 1 B1 Test = R Ref = G A 2, C 1 B1 Test = G Ref = R A 5, C 2 B1 Test = R Ref = G A 6, C 2 B1 Test = G Ref = R A 3, C 1 B2 Test = R Ref = G A 4, C 1 B2 Test = R Ref = G A 7, C 2 B2 Test = R Ref = G A 8, C 2 B2 Test = G Ref = R Transcript profiling Experiment Design


Download ppt "Bioinformatics Expression profiling and functional genomics Part I: Preprocessing Ad 29/10/2006."

Similar presentations


Ads by Google