Presentation is loading. Please wait.

Presentation is loading. Please wait.

The microarray data analysis Ana Deckmann Carla Judice Jorge Lepikson Jorge Mondego Leandra Scarpari Marcelo Falsarella Carazzolle Michelle Servais Tais.

Similar presentations


Presentation on theme: "The microarray data analysis Ana Deckmann Carla Judice Jorge Lepikson Jorge Mondego Leandra Scarpari Marcelo Falsarella Carazzolle Michelle Servais Tais."— Presentation transcript:

1 The microarray data analysis Ana Deckmann Carla Judice Jorge Lepikson Jorge Mondego Leandra Scarpari Marcelo Falsarella Carazzolle Michelle Servais Tais Herig

2 Summary - Statistics background - Introduction to microarray - Pre-processing microarray data - Statistics analysis - Applications on the LGE - Gene Chip

3 - measurement = truth + error - error = bias + variance Error model Normalization Experimental replicate (techniques and biological) and statistics Bias describe a systematic tendency of the measurement. Ex: dyes Cy3 and Cy5 don´t have the same efficient Variance is often normally distributed, ex : instrumentation imperfection and biological variation Statistics background

4 - Standard deviation Mean : Standard deviation : mean(x)  Gaussian function

5 Assume data with one outlier: x = (8, 85, 7, 9, 5, 4, 13, 6, 8) –The mean of all x’s, i.e. (x 1 +x 2 +...+x K )/K, is affected by the outlier: mean(x) = 16.11 (7.5) –The median of all x’s, i.e. the middle value of (x 1 +x 2 +...+x K ), is not (if < 50% values are outliers): x ordered = (4,5,6,7,8,8,9,13,85) median(x) = 8.0 Use the median instead of the mean if you expect artifacts. (If there are a lot of measurements and the errors are symmetrically distributed the median will give the same result as the mean without outliers.) - Mean vs median :

6 - Quantiles Mean the fraction (or percent) of points below the given value. That is, the 0.3 (or 30%) quantile is the point at which 30% percent of the data fall below and 70% fall above that value. Q p =30% x=(0,10,40,25,15,50,70,60) x=(0,10,15,25,40,50,60,70) ordered values Quantil(x ; 30%) = (0,10,15) 1º quartil = 10 3º quartil = 60 Median = (25+40)/2 = 32.5

7 Introduction to microarray -Three different microarray technologies : - Spotted cDNA microarrays (500 to 2500 bp) - Spotted oligonucleotide microarrays (30 to 70 bp) - Affymetrix chips (25 bp) - Can be used to : - Differential gene expression studies, gene co-regulation studies, gene function identification studies. time-course studies, dose-response studies, clinical diagnosis, …

8 Two color architecture

9 Probes: 30-meros, 90% até 550 bases downstream extremidade 3’ Targets: 10ug cRNA biotinilado Codelink architecture (one color)

10  higher frequency, more energy  lower frequency, less energy excitation red laser green laser emission overlay images Scanning

11 A B C H G F D E 1234 1 234567891011 a b c d e f g h i j k Scarpari, Leandra – 2006 – Tese Doutorado Ludwig flags : (0) Int <= Back (1) Irregular spots (3) Spot ok (4) Saturated Ludwig scanner

12 Codelink flags : (L) near background (C) contaminated (S) saturated (M) masked (G) good Codelink scanner

13 A B C H G F D E 1234 LGE defined flags : (0) – Spot ok (1) – Spot Saturado (2) – Int/Back <= 1.05 (3) – Area <= 110 or 50 (9x9 or 11x11) Defined intensity : -Int Cy3 = Area Cy3 * (median(Int Cy3)- median(Bkgd(Cy3)) -Int Cy5 = Area Cy5 * (median(Int Cy5)- median(Bkgd(Cy5)) LGE scanner

14 Cy3= 3329280; Cy5= 2251624r=0.67 (fold=-1.49) (Target median - Bkgd median) * Area = integrated intensity pixels out pixels in > pixels out pixels in - * =

15 Cy3= 222824; Cy5= 15488r=0.069 fold=-14.5 flag=0 Cy3= 481536; Cy5= 676000r=fold=1.40 flag=0 Cy3= 293664; Cy5= 485368r=1.65 flag=0 Cy3= 6400; Cy5= -3584 NA (sinal:ruído<=1) flag=2 Cy3= 8767720; Cy5= 1349296 r=0.15 fold=-6.7 flag=1

16 Pre-processing microarray data -Bioconductor repository (http://www.bioconductor.org/) -Log intensities R=G Log 2 R=Log 2 G Most genes have low gene expression levels. What happens here?

17 up-regulated genes down-regulated genes non-differentially expressed genes are now along the horizontal line: M = 0  log 2 R - log 2 G = 0  R = G Transformed data {(M,A) i }: M = log 2 (R) - log 2 (G) (minus) A = ½·[log 2 (R) + log 2 (G)] (add) M vs A plot

18 log 2 R = red channel signal log 2 G = green channel signal Density plot

19 1 16 Print-tip box plot

20 Normalization within slides Expectation: Most genes are non-differentially expressed, i.e. most of the data points should be around M=0.

21 Median normalization : which sets the median of log intensity ratios to zero Median value = 0 Lowess normalization : global lowess normalization

22 Print-tip normalization : print-tip group lowess normalization X* ij =(X ij -median(GRID j ))/sd(GRID j ) Scaled print-tip : scaled print-tip group lowess normalization

23 Normalization across slides -QUANTILE QQPlot Mean between 8 slides

24 -LOWESS (applied in one color microarray) Transformed data {(M,A) i }: M = log 2 (Int 1 ) - log 2 (Int 2 ) ; A= ½·[log 2 (Int 1 ) + log 2 (Int 2 )]

25 Statistics analysis - T statistics test The T statistics down-weight the importance of the average if the deviation is large and vice versa; T = mean(x) / SE(x) where SE(x)=std.dev(x)/N (standard error of the mean) The blue gene has the lower T-value than red gene.

26 Top table and volcanoplot Fold change = ratio; if ratio >=1 or -1/ratio; if ratio < 1

27 Cluster data analysis

28 Missing values Bioinformatics (2001) vol 17, n. 6, 520-525 Gene expression microarray experiments can generate data sets with multiple missing expression. Accurate estimation of missing values is an important for efficient data analysis.

29 Applications on the LGE -Codelink (Ana Deckmann) - There is one package in the bioconductor for the codelink - Pipeline used : Read codelink file Normalize between slides : method LOWESS BMC Bioinfomatics 2005, 6:309 Background corrected Bad spot excluded Flags : C,S,M,X and I Clustering and data analyses Replicate validation At least the flags : - GG x GG - GG x LL - LL x GG Statistical analyses Fold change >= 2 P-value <= 0.05

30 LOWESS

31

32 -Ludwig (Leandra Scarpari) - Reformat file from ScanArray (Ludwig) to ScanAlyze to be compatible with bioconductor package (aroma and limma) - Pipeline used : Background corrected Reformat file Read ScanAlyze file Normalize across slides : method quantile Clustering and data analyses Results were compatible with Ludwig analyses Bad spot excluded Flags : 0, 1, 2 and 4 Normalize within arrays : method lowess Nucleic Acids Research, 2002, Vol 30, No 4 Replicate validation At least flag=3 in 2 internal replicates for each array Statistical analyses Fold change >= 2 P-value <= 0.05

33 LOWESS

34 QUANTILE

35

36 - LGE (two color) - Reformat file from Scanner LGE to ScanAlyze to be compatible with bioconductor package (aroma and limma) - Pipeline used : Background corrected Reformat file Read ScanAlyze file Normalize within arrays : method lowess Normalize across slides : method quantile Data analyses Bad spot excluded Flag: 2 (Ratio Int/Back < XX) Replicate validation At least flag 3 in 2 internal replicates for each array Statistical analyses Fold change >= 2 P-value <= 0.05

37 LOWESS + QUANTILE

38 - LGE (one color) - Reformat file from Scanner LGE to ScanAlyze to be compatible with bioconductor package (aroma and limma) - Pipeline used : Background corrected Reformat file Read ScanAlyze file Normalize within arrays : method median Normalize across slides : method quantile Clustering and data analyses Bad spot excluded Flag: 2 (Ratio Int/Back < XX) Replicate validation At least flag 3 in 2 internal replicates for each array Statistical analyses Fold change >= 2 P-value <= 0.05

39 MEDIAN + QUANTILE

40 Mais expressos em Op0d Corte/backgroundAmostrap.valueFold changeIdentidadeOrganismo 0,05G1.i106,93E-075,66gnl|Amel_1.1|Contig6992 2e-13Apis mellifera F1.j102,59E-064,05desconhecidoApis mellifera D1.i107,70E-053,08no hits (baixa qualidade) 0,01B1.a20,5153521,21Dunce 2e-39Drosophila melanogaster Mais expressos em Op5d Corte/backgroundAmostrap.valueFold changeIdentidadeOrganismo 0,05H4.b20,00017-3,00gnl|Amel_1.1|Contig4902 2e-55Apis mellifera B3.i30,000992-2,35gnl|Amel_1.1|Contig896 1e-09Apis mellifera H2.d20,001343-2,16gnl|Amel_1.1|Contig10843 1e-16Apis mellifera 0,01H4.h30,015089-2,80Groucho 1.6e-14Anopheles gambiae

41 Gene Chip

42

43

44

45

46 Fim

47 Comparison of normalization methods for Codelink Bioarray data Differences between pair of arrays in the technical replicates : (1)Array 1 vs array 4 (2)Array4 vs array 5 BMC Bioinfomatics 2005, 6:309

48 - Within slide normalization BeforeAfter Print-tip normalization No norm Print tip Scaled print tip Nucleic Acids Research, 2002, vol 30, No 4


Download ppt "The microarray data analysis Ana Deckmann Carla Judice Jorge Lepikson Jorge Mondego Leandra Scarpari Marcelo Falsarella Carazzolle Michelle Servais Tais."

Similar presentations


Ads by Google