Presentation on theme: "Introduction to Microarray Gene Expression"— Presentation transcript:
1Introduction to Microarray Gene Expression Shyamal D. Peddada Biostatistics BranchNational Inst. EnvironmentalHealth Sciences (NIH) Research Triangle Park, NC
2Outline of the four talks A general overview of microarray dataSome important terminology and backgroundVarious platformsSources of variationNormalization of dataAnalysis of gene expression data - Nominal explanatory variablesTwo types of explanatory variablesScientific questions of interestA brief discussion on false discovery rate (FDR) analysisSome existing methods of analysis.
3Outline of the four talks Analysis of ordered gene expression dataCommon experimental designsSome existing statistical methodsAn exampleDemonstration of ORIOGENSome open research problemsAnalysis of data from cell-cycle experimentsSome background on cell-cycle experimentsModeling the dataData from multiple experimentsSome open research problem
5To perform statistical analysis of any given data It is important to understand all sources of (i) bias, (ii) variability.Some basic understanding of the underlying technology!Understand the sampling/experimental design
8Some background terminology: DNA and RNA DNA (Deoxyribonucleic acid) - Contains genetic code or instructions for the development and function living organisms. It is double stranded.Four Nucleotides (building blocks of DNA)Adenine (A), Guanine (G),Thymine (T), Cytosine (C)Base pairs: (A, T) (G, C)E.g ’ ---AAATGCAT---3’3’ ---TTTACGTA---5’
9Some background terminology: DNA and RNA RNA (Ribonucleic acid) - transcribed (or copied) from DNA. It is single stranded. (Complimentary copy of one of the strands of DNA)RNA polymerase - An enzyme that helps in the transcription of DNA to form RNA.Four Nucleotides (building blocks of DNA)Adenine (A), Guanine (G),Uracil (U), Cytosine (C)Base pairs: (A, U) (G, C)
10Some background terminology: Types of RNA Types of RNA - (transfer) tRNA,(ribosomal) rRNA, etc.mRNA - messenger RNA. Carries information from DNA to ribosomes where protein synthesis takes place (less stable than DNA).
11Some background terminology: Oligos Oligonucleotide - a short segment of DNA consisting of a few base pairs. In short it is commonly called “Oligo”.“mer” - unit of measurement for an Oligo. It is the number of base pairs. So 30 base pair Oligo would be 30-mer long.
12Some background terminology: Probes cDNA - complimentary DNA. DNA sequence that is complimentary to the given mRNA.Obtained using an enzyme called reverse transcriptase.Probes - a short segment of DNA (about 100-mer or longer) used to detect DNA or RNA that compliments the sequence present in the probe.
13Some background terminology: “Blots” - Origins of Microarrays Southern blot (Edwin Southern, 1975 J. Molec. Biol.)A method used to identify the presence of a DNA sequence in a sample of DNA.Western blot (immunoblot)to identify a specific protein from a tissue extract.
14Some background terminology Southwestern blotto identify and characterize DNA-binding proteins.Northern blotA method used to study the gene expression from a sample of mRNA.
16Northern blot Vs Microarray Rate of expression analysisThousands of genes at a time(High throughput)Few genes at a timeAutomationAutomation possibleManualScopeAllows to explore relationships among several 100’s of genes at the same timeLimited
17What is a Microarray?Sequences from thousands of different genes are immobilized, or attached, at fixed locations.Spotted, or actually synthesized directly onto the support.
18Microarray Technology Two color dye array (Spotted array)Spotted cDNA microarraysSpotted oligo microarraysSingle dye arrayIn situ oligo microarrays
21Spotted DNA Microarray Slides carrying spots of target DNA are hybridized to fluorescently labeled cDNA from experimental and control cells and the arrays are imaged at two or more wavelengthsExpression profiling involves the hybridization of fluorescently labeled cDNA, prepared from cellular mRNA, to microarrays carrying thousands of unique sequences.
22Spotted DNA Microarray Spotted DNA array is typically “home made” so you need to think about:cDNA or OligoLocation of the Oligo in a given geneOligo length - number of bp?
23Spotted DNA Microarray Gene expression:Y < 0; gene is over expressed in green labeled sample compared to red-labeled sampleY = 0; gene is equally expressed in both samplesY > 0; gene is over expressed in red-labeled sample compared to green labeled sample
25Major Commercial Platforms More than 50 companies are currently offering various DNA microarray platforms, reagents and softwareAffymetrix dominated the marker for many years*Agilent has one and two-color microarray platform
26Affymetrix GeneChipEach gene is represented by 11 to 20 oligos of 25-mersProbe: An oligo of 25-merProbe Pair: a PM and MM pairPerfect match (PM): A 25-mer complementary to a reference sequence of interest (part of the gene)Mismatch (MM): same as PM with a single base change for the middle (13th) base (G <-> C, A <-> T)Probe set: a collection of probe-pairs (11 to 20) related to a fraction of gene
27Affymetrix call for the presence of a signal Affymetrix detection algorithm uses probe pair intensities to obtain detection p-valueUsing this p-value they decide whether the signalis“ present”, “marginal” or “absent”
28Affy call Detection of p-value Calculate Kendall’s tau T for each probe pairT = (PM-MM) / (PM+MM)Determine the statistical significance of the gene by computing the p-value.
32Which Platform to Choose? Every platform has its unique featureChoose platform based onNature of the studyAmount of available RNACostPlatform comparison in MAQC study
33MAQC ProjectObjective: To generate a set of quality control tools for microarray research community137 participants representing 51 organizationsGene expression from two distinct RNA samples (total 4 samples)Sample A = Universal Human Reference RNA(UHRR)–100%Sample B = Human Brain Reference RNA(HBRR) – 100%Sample C = 75% UHRR + 25% HBRRSample D = 25% UHRR + 75% HBRR
35Why Normalize Data?To “calibrate”/adjust data so as to reduce or eliminate the effects arising from variation in technology and other sources rather than due to true biological differences between test groups.
36Sources of bias/variation Tissue or cell linesmRNAIt can degrade over time - so there is a potential batch effect if portions of experiment are performed at different timesPurity and quantityDye color effect (spotted arrays)Variation due to technology - is substantially reduced with improved technologyEtc.
37A useful graphical representation of data Data matrix:Let
38A useful graphical representation of data Let its spectral decomposition be given bywhere
39A useful graphical representation of data ThenPlot
41Internal control normalization (Housekeeping gene(s)) Expression of each gene is measured relative to the average of house keeping genes.Basic assumption: Expression of housekeeping genes does not change.Disadvantage:House keeping genes may be highly expressed sometimes. Unexpected regulation of house keeping gene(s) leads to misinterpretation
42Global Normalization Basic assumption Regression of Mean/Median expression ratio of all monitored mRNAs is constant across a chip.Regression ofIn simple terms the log ratios are corrected by a common “mean” or “median”This method can also be applied to single Dye data
43Linear Normalization (for spotted arrays) Basic assumptionMean/Median expression ratio of all monitored mRNAs depends upon the average intensityRegression of
44Non-Linear Normalization (for spotted arrays) Basic assumptionMean/Median expression ratio of all monitored mRNAs depends upon the average intensityRegression ofWhere is estimated by the robust scatter plotsmoother LOWESS (Locally WEighted Scatterplot Smoothing)
45Analysis of Variance (ANOVA) Standard Analysis of Variance modelResponse variable - Gene expressionExplanatory variables:Dye colorBatchOther potential effects?Advantage: Statistically significantgenes can be identified while controlling for thevarious experimental conditions/factors.
46Some important experimental designs Pooled Samples versus Separate samplesSometimes there may not be sufficient biological sample/specimen from a given animal. In such cases biological samples are pooled from several identical animals to form a sample.
47An example of a pooling design (for each treatment group) Subjects Pool Observations(Microarray chips)
48The pooling design Subjects Pool Observations (Microarray chips) 9 3 6 (3 per pool)More generally:n p m(r=n/p per pool)
49The standard design Subjects # Pool Observations (Microarray chips) (r=1)More generally:n p=n m=n
50Some issues What are the underlying parameters? Effect of pooling on power.The basic assumption. Validity of the assumption.
51ParametersTotal variation in the expression of a gene can be decomposed in to:Biological variationTechnical variationBiological samples (n)Number of pools (p)Biological samples per pool (r=n/p)Observed number of samples (e.g. microarrays) (m)
52Some comments about pooling Variance of the estimated mean expression of a gene depends on:number of pools (p)number of bio samples per pool (r)number of arrays (m)biological variationTechnical variation.Pooling works well when the biological variation in the geneexpression is substantially larger than the technical variation.
53Power comparisons # Bio #Micro Pool size Power 5/group 5/group 1 (Standard design)6/group 6/group 1 (Standard design)6/group 3/group 2 (i.e 3 pools/group)8/group 4/group (i.e. 4 pools/group)10/group 5/group (i.e. 5 pools/group)Zhang and Gant (2005)
54Power comparisons Conditions of the simulation study: Biological variation is 4 times the technical variation.False positive rate isDetect 2-fold expression.Data are normally distributed.
55A fundamental assumption Biological averaging:Suppose an experiment consists of pooling “r” samples. Thenthe expression of a gene in the pooled sample is assumed tobe the average of the gene’s expression in the “r” samples.This assumption need not be true especially if the expressionvalues are transformed non-linearly.
56Some important experimental designs Reference designs (Spotted array)Each treatment sample is hybridized against a common reference control.Loop designs (Spotted array)Suppose we have a control and three experimental groups A, B and C. Then hybridize Control and A, A with B, B with C and C with A.
57Data Analysis - Preliminaries NormalizationTransformation of data (usual methods)Perhaps first fit ANOVA and plot the residualsLog transformationSquare rootMore generally, Box-Cox family of transformationsIdentify potential outliers in the data (again, perhaps use the residuals)
58Data AnalysisMethod of Analysis depends upon the scientific question of interest.In the next three lectures we describe several general methods and illustrate some using real data!