> n problem. Agenda"> > n problem. Agenda">

Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.

Similar presentations


Presentation on theme: "1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007."— Presentation transcript:

1 1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007

2 2 Microarrays 100 A Statistician’s Simplification A microarray is a piece of glass or polymer with several thousand spots, each of which contains thousands of copies of a short piece of 1 strand of the "double helix" of DNA or of cDNA (to be explained). The rungs of the DNA ladder consists of 2 bound codons which are designated C, G, A, T. These are called base pairs. Each spot on a microarray consists of a piece of one side of the ladder with the attached base. C binds only to G. A binds only to T. A sample containing an unknown number of the complementary strands is labeled and hybridized to the array. The response is a measure of the quantity of label for each spot, which should be proportional to the number of complementary strands in the sample. http://www.bioteach.ubc.ca /MolecularBiology/AMonks FlourishingGarden/

3 3 1.DNA 100 - what we need to know to understand what a microarray can measure. 2. What can a microarray measure? 3.Where does the material printed on the microarray come from? 4.What does a microarray experiment "look like" and where do statistical methods fit in? 5.(Time permitting) Gene expression experiments and the p >> n problem. Agenda

4 4 DNA 100 A Statistician’s Simplification Every cell in an organism has the same genetic material, stored in the double helix of DNA. In a diploid population, most cells have 2 copies of each chromosome. Genes are the part of the DNA that code for proteins, but there are many other important features that interest biologists. http://www.accessexcellence.org/ RC/VL/GG/chromosome.html

5 5 Transcription (Making RNA) www.csu.edu.au/faculty/health/biomed/subjects/molbol/basic.htm transcription factors bind to the promoter and bind RNA polymerase DNA strands separate and transcription is initiated transcription continues in the 3'-5' direction until the stop codons are reached The completed RNA strand is released for post-processing

6 6 Introns and Exons In "higher" organisms, the gene contains noncoding regions, called introns, and coding regions called exons. The introns are spliced out of the mRNA before translation into protein. "Splicing variants" can be formed by the cell selecting combinations of the exons. The resulting spliced strand is the mRNA. We can "predict" exons using statistical algorithms, but the gold standard is that only exons match mRNA sequences At each end of the mRNA is an untranslated region (UTR) which is unique to the gene. http://biology.unm.edu/ccouncil/Bi ology_124/Summaries/T&T.html Chromosome promoter

7 7 cDNA RNA is much less stable than DNA. To preserve the exon sequence, and for printing microarrays, reverse transcription is used in the lab to convert the RNA into the complementary cDNA. cDNA can be preserved by inserting it into the genome of a living microbe (cDNA library).

8 8 DNA 100 A Statistician’s Simplification DNA is complicated stuff. Protein-coding regions are called genes. There are also other functional parts to the DNA, some of which code for RNA and some of which are regulatory regions - i.e. they help control how the coding regions are used - e.g. promoters The supercoiling of the DNA may also control how the coding regions are used. As well, there is a lot of DNA which appears to be "junk" - i.e. to date no function is known. But we keep making new discoveries - e.g. some of the "junk" codes for small RNA pieces that are functional.

9 9 What can be measured on a microarray? 1. Amount of mRNA expressed by a gene. 2. Amount of mRNA expressed by an exon. 3. Amount of RNA expressed by a region of DNA. 4. Which strand of DNA is expressed. 5. Which of several similar DNA sequences is present in the genome. 6. How many copies of a gene is present in the genome. 7. Where a known protein has bound to the DNA. (ChIP on chip)

10 10 What can be measured on a microarray? 1. Amount of mRNA expressed by a gene. gene expression array, exon array, tiling array 2. Amount of mRNA expressed by an exon. exon array, tiling array 3. Amount of RNA expressed by a region of DNA. tiling array 4. Which strand of DNA is expressed. exon array, tiling array 5. Which of several similar DNA sequences is present in the genome. SNP array 6. How many copies of a gene is present in the genome. gene expression array, exon array, tiling array 7. Where a known protein has bound to the DNA. (ChIP on chip) promoter array, tiling array

11 11 Types of Microarrays Exon 1Exon 2Exon 3UTR A cDNA microarray can be made from the unsequenced cDNA library. All the other types require that the sequence be available. oligo exon cDNA chromosome sequence CCGTTCACATTAGGATACCAGTTCAAGGCCGTTCACATTAGGATACCAGTTCAAGGAGGCCGTTCAGTTCACATTA tile promoter CCGTTCACA AAGGCCGTT CCGTGCACAAAGGACGTT SNP cDNA sequence

12 12 Print Technology The cDNA or oligo can be: 1. Printed on the slide using an "arraying robot" which deposits a drop of liquid containing the material at each spot. (gene expression only) 40,000+ spots 2. Oligos (all the same length) can be synthesized on the slide using: i) inkjet technology ii) photolithography 1,000,000+ spots 3. There are other technologies that give similar types of results (e.g. "beads").

13 13 Spotted 2-Channel Array http://www.anst.uu.se/frgra677/bilder/micro_method_large.jpg Spotted arrays are printed on coated microscope slides. 2 RNA samples are converted to cDNA. Each is labelled with a different dye.

14 14 Format of an Affymetrix Array http://cnx.rice.edu/content/m12388/latest/figE.JPG

15 15 Microarray experiments Obtain sequence info select oligos Print microarray Print or buy the microarray

16 16 Microarray experiments Obtain sequence info select oligos Print microarray Print or buy the microarray sequencing error assembly error contamination unique similar hybridization rates

17 17 Microarray experiments Obtain sequence info select oligos Print microarray obtain tissue sample extract RNA extract mRNA label normalize mRNA Print or buy the microarray Create the labeled samples

18 18 Microarray experiments Obtain sequence info select oligos Print microarray obtain tissue sample extract RNA extract mRNA label normalize mRNA Print or buy the microarray Create the labeled samples experimental design -number of biological replicates -technical replicates blocks sample pooling

19 19 Microarray experiments Obtain sequence info select oligos Print microarray obtain tissue sample extract RNA extract mRNA label normalize mRNA Print or buy the microarray Create the labeled samples hybridize

20 20 Microarray experiments Obtain sequence info select oligos Print microarray obtain tissue sample extract RNA extract mRNA label normalize mRNA Print or buy the microrray Create the labeled samples hybridize hybridization design (multichannel)

21 21 Microarray experiments hybridize scan detect spots compute spot summary detect background detect bad spots process image remove array specific noise

22 22 Microarray experiments hybridize scan detect spots compute spot summary detect background detect bad spots spot detection software pixel mean, median... background correction detection limit background > foreground badly printed spots flaws process image using multiple scans remove array specific noise normalization

23 23 Rafael A Irizarry, Department of Biostatistics JHU rafa@jhu.edu http://www.biostat. jhsph.edu/~ririzarr http://www.biocon ductor.org rafa@jhu.edu http://www.biostat. jhsph.edu/~ririzarr http://www.biocon ductor.org nci 2002 Spot Detection Adaptive segmentationFixed circle segmentation ---- GenePix ---- QuantArray ---- ScanAnalyze Spot uses morphological opening

24 24 Gene Expression Microarray experiments obtain numerical summary for each gene or exon on each array sample classification clustering genes and samples differential expression analysis

25 25 Gene Expression Microarray experiments obtain numerical summary for each gene or exon on each array sample classification clustering genes and samples t-tests, ANOVA Bayesian versions of above Fourier analysis of time series False discovery and nondiscovery rates differential expression analysis robust methods to downweight outliers data imputation (if needed) discriminant analysis support vector machines supervised learning unsupervised learning hierarchical clustering k-means clustering heatmaps

26 26 A heatmap samples of different regions of the brain in humans and chimpanzees sample clusters show that different regions of the brain cluster more closely than different species gene clusters show that some genes differentiate among brain regions while other differentiate the 2 species ☺

27 27 SNP Microarray experiments obtain numerical summary for each SNP estimate SNP frequency haplotyping association with subpopulation

28 28 SNP Microarray experiments obtain numerical summary for each SNP estimate SNP frequency haplotyping association with subpopulation binomial distribution determine which sets of SNPs come from each of the 2 chromosomes association with disease, ecotype, etc multivariate analysis mixture models

29 29 Tiling Microarray experiments obtain numerical summary for each codon on the chromosome visualization

30 30 Tiling Microarray experiments obtain numerical summary for each codon on the chromosome visualization nonparametric smoothing

31 31 p >> n n=#samples Usually, we have some type of response on the samples which may be quantitative (e.g. body mass index, HDL) or categorical (cancer type, growth stage...) p=#measurements which may be the intensity per gene, exon, locus, promoter region...

32 32 p >> n n=#samples Call the response Y typically a n x 1 vector (e.g. BMI) p=#measurements Call the measurements X, an n x p matrix

33 33 p >> n n=#samples Call the response Y typically a n x 1 vector (e.g. BMI) p=#measurements Call the measurements X, an n x p matrix. Typically we might think that Y is related to a small set of the X measures, e.g. k<<n of the p variables. e.g. Y=U  where U is the n x k matrix of measurements,  is an unknown vector of constants and  is random.

34 34 p >> n n=#samples Call the response Y typically a n x 1 vector (e.g. BMI) p=#measurements Call the measurements X, an n x p matrix. Typically we might think that Y is related to a small set of the X measures, e.g. k<<n of the p variables. e.g. Y=U  where U is the n x k matrix of measurements,  is an unknown vector of constants and  is random. If we try to solve Y=X  and X has rank n, we will always find an exact solution. In fact, if we select any submatrix of columns of X with rank n, we will always find an exact solution, even if those columns are completely independent of U.

35 35 p >> n Another approach is to try to predict using 1 column of X at a time. If none of the columns are in U (so that the corresponding coefficients are 0), then, if we do any statistical test for  =0 and reject for p-value < , we will reject  p of the tests and conclude that the corresponding spots are associated with Y. Because usually p>10000, we will make a lot of mistakes unless  is extremely small.

36 36 p >> n All of the special methods for analysis of gene expression data are developed to solve the p >> n problem.

37 37 p >> n e.g. 2-sample problem 2 conditions: e.g. cancer normal For gene (exon, locus...) i, we have n samples with p genes and observe Y ijk i = gene id j= condition id k=sample id Usual method: 2-sample t-test:

38 38 p >> n e.g. 2-sample problem Some ideas to improve selection of differentially expressed genes: 1) Force all genes to have the same variability (2-way ANOVA) by the normalization step. 2) assume that there is a distribution of gene means known in advance or estimated from the data (Bayesian or empirical Bayes methods). 3) Use the data to estimate the number of inference errors. 4) Force the data to be normally distributed (within gene) in the normalization step or use bootstrap or permutation methods (suitable for fairly large sample size).

39 39 Unsolved Problems People are still working on normalization, differential expression analysis, clustering and classification for gene expression arrays. There are also problems in combining data from other sources including measurements from other platforms, meta-analysis, and data from the literature. These problems are not dead, but it will be increasingly difficult to find new problems without a paradigm change. The new arrays (exon, SNP, tiling) will need more new methodology.

40 40 Acknowledgements Francesca Chiaromonte Floral Genome Project dePamphilis Lab Ma Lab Carlson Lab McNellis Lab Pugh Lab Fedorov Lab Tony Hua Xianyun Mao Bioinformatics Consulting Center Huck Institute Diya Zhang Wenlei Liu Qing Zhang Allison Lab (UAB)


Download ppt "1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007."

Similar presentations


Ads by Google