13Reverse Transcription replicationtranscriptiontranslationDNARNAProteinReverse TranscriptionBy reverse transcriptase, we can convert RNA into cDNA.
14The Southern BlotBasic DNA detection technique that has been used for over 30 years, known as Southern blots:A “known” strand of DNA is deposited on a solid support (i.e. nitocellulose paper)An “unknown” mixed bag of DNA is labelled (radioactive or flourescent)“Unknown” DNA solution allowed to mix with known DNA (attached to nitro paper), then excess solution washed offIf a copy of “known” DNA occurs in “unknown” sample, it will stick (hybridize), and labeled DNA will be detected on photographic film
15mRNA Represent Gene Function When measure the level of a mRNA, we are monitoring the activity of a gene.Thus, if we can understand all the level of mRNAs, we can study the expression of whole genome.Microarray takes the advantage of getting over of blotting data in a single experiment, which makes monitoring the genome activity possible.
16Content Biology background of microarray Design of microarray The workflow of microarrayImage analysis of microarrayData analysis of microarrayDiscussion
17Design of Microarray Microarray in different context The idea of microarrayMain type of array chips
18mRNA Levels Compared in Many Different Contexts Different tissues, same organism (brain v. liver)Same tissue, same organism (tumor v. non-tumor)Same tissue, different organisms (wt v. mutant)Time course experiments (development)Other special designs (e.g. to detect spatial patterns).
19Idea of Microarray Cell A Cell B Labeled cDNA from geneX Hybridizaton to chipSpot of geneX with complementary sequence of colored cDNAThis spot shows red color after scanning.
20Over 10,000 Hybridization Could Be Down at One Time
21Several Types of Arrays Spotted DNA arraysDeveloped by Pat Brown’s lab at StanfordPCR products of full-length genes (>100nt)Affymetrix gene chipsPhotolithography technology from computer industry allows building many 25-mersInk-jet microarrays from Agilent25-60-mers “printed directly on glass slidesFlexible, rapid, but expensive
22Array Fabrication Spotting Use PCR to amplify DNARobotic "pen" deposits DNA at defined coordinatesapproximately 1-10 ng per spotExperimentation with oligos (40, 70 bp)
23This machine can make 48 microarrays simultaneously.
24Array Fabrication Photolithography Light activated synthesissynthesize oligonucleotides on glass slides107copies per oligo in 24 x 24 um squareUse 20 pairs of different 25-mers per genePerfect match and mismatch
26Affymetrix Microarrays Raw image1.28cm50um~107 oligonucleotides,half perfectly match mRNA (PM),half have one mismatch (MM)Raw gene expression is intensity difference: PM - MM
27Agilent cDNA microarray and oligonucelotides microarray Agilent delivering printed 60-mer microarrays in addition to 25-mer formats.The inkjet process uses standard phosphoramidite chemistry to deliver extremely small volumes (picoliters) of the chemicals to be spotted.
28Content Biology background of microarray Design of microarray The workflow of microarrayImage analysis of microarrayData analysis of microarray
31Cy3 and Cy5 cDNA Hybridization On To The Chip e.g. treatment / controlnormal / tumor tissueSample loading1.Loading from the corner of the cover slipIt is time consuming and easily producing bubbles.12. Loading sample at the center of array then put the slip smoothlyFaster, and have lower chance of bubble producing then the last one.2Sample loading3. Loading sample at the side of the array then put the slip on.Solution would attach to the slip right after the slip contact with it, and would diffuse with the movement of slip when we slowly move down.3Sample loading
32ScanGreen: down regulateRed: up regulateYellow: equal level
33Content Biology background of microarray Design of microarray The workflow of microarrayImage analysis of microarrayData analysis of microarrayDiscussion
34Image analysis To find a spot Convert feature into numeric data Image normalization
35The Algorithms1. Find spots: Finds the location of each spot on the microarray.2. Cookie cutter algorithm:(1).Suppose the distribution of pixels vs intensity is Gaussian curve(2).Using SD or IQR to identify the feature and background of each spot(3).Calculates statistics for the pixel population
36Interquartile Range(IQR) DK=IQR/21.42 IQRBoundary for rejection25%50%75%Boundary for rejectionIQR
37Feature or cookieDLocal backgroundExclusion zone
45Additional Normalization Pin dependentSimilar to intensity dependent fit.Compute individual lowess fits for each pin groupWithin slide normalizationAfter pin dependent normalization, log ratios for each pin are centered around 0Scale variance for each pinUses MAD (median absolute deviation)
46Additional Normalization Dye swapCombine relative expression levels without explicit normalizationCompute lowess fit forlog2(RR’/GG’)/2 vs. log2(A + A’)/2Normalized ratio islog2(R/G) - c(A)where c(A) is the lowess prediction
47Content Biology background of microarray Design of microarray The workflow of microarrayImage analysis of microarrayData analysis of microarrayDiscussion
48Data analysis Data filtering Fold change analysis Classification ClusteringFuture direction
49Microarray Data Classification Microarray chipsImages scanned by laserGene ValueD26528_atD26561_cds1_atD26561_cds2_atD26561_cds3_atD26579_atD26598_atD26599_atD26600_atD28114_atDatasetsNewsampleData Miningand analysisPrediction:
50The Threshold of SpotsFiltering - remove genes with insufficient variationRemove insufficient spot:saturated, None uniform, too high background…Remove extreme signal:e.g. MaxVal - MinVal < 500 and MaxVal/MinVal < 5Statistical filtering (e.g. p-value<0.01)biological reasonsfeature reduction for algorithmic
51Microarray Data Analysis Types Different gene expressionFold change analysisClassification (Supervised)identify diseasepredict outcome / select best treatmentClustering (Unsupervised)find new biological classes / refine existing onesexploration…
52Differential Gene Expression n-fold changen typically >= 2May hold no biological relevanceOften too restrictive2 expressionCalculate standard deviation Genes with expression more than 2 away are differentially expressed
55Classification: Multi-Class Similar Approach:select top genes most correlated to each classselect best subset using cross-validationbuild a single model separating all classesAdvanced:build separate model for each class vs. restchoose model making the strongest prediction
56Popular Classification Methods Decision Trees/Rulesfind smallest gene sets, but also false positivesNeural Nets -work well if number of genes is reducedSVMgood accuracy, does its own gene selection, hard to understandK-nearest neighbor - robust for small number genesBayesian nets - simple, robust
57Multi-class Data Example Brain data, Pomeroy et al 2002, Nature (415), Jan 200242 examples, about 7,000 genes, 5 classesSelected top 100 genes most correlated to each classSelected best subset by testing 1,2, …, 20 genes subsets, leave-one-out x-validation for each
58Classification – Other Applications Combining clinical and genetic dataOutcome / Treatment predictionAge, Sex, stage of disease, are usefule.g. if Data from Male, not Ovarian cancer
59Clustering Goals Find natural classes in the data Identify new classes / gene correlationsRefine existing taxonomiesSupport biological analysis / discoveryDifferent MethodsHierarchical clustering, SOM's, etc
60SOM clustering SOM - self organizing maps Preprocessing filter away genes with insufficient biological variationnormalize gene expression (across samples) to mean 0, st. dev 1, for each gene separately.Run SOM for many iterationsPlot the results
62Hierarchical Clustering The most popular hierarchical clustering method used in microarray data analysis is the so called agglomerative methodworks with the data in a bottom-up manner.Initially, each data point forms a cluster and the algorithm works through the cluster sets by repeatedly merging the two which are the most similar or have the shortest distance.algorithm involves the computation of the distance or similarity matrixO(N^2) complexity and thus is not very efficient.
64Future directionsAlgorithms optimized for small samples (the no. of samples will remain small for many tasks)Integration with other databiological networksmedical textprotein datacost-sensitive classification algorithmserror cost depends on outcome (don’t want to miss treatable cancer), treatment side effects, etc.
65Integrate biological knowledge when analyzing microarray data (from Cheng Li, Harvard SPH) Right picture: Gene Ontology: tool for the unification of biology, Nature Genetics, 25, p25
66Content Biology background of microarray Design of microarray The workflow of microarrayImage analysis of microarrayData analysis of microarrayDiscussion
67Microarray Potential Applications Biological discoverynew and better molecular diagnosticsnew molecular targets for therapyfinding and refining biological pathwaysMutation and polymorphism detectionRecent examplesmolecular diagnosis of leukemia, breast cancer, ...appropriate treatment for genetic signaturepotential new drug targets
68Microarray Limitations Cross-hybridization of sequences with high identityChip to chip variationTrue measure of abundance?Does mRNA levels reflect protein levels?Generally, do not “prove” new biology - simply suggest genes involved in a process, a hypothesis that will require traditional experimental verification.What fold change has biological relevance?Need cloned EST or some sequence knowledge -- rare messages may be undetectedExpensive!! Not every lab can afford experiment repeat.The real limitation is Bioinformatics
69Additional Information Review papers on microarrayGenomics, gene expression and DNA arrays (Nature, June 2000)Microarray - technology review (Natural Cell Biology, Aug. 2001)Magic of Microarray (Scientific American, Feb. 2002)Molecular biology tutorial
70Biological data retrieval systems: Entrez http://www. ncbi. nlm. nih A retrieval system for searching a number of inter-connected databases at the NCBI. It provides access to:PubMed: The biomedical literature (Medline)Genbank: Nucleotide sequence databaseProtein sequence databaseStructure: three-dimensional macromolecular structuresGenome: complete genome assembliesPopSet: population study data setsOMIM: Online Mendelian Inheritance in ManTaxonomy: organisms in GenBankBooks: online booksProbeSet: gene expression and microarray datasets3D Domains: domains from Entrez StructureUniSTS: markers and mapping dataSNP: single nucleotide polymorphismsCDD: conserved domains2. Entrez allows users to perform various searches.