Introduction to Microarray Data Analysis 11/17/08 – 11/24/08

Slides:



Advertisements
Similar presentations
M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.
Advertisements

Microarray technology and analysis of gene expression data Hillevi Lindroos.
Microarray Data Analysis Stuart M. Brown NYU School of Medicine.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Introduction to DNA Microarrays Todd Lowe BME 88a March 11, 2003.
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Microarray Data Preprocessing and Clustering Analysis
Differentially expressed genes
The Human Genome Project and ~ 100 other genome projects:
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Fuzzy K means.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Introduce to Microarray
Gene Expression Data Analyses (1) Trupti Joshi Computer Science Department 317 Engineering Building North (O)
Analysis of microarray data
with an emphasis on DNA microarrays
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
Affymetrix vs. glass slide based arrays
Whole Genome Expression Analysis
CS 6293 Advanced Topics: Transcriptional Bioinformatics
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.
Lecture 22 Introduction to Microarray
CDNA Microarrays MB206.
Data Type 1: Microarrays
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
Microarray - Leukemia vs. normal GeneChip System.
Scenario 6 Distinguishing different types of leukemia to target treatment.
CS 5263 Bioinformatics Lecture 23 Microarray Data Analysis.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
1 FINAL PROJECT- Key dates –last day to decided on a project * 11-10/1- Presenting a proposed project in small groups A very short presentation (Max.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Gene Expression Analysis. 2 DNA Microarray First introduced in 1987 A microarray is a tool for analyzing gene expression in genomic scale. The microarray.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Statistical Testing with Genes Saurabh Sinha CS 466.
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Microarray (Gene Expression) DNA microarrays is a technology that can be used to measure changes in expression levels or to detect SNiPs Microarrays differ.
Extracting binary signals from microarray time-course data Debashis Sahoo 1, David L. Dill 2, Rob Tibshirani 3 and Sylvia K. Plevritis 4 1 Department of.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Overview of Microarray. 2/71 Gene Expression Gene expression Production of mRNA is very much a reflection of the activity level of gene In the past, looking.
Cluster validation Integration ICES Bioinformatics.
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Microarrays and Other High-Throughput Methods BMI/CS 576 Colin Dewey Fall 2010.
Microarray Data Analysis The Bioinformatics side of the bench.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Microarray: An Introduction
Microarray Technology and Data Analysis Roy Williams PhD Sanford | Burnham Medical Research Institute.
Introduction to Microarray Data Analysis
Microarray Technology and Applications
Molecular Classification of Cancer
Introduction to Microarray Data Analysis
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Microarray Data Analysis
Presentation transcript:

Introduction to Microarray Data Analysis 11/17/08 – 11/24/08 CS 5263 Bioinformatics Introduction to Microarray Data Analysis 11/17/08 – 11/24/08

Outline What is microarray Basic categories of microarray How can microarray be used Computational and statistical methods involved in microarray Probe design Image processing Pre-processing Differentially expressed gene identification Clustering and classification Network / pathway modeling

Gene expression Reverse transcription (in lab) Product is called cDNA Genes have different activities at different time / environment DNA Microarrays Measure gene transcription (amount of mRNA) in a high-throughput fashion A surrogate of gene activity

(an old technique for measuring mRNA expression) Northern Blot (an old technique for measuring mRNA expression) 1. mRNA extracted and purified. 4. mRNA are transferred from the gel to a membrane. 2. mRNA loaded for electrophoresis. Lane 1: size standards. Lane 2: RNA to be tested. 5. A labeled probe specific for the RNA fragment is incubated with the blot. So the RNA of interest can be detected. - 3. The gel is charged and RNA “swim” through gel according to weight. Hybridization Need relatively large amount of mRNA + http://www.escience.ws/b572/L13/north.html

RT-PCR (reverse transcription-polymerase chain reaction) RNA is reverse transcribed to DNA. PCR procedures can be used amplify DNA at exponential rate. Gel quantification for the amplified product. ---- an semi-quantitative method. Smaller amount of sample needed. See animation of RT-PCR: http://www.bio.davidson.edu/courses/Immunology/Flash/RT_PCR.html real-time RT-PCR The PCR amplification can be monitored by fluorescence in “real time”. The fluorescence values recorded in each cycle represent the amount of amplified product. ---- a quantitative method. The current most advanced and accurate analysis for mRNA abundance. Usually used to validate microarray result. Often used to validate microarray http://www.ambion.com/techlib/basics/rtpcr/

Limitation of the old techniques Labor intensive Can only detect up to dozens of genes. (gene-by-gene analysis)

What is a microarray A 2D array of DNA sequences from thousands of genes Each spot has many copies of same gene (probe) Allow mRNAs from a sample to hybridize Form RNA-DNA double-strand Measure number of hybridizations per spot

What is a Microarray (2) Gene 9 Conceptually similar to (reverse) Northern blot (Many) probes, rather than mRNAs, are fixed on some surface, in an ordered way

Microarray categories cDNAs microarray Each probe is the cDNA of a gene (length: hundreds to thousands nucleotides) Stanford, Brown Lab Oligonucleotide microarray Each probe is a synthesized short DNA (uniquely corresponding to a substring of a gene) Affymetrix: ~ 25mers Agilent: ~ 60 mers Others

Spotted cDNA microarray

Array Manufacturing Each tube contains cDNAs corresponding to a unique gene. Pre-amplified, and spotted onto a glass slide

Experiment cy3 cy5

Data acquisition Computer programs are used to process the image into digital signals. Segmentation: determine the boundary between signal and background Results: gene expression ratios between two samples

cDNA Microarray Methodology Animation

Affymetrix GeneChip®

Array Design 25-mer unique oligo mismatch in the middle nuclieotide multiple probes (11~16) for each gene from Affymetrix Inc.

Array Manufacturing In situ synthesis of oligonucletides Technology adapted from semiconductor industry. (photolithography and combinatorial chemistry)                                                              In situ synthesis of oligonucletides from Affymetrix Inc.

GeneChip® Probe Arrays Hybridized Probe Cell * * GeneChip Probe Array * * * Single stranded, labeled RNA target * Oligonucleotide probe 24µm Millions of copies of a specific oligonucleotide probe 1.28cm >200,000 different complementary probes Image of Hybridized Probe Array

Overview of the Affymetrix GeneChip technology Each probe set combines to give an absolute expression level. Image segmentation is relatively easy. But how to use MM signal is debatable from Affymetrix Inc.

Comparison of cDNA array and GeneChip cDNA GeneChip Probe preparation Probes are cDNA fragments, usually amplified by PCR and spotted by robot. Probes are short oligos synthesized using a photolithographic approach. colors Two-color (measures relative intensity) One-color (measures absolute intensity) Gene representation One probe per gene 11-16 probe pairs per gene Probe length Long, varying lengths (hundreds to 1K bp) 25-mers Density Maximum of ~15000 probes. 38500 genes * 11 probes = 423500 probes

Why the difference? Affymetrix GeneChip One color design cDNA microarray Two color design Why the difference?

Affymetrix GeneChip cDNA microarray Photolithography Robotic spotting (The amount of oligos on a probe is well controlled) cDNA microarray Robotic spotting (The amount of cDNA spotted on a probe may vary greatly)

Advantage and disadvantage of cDNA array and GeneChip cDNA microarray Affymetrix GeneChip The data can be noisy and with variable quality Specific and sensitive. Result very reproducible. Cross(non-specific) hybridization can often happen. Hybridization more specific. May need a RNA amplification procedure. Can use small amount of RNA. More difficulty in image analysis. Image analysis and intensity extraction is easier. Need to search the database for gene annotation. More widely used. Better quality of gene annotation. Cheap. (both initial cost and per slide cost) Expensive (~$400 per array+labeling and hybridization) Can be custom made for special species. Only several popular species are available Do not need to know the exact DNA sequence. Need the DNA sequence for probe selection.

Typical Microarray Analysis Preprocess Normalize Filter Raw data •Present/Absent •Minimum value •Fold change Significance Classification Clustering Function (Gene Ontology) Regulation (Motif finding)

Preprocessing Garbage in => Garbage out Background subtraction Account for non-specific hybridization Transformation (e.g. to log scale) Convenience Convert data into a certain distribution (e.g. normal) assumed by many statistical procedures Normalization Remove systematic biases Make data from different samples comparable Filtering, averaging, etc. Remove random noises Order may be different. May be combined. Garbage in => Garbage out

Background subtraction For cDNA array, relatively straightforward Raw data contain foreground and background values Foreground values obtained from detected spots Background values obtained from surrounding area It may occur that background > foreground For oligo array, probes are densely packed, so cannot be used directly. Hope: MM captures non-specific hybridization? Recent studies suggest that PM and MM are correlated. Better ignore MM entirely or use with caution Available software tools MAS 5 (by affymetrix) dChIP GCRMA

Normalization Where errors could come from? Random noises Repeat the same experiment twice, get diff results Using multiple replicates reduces the problem Systematic errors Arrays manufactured at different time On the same array, probes printed with different printer tips may have different biases Dye effect: difference between Cy5 and Cy3 labeling Experimental factors Array A being applied more mRNAs than array B Sample preparation procedure Experiments carried out at different time, by different users, etc.

cDNA microarray data preprocessing

Typical experiments Probes (genes) Wide-type cells vs mutated cells Diseased cells with normal cells Cells under normal growth condition vs cells treated with chemicals Typically repeated for several times Ratios Probes (genes)

Transforming cDNA microarray data Data: Cy5/Cy3 ratios as well as raw intensities Most common is log2 transformation 2 fold increase => log2(2) = 1 2 fold decrease => log2(1/2) = -1

Dye effect Solution: dye swapping. cDNA microarray experiments using two identical samples. Observation: Cy5 consistently lower than Cy3. (mean log (cy5/cy3) < 0) Solution: dye swapping.

Dye swapping ½ log2 (cy5/cy3 on chip 1) + ½ log2 (cy3/cy5 on chip 2) Chip 1: label test by cy5 and control by cy3 Chip 2: label test by cy3 and control by cy5 Ideally cy5/cy3 = cy3/cy5 Not so due to dye effect Compute average ratio: ½ log2 (cy5/cy3 on chip 1) + ½ log2 (cy3/cy5 on chip 2)

Total intensity normalization Even after dye-swapping, may still see systematic biases Assume the total amount of mRNAs should not change between two samples Rescale so that the two colors have same total intensity Assumption not necessarily true Rescale according to a subset of genes House-keeping genes Middle 90% (for example) of genes Spike-in genes

M-A plot Also know as ratio-intensity plot M: log2(cy5 / cy3) = log2(cy5) – log2(cy3) A: ½ log2(cy5 * cy3) = (log2(cy5) + log2(cy3)) / 2 Ideal: M centered at zero variance does not depend on A. However: Systematic dependence between M and A High variance of M for smaller A M A

Lowess normalization Lowess: Locally Weighted Regression Fit local polynomial functions M adjusted according to fitted line M M’ A A

Replicate filtering Experiments repeated Genes with very high variability is questionable Ratio 1 Ratio 2 Log2(ratio2) Log2(ratio1)

oligo microarray data preprocessing (Affymetrix chip)

Typical experiments Multiple microarrays For example n samples (from different time, location, condition, treatment, etc.) k replicates for each samples For example Samples collected from 100 healthy people and 100 cancer patients Cells treated with some drugs, take samples every 10 minutes Repeat on 3 – 5 microarrays for each sample Improve reliability of the results Often averaged after some preprocessing

Main characteristics For each gene, there are multiple PM and MM probes (11-16 pairs) how to obtain overall intensities from these probe-level intensities? Array outputs are absolute values rather than ratios Cross-array normalization is important for them to be comparable

Transformation Log transformation for one-color array When get a data set from someone, be careful with the scale

Normalization Ideas similar to cDNA microarrays For cDNA microarray arrays, normalize on log ratios. May have one or more arrays. Here, normalize absolute expression values. Usually multiple array. Total intensity normalization Each array has the same mean intensity Can be based on all genes or a selected subset of genes House-keeping genes Middle 90% (for example) of genes Spike-in genes Lowess: using a common reference, or cyclic Many useful tools implemented in R (Bioconductor)

Quantile normalization Normalize multiple arrays Assume the distribution of the values obtained from each array is the same or similar Quantile normalization

Quantile normalization Restore order Sort col mean X 3

An example data set J DeRisi, V Iyer, and P Brown, “Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale”, Science, 278: 680 – 686, 1997 Yeast cells grow in glucose medium When glucose was depleted, cells change their metabolic pathways cDNA microarray Test: 2, 4, 6, 8, 10, 12, 14 hours after growth Control: 0 hour Total data points: ~6000 x 7 No replicates! No normalization! Use fold-change to get differentially expressed genes!

Histogram of log ratios Median = -0.27 Two possibilities: Dye effect Sample difference

Total intensity normalization Median = -0.1 mean(cy3) = 3141 mean(cy5) = 2838 3141 / 2838 = 1.11 Other options: use median use subset of genes Exclude 10% extreme House-keeping genes Spike-in genes Etc. Net effect: constant factor for every gene

Intensity-intensity plot Total intensity normalization Total intensity normalization worked well here

Intensity-intensity plot Total intensity normalization Did not work well for this experiment Dye-swapping can probably help

M-A plot A: log2(cy5 * cy3) = log2(cy5)+log2(cy3) M: log2(cy5 / cy3) = = log2(cy5)-log2(cy3)

M-A plot Dependency of M on A

Box plot

Conclusions Microarray provides a way to measure thousands of genes simultaneously and make the global monitoring of cellular activities possible. The method produces noisy data and normalization is crucial. Real Time RT-PCR for validation of small number of genes.

Limitation Measures mRNA instead of proteins. Actual protein abundance and post-translation modification can not be detected. Suitable for global monitoring and should be used to generate further hypothesis or should combine with other carefully designed experiments.

Mechanisms in microarray Important mechanisms that make microarray work: Reverse transcription: mRNA => cDNA. This is usually also the step to label dyes. (Protein can not be reverse translated to mRNA or to another form. So difficult to label dyes.) Double strand binding of complimentary DNA sequences. (Protein does not enjoy such a good property; there are 20 amino acids without complementary binding)

Typical Microarray Analysis Normalize Filter Raw data •Present/Absent •Minimum value •Fold change Significance Classification Clustering Function (Gene Ontology) Regulation (Motif finding)

Identify differentially expressed genes Two samples: one normal, one cancer Which set of genes have significantly different expression levels between the two samples? Naïve approach: fold change threshold (e.g. two fold) Log2 (cy5 / cy3) > 1: up-regulated / induced Log2(cy5 / cy3) < -1: down-regulated / repressed Still widely used – very simple Main problem: genes with low expression levels may have a large fold change by chance From 10 to 100: ten fold From 1000 to 3000: three fold However: low-intensity => relatively high variance

Problem with fold change The most “differentially” expressed genes are the ones with the lowest average expression levels

More robust estimation of differentially expression Estimate variance as a function of average expression Compute a Z-score depending on location: Z(x) = (x - <x>) / (x) x : log2(R/G) value. <x> : local mean (x): local standard deviation Reference: Quackenbush, Nat Gen, 2002

SAM (Significance Analysis of Microarrays) Tusher et. al. PNAS 2001, 98:5116-5121 Excel add-in (free download, technical details) Most cited method of microarray data analysis Example: Test - 3 reps; Control - 3 reps T1 T2 T3 C1 C2 C3 Ratio Gene1 1000 2000 1500 200 300 250 6 Gene2 3000 500 2 Gene3 100 20 80 50 8 Gene4 1800 1700 1900 800 900 Which one is more significantly differentially expressed?

Gene 2 Ratio = 2000/1000 = 2 Gene 4 Ratio = 1800/900 = 2

SAM (Significance Analysis of Microarrays) Basic idea: compute a statistic (e.g. Student’s t-test) Larger t => higher significance P-value can be directly computed for t-test or estimated from permutation test + S0 To avoid small sample problem T1 T2 T3 C1 C2 C3 Ratio t Gene1 1000 2000 1500 200 300 250 6 4.3 Gene2 3000 500 2 1.5 Gene3 100 20 80 50 8 1.2 Gene4 1800 1700 1900 800 900 11.0

Permutation test to determine significance Gene1 1000 2000 1500 200 300 250 4.3 Perm1 0.17 Perm2 -1.3 … Perm-n 0.7 Number of unique permutations: (6 choose 3) = 20. Smallest possible p-value: 1/20 = 0.05 With 5 samples on each side: (10 choose 5) = 252 With 10 samples on each side: (20 choose 10) ~ 200k For small sample size: pool all genes

Permutation test Sorted Real t t1 t2 … tn tavg Treal - tavg  -

SAM

False Discovery Rate (FDR) Multiple testing problem P-value cutoff = 0.05 We tested 10000 genes Would expect 500 genes by chance at this significance level Found 600 genes with p < 0.05. Many might be due to noise. Bonferroni correction Use p-value cutoff 0.05 / 10000 Among all genes selected, P(at least one false positive) <= 0.05 Too conservative. Very few genes can be selected. False Discovery Rate (FDR) FDR = 0.1, meaning among all genes selected, (say 100), we would expect 10 to be false positive FDR as high as 0.5 may be acceptable to biologists Several different approaches to estimate (Most popular: Benjamini & Hochberg)

FDR in SAM Real t t1 t2 … tn tavg Treal - tavg Sorted  - FDR = the median number of “significant” ones in permuted columns number of significant ones in real Small : more genes selected; higher FDR. Large : less genes selected; lower FDR.

FDR in SAM FDR = 1855/5065=36% FDR = 1.5/209<1%

Typical Microarray Analysis Normalize Filter Raw data •Present/Absent •Minimum value •Fold change Significance Classification Clustering Function (Gene Ontology) Regulation (Motif finding)

Source: “Practical Microarray Analysis”, Presentation by Benedikt Brors, German Cancer Research Center

Classification (Supervised learning) (Clustering: unsupervised learning) Classification: separate items into groups based on features of the items and based on a training set of previously labeled items Many classification algorithms: Decision tree, SVM, naïve bayes, nearest neighbors, neural networks, etc. Some tell you how the classification is made, which might help biologists to understand the molecular mechanisms Some are black boxes In most cases, performance by different algs is similar. Having the right features (predictor variables) is the key.

AML: acute myeloid leukemia ALL: acute lymphoblastic leukemia Classification is critical for successful treatment. Clinical distinction involves an experienced hematopathologist’s interpretation of tumor morphology, histochemistry, immunophenotyping, and cytogenetic analysis. each performed in a separate, highly specialized laboratory Still imperfect and errors do occur. Golub et. al., Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science 286: 531 – 537, 1999 Method: weighted vote (similar to centroid classifier)

Centroid-based classifier G1 x? Model Training: Based on the training data calculate the centroid for each class. Classification: Given a data point, calculate the distance between the point and each of the class centroids. Assign the point to the closest class d1 * * * d2 * c1 * * * * o o o o c2 o o o o G2 ALL centroid AML centroid ?

K-Nearest-Neighbour classifier Model Training: none Classification: Given a data point, locate K nearest points. Returns the most common class label among the k points nearest to x We usually set K > 1 to avoid outliers Variations: Can also use a radius threshold rather than K. We can also set a weight for each neighbour that takes into account how far it is from the query point . _ + x

Cancer classification Tons of papers have been published. Many claimed high accuracy. Be careful when evaluating those papers. Very easy to overfit: much more number of genes than number of samples Simple methods often outperform fancy ones SVM and KNN among best Simple methods usually also mean robustness and easy to interpret In most cases, performance by different algs is similar. Having the right features (predictor variables) is the key.

Clustering microarray data Unsupervised learning Group genes into co-expressed sets Genes with similar expression patterns across multiple experiments may be co-regulated Group experiments into clusters Experiments within the same group may have similar “gene expression” signature For example, disease sub-types that can be classified from gene expression data

Clustering microarray data How to tell if two expression vectors are similar? Define the (dis)-similarity measure between two vectors How to group multiple profiles into meaningful subsets ? Describe the clustering procedure Are the results meaningful ? Evaluate biological meaning of a clustering

(Dis)-similarity measures Two genes, X=(x1,…, xm) and Y=(y1…,ym). Euclidean distance Pearson correlation coefficient Cosine similarity Mutual information Etc.

Clustering algorithms Hierarchical clustering K-means clustering Self Organizing Maps (SOMs) Spectral clustering Model-based Graph-based Etc. Jiang and Zhang, Cluster Analysis for Gene Expression Data: A Survey, IEEE Transactions on Knowledge and Data Engineering, Vol. 16, No. 11. (2004), pp. 1370-1386

Hierarchical clustering Agglomerative or divisive (less popular) Agglomerative basic idea: Given n genes Initially every gene in a single cluster for each iteration find two most similar genes (or gene groups), combine into one cluster Terminate when only one cluster is left (how to define similarity between two groups?)

Hierarchical clustering b c d e f Exact behavior depends on how to compute the distance between two clusters No need to specify number of clusters A distance cutoff is often chosen to break tree into clusters

Distance between clusters Single-linkage Not recommended Can be reduced to MST Complete-linkage Average-linkage (very similar to UPGMA) Centroid method http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html

An example Genes Experiments

Hierarchical clustering Average linkage. Cluster genes only.

Average linkage. Cluster both genes and experiments.

Leaf ordering a b c d e f a b c f d e

Optimal leaf ordering Idea: maximize sum of similarities of adjacent leaves in the ordering. Algorithm: Dynamic Programming. Bar-Joseph, Gifford and Jaakkola, Fast optimal leaf ordering for hierarchical clustering, Bioinformatics Vol. 17 no. 90001 2001

K-means Basic idea: Given n genes Guess number of clusters: k (Randomly) choose k genes as cluster centers Assign each gene to the closest center Re-compute center for each cluster Until assignment is stable Similarity to EM. Objective function: minimize total distance to cluster centers. May be trapped by local optima. Multiple runs with different random starting points are generally needed. http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

K-means K = 15

Another view of clusters Log ratio Log ratio Experiments

How to determine number of clusters? An open problem Larger K: More homogeneity within clusters Less separation between clusters Small K: The opposite Many heuristic methods have been proposed, none is uniformly good

Heuristics to determine number of clusters Tibshirani, Walther and Hastie, Estimating the number of clusters in a dataset via the gap statistic (2000) Define some statistic with respect to the number of clusters Gap statistic: (weighted) average log distance to cluster centers  expected

Biclustering Cheng Y, Church GM (2000). "Biclustering of expression data". Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology: 93–103. Gene Condition

Evaluating clustering Do genes in the same cluster share similar functions? Functional enrichment analysis Do genes in the same cluster share similar cis-regulatory motifs? Motif finding

Gene Ontology (GO) Gene functions were often defined using free text Hard to extract, transfer, revise, predict, annotate, comprehend, manage … The list of vocabularies should be pre-defined and commonly agreed Gene Ontology provides a controlled vocabulary to describe gene and gene product attribute

Gene ontology Two parts Three ontology categories Ontology: list of vocabularies (terms) to use Annotations: characterizing genes using ontology terms Three ontology categories Biological process Molecular function Cellular components

Part of a GO graph Each GO category is a directed acyclic graph A term can have multiple parents, and multiple children. A gene can be annotated by multiple terms. If annotated by a child term, automatically annotated by all ascendant terms.

Example functional enrichment analysis Total number of genes in yeast: 7268 65 genes have function in co-enzyme biosynthesis Cluster A: 100 genes 20 of them have function in co-enzyme biosynthesis Significance can be computed using cumulative hyper-geometric test: if we randomly draw 100 genes from the genome, what’s the chance that we’ll see at least 20 co-enzyme biosynthesis genes? 65 100 20 7268

Example functional enrichment analysis If we randomly draw 100 genes from the genome, the prob that we’ll see exactly 20 co-enzyme biosynthesis genes: 65 100 20 7268 P-value of enrichment Correction for multiple testing problem is usually preferred, as there are many GO terms being tested. Besides GO, other information can also be used to test for enrichment. E.g. protein complexes, pathways, motifs, etc.

Gene Ontology Tools geneontology.org Tools for enrichment analysis Download ontology files, species-specific annotation files Links to many useful analysis tools Tools for enrichment analysis GO:TermFinder. Downloadable. (Web interface available at SGD for yeast only) FuncAssociate: Web tool. ~a dozen model organisms (human, mouse, fruit fly, c. elegan, yeast, Arabidopsis, etc). DAVID Bioinformatics Resources: Web tool. (Downloadable). Mammalian genes.

An example application Tavazoie et al, Systematic determination of genetic network architecture, Nature Genetics, 22, 1999 3000 yeast genes, 15 time points during “cell cycle” Use k-means clustering, k=30 Clusters correlate well with known function AlignACE motif finding From 600-bp upstream regions Found many known motifs

Cell-division cycle The process that a cell duplicates its genome and divides into two identical cells Four phases G1 (preparation) S (DNA duplication) G2 (preparation) M (cell division)

Enriched functions in clusters

Motifs in Clusters

Comparing clustering results When true clustering is known Confusion matrix Jaccard Index, Wallace Index, Rand Index, etc. M. Meila, Comparing clusterings--an information based distance, Journal of Multivariate Analysis, 2007, 98:873 - 895 When true clustering is unknown Use Gene Ontology, e.g. Total number of enriched GO terms (affected by # of clusters) Product of enrichment p-values (affected by # of clusters) Above measurement compared to random (e.g. Z-score) Gibbons and Roth, Judging the Quality of Gene Expression-Based Clustering Methods Using Gene Annotation, Genome Res, 2002, 12:1574-81

Jaccard Index Real cluster (c1) K-means cluster (c2) Confusion matrix: 3 total 33 31 64 24 52 76 140 Jaccard Index = # gene-pairs in the same cluster under c1 AND c2 # gene-pairs in the same cluster under c1 OR c2 Real cluster 332+312+242+522 = 0.54 642+242+522 +762+332+312 – N11 K-means cluster