02/21/00 V1.2 Clustering Large Data Sets in Gene expression analysis Daniel Weaver.

02/21/00 V1.2 Clustering Large Data Sets in Gene expression analysis Daniel Weaver

02/21/00 V1.2 Overview What is “Gene Expression”? Scientific questions and clustering techniques

02/21/00 V1.2 “The Central Dogma” The arrows represent the transfer or flow of information. DNA and RNA store information in a base-4 code (the four nucleotides). Proteins store information in a base-20 code (the 20 amino acids). DNA  RNA  Protein TranscriptionTranslation

02/21/00 V1.2 What’s in a name? DNA  RNA = “Transcription” –because the information is exactly copied (or “transcribed”) from one base-4 system (DNA) to an equivalent base-4 system (RNA). Think of a monk transcribing a scroll. RNA  Protein = “Translation” –because the information is converted from a base-4 system (RNA) to a base-20 system (protein). Think of a monk translating a scroll into a new language.

02/21/00 V1.2 What is a “gene”? “A gene is a segment of DNA that contains all the information necessary to code for some function.” A gene is also the unit of information that is transferred through Transcription and Translation.

02/21/00 V1.2 Switching genes on (or off) Promoter Enhancer Purpose: to correctly control the amount of active functional (protein) product present in the cell or organism. Figure taken, with permission from Alberts et al., Molecular Biology of the Cell

02/21/00 V1.2 Presence vs. expression All cells have the same set of genes. Different cell types express different subsets of their genes. Constitutive genes are expressed in most cell types. Cell-type specific genes are expressed in only a few cell types. A B C

02/21/00 V1.2 Gene expression responds to the environment Changes to the cell’s internal or external environment can lead to changes in gene expression. Most human diseases manifest through a mis- regulation of gene expression A B C

02/21/00 V1.2 Microarrays and related technologies

02/21/00 V1.2 Example - raw microarray data = more abundant in cell type A = more abundant in cell type B = equally abundant in both cell types

02/21/00 V1.2 Interpreting raw data Most gene expression detection data sets are expressed as a ratio of Red:Green (experiment:control) signal. Frequently use a normalized log(red:green) ratio: for gene X X i = Such that the Euclidean length of X is 1. Interpreted raw data are tabulated in a Entity-by- Entity table, Genes-by-Experiments. log (ratio i ) [  log 2 (ratio i )] ½

02/21/00 V1.2 Gene-by-Experiment table Gene expression analysis is a variant of classic data mining – looking for informative patterns in the rows and columns of this type of table.

02/21/00 V1.2 Data volumes ~120,000 genes in the human genome. Expression detection techniques can take from 1- 50 measurement simultaneously on each gene. Many, diverse Gene and Experiment attributes In 3-5 years, 10 5 + data sets will be available for analysis Data volumes ranging from 10’s of Gb to a few Tb

02/21/00 V1.2 Analyzing Gene expression data What genes are (or are not) expressed? –In different cells –Under different external conditions –In different disease states How much does their expression change? Does the change in expression correlate with other observed parameters? Handled with descriptive statistics

02/21/00 V1.2 Clustering and Classifying gene expression Scientific questions to be answered Clustering techniques that are being applied Lots of room and need for novel statistical and computational analyses

02/21/00 V1.2 Clustering Gene expression data Functionally classify novel genes Identify co-regulated gene groups Identify diagnostic gene expression patterns

02/21/00 V1.2 Functionally Classifying Genes Problem: –Genome sequencing projects identify many, previously unstudied genes. –Can one use the genes’ expression patterns to cluster genes that have similar function?

02/21/00 V1.2 Inputs and outputs Inputs –A set of genes whose functional classification is know. –A set of genes whose functional classification is unknown. –Gene expression data sets for all the genes. Desired Output –A “best fit” functional classification for each of the novel genes.

02/21/00 V1.2 Examples Brown et al. (2000) PNAS 97(1), 262-267. Input: –Log normalized data from 79 experiments on 2,467 genes Trained on 2/3 of the genes, tested on remaining 3rd. Classifiers tried include: Support Vector Machines and four machine learning algorithms (Parzen, FLD, C4.5, MOC1 ) SVM’s performed the best and using the kernel: K(X,Y) = (XY+1) d (d=1,2,or 3) This kernel transforms the data into higher dimensional space where it is easier to identify a separating hyperplane Sensitivity = ~0.6

02/21/00 V1.2 Examples Hierarchical clustering, Average linkage (DeRisi et al) –Cluster the genes –Examine the clusters (through human intervention) to determine whether a cluster has a genes with known functions.

02/21/00 V1.2 Co-regulated genes Problem: –Biological processes typically involve genes of many functional categories. –Knowledge of what genes act coordinately can help direct drug development Expression Group 1 Expression Group 2 Expression Group 3

02/21/00 V1.2 Inputs and Outputs Inputs –Gene expression data for all genes of interest –(Information about the experimental conditions in which the gene expression data sets were collected) Desired Outputs –Ordering of the input genes into sets of genes with related expression patterns

02/21/00 V1.2 Examples Eisen et al. (1998) PNAS 95: 14863-14868 Input: –Log normalized data from 12 experiments on 2,467 genes Performed pair-wise average linkage cluster analysis, using a modified Pearson correlation coefficient metric Gene that cluster together are displayed in a dendrogram wherein the branch lengths correlate to the degree of similarity

02/21/00 V1.2 Examples Tavazoie et al. (1999) Nature Genetics 22:281-285. Inputs: –“Variance-normalized” data from 15 experiments on 6,220 genes. Variance normalization is X ij = (X ij – X i )/ stdev (X i ) for gene i in experiment j. Used Euclidean distance as the metric and performed k-means clustering, programmed to find 10, 30, and 60 centroids. Gene clusters were shown to contain functionally related genes as expected.

02/21/00 V1.2 Diagnostic expression patterns Problem: –Many diseases cannot be reliably distinguished through traditional techniques (microscopy, pathology, etc.) –Given gene expression data from diseased tissue, is there a set of genes that correctly distinguishes the diseases (as judged by other criteria).

02/21/00 V1.2 Inputs and Outputs Inputs –Gene expression data for all genes (available) –Information about the patients afflicted with the complex disease of interest. Desired output –The minimal set of genes that accurately partitions the disease, i.e. the minimal diagnostic gene expression pattern.

02/21/00 V1.2 Examples Alizadeh et al. (2000) Nature 403: 503-511. Input: –Log normalized data from 96 experiments on 4,026 genes (out of 17,856 measured). The 96 experiments were performed on cancer biopsies from patients with Diffuse Large B-cell Lymphoma (DLBCL). Pair-wise average linkage cluster analysis, using a modified Pearson correlation coefficient metric (Eisen et al., 1998). Two previously unknown DLBCL sub-types distinguished by small gene clusters (~40 genes and ~70 genes) Subtypes correspond to prognosis: –“GC B-like”  76% survivorship –“Activated B-like”  16% survivorship (Overhead)

02/21/00 V1.2 Summary Current techniques include supervised and unsupervised classification Three main scientific questions: –Functionally classifying genes –Identifying co-regulated sets of genes –Identifying diagnostic expression “fingerprints” Data sets are relatively small now, but growing rapidly. Classification draws from the expression data and from other domain knowledge. Lots of room and need for novel statistical and computational analyses

02/21/00 V1.2 Further Reading Clustering Gene Expression Data 1.Alizadeh, et al. (2000) Nature 403: 503-511. 2.Alon, et al. (1999) PNAS 96: 6745-6750. 3.Butte and Kohane. (2000) Proceedings of Pacific Sym. Biocomputing. 4.Brown, et al. (2000) PNAS 97: 262-267. 5.Eisen, et al. (1998) PNAS 95: 14863-14868. 6.Iyer, et al. (1999) Science 283: 83-87. 7.Raychaudhuri, et al. (2000) Proceedings of Pacific Sym. Biocomputing. 8.Roberts, et al. (2000) Science 287: 873-880. 9.Ross et al. (2000) Nature Genetics 24:227-235. 10.Scherf, et al. (2000) Nature Genetics 24: 236-244. 11.Spellman, et al. (1998) Mol Biol Cell 9: 3273-3297. 12.Tamayo, et al. (1999) PNAS 96: 2907-2912. 13.Tavazoie, et al. (1999) Nature Gen 22: 281-285. 14.Zhu and Zhang. (2000) Proceedings of Pacific Sym. Biocomputing.

02/21/00 V1.2 Further Reading Other related gene expression papers: Holstege, et al. (1998) Cell 95:717-728. DeRisi et al. (1996) Nature Genetics 14:457-460. Schena et al. (1995) Science 270:467-470. DeRisi et al. (1997) Science 278:680-686. Hilsenbeck et al. (1999) J. Natl. Cancer Inst. 91:453-459.

02/21/00 V1.2 Expression Data sets European Bioinformatics Institute (EBI) (links to refs. 4,5,6,11) –Main microarray page http://www.ebi.ac.uk/microarray/ –Microarray public data set page (this is a great portal site from which you can browse to many of the published data sets) http://industry.ebi.ac.uk/~brazma/Data-mining/microarray.html National Human Genome Research Institute (NHGRI) –Main page http://www.nhgri.nih.gov/DIR/LCG/15K/HTML/ –Data set down load page ftp://kronos.nhgri.nih.gov/pub/outgoing/olga/old/ National Cancer Institute (NCI) (ref. 9 & 10) –Main page http://discover.nci.nih.gov/ –Data set down load page http://discover.nci.nih.gov/nature2000/ Lymphoma data set (ref. 1) –Main page http://llmpp.nih.gov/ –Data set download page http://llmpp.nih.gov/lymphoma/

02/21/00 V1.2 Daniel Weaver

02/21/00 V1.2 Clustering Large Data Sets in Gene expression analysis Daniel Weaver.

Similar presentations

Presentation on theme: "02/21/00 V1.2 Clustering Large Data Sets in Gene expression analysis Daniel Weaver."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

02/21/00 V1.2 Clustering Large Data Sets in Gene expression analysis Daniel Weaver.

Similar presentations

Presentation on theme: "02/21/00 V1.2 Clustering Large Data Sets in Gene expression analysis Daniel Weaver."— Presentation transcript:

Similar presentations

About project

Feedback