Future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people.

Slides:

Advertisements

Similar presentations

Basic Gene Expression Data Analysis--Clustering

Advertisements

Application of available statistical tools Development of specific, more appropriate statistical tools for use with microarrays Functional annotation of.

Modeling sequence dependence of microarray probe signals Li Zhang Department of Biostatistics and Applied Mathematics MD Anderson Cancer Center.

BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic.

UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

1 MicroArray -- Data Analysis Cecilia Hansen & Dirk Repsilber Bioinformatics - 10p, October 2001.

Microarray technology and analysis of gene expression data Hillevi Lindroos.

Microarray Data Analysis Stuart M. Brown NYU School of Medicine.

Getting the numbers comparable

DNA microarray and array data analysis

Microarrays Dr Peter Smooker,

DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.

Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.

Figure 1: (A) A microarray may contain thousands of ‘spots’. Each spot contains many copies of the same DNA sequence that uniquely represents a gene from.

DNA Arrays …DNA systematically arrayed at high density, –virtual genomes for expression studies, RNA hybridization to DNA for expression studies, –comparative.

Central Dogma 2 Transcription mRNA Information stored In Gene (DNA) Translation Protein Transcription Reverse Transcription SELF-REPAIRING ARABIDOPSIS,

Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.

Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.

Introduction to Bioinformatics - Tutorial no. 12

Alternative Splicing As an introduction to microarrays.

Microarray Analysis Software Ricardo Verdugo ECS 289A – Winter 2003 University of California, Davis Ricardo Verdugo ECS 289A – Winter 2003 University of.

Generate Affy.dat file Hyb. cRNA Hybridize to Affy arrays Output as Affy.chp file Text Self Organized Maps (SOMs) Functional annotation Pathway assignment.

Introduce to Microarray

Affymetrix GeneChip Data Analysis Chip concepts and array design Improving intensity estimation from probe pairs level Clustering Motif discovering and.

GeneChips and Microarray Expression Data

Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.

By Moayed al Suleiman Suleiman al borican Ahmad al Ahmadi

with an emphasis on DNA microarrays

Affymetrix vs. glass slide based arrays

Whole Genome Expression Analysis

Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.

CDNA Microarrays MB206.

Data Type 1: Microarrays

Gene Expression Data Qifang Xu. Outline cDNA Microarray Technology cDNA Microarray Technology Data Representation Data Representation Statistical Analysis.

Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.

Microarray - Leukemia vs. normal GeneChip System.

Scenario 6 Distinguishing different types of leukemia to target treatment.

Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.

1 FINAL PROJECT- Key dates –last day to decided on a project * 11-10/1- Presenting a proposed project in small groups A very short presentation (Max.

GeneChip® Probe Arrays

Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.

Gene Expression Analysis. 2 DNA Microarray First introduced in 1987 A microarray is a tool for analyzing gene expression in genomic scale. The microarray.

Whole Genome Approaches to Cancer 1. What other tumor is a given rare tumor most like? 2. Is tumor X likely to respond to drug Y?

Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.

Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.

Microarray (Gene Expression) DNA microarrays is a technology that can be used to measure changes in expression levels or to detect SNiPs Microarrays differ.

Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics

Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.

Overview of Microarray. 2/71 Gene Expression Gene expression Production of mRNA is very much a reflection of the activity level of gene In the past, looking.

Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.

ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.

Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.

Microarray Data Analysis The Bioinformatics side of the bench.

1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.

Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.

Introduction to Oligonucleotide Microarray Technology

Gene expression and data analysis. RNA Detection by Northern Blotting.

Microarray: An Introduction

Green with envy?? Jelly fish “GFP” Transformed vertebrates.

Using Web-Based Tools for Microarray Analysis

Gene Expression Analysis

Microarray - Leukemia vs. normal GeneChip System.

Microarray Technology and Applications

Molecular Classification of Cancer

Getting the numbers comparable

Dimension reduction : PCA and Clustering

Microarray Data Analysis

Data Type 1: Microarrays

Presentation transcript:

future data scientists also need to be skilled in statistics, and to be able to tell stories with data, to make it understandable to a variety of people.

DNA microarray and array data analysis

What is DNA Microarray DNA microarray is a new technology to measure the level of the RNA gene products of a living cell. DNA microarray is a new technology to measure the level of the RNA gene products of a living cell. A microarray chip is a rectangular chip on which is imposed a grid of DNA spots. These spots form a two dimensional array. A microarray chip is a rectangular chip on which is imposed a grid of DNA spots. These spots form a two dimensional array. Each spot in the array contains millions of copies of some DNA strand, bonded to the chip. Each spot in the array contains millions of copies of some DNA strand, bonded to the chip. Chips are made tiny so that a small amount of RNA is needed from experimental cells. Chips are made tiny so that a small amount of RNA is needed from experimental cells.

DNA Microarray Many applications in both basic and clinical research Many applications in both basic and clinical research determining the role a gene plays in a pathway, disease, diagnostics and pharmacology, … determining the role a gene plays in a pathway, disease, diagnostics and pharmacology, … There are three main platforms for performing microarray analyses. There are three main platforms for performing microarray analyses. cDNA arrays (generic, multiple manufacturers) cDNA arrays (generic, multiple manufacturers) Oligonucleotide arrays (genechips) (Affymetrix) Oligonucleotide arrays (genechips) (Affymetrix) BeatArray (BeadChip) (Illumina) BeatArray (BeadChip) (Illumina) cDNA membranes (radioactive detection) cDNA membranes (radioactive detection)

cDNA Microarray Spot cloned cDNAs onto a glass/nylon microscope slide Spot cloned cDNAs onto a glass/nylon microscope slide usually PCR amplified segments of plasmids usually PCR amplified segments of plasmids Complementary hybridization Complementary hybridization -- CTAGCAGG actual gene -- GATCGTCC cDNA ( Reverse transcriptase) -- CUAGCAGG mRNA Label 2 mRNA samples with 2 different colors of fluorescent dye -- control vs. experimental Label 2 mRNA samples with 2 different colors of fluorescent dye -- control vs. experimental Mix two labeled mRNAs and hybridize to the chip Mix two labeled mRNAs and hybridize to the chip Make two scans - one for each color Make two scans - one for each color Combine the images to calculate ratios of amounts of each mRNA that bind to each spot Combine the images to calculate ratios of amounts of each mRNA that bind to each spot

CTRL TEST Spotted Microarray Process

cDNA Array Experiment Movie s/chip/chip.html s/chip/chip.html s/chip/chip.html s/chip/chip.html

Affymetrix Uses 25 base oligos synthesized in place on a chip (20 pairs of oligos for each gene) Uses 25 base oligos synthesized in place on a chip (20 pairs of oligos for each gene) cRNA labeled and scanned in a single “color” cRNA labeled and scanned in a single “color” one sample per chip one sample per chip Can have as many as 760,000 probes on a chip Can have as many as 760,000 probes on a chip Arrays get smaller every year (more genes) Arrays get smaller every year (more genes) Chips are expensive Chips are expensive Proprietary system: “black box” software, can only use their chips Proprietary system: “black box” software, can only use their chips

Affymetrix GeneChip ® Probe Arrays 24~50µm Each probe cell or feature contains millions of copies of a specific oligonucleotide probe Image of Hybridized Probe Array Single stranded, fluorescently labeled cRNA target Oligonucleotide probe * * * * * 1.28cm GeneChip Probe Array Hybridized Probe Cell *

GeneChip® Human Gene 1.0 ST Array

Affymetrix GeneChip ® Probe Array

Affymetrix Genome Arrays

Affymetrix GeneChip Probe: 25 bases long single stranded DNA oligos 25 bases long single stranded DNA oligos Probe Cell: Single square-shaped feature on an array containing one type of probe. Single square-shaped feature on an array containing one type of probe. Contains millions of probe molecules Contains millions of probe molecules Probe Pair: Perfect Match/Mismatch Perfect Match/Mismatch Probe Set

Perfect Match Mismatch 25 mer DNA oligo Array Design 3’ 5’ Twenty oligo probes are selected from the 3’ end of the gene For each probe selected, a partner containing a central mutation is also made Perfect Match Mismatch Probe Set Probe Pair PM MM Probe Cell 24  m For each gene a total of 20 probe pairs are arrayed on the chip

Probe Sub-types on chips Exemplars Specific transcripts 2. Expressed sequence tags (ESTs) 1. Known genes: Consensus 3. Spiked control transcripts Housekeeping genes

Total RNA (5-8  g) AAAAAAAAA cRNA preparation cRNA is now ready for hybridization to test chip cDNA Strand 1 synthesisTTTTTTTTTNNNNNNNNN AAAAAAAAA SS II reverse transcriptase T7RNA pol. promoter cDNA Strand 2 synthesis TTTTTTTTTNNNNNNNNN AAAAAAAAANNNNN E. coli DNA pol. I T7RNA pol. promoter NNNNNNNN IVT cRNA synthesis amplifies and labels transcripts with Biotin NNNNNNNNNNNNNAAAAAAAAAAAAAAN TTTTTT T T T T T UUUUUUUUUU ……….. UUUUUUUUUU ……….. UUUUUUUUUU ……….. UUUUUUUUUU ……….. UUUUUUUUUU ……….. …… ……. T 7 RNA pol. T T Fragmented cRNA

cDNA probes B B B B B B B B B B B B B B B B B BB B B cRNA labeled targets B B B B B B B B B B B BB B B Non- Specific Binding Specific Binding Post hybridiz -ation washes S FL S S

B B B S S S B B B S S S Streptavidin

Chips are placed in the Fluidics station where they are washed, stained and washed again (2.5 hours) Chip is placed in a hybridization oven and incubated overnight Hybridization cocktail Affymetrix Array Chip Sample is added to a hybridization cocktail along with spiked control transcripts and is loaded onto an array chip Data is acquired by the computer as soon as the scan has been completed. After staining, the signal intensities are measured with a laser scanner (15 min)

The chip image data file (or “.dat” file) is the first part of data acquisition and appears on the computer screen upon completion of the laser scan. Here, we zoom in to see an individual probe set that has been highlighted Probe set

The first image is “sample1.dat.” note the pixel to pixel variation within a probe cell A “*.cel.” file is automatically generated when the “*.dat” image first appears on the screen. Note that this derivative file has homogenous signal intensity within its probe cells

Affymetrix Algorithms 1.1 Adjusting MMs to purge negative values All MMs < PMs, No adjustment necessary Few MMs > PMs, change MMs based on weighted mean of other MMs Most MMs > PMs, change MMs to be slightly lesss than PM 1. Signal

Affymetrix Algorithms Signal Calculation. Calculate the signal PM MM PM-MM Using Tukey’s biweight mean = 1780 Signal (expression level) = 1780 Having adjusted the MM values, we now calculate the signal The PM values.The PM-MM values are calculated.The MM values. Standard deviations Weight factor The unweighted mean is vulnerable to outlier data. In order to protect against this, we dampen the effect of outliers by using the Tukey bi-weight mean. PM-MM values that are a number of standard deviations away from the mean are given low weights in accordance with the graph shown here. Individual PM-MM data are multiplied by the weight factor before calculation of the mean. The weighted mean is then called the “signal.” Unweighted mean = 2063

[CEL] Version=3 [HEADER] Cols=640 Rows=640 TotalX=640 TotalY=640 OffsetX=0 OffsetY=0 GridCornerUL= GridCornerUR= GridCornerLR= GridCornerLL= Axis-invertX=0 AxisInvertY=0 swapXY=0 DatHeader=[ ] CL AA:CLS=4733 RWS=4733 XIN=3 YIN=3 VE= /16/01 13:32:23 HG_U95Av2.1sq 6 Algorithm=Percentile AlgorithmParameters=Percentile:75;CellMargin:2;OutlierHigh:1.500;OutlierLow:1.004 [INTENSITY] NumberCells= CellHeader=XYMEANSTDVNPIXELS cel file

Microarray Data Analysis Data processing and visualization Data processing and visualization Supervised learning Supervised learning Feature selection Feature selection Machine learning approaches Machine learning approaches Unsupervised learning Unsupervised learning Clustering and pattern detection Clustering and pattern detection Infer gene interactions in pathways and networks Infer gene interactions in pathways and networks Gene regulatory regions predictions based co-regulated genes Gene regulatory regions predictions based co-regulated genes Linkage between gene expression data and gene sequence/function databases Linkage between gene expression data and gene sequence/function databases …

Microarrays: An Example Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 1999 Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, examples (38 train, 34 test), about 7,000 probes 72 examples (38 train, 34 test), about 7,000 probes well-studied (CAMDA-2000), good test example well-studied (CAMDA-2000), good test example ALLAML Visually similar, but genetically very different

Normalization Need to scale the red sample so that the overall intensities for each chip are equivalent control Sample 1 Sample 2 What can we tell from the two plots ?

Normalization To insure the data are comparable, normalization attempts to correct the following variables: To insure the data are comparable, normalization attempts to correct the following variables: Number of cells in the sample Number of cells in the sample Total RNA isolation efficiency Total RNA isolation efficiency Signal measurement sensitivity Signal measurement sensitivity … Can use simple/complicated math Can use simple/complicated math Normalization by global scaling (bring each image to the same average brightness) Normalization by global scaling (bring each image to the same average brightness) Normalization by sectors Normalization by sectors Normalization to housekeeping genes Normalization to housekeeping genes … Active research area Active research area

AML vs ALL

Microarray Data Analysis Data processing and visualization Data processing and visualization Supervised learning Supervised learning Feature selection Feature selection Machine learning approaches Machine learning approaches Unsupervised learning Unsupervised learning Clustering and pattern detection Clustering and pattern detection Infer gene interactions in pathways and networks Infer gene interactions in pathways and networks Gene regulatory regions predictions based co-regulated genes Gene regulatory regions predictions based co-regulated genes Linkage between gene expression data and gene sequence/function databases Linkage between gene expression data and gene sequence/function databases …

Feature selection ProbeAML1AML2AML3ALL1ALL2ALL3 D21869_s_at D25233cds_at D25543_at L03294_g_at J03960_at M81855_at L14936_at L19998_at L19998_g_at AB017912_at AB017912_g_at U86635_g_at …………………

ProbeAML1AML2AML3ALL1ALL2ALL3p-value D21869_s_at D25233cds_at D25543_at L03294_g_at J03960_at M81855_at L14936_at L19998_at L19998_g_at AB017912_at AB017912_g_at U86635_g_at ……………………   

Hypothesis Testing

Null hypothesis is a hypothesis set up to be nullified in order to support an alternative hypothesis. Null hypothesis is a hypothesis set up to be nullified in order to support an alternative hypothesis. Hypothesis testing is to test the viability of the null hypothesis for a set of experimental data Hypothesis testing is to test the viability of the null hypothesis for a set of experimental data Example: Example: Test whether the time to respond to a tone is affected by the consumption of alcohol Test whether the time to respond to a tone is affected by the consumption of alcohol Hypothesis : µ1 - µ2 = 0 Hypothesis : µ1 - µ2 = 0 µ1 is the mean time to respond after consuming alcohol µ1 is the mean time to respond after consuming alcohol µ2 is the mean time to respond otherwise µ2 is the mean time to respond otherwise ?

Z-test Theorem: If x i has a normal distribution with mean  and standard deviation  2, i=1,…,n, then U=  a i x i has a normal distribution with a mean E(U)=   a i and standard deviation D(U)=  2  a i 2. Theorem: If x i has a normal distribution with mean  and standard deviation  2, i=1,…,n, then U=  a i x i has a normal distribution with a mean E(U)=   a i and standard deviation D(U)=  2  a i 2.  x i /n ~ N( ,  2 /n).  x i /n ~ N( ,  2 /n). Z test : H: µ = µ 0 (µ 0 and  0 are known, assume  =  0 ) Z test : H: µ = µ 0 (µ 0 and  0 are known, assume  =  0 ) What would one conclude about the null hypothesis that a sample of N = 46 with a mean of 104 could reasonably have been drawn from a population with the parameters of µ = 100 and  = 8? Use What would one conclude about the null hypothesis that a sample of N = 46 with a mean of 104 could reasonably have been drawn from a population with the parameters of µ = 100 and  = 8? Use Note: z follows a normal distribution N(0, 1) Note: z follows a normal distribution N(0, 1)normal distribution N(0, 1)normal distribution N(0, 1) Reject the null hypothesis.

Z-test Theorem: If x i follows a normal distribution with mean  and standard deviation  2, i=1,…,n, then U=  a i x i has a normal distribution with a mean E(U)=   a i and standard deviation D(U)=  2  a i 2. Theorem: If x i follows a normal distribution with mean  and standard deviation  2, i=1,…,n, then U=  a i x i has a normal distribution with a mean E(U)=   a i and standard deviation D(U)=  2  a i 2.  x i /n ~ N( ,  2 /n).  x i /n ~ N( ,  2 /n). Z test : H: µ = µ 0 (µ 0 and  0 are known, assume  =  0 ) Z test : H: µ = µ 0 (µ 0 and  0 are known, assume  =  0 ) But, in practice  0 is often unknown.

T-test S x-y standard error of the difference Assuming  1 and  2 are different

William Sealey Gosset ( ) William Sealey Gosset ( ) (Guinness Brewing Company) T-test

P-value Does a particular gene have the same expression level in ALL and AML? ProbeAML1AML2AML3ALL1ALL2ALL3p-value D25543_at L03294_g_at …………………… ALLAML

Data processing Feature selection Feature selection T-test T-test Based on the fold change Based on the fold change

Matlab ttest [H,P] = ttest2(X,Y) Determines whether the means from matrices X and Y are statistically different. H return a 0 or 1 indicating accept or reject null hypothesis (that the means are the same) P will return the significance level

Microarray Data Analysis Data processing and visualization Data processing and visualization Supervised learning Supervised learning Feature selection Feature selection Machine learning approaches Machine learning approaches Unsupervised learning Unsupervised learning Clustering and pattern detection Clustering and pattern detection Infer gene interactions in pathways and networks Infer gene interactions in pathways and networks Gene regulatory regions predictions based co-regulated genes Gene regulatory regions predictions based co-regulated genes Linkage between gene expression data and gene sequence/function databases Linkage between gene expression data and gene sequence/function databases …

Feature 2 Feature 1 L L L L L L L M M M M M M Nearest Neighbor Classification Nearest Neighbor Classification = AML = ALL = test sample M L Feature 2 Feature 1 L L L L L L L M M M M M M Feature 2 Feature 1 L L L L L L L M M M M M M = AML = ALL = test sample M L

Distance Issues Euclidean distance ■ Pearson distance g1g1 g2g2 g3g3 g4g4

Cross-validation

Microarray Data Analysis Data processing and visualization Data processing and visualization Supervised learning Supervised learning Feature selection Feature selection Machine learning approaches Machine learning approaches Unsupervised learning Unsupervised learning Clustering and pattern detection Clustering and pattern detection Gene regulatory regions predictions based co- regulated genes Gene regulatory regions predictions based co- regulated genes Linkage between gene expression data and gene sequence/function databases Linkage between gene expression data and gene sequence/function databases …

Genetic Algorithm for Feature Selection Sample Clear cell RCC, etc. Raw measurement data f1 f2 f3 f4 f5 Feature vector = pattern

Why Genetic Algorithm? Assuming 2,000 relevant genes, 20 important discriminator genes (features). Assuming 2,000 relevant genes, 20 important discriminator genes (features). Cost of an exhaustive search for the optimal set of features ? Cost of an exhaustive search for the optimal set of features ? C(n,k)=n!/k!(n-k)! C(2,000, 20) = 2000!/(20!1980!) ≥ (100)^20 = 10^40 If it takes one femtosecond ( second) to evaluate a set of features, it takes more than 3  10^17 years to find the optimal solution on the computer.

Evolutionary Methods Based on the mechanics of Darwinian evolution Based on the mechanics of Darwinian evolution The evolution of a solution is loosely based on biological evolution The evolution of a solution is loosely based on biological evolution Population of competing candidate solutions Population of competing candidate solutions Chromosomes (a set of features) Chromosomes (a set of features) Genetic operators (mutation, recombination, etc.) Genetic operators (mutation, recombination, etc.) generate new candidate solutions generate new candidate solutions Selection pressure directs the search Selection pressure directs the search those that do well survive (selection) to form the basis for the next set of solutions. those that do well survive (selection) to form the basis for the next set of solutions.

A Simple Evolutionary Algorithm Selection Genetic Operators Evaluation

Genetic Operators Crossover Randomly Selected Crossover Point Mutation Randomly Selected Mutation Site l Recombination is intended to produce promising individuals. l Mutation maintains population diversity, preventing premature convergence.

Genetic Algorithm g2g1g6g3g21 g201g17g51g21g1 g12g7g15g12g10 g25g72g56g23g10 g20g7g5g2g100 Good enough Stop g20g7g6g3g21 g20g7g25g23g14 g12g7g15g22g10 g25g72g56g23g10 g2g1g5g2g100 Not good enough

GA Fitness At the core of any optimization approach is the function that measures the quality of a solution or optimization. At the core of any optimization approach is the function that measures the quality of a solution or optimization. Called: Called: Objective function Objective function Fitness function Fitness function Error function Error function measure measure etc. etc.

Encoding Most difficult, and important part of any GA Most difficult, and important part of any GA Encode so that illegal solutions are not possible Encode so that illegal solutions are not possible Encode to simplify the “evolutionary” processes, e.g. reduce the size of the search space Encode to simplify the “evolutionary” processes, e.g. reduce the size of the search space Most GA’s use a binary encoding of a solution, but other schemes are possible Most GA’s use a binary encoding of a solution, but other schemes are possible

Genetic Algorithm/K-Nearest Neighbor Algorithm Classifier (kNN) Feature Selection (GA) Microarray Database

Microarray Data Analysis Data processing and visualization Data processing and visualization Supervised learning Supervised learning Machine learning approaches Machine learning approaches Unsupervised learning Unsupervised learning Clustering and pattern detection Clustering and pattern detection Gene regulatory regions predictions based co- regulated genes Gene regulatory regions predictions based co- regulated genes Linkage between gene expression data and gene sequence/function databases Linkage between gene expression data and gene sequence/function databases …

Unsupervised learning Supervised methods Can only validate or reject hypotheses Can not lead to discovery of unexpected partitions Unsupervised learning No prior knowledge is used Explore structure of data on the basis of similarities

DEFINITION OF THE CLUSTERING PROBLEM

CLUSTER ANALYSIS YIELDS DENDROGRAM T (RESOLUTION)

BUT WHAT ABOUT THE OKAPI?

Agglomerative Hierarchical Clustering Distance between joined clusters Dendrogram at each step merge pair of nearest clusters initially : each point = cluster Need to define the distance between the new cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers Need to define the distance between the new cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers (UPGMA)

Hierarchical Clustering - Summary Results depend on distance update method Results depend on distance update method Greedy iterative process Greedy iterative process NOT robust against noise NOT robust against noise No inherent measure to identify stable clusters No inherent measure to identify stable clusters Average Linkage (UPGMA) – the most widely used clustering method in gene expression analysis

Cluster both genes and samples Sample should cluster together based on experimental design Sample should cluster together based on experimental design Often a way to catch labelling errors or heterogeneity in samples Often a way to catch labelling errors or heterogeneity in samples

nature 2002 breast cancer Heat map

Centroid methods – K-means Data points at X i, i= 1,...,N Centroids at Y ,  = 1,...,K Assign data point i to centroid  ; S i =  Cost E: E(S 1, S 2,...,S N ; Y 1,...Y K ) = Minimize E over S i, Y 

K-means “Guess” K=3

Start with random positions of centroids. K-means Iteration = 0

K-means Iteration = 1 Start with random positions of centroids. Assign each data point to closest centroid.

K-means Iteration = 2 Start with random positions of centroids. Assign each data point to closest centroid. Move centroids to center of assigned points

K-means Iteration = 3 Start with random positions of centroids. Assign each data point to closest centroid. Move centroids to center of assigned points Iterate till minimal cost

Fast algorithm: compute distances from data points to centroids Fast algorithm: compute distances from data points to centroids Result depends on initial centroids’ position Result depends on initial centroids’ position Must preset K Must preset K Fails for “non-spherical” distributions Fails for “non-spherical” distributions K-means - Summary

Issues in Cluster Analysis A lot of clustering algorithms A lot of clustering algorithms A lot of distance/similarity metrics A lot of distance/similarity metrics Which clustering algorithm runs faster and uses less memory? Which clustering algorithm runs faster and uses less memory? How many clusters after all? How many clusters after all? Are the clusters stable? Are the clusters stable? Are the clusters meaningful? Are the clusters meaningful?

Which Clustering Method Should I Use? What is the biological question? What is the biological question? Do I have a preconceived notion of how many clusters there should be? Do I have a preconceived notion of how many clusters there should be? How strict do I want to be? Spilt or Join? How strict do I want to be? Spilt or Join? Can a gene be in multiple clusters? Can a gene be in multiple clusters? Hard or soft boundaries between clusters Hard or soft boundaries between clusters