The following slides have been adapted from to be presented at the Follow-up course on Microarray Data Analysis.

Slides:

Advertisements

Similar presentations

Basic Gene Expression Data Analysis--Clustering

Advertisements

Cluster Analysis: Basic Concepts and Algorithms

1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.

Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.

Hierarchical Clustering, DBSCAN The EM Algorithm

Clustering Basic Concepts and Algorithms

2806 Neural Computation Self-Organizing Maps Lecture Ari Visa.

Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)

Adaptive Resonance Theory (ART) networks perform completely unsupervised learning. Their competitive learning algorithm is similar to the first (unsupervised)

Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.

Introduction to Bioinformatics

A hierarchical unsupervised growing neural network for clustering gene expression patterns Javier Herrero, Alfonso Valencia & Joaquin Dopazo Seminar “Neural.

Making Sense of Complicated Microarray Data Part II Gene Clustering and Data Analysis Gabriel Eichler Boston University Some slides adapted from: MeV documentation.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

Mutual Information Mathematical Biology Seminar

SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Cluster Analysis: Basic Concepts and Algorithms

Gene Expression 1. Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC EPCLUST 2.

Tutorial 8 Clustering 1. General Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC –ArrayExpress.

Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz

Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.

Neural Networks Lecture 17: Self-Organizing Maps

Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.

Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.

BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.

More on Microarrays Chitta Baral Arizona State University.

1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.

Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations.

tch?v=Y6ljFaKRTrI Fireflies.

1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.

Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.

Hierarchical Clustering of Gene Expression Data Author : Feng Luo, Kun Tang Latifur Khan Graduate : Chien-Ming Hsiao.

Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.

More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.

Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.

By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.

K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:

Cluster validation Integration ICES Bioinformatics.

Analyzing Expression Data: Clustering and Stats Chapter 16.

1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.

Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!

Cluster Analysis, an Overview Laurie Heyer. Why Cluster? Data reduction – Analyze representative data points, not the whole dataset Hypothesis generation.

Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.

Multivariate statistical methods Cluster analysis.

Unsupervised Learning

Multivariate statistical methods

Unsupervised Learning: Clustering

Unsupervised Learning: Clustering

Semi-Supervised Clustering

Clustering CSC 600: Data Mining Class 21.

Chapter 15 – Cluster Analysis

CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:

Data Clustering Michael J. Watts

Hierarchical clustering approaches for high-throughput data

Cluster Analysis in Bioinformatics

Data Mining – Chapter 4 Cluster Analysis Part 2

Self-organizing map numeric vectors and sequence motifs

Text Categorization Berlin Chen 2003 Reference:

Hierarchical Clustering

Unsupervised Learning

Presentation transcript:

The following slides have been adapted from to be presented at the Follow-up course on Microarray Data Analysis (Nov , PICB Shanghai) by Peter Serocka

Analysis of Multiple Experiments TIGR Multiple Experiment Viewer (MeV)

The Expression Matrix is a representation of data from multiple microarray experiments. Each element is a log ratio (usually log 2 (Cy5 / Cy3) ) Red indicates a positive log ratio, i.e, Cy5 > Cy3 Green indicates a negative log ratio, i.e., Cy5 < Cy3 Black indicates a log ratio of zero, i. e., Cy5 and Cy3 are very close in value Exp 1Exp 2Exp 3Exp 4Exp 5Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gray indicates missing data

Expression Vectors -Gene Expression Vectors encapsulate the expression of a gene over a set of experimental conditions or sample types Log2(cy5/cy3)

Expression Vectors As Points in ‘Expression Space’ Experiment 1 Experiment 2 Experiment 3 Similar Expression Exp 1Exp 2Exp 3 G1 G2 G3 G4 G

Distance and Similarity -the ability to calculate a distance (or similarity, it’s inverse) between two expression vectors is fundamental to clustering algorithms -distance between vectors is the basis upon which decisions are made when grouping similar patterns of expression -selection of a distance metric defines the concept of distance

Distance: a measure of similarity between genes. Exp 1Exp 2Exp 3Exp 4Exp 5Exp 6 Gene A Gene B x 1A x 2A x 3A x 4A x 5A x 6A x 1B x 2B x 3B x 4B x 5B x 6B Some distances: (MeV provides 11 metrics) 1.Euclidean:  i = 1 (x iA - x iB ) Manhattan:  i = 1 |x iA – x iB | 6 3. Pearson correlation p0p0 p1p1

Distance is Defined by a Metric Euclidean Pearson(r*-1)Distance Metric : D D

Algorithms…

Hierarchical Clustering (HCL) HCL is an agglomerative clustering method which joins similar genes into groups. The iterative process continues with the joining of resulting groups based on their similarity until all groups are connected in a hierarchical tree. (HCL-1)

Hierarchical Clustering g8g1g2g3g4g5g6g7 g1g8g2g3g4g5g6g7g1g8g4g2g3g5g6 g1 is most like g8 g4 is most like {g1, g8} (HCL-2)

g7g1g8g4g2g3g5g6 g1g8g4g2g3g5g7 g6g1g8g4g5g7g2g3 Hierarchical Clustering g5 is most like g7 {g5,g7} is most like {g1, g4, g8} (HCL-3)

g6g1g8g4g5g7g2g3 Hierarchical Tree (HCL-4)

Hierarchical Clustering During construction of the hierarchy, decisions must be made to determine which clusters should be joined. The distance or similarity between clusters must be calculated. The rules that govern this calculation are linkage methods. (HCL-5)

Agglomerative Linkage Methods Linkage methods are rules or metrics that return a value that can be used to determine which elements (clusters) should be linked. Three linkage methods that are commonly used are: Single Linkage Average Linkage Complete Linkage (HCL-6)

Cluster-to-cluster distance is defined as the minimum distance between members of one cluster and members of the another cluster. Single linkage tends to create ‘elongated’ clusters with individual genes chained onto clusters. D AB = min ( d(u i, v j ) ) where u  A and v  B for all i = 1 to N A and j = 1 to N B Single Linkage (HCL-7) D AB

Cluster-to-cluster distance is defined as the average distance between all members of one cluster and all members of another cluster. Average linkage has a slight tendency to produce clusters of similar variance. D AB = 1/(N A N B )  ( d(u i, v j ) ) where u  A and v  B for all i = 1 to N A and j = 1 to N B Average Linkage (HCL-8) D AB

Cluster-to-cluster distance is defined as the maximum distance between members of one cluster and members of the another cluster. Complete linkage tends to create clusters of similar size and variability. D AB = max ( d(u i, v j ) ) where u  A and v  B for all i = 1 to N A and j = 1 to N B Complete Linkage (HCL-9) D AB

Comparison of Linkage Methods SingleAve.Complete (HCL-10)

1. Specify number of clusters, e.g., Randomly assign genes to clusters. G1G2G3G4G5G6G7G8G9G10G11G12G13 K-Means / K-Medians Clustering (KMC)– 1

K-Means Clustering – 2 3. Calculate mean / median expression profile of each cluster. 4. Shuffle genes among clusters such that each gene is now in the cluster whose mean / median expression profile (calculated in step 3) is the closest to that gene’s expression profile. G1G2G3G4G5G6 G7 G8G9G10 G11 G12 G13 5. Repeat steps 3 and 4 until genes cannot be shuffled around any more, OR a user-specified number of iterations has been reached. K-Means / K-Medians is most useful when the user has an a-priori hypothesis about the number of clusters the genes should group into.

Cluster Affinity Search Technique (CAST) -uses an iterative approach to segregate elements with ‘high affinity’ into a cluster -the process iterates through two phases -addition of high affinity elements to the cluster being created -removal or clean-up of low affinity elements from the cluster being created

Clustering Affinity Search Technique (CAST)-1 Affinity = a measure of similarity between a gene, and all the genes in a cluster. Threshold affinity = user-specified criterion for retaining a gene in a cluster, defined as %age of maximum affinity at that point 1. Create a new empty cluster C1. 3. Move the two most similar genes into the new cluster. Empty cluster C1 G2 G4 G9 G8 G12 G6 G1 G7 G13 G11 G14 G3 G5G15 G10 Unassigned genes 4. Update the affinities of all the genes (new affinity of a gene = its previous affinity + its similarity to the gene(s) newly added to the cluster C1) 2. Set initial affinity of all genes to zero 5. While there exists an unassigned gene whose affinity to the cluster C1 exceeds the user-specified threshold affinity, pick the unassigned gene whose affinity is the highest, and add it to cluster C1. Update the affinities of all the genes accordingly. ADD GENES:

CAST – 2 6. When there are no more unassigned high-affinity genes, check to see if cluster C1 contains any elements whose affinity is lower than the current threshold. If so, remove the lowest-affinity gene from C1. Update the affinities of all genes by subtracting from each gene’s affinity, its similarity to the removed gene. 7. Repeat step 6 while C1 contains a low-affinity gene. 8. Repeat steps 5-7 as long as changes occur to the cluster C1. REMOVE GENES: 9. Form a new cluster with the genes that were not assigned to cluster C1, repeating steps Keep forming new clusters following steps 1-9, until all genes have been assigned to a cluster Current cluster C1 G2 G4 G9 G8 G12 G6 G1 G7 G13 G11 G14 G3 G5 G15 G10 Unassigned genes

QT-Clust (from Heyer et. al. 1999) (HJC) -1 1.Compute a jackknifed distance between all pairs of genes (Jackknifed distance: The data from one experiment are excluded from both genes, and the distance is calculated. Each experiment is thus excluded in turn, and the maximum distance between the two genes (over all exclusions) is the jackknifed distance. This is a conservative estimate of distance that accounts for bias that might be introduced by single outlier experiments.) 2. Choose a gene as the seed for a new cluster. Add the gene which increases cluster diameter the least. Continue adding genes until additional genes will exceed the specified cluster diameter limit. G4 G6 G5 G8 G7 G9 G10 G2 G3 G11 G1 “Seed” gene Currently unassigned genes Current cluster G11 G12 3. Repeat step 2 for every gene, so that each gene has the chance to be the seed of a new cluster. All clusters are provisional at this point.

QT-Clust – 2 4. Choose the largest cluster obtained from steps 2 and 3. In case of a tie, pick one of the largest clusters at random. 5. All genes that are not in the cluster selected above are treated as currently unassigned. Repeat steps 2-4 on these unassigned genes. 6. Stop when the last cluster thus formed has fewer genes than a user-specified number. All genes that are not in a cluster at this point are treated as unassigned. G1 “Seed” gene G11 G12G7 G8 G2 “Seed” gene G11 G10 G3 G4 G1 G5 G9 G7 G8 G3 “Seed” gene G9 G4 Pick this cluster

Self Organizing Tree Algorithm Dopazo, J., J.M Carazo, Phylogenetic reconstruction using and unsupervised growing neural network that adopts the topology of a phylogenetic tree. J. Mol. Evol. 44: , Herrero, J., A. Valencia, and J. Dopazo. A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics, 17(2): , SOTA - 1

SOTA Characteristics Divisive clustering, allowing high level hierarchical structure to be revealed without having to completely partition the data set down to single gene vectors Data set is reduced to clusters arranged in a binary tree topology The number of resulting clusters is not fixed before clustering Neural network approach which has advantages similar to SOMs such as handling large data sets that have large amounts of ‘noise’ SOTA - 2

SOTA Topology Parent Node Winning Cell Sister Cell pp ww ss    migration factor (  s <  p <  w ) SOTA - 3 Centroid Vector Members

Adaptation Overview -each gene vector associated with the parent is compared to the centroid vector of its offspring cells. -the most similar cell’s centroid and its neighboring cells are adapted using the appropriate migration weights. SOTA - 4

-following the presentation of all genes to the system a measure of system diversity is used to determine if training has found an optimal position for the offspring. -if the system diversity improves (decreases) then another training epoch is started otherwise training ends and a new cycle starts with a cell division. SOTA - 5

The most ‘diverse’ cell is selected for division at the start of the next training cycle. SOTA - 6

Growth Termination Expansion stops when the most diverse cell’s diversity falls below a threshold. SOTA - 7

Each training cycle ends when the overall tree diversity ‘stabilizes’. This triggers a cell division and possibly a new training cycle. SOTA - 8

Self-organizing maps (SOMs) – 1 1. Specify the number of nodes (clusters) desired, and also specify a 2-D geometry for the nodes, e.g., rectangular or hexagonal N = Nodes G = Genes G1G6 G3 G5 G4 G2 G11 G7 G8 G10 G9 G12G13 G14 G15 G19 G17 G22 G18 G20 G16 G21 G23 G25 G24 G26G27 G29G28 N1N2 N3N4 N5N6

SOMs – 2 2. Choose a random gene, e.g., G9 3. Move the nodes in the direction of G9. The node closest to G9 (N2) is moved the most, and the other nodes are moved by smaller varying amounts. The further away the node is from N2, the less it is moved. G1G6 G3 G5 G4 G2 G11 G7 G8 G10 G9 G12G13 G14 G15 G19 G17 G22 G18 G20 G16 G21 G23 G25 G24 G26G27 G29G28 N1N2 N3N4 N5N6

SOM Neighborhood Options G11 G7 G8 G10 G9 N1N2 N3N4 N5N6 G11 G7 G8 G10 G9 N1N2 N3N4 N5N6 Bubble Neighborhood Gaussian Neighborhood radius All move, alpha is scaled. Some move, alpha is constant.

SOMs – 3 4. Steps 2 and 3 (i.e., choosing a random gene and moving the nodes towards it) are repeated many (usually several thousand) times. However, with each iteration, the amount that the nodes are allowed to move is decreased. 5. Finally, each node will “nestle” among a cluster of genes, and a gene will be considered to be in the cluster if its distance to the node in that cluster is less than its distance to any other node G1G6 G3 G5 G4 G2 G11 G7 G8 G10 G9 G12G13 G14 G15 G19 G17 G22 G18 G20 G16 G21 G23 G25 G24 G26G27 G29G28 N1 N2 N3 N4 N5 N6

Compute first principle component of expression matrix Shave off  % (default 10%) of genes with lowest values of dot product with 1 st principal component Orthogonalize expression matrix with respect to the average gene in the cluster and repeat shaving procedure Repeat until only one gene remains Results in a series of nested clusters Choose cluster of appropriate size as determined by gap statistic calculation Gene Shaving

Gap statistic calculation (choosing cluster size) Quality measure for clusters: Create random permutations of the expression matrix and calculate R 2 for each Large R 2 implies a tight cluster of coherent genes within variance between variance R 2 = Compare R 2 of each cluster to that of the entire expression matrix Choose the cluster whose R 2 is furthest from the average R 2 of the permuted expression matrices. between variance of mean gene across experiments within variance of each gene about the cluster average Gene Shaving The final cluster contains a set of genes that are greatly affected by the experimental conditions in a similar way.

Relevance Networks Set of genes whose expression profiles are predictive of one another. Genes with low entropy (least variable across experiments) are excluded from analysis. H = -  p(x)log 2 (p(x)) x=1 10 Can be used to identify negative correlations between genes

Relevance Networks Correlation coefficients outside the boundaries defined by the minimum and maximum thresholds are eliminated. ADEBC ADEBC T min = 0.50 The expression pattern of each gene compared to that of every other gene. The ability of each gene to predict the expression of each other gene is assigned a correlation coefficient T max = 0.90 The remaining relationships between genes define the subnets