Presentation is loading. Please wait.

Presentation is loading. Please wait.


Similar presentations

Presentation on theme: "O VERVIEW OF G ENE C LUSTERING AND A LGORITHMIC M ETHODOLOGIES Beth Benas Rizwan Habib Alexander Lowitt Piyush Malve."— Presentation transcript:

1 O VERVIEW OF G ENE C LUSTERING AND A LGORITHMIC M ETHODOLOGIES Beth Benas Rizwan Habib Alexander Lowitt Piyush Malve

2 C ONTENTS 1 Overview of Gene Clustering: Biological Definitions 2 Algorithmic Methodologies 3 Software – Cluster 3 Program 4 Case Study / Article Overview

3 W HAT IS G ENE C LUSTERING ? Two or more genes that code for the same or similar products Two different processes for duplication of original genes via: 1) Homologous recombination 2) Transposition events

4 H OMOLOGOUS R ECOMBINATION Genetic recombination where nucleotides are exchanged between similar or identical strands of DNA Breaking and rejoining strands of DNA Established in meiosis to provide for more genetic variability



7 R ETROTRANSPOSON Transposons  mobile DNA Sequences of DNA that are capable of moving to alternative positions along the genome of a single cell “jumping genes” Retrotransposition  type of transposon able to become amplified within a genome Relatively stable and tend to withstand natural selection Thus, prevalent across generations

8 M UTATIONS IN D UPLICATED G ENE Second copy generated is free from selective pressure Second copy can mutate quicker Not necessarily lasting changes

9 W HAT D OES A LL T HIS M EAN ? Useful technique to group similar genetic code together Relational understanding between homologous objects Trending / patterns of genetic expression Functional relatedness Phenotypic relatedness

10 W HAT IS G ENE C LUSTERING ? Presume – Genome is a 2D Cartesian space or a graph paper – Genes are now points on this graph paper – Let see how many lines and hyperbolas are there? Gene clustering is the process of assigning two or more genes to a “gene cluster” that serve to encode for the same or similar products As populations from a common ancestor tend to possess the same varieties of gene clusters, they are useful for tracing back recent evolutionary history. An example of a gene cluster is the Human β-globin gene cluster, which contains five functional genes and one non-functional gene for similar proteins. – All Hemoglobin molecules contain any two identical proteins from this gene cluster, depending on their specific role.

11 C ONTENTS 1 Overview of Gene Clustering: Biological Definitions 2 Algorithmic Methodologies 3 Software – Cluster 3 Program 4 Case Study / Article Overview

12 H IERARCHICAL C LUSTERING Allows organization of the clustering data to be represented in a tree (dendrogram) Agglomerative (Bottom Up): each observation starts as own cluster. Clusters are merged based on similarities Divisive (Top Down): all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. In general, splits in the tree are determined in a greedy manner.

13 H IERARCHICAL C LUSTERING Divisive Agglomerative

14 H IERARCHICAL C LUSTERING A measure of dissimilarity between sets of observations is required for combination and division of clusters. This is achieved by use of an appropriate metric (a measure of distance between pairs of observations), and a linkage criteria which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets.

15 H IERARCHICAL C LUSTERING The choice of an appropriate metric will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another. The linkage criteria determines the distance between sets of observations as a function of the pairwise distances between observations.

16 A DVANTAGE Hierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a matrix of distances

17 K- M EANS C LUSTERING k -means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. It is similar to the expectation-maximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural clusters in the data.

18 K- M EANS C LUSTERING Regarding computational complexity, the k - means clustering problem is: NP-hard in general Euclidean space d even for 2 clusters. NP-hard for a general number of clusters k even in the plane. If k and d are fixed, the problem can be exactly solved in time O(n dk+1 log n), where n is the number of entities to be clustered. Thus, a variety of heuristic algorithms are generally used.

19 K- M EANS C LUSTERING Heuristic algorithm  no guarantee that it will converge to the global optimum Algorithm is usually very fast  it is common to run it multiple times with different starting conditions. It has been shown that there exist certain point sets on which k -means takes super polynomial time: 2 Ω(√ n ) to converge.

20 K- M EANS C LUSTERING Two key features of k -means efficiency The number of clusters k is an input parameter: an inappropriate choice of k may yield poor results. Euclidean distance is used as a metric and variance is used as a measure of cluster scatter. Often regarded as its biggest drawbacks.

21 A PPLICATIONS OF K-M EANS Image segmentation The k -means clustering algorithm is commonly used in computer vision as a form of image segmentation. The results of the segmentation are used to aid border detection and object recognition. Standard Euclidean distance is usually insufficient in forming the clusters. Instead, a weighted distance measure utilizing pixel coordinates, RGB pixel color and/or intensity, and image texture is commonly used.

22 SOM Self organizing map (SOM) is a learning method which produces low dimension data (e.g. 2D) from high dimension data (nD) E.g. an apple is different from a banana in more then two ways but they can be differentiated based on their size and color only. If we present apples and bananas with points and similarity with lines then – Two points connected by a shorter line are of same kind – Two points connected by a longer line are of different kind Shorter line = line with length less then threshold t Longer line = line with length greater then threshold t We just created a map to differentiate an apple from banana based on two traits only. We have successfully “trained” the SOM, now anyone can use to “map” apples from banana and vice versa DEMO for SOM Training DEMO for SOM Mapping

23 A PPLICATION OF SOM Genome Clustering – Goal: trying to understand the phylogenetic relationship between different genomes. – Compute: bootstrap support of individual genomes for different phylogentic tree topologies, then cluster based on the topology support. Clustering Proteins based on the architecture of their activation loops – Align the proteins under investigation – Extract the functional centers – Turn 3D representation into 1D feature vectors – Cluster based on the feature vectors

24 PCA Principal component analysis (PCA) is a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components Also know as Independent component analysis or dimension reduction technique SOM and PCA are related (SOM is non-linear PCA) PCA decomposes complex data relationship into simple components and then represent all data in terms of these simple components SOM is efficient then PCA but PCA is more versatile.

25 PCA E XAMPLE Suppose three entities X1, X2 and X3 acts together to define a process. i.e. their graph will have three dimensions

26 A 3D P LOT

27 A PPLY PCA It is hard to guess the relationships X1 vs. X2 X2 vs. X3 X3 vs. X1 PCA can transform this 3D graph into four 2D graph to reveal individual relationship among each of three Xi.

28 1 3D = 4 2D


30 C ONTENTS 1 Overview of Gene Clustering: Biological Definitions 2 Algorithmic Methodologies 3 Software – Cluster 3 Program 4 Case Study / Article Overview :

31 C LUSTER 3.0 Implements most commonly used clustering methods for gene expression data analysis provides a computational and graphical environment for analyzing data from DNA microarray experiments, or other genomic datasets Data_set.txt => Cluster 3.0 => cluster_output.txt Cluster_output.txt => TreeView => Visualization Cluster 3.0, TreeView are both open source and Sample data is also provided to play around with it. Cluster 3.0, TreeView Sample data

32 L OADING F ILE Rows are genes Columns are samples (BLUE) YOFR (yeast open reading frame) is used by TreeView to specify how rows are linked to external websites Table is represented as a tab delimited file for Cluster to use it

33 F ILTER D ATA Filtering tab allows you to remove genes that do not have certain desired properties from your dataset – % Present >= X. This removes all genes that have missing values in greater than (100- X ) percent of the columns. – SD (Gene Vector) >= X. This removes all genes that have standard deviations of observed values less than X. – At least X Observations with abs(Val) >= Y. This removes all genes that do not have at least X observations with absolute values greater than Y. – MaxVal-MinVal >= X. This removes all genes whose maximum minus minimum values are less than X.

34 A DJUSTING D ATA Cluster allow to perform a number of operations that alter the underlying data in the imported file – Log Transform Data: replace all data values x by log2 (x). Why? – Center genes [mean or median]: Subtract the row-wise mean or median from the values in each row of data, so that the mean or median value of each row is 0. – Center arrays [mean or median]: Subtract the column-wise mean or median from the values in each column of data, so that the mean or median value of each column is 0. – Normalize genes: Multiply all values in each row of data by a scale factor S so that the sum of the squares of the values in each row is 1.0 (a separate S is computed for each row). – Normalize arrays: Multiply all values in each column of data by a scale factor S so that the sum of the squares of the values in each column is 1.0 (a separate S is computed for each column). These operations are not associative, so the order in which these operations is applied is very important Log transforming centered genes are not the same as centering log transformed genes.

35 L OG T RANSFORMATION Experiment: analyzing gene expression data from DNA microarray as florescent ratios We are looking gene expression over time Results are relative expression level to time 0 Time 0: base time Time 1: gene is unchanged Time 2: gene is up-regulated 2 folds Time 3: gene is down-regulated 2 folds “Is 2-fold up the same magnitude of change as 2- fold down but just in the opposite direction?” If yes, then log transform the sample data If no, then use the data as it is

36 M EAN /M EDIAN C ENTERING Experiment: analyzing a large number of tumor samples all compared to a common reference sample made from a collection of cell-lines. For each gene, you have a series of ratio values that are relative to the expression level of that gene in the reference sample. Since the reference sample really has nothing to do with your experiment, you want your. “Is reference sample a part of the experimental samples or vice versa, i.e. analysis is independent of the amount of a gene present in the reference sample” If yes, then use centering If no, then work with raw data Median centering is preferred over mean centering

37 D ISTANCE /S IMILARITY M EASURE “Is graph on the left the same as graph on the right?” Pearson correlation factor says they are similar, i.e. x = 2x = 2x+y. Use Spearman rank correlation or Kendall's τ of Cluster 3.0. Euclidean distance says they are not similar, i.e. x != 2x. Pearson measures only the similarity while Euclidean measures the magnitude of similarity.

38 C ONTENTS 1 Overview of Gene Clustering: Biological Definitions 2 Algorithmic Methodologies 3 Software – Cluster 3 Program 4 Case Study / Article Overview

39 A RTICLE “Systematic Variation in Gene Expression Patterns in Human Cancer Cell Lines” 2000: Nature America, Inc.; Princeton University Primary authors Ross Douglas, Scherf Uwe, Michael Eisen

40 B ACKGROUND Cell lines from human tumors  used for many years as experimental models to show neoplasia or neoplastic disease 60 cancer cell lines  National Cancer Institute’s Developmental Therapeutics Program (DPT) DNA microarrays to show variation in the prevalence of transcripts Comparing RNA from: Two breast cancer biopsy samples Sample of normal breast tissue NCI60 cell lines derived from breast cancers (excluding MDA- MB-435 and MDA-N) Leukaemias Pattern shared between the cancer specimens and individual cell lines derived from breast cancers and leukaemias

41 B ACKGROUND cDNA microarrays were used to explore variation in 8,000 different genes along 60 cell lines National Cancer Institute Screen for anti-cancer drugs Purpose: Show phenotypic variation  cell reproduction rate, drug metabolism Location of tumors To verify gene expression comparison patterns in cell lines to that of normal breast tissue or tumor samples within breast tissue Clustering to look at outliers that would validate or dismiss previous classification efforts

42 C LUSTERING IN A CTION Process: Develop rows of genes and columns of microarray hybridization Normalized fluorescence ratios from the database Subtraction of local background Established specific criteria to group a subset of the 9,703 cDNA elements from the arrays Centered data by subtracting arithmetic mean of all ratios measured log2 (ratio) > 2.8 Centering provides for all future analysis to be independent of amount of mRNA in reference pool

43 C LUSTERING IN A CTION Display representing microarray hybridization and genes Normalized the data and switched quantitative data to that of a color gradient Each color represents the mean adjusted expression level of the gene and cell line




47 Hierarchical clustering algorithm Pearson correlation coefficient  comparing similarities and ignoring differences in variation along cell line genes Similar expression characterized by short branches and longer branches denote dissimilarities

48 C LUSTERING IN A CTION Dendrogram: gene expression patterns within cell line of original tissue Cell lines derived from leukaemia, melanoma, central nervous system, colon, renal and ovarian tissue.

49 C ONCLUSIONS cDNA’s provided 8,000 genes  only 3,700 represented previously classified human proteins 1,900 had homologues in other organisms and 2,400 were identified via ESTs Estimated that 80% of the genes were correctly identified Able to analyze intact tumors within their specific microenvironment Dendrograms provide possibility improved taxonomy of cancer Helpful to explain heterogeneity of breast cancer Possibility of individual treatment regimens (personalized medicine)

50 Thank You!

Download ppt "O VERVIEW OF G ENE C LUSTERING AND A LGORITHMIC M ETHODOLOGIES Beth Benas Rizwan Habib Alexander Lowitt Piyush Malve."

Similar presentations

Ads by Google