Project Phase I l Due on 9/22, send me through email l 2-10 Pages l Free style in writing (use 11pt font or larger) l Project description å Overview å.

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

BioInformatics (3).
Basic Gene Expression Data Analysis--Clustering
Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
Cluster Analysis.
Making Sense of Complicated Microarray Data Part II Gene Clustering and Data Analysis Gabriel Eichler Boston University Some slides adapted from: MeV documentation.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Introduction to Bioinformatics Algorithms Clustering.
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Cluster Analysis: Basic Concepts and Algorithms
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Lecture 09 Clustering-based Learning
Minimum Spanning Trees. Subgraph A graph G is a subgraph of graph H if –The vertices of G are a subset of the vertices of H, and –The edges of G are a.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
Gene expression & Clustering (Chapter 10)
Genetic network inference: from co-expression clustering to reverse engineering Patrik D’haeseleer,Shoudan Liang and Roland Somogyi.
CZ5225: Modeling and Simulation in Biology Lecture 5: Clustering Analysis for Microarray Data III Prof. Chen Yu Zong Tel:
Microarrays to Functional Genomics: Generation of Transcriptional Networks from Microarray experiments Joshua Stender December 3, 2002 Department of Biochemistry.
Microarrays.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Lecture 19 Greedy Algorithms Minimum Spanning Tree Problem.
Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Clustering.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.
Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Flat clustering approaches
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Lecture 19 Minimal Spanning Trees CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Graph clustering to detect network modules
Canadian Bioinformatics Workshops
Hierarchical clustering approaches for high-throughput data
Clustering BE203: Functional Genomics Spring 2011 Vineet Bafna and Trey Ideker Trey Ideker Acknowledgements: Jones and Pevzner, An Introduction to Bioinformatics.
CSCI B609: “Foundations of Data Science”
Dimension reduction : PCA and Clustering
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Clustering The process of grouping samples so that the samples are similar within each group.
Presentation transcript:

Project Phase I l Due on 9/22, send me through l 2-10 Pages l Free style in writing (use 11pt font or larger) l Project description å Overview å Problem definition å Why it is important å Some review of existing work å Objectives to achieve

Gene Expression Data Analyses Dong Xu Computer Science Department 109 Engineering Building West

Lecture Outline l Gene expression l Similarity between gene expression profiles l Concept of clustering l K-Means clustering l Hierarchical clustering l Minimum spanning tree-based clustering

Time/Condition Expression (relatively levels to reference point at 0) Gene expression profiles

Goal of Microarray Experiments Microarray data Biological pathway Gene expression  Regulation/function in pathway/cellular state/phenotype  Disease diagnosis / disease gene identification

What Microarray Can Tell Us l Differentially expressed genes å Under different conditions å Different genotypes (mutant vs. wild type) l Co-expression and gene function inference l Regulatory network inference

Regulatory Networks l Which gene controls what? l Current methods for network reconstruction å Boolean networks X qualitative representation (on/off relationship) X computationally more manageable å differential equations X give “detailed” dynamic properties of networks X mathematically/computationally more problematic å Bayesian networks X define regulatory relationship X Widely used l E-Cell Project ( network modelinghttp://

Lecture Outline l Gene expression l Similarity between gene expression profiles l Concept of clustering l K-Means clustering l Hierarchical clustering l Minimum spanning tree-based clustering

Similarity between Profiles Similarity measure:  Euclidean distance  Correlation coefficient  Trend  … Correlation coefficient often works better. 0 expression time Expression profile

Pearson Correlation Coefficient l Compares scaled profiles! l Can detect inverse relationships l Most commonly used n=number of conditions x=average expression of gene x in all n conditions y=average expression of gene y in all n conditions s x =standard deviation of x S y =standard deviation of y

Correlation Pitfalls Correlation=0.97

Correlation coefficient Gene X Gene Y S(X,Y) {0+}

Euclidean Distance l Scaled versus unscaled l Cannot detect inverse relation ships For Gene X=(x 1, x 2,…x n ) and Gene Y=(y 1, y 2,…y n )

Lecture Outline l Gene expression l Similarity between gene expression profiles l Concept of clustering l K-Means clustering l Hierarchical clustering l Minimum spanning tree-based clustering

Data-Mining through Clustering Degradation Synthesis Chromatin Glycolysis Assumptions for clustering analysis:  Expression level of a gene reflects the gene’s activity.  Genes involved in same biological process exhibit statistical relationship in their expression profiles.

Clustering: group objects into clusters so that o objects in each cluster have “similar” features; o objects of different clusters have “dissimilar” features Idea of Clustering

Methods of Clustering discriminant analysis (Fisher,1931) K-means (Lloyd,1948) support vector machines (Vapnik, 1985) self-organizing maps (Kohonen, 1980) single linkage (dendrogram) hierarchical clustering minimum spanning tree based clustering

Issues in Cluster Analysis l A lot of clustering algorithms l A lot of distance/similarity metrics l Which clustering algorithm runs faster and uses less memory? l How many clusters after all? l Are the clusters stable? l Are the clusters meaningful?

Which Clustering Method Should I Use? l What is the biological question? l Do I have a preconceived notion of how many clusters there should be? l How strict do I want to be? Spilt or Join? l Can a gene be in multiple clusters? l Hard or soft boundaries between clusters

Lecture Outline l Gene expression l Similarity between gene expression profiles l Concept of clustering l K-Means clustering l Hierarchical clustering l Minimum spanning tree-based clustering

K-means clustering for expression profiles Step 1: Transform n (genes) * m (experiments) matrix into n(genes) * n(genes) distance matrix Step 2: Cluster genes based on a k-means clustering algorithm To transform the n*m matrix into n*n matrix, use a similarity (distance) metric.

K-means algorithm The most popular algorithm for clustering What is so attractive? Simple Mathematically correct Fast Invariant to dimension Easy to implement

K-Means Clustering l Basic Ideas : using cluster centre (means) to represent cluster l Assigning data elements to the closet cluster (centre). l Goal: Minimize square error (intra-class dissimilarity) : = l There is no hierarchy. l Must supply the number of clusters (k) into which the data are to be grouped. 2

Initialization 1 Specify the number of cluster k -- for example, k = 4 gene conditions Expression matrix Each point is called “gene” K-means Clustering : Procedure (1)

Initialization 2 Genes are randomly assigned to one of k clusters K-means Clustering : Procedure (2) or choose random starting centers

Calculate the mean of each cluster (1,2) (3,2) (3,4) (6,7) [(6,7) + (3,4) + …] K-means Clustering : Procedure (3)

Each gene is reassigned to the nearest cluster Gene i to cluster c K-means Clustering : Procedure (4)

K-means Clustering : Procedure (5) Iterate until the means are converged

Convergence of K-means algorithm Example : 111 data points in 9-dimensional space N= # of starts for achieving global solution # of Clusters N For each set of starting centers we’ll get a local minimum Increase number of starts!

Lecture Outline l Gene expression l Similarity between gene expression profiles l Concept of clustering l K-Means clustering l Hierarchical clustering l Minimum spanning tree-based clustering

Hierarchical clustering (1) Step 2: Cluster genes based on distance matrix and draw a dendrogram until single node remains Step 1: Transform genes * experiments matrix into genes * genes distance matrix

Hierarchical clustering (2)

Hierarchical Clustering Results

K-Means vs Hierarchical Clustering

Lecture Outline l Gene expression l Similarity between gene expression profiles l Concept of clustering l K-Means clustering l Hierarchical clustering l Minimum spanning tree-based clustering

Graph Representation Represent a set of n-dimensional points as a graph o each data point (gene) represented as a node o each pair of genes represented as an edge with a weight defined by the “dissimilarity” between the two genes n-D data points graph representation distance matrix

Minimum Spanning Tree  Spanning tree: a sub-graph that has all nodes connected and has no cycles  Minimum spanning tree (MST): a spanning tree with the minimum total distance (b) (c) (a)

Prim’s algorithm and Kruskal’s algorithm Kruskal’s algorithm  step 1: select an edge with the smallest distance from graph  step 2: add to tree as along as no cycle is formed  step 3: remove the edge from graph  step 4: repeat steps 1-3 till all nodes are connected in tree (b) 4 3 (c) (d) (e) How to Construct Minimum Spanning Tree (a) 14

 Significantly simplifies the data clustering problem, while losing very little essential information for clustering.  We have mathematically proved: Foundation of MST Approach A multi-dimensional clustering problem is equivalent to a tree-partitioning problem!

Clustering by Cutting Long Edge 1 Hierarchical cutting 1 st cut: longest edge 2 nd cut: second longest edge … Work well for “easy” cases. Produce many clusters with single element for some “difficult” cases. 2

Tree-Based Clustering  For each edge, calculate the assessment value  Find the edge that give the minimum assessment value as the place to cut g*g*  Clustering using iterative method  guarantee to find the global optimality using tree-based dynamic programming

Automated Selection of Number of Clusters Select “transition point” in the assessment value as the“correct” number of clusters.

Transition Profiles indicator[n] = (A[n-1] – A[n]) / (A[n] – A[n+1]) A[k] is the assessment value for partition with k clusters Our clustering of yeast data

Reading Assignments (1) l Suggested reading: å Chapter 10 in “Neil C.Jones and Pavel A. Pevzner: An Introduction to Bioinformatics Algorithms (Computational Molecular Biology). MIT Press, 2004.” å Chapter 11 in “Current Topics in Computational Molecular Biology, edited by Tao Jiang, Ying Xu, and Michael Zhang. MIT Press ”

Reading Assignments (2) l Optional reading: 1. Ying Xu, Victor Olman, and Dong Xu. Clustering Gene Expression Data Using a Graph-Theoretic Approach: An Application of Minimum Spanning Trees. Bioinformatics. 18: , Dong Xu, Victor Olman, Li Wang, and Ying Xu. EXCAVATOR: a computer program for gene expression data analysis. Nucleic Acid Research. 31:

Develop a program that implement the K-means clustering algorithm 1. Allow several random initializations, and compare their clustering results. Choose the one that has the best value for objective function. 2. Test the program using the gene expression data sent to the mailing list. 3. Output gene IDs for each cluster. Project Assignment 2