Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Slides:



Advertisements
Similar presentations
K-Means Clustering Algorithm Mining Lab
Advertisements

Clustering II.
What is Cluster Analysis?
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Clustering AMCS/CS 340: Data Mining Xiangliang Zhang
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering, DBSCAN The EM Algorithm
Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
CS690L: Clustering References:
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
K-means clustering Hongning Wang
Clustering II.
4. Clustering Methods Concepts Partitional (k-Means, k-Medoids)
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Cluster Analysis: Basic Concepts and Algorithms
Introduction to Bioinformatics - Tutorial no. 12
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Microarrays.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 Dept. Computer Science and Information Engineering.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Mr. Idrissa Y. H. Assistant Lecturer, Geography & Environment Department of Social Sciences School of Natural & Social Sciences State University of Zanzibar.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining and Text Mining. The Standard Data Mining process.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 10 —
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Machine Learning Clustering: K-means Supervised Learning
Machine Learning Lecture 9: Clustering
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Topic 3: Cluster Analysis
CSE 5243 Intro. to Data Mining
CSE572, CBS598: Data Mining by H. Liu
Information Organization: Clustering
DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일
DATA MINING Introductory and Advanced Topics Part II - Clustering
CSE572, CBS572: Data Mining by H. Liu
Dimension reduction : PCA and Clustering
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
Topic 5: Cluster Analysis
Unsupervised Learning: Clustering
CSE572: Data Mining by H. Liu
Presentation transcript:

Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

Gene Expression Data (Microarray) p genes on n samples Genes mRNA samples Gene expression level of gene i in mRNA sample j Log (treated-exp-value /controlled-exp-value ) sample1sample2sample3sample4sample5 …

Some possible applications  Sample from specific organ to show which genes are expressed  Compare samples from healthy and sick host to find gene-disease connection  Discover co-regulated genes  Discover promoters

Major Analysis Techniques  Single gene analysis  Compare the expression levels of the same gene under different conditions  Main techniques: Significance test (e.g., t-test)  Gene group analysis  Find genes that are expressed similarly across many different conditions  Main techniques: Clustering (many possibilities)  Gene network analysis  Analyze gene regulation relationship at a large scale  Main techniques: Bayesian networks

Clustering Methods  Similarity-based ( need a similarity function )  Construct a partition  Agglomerative, bottom up  Searching for an optimal partition  Typically “hard” clustering  Model-based (latent models, probabilistic or algebraic)  First compute the model  Clusters are obtained easily after having a model  Typically “soft” clustering

Similarity-based Clustering  Define a similarity function to measure similarity between two objects  Common criteria: Find a partition to  Maximize intra-cluster similarity  Minimize inter-cluster similarity  Two ways to construct the partition  Hierarchical (e.g.,Agglomerative Hierarchical Clustering)  Search by starting at a random partition (e.g., K-means)

Method 1 (Similarity-based): Agglomerative Hierarchical Clustering

Agglomerative Hierachical Clustering  Given a similarity function to measure similarity between two objects  Gradually group similar objects together in a bottom-up fashion  Stop when some stopping criterion is met  Variations: different ways to compute group similarity based on individual object similarity

Similarity Measure: Pearson CC  The most popular correlation coefficient is Pearson correlation coefficient (1892)  correlation between X={X 1, X 2, …, X n } and Y={Y 1, Y 2, …, Y n } :  where (Adapted from a Slide by Shin-Mu Tseng) s XY s XY is the similarity between X & Y Better measures focus on a subset of values…

Similarity-induced Structure

How to Compute Group Similarity? Given two groups g1 and g2, Single-link algorithm: s(g1,g2)= similarity of the closest pair complete-link algorithm: s(g1,g2)= similarity of the farthest pair average-link algorithm: s(g1,g2)= average of similarity of all pairs Three Popular Methods:

Three Methods Illustrated Single-link algorithm ? g1 g2 complete-link algorithm …… average-link algorithm

Comparison of the Three Methods  Single-link  “Loose” clusters  Individual decision, sensitive to outliers  Complete-link  “Tight” clusters  Individual decision, sensitive to outliers  Average-link  “In between”  Group decision, insensitive to outliers  Which one is the best? Depends on what you need!

Method 2 (similarity-based): K-Means

K-Means Clustering  Given a similarity function  Start with k randomly selected data points  Assume they are the centroids of k clusters  Assign every data point to a cluster whose centroid is the closest to the data point  Recompute the centroid for each cluster  Repeat this process until the similarity-based objective function converges

Method 3 (model-based): Mixture Models

Mixture Model for Clustering P(X|Cluster 1 ) P(X|Cluster 2 ) P(X|Cluster 3 ) P(X)= 1 P(X|Cluster 1 )+ 2 P(X|Cluster 2 )+ 3 P(X|Cluster 3 )

Mixture Model Estimation  Likelihood function  Parameters: i,  i,  i  Using EM algorithm  Similar to “soft” K-means

Method 4 (model-based) [If we have gtime] Singular Value Decomposition (SVD) Also called “Latent Semantic Indexing” (LSI)

Example of “Semantic Concepts” (Slide from C. Faloutsos’s talk)

Singular Value Decomposition (SVD) A [n x m] = U [n x r]   r x r] (V [m x r] ) T  A: n x m matrix (n documents, m terms)  U: n x r matrix (n documents, r concepts)   : r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix)  V: m x r matrix (m terms, r concepts) (Slide from C. Faloutsos’s talk)

Example of SVD data inf retrieval brain lung = CS MD xx CS-concept MD-concept Term rep of concept (Slide adapted from C. Faloutsos’s talk) Strength of CS-concept Dim. Reduction A = U  V T

More clustering methods and software  Partitioning : K-Means, K-Medoids, PAM, CLARA …  Hierarchical : Cluster, HAC 、 BIRCH 、 CURE 、 ROCK  Density-based : CAST, DBSCAN 、 OPTICS 、 CLIQUE…  Grid-based : STING 、 CLIQUE 、 WaveCluster…  Model-based : SOM (self-organized map) 、 COBWEB 、 CLASSIT 、 AutoClass…  Two-way Clustering  Block clustering