ArrayCluster: an analytic tool for clustering, data visualization and module ﬁnder on gene expression proﬁles 組員：李祥豪謝紹陽江建霖.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Applications of one-class classification

Object Specific Compressed Sensing by minimizing a weighted L2-norm A. Mahalanobis.

Unsupervised Learning

Biointelligence Laboratory, Seoul National University

Computer vision: models, learning and inference Chapter 8 Regression.

Dimension reduction (1)

Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”

Model-based clustering of gene expression data Ka Yee Yeung 1,Chris Fraley 2, Alejandro Murua 3, Adrian E. Raftery 2, and Walter L. Ruzzo 1 1 Department.

Visual Recognition Tutorial

DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.

‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.

Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.

Visual Recognition Tutorial

PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Prof.Dr.Cevdet Demir

Patrick Kemmeren Using EP:NG.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.

Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz

Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.

1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.

Chapter 2 Dimensionality Reduction. Linear Methods

Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,

Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:

PATTERN RECOGNITION AND MACHINE LEARNING

Presented by Tienwei Tsai July, 2005

WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)

ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.

1 1 Slide © 2007 Thomson South-Western. All Rights Reserved OPIM 303-Lecture #9 Jose M. Cruz Assistant Professor.

1 1 Slide © 2007 Thomson South-Western. All Rights Reserved Chapter 13 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.

Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:

Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Flat clustering approaches

PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton

CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct

Tutorial I: Missing Value Analysis

Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Principal Components Analysis ( PCA)

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Lab 5 Unsupervised and supervised clustering Feb 22 th 2012 Daniel Fernandez Alejandro Quiroz.

PREDICT 422: Practical Machine Learning

Chapter 3: Maximum-Likelihood Parameter Estimation

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

Dimension reduction : PCA and Clustering by Agnieszka S. Juncker

Principal Component Analysis (PCA)

Clustering Evaluation The EM Algorithm

Principal Component Analysis

Course Outline MODEL INFORMATION COMPLETE INCOMPLETE

Probabilistic Models with Latent Variables

ECE539 final project Instructor: Yu Hen Hu Fall 2005

SMEM Algorithm for Mixture Models

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Dimension reduction : PCA and Clustering

Presentation transcript:

ArrayCluster: an analytic tool for clustering, data visualization and module ﬁnder on gene expression proﬁles 組員：李祥豪謝紹陽江建霖

Outline  Introduction  Mixed Factors Model  Analytic Tools  Summary  Demo

Introduction  This task can be addressed by grouping gene expression patterns of a large number of genes  Typical microarray data have a fairly small sample size, less than 100, whereas the number of genes involved is more than several thousands

Introduction  One major difficulty in this problem is that the number of samples to be clustered is much smaller than the dimension of data  Most clustering technologies, e.g. k- means, Gaussian mixture clustering, hierarchical clustering and so on, would be limited by over-learning

Introduction  In statistics, overfitting is fitting a statistical model that has too many parameters.  When the degrees of freedom in parameter selection exceed the data, this leads to arbitrariness in the final (fitted) model parameters which reduces or destroys the ability of the model to generalize beyond the fitting data.

Introduction  In machine learning, usually a learning algorithm is trained using some set of training examples, especially in learning was performed too long or training are rare, the learner may adjust to very specific random features of the training data, that have no causal relation to the target function.

Introduction  In both statistics and machine learning, in order to avoid overfitting, it is necessary to use additional techniques (e.g. cross-validation, early stopping, Bayesian Priors on parameters or model comparison), that can indicate when further training is not resulting in better generalization.

Mixed Factors Model  The mixed factors model presents a parsimonious parameterization of Gaussian mixture model  Our primal intention is parsimoniously to describe the group structure of data based on the factor variables. To this end, we devise the mixed factors that follow a G- components Gaussian mixture as

Mixed Factors Model  The mixed factors model, we possibly avoid the over-fitting of the Gaussian mixture by choosing an appropriate factor dimension regardless to the high dimensionality of data.  Once the model has been fitted to a given dataset, clustering can be addressed by the Bayes rule.

Mixed Factors Model  To avoid it, we impose the orthogonality on the q columns of the factor loading matrix  This imposition leads to a canonical representation of the mixed factors model as  From this equation, one achieves the fact that the q canonical variates in A T x j € R q are distributed according to

Mixed Factors Model  The canonical variates can be considered as the q modules of genes which are relevant to the existing molecular subtypes.  This process yields a feature selection that constructs good discriminators for existing groups as linear combination d genes.

Analytic Tools  File format of data file

Analytic Tools  model selection based on BIC curve

Analytic Tools  In this plot, the horizontal and vertical axes correspond to the factor dimension and the BIC scores, respectively. The each line represents curve of BIC scores against to varying factor dimensions (q) for a fixed number of clusters (G)

Analytic Tools  File format of mixed_factors

Analytic Tools  Box plot of the computed factor scores

Analytic Tools  Each cluster is separated with the blank lines. All samples in one cluster are ordered according to the degree of the belongings that are measured by the Maharanobis distance between each sample point and the corresponding group centeroid. The calculated distances are indicated next to the sample identifiers

Analytic Tools  File format of relevant_set

Analytic Tools  relevant module profiling

 After selecting rows (genes) of interest, the enlarged expression image will be displayed on the right window

Analytic Tools  The ArrayCluster provides users an usable environment to perform the following tasks: Parameter estimation of the mixed factors model: The ArrayCluster computes the maximum likelihood estimators by using the EM algorithm Determination of the number of clusters and the factor dimension (the number of group- relatedmodules):These are selected based on the Bayesian information criterion (BIC) Clustering based on the Bayes rule

Analytic Tools Dimension reduction of data: This task is addressed by the same way of the classical factor analysis, the mixed factors analysis explicitly reflects the existing group structure of original data, while the classical factor analysis ignores it during the dimension reduction Identification of the group-related genes: In the ArrayCluster, the relevant genes in each module are selected to be top L (user can specify) of the highest positive (negative) correlation with each element of the factor vector

Analytic Tools Identification of the modules: By separating positive and negative correlated genes with the factor vector in a module, totally we identify 2q modules Missing data imputation Data preprocessing: The methods include normalization and gene filtering

Summary  The ArrayCluster visualizes the computed factor scores using the box plot matrix  Enhancing the graphical understanding of the group structure.  A casual link from the calibrated clusters to biological knowledge can be elucidated through the inspection of the group-related modules.

Summary  The ArrayCluster displays the expression patterns of these modules.  Genes at these modules and their visualization give us a scope to question where the calibrated clusters come from.

Thanks for your attention Next->DEMO