A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.

Slides:



Advertisements
Similar presentations
The Robert Gordon University School of Engineering Dr. Mohamed Amish
Advertisements

Feature Grouping-Based Fuzzy-Rough Feature Selection Richard Jensen Neil Mac Parthaláin Chris Cornelis.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
An Association Analysis Approach to Biclustering website:
Putting genetic interactions in context through a global modular decomposition Jamal.
Yue Han and Lei Yu Binghamton University.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Introduction to Bioinformatics
Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics.
Bi-correlation clustering algorithm for determining a set of co- regulated genes BIOINFORMATICS vol. 25 no Anindya Bhattacharya and Rajat K. De.
Exhaustive Signature Algorithm
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Analyzing System Logs: A New View of What's Important Sivan Sabato Elad Yom-Tov Aviad Tsherniak Saharon Rosset IBM Research SysML07 (Second Workshop on.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation.
Parameterizing Random Test Data According to Equivalence Classes Chris Murphy, Gail Kaiser, Marta Arias Columbia University.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
CSCE822 Data Mining and Warehousing
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project Presentation Rohit Gupta and Varun Chandola
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
COMMUNITIES IN MULTI-MODE NETWORKS 1. Heterogeneous Network Heterogeneous kinds of objects in social media – YouTube Users, tags, videos, ads – Del.icio.us.
Overview DM for Business Intelligence.
Bi-Clustering Jinze Liu. Outline The Curse of Dimensionality Co-Clustering  Partition-based hard clustering Subspace-Clustering  Pattern-based 2.
Gene expression & Clustering (Chapter 10)
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Jesse Gillis 1 and Paul Pavlidis 2 1. Department of Psychiatry and Centre for High-Throughput Biology University of British Columbia, Vancouver, BC Canada.
Bi-Clustering. 2 Data Mining: Clustering Where K-means clustering minimizes.
Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.
Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.
A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside.
I MPROVING C O -C LUSTER Q UALITY WITH A PPLICATION TO P RODUCT R ECOMMENDATIONS Michail Vlachos et al. Distributed Application Systems Presentation by.
A hybrid SOFM-SVR with a filter-based feature selection for stock market forecasting Huang, C. L. & Tsai, C. Y. Expert Systems with Applications 2008.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond.
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Model-based evaluation of clustering validation measures.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2011.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Cluster validation Integration ICES Bioinformatics.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Welcome! Seminar – Monday 6:00 EST HS Seminar Unit 1 Prof. Jocelyn Ramos.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.
University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure.
Preface to the special issue on context-aware recommender systems
Two études on modularity
Hierarchical clustering approaches for high-throughput data
Discovering Functional Communities in Social Media
Information Theoretical Probe Selection for Hybridisation Experiments
Statistical Data Analysis
Presentation transcript:

A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali

Introduction Microarray technology use to study the expression of many genes at once Large amount of data is produced in the microarray technology Proper analysis of the data is important to get meaningful information from it There is a need for new analysis techniques

Data Analysis From data to knowledge We need to process data by grouping and synthesizing information into a “big picture” based upon characteristics and relationships One of the most used analysis technique is traditional clustering

Traditional Clustering Applied to either rows or columns of the data matrix separately Each gene is defined using all the conditions Each condition is characterized by the activity of all the genes that belong to it Genes Conditions

Motivation The large amount of data provide us great challenges of analysis Clustering algorithms consider all the conditions to group genes and all the genes to group conditions Biologically data may not show similar behavior in all conditions but in a subset of them Traditional clustering algorithms will very likely miss some important information

Biclustering The term “Biclustering” was first used by Cheng and Church in gene expression data analysis [Year 2000] Clusters do not need to include all parameters (genes in Bioinformatics) for all conditions Data Matrix ▫Each gene – One row ▫Each condition – One column ▫Each element – expression level of a gene under specific condition

Biclustering (Cont.) Performs clustering in these two dimensions simultaneously Each gene is selected using only a subset of the conditions Each condition is selected using only a subset of the genes Genes Conditions

Goal of Biclustering To identify subgroups of genes and subgroups of conditions by performing simultaneous clustering of both rows and columns of the gene expression matrix, instead of clustering these two dimensions separately To find biclusters is NP-hard problem: It is actually a generalized version of traditional clustering

Previous Work A systematic comparison and evaluation of biclustering methods for gene expression data - Amela Prelic (2006) Algorithms: ▫Statistical Algorithmic Method for Biclustering Analysis Algorithm (SAMBA) ▫Order Preserving Submatrix Algorithm (OPSM) ▫Iterative Signature Algorithm (ISA) ▫Cheng and Church algorithm ▫xMotif ▫Bimax

Previous Work (Cont.) Comparative Analysis of Biclustering Algorithms – Doruk Bozdag … (2010) Algorithms ▫Correlated Pattern Bicluster Algorithm (CPB) ▫Cheng and Church Algorithm ▫Order Preserving Submatrix Algorithm (OPSM) ▫HARP Algorithm ▫Minimum Sum-Squared Residue-based CoClustering Algorithm (MSSRCC) ▫Statistical Algorithmic Method for Biclustering Analysis Algorithm (SAMBA)

The Importance of Assessment Different algorithms give different solutions for same data There is no agreed upon guideline for choosing among them Validation Techniques ▫External Validation Measures  Evaluate a result based on the knowledge of the correct class labels ▫Internal Validation Measures  Evaluate a result based on the information intrinsic to the data alone

Validation In most biclustering papers external validation measures used to assess the methods, ▫It is not clear how to extend notions such as homogeneity and separation to the biclustering context (Gat-Viks et al 2003) ▫Internal measures don’t work well in case of biclustering due to which Gat-Viks et al 2003 and Handl et al 2005 recommend external measures

Objectives of the Project Comprehensive Assessment Technique ▫Internal measures as well as external measures Customized Biclustering Method ▫Input domain

Validation using Synthetic Data Testing using Manufactured data ▫The portion of the implanted bicluster the algorithm was able to return ▫The portion external or irrelevant to the implanted bicluster which algorithm returns ▫Two metrics to evaluate cluster quality  U: Uncovered portion of the implanted bicluster  E: Portion of the output cluster external to the implanted bicluster

Validation using Synthetic Data Testing using real (domain specific) data – for example using Gene match score ▫M1, M2 be two sets of Biclusters ▫Average of the maximum match scores for all biclusters in M1 with respect to the bicluster in M2 Potential improvements ▫Don’t consider samples / conditions ▫Specificity and Sensitivity

Proposed Assessment Calculate sensitivity and specificity scores ▫Specificity: proportion of negatives which are correctly identified ▫Sensitivity: proportion of actual positives which are correctly identified Improve existing measures: ▫Average of the maximum match scores for all bi-clusters in M1 with respect to bi-clusters in M2 (considering both genes and samples) Assessment based on knowledge of domain data ▫The resulting biclusters were evaluated based on the enrichment of Gene Ontology (GO) terms

Experiments Given two biclustering results ▫M1: Result of a biclustering algorithm ▫M2: True Result ▫(G1, C1) M1 and (G2, C2) M2 Calculate similarity score (Jaccard Coefficient) ▫ and Calculate the two scores, ▫Score 1: % of result of an algorithm is included in the true result ▫Score 2: % of true result an algorithm can find

Results Synthetic Data: 100 genes and 100 samples 10 implanted biclusters of each size 10 X 10 (10 genes and 10 samples) Used publically available different biclustering algorithm implementations Score 1: % of result of an algorithm is included in the true result Score 2: % of true result an algorithm can find Algorithm No of biclusters Score 1Score 2 Cheng and Church Algorithm (CC) Iterative Search Algorithm (ISA)910.9 Order Preserving Sub Matrix (OPSM) Algorithm Statistical Algorithm Method for Bicluster Analysis (SAMBA) xMotif Algorithm

Conclusion Traditional Clustering is too restrictive technique for analyzing datasets in various application domains We need new flexible analysis technique like biclustering to deal with possible imperfections in the input datasets Assessment of data analysis is critical and must be considered while selecting the right tool for each application domains Biclustering represents a powerful tool for analysis of data in a variety of domains and can be applicable to datasets other than biology

References Madeira, S.C., Oliveira, A.L.: Biclustering algorithms for biological data analysis: A survey Amela Prelic et al: A systematic comparison and evaluation of biclustering methods for gene expression data

Thank you…