EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
An Association Analysis Approach to Biclustering website:
Cluster Analysis. Midterm: Monday Oct 29, 4PM  Lecture Notes from Sept 5, 2007 until Oct 15, Chapters from Textbook and papers discussed in class.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Finding Local Linear Correlations in High Dimensional Data Xiang Zhang Feng Pan Wei Wang University of.
Texture Segmentation Based on Voting of Blocks, Bayesian Flooding and Region Merging C. Panagiotakis (1), I. Grinias (2) and G. Tziritas (3)
A Probabilistic Framework for Semi-Supervised Clustering
University at BuffaloThe State University of New York Interactive Exploration of Coherent Patterns in Time-series Gene Expression Data Daxin Jiang Jian.
Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics.
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Bi-correlation clustering algorithm for determining a set of co- regulated genes BIOINFORMATICS vol. 25 no Anindya Bhattacharya and Rajat K. De.
Principal Component Analysis
Mutual Information Mathematical Biology Seminar
Evaluation and optimization of clustering in gene expression data analysis A. Fazel Famili, Ganming Liu and Ziying Liu National Research Council of Canada.
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Reduced Support Vector Machine
1 DECK: Detecting Events from Web Click-through Data Ling Chen, Yiqun Hu, Wolfgang Nejdl Presented by Sebastian Föllmer.
Mining Phenotypes and Informative Genes from Gene Expression Data Chun Tang, Aidong Zhang and Jian Pei Department of Computer Science and Engineering State.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Clustering (Part II) 11/26/07. Spectral Clustering.
FACE RECOGNITION, EXPERIMENTS WITH RANDOM PROJECTION
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Introduction to Bioinformatics - Tutorial no. 12
Fuzzy K means.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
CSCE822 Data Mining and Warehousing
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project Presentation Rohit Gupta and Varun Chandola
A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
COMMUNITIES IN MULTI-MODE NETWORKS 1. Heterogeneous Network Heterogeneous kinds of objects in social media – YouTube Users, tags, videos, ads – Del.icio.us.
Bi-Clustering Jinze Liu. Outline The Curse of Dimensionality Co-Clustering  Partition-based hard clustering Subspace-Clustering  Pattern-based 2.
Gene expression & Clustering (Chapter 10)
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
Cs: compressed sensing
Bi-Clustering. 2 Data Mining: Clustering Where K-means clustering minimizes.
Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.
A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Ground Truth Free Evaluation of Segment Based Maps Rolf Lakaemper Temple University, Philadelphia,PA,USA.
Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
Gene-Markers Representation for Microarray Data Integration Boston, October 2007 Elena Baralis, Elisa Ficarra, Alessandro Fiori, Enrico Macii Department.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Chung-hung.
Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2011.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
Data Mining, ICDM '08. Eighth IEEE International Conference on Duy-Dinh Le National Institute of Informatics Hitotsubashi, Chiyoda-ku Tokyo,
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Clustering High-Dimensional Data. Clustering high-dimensional data – Many applications: text documents, DNA micro-array data – Major challenges: Many.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
Jinbo Bi Joint work with Jiangwen Sun, Jin Lu, and Tingyang Xu
Principal Component Analysis (PCA)
Subspace Clustering/Biclustering
CARPENTER Find Closed Patterns in Long Biological Datasets
Multivariate Statistical Methods
A Fast Algorithm for Subspace Clustering by Pattern Similarity
GPX: Interactive Exploration of Time-series Microarray Data
CS 485G: Special Topics in Data Mining
Presentation transcript:

EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science

Administrative Final exam: Dec 15 7:30-10: EECS 7302

EECS 7303 Model Based Subspace Clustering Microarray Bi-clustering δ-clustering

EECS 7304 MicroArray Dataset

EECS 7305 Gene Expression Matrix Genes Conditions Genes Conditions Time points Cancer Tissues

EECS 7306 Data Mining: Clustering Where K-means clustering minimizes

EECS 7307 Clustering by Pattern Similarity (p- Clustering) The micro-array “raw” data shows 3 genes and their values in a multi- dimensional space Parallel Coordinates Plots Difficult to find their patterns “non-traditional” clustering

EECS 7308 Clusters Are Clear After Projection

EECS 7309 Why p-Clustering? Microarray data analysis may need to Clustering on thousands of dimensions (attributes) Discovery of both shift and scaling patterns Clustering with Euclidean distance measure? — cannot find shift patterns Clustering on derived attribute A ij = a i – a j ? — introduces N(N-1) dimensions Bi-cluster using transformed mean-squared residue score matrix (I, J) Where A submatrix is a δ-cluster if H(I, J) ≤ δ for some δ > 0 Problems with bi-cluster No downward closure property, Due to averaging, it may contain outliers but still within δ-threshold

EECS Motivation DNA microarray analysis CH1ICH1BCH1DCH2ICH2B CTFC VPS EFB SSA FUN SP MDM CYS DEP NTG

EECS Motivation

EECS Motivation Strong coherence exhibits by the selected objects on the selected attributes. They are not necessarily close to each other but rather bear a constant shift. Object/attribute bias bi-cluster

EECS Challenges The set of objects and the set of attributes are usually unknown. Different objects/attributes may possess different biases and such biases may be local to the set of selected objects/attributes are usually unknown in advance May have many unspecified entries

EECS Previous Work Subspace clustering Identifying a set of objects and a set of attributes such that the set of objects are physically close to each other on the subspace formed by the set of attributes. Collaborative filtering: Pearson R Only considers global offset of each object/attribute.

EECS bi-cluster Terms Consists of a (sub)set of objects and a (sub)set of attributes Corresponds to a submatrix Occupancy threshold  Each object/attribute has to be filled by a certain percentage. Volume: number of specified entries in the submatrix Base: average value of each object/attribute (in the bi-cluster) Biclustering of Expression Data, Cheng & Church ISMB’00

EECS bi-cluster CH1ICH1BCH1DCH2ICH2BObj base CTFC3 VPS EFB SSA1 FUN14 SP07 MDM10 CYS DEP1 NTG1 Attr base

EECS conditions 40 genes

EECS Motivation

EECS conditions 40 genes

EECS Motivation Co-regulated genes

EECS bi-cluster Perfect  -cluster Imperfect  -cluster Residual: d IJ d Ij d iJ d ij

EECS bi-cluster The smaller the average residue, the stronger the coherence. Objective: identify  -clusters with residue smaller than a given threshold

EECS Cheng-Church Algorithm Find one bi-cluster. Replace the data in the first bi-cluster with random data Find the second bi-cluster, and go on. The quality of the bi-cluster degrades (smaller volume, higher residue) due to the insertion of random data.

EECS The FLOC algorithm Generating initial clusters Determine the best action for each row and each column Perform the best action of each row and column sequentially Improved? Y N Yang et al. delta-Clusters: Capturing Subspace Correlation in a Large Data Set, ICDE’02

EECS The FLOC algorithm Action: the change of membership of a row (or column) with respect to a cluster column row M+N actions are Performed at each iteration N=3 M=4

EECS The FLOC algorithm Gain of an action: the residual reduction incurred by performing the action Order of action: Fixed order Random order Weighted random order Complexity: O((M+N)MNkp)

EECS The FLOC algorithm Additional features Maximum allowed overlap among clusters Minimum coverage of clusters Minimum volume of each cluster Can be enforced by “temporarily blocking” certain action during the mining process if such action would violate some constraint.

EECS Performance Microarray data: 2884 genes, 17 conditions 100 bi-clusters with smallest residue were returned. Average residue = The average residue of clusters found via the state of the art method in computational biology field is The average volume is 25% bigger The response time is an order of magnitude faster

EECS Conclusion Remark The model of bi-cluster is proposed to capture coherent objects with incomplete data set. base residue Many additional features can be accommodated (nearly for free).

EECS References J. Young, W. Wang, H. Wang, P. Yu, Delta-cluster: capturing subspace correlation in a large data set, Proceedings of the 18th IEEE International Conference on Data Engineering (ICDE), pp , H. Wang, W. Wang, J. Young, P. Yu, Clustering by pattern similarity in large data sets, to appear in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), Y. Sungroh, C. Nardini, L. Benini, G. De Micheli, Enhanced pClustering and its applications to gene expression data Bioinformatics and Bioengineering, J. Liu and W. Wang, OP-Cluster: clustering by tendency in high dimensional space, ICDM’03.