Bi-Clustering Jinze Liu. Outline The Curse of Dimensionality Co-Clustering  Partition-based hard clustering Subspace-Clustering  Pattern-based 2.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Clustering V. Outline Validating clustering results Randomization tests.
Cluster Analysis. Midterm: Monday Oct 29, 4PM  Lecture Notes from Sept 5, 2007 until Oct 15, Chapters from Textbook and papers discussed in class.
Dimensionality Reduction PCA -- SVD
A Probabilistic Framework for Semi-Supervised Clustering
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Bi-correlation clustering algorithm for determining a set of co- regulated genes BIOINFORMATICS vol. 25 no Anindya Bhattacharya and Rajat K. De.
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Mutual Information Mathematical Biology Seminar
Turning Privacy Leaks into Floods: Surreptitious Discovery of Social Network Friendships Michael T. Goodrich Univ. of California, Irvine joint w/ Arthur.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Reduced Support Vector Machine
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project Presentation Rohit Gupta and Varun Chandola
Birch: An efficient data clustering method for very large databases
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
COMMUNITIES IN MULTI-MODE NETWORKS 1. Heterogeneous Network Heterogeneous kinds of objects in social media – YouTube Users, tags, videos, ads – Del.icio.us.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
Bi-Clustering. 2 Data Mining: Clustering Where K-means clustering minimizes.
Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.
Generative Topographic Mapping by Deterministic Annealing Jong Youl Choi, Judy Qiu, Marlon Pierce, and Geoffrey Fox School of Informatics and Computing.
Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside.
Usman Roshan Machine Learning, CS 698
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
First topic: clustering and pattern recognition Marc Sobel.
DB Seminar Series: Semi- supervised Projected Clustering By: Kevin Yip (4 th May 2004)
Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2011.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.
Analyzing Expression Data: Clustering and Stats Chapter 16.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.
Clustering High-Dimensional Data. Clustering high-dimensional data – Many applications: text documents, DNA micro-array data – Major challenges: Many.
Ultra-high dimensional feature selection Yun Li
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
Chapter 7. Classification and Prediction
Principal Component Analysis (PCA)
Subspace Clustering/Biclustering
CS 485G: Special Topics in Data Mining
Presentation transcript:

Bi-Clustering Jinze Liu

Outline The Curse of Dimensionality Co-Clustering  Partition-based hard clustering Subspace-Clustering  Pattern-based 2

3 Clustering Where K-means clustering minimizes

The Curse of Dimensionality The dimension of a problem refers to the number of input variables (actually, degrees of freedom). The curse of dimensionality The exponential increase in data required to densely populate space as the dimension increases. The points are equally far apart in high dimensional space. 1–D 2–D 3–D

Motivation 5 Document Clustering:  Define a similarity measure  Clustering the documents using e.g. k- means Term Clustering:  Symmetric with Doc Clustering

Motivation 6 Genes Patients Hierarchical Clustering of Genes Hierarchical Clustering of Patients

Contingency Tables Let X and Y be discrete random variables  X and Y take values in {1, 2, …, m} and {1, 2, …, n}  p(X, Y) denotes the joint probability distribution—if not known, it is often estimated based on co-occurrence data  Application areas: text mining, market-basket analysis, analysis of browsing behavior, etc. Key Obstacles in Clustering Contingency Tables  High Dimensionality, Sparsity, Noise  Need for robust and scalable algorithms

Co-Clustering Simultaneously  Cluster rows of p(X, Y) into k disjoint groups  Cluster columns of p(X, Y) into l disjoint groups Key goal is to exploit the “duality” between row and column clustering to overcome sparsity and noise

Co-clustering Example for Text Data document word clusters document clusters Co-clustering clusters both words and documents simultaneously using the underlying co-occurrence frequency matrix

Result of Co-Clustering 10 A presentation topic – Hierarchical Co-Clustering

Clustering by Patterns 11

12 Clustering by Pattern Similarity (p-Clustering) The micro-array “raw” data shows 3 genes and their values in a multi-dimensional space  Parallel Coordinates Plots  Difficult to find their patterns “non-traditional” clustering

13 Clusters Are Clear After Projection

14 Motivation E-Commerce: collaborative filtering Movie 1 Movie 2 Movie 3 Movie 4 Movie 5 Movie 6 Movie 7 Viewer Viewer Viewer Viewer Viewer 55534

15 Motivation

16 Motivation Movie 1 Movie 2 Movie 3 Movie 4 Movie 5 Movie 6 Movie 7 Viewer Viewer Viewer Viewer Viewer 55534

17 Motivation

18 Motivation DNA microarray analysis CH1ICH1BCH1DCH2ICH2B CTFC VPS EFB SSA FUN SP MDM CYS DEP NTG

19 Motivation

20 Motivation Strong coherence exhibits by the selected objects on the selected attributes.  They are not necessarily close to each other but rather bear a constant shift.  Object/attribute bias bi-cluster

21 Challenges The set of objects and the set of attributes are usually unknown. Different objects/attributes may possess different biases and such biases  may be local to the set of selected objects/attributes  are usually unknown in advance May have many unspecified entries

22 Previous Work Subspace clustering  Identifying a set of objects and a set of attributes such that the set of objects are physically close to each other on the subspace formed by the set of attributes. Collaborative filtering: Pearson R  Only considers global offset of each object/attribute.

23 bi-cluster Consists of a (sub)set of objects and a (sub)set of attributes  Corresponds to a submatrix  Occupancy threshold   Each object/attribute has to be filled by a certain percentage.  Volume: number of specified entries in the submatrix  Base: average value of each object/attribute (in the bi-cluster)

24 bi-cluster CH1ICH1BCH1DCH2ICH2BObj base CTFC3 VPS EFB SSA1 FUN14 SP07 MDM10 CYS DEP1 NTG1 Attr base

25 bi-cluster Perfect  -cluster Imperfect  -cluster  Residue: d IJ d Ij d iJ d ij

26 bi-cluster The smaller the average residue, the stronger the coherence. Objective: identify  -clusters with residue smaller than a given threshold

27 Cheng-Church Algorithm Find one bi-cluster. Replace the data in the first bi-cluster with random data Find the second bi-cluster, and go on. The quality of the bi-cluster degrades (smaller volume, higher residue) due to the insertion of random data.

28 The FLOC algorithm Generating initial clusters Determine the best action for each row and each column Perform the best action of each row and column sequentially Improved? Y N

29 The FLOC algorithm Action: the change of membership of a row(or column) with respect to a cluster column row M+N actions are Performed at each iteration N=3 M=4

30 The FLOC algorithm Gain of an action: the residue reduction incurred by performing the action Order of action:  Fixed order  Random order  Weighted random order Complexity: O((M+N)MNkp) 

31 The FLOC algorithm Additional features  Maximum allowed overlap among clusters  Minimum coverage of clusters  Minimum volume of each cluster Can be enforced by “temporarily blocking” certain action during the mining process if such action would violate some constraint.

32 Performance Microarray data: 2884 genes, 17 conditions  100 bi-clusters with smallest residue were returned.  Average residue =  The average residue of clusters found via the state of the art method in computational biology field is  The average volume is 25% bigger  The response time is an order of magnitude faster

33 Conclusion Remark The model of bi-cluster is proposed to capture coherent objects with incomplete data set.  base  residue Many additional features can be accommodated (nearly for free).