Presentation is loading. Please wait.

Presentation is loading. Please wait.

EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science

Similar presentations


Presentation on theme: "EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science"— Presentation transcript:

1 EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/

2 Administrative Final exam: Dec 15 7:30-10:00 2015-11-25EECS 7302

3 2015-11-25EECS 7303 Model Based Subspace Clustering Microarray Bi-clustering δ-clustering

4 2015-11-25EECS 7304 MicroArray Dataset

5 2015-11-25EECS 7305 Gene Expression Matrix Genes Conditions Genes Conditions Time points Cancer Tissues

6 2015-11-25EECS 7306 Data Mining: Clustering Where K-means clustering minimizes

7 2015-11-25EECS 7307 Clustering by Pattern Similarity (p- Clustering) The micro-array “raw” data shows 3 genes and their values in a multi- dimensional space Parallel Coordinates Plots Difficult to find their patterns “non-traditional” clustering

8 2015-11-25EECS 7308 Clusters Are Clear After Projection

9 2015-11-25EECS 7309 Why p-Clustering? Microarray data analysis may need to Clustering on thousands of dimensions (attributes) Discovery of both shift and scaling patterns Clustering with Euclidean distance measure? — cannot find shift patterns Clustering on derived attribute A ij = a i – a j ? — introduces N(N-1) dimensions Bi-cluster using transformed mean-squared residue score matrix (I, J) Where A submatrix is a δ-cluster if H(I, J) ≤ δ for some δ > 0 Problems with bi-cluster No downward closure property, Due to averaging, it may contain outliers but still within δ-threshold

10 2015-11-25EECS 73010 Motivation DNA microarray analysis CH1ICH1BCH1DCH2ICH2B CTFC343922844108280228 VPS8401281120275298 EFB131828037277215 SSA1401292109580238 FUN1428572852576271226 SP0722829048285224 MDM10538272266277236 CYS332228841278219 DEP131227240273232 NTG132929633274228

11 2015-11-25EECS 73011 Motivation

12 2015-11-25EECS 73012 Motivation Strong coherence exhibits by the selected objects on the selected attributes. They are not necessarily close to each other but rather bear a constant shift. Object/attribute bias bi-cluster

13 2015-11-25EECS 73013 Challenges The set of objects and the set of attributes are usually unknown. Different objects/attributes may possess different biases and such biases may be local to the set of selected objects/attributes are usually unknown in advance May have many unspecified entries

14 2015-11-25EECS 73014 Previous Work Subspace clustering Identifying a set of objects and a set of attributes such that the set of objects are physically close to each other on the subspace formed by the set of attributes. Collaborative filtering: Pearson R Only considers global offset of each object/attribute.

15 2015-11-25EECS 73015 bi-cluster Terms Consists of a (sub)set of objects and a (sub)set of attributes Corresponds to a submatrix Occupancy threshold  Each object/attribute has to be filled by a certain percentage. Volume: number of specified entries in the submatrix Base: average value of each object/attribute (in the bi-cluster) Biclustering of Expression Data, Cheng & Church ISMB’00

16 2015-11-25EECS 73016 bi-cluster CH1ICH1BCH1DCH2ICH2BObj base CTFC3 VPS8401120298273 EFB131837215190 SSA1 FUN14 SP07 MDM10 CYS332241219194 DEP1 NTG1 Attr base34766244219

17 2015-11-25EECS 73017 17 conditions 40 genes

18 2015-11-25EECS 73018 Motivation

19 2015-11-25EECS 73019 17 conditions 40 genes

20 2015-11-25EECS 73020 Motivation Co-regulated genes

21 2015-11-25EECS 73021 bi-cluster Perfect  -cluster Imperfect  -cluster Residual: d IJ d Ij d iJ d ij

22 2015-11-25EECS 73022 bi-cluster The smaller the average residue, the stronger the coherence. Objective: identify  -clusters with residue smaller than a given threshold

23 2015-11-25EECS 73023 Cheng-Church Algorithm Find one bi-cluster. Replace the data in the first bi-cluster with random data Find the second bi-cluster, and go on. The quality of the bi-cluster degrades (smaller volume, higher residue) due to the insertion of random data.

24 2015-11-25EECS 73024 The FLOC algorithm Generating initial clusters Determine the best action for each row and each column Perform the best action of each row and column sequentially Improved? Y N Yang et al. delta-Clusters: Capturing Subspace Correlation in a Large Data Set, ICDE’02

25 2015-11-25EECS 73025 The FLOC algorithm Action: the change of membership of a row (or column) with respect to a cluster 34 4 13 22 3 2 2 04 column row 1 3 2 1 234 M+N actions are Performed at each iteration N=3 M=4

26 2015-11-25EECS 73026 The FLOC algorithm Gain of an action: the residual reduction incurred by performing the action Order of action: Fixed order Random order Weighted random order Complexity: O((M+N)MNkp)

27 2015-11-25EECS 73027 The FLOC algorithm Additional features Maximum allowed overlap among clusters Minimum coverage of clusters Minimum volume of each cluster Can be enforced by “temporarily blocking” certain action during the mining process if such action would violate some constraint.

28 2015-11-25EECS 73028 Performance Microarray data: 2884 genes, 17 conditions 100 bi-clusters with smallest residue were returned. Average residue = 10.34 The average residue of clusters found via the state of the art method in computational biology field is 12.54 The average volume is 25% bigger The response time is an order of magnitude faster

29 2015-11-25EECS 73029 Conclusion Remark The model of bi-cluster is proposed to capture coherent objects with incomplete data set. base residue Many additional features can be accommodated (nearly for free).

30 2015-11-25EECS 73030 References J. Young, W. Wang, H. Wang, P. Yu, Delta-cluster: capturing subspace correlation in a large data set, Proceedings of the 18th IEEE International Conference on Data Engineering (ICDE), pp. 517-528, 2002. H. Wang, W. Wang, J. Young, P. Yu, Clustering by pattern similarity in large data sets, to appear in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2002. Y. Sungroh, C. Nardini, L. Benini, G. De Micheli, Enhanced pClustering and its applications to gene expression data Bioinformatics and Bioengineering, 2004. J. Liu and W. Wang, OP-Cluster: clustering by tendency in high dimensional space, ICDM’03.


Download ppt "EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science"

Similar presentations


Ads by Google