A Fast Algorithm for Subspace Clustering by Pattern Similarity

A Fast Algorithm for Subspace Clustering by Pattern Similarity
Haixun Wang, Fang Chu, Wei Fan, Philip S Yu, and Jian Pei In the 16th International Conference on Scientific and Statistical Database Management, 2004 2019/1/3 報告人:吳建良

Abstract Clustering by pattern similarity SeqClus: a novel model
Objects exhibit a coherent pattern of rise and fall in subspaces SeqClus: a novel model Intuitive for capturing subspace pattern similarity Reduce computation complexity dramatically Discovery pattern similarity embedded in data sequence

Subspace pattern similarity
(b) A Shifting pattern in subspace {b, c, h, j, e} (a) Raw data: 3 objects, 10 columns (c) A Scaling pattern in subspace {f, d, a, g, i}

Applications Analysis of Large Scientific Datasets
Gene expression data Example: VPS8, EFB1, CYS3 changes coherently under {CH1I, CH1D, CH2B}

Applications (cont.) Example 1. Counting Example 2. Clustering
How many genes whose expression level in sample CH1I is about 100±5 units higher than that in CH2B, 280±5 units higher than that in CH1D, and 75±5 units higher than that in CH2I? Find clusters of genes that exhibit coherent subspace patterns, given the following constraints: i) the subspace pattern has dimensionality higher than minCols; and ii) the number of objects in the cluster is larger than minRows

Applications (cont.) Discovery of Sequential Patterns
Network event logs Focus on two attributes: Event and Timestamp

Applications (cont.) Example 3. Sequential Pattern
Mining subspace pattern Events as condition on X axis Timestamp as Y axis Sliding Window Event CiscoDCDLinkUp is followed by MLMStatusUp that is followed, in turn, by CiscoDCDLinkUp, under the constraint that the interval between the first two events is about 20±2 seconds, and the interval between the 1st and 3rd events is about 40±2 seconds Event1 Event2 Event3 … Timestamp t1 t2 t3 t4

Object Definition Use sequences to represent objects in a tabular dataset D A={c1, c2,…cn}: the set of columns : total order Object x: Where is the value of x in column ci Example: EFB1: <(CH1I, 318), (CH1B, 280), (CH1D, 37)…>

Sequence-based pattern similarity
Distance function: Given two objects x and y, a subspace S, an arbitrary dimension ck Example: (1) ck max x y c1 c3 c2 c4 1 4 2 3 5 distk,S(x, y)=3

Sequence-based pattern similarity
Property1 Proof:

Pattern Definition Pattern p:
a tuple (T,δ), T is an ordered sequence of (column, value) pair Object x exhibits pattern p in subspace S={c1,…,ck} (2)

Pattern Definition (cont.)
Example: Pattern p: <(c1,0), (c2,2), (c3,1), 2> Object x: <(c1,2), (c2,3), (c3,1)> x’: <(c1,0), (c2,1), (c3,-1)> High density pattern The number of objects that satisfy Eq(2) reach a user-defined density threshold c1 c3 c2 1 2 3 4 -1 -2 +δ -δ

Construct a Counting tree
A compact summary structure of density patterns, like suffix trie Create a counting tree For each object x, insert its relevant subsequences (length≧) into the tree. At the end up node t, increase the t’s count by 1 With depth-first traversal, label each tree node t as a triple: (ID1, ID2, Count)

Construct a Counting tree (cont.)
Relevant subsequence The relevant subsequence of an object x in an n-dimension space are: (ID1, ID2, Count) ID1: unique identification of node t ID2: is the largest ID1 of t’s descendent nodes Count: If t is a leaf node, Count is the number of objects end up at t Otherwise, it is the sum of the counts of its child nodes

Example: =2 x c1 c3 c2 c4 z y x1 c1 c3 c2 c4 x3 x2 y1 c1 c3 c2 c4 y3
Relevant subsequences x c1 c3 c2 c4 4 3 2 z 1 y x1 c1 c3 c2 c4 -1 -4 -2 x3 x2 -3 y1 c1 c3 c2 c4 1 -2 y3 y2 -3 -1 z1 c1 c3 c2 c4 1 2 z3 -2 z2 -1 c1 c2 c3 c4 [1, , ] [2, , ] [3, , ] [4, ,1] [1,9, ] [2,4, ] [3,4, ] [4,4,1] [ , ,1] [1,9,3] [2,4,1] [3,4,1] [4,4,1] x -1 -4 -2 1 y -2 x, y -3 -1 [ , ,1] [5,9,2] [6,7,1] [7,7,1] [5, , ] [6, , ] [7, ,1] [5,9, ] [6,7, ] [7,7,1] z 2 [8,9, ] [9,9,1] [8, , ] [9, ,1] [8,9,1] [9,9,1] [ , ,1] [10, , ] [11, , ] [12, ,2] [10,14, ] [11,12, ] [12,12,2] [ , ,2] [ , ,1] [10,14,3] [11,12,2] [12,12,2] z 1 -1 [13,14,1] [14,14,1] [ , ,1] [13, , ] [14, ,1] [13,14, ] [14,14,1]

Counting list: count pattern occurrences during the depth-first traversal
Link head List of node labels (c1,c1,0) [1,9,3] [1,9,3] [2,4,1] [3,4,1] [4,4,1] (c1,c2,-1) [2,4,1] x -1 -4 -2 [5,9,2] [6,7,1] [7,7,1] (c1,c3,-4) [3,4,1] 1 y (c1,c4,-2) [4,4,1] -2 [8,9,1] [9,9,1] z (c1,c2,1) [5,9,2] 2 [10,14,3] [11,12,2] [12,12,2] (c1,c3,-2) [6,7,1] x, y -3 -1 (c1,c4,0) [7,7, 1 ] [9,9,1+1] [13,14,1] [14,14,1] 1 z (c1,c3,2) [8,9,1] -1 (c2,c2,0) [10,14,3] (c2,c3,-3) [11,12,2] (c2,c4,-1) [12,12,2] (c2,c3,1) [13,14,1] [14,14,3]

Counting Pattern Occurrence
-1 1 2 [1,12,3] [2,5,1] [3,5,1] [4,5,1] [6,12,2] [7,9,1] [8,9,1] [10,12,1] [11,12,1] c5 [5,5,1] [9,9,1] [12,12,1] Link head List of node labels (c1,c1,0) [1,12,3] (c1,c2,-1) [2,5,1] (c1,c3,1) [3,5,1] [7,9,2] (c1,c4,1) [4,5,1] [8,9,2] [11,12,3] (c1,c5,2) [5,5,1] [9,9,2] [12,12,3] (c1,c2,1) [6,12,2] <(c1,0), (c3,1), (c4,1)>出現次數=2 (c1,c3,2) [10,12, 1] (c1,c1,0) [1,12,3] (c1,c3,1) [3,5,1] [7,9,2] 計算規則: If IDv is the first element of the list, then there are cntw objects Otherwise, there are cntw-cntu objects (c1,c4,1) [4,5,1] [8,9,2] [11,12,3]

Clustering Construct Cluster Tree
node: is the triple (item, count, range-list) Count the occurrences of all 2-column pattern If it is frequent (count≧minRows), insert it under root node of cluster tree For each node p on current level, join p with its eligible nodes to derive nodes on the next level

Clustering (cont.) A node q is node p’s eligible nodes Join operation
q is on the same level as p; if p denotes a-b=v, and q denotes c-d=v’, then Join operation p: (a-b=v, count, range-list) New node: (c-b=v’, count’, range-list’) q: (c-d=v’, count’, range-list’)

Clustering (cont.) minRows=2 (c2-c1=-1, 1, [2,4]) (c2-c1=1, 2, [5,9])
X (c2-c1=1, 2, [5,9]) (c4-c1=0, 2, [7,7], [9,9]) join (c3-c1=-4, 1, [3,4]) X (c3-c1=-2, 1, [6,7]) 從root到leaf代表了freq. pattern <(c1,0),(c2,1),(c4,0)> <(c1,0),(c4,0)> <(c2,0),(c3,-3),(c4,-1)> <(c2,0),(c4,-1)> X (c3-c1=2, 1, [8,9]) X Root (c4-c1=-2, 1, [4,4]) X (c4-c1=0, 2, [7,7], [9,9]) (c3-c2=-3, 2, [11,12]) (c4-c2=-1, 3, [12,12], [14,14]) join (c3-c2=1, 1, [13,14]) X (c4-c2=-1, 3, [12,12], [14,14])

Experiment Synthetic data Tabular form Sequential form
Value range: 0~300 Embed clusters: δ=0, 2, 4, 6,… Sequential form (id, timestamp) Generated by probabilistic distribution

Experiment (cont.) Scalability

Experiment (cont.) Gene expression data Event management data
Yeast micro-array 2884 genes, 17 conditions, Expression level range: 0~600, discretized into 40 bins Event management data NETVIEW EventType: 241 10 days’ worth of event logs

Experiment (cont.) Yeast micro-array NETVIEW

A Fast Algorithm for Subspace Clustering by Pattern Similarity

Similar presentations

Presentation on theme: "A Fast Algorithm for Subspace Clustering by Pattern Similarity"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Fast Algorithm for Subspace Clustering by Pattern Similarity

Similar presentations

Presentation on theme: "A Fast Algorithm for Subspace Clustering by Pattern Similarity"— Presentation transcript:

Similar presentations

About project

Feedback