K.L Ong, W. Li, W.K. Ng, and E.P. Lim

K.L Ong, W. Li, W.K. Ng, and E.P. Lim
SCLOPE: An Algorithm for Clustering Data Streams of Categorical Attributes K.L Ong, W. Li, W.K. Ng, and E.P. Lim Proc. of the 6th Int. Conf. on Data Warehousing and Knowledge Discovery, Zaragoza, Spain, September 2004 (DaWak04) 2019/5/25 報告人:吳建良

Outline Clustering a Data Stream of Categorical Attributes
SCLOPE Algorithm CluStream Framework CLOPE Algorithm FP-Tree-Like Structure Experimental Results

Clustering a Data Stream of Categorical Attributes
Technically challenges High dimensions Sparsity problem in categorical datasets Additional stream constraints One pass I/O Low CPU consumption

SCLOPE Algorithm Adopt two aspects of CluStream framework SCLOPE
Pyramidal time frame: store summary statistics at different time periods Separate the process of clustering Online micro-clustering component Offline macro-clustering component SCLOPE Online: Pyramidal time frame, FP-Tree-like structure Offline: CLOPE Clustering Algorithm

CLOPE Clustering Algorithm
Cluster quality measure Histogram Similarity A larger height-to width ratio  better intra-cluster similarity H=S/W Tid Items 1 ab 2 abc 3 acd 4 de 5 def

CLOPE Clustering Algorithm (cont.)
Suppose Clustering C={C1, C2, …, Ck} Height-to-width ratio Gradient G(Ci)=H(Ci)/W(Ci)=S(Ci)/W(Ci)2 Criterion function r: repulsion Control the level of intra-cluster similarity

a b C1={T1} Initial Phase: Tid Items Clus 1 ab 2 abc 3 acd 4 de 5 def T2 is temporally added into C1 Profit=0.55 a b c C1={T1, T2} or Create a new cluster C2 Profit=0.41 C2={T2} C1 C1 C1 C2 C2 T3 is temporally added into C1 Profit=0.5 or Create a new cluster C2 Profit=0.41 a c d C2={T3} a b c d C1={T1, T2 , T3} a b c d C1={T1, T2 , T3} d e f C2={T4, T5} Final Result:

Iteration Phase: Repeat moved=false For all transaction t in the database move t to an existing cluster or new cluster Cj that maximize profit If Ci≠Cj then write <t, j> moved=true Until not moved

Maintain Summary Statistics
Data streams A set of records R1,…, Ri,… arriving at time stamps t1,…, ti,… Each record R contains attributes A={a1, a2, …, aj} A micro-cluster within time window (time tp ~ tq) is defined as a tuple : a vector of record identifiers : cluster histogram - width: - size: - height: size to width ratio

FP-Tree-Like Structure
Drawbacks of CLOP Multiple scans of the dataset Multiple evaluations of the criterion function for each record FP-tree-like structure Require two scans of dataset Determine the singleton frequency  Insert each into FP-tree after arranging all attributes according to their descending singleton frequency Share common prefixes Without the need to compute clustering criterion

Construct FP-Tree-Like Structure
Tid Items 1 ab 2 abc 3 acd 4 de 5 def Tid Items 1 ab 2 abc 3 adc 4 de 5 def Scan database once a:3, d:3, b:2, c:2, e:2, f:1 Arrange null a:3 b:2 c:1 d:1 d:2 e:2 f:1

FP-Tree-Like Structure
Each path (from the root to a leaf node) is a micro-cluster The number of micro-clusters are depend on the available memory space Merge strategy Select node which has longest common prefix Select any two paths passing through the node Merge its corresponding

Online Micro-clustering Component of SCLOPE
On beginning of (window wi) do 1: if (i=0) then Q’ ←{a random order of v1,…,v∣A∣} 2: T ← new FP-tree and Q ←Q’ 3: for all (incoming record ) do 4: order R according to Q and 5: if (R can be inserted completely along an existing path Pi in T) then 6: 7: else 8: Pj ← new path in T and ← new cluster histogram for Pj 9:

Online Micro-clustering Component of SCLOPE
On end of (window wi) do 10: L ← {<n, height(n)>: n is a node in T with > 2 children} 11: order L according to height(n) 12: while do 13: select <n, height(n)> with largest value 14: select paths Pi, Pj where 15: 16: delete 17: output micro-clusters and cluster histograms

Offline Macro-clustering Component of SCLOPE
Time-horizon h and repulsion r h: span one or more windows r: control the intra-cluster similarity Profit function: clustering criterion Each micro-cluster is treated as a pseudo-record #Micro-cluster physical records It takes less time to converge on the clustering criterion

Experimental Results Environment Aspects: Dataset CPU Pentium-4: 2GHz
RAM: 1GB OS: Windows 2000 Aspects: Performance, scalability, cluster accuracy Dataset Real-world, synthetic data

Performance and Scalability
Real-life data FIMI repository (

Performance and Scalability (cont.)
Synthetic data IBM synthetic data generator Dataset: 50k records (a) 50 clusters (b) 100 clusters (c) 500 clusters

#Attributes: 1000

Cluster Accuracy Mushroom data set Purity metric
117 distinct attributes and 8124 records Two classes 4208 edible , and 3916 poisonous Purity metric the average percentage of the dominant class label in each cluster

Cluster Accuracy (cont.)
Online micro-clustering component of SCLOPE SCLOPE v.s CLOPE

K.L Ong, W. Li, W.K. Ng, and E.P. Lim

Similar presentations

Presentation on theme: "K.L Ong, W. Li, W.K. Ng, and E.P. Lim"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

K.L Ong, W. Li, W.K. Ng, and E.P. Lim

Similar presentations

Presentation on theme: "K.L Ong, W. Li, W.K. Ng, and E.P. Lim"— Presentation transcript:

Similar presentations

About project

Feedback