Download presentation

Presentation is loading. Please wait.

Published byMiles Morrey Modified over 2 years ago

1
RIC- Refine Initial Cluster Centers in Partitioning Clustering for Clustering Transacting Data 資工系 張蕙珠 01/04/2005

2
Clustering Clustering is the unsupervised classification of patterns into groups (clusters) Data objects are similar in same cluster, but they are dissimilar in other clusters Clustering is useful in pattern-analysis, market or customer segmentation, machine learning, information retrieval, and data mining

3
Clustering class Partitioned clustering Hierarchical clustering

4
Partitioned clustering Give k, the number of partitions construct Create an initial partitioning Use an iterative relocation technique that attempts to improve the partitioning by moving objects from one group to another

5
Hierarchical clustering The bottom-up approach starts with each object forming a separate group, it successively merges the objects or group close to one another, until a termination condition holds. The top-down approach starts with all the objects in the same cluster. In each successive iteration, a cluster is split up into smaller clusters, until eventually each object is in one cluster, or until a termination condition holds.

6
Motive partitioning clustering need to refine initial cluster centers (e.g.) partitioning clustering need an accurate number of cluster (e.g.)

7
RIC clustering algorithm Definition (1) Width stands for number of distinct data item in dataset as W (2) Height stands for appeared times of each data item as H (3) Average height stands for averaging height of each data item as Avg-H

8
RIC Clustering Algorithm Step1:count the item of market bucket data( W, H, Avg-H) Step2:select item(H) > Avg-H Step3: combine selected item Step4: repeat Count result of permutation and combination of every distinct combination item set, and compute the H 、 W and Avg-H select combined item set > Avg-H Step5:until (get k cluster center) or (all H(i) < Avg-H) or (last result doesn’t change)

9
Example 1.H(A)=2 ， H(B)=9 ， H(C)=4 ， H(D)=8 ， H(E)=2 ， H(F)=3 ， H(G)=2 ， H(H)=5 ， H(I)=6 ， W=9 ， Avg-H= (2+9+4+8+2+3+2+5+6)/9 ≒ 4 2.select item if item(H) >= Avg-H then H(B)=9 ， H(C)=4 ， H(D)=8 ， H(H)=5 ， H(I)=6 3.Combine selected item {BD ， BC ， CD ， BH ， CH ， DH ， HI ， BI ， CI ， DI} 4. H(B,D)=4 ， H(B,C)=3 ， H(C,D)=2 ， H(D,H)=4 ， H(B,I)=4 ， H(C,I)=2 ， otherwise H=0 ， W=10 ， Avg- H=(4+3+2+4+4+2)/10 ≒ 2 5.Select combined set then H(B,D)=4 ， H(B,C)=3 ， H(C,D)=2 ， H(D,H)=4 ， H(B,I)=4 ， H(C,I)=2

10
Example(con’d) 6.select item(i) for combination {B,C,D}is combined from (B,D) 、 (B,C) and (C,D) {B,C,I} is combined from (B,I) 、 (B,C) and (C,I) {D,H} So center1={B,C,D} ， center2={D,H} ， center3={B,C,I}

11
Result Then continue clustering process and result on below Trans ItemTransItemTransItem 110C1210C2310C3 120C2220C3320C2 130C3230C2330C2 140C3240C1340C1 150C1250C3350C3

12
Improvement RIC permits one pass clustering process to converge to a good solution, but partitioning clustering method need an unknown number of iterations to converge to a good solution RIC leads to an optimal solution, but partitioning clustering method require result what it is acceptable but far from the optimal one. RIC prevents acquiring inaccurate clustering result derived from user defining unfit number of cluster, but partitioning clustering method can’t prevent.

13
partitioning clustering method to clustering use an iterative procedure what converges to one of numerous local minima. these iterative techniques are especially sensitive to initial starting condition The refined initial cluster center allow the clustering process to converge to a better local minimum

14
TransItemTransItemTransItem 110B,D210D,G,H310B,E,I 120D,F,H220C, I320D, H 130B, G, I230H330D, F, H 140B, I240B, C, D340 B, C, D, F 150 A, B, D 250A, B, I350 B, C, E, I arbitrary selection cluster centers C1={H}(id=230) C2={D, H}(id=320) C3={A, B, I}(id=250) distance formule d(transi, centerj) = #(transi∩centerj) / #(transi) thread-hold=1/2

15
TransItemTransItemTransItem 110C2 or C3210C2310C3 120C2220C3320C2 130C3230C1330C2 140C3240 It doesn ’ t belong to any cluster 340 It doesn ’ t belong to any cluster 150C3250C3350 It doesn ’ t belong to any cluster MotiveMotive→

16
We acquire inaccurate clustering result when we setting number of cluster doesn’t conform to actual data distribution. A great deal of data can not be belonged anyone cluster if we setting number of cluster is smaller than real number of cluster. Someone cluster doesn’t contain any data object if we setting number of cluster is bigger than real number of cluster. →

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google