Presentation is loading. Please wait.

Presentation is loading. Please wait.

DATA MINING - CLUSTERING

Similar presentations


Presentation on theme: "DATA MINING - CLUSTERING"— Presentation transcript:

1 DATA MINING - CLUSTERING

2 Clustering Clustering - unsupervised classification
Clustering - the process of grouping physical or abstract objects into classes of similar objects Clustering - help in construct meaningful partitioning of a large set of objects Data clustering in statistics, machine learning, spatial database and data mining Clustering - niekontrolowana klasyfikacja (unsupervised classification) Clustering - proces grupowania fizycznych lub abstrakcyjnych obiektów w klasy podobnych obiektów Clustering - pomoc przy porządkowaniu i dzieleniu olbrzymich zbiorów obiektów Dziedziny wykorzystania clustering’u Analiza clusteringowa pomaga tworzyć podzbiory dużych zbiorów obiektów, takich jak systemy dużej skali, których dekompozycja ułatwia projektowanie i implementację tych systemów. Clustering danych jako zadanie DATA MINING wykrywa zbite regiony danych według pewnego rozmiaru odległości w olbrzymich wielowymiarowych zbiorach danych. W ustalonym, dużym zbiorze wielowymiarowych punktów danych, przestrzeń danych jest zwykle niejednolicie wypełniona przez punkty danych. Clustering danych identyfikuje rzadkie i gęste obszary oraz całkowicie odkrywa wzory (model) podziałów zbioru danych. Clustering danych jest wykorzystywanych w takich dziedzinach nauki jak statystyka, uczenie maszyn, data mining.

3 CLARANS Algorithm CLARANS - Clustering Large Applications Based on Randomized Search - presented by Ng and Han CLARANS - based on randomized search and 2 statistics algorithm: PAM and CLARA Method of algorithm - search of local optimum Example of algorithm usage

4 Focusing Methods FM - based on CLARANS algorithm and efficient spatial access method, like R*-tree The focusing on representative objects technique The focus on relevant clusters technique The focus on a cluster technique Examples of usage

5 Pattern-Based Similarity Search
Searching for similar patterns in a temporal or spatial-temporal database Two types of queries encountered in data mining operations: object - relative similarity query all -pair similarity query Various approaches: similarity measures chosen type of comparison chosen subsequence parameters chosen Różne podejścia: wybór miary podobieństwa (similarity measures) wybór miary porównania (czas czy przekształcenie) wybór parametrów podsekwencji: długości, itp.

6 Similarity Measures (1)
1st Measure - the Euclidean distance between two sequences: {xi} - the target sequence of length n {yi} - a sequence of length N in thedatabase {ziJ} - J-th subsequence of length n of {yi}

7 Similarity Measures (2)
2nd Measure - the linear correlation between two sequences: 3th Measure - the correlation (Discrete Fourier Transforms) between two sequences:

8 Alternative approaches
Matching all of the data points of a sequence simultaneously Matching each sequence into a small set of multidimensional rectangles in the featude space Fourier transformation SVD - Singular Value Decomposition The Karhunen-Loeve transformation Hierarchy Scan - new approach

9 Mining Path Traversal Patterns
Solution of the problem of mining traversal patterns: first step: devise to convert the original sequence of log data into a set of traversal subsequences (maximal forward reference) second step: determine the frequent traversal patterns, term large reference sequences Problems with finding large reference sequences

10 Mining Path Traversal Patterns - Example
C B G D W H O E A U 2 3 4 5 7 6 9 8 15 11 13 14 12 10 1 Traversal path for a user: {A,B,C,D,C,B,E,G,H,G,W,A,O,U,O,V} The set of maximal forward references for this user: { ABCD, ABEGH, ABEGW, AOU, AOV }

11 Clustering Features and CF-trees
triplet summarizing information about subclusters of points: CF-tree - a balanced tree with 2 parameters: branching factor B - max number of children threshold T - max diameter of subclusters stored at the leaf nodes

12 Usage of this structure in BIRCH algorithm
Construct of CF-tree The non leaf nodes are storing sums of their’ s children’s CFs The CF-tree is build dynamically as data points are inserted A point is inserted to the closest leaf entry (subcluster) If the diameter of the subcluster after insertion is larger than T value, the leaf node(s) are split 2 370 150 120 2 270 2 100 100 Usage of this structure in BIRCH algorithm

13 BIRCH Algorithm - Balanced - Interactive - Reducing and BIRCH
- Clustering using - Hierarchies BIRCH

14 BIRCH Algorithm (1) PHASE 1
Data PHASE 1 Scan all data and built an initial in memory CF tree using the given amount of memory and recycling space on disk Phase 1: Load into memory Phase 2: Condense Zbudowane drzewo CF stara się odzwierciedlać sklasterowane informacje zbioru danych tak efektywnie jak tylko jest to możliwe. Przetwarzanie w dalszych fazach jest: a) szybkie ponieważ nie potrzeba żadnych operacji I/O, gdyż problem został zredukowany b) dokładne ponieważ nie nadające się dane zostały wyrzucone c) uporządkowane ponieważ dane są lepiej ułożone niż w oryginalnym zbiorze Phase 3: Global Clustering Phase 4: Cluster Refining

15 BIRCH Algorithm (2) PHASE 2 (optional)
Data PHASE 2 (optional) Scan the leaf entries in the initial CF tree to rebuild a smaller CF tree, while removing more outliers and grouping crowded clusters into largest one Phase 1: Load into memory Phase 2: Condense Analiza clusteringowa pomaga tworzyć podzbiory dużych zbiorów obiektów, takich jak systemy dużej skali, których dekompozycja ułatwia projektowanie i implementację tych systemów. Clustering danych jako zadanie DATA MINING wykrywa zbite regiony danych według pewnego rozmiaru odległości w olbrzymich wielowymiarowych zbiorach danych. W ustalonym, dużym zbiorze wielowymiarowych punktów danych, przestrzeń danych jest zwykle niejednolicie wypełniona przez punkty danych. Clustering danych identyfikuje rzadkie i gęste obszary oraz całkowicie odkrywa wzory (model) podziałów zbioru danych. Clustering danych jest wykorzystywanych w takich dziedzinach nauki jak statystyka, uczenie maszyn, data mining. Phase 3: Global Clustering Phase 4: Cluster Refining

16 BIRCH Algorithm (3) PHASE 3
Data PHASE 3 Adapt an existing clustering algorithm for a set of data points to work with a set of subclusters, each described by its CF vector. Phase 1: Load into memory Phase 2: Condense Phase 3: Global Clustering Phase 4: Cluster Refining

17 BIRCH Algorithm (4) PHASE 4 (optional)
Data PHASE 4 (optional) Pass over the data to correct inaccuracies and refine clusters further. Phase 4 entails the cost of additional pass. Phase 1: Load into memory Phase 2: Condense Po wykonaniu fazy trzeciej otrzymujemy zbiór cluster’ów, które zawierają główny rozkład wzorca danych. Jednak mogą istnieć pewne niedokładności. Phase 3: Global Clustering Phase 4: Cluster Refining

18 CURE Algorithm - Clustering - Using - Representatives CURE

19 CURE Algorithm (1) Data Draw random sample Partition sample Partially cluster partitions Label data in disk Cluster partial clusters Eliminate outliers CURE begins by drawing random sample from the database.

20 CURE Algorithm (2) Data Draw random sample Partition sample Partially cluster partitions Label data in disk Cluster partial clusters Eliminate outliers In order to further speed up clustering, CURE first partition the random sample into p partitions, each of size n/p.

21 CURE Algorithm (3) Data Draw random sample Partition sample Partially cluster partitions Label data in disk Cluster partial clusters Eliminate outliers Partially cluster each partition until the final number of clusters in each partition reduce to n/pq for some constant q > 1.

22 CURE Algorithm (4) Outliers do not belong to any of the clusters.
Data Draw random sample Partition sample Partially cluster partitions Label data in disk Cluster partial clusters Eliminate outliers Outliers do not belong to any of the clusters. In CURE outliers are eliminated at multiply steps.

23 CURE Algorithm (5) Data Draw random sample Partition sample Partially cluster partitions Label data in disk Cluster partial clusters Eliminate outliers Cluster in a final pass to generate the final k clusters.

24 CURE Algorithm (6) Data Draw random sample Partition sample Partially cluster partitions Label data in disk Cluster partial clusters Eliminate outliers Each data point is assigned to the cluster containing the representative point closest to it.

25 CURE - cluster procedure
procedure cluster(S,k) begin T := build_kd_tree(S) Q := buid_heap(S) while size(Q) > k do { u := extract_min(Q) v := u.closet delete(Q,v) w := merge(u,v) delete_rep(T,u) insert_rep(T,w) w.closet := x /*x is an arbitrary cluster in Q */ for each xQ do { if dist(w,x) < dist(w,w.closet) w.closet := x if x.closet is either u or v { if dist(x,x.closet)< dist(x,w) a.closet := closet_cluster(T,x,dist(x,w)) else x.closet :=w relocate(Q,x) } else if dist(x,x.closet)>dist(x,w){ x.closet := w relocate(Q,x) } } insert (Q,w) } end

26 CURE - merge procedure procedure merge(u,v) begin w := uUv
w.mean := |u|u.mean+|v|v.mean/|u|+|v| tmpSet :=  for i := 1 to c do { maxDist := 0 foreach point p in cluster w do { if i = 1 maxDist := dist(p,w.mean) else minDist := min{dist(p,q) : q tmpSet if (minDist  maxDist) { maxDist := minDist maxPoint := p } } tmpSet := tmpSet U {maxPoint} } foreach point p in tmpSet do w.rep := w.rep U {p+*(w.mean-p)} return w end

27 The Intelligent Miner of IBM
Clustering - Demographic provides fast and natural clustering of very large databases. It automatically determines the number of clusters to be generated. Similarities between records are determined by comparing their field values. The clusters are then defined so that Condorcet’s criterion is maximised: (sum of all record similarities of pairs in the same cluster) - (sum of all record similarities of pairs in different clusters)

28 The Intelligent Miner - example
Suppose that you have a database of a supermarket that includes customer identification and information abut the date and time of the purchases. The clustering mining function clusters this data to enable the identification of different types of shoppers. For example, this might reveal that customers buy many articles on Friday and usually pay by credit card.


Download ppt "DATA MINING - CLUSTERING"

Similar presentations


Ads by Google