Download presentation

Presentation is loading. Please wait.

Published byClinton Parsons Modified about 1 year ago

1
Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering algorithms assume a large data structure which is memory resident. Clustering may be performed first on a sample of the database then applied to the entire database. Clustering may be performed first on a sample of the database then applied to the entire database. Algorithms Algorithms –BIRCH –DBSCAN –CURE

2
Part II - Clustering© Prentice Hall2 Desired Features for Large Databases One scan (or less) of DB One scan (or less) of DB Online Online Suspendable, stoppable, resumable Suspendable, stoppable, resumable Incremental Incremental Work with limited main memory Work with limited main memory Different techniques to scan (e.g. sampling) Different techniques to scan (e.g. sampling) Process each tuple once Process each tuple once

3
Part II - Clustering© Prentice Hall3 BIRCH Balanced Iterative Reducing and Clustering using Hierarchies Balanced Iterative Reducing and Clustering using Hierarchies Incremental, hierarchical, one scan Incremental, hierarchical, one scan Save clustering information in a tree Save clustering information in a tree Each entry in the tree contains information about one cluster Each entry in the tree contains information about one cluster New nodes inserted in closest entry in tree New nodes inserted in closest entry in tree

4
Part II - Clustering© Prentice Hall4 Clustering Feature (N,LS,SS) (N,LS,SS) –N: Number of points in cluster –LS: Sum of points in the cluster –SS: Sum of squares of points in the cluster CF Tree CF Tree –Balanced search tree –Node has CF triple for each child –Leaf node represents cluster and has CF value for each subcluster in it. –Subcluster has maximum diameter

5
Part II - Clustering© Prentice Hall5 BIRCH Algorithm

6
Part II - Clustering© Prentice Hall6 Improve Clusters

7
Part II - Clustering© Prentice Hall7 DBSCAN Density Based Spatial Clustering of Applications with Noise Density Based Spatial Clustering of Applications with Noise Outliers will not effect creation of cluster. Outliers will not effect creation of cluster. Input Input –MinPts – minimum number of points in cluster –Eps – for each point in cluster there must be another point in it less than this distance away.

8
Part II - Clustering© Prentice Hall8 DBSCAN Density Concepts Eps-neighborhood: Points within Eps distance of a point. Eps-neighborhood: Points within Eps distance of a point. Core point: Eps-neighborhood dense enough (MinPts) Core point: Eps-neighborhood dense enough (MinPts) Directly density-reachable: A point p is directly density-reachable from a point q if the distance is small (Eps) and q is a core point. Directly density-reachable: A point p is directly density-reachable from a point q if the distance is small (Eps) and q is a core point. Density-reachable: A point si density- reachable form another point if there is a path from one to the other consisting of only core points. Density-reachable: A point si density- reachable form another point if there is a path from one to the other consisting of only core points.

9
Part II - Clustering© Prentice Hall9 Density Concepts

10
Part II - Clustering© Prentice Hall10 DBSCAN Algorithm

11
Part II - Clustering© Prentice Hall11 CURE Clustering Using Representatives Clustering Using Representatives Use many points to represent a cluster instead of only one Use many points to represent a cluster instead of only one Points will be well scattered Points will be well scattered

12
Part II - Clustering© Prentice Hall12 CURE Approach

13
Part II - Clustering© Prentice Hall13 CURE Algorithm

14
Part II - Clustering© Prentice Hall14 CURE for Large Databases

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google