Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004.

Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

Overview Algorithms GRAVIclust AUTOCLUST AUTOCLUST+ 3D Boundary-based Clustering SNN

Gravity based spatial clustering GRAVIclust Initialisation Phase calculate the initial centre clusters Optimisation Phase improve the position of the cluster centres so as to achieve a solution which minimizes the distance function

GRAVIclust: Initialisation Phase Input: set of points P

GRAVIclust: Initialisation Phase Input: set of points P matrix of distances between all pairs of points assumption: actual access path distance exists in GIS maps e.g.. http://www.transinfo.qld.gov.auhttp://www.transinfo.qld.gov.au very versatile footpath road map rail map

GRAVIclust: Initialisation Phase Input: set of points P matrix of distances between all pairs of points # of required clusters k

GRAVIclust: Initialisation Phase Step 1: calculate first initial centre the point with the largest number of points within radius r remove first initial centre & all points within radius r from further consideration Step 2: repeat Step 1 until k initial centres have been chosen Step 3: create initial clusters by assigning all points to the closest cluster centre

GRAVIclust: radius calculation Radius r calculated based on the area of the region considered for clustering static radius based on the assumption that all clusters are of the same size dynamic radius recalculated after each initial cluster centre is chosen

GRAVIclust: Static vs. Dynamic Static reduced computation # points within a radius r has to be calculated only once not suitable for problems where the points are separated by large empty areas Dynamic increases computation time ensures the radius is adjusted as the points are removed Differs only when distribution is non-uniform

GRAVIclust: Optimisation Phase Step 1: for each cluster, calculate new centre based on the the point closest to cluster centre of gravity Step 2: re-assign points to new cluster centres Step 3: recalculate distance function never greater than previous Step 4: repeat Step 1 to 3 until value distance function equals previous

GRAVIclust Deterministic Can handle obstacles Monotonic convergence of the distance function to a stable point

AUTOCLUST Definitions

AUTOCLUST Definitions II

AUTOCLUST Phase 1: finding boundaries Phase 2: restoring and re-attaching Phase 3: detecting second-order inconsistency

AUTOCLUST: Phase 1 Finding boundaries Calculate Delaunay Diagram for each point p i ShortEdges(p i ) LongEdges(p i ) OtherEdges(p i ) Remove ShortEdges(p i ) and LongEdges(p i )

AUTOCLUST: Phase 2 Restoring and re-attaching for each point p i where ShortEdges(p i )   Determine a candidate connected component C for p i If there are 2 edges e j = (p i,p j ) and e k = (p i,p k ) in ShortEdges(p i ) with CC[p j ]  CC[p k ], then Compute, for each edge e = (p i,p j )  ShortEdges(p i ), the size ||CC[p j ]|| and let M = max e = (pi,pj)  ShortEdges(pi) ||CC[p j ]|| Let C be the class labels of the largest connected component (if there are two different connected components with cardinality M, we let C be the one with the shortest edge to p i )

AUTOCLUST: Phase 2 Restoring and re-attaching for each point p i where ShortEdges(p i )   Determine a candidate connected component C for p i If … Otherwise, let C be the label of the connected component all edges e  ShortEdges(p i ) connect p i to

AUTOCLUST: Phase 2 Restoring and re-attaching for each point p i where ShortEdges(p i )   Determine a candidate connected component C for p i If the edges in OtherEdges(p i ) connect to a connected component different than C, remove them. Note that all edges in OtherEdges(p i ) are removed, and only in this case, will p i swap connected components Add all edges e  ShortEdges(p i ) that connect to C

AUTOCLUST: Phase 3 Detecting second-order inconsistency compute the LocalMean for 2- neighbourhoods remove all edges in N 2,G(pi) that are long edges

AUTOCLUST

No user supplied arguments eliminates expensive human-based exploration time for finding best-fit arguments Robust to noise, outliers, bridges and type of distribution Able to detect clusters with arbitrary shapes, different sizes and different densities Can handle multiple bridges O(n log n)

AUTOCLUST+ Construct Delaunay Diagram Calculate MeanStDev(P) For all edges e, remove e if it intersects some obstacles Apply the 3 phases of AUTOCLUST to the planar graph resulting from the previous steps

3D Boundary-based Clustering Benefits from 3D Clustering more accurate spatial analysis distinguish positive clusters: clusters in higher dimensions but not in lower dimensions

3D Boundary-based Clustering Benefits from 3D Clustering more accurate spatial analysis distinguish positive clusters: clusters in higher dimensions but not in lower dimensions negative clusters: clusters in lower dimensions but not in higher dimensions

3D Boundary-based Clustering Based on AUTOCLUST Uses Delaunay Tetrahedrizations Definitions: e j potential inter-cluster edge if:

3D Boundary-based Clustering Phase I For all the p i  P, classify each edge e j incident to p i into one of three groups ShortEdges(pi) when the length of e j is less than the range in AI(p i ) LongEdges(pi) when the length of e j is greater than the range in AI(p i ) OtherEdges(pi) when the length of e j is within AI(p i ) For all the p i  P, remove all edges in ShortEdges(pi) and LongEdges(pi)

3D Boundary-based Clustering Phase II Recuperate ShortEdges(pi) incident to border points using connected component analysis Phase III Remove exceptionally long edges in local regions

Shared Nearest Neighbour Clustering in higher dimensions Distances or similarities between points become more uniform, making clustering more difficult Also, similarity between points can be misleading i.e.. a point can be more similar to a point that “actually” belongs to a different cluster Solution Shared nearest neighbor approach to similarity

SNN: An alternative definition of similarity Euclidian distance most common distance metric used while useful in low dimensions, it doesn’t work well in high dimensions A1A2A3A4A5A6A7A8A9A10 P13000000000 P20000000004 P33240123120 P40240123124

SNN: An alternative definition of similarity Define similarity in terms of their shared nearest neighbours the similarity of the points is “confirmed” by their common shared nearest neighbours

SNN: An alternative definition of density SNN similarity, with the k-nearest neighbour approach if the k-nearest neighbour of a point, with respect to SNN similarity is close, then we say that there is a high density at this point since it reflects the local configuration of the points in the data space, it is relatively insensitive to variations in desitiy and the dimensionality of the space

SNN: Algorithm Compute the similarity matrix corresponds to a similarity graph with data points for nodes and edges whose weights are the similarities between data points

SNN: Algorithm Compute the similarity matrix Sparsify the similarity matrix by keeping only the k most similar neighbours corresponds to keeping only the k strongest links of the similarity graph

SNN: Algorithm Compute the similarity matrix Sparsify the similarity matrix … Construct the shared nearest neighbour graph from the sparsified similarity matrix

SNN: Algorithm Compute the similarity matrix Sparsify the similarity matrix … Construct the shared … Find the SNN density of each point Find the core points

SNN: Algorithm Compute the similarity matrix Sparsify the similarity matrix … Construct the shared … Find the SNN density of each point

SNN: Algorithm Compute the similarity matrix Sparsify the similarity matrix … Construct the shared … Find the SNN density of each point Form clusters from the core points

SNN: Algorithm Compute the similarity matrix Sparsify the similarity matrix … Construct the shared … Find the SNN density of each point Form clusters from the core points Discard all noise points

SNN: Algorithm Compute the similarity matrix Sparsify the similarity matrix … Construct the shared … Find the SNN density of each point Form clusters from the core points Discard all noise points Assign al non-noise, non-core points to clusters

Shared Nearest Neighbour Finds clusters of varying shapes, sizes, and densities, even in the presence of noise and outliers Handles data of high dimentionality and varying densities Automaticly detects the # of clusters

Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004.

Similar presentations

Presentation on theme: "Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004.

Similar presentations

Presentation on theme: "Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004."— Presentation transcript:

Similar presentations

About project

Feedback