# CLUSTERING.

## Presentation on theme: "CLUSTERING."— Presentation transcript:

CLUSTERING

Introduction A cluster is a collection of data objects that have similarity with objects within the same cluster and have dissimilarity with the objects of other clusters. Clustering : is used in biology to develop new plants and animal taxonomies. Is used in business to enable marketers to develop new distinct groups of their customers and characterize the customer group on basis of their purchasing. Is used in the identification of groups of automobiles insurance policy consumer. Is used in the identification of groups of house in a city on the basis of house type, their cost and geographical location. Is used to classify the document on the web for information discovery.

The basic requirements of the cluster analysis are:
·       Dealing with Different Types of Attributes: Many clustering algorithms are developed to cluster interval-based (numerical) data. While many applications require other types of data such as binary data, nominal data, ordinal data, or mixture of these data types. ·       Dealing with Noisy Data: Generally a database consists of outliers, missing, unknown, or erroneous data. But many clustering algorithms are too sensitive to such type of data. ·       Constraints on Clustering: Many application need to perform clustering under various constrains. For example, you want to search the location of a person in a city on internet. To search the location of a person, you may use the cluster households that with considering constraints such as the street, area, or house number for city.

· Dealing with Arbitrary Shape: Many clustering algorithms determine the clusters using Euclidean or Manhattan distance measures. The algorithms based on these measure distance find spherical clusters with similar size and density. But the shape of cluster is not same, so you have to need to require an algorithm that can work with arbitrary shapes. ·  High Dimensionality: A database or data warehouse consists various dimensions and attributes. Some clustering algorithms are well working with low-dimensional data that contains two or three-dimensional. For high-dimensional data you require new algorithms. ·   Ordering of Input Data: Some clustering algorithms are sensitive to the order of input data. For example, the same set of data are input with different ordering, there may arise some error.

·Interpretability and Usability: The result of the clustering algorithms should be interpretable, comprehensible, and usable. ·Determining Input Parameter: Many clustering algorithms require user to input some parameter on cluster analysis run time. The clustering result is so sensitive about the input parameter. Parameters are very hard to determine for high-dimensional objects. Scalability: Many clustering algorithms work well on small data sets that consists of fewer than 250 data objects. While a large database consists of millions of objects. So you have to need a highly scalable clustering technique

Types of Data Data Matrix : represented in the form of a relational table n X p matrix . Where , n – represents objects p – variables. -Dissimilarity Matrix: represented in the form of a relational table of nXn matrix, Where n – represents the objects.

Interval Scaled Variables
continuous measurement of linear scale For e.g height and weight. E.g suppose you are measuring the weather temperature in Celsius or Fahrenheit and it is difficult to change the measurement from Celsius to Fahrenheit in cluster analysis. So In order to avoid dependency of the measurement unit, always use a standard data I.e unit-less. There are two steps to convert the original measurement unit to unit less variables: 1.Calculating the mean absolurte deviation , Sf using this formula: Sf = 1 \ n (|X1f -Mf|+|X2f-Mf|+..+|Xnf-Mf|)

Where, X1f…Xnf - n measurements of f Mf – mean value of f that is equal to simple mean : Mf = (X1f+X2f+..+Xnf) / n 2. Calculate the standard measurement using the formula: Zif = (Xif - Mf) / Sf Where , Zf – standard measurement. After calculating Z-score , you can compute the dissimilarity between the objects using one of these distance techniques: Euclidean Distance: it is the geometric distance between multidimensional spaces .

d(i,j) = {n|Xi-Xj|q } 1/q
Dissimilarity is calculated by, d(i,j) = {n(Xi-Xj)} 1/2 Manhattan Distance: average difference of the various dimension objects. Dissimilarity is calculated by, d(i,j) = n|Xi-Xj| Minkowski Distance: generalization of both Euclidean distance and Manhattan Distance Dissimilarity is calculated by, d(i,j) = {n|Xi-Xj|q } 1/q

Binary Variable : Represents two states 0 and 1, When state is 0 – variable is absent 1 – variable is present Two types of Binary Variables: symmetric asymmetric

-Nominal , Ordinal, Ratio-Scaled variables
-Mixed Variables

Partitioning Method The k-means Method
1.Select k objects randomly that represents a cluster for which cluster mean or centre of gravity of a cluster. 2.Assign an object to the cluster for which cluster mean is most similar based on dissimilar distance. 3.Update cluster mean of each cluster, if necessary 4.Repeat this process until the criterion function converges all the clusters. E = k pci | p – mi|2 i =1

k-mediods Method Randomly select k objects that represent reference point, mediods. Assign remaining object to a cluster that is most similar to its mediod. Randomly select a non-mediod object Orandom. Calculate the total cost C of swapping the mediod Oi with non-mediods object Orandom and Oi. Swap mediod Oi with non-mediods object Orandom to make new medoids , if total cost of swapping is negative. Process is again started from step 2 and this process is repeated until no swapping occurs.

Hierarchical Method Agglomerative and Divisive Hierarchical Clustering
In Agglomerative clustering each object creates its own clusters. The single clusters are merged to make larger cluster and the process of merging continues until all the singular clusters are merged into one big cluster that consist of all the objects. While In divisive hierarchical clustering method all objects are arranged within a big single cluster and the large cluster is continuously divided into smaller clusters until each cluster has single object.

Agglomerative 1 2 3 4 5 6 4 5 6 1 2 3 4 5 1 2 4 5 6 1 2 3 Divisive
Level 0 Level 5 Level 1 Level 4 4 5 6 Level 2 1 2 3 Level 3 Level 2 Level 3 4 5 1 2 Level 1 Level 4 4 5 6 1 2 3 Level 5 Level 0 Divisive

Balanced Iterative Reducing and Clustering Using Hierarchies (BIRCH)
Clustering Feature Tree Clustering Feature CF = (N,LS,SS) N- no. of objects in a sub-cluster. LS- Linear summation of N objects SS – square of summation of N data objects

Limitation: Working: - it is suitable for only spherical cluster
BIRCH examines the data set to build an initial in-memory CF tree that represents the multilevel compression of the data objects in the form of clustering features. BIRCH selects a clustering algorithm to cluster the leaf-nodes of the CF tree. Limitation: - it is suitable for only spherical cluster

CURE Clustering using representatives: Working Steps:
Randomly create a sample of all the objects that are contained in a data set. Divide the sample into definite set of partitions. Simultaneously, create partial cluster in each partitions. Apply random sampling to remove outliers. If a cluster grows slowly, then delete this cluster from the partition. Cluster all the partially created clusters in each patition using shrinking factor.It means all the representative points formed a new cluster by moving towards the cluster centre that is specified by user defined fraction, shrinking factor.

- Label the form cluster

Density Based Method DBSCAN working steps:
Check the -neighbourhood for each data points in a data set. Create core objects , if the -neighbourhood of this object contains MinPts data points. Collect all the objects that are directly density-reachable from these core objects. Merge some of these objects, which are directly density-connected objects, to form new cluster. Terminate this process , when no other data point can be added to any cluster.