Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent and Adaptive Systems Research Group A Novel Method of Estimating the Number of Clusters in a Dataset Reza Zafarani and Ali A. Ghorbani Faculty.

Similar presentations


Presentation on theme: "Intelligent and Adaptive Systems Research Group A Novel Method of Estimating the Number of Clusters in a Dataset Reza Zafarani and Ali A. Ghorbani Faculty."— Presentation transcript:

1 Intelligent and Adaptive Systems Research Group A Novel Method of Estimating the Number of Clusters in a Dataset Reza Zafarani and Ali A. Ghorbani Faculty of Computer Science, University of New Brunswick, Fredericton, NB, Canada {r.zafarani, ghorbani} @ unb.ca What is Clustering? The unsupervised division of patterns (data points, feature vectors, instances,…) into groups of similar objects. Objects in the same group are more similar whereas objects in different groups are more dissimilar. The Need of Clustering Analysis of Data / Finding (dis)similar data. Need of abstraction / Reducing redundancy. Alleviate the effect of low computation power. Cost efficiency and business use-cases. Related Work Early literature in the area of dynamic clustering have attempted to solve this by running algorithms for several Ks (number of clusters). Best K among them is determined based on some coefficients or statistics. Distance between two cluster centroids normalized by cluster's standard deviation could be used as a coefficient. Silhouette coefficient, which compares the average distance Value between points in the same cluster and the average Distance value between points in different clusters. These coefficients are plotted as a function of K (number of clusters) and the best K is selected. Probabilistic measures which determine the best model in mixture models can also be used. In this area, an optimal K corresponds to the best fitting model. Some famous criteria in this area are BIC, MDL, and MML. The Oracle of Clustering Definition: A function, called the Oracle, that can predict whether two random data points belong to the same cluster or not. Oracle Approximation Thresholding: A simple yet effective approach to predict the Oracle is to use thresholding on the similarity function between the data. A justifiable threshold could be a linear combination of the mean and standard deviation of the similarities between the data. In order to make this more accurate dimensionality reduction methods can be used on the data first. Ensemble Clustering Two points are considered to be in the same cluster if a majority of different clustering algorithms consider them to be in the same cluster. This method can be computationally inefficient. Predicting the Number of Clusters Using the Oracle Monte Carlo Sampling: Given this Oracle function and using Monte Carlo sampling of this Oracle function, the probability of random points being in the same cluster can be estimated., and are the size of the cluster, the number of clusters, and the dataset size, respectively. The sampling is controlled using chernoff bounds: Where p is the actual probability, N is the sample size, and are two prefixed constants. For Example: This problem links this area to the research avenues in Partition Theory, and more specifically, variations of Kloosterman sums and summand distributions in integer partitions. Discussions It's simple to see that given this Oracle function, the clustering can be done within a O(mn) time complexity where m is the number of clusters and n is the number of datapoints. The Oracle function can be approximated for different clustering algorithms (preferably for those with quadratic running times). Their running times can be reduced to O(mn), if this approximation can take place. Transfer Learning can be used in order to approximate this Oracle for algorithms with quadratic running times (the relation between oracles of different clustering algorithms can be learnt or approximated). Future Work A reasonable way to predict the number of clusters (m) from these probabilities is to use methods in Partition Theory along with the methods in Convex Optimization. Gap statistics is another area worth investigating in order to detect the number of clusters here. The methods in gap statistics can be used to refine the threshold values which are used in oracle prediction. The transfer learning function can be learnt and the optimum conditions under which the function is learnable can be discovered. References Milligan, G., Cooper, M.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2) (1985) 159-179 Pelleg, D., Moore, A.: X-means: Extending K-means with efficient estimation of the number of clusters. Proceedings of the 17 th International Conf. on Machine Learning (2000) 727- 734 Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 63(2) (2001) 411-423 Guan, Y., Ghorbani, A., Belacel, N.: Y-means: A clustering method for intrusion detection. Proceedings of Canadian Conference on Electrical and Computer Engineering (2003) Erdos, P., Lehner, J.: The distribution of the number of summands in the partitions of a positive integer. Duke Math. J 8(2) (1941) 335-345 Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press (2004) Challenges in Clustering Dynamism Validity High Dimensions (curse of dimensionality) Subjectivity e.g. the set {ship, bird, fish} can be clustered in two different ways. Large Data Sets Complexity Proximity Measures Initial Conditions Ensemble Techniques


Download ppt "Intelligent and Adaptive Systems Research Group A Novel Method of Estimating the Number of Clusters in a Dataset Reza Zafarani and Ali A. Ghorbani Faculty."

Similar presentations


Ads by Google