A genetic approach to the automatic clustering problem Author : Lin Yu Tseng Shiueng Bien Yang Graduate : Chien-Ming Hsiao.

Slides:



Advertisements
Similar presentations
Algorithm Design Techniques
Advertisements

ADBIS 2007 Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique Rayner Alfred Dimitar.
Clustering.
Clustering Categorical Data The Case of Quran Verses
CS6800 Advanced Theory of Computation
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Genetic Algorithms Contents 1. Basic Concepts 2. Algorithm
WEI-MING CHEN k-medoid clustering with genetic algorithm.
K Means Clustering , Nearest Cluster and Gaussian Mixture
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Introduction to Bioinformatics
Cluster Analysis.
Clustering II.
Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.
Data Mining Techniques Outline
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
COMP305. Part II. Genetic Algorithms. Genetic Algorithms.
4. Ad-hoc I: Hierarchical clustering
Basic Data Mining Techniques
Clustering.
Three Algorithms for Nonlinear Dimensionality Reduction Haixuan Yang Group Meeting Jan. 011, 2005.
COMP305. Part II. Genetic Algorithms. Genetic Algorithms.
Artificial Intelligence Genetic Algorithms and Applications of Genetic Algorithms in Compilers Prasad A. Kulkarni.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Genetic Algorithm What is a genetic algorithm? “Genetic Algorithms are defined as global optimization procedures that use an analogy of genetic evolution.
Clustering Unsupervised learning Generating “classes”
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Genetic Algorithm.
Efficient Model Selection for Support Vector Machines
On comparison of different approaches to the stability radius calculation Olga Karelkina Department of Mathematics University of Turku MCDM 2011.
Presented by Tienwei Tsai July, 2005
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
Soft Computing Lecture 18 Foundations of genetic algorithms (GA). Using of GA.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel genetic algorithm for automatic clustering Advisor.
Hierarchical Distributed Genetic Algorithm for Image Segmentation Hanchuan Peng, Fuhui Long*, Zheru Chi, and Wanshi Siu {fhlong, phc,
Chapter 14: SEGMENTATION BY CLUSTERING 1. 2 Outline Introduction Human Vision & Gestalt Properties Applications – Background Subtraction – Shot Boundary.
1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20.
1 The Euclidean Non-uniform Steiner Tree Problem by Ian Frommer Bruce Golden Guruprasad Pundoor INFORMS Annual Meeting Denver, Colorado October 2004.
Hierarchical Clustering
Zorica Stanimirović Faculty of Mathematics, University of Belgrade
Clustering Spatial Data Using Random Walks Author : David Harel Yehuda Koren Graduate : Chien-Ming Hsiao.
Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.
Chapter 8 The k-Means Algorithm and Genetic Algorithm.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Smooth Side-Match Classified Vector Quantizer with Variable Block Size IEEE Transaction on image processing, VOL. 10, NO. 5, MAY 2001 Department of Applied.
ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, Proceedings.,
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
 Based on observed functioning of human brain.  (Artificial Neural Networks (ANN)  Our view of neural networks is very simplistic.  We view a neural.
LOGO Identifying Opinion Leaders in the Blogosphere Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng CIKM 2007 Advisor : Dr. Koh Jia-Ling Speaker : Tu.
EE749 I ntroduction to Artificial I ntelligence Genetic Algorithms The Simple GA.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Advisor : Dr. Hsu Graduate : You-Cheng Chen Author : Jeremy.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Comparison of Tarry’s Algorithm and Awerbuch’s Algorithm CS 6/73201 Advanced Operating System Presentation by: Sanjitkumar Patel.
Genetic Algorithm Dr. Md. Al-amin Bhuiyan Professor, Dept. of CSE Jahangirnagar University.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
1 Comparative Study of two Genetic Algorithms Based Task Allocation Models in Distributed Computing System Oğuzhan TAŞ 2005.
Genetic Algorithms for clustering problem Pasi Fränti
Breeding Swarms: A GA/PSO Hybrid 簡明昌 Author and Source Author: Matthew Settles and Terence Soule Source: GECCO 2005, p How to get: (\\nclab.csie.nctu.edu.tw\Repository\Journals-
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Parametric calibration of speed–density relationships in mesoscopic traffic simulator with data mining Adviser: Yu-Chiang Li Speaker: Gung-Shian Lin Date:2009/10/20.
Semi-Supervised Clustering
Author :Shigeomi HARA Hiroshi DOUZONO Yoshio NOGUCHI
Parametric calibration of speed–density relationships in mesoscopic traffic simulator with data mining Adviser: Yu-Chiang Li Speaker: Gung-Shian Lin Date:2009/10/20.
Data Mining – Chapter 4 Cluster Analysis Part 2
Presentation transcript:

A genetic approach to the automatic clustering problem Author : Lin Yu Tseng Shiueng Bien Yang Graduate : Chien-Ming Hsiao

Outline Motivation Objective Introduction The basic concept of the genetic strategy The genetic clustering algorithm The heuristic to find a good clustering Conclusion Personal Opinion

Motivation Some clustering algorithms require the user to provide the number of clusters as input It is not easy for the user to guess how many clusters should be there. The user in general has no idea about the number of clusters. The clustering result may be no good Especially when the number of clusters is large and not easy to guess

Objective Propose a genetic clustering algorithm Will automatically search for a proper number Classify the objects into these clusters

Introduction The clustering methods Hierarchical The agglomerative methods The divisive methods Non-Hierarchical The K-means algorithm Is an iterative hill-climbing algorithm the solution obtained depends on the initial clustering

The basic concept of the genetic strategy

The genetic clustering algorithm The algorithm CLUSTERING consists of two stages The nearest-neighbor algorithm. To group those data that are close to one another. To reduce the size of the data to a moderate one that is suitable for the genetic clustering algorithm. Genetic clustering algorithm. To group the small clusters into larger cluster. A heuristic strategy is then used to find a good clustering.

The nearest-neighbor algorithm. The distance Base on the average of the nearest-neighbor distances Steps 1. For each object O i, find the distance between O i and its nearest neighbor.

The nearest-neighbor algorithm Steps 2. Compute d av, the average of the nearest-neighbor distance by using step 1 3. View the n objects as nodes of a graph. Compute the adjacency matrix A n*n

The nearest-neighbor algorithm Steps 4. Find the connected components of this graph. The data sets represented by these connected components be denoted by B 1, B 2, …, B m The center of each set be denoted by V i, 1 ≤ i ≤ m

The genetic algorithm Initialization step Iterative generations Reproduction phase Crossover phase Mutation phase

The genetic algorithm Initialization step A population of N strings is randomly generated The length of each string is m m is the number of the sets obtained in the first stage. If B i is in this subset, the ith position of the string will be 1; otherwise, it will be 0 Each B i in the subset is used as a seed to generate a cluster.

The genetic algorithm

How to generate a set of clusters from the seeds Let T = {T 1, T 2,…, T s } be the subset corresponding to a string. The initial clusters C i ’s are T i ’s and initial centers S i ’s of clusters are V i ’s for i = 1, 2,…,s. The size of cluster Ci is ‌ C i ‌ = ‌ T i ‌ for i = 1, 2,…,s, where ‌ T i ‌ denotes the number of objects belonging to T i

The genetic algorithm The B i ’s in {B 1, B 2, …, B m } – T are taken one by one and the distance between the center V i of the taken B i. the center S j of each cluster C j is calculated If B i is classified as in the cluster C j, the center S j and the size of the cluster C j will be recomputed

The genetic algorithm Reproduction phase The intra-distance in the center C i The inter-distance between this cluster Ci and the set of all other clusters. The fitness function of a string R

The genetic algorithm Crossover phase Two random number p and q in [1, m] are generated to decide which pieces of the string are to be interchanged. The crossover operator is done with probability p c Mutation Phase Each chosen bit will be changed from 0 to 1 or from 1 to 0.

The heuristic strategy to find a good clustering D 1 (w) estimates the closeness of the clusters in the clustering D 2 (w) estimates the compactness of the clusters in the clustering

The heuristic strategy to find a good clustering The value of w’s are chosen from [w 1, w 2 ] by some kind of binary search To finds the greatest jump on the values of D 1 (w)’s and the greatest jump on the values of D 2 (w)’s. Based on these jumps, it then decides which a good clustering is

Experiments The population size is 50 The crossover rate is 80 % The mutation rate is 5 % [w 1, w 2 ] = [1, 3] w 1 is the smallest value, w 2 is the largest value Three sets of data were used

Fig. (a) The first set of data consists of three groups of points on the plane. The densities of three groups are not the same Fig. (b), (c) K-mean algorithm Fig. (d) Complete-link method Fig. (e) Single-link method

Fig. (a) The original data set with five groups of points Fig. (b), (c) and (d) K-mean algorithm Fig. (e) By CLUSTERING, complete-link, single- link and K-mean

Conclusion and Personal Opinion The experimental results show that CLUSTERING is effective. Can automatically search for a proper number as the number of clusters.