ADBIS 2007 Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique Rayner Alfred Dimitar.

ADBIS 2007 Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer Science Department, York University (3 rd October, 2007)

1 st October 2007ADBIS 2007, Varna, Bulgaria Overview Introduction Data Transformation The Data Summarization Approach –Dynamics Aggregation of Relational Attributes Experimental Evaluations Experimental Results Conclusions

1 st October 2007ADBIS 2007, Varna, Bulgaria Introduction Relational database requires effective and efficient ways of extracting pattern based on content stored in multiple tables. Relational database Setting –A record can be associated to one or more records due to the one-to-many association constraint. –Traditional data mining tools require data in relational databases to be transformed into an attribute-value format by joining multiple tables. –However, with the large volume of relational data with a high degree of one-to-many associations, this process is not efficient as the joined table can be too large to be processed and we may lose some information during the join operation.

1 st October 2007ADBIS 2007, Varna, Bulgaria Introduction propose a GA-based clustering technique to aggregate multiple instances transform the data to a suitable form for clustering task treat these multiple instances of a record, as a bag of terms Provides experimental results comparing varieties of cluster algorithms, as a means of aggregating multiple instances, that include –Semi-supervised GA-clustering DARA-based algorithm (SSGA-DARA), –GA-clustering DARA-based algorithm (GA-DARA) –k-means DARA-based clustering algorithm (K-DARA)

1 st October 2007ADBIS 2007, Varna, Bulgaria Data Transformation Let R denote a set of m records stored in the target table and let S denote a set of n records (T 1, T 2, T 3, …, T n ), stored in the non-target table. Let S i is in the subset of S, S i ⊆ S, and associated with a single record R a stored in the target table, R a ∈ R. The association of these records can be described as R a → S i. Since a record can be characterized by the bag of records that are associated with it, we can use the vector space model to represent the data

1 st October 2007ADBIS 2007, Varna, Bulgaria Data Transformation Non-target table with a single attribute 1.Compute the cardinality of the attribute’s domain in the non-target table (Discretisize continuous values) 2.Encode the values find the appropriate number of bits, n, that can represent these values, where 2n-1 < |Attribute’s Domain| ≤ 2n. For example, if the attribute has 5 different values (London, New York, Chicago, Paris, Kuala Lumpur), then we just need 3 (5 < 23) bits to represent each of these values (001, 010, 011, 100, 101). 3.For each encoded term, add this term to the bag of terms

1 st October 2007ADBIS 2007, Varna, Bulgaria Data Transformation Non-target table with multiple attributes 1.Repeat Step (1)and (2) in previous case, for all attributes 2.p attributes combined –Let F = (F 1, F 2, F 3,…, F k ) denotes k attributes –Let dom(F i ) denotes the domain of the i-th attribute. –An instance may have theses values (F 1,a, F 2,b, F 3,c, F 4,d,…, F k-1,b, F k,n ), where a ∈ dom(F 1 ),b ∈ dom(F 2 ),…,n ∈ dom(F k ). –If p = 1, we have k number of terms 1:F 1,a,2:F 2,b,3:F 3,c,4:F 4,d,…,k-1:F k-1,b,k:F k,n –If p = 2, k/2 terms produced (with even number of fields) 1:F 1,a F 2,b, 2:F 3,c F 4,d,…, (k/2):F k-1,b F k,n –if p = k, then we have a single term produced 1:F 1,a F 2,b F 3,c F 4,d …F k-1,b F k,n term produced. 3.For each encoded term, add this term to the bag of terms

1 st October 2007ADBIS 2007, Varna, Bulgaria GA-Based Clustering Technique A Genetic Algorithm –A GA is a computational abstraction of biological evolution that used to some optimisation problems –A GA is an iterative process applying a series of genetic operators (selection, crossover, mutation) to a population of elements (chromosomes) –Initially, a random population is created, which represents different points in the search space. –An objective and fitness function is associated with each chromosome that represents the degree of goodness of the chromosome. –The process of selection, crossover and mutation continues for a fixed number of generations or till a termination condition is satisfied.

1 st October 2007ADBIS 2007, Varna, Bulgaria GA-Based Clustering Technique A Clustering Algorithm –each record R a is considered to be a vector in the term-space or pattern-space. –we employed the tf-idf term weighting model, in which each record can be represented as (tf 1 log(n/df 1 ), tf 2 log(n/df 2 ),..., tf m log(n/df m )) where tf i is the frequency of the i-th term in the record, df i is the number of records that contain the i-th term and n is the number of records. –In the vector-space model, the cosine similarity is the most commonly used method to compute the similarity between two records R i and R j

1 st October 2007ADBIS 2007, Varna, Bulgaria GA-Based Clustering Technique A Semi-Supervised Clustering Algorithm –base to our semi-supervised algorithm - use an unsupervised clustering method optimized with a genetic algorithm incorporating a measure of classification accuracy used in decision tree algorithm, the GINI index –we examine the clustering algorithm that minimizes some objective function applied to k-cluster centers, such as cluster quality (DBI Index) –DBI uses both the within-cluster and between clusters distances to measure the cluster quality

1 st October 2007ADBIS 2007, Varna, Bulgaria GA-Based Clustering Technique A Semi-Supervised Clustering Algorithm –clustering can be considered as a K-nary partition - Gini Index can be applied to measure the partition’s impurity –GI of a certain cluster, k, is (n = number of class, P kc = number of points belong to c- th class in cluster k, N k = total number of points in k) –The purity of the partitioning into K clusters is (N = number of points in the dataset, T Ck = number of points in cluster k). –Minimize the impurity measurement, to get better quality

1 st October 2007ADBIS 2007, Varna, Bulgaria GA-Based Clustering Technique A Semi-Supervised Clustering Algorithm –By minimizing the objective function that minimizes a linear combination of the cluster dipersion measure (DBI) and the cluster impurity measure (GI), the algorithm becomes semi-supervised. –More specifically, given N points and K-clusters, select K cluster centers that minimize the following objective function:

1 st October 2007ADBIS 2007, Varna, Bulgaria GA-Based Clustering Technique A Semi-Supervised GA-Based Clustering Algorithm –two phases in the semi-supervised GA-based Clustering algorithm. Data Reduction - reduce the N points data by grouping all points to their nearest neighbour (speed up the process of genetic clustering) –Connect each object to the nearest neighbour and repeat this step as long as the distance between two components of connected objects is below the specified scale’s value –let the data sets represented by these connected nodes be denoted by B 1, B 2, B 3, …, B m-1, B m where m is the number of connected nodes and m < N, since B i consists of 1 or more connected nodes, i ≤ m. Use genetic algorithm to cluster the m data points based on the objective function, defined previously, m < N.

1 st October 2007ADBIS 2007, Varna, Bulgaria GA-Based Clustering Technique genetic algorithm to cluster the m data points, m < N –Population Initialization A population of X strings (m-length) is randomly generated, X strings are generated with the number of 1’s in the strings uniformly distributes within [1,m]. Each string represents a subset of {B 1, B 2, B 3, …, B m-1, B m }. If B i is in this subset S, the ith position of the string will be 1; otherwise, it will be 0, where i = 1,2,3,…,m. Each B i in subset S is used as a seed to generate a cluster. If B j is not in the subset, they will be merged to the nearest B k in the subset S, where j,k = 1,2,3,…,m and j ≠ k. –Fitness Computation the objective fitness function (OFF) that we maximize is

1 st October 2007ADBIS 2007, Varna, Bulgaria GA-Based Clustering Technique genetic algorithm to cluster the m data points, m < N –Selection Process a rouleete wheel with slots sized according to the fitness is used –Crossover A pair of chromosome, c i and c j, are chosen for applying the crossover operator. One of the parameters of a genetic system is probability of crossover pc. In this experiment, we set pc = 0.25 –Mutation The mutation operator performs a bit-by-bit basis. Another parameter of the genetic system, probability of permutation p m gives the expected number of mutated bits p m ·m·X. In this experiment, we set p m = 0.01

1 st October 2007ADBIS 2007, Varna, Bulgaria Experimental Evaluations Experiments are designed to demonstrate –The performance gain of semi-supervised genetic algorithm- based clustering technique (SSGA-DARA) over the K-clusters clustering technique (K-DARA), genetic algorithm-based clustering (GA-DARA) –The proposed Genetic Algorithm-based DARA method of data transformation outperforms other relational data mining approach to relational data mining. In this experiment, we chose two well-known datasets, the Mutagenesis, Musk.

1 st October 2007ADBIS 2007, Varna, Bulgaria Experimental Evaluations The evaluations are performed with three different settings –K-DARA clustering based on K K number of clusters, where K K is manually defined by user. Take the average accuracy of the J48 for 10 different values of K K –GA-DARA Records are automatically clustered to K GA number of clusters, using the genetic algorithm (GA) by taking the measure of cluster’s dispersion as the objective fitness function. Other parameters were set to pc = 0.25 (crossover probability), and pm = 0.01 (permutation probability). –SSGA-DARA records are automatically clustered to K SSGA number of clusters, using the genetic algorithm (GA) by taking the measure of cluster’s dispersion and impurity as the objective fitness function. Other parameters were set to pc = 0.25 (crossover probability), and pm = 0.01 (permutation probability).

1 st October 2007ADBIS 2007, Varna, Bulgaria Experimental Results The accuracy estimations, from leave-one-out performance results, for GA-DARA and K-DARA are much lower compared to SSGA-DARA. –Since neither of these algorithms addresses the cluster impurity problem. GA-DARA algorithm performs virtually the same as the K- DARA algorithm,

1 st October 2007ADBIS 2007, Varna, Bulgaria Experimental Results Table 2 shows the results of paired one-sided t-tests (p=0.0025), on dataset Musk and Mutagenesis datasets –symbol ‘●’ indicates significant improvement performance by method in row over method in column –symbol ‘○’ indicates no significant improvement performance by method in row over method in column, A represents the K-DARA algorithm, B represents the GA-DARA algorithm C represents the SSGA-DARA algorithm

1 st October 2007ADBIS 2007, Varna, Bulgaria Conclusions presented a novel method for semi-supervised learning that combines supervised and unsupervised learning techniques to extract patterns from multiple tables with a high degree of one-to-many association –The basic idea is to treat a series of records, associated with a single record in the target table, as a bag of terms, and take an unsupervised clustering method and simultaneously optimize the misclassification error of the resulting clusters. Experimental results show that using DBI for cluster dispersion and GI for cluster impurity finds solutions with much greater accuracy.

Thank You Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique

ADBIS 2007 Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique Rayner Alfred Dimitar.

Similar presentations

Presentation on theme: "ADBIS 2007 Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique Rayner Alfred Dimitar."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ADBIS 2007 Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique Rayner Alfred Dimitar.

Similar presentations

Presentation on theme: "ADBIS 2007 Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique Rayner Alfred Dimitar."— Presentation transcript:

Similar presentations

About project

Feedback