ADBIS 2007 Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique Rayner Alfred Dimitar.

Slides:



Advertisements
Similar presentations
© Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems Introduction.
Advertisements

Algorithm Design Techniques
ADBIS 2007 Discretization Numbers for Multiple-Instances Problem in Relational Database Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Random Forest Predrag Radenković 3237/10
ADBIS 2007 A Clustering Approach to Generalized Pattern Identification Based on Multi-instanced Objects with DARA Rayner Alfred Dimitar Kazakov Artificial.
Using Parallel Genetic Algorithm in a Predictive Job Scheduling
Fast Algorithms For Hierarchical Range Histogram Constructions
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Basic Data Mining Techniques Chapter Decision Trees.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Basic Data Mining Techniques
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Genetic Algorithm for Variable Selection
COMP305. Part II. Genetic Algorithms. Genetic Algorithms.
Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.
Genetic Algorithm Genetic Algorithms (GA) apply an evolutionary approach to inductive learning. GA has been successfully applied to problems that are difficult.
Genetic Algorithms Nehaya Tayseer 1.Introduction What is a Genetic algorithm? A search technique used in computer science to find approximate solutions.
Chapter 6: Transform and Conquer Genetic Algorithms The Design and Analysis of Algorithms.
Evaluating Performance for Data Mining Techniques
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
Genetic Algorithms: A Tutorial
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Genetic Algorithm.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
Cristian Urs and Ben Riveira. Introduction The article we chose focuses on improving the performance of Genetic Algorithms by: Use of predictive models.
Soft Computing Lecture 18 Foundations of genetic algorithms (GA). Using of GA.
SOFT COMPUTING (Optimization Techniques using GA) Dr. N.Uma Maheswari Professor/CSE PSNA CET.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel genetic algorithm for automatic clustering Advisor.
1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20.
Lecture 8: 24/5/1435 Genetic Algorithms Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Basic Data Mining Technique
Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.
Chapter 8 The k-Means Algorithm and Genetic Algorithm.
GENETIC ALGORITHMS FOR THE UNSUPERVISED CLASSIFICATION OF SATELLITE IMAGES Ankush Khandelwal( ) Vaibhav Kedia( )
1 “Genetic Algorithms are good at taking large, potentially huge search spaces and navigating them, looking for optimal combinations of things, solutions.
Neural and Evolutionary Computing - Lecture 9 1 Evolutionary Neural Networks Design  Motivation  Evolutionary training  Evolutionary design of the architecture.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
A genetic approach to the automatic clustering problem Author : Lin Yu Tseng Shiueng Bien Yang Graduate : Chien-Ming Hsiao.
Genetic Algorithms. Evolutionary Methods Methods inspired by the process of biological evolution. Main ideas: Population of solutions Assign a score or.
Algorithms and their Applications CS2004 ( ) 13.1 Further Evolutionary Computation.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,
Data Mining and Decision Support
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
Genetic Algorithm Dr. Md. Al-amin Bhuiyan Professor, Dept. of CSE Jahangirnagar University.
Artificial Intelligence By Mr. Ejaz CIIT Sahiwal Evolutionary Computation.
1 Comparative Study of two Genetic Algorithms Based Task Allocation Models in Distributed Computing System Oğuzhan TAŞ 2005.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Genetic Algorithms. Solution Search in Problem Space.
Genetic Algorithm(GA)
Genetic Algorithm. Outline Motivation Genetic algorithms An illustrative example Hypothesis space search.
Presented By: Farid, Alidoust Vahid, Akbari 18 th May IAUT University – Faculty.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Intelligent Exploration for Genetic Algorithms Using Self-Organizing.
Genetic Algorithms.
Bulgarian Academy of Sciences
Artificial Intelligence (CS 370D)
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Searching for solutions: Genetic Algorithms
Presentation transcript:

ADBIS 2007 Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer Science Department, York University (3 rd October, 2007)

1 st October 2007ADBIS 2007, Varna, Bulgaria Overview Introduction Data Transformation The Data Summarization Approach –Dynamics Aggregation of Relational Attributes Experimental Evaluations Experimental Results Conclusions

1 st October 2007ADBIS 2007, Varna, Bulgaria Introduction Relational database requires effective and efficient ways of extracting pattern based on content stored in multiple tables. Relational database Setting –A record can be associated to one or more records due to the one-to-many association constraint. –Traditional data mining tools require data in relational databases to be transformed into an attribute-value format by joining multiple tables. –However, with the large volume of relational data with a high degree of one-to-many associations, this process is not efficient as the joined table can be too large to be processed and we may lose some information during the join operation.

1 st October 2007ADBIS 2007, Varna, Bulgaria Introduction propose a GA-based clustering technique to aggregate multiple instances transform the data to a suitable form for clustering task treat these multiple instances of a record, as a bag of terms Provides experimental results comparing varieties of cluster algorithms, as a means of aggregating multiple instances, that include –Semi-supervised GA-clustering DARA-based algorithm (SSGA-DARA), –GA-clustering DARA-based algorithm (GA-DARA) –k-means DARA-based clustering algorithm (K-DARA)

1 st October 2007ADBIS 2007, Varna, Bulgaria Data Transformation Let R denote a set of m records stored in the target table and let S denote a set of n records (T 1, T 2, T 3, …, T n ), stored in the non-target table. Let S i is in the subset of S, S i ⊆ S, and associated with a single record R a stored in the target table, R a ∈ R. The association of these records can be described as R a → S i. Since a record can be characterized by the bag of records that are associated with it, we can use the vector space model to represent the data

1 st October 2007ADBIS 2007, Varna, Bulgaria Data Transformation Non-target table with a single attribute 1.Compute the cardinality of the attribute’s domain in the non-target table (Discretisize continuous values) 2.Encode the values find the appropriate number of bits, n, that can represent these values, where 2n-1 < |Attribute’s Domain| ≤ 2n. For example, if the attribute has 5 different values (London, New York, Chicago, Paris, Kuala Lumpur), then we just need 3 (5 < 23) bits to represent each of these values (001, 010, 011, 100, 101). 3.For each encoded term, add this term to the bag of terms

1 st October 2007ADBIS 2007, Varna, Bulgaria Data Transformation Non-target table with multiple attributes 1.Repeat Step (1)and (2) in previous case, for all attributes 2.p attributes combined –Let F = (F 1, F 2, F 3,…, F k ) denotes k attributes –Let dom(F i ) denotes the domain of the i-th attribute. –An instance may have theses values (F 1,a, F 2,b, F 3,c, F 4,d,…, F k-1,b, F k,n ), where a ∈ dom(F 1 ),b ∈ dom(F 2 ),…,n ∈ dom(F k ). –If p = 1, we have k number of terms 1:F 1,a,2:F 2,b,3:F 3,c,4:F 4,d,…,k-1:F k-1,b,k:F k,n –If p = 2, k/2 terms produced (with even number of fields) 1:F 1,a F 2,b, 2:F 3,c F 4,d,…, (k/2):F k-1,b F k,n –if p = k, then we have a single term produced 1:F 1,a F 2,b F 3,c F 4,d …F k-1,b F k,n term produced. 3.For each encoded term, add this term to the bag of terms

1 st October 2007ADBIS 2007, Varna, Bulgaria GA-Based Clustering Technique A Genetic Algorithm –A GA is a computational abstraction of biological evolution that used to some optimisation problems –A GA is an iterative process applying a series of genetic operators (selection, crossover, mutation) to a population of elements (chromosomes) –Initially, a random population is created, which represents different points in the search space. –An objective and fitness function is associated with each chromosome that represents the degree of goodness of the chromosome. –The process of selection, crossover and mutation continues for a fixed number of generations or till a termination condition is satisfied.

1 st October 2007ADBIS 2007, Varna, Bulgaria GA-Based Clustering Technique A Clustering Algorithm –each record R a is considered to be a vector in the term-space or pattern-space. –we employed the tf-idf term weighting model, in which each record can be represented as (tf 1 log(n/df 1 ), tf 2 log(n/df 2 ),..., tf m log(n/df m )) where tf i is the frequency of the i-th term in the record, df i is the number of records that contain the i-th term and n is the number of records. –In the vector-space model, the cosine similarity is the most commonly used method to compute the similarity between two records R i and R j

1 st October 2007ADBIS 2007, Varna, Bulgaria GA-Based Clustering Technique A Semi-Supervised Clustering Algorithm –base to our semi-supervised algorithm - use an unsupervised clustering method optimized with a genetic algorithm incorporating a measure of classification accuracy used in decision tree algorithm, the GINI index –we examine the clustering algorithm that minimizes some objective function applied to k-cluster centers, such as cluster quality (DBI Index) –DBI uses both the within-cluster and between clusters distances to measure the cluster quality

1 st October 2007ADBIS 2007, Varna, Bulgaria GA-Based Clustering Technique A Semi-Supervised Clustering Algorithm –clustering can be considered as a K-nary partition - Gini Index can be applied to measure the partition’s impurity –GI of a certain cluster, k, is (n = number of class, P kc = number of points belong to c- th class in cluster k, N k = total number of points in k) –The purity of the partitioning into K clusters is (N = number of points in the dataset, T Ck = number of points in cluster k). –Minimize the impurity measurement, to get better quality

1 st October 2007ADBIS 2007, Varna, Bulgaria GA-Based Clustering Technique A Semi-Supervised Clustering Algorithm –By minimizing the objective function that minimizes a linear combination of the cluster dipersion measure (DBI) and the cluster impurity measure (GI), the algorithm becomes semi-supervised. –More specifically, given N points and K-clusters, select K cluster centers that minimize the following objective function:

1 st October 2007ADBIS 2007, Varna, Bulgaria GA-Based Clustering Technique A Semi-Supervised GA-Based Clustering Algorithm –two phases in the semi-supervised GA-based Clustering algorithm. Data Reduction - reduce the N points data by grouping all points to their nearest neighbour (speed up the process of genetic clustering) –Connect each object to the nearest neighbour and repeat this step as long as the distance between two components of connected objects is below the specified scale’s value –let the data sets represented by these connected nodes be denoted by B 1, B 2, B 3, …, B m-1, B m where m is the number of connected nodes and m < N, since B i consists of 1 or more connected nodes, i ≤ m. Use genetic algorithm to cluster the m data points based on the objective function, defined previously, m < N.

1 st October 2007ADBIS 2007, Varna, Bulgaria GA-Based Clustering Technique genetic algorithm to cluster the m data points, m < N –Population Initialization A population of X strings (m-length) is randomly generated, X strings are generated with the number of 1’s in the strings uniformly distributes within [1,m]. Each string represents a subset of {B 1, B 2, B 3, …, B m-1, B m }. If B i is in this subset S, the ith position of the string will be 1; otherwise, it will be 0, where i = 1,2,3,…,m. Each B i in subset S is used as a seed to generate a cluster. If B j is not in the subset, they will be merged to the nearest B k in the subset S, where j,k = 1,2,3,…,m and j ≠ k. –Fitness Computation the objective fitness function (OFF) that we maximize is

1 st October 2007ADBIS 2007, Varna, Bulgaria GA-Based Clustering Technique genetic algorithm to cluster the m data points, m < N –Selection Process a rouleete wheel with slots sized according to the fitness is used –Crossover A pair of chromosome, c i and c j, are chosen for applying the crossover operator. One of the parameters of a genetic system is probability of crossover pc. In this experiment, we set pc = 0.25 –Mutation The mutation operator performs a bit-by-bit basis. Another parameter of the genetic system, probability of permutation p m gives the expected number of mutated bits p m ·m·X. In this experiment, we set p m = 0.01

1 st October 2007ADBIS 2007, Varna, Bulgaria Experimental Evaluations Experiments are designed to demonstrate –The performance gain of semi-supervised genetic algorithm- based clustering technique (SSGA-DARA) over the K-clusters clustering technique (K-DARA), genetic algorithm-based clustering (GA-DARA) –The proposed Genetic Algorithm-based DARA method of data transformation outperforms other relational data mining approach to relational data mining. In this experiment, we chose two well-known datasets, the Mutagenesis, Musk.

1 st October 2007ADBIS 2007, Varna, Bulgaria Experimental Evaluations The evaluations are performed with three different settings –K-DARA clustering based on K K number of clusters, where K K is manually defined by user. Take the average accuracy of the J48 for 10 different values of K K –GA-DARA Records are automatically clustered to K GA number of clusters, using the genetic algorithm (GA) by taking the measure of cluster’s dispersion as the objective fitness function. Other parameters were set to pc = 0.25 (crossover probability), and pm = 0.01 (permutation probability). –SSGA-DARA records are automatically clustered to K SSGA number of clusters, using the genetic algorithm (GA) by taking the measure of cluster’s dispersion and impurity as the objective fitness function. Other parameters were set to pc = 0.25 (crossover probability), and pm = 0.01 (permutation probability).

1 st October 2007ADBIS 2007, Varna, Bulgaria Experimental Results The accuracy estimations, from leave-one-out performance results, for GA-DARA and K-DARA are much lower compared to SSGA-DARA. –Since neither of these algorithms addresses the cluster impurity problem. GA-DARA algorithm performs virtually the same as the K- DARA algorithm,

1 st October 2007ADBIS 2007, Varna, Bulgaria Experimental Results Table 2 shows the results of paired one-sided t-tests (p=0.0025), on dataset Musk and Mutagenesis datasets –symbol ‘●’ indicates significant improvement performance by method in row over method in column –symbol ‘○’ indicates no significant improvement performance by method in row over method in column, A represents the K-DARA algorithm, B represents the GA-DARA algorithm C represents the SSGA-DARA algorithm

1 st October 2007ADBIS 2007, Varna, Bulgaria Conclusions presented a novel method for semi-supervised learning that combines supervised and unsupervised learning techniques to extract patterns from multiple tables with a high degree of one-to-many association –The basic idea is to treat a series of records, associated with a single record in the target table, as a bag of terms, and take an unsupervised clustering method and simultaneously optimize the misclassification error of the resulting clusters. Experimental results show that using DBI for cluster dispersion and GI for cluster impurity finds solutions with much greater accuracy.

Thank You Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique