ADBIS 2007 Discretization Numbers for Multiple-Instances Problem in Relational Database Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer.

ADBIS 2007 Discretization Numbers for Multiple-Instances Problem in Relational Database Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer Science Department, York University (30 th September, 2007)

30 th September 2007ADBIS 2007, Varna, Bulgaria Overview Introduction Objectives Experimental Design –Data Pre-processing: Discretization –Data Summarization (DARA) Experimental Evaluation Experimental Results Conclusions

30 th September 2007ADBIS 2007, Varna, Bulgaria Introduction Handling numerical data stored in a relational database is unique –due to the multiple occurrences of an individual record in the non-target table and –non-determinate relations between tables. Most traditional data mining methods deal with a single table and discretization process is based on a single table. In a relational database, multiple records from one table with numerical attributes are associated with a single structured individual stored in the target table. Numbers in multi-relational data mining (MRDM) are often discretized, after considering the schema of the relational database

30 th September 2007ADBIS 2007, Varna, Bulgaria Introduction This paper considers different alternatives for dealing with continuous attributes in MRDM The discretization procedures considered in this paper include algorithms –that do not depend on the multi-relational structure and also –that are sensitive to this structure. A few discretization methods implemented, including the proposed entropy-instance-based discretization, embedded in DARA algorithm

30 th September 2007ADBIS 2007, Varna, Bulgaria Objectives To study the effects of taking the one-to-many association issue into consideration in the process of discretizing continuous numbers. –Propose the entropy-instance-based discretization method, which is embedded in DARA algorithm –In DARA algorithm, we employ several methods of discretization in conjunction with C4.5 classifier, as an induction algorithm –We demonstrate on the empirical results obtained that discretization can be improved by taking into consideration the multiple-instance problem

30 th September 2007ADBIS 2007, Varna, Bulgaria Experimental Design Data Pre-processing –Discretization of Continuous Attributes in Multi-relational setting using Entropy-Instance-Based Algorithm Data Aggregation –Data summarization using DARA as a mean of data summarization based on Cluster dispersion and Impurity Evaluation of the discretization methods using C4.5 classifiers Discretization of Continuous Attributes Using Entropy-Instance- Based Algorithm Data Summarization using DARA based on Cluster Dispersion and Impurity Relational Data Categorical Data Summarized Data Learning can be done using any traditional AV data mining methods

30 th September 2007ADBIS 2007, Varna, Bulgaria Data Pre-processing: Discretization To study the effects of one-to-many association issue in the process of discretizing continuous numbers. Propose the entropy-instance-based discretization method, which is embedded in DARA algorithm In DARA algorithm, we employ several methods of discretization in conjunction with C4.5 classifier, as an induction algorithm – Equal Height – each bin has same number of samples – Equal Weight - considers the distribution of numeric values present and the groups they appear in – Entropy-Based – uses the class information entropy – Entropy-Instance-based - uses the class information entropy and individual information entropy We demonstrate that discretization can be improved by considering the one-to-many problem

30 th September 2007ADBIS 2007, Varna, Bulgaria Entropy-Instance-Based (EIB) Discretization Background –Based on the entropy-based multi-interval discretization method (Fayyad and Irani 1993) –Given a set of instances S, two samples of S, S 1 and S 2, a feature A, and a partition boundary T, the class information entropy is –So, for k bins, the class information entropy for multi-interval entropy-based discretization is E(A,T,S) = Ent(S k ) = I(A,T,S,k) =

30 th September 2007ADBIS 2007, Varna, Bulgaria Entropy-Instance-Based (EIB) Discretization In EIB, besides the class information entropy, another measure that uses individual information entropy is added to select multi-interval boundaries for discretization Given n individuals, the individual information entropy of a subset S is IndEnt(S) = where p(I i, S) is the probability of the i-th individual in the subset S The total individual information entropy for all partitions is Ind(A,T,S,k) =

30 th September 2007ADBIS 2007, Varna, Bulgaria Entropy-Instance-Based (EIB) Discretization As a result, by minimizing the function Ind_I(A,T,S,k), that consists of two sub-functions, I(A,T,S,k) and Ind(A,T,S,k), we are discretizing the attributes values based on the class and individual information entropy. Ind_I(A,T,S,k) = = +

30 th September 2007ADBIS 2007, Varna, Bulgaria Entropy-Instance-Based (EIB) Discretization One of the main problems with this discretization criterion is that it is relatively expensive –Use a GA-based discretization to obtain a multi-interval discretization for continuous attributes, consists of an initialization step the iterative generations of the –reproduction phase, –the crossover phase and –mutation phase

30 th September 2007ADBIS 2007, Varna, Bulgaria Entropy-Instance-Based (EIB) Discretization An initialization step –a set of strings (chromosomes), where each string consists of b-1 continuous values representing the b partitions, is randomly generated within the attributes values of min and max –For instance, given minimum and maximum values of 1.5 and 20.5 for a continuous field, we have (2.5,5.5,9.3,12.6,15.5,20.5) –The fitness function for genetic entropy-instance-based discretization is defined as f = 1/ Ind_I(A,T,S,k)

30 th September 2007ADBIS 2007, Varna, Bulgaria Entropy-Instance-Based (EIB) Discretization the iterative generations of –the reproduction phase Roulette wheel selection is used –the crossover phase and a crossover probability pc of 0.50 is used –mutation phase a fixed probability pm of 0.10 is used

30 th September 2007ADBIS 2007, Varna, Bulgaria Data Summarization (DARA) Data summarization based on Information Retrieval (IR) Theory Dynamic Aggregation of Relational Attributes (DARA) – categorizes objects with similar patterns based on tf-idf weights, borrowed from IR theory Scalable and produce interpretable rules NT T T= Target table NT = Non-target table = Data Summarization

30 th September 2007ADBIS 2007, Varna, Bulgaria Data Summarization (DARA) Data summarization based on Information Retrieval (IR) Theory TF-IDF (term frequency-inverse document frequency) - a weight often used in information retrieval and text mining A statistical measure used to evaluate how important a word is to a document in a corpus The importance of term increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

30 th September 2007ADBIS 2007, Varna, Bulgaria Data Summarization (DARA) In a multi-relational setting, –an object (a single record) is considered as a document –All corresponding values of attributes stored in multiple tables are considered as terms that describe the characteristics of the object (the record) –DARA transforms data representation in a relational model into a vector space model and employs TF-IDF weighting scheme to cluster and summarize them

30 th September 2007ADBIS 2007, Varna, Bulgaria Data Summarization (DARA) tf i idf i (term frequency-inverse document frequency) where n i is the number of occurrences of the considered term, and the denominator is the number of occurrences of all terms. The inverse document frequency is a measure of the general importance of the term with |D| : total number of documents in the corpus and d is the number of documents where the term t i appears

30 th September 2007ADBIS 2007, Varna, Bulgaria Data Summarization (DARA) Data Summarization Stages 1.Information Propagation Stage –Propagates the record ID and classes from the target concepts to the non-target tables 2.Data Aggregation Stage –Summarize each record to become a single tuple –Uses a clustering technique based on the TF-IDF weight, in which each record can be represented as –The cosine similarity method is used to compute the similarity between two records R i and R j, cos(R i,R j ) = R i ·R j /(||R i ||·|||R j ||) (tf 1 log(n/df 1 ), tf 2 log(n/df 2 ),..., tf m log(n/df m ))

30 th September 2007ADBIS 2007, Varna, Bulgaria Experimental Evaluation Implement the discretization methods in the DARA algorithm, in conjunction with the C4.5 classifier, as an induction algorithm that is run on the DARAs discretized and transformed data representation chose three varieties of a well-known datasets, the Mutagenesis relational database –The data describes 188 molecules falling in two classes, mutagenic (active) and non-mutagenic (inactive) and 125 of these molecules are mutagenic.

30 th September 2007ADBIS 2007, Varna, Bulgaria Experimental Evaluation three different sets of background knowledge (referred to as experiment B1, B2 and B3). – B1 : The atoms in the molecule are given, as well as the bonds between them, the type of each bond, the element and type of each atom. – B2 : Besides B1, the charge of atoms are added – B3 : Besides B2, the log of the compound octanol/water partition coefficient (logP), and energy of the compounds lowest unoccupied molecular orbital ( Є LUMO) are added Perform a leave-one-out cross validation using C4.5 for different number of bins, b, tested for B1, B2 and B3.

30 th September 2007ADBIS 2007, Varna, Bulgaria Experimental Results Performance (%) of leave-one-out cross validation of C4.5 on Mutagenesis dataset The predictive accuracy for EqualHeight and EqualWeight is lower on datasets B1 and B2, when the number of bins is smaller the accuracy of entropy and entropy-instance based discretization is lower when the number of bins is smaller on dataset B3 The result of entropy-based and entropy-instance-based discretization on B1, B2 and B3 are virtually identical, (five out of nine tests EIB performs better than EB)

30 th September 2007ADBIS 2007, Varna, Bulgaria Conclusions presented a method called dynamic aggregation of relational attributes (DARA) with entropy-instance- based discretization to propositionalise a multi- relational database The DARA method has shown a good performance on three well-known datasets in term of performance accuracy. The entropy-instance-based and entropy-based discretization methods are recommended for discretization of attribute values in multi-relational datasets –Disadvantage – computation is expensive when the number of bins is large

Thank You Discretization Numbers for Multiple-Instances Problem in Relational Database

ADBIS 2007 Discretization Numbers for Multiple-Instances Problem in Relational Database Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer.

Similar presentations

Presentation on theme: "ADBIS 2007 Discretization Numbers for Multiple-Instances Problem in Relational Database Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ADBIS 2007 Discretization Numbers for Multiple-Instances Problem in Relational Database Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer.

Similar presentations

Presentation on theme: "ADBIS 2007 Discretization Numbers for Multiple-Instances Problem in Relational Database Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer."— Presentation transcript:

Similar presentations

About project

Feedback