Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A hierarchical clustering algorithm for categorical sequence.

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A hierarchical clustering algorithm for categorical sequence."— Presentation transcript:

1 Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A hierarchical clustering algorithm for categorical sequence data Advisor : Dr. Hsu Graduate : Wen-Hsiang Hu Authors : Seung-Joon Oh*, Jae-Yearn Kim, 2003 Elsevier B.V. All rights reserved. Republic of Korea.

2 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Introduction The proposed similarity measure and hierarchical clustering algorithm. Experimental results Conclusions Personal Opinion

3 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation some methods of similarity measure have defects (ex: edit distance 、 sequence alignment)

4 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective generate better-quality clusters lower computational time

5 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Introduction We study how to cluster sequence datasets, such as protein sequences, retail transactions, and web-logs. We propose a new similarity measure and develop a hierarchical clustering algorithm for categorical sequence data.

6 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 The similarity between data must be decided before clustering the data. The similarity between S1 = A B C D and S2 = A C D E is calculated using the pairs of items in: S1=>(AB, AC, AD, BC, BD, CD) = E 1 S2=>(AC, AD, AE, CD, CE, DE) = E 2 Measure of similarity between sequences 1.We are given database D of sequences. 2. Sequence S =(x 1 x 2...x i...x j...x n ) is an ordered list of items, where x i is an item having a categorical value. 3. A sequence element e k is a pair of items, x i x j (i < j), in sequence S. 4. E = (e 1, e 2,...,e k,...) is the collection of sequence elements e k. 5.The number of elements in E is referred to as the size of E and is denoted by |E|. The pairs of identical items are AC, AD, CD =>

7 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Our proposed similarity measure between sequences Si and Sj can be converted into dissimilarities by using a simple transformation such as: The measure d is called a semimetric, if it fulfills the conditions Proposed method to compute the dissimilarity between sequences

8 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 The proposed hierarchical clustering algorithm First, each of the n × (n − 1)/2 pairs of possible merges is evaluated, and the two clusters that have the maximum value of the criterion function [Eq(2)] are merged. Eq.(2) is derived from Eq.(3) from Zho and Karypis[10] Where n r is the number of sequences in Cr and k is the number of clusters. (2) Where n r is the number of objects in Cr and k is the number of clusters. S1=ABCD =>(AB, AC, AD, BC, BD, CD) = E1 S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 C new

9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Our proposed agglomerative hierarchical clustering algorithm is presented as follows: S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 C new The proposed hierarchical clustering algorithm (cont.)

10 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Step 1. requires large computational time S1 = 〈 a1...ai...ck...an 〉 S2 = 〈 b1...bj...cl...bm 〉 Items ai are exclusive to sequence S1; items bj are exclusive to S2; and items ck and cl are common to S1 and S2. However, ck and cl may or may not be the same sequence (see Example 2). ck is called S3 and cl is called S4. Let E1, E2, E3, and E4 be the collection of sequence elements in S1, S2, S3, and S4, respectively. The similarity between sequences S1 and S2 is defined: An efficient algorithm for measuring similarity

11 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 S1 = 〈 A B C F Z A 〉 are compared with S2 = 〈 A F C H 〉 identical item is inserted into S3= 〈 ACFA 〉 =>E3=(AC,AF,AA,CF,CA,FA) S2 = 〈 A F C H 〉 are compared with S1 = 〈 A B C F Z A 〉 identical item is inserted into S4 = 〈 A F C 〉 =>E4 = (AF,AC, FC) S1 = 〈 A B C F Z A 〉 and S2 = 〈 A F C H 〉 calculated using the pairs of items in: S1(AB, AC, AF,AZ,AA,BC,BF,BZ,BA,CF,CZ,CA,FZ,FA,ZA) = E1 S2(AF,AC, AH,FC,FH,CH) = E2 An efficient algorithm for measuring similarity (cont.)

12 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Experimental results Similarity measure hierarchical clustering algorithm Algorithm 1 edit distance Proposed hierarchical clustering algorithm Algorithm 2edit distancecomplete linkage method Proposed clustering Algorithm Proposed hierarchical clustering algorithm

13 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 The splice dataset contains sequences for 767 EI (exon/intron) and 768 IE (intron/exon). Experimental results - Splice dataset

14 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 four different datasets:DS1,DS2,DS3, and DS4 Each dataset was a market basket database. Experimental results – Synthetic dataset No of misclassified transactions

15 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 we generated synthetic datasets (2000 transactions) No of misclassified transactions

16 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 Conclusions We developed a hierarchical clustering algorithm and presented an efficient method for determining the similarity measure.

17 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 Personal Opinion 2- dimension => multi-dimension


Download ppt "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A hierarchical clustering algorithm for categorical sequence."

Similar presentations


Ads by Google