Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Unsupervised Learning with Mixed Numeric and Nominal Data.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Unsupervised Learning with Mixed Numeric and Nominal Data Advisor : Dr. Hsu Graduate : Yu-Cheng Chen Authors : Cen Li, Gautam Biswas 2002 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Introduction Background SABC Experimental results Conclusions Personal Opinion

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation Tradition clustering algorithms assume feature are either numeric or categorical valued. Majority of the useful data is described by numeric and nominal valued features

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective Developing unsupervised learning techniques that exhibit good performance with mixed data.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Introduction Traditional approaches that be used to resolve mixed data have listed as following: Binary encoding. Discretize numeric attributes. Generalize criterion functions to handle mixed data.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Background COBWEB/3 use CU measure for categorical attributes For numeric attributes

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Background (cont.) COBWEB/3 CU measure for numeric attributes is defined as: The overall CU is defined as:

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Background (cont.) COBWEB Limitations: The normal distribution assumption for numeric data. The accuracy of the estimate is suspect when sample size is samll When objects in Ck has a unique value, the σ ik = 0 and 1/ σ ik →∞, so we set the 1/ σ ik =1 when σ ik =1 < 1

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Background (cont.) ECOBWEB want to remedy the disadvantages of COBWEB/3 The normal distribution assumption When σ ik = 0

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Background (cont.) ECOBWEB Limitations: The choice of the parameters has a significant effect on CU computation.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Background (cont.) AUTOCLASS Use Bayesian method to clustering Derive the most probable class distribution for the data given prior information. Limitations: Computational complexity is too high. Over fitting problem.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 SBAC uses a similarity measure defined by Goodall adopts a hierarchical agglomerative approach to build partition structures. The similarity is decided by The uncommonality of feature value matches. X1= {a, b}, X2={a, b}, X3={c, d}, X4={c, d} ( P(a) =P(b) ) >= ( P(c)=P(d) ) The similarity of X3 and X4 should be greater than that of X1 and X2. SBAC System

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Summary For numeric feature values, the similarity takes on: The feature value difference The uniqueness of the feature value pair SBAC System

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 Computing Similarity for numeric Attributes We define the More Similar Feature Segment Set (MSFSS) The set of all pairs of values for feature that are equally or more similar to the pair ( (V i ) k, (V j ) k ). SBAC System

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 The probability of picking two pair having a values (V l ) k, (V m ) k  MSFVS ((V i ) k,(V j ) k ) is defined as The dissimilairty of the pair (D ij ) k is defined as the summation of the probabilities. The similarity of the pair ((V i ) k,(V j ) k is defined as SBAC System

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 For nominal feature values, the similarity is We define the More Similar Feature Value Set (MSFVS) The set of all pairs of values for feature that are equally or more similar to the pair ( (V i ) k, (V i ) k ). SBAC System f(a)=3, f(b)=3, f(c)=4 MSFVS(c, c)={ (a, a),(b, b), (c, c)} MSFVS(b, b)={ (a, a), (b, b)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 The probability of picking a pair (V l ) k, (V l ) k  MSFVS ((V i ) k ) is defined as following The dissimilairty of the pair (D ii ) k is defined as the summation of the probabilities SBAC System

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 18 SBAC System f(a)=3, f(b)=3, f(c)=4 MSFVS(c, c)={ (a, a),(b, b), (c, c)} MSFVS(b, b)={ (a, a), (b, b)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 19 Aggregating Similarity from Multiple Features Assuming the results are expressed as Fisher’s χ 2 For numeric features: For nominal features: SBAC System

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 20 Combining the two types of features: SBAC System 6 {c, 9} 9 {a, 7.5} 5 {c, 10.5} 8 {c, 9}

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 21 The agglomerative clustering algorithm: SBAC System

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 22 The predefined threshold t We set t=0.3 * D(root), D(root)=0.876, t=0.263 If the dissimilarity is dropping larger than t, then stop SBAC System

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 23 Artificial data 180 data points, three classes, G1, G2, G3 Two nominal and two numeric attributes. Each classes has 60 data points. Experimental results

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 24 Experimental results (cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 25 Experimental results(cont.) SBAC COBWEB ECOBWEBAUTOCLASS

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 26 Real data Hand Written Character (8OX) Data Numeric features 45 objects Mushroom Data Nominal features 200 objects (100 of them were poisonous) Heart disease Data Mixed features 303 patients Experimental results (cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 27 Results Experimental results (cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 30 Conclusions This paper proposed a new similarity measure that assigns greater weight to feature value matches that are uncommon in the population. The approach has better performance in clustering than another’s do.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 31 Personal Opinion The time complexity of this approach is too high. The process of computing similarity and clustering are too messy.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Unsupervised Learning with Mixed Numeric and Nominal Data.

Similar presentations

Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Unsupervised Learning with Mixed Numeric and Nominal Data."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Unsupervised Learning with Mixed Numeric and Nominal Data.

Similar presentations

Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Unsupervised Learning with Mixed Numeric and Nominal Data."— Presentation transcript:

Similar presentations

About project

Feedback