Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Unsupervised Learning with Mixed Numeric and Nominal Data.

Slides:



Advertisements
Similar presentations
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: Hichem.
Advertisements

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel document similarity measure based on earth mover’s.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Unsupervised pattern recognition models for mixed feature-type.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology U*F clustering : a new performant “ clustering-mining ”
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel genetic algorithm for automatic clustering Advisor.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Anthony K.H. Tung Hongjun Lu Jiawei Han Ling Feng 國立雲林科技大學 National.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A data mining approach to the prediction of corporate failure.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology HE-Tree: a framework for detecting changes in clustering.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 The k-means range algorithm for personalized data clustering.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Positive and Negative Patterns for Relevance Feature.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comprehensive Comparison Study of Document Clustering.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology On Data Labeling for Clustering Categorical Data Hung-Leng.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Ming Hsiao Author : Bing Liu Yiyuan Xia Philp S. Yu 國立雲林科技大學 National Yunlin University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Keng-Wei Chang Author: Yehuda.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 New Unsupervised Clustering Algorithm for Large Datasets.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. A semantic similarity metric combining features and intrinsic information content Presenter: Chun-Ping.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. An IPC-based vector space model for patent retrieval Presenter: Jun-Yi Wu Authors: Yen-Liang Chen, Yu-Ting.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 GMDH-based feature ranking and selection for improved.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A k-mean clustering algorithm for mixed numeric and categorical.
A Fuzzy k-Modes Algorithm for Clustering Categorical Data
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2007.SIGIR.8 New Event Detection Based on Indexing-tree.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Fast accurate fuzzy clustering through data reduction Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Evolving Reactive NPCs for the Real-Time Simulation Game.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Utilizing Marginal Net Utility for Recommendation in E-commerce.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Chung-hung.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A modified version of the K-means algorithm with a distance.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Fuzzy integration of structure adaptive SOMs for web content.
Intelligent Database Systems Lab Advisor : Dr.Hsu Graduate : Keng-Wei Chang Author : Lian Yan and David J. Miller 國立雲林科技大學 National Yunlin University of.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Model-based evaluation of clustering validation measures.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A text mining approach on automatic generation of web.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Loss of the Mahalanobis Distance in High Dimensions-
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Multiclass boosting with repartitioning Graduate : Chen,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An initialization method to simultaneously find initial.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology O( ㏒ 2 M) Self-Organizing Map Algorithm Without Learning.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Region-based image retrieval using integrated color, shape,
Intelligent Database Systems Lab Advisor : Dr.Hsu Graduate : Keng-Wei Chang Author : Balaji Rajagopalan Mark W. Isken 國立雲林科技大學 National Yunlin University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A hierarchical clustering algorithm for categorical sequence.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Direct mining of discriminative patterns for classifying.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Modelling Fish Behaviour Advisor : Dr. Hsu Presenter :
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Jessica K. Ting Michael K. Ng Hongqiang Rong Joshua Z. Huang 國立雲林科技大學.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Wei Xu,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Comparing Association Rules and Decision Trees for Disease.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology ACM SIGMOD1 Subsequence Matching on Structured Time Series.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Hierarchical model-based clustering of large datasets.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Growing Hierarchical Tree SOM: An unsupervised neural.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author : Yongqiang Cao Jianhong Wu 國立雲林科技大學 National Yunlin University of Science.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Dual clustering : integrating data clustering over optimization.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2005.ACM GECCO.8.Discriminating and visualizing anomalies.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Sanghamitra.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Visualizing social network concepts Presenter : Chun-Ping Wu Authors :Bin Zhu, Stephanie Watts, Hsinchun.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Lynette.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Named Entity Disambiguation by Leveraging Wikipedia Semantic Knowledge Presenter : Jiang-Shan Wang Authors.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Adaptive Clustering for Multiple Evolving Streams Graduate.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A clustering-based approach for prediction of cardiac.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Presenter : Chien-Hsing Chen Author: Geoffrey I. Webb.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A New Cluster Validity Index for Data with Merged Clusters.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology IEEE EC1 Generating War Game Strategies Using A Genetic.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Michael.
Presentation transcript:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Unsupervised Learning with Mixed Numeric and Nominal Data Advisor : Dr. Hsu Graduate : Yu-Cheng Chen Authors : Cen Li, Gautam Biswas 2002 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Introduction Background SABC Experimental results Conclusions Personal Opinion

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation Tradition clustering algorithms assume feature are either numeric or categorical valued. Majority of the useful data is described by numeric and nominal valued features

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective Developing unsupervised learning techniques that exhibit good performance with mixed data.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Introduction Traditional approaches that be used to resolve mixed data have listed as following: Binary encoding. Discretize numeric attributes. Generalize criterion functions to handle mixed data.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Background COBWEB/3 use CU measure for categorical attributes For numeric attributes

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Background (cont.) COBWEB/3 CU measure for numeric attributes is defined as: The overall CU is defined as:

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Background (cont.) COBWEB Limitations: The normal distribution assumption for numeric data. The accuracy of the estimate is suspect when sample size is samll When objects in Ck has a unique value, the σ ik = 0 and 1/ σ ik →∞, so we set the 1/ σ ik =1 when σ ik =1 < 1

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Background (cont.) ECOBWEB want to remedy the disadvantages of COBWEB/3 The normal distribution assumption When σ ik = 0

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Background (cont.) ECOBWEB Limitations: The choice of the parameters has a significant effect on CU computation.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Background (cont.) AUTOCLASS Use Bayesian method to clustering Derive the most probable class distribution for the data given prior information. Limitations: Computational complexity is too high. Over fitting problem.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 SBAC uses a similarity measure defined by Goodall adopts a hierarchical agglomerative approach to build partition structures. The similarity is decided by The uncommonality of feature value matches. X1= {a, b}, X2={a, b}, X3={c, d}, X4={c, d} ( P(a) =P(b) ) >= ( P(c)=P(d) ) The similarity of X3 and X4 should be greater than that of X1 and X2. SBAC System

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Summary For numeric feature values, the similarity takes on: The feature value difference The uniqueness of the feature value pair SBAC System

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 Computing Similarity for numeric Attributes We define the More Similar Feature Segment Set (MSFSS) The set of all pairs of values for feature that are equally or more similar to the pair ( (V i ) k, (V j ) k ). SBAC System

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 The probability of picking two pair having a values (V l ) k, (V m ) k  MSFVS ((V i ) k,(V j ) k ) is defined as The dissimilairty of the pair (D ij ) k is defined as the summation of the probabilities. The similarity of the pair ((V i ) k,(V j ) k is defined as SBAC System

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 For nominal feature values, the similarity is We define the More Similar Feature Value Set (MSFVS) The set of all pairs of values for feature that are equally or more similar to the pair ( (V i ) k, (V i ) k ). SBAC System f(a)=3, f(b)=3, f(c)=4 MSFVS(c, c)={ (a, a),(b, b), (c, c)} MSFVS(b, b)={ (a, a), (b, b)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 The probability of picking a pair (V l ) k, (V l ) k  MSFVS ((V i ) k ) is defined as following The dissimilairty of the pair (D ii ) k is defined as the summation of the probabilities SBAC System

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 18 SBAC System f(a)=3, f(b)=3, f(c)=4 MSFVS(c, c)={ (a, a),(b, b), (c, c)} MSFVS(b, b)={ (a, a), (b, b)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 19 Aggregating Similarity from Multiple Features Assuming the results are expressed as Fisher’s χ 2 For numeric features: For nominal features: SBAC System

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 20 Combining the two types of features: SBAC System 6 {c, 9} 9 {a, 7.5} 5 {c, 10.5} 8 {c, 9}

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 21 The agglomerative clustering algorithm: SBAC System

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 22 The predefined threshold t We set t=0.3 * D(root), D(root)=0.876, t=0.263 If the dissimilarity is dropping larger than t, then stop SBAC System

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 23 Artificial data 180 data points, three classes, G1, G2, G3 Two nominal and two numeric attributes. Each classes has 60 data points. Experimental results

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 24 Experimental results (cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 25 Experimental results(cont.) SBAC COBWEB ECOBWEBAUTOCLASS

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 26 Real data Hand Written Character (8OX) Data Numeric features 45 objects Mushroom Data Nominal features 200 objects (100 of them were poisonous) Heart disease Data Mixed features 303 patients Experimental results (cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 27 Results Experimental results (cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 28 Results Experimental results (cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 29 Results Experimental results (cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 30 Conclusions This paper proposed a new similarity measure that assigns greater weight to feature value matches that are uncommon in the population. The approach has better performance in clustering than another’s do.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 31 Personal Opinion The time complexity of this approach is too high. The process of computing similarity and clustering are too messy.