Presentation is loading. Please wait.

Presentation is loading. Please wait.

ROUGH SET THEORY AND FUZZY LOGIC BASED WAREHOUSING OF HETEROGENEOUS CLINICAL DATABASES Yiwen Fan.

Similar presentations


Presentation on theme: "ROUGH SET THEORY AND FUZZY LOGIC BASED WAREHOUSING OF HETEROGENEOUS CLINICAL DATABASES Yiwen Fan."— Presentation transcript:

1 ROUGH SET THEORY AND FUZZY LOGIC BASED WAREHOUSING OF HETEROGENEOUS CLINICAL DATABASES Yiwen Fan

2 Purpose: Warehouse medical databases: Clinical databases: have accumulated large quantities of information about patients and their medical conditions Warehouse these databases and to analyze the patient ’ s condition: we need an efficient data mining technique. Data Mining Process: Data warehousing, data query and cleaning, and data analysis.

3 Three major data mining Techniques Regression Clustering Classification

4

5 Techniques used in this paper Two phase: Clustering and Classification First phase: To Use Rough Set Theory for Clustering ( clustering technique will reduce the complexity of the RST result) Second phase: Using Fuzzy Logic to classify the result of the clusters. Rough Set Theory (RST): Cluster Fuzzy logic: Classification Definition of Clustering: A kind of data mining techniques for warehousing the heterogeneous database. And it is used to group data that have similar characteristics in the same cluster and also group the data that have dissimilar characteristics with other clusters. (used to handle uncertainty and incomplete information)

6 Previous clustering techniques : K-Means Expectation Maximization Association Rule K-Prototype Fuzzy K-Modes etc.

7 Phase 1 – Clustering Definition: Partition data into groups of similar categories or objects. Cluster: The group in the same category or object. Different Clusters: Each of the categories in clusters is similar between them and is dissimilar to the categories of other groups. Fewer Number of Cluster: 1) Lose: Lose data details; 2) Benefit: Simplification. The search for the clusters Unsupervised Learning Clusters Type: 1. Exclusive Clusters: Any categories or objects belong to only one cluster. 2. Overlapping Clusters: Category or an object may belong to many clusters. 3. Probabilistic Clusters: A category or an object belongs to each cluster with a certain probability.

8 Notations in Rough Set Theory(RST) Definition 1:- Indiscernibility Relation: IND (B) Definition 2:- Equivalence Class: [ x i ] IND(B) Definition 3:- Lower Approximation: Definition 4:- Upper Approximation: Definition 5:- Roughness: Definition 6:- Mean Roughness Definition 7:- Standard Deviation

9 1) Whole Data Set -> Parent Node U 2) Current Number of Data Set: - >CNC( iterated from 1-K) 3)A attributes, Find the attributes have in the same category 4)Calculate the Roughness of these attributes of this category. 5)Found the mean value of all these attributes 6)Calculate and Store the Standard Deviation of these attributes 7) The smaller standard deviation is used for next iteration 8) If the Standard deviation does not match the smaller value, the next smaller value is taken as the splitting attribute. 9) Perform binary splitting: split the whole dataset into two clusters 9) Use Distance of Relevance formula to select the cluster(which have largest distance)

10

11 Phase 2 – Classification Fuzzy Inference: Generating a mapping from a given input to an output using fuzzy logic. Then, the mapping gives a basis, from which decisions can be generated or patterns discerned. Fuzzy Inference System : 1) Fuzzification 2) Fuzzy Rules Generation 3) Defuzzification Fuzzy Inference Process: 1) Membership Functions 2 ) Logical Operations 3 ) If-Then Rules

12 Fuzzification Conditions 1. All the “ Cluster 1 (C - 1) ” values are compared with “ Minimum Limit Value ( ML (C - 1) ) “. If any values of Cluster 1 values are less than the value ML, then those values are set as L. 2. All the “ Cluster 1 (C - 1) ” values are compared with “ Maximum Limit Value ( XL (C - 1) ) “. If any values of Cluster 1 values are less than the value XL (C - 1), then those values are set as H. (C - 1) 3. If any values of “ Cluster1(C -1 ) ” values are greater than the value ML,and less than the value XL (C - 1), then those values are set as M. Similarly, make the conditions for other cluster C - 2 also for generating fuzzy values.

13 Fuzzy Rules Generation General form of Fuzzy Rule: “ IF A THEN B ” IF:antecedent THEN:conclusion The output values between L and H of the FIS is trained for generating the Fuzzy Rules. According to the fuzzy values for each feature that are generated in the Fuzzification process, the Fuzzy Rules are also generated.

14 Defuzzification Input: The fuzzy set Output : A single number with value L, M or H (represents whether the given input dataset is in the Low range, Medium range or in the High range.) The FIS is trained with the use of the Fuzzy Rules and the testing process is done with the help of datasets.

15 Evaluation metrics Sensitivity Sensitivity measures the proportion of actual positives which are correctly identified. It relates to the test ‟ s ability to identify positive results. Specificity: Measures the proportion of negatives which are correctly identified. It relates to the ability of the test to identify negative results. Accuracy From the above results, we can easily get the accuracy value using the following formula, Evaluate the effectiveness of the proposed systems Justify theoretical and practical developments of these systems

16 Results and Discussions The paper used the heart disease data sets: Cleveland, Hungarian and Switzerland Total Number of Attributes: 76 Generally used 14 attributes: Age, sex, chest pain type, resting blood pressure,serum cholesterol in mg/dl, fasting blood sugar, resting electro-cardiographic results, maximum heart rate achieved, exercise induced angina, ST depression, slope of the peak exercise ST segment, number of major vessels, thal and diagnosis of heart disease.

17 Clustering Results The dataset are clustered into two sets. Red dots->Cluster 1 Blue dots-> Cluster 2 Cross-> Centroids

18 Cleveland dataset Graph for the sensitivity, sensitivity and accuracy of Cleveland dataset Performance evaluation for sensitivity, specificity and accuracy of Cleveland dataset Iteratio No Sensitivit (in %) Specificit (in %) Accuracy (in %) 121730 2291937 3362544 4542545 5573847 6573850 7645054 8715759 9716964 107975

19 Iteratio No Sensitivit (in %) Specificit (in %) Accurac (in %) 189815 2259831 3689869 4839885 5839885 6939892 7939892 898 Switzerland dataset Performance evaluation for sensitivity, specificity and accuracy of Switzerland dataset Graph for the sensitivity, sensitivity and accuracy of Switzerland dataset

20 Hungarian Dataset Iteratio No Sensitivit (in %) Specificit (in %) Accurac (in %) 192640 295850 3185954 4286354 5376957 6376957 7377360 8467962 9468969 10649872 Graph for the sensitivity, sensitivity and accuracy of Hungarian dataset Performance evaluation for sensitivity, specificity and accuracy of Hungarian

21 Conclusion The Switzerland dataset has provided better result, in compared with the other two datasets. At the highest iteration level, we could achieved good clustering and classification results. Rough Set Theory was used as clustering algorithm Fuzzy logic was used to classify the clusters. The experimentation was carried out on heart disease datasets The evaluation metrics of sensitivity, specificity and accuracy for the proposed work was also analyzed. Result :

22 Reference: [1] R.SARAVANA KUMAR, “ ROUGH SET THEORY AND FUZZY LOGIC BASED WAREHOUSING OF HETEROGENEOUS CLINICAL DATABASES ” , [2] Duo Chen, Du-Wu Cui, Chao-Xue Wang, and Zhu-Rong Wang, "A Rough Set-Based Hierarchical Clustering Algorithm for Categorical Data", International Journal of Information Technology, Vol.12, No.3, pp. 149-159, 2006


Download ppt "ROUGH SET THEORY AND FUZZY LOGIC BASED WAREHOUSING OF HETEROGENEOUS CLINICAL DATABASES Yiwen Fan."

Similar presentations


Ads by Google