Presentation is loading. Please wait.

Presentation is loading. Please wait.

A three-step approach for STULONG database analysis: characterization of patients’ groups O. Couturier, H. Delalin, H. Fu, E. Kouamou, E. Mephu Nguifo.

Similar presentations


Presentation on theme: "A three-step approach for STULONG database analysis: characterization of patients’ groups O. Couturier, H. Delalin, H. Fu, E. Kouamou, E. Mephu Nguifo."— Presentation transcript:

1 A three-step approach for STULONG database analysis: characterization of patients’ groups O. Couturier, H. Delalin, H. Fu, E. Kouamou, E. Mephu Nguifo Computer Science Research Center of Lens (CRIL) CNRS - Université d’Artois – IUT de Lens Discovery Challenge (PKDD 2004)

2 2 Goal What are the relations between social factors (social characteristics) and the other characteristics of men in the respective groups?

3 3 Overview Discovery process Techniques and results –Clustering –Classification –Association rules Conclusion and further work

4 4 Discovery Process Hypothesis on data –ENTRY table –Groups provided by expert Merging groups 1 and 2 : Normal group Merging groups 3 and 4 : Risk group Ignoring group 6 –Characteristics Considering previous work of LRI ML research team at previous PKDD Challenges

5 5 Discovery Process Can we find a model that fits with the provided groups ? Are there strong similarities among instances of different groups ? Which kind of relations exist among group characteritics ?

6 6 Discovery Process DataTasksKnowledge Clustering Generated clusters vs provided ones Entry data groups Supervised classification Similarities among instances, and groups Association rules search Affinity among groups characteristics

7 7 Techniques and Results : Clustering Goal: do the initials groups can be considered as they were defined? Data : groups 12, 34 and 5 Clustering systems (WEKA package) : –COBWEB: 2 groups –EM: 4 groups –KMEANS: 2 groups Results: difficulty to identify properties which allow to retrieve the initial groups

8 8 Techniques and Results : Supervised Classification Risk group patients similar to those in Normal or Pathological group ? Data : –Training set : group 12 and group 5 –Test set : group 34 (Risk) System (WEKA package) : –Decision tree C4.5

9 9 Techniques and Results : Supervised Classification Training results: –HT descriptor are one of the most relevant factors of the disease –Thirdteen instances of Pathological group are classified as Normal === Confusion Matrix === a e <--classified as 276 0 | a = 12 13 101 | e = 5

10 10 Techniques and Results : Supervised Classification Test set: Risk Group34 –Health district number is not a relevant factor –2/5 Risk patients similar to Normal group patients === Confusion Matrix === a c d e <-- classified as 0 0 0 0 | a = 12 177 0 0 250 | c = 3 (odd) 197 0 0 235 | d = 4 (even) 0 0 0 0 | e = 5

11 11 Techniques and Results : Association rules search Goal : Find relations that exist among group characteritics Data : 1417 patients of groups 12, 34 and 5 System : –Apriori : B. Goethals implementation –Preprocess : Binary conversion of the 27 characteristics –Frequent Itemsets Search Results : –Frequent itemsets common to different groups

12 12 Techniques and Results : Association rules search Preprocessing : Binary conversion –BMI : weight / size² (m) If bmi > 27 then 1 else 0 –Age If age > 45 then 1 else 0 –Smoker If smokerconsumption!=0 OR duration then 1 else 0

13 13 Techniques and Results : Association rules search Pre-processing : Binary conversion –Bolhr (chest pain) If bolhr=1 or bolhr=6 then 0 else 1 –Chol If (chol > 2+(age/100)) then 1 else 0 –Tg If tg<150 then 0 else 1

14 14 Techniques and Results : Association rules search Frequent itemsets search –Support threshold (Minsup) = 0.10 significant for at least 10% of the population –Search was done with no MinSup (i.e MinSup value = 0) Itemsets Class 12Class 34Class 5 Support value

15 15

16 16 Techniques and Results : Association rules search Frequent itemsets search – Results –Support value of Alcohol attribute = 1 –1-itemsets Attribute IM is false for each patient of group 12 and 34. The value is true for 33% of patients of group 5. HT is false for each patient of group 12. STUDY is more frequent in group12 than in group5 –3-itemsets AGE & SMOKER & CHOL is less frequent in group12 than in group5 –etc … –SupportValue Group 34 is between SupportValue Group 12 and SupportValue of Group 5.

17 17 Conclusion RG similarity with NG and PG. 3 steps: –Clustering: initial groups are not found –Classification: some attributes which characterize the pathological group but already known –Frequent itemsets search: difficult to highlight concrete results but interesting informations

18 18 Further work Upgrade the binary conversion Refining the data set on the population –for instance, 12 patients died because of atherosclerosis while they were in the NG. Refining our hypothesis –Data set of ENTRY table –Look at the CONTROL table

19 19 Thanks !


Download ppt "A three-step approach for STULONG database analysis: characterization of patients’ groups O. Couturier, H. Delalin, H. Fu, E. Kouamou, E. Mephu Nguifo."

Similar presentations


Ads by Google