Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering and Summarising Association Rules Mined from Phenotype, Genotype and Environmental Data Concerning Age-Related Hearing Impairment Kati Iltanen.

Similar presentations


Presentation on theme: "Clustering and Summarising Association Rules Mined from Phenotype, Genotype and Environmental Data Concerning Age-Related Hearing Impairment Kati Iltanen."— Presentation transcript:

1 Clustering and Summarising Association Rules Mined from Phenotype, Genotype and Environmental Data Concerning Age-Related Hearing Impairment Kati Iltanen a, Sami Kiviharju a, Lida Ao a, Martti Juhola a, Ilmari Pyykkö b a School of Information Sciences, University of Tampere, Finland b School of Medicine, University of Tampere, Finland

2 Kati Iltanen, Medinfo 20132 Introduction  Aim of the study: to examine applicability of association rules for analysing effects of genetic and environmental factors on age-related hearing impairment (ARHI)  To possibly generate new hypotheses for medical research  Association analysis  Data mining approach to discover items (variable-value pairs) frequently co-occurring in data  Association rules of the form “A → B” generated from frequent item sets  Capability to do a complete search efficiently

3 Kati Iltanen, Medinfo 20133 Introduction  Challenge  High-dimensional data result in a very large number of association rules.  Rules may be overlapping  Postprocessing is needed  Focus of the study: to develop an approach to cluster, summarise and represent association rules for easier exploration

4 Kati Iltanen, Medinfo 20134 ARHI data  Originate from a European multicentre study on ARHI  Collected in nine medical centres from seven European countries (e.g. Van Laer et al., 2008)  2428 cases: females and males aged 53 to 67  The cases represent the best and the worst hearing thirds of their population at high frequencies (2, 4 and 8 KHz)  1241 cases with ARHI  Cases having pathologies (other than ARHI) possibly influencing hearing ability were excluded

5 Kati Iltanen, Medinfo 20135 ARHI data  764 variables  42 phenotypes and environmental factors  Phenotypes: e.g. gender, age, body mass index, blood pressure, diabetes, cardiovascular disease and renal failure  Environmental and life style factors: e.g. use of ototoxic medication, exposure to chemicals, exposure to noise, alcohol use, and tobacco smoking  722 single nucleotide polymorphisms (SNPs) from 70 candidate genes

6 Kati Iltanen, Medinfo 20136 Arhi rules LHS Zhighbest>0.147 Genotype, phenotype, environmental variables From 1 to 3 items “Has a hearing impairment” Zhighbest: averaged gender and age independent Z-score of high frequencies (2, 4 and 8 KHz) for the better hearing ear 0.147: a threshold value given by the expert physician  Rules were mined with Magnum Opus from RuleQuest Research.  Form for rules: 

7 Kati Iltanen, Medinfo 20137  Interestingness measures used for association rules  Support  Confidence  Lift  Statistical significance: Fisher exact test Arhi rules

8 Kati Iltanen, Medinfo 20138 Clustering ARHI rules  Measure of similarity or closeness between two association rules proportion of cases matched by both rules among cases matched by either one or both rules (a variant of a measure presented by Gupta et al., 1999) Intersection of R22 and R26: 187 cases (Both R22 and R26 hold for 187 cases.) Union of R22 and R26: 190 cases The similarity between R22 and R26: 187/190≈0.98

9 Kati Iltanen, Medinfo 20139 Clustering ARHI rules  Clustering method based on graph-theoretic techniques  Implemented using Matlab, Java and PostgreSQL A connected component (a threshold of 0.3 used for the similarity measure).  Rule graph  Rules - nodes  Similarities between rules - weights of edges between nodes  Similarities above chosen threshold - connections between nodes  One connected component is a rule subset or cluster.  Clustering – searching for connected components

10 Kati Iltanen, Medinfo 201310  Rules represented in html documents  Program implemented using Matlab  Rule subset information is given at different levels of details  Overall summary listing for rule subsets  Number of rules, coverage, main item Summarising rule subsets

11 Kati Iltanen, Medinfo 201311 Summarising rule subsets  At the next level, rule subset information is enlarged with the information about the other items.

12 Kati Iltanen, Medinfo 201312 Representing rule subsets  Gene colouring  Marking items of special interest  Important SNPs from earlier studies  Ordering items in rules on the basis of item frequencies

13 Kati Iltanen, Medinfo 201313 Representing rule subsets  Ordering rules in clusters on the basis of item frequencies

14 Kati Iltanen, Medinfo 201314 Representing rule subsets

15 Kati Iltanen, Medinfo 201315 Representing rule subsets  Similarities between the rules in a similarity matrix “Solvent exposure” rules “Noisy workplace” rules Highly overlapping rules

16 Kati Iltanen, Medinfo 201316 Summary statistics of ARHI rules 1-item LHS2-item LHS3-item LHS Size of search space 22312.48535·10 6 1.84332·10 9 Minimum support threshold 50 cases 1% Minimum confidence threshold 60%70%90% Number of rules 677518 Total coverage 48.3%86.6 %96.5% Support 3.4 - 13.4%2.1 - 7.8%1 - 2% Confidence 60 - 67.9%70 - 80.6%90 - 100% Lift 1.17 - 1.331.37 - 1.581.76 - 1.96 Common threshold values: lift 1, Fisher exact test: α = 0.01

17 Kati Iltanen, Medinfo 201317 Conclusions  Developed approach  simplified the rule exploration by grouping together the rules concerning the same items the rules concerning the same phenomenon  enabled the recognition of the overlapping rules possibly suggesting more complex interactions  Association analysis  detected factors found significant in previous studies concerning this ARHI data  enabled more exhaustive analysis of more complex patterns However, the problem of multiple testing has to be remembered.  gave new interesting information to the expert physician especially rules concerning osteoporosis

18 Kati Iltanen, Medinfo 201318 References and acknowledgments The authors are grateful to Baur M, Bille M, Bonaconsa A, Cremers CW, Demeester K, Dhooge I, Diaz-Lacava AN, Espeso A, Fransen E, Hannula S, Hendrickx JJ, Huygen PL, Huyghe J, Huyghe JR, Jensen M, Konings A, Kremer H, Kunst S, Lacava A, Lemkens N, Manninen M, Mazzoli M, Mäki- Torkko E, Orzan E, Parving A, Pawelczyk M, Pfister M, Rajkowska E, Sliwinska-Kowalska M, Sorri M, Steffens M, Stephens D, Topsakal V, Tropitzsch A, Van Camp G, Van de Heyning PH, Van Eyken E, Van Laer L, Verbruggen K, and Wienker TF, for the possibility to use the ARHI data. Acknowledgments References Gupta et al., Distance based clustering of association rules In: Intelligent Engineering Systems Through Artificial Neural Networks (Proceedings of ANNIE 1999), ASME Press, 1999, pp. 759-764. Van Laer et al., The grainyhead like 2 gene (GRHL2) alias TFCP2L3, is associated with age-related hearing impairment. Hum Mol Genet 2008: 15: 159-69.


Download ppt "Clustering and Summarising Association Rules Mined from Phenotype, Genotype and Environmental Data Concerning Age-Related Hearing Impairment Kati Iltanen."

Similar presentations


Ads by Google