ECML/PKDD 2003 Discovery Challenge 1 21 1 1 Attribute-Value and First Order Data Mining within the STULONG project Anneleen Van Assche, Sofie Verbaeten,

Slides:



Advertisements
Similar presentations
COMP3740 CR32: Knowledge Management and Adaptive Systems
Advertisements

Groupe de travail athérosclérose 1 STULONG Discovery Challenges Feedback Marie Tomečková EuroMISE – Cardio This work is supported by the project LN00B107.
Florida International University COP 4770 Introduction of Weka.
ADBIS 2007 A Clustering Approach to Generalized Pattern Identification Based on Multi-instanced Objects with DARA Rayner Alfred Dimitar Kazakov Artificial.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Learning Algorithm Evaluation
Logistic Regression.
Describing Relationships Using Correlation and Regression
AtherEx: an Expert System for Atherosclerosis Risk Assessment Petr Berka, Vladimír Laš University of Economics, Prague Marie Tomečková Institute of Computer.
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
WRSTA, 13 August, 2006 Rough Sets in Hybrid Intelligent Systems For Breast Cancer Detection By Aboul Ella Hassanien Cairo University, Faculty of Computer.
x – independent variable (input)
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Cost-Sensitive Classifier Evaluation Robert Holte Computing Science Dept. University of Alberta Co-author Chris Drummond IIT, National Research Council,
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Chapter 5 Data mining : A Closer Look.
1 Formal Evaluation Techniques Chapter 7. 2 test set error rates, confusion matrices, lift charts Focusing on formal evaluation methods for supervised.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
© Negnevitsky, Pearson Education, Introduction, or what is data mining? Introduction, or what is data mining? Data warehouse and query tools Data.
1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003 September 22-27, 2003 Berzal F., Cubero J.C., Sanchez.
Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Studying the Presence of Genetically Modified Variants in Organic Oilseed Rape by using Relational Data Mining Aneta Ivanovska 1, Celine Vens 2, Sašo Džeroski.
PhD Committee J. Vanthienen (promotor, K.U.Leuven) J. Vandenbulcke
Biostatistics in Practice Peter D. Christenson Biostatistician Session 5: Methods for Assessing Associations.
Statistics for clinicians Biostatistics course by Kevin E. Kip, Ph.D., FAHA Professor and Executive Director, Research Center University of South Florida,
Prof. of Clinical Chemistry, Mansoura University.
Analysis of Death Causes in the STULONG Data Set Jan Burian, Jan Rauch EuroMISE – Cardio University of Economics Prague.
A three-step approach for STULONG database analysis: characterization of patients’ groups O. Couturier, H. Delalin, H. Fu, E. Kouamou, E. Mephu Nguifo.
HASAR : Mining Sequential Association Rules for Atherosclerosis Risk Factor Analysis Laurent Brisson, Nicolas Pasquier, Céline Hebert, Martine Collard.
Introduction, or what is data mining? Introduction, or what is data mining? Data warehouse and query tools Data warehouse and query tools Decision trees.
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
MUDIM (Petr Šimeček, Euromise) system for multidimensional compositional models (Radim Jiroušek) C++ code, distributed as R-package focused on medical.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.
Trend Analysis and Risk Identification 1 The Gerstner laboratory for intelligent decision making and control, Czech Technical University, Prague Lenka.
LOGISTIC REGRESSION A statistical procedure to relate the probability of an event to explanatory variables Used in epidemiology to describe and evaluate.
Neural and Evolutionary Computing - Lecture 9 1 Evolutionary Neural Networks Design  Motivation  Evolutionary training  Evolutionary design of the architecture.
Acknowledgements Contact Information Anthony Wong, MTech 1, Senthil K. Nachimuthu, MD 1, Peter J. Haug, MD 1,2 Patterns and Rules  Vital signs medoids.
Discovering Descriptive Knowledge Lecture 18. Descriptive Knowledge in Science In an earlier lecture, we introduced the representation and use of taxonomies.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Multi-Relational Data Mining: An Introduction Joe Paulowskey.
CORRELATION: Correlation analysis Correlation analysis is used to measure the strength of association (linear relationship) between two quantitative variables.
© Copyright McGraw-Hill 2000
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 7-1 Chapter 7 Sampling Distributions Basic Business Statistics.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new.
Data Mining and Decision Support
Brian Lukoff Stanford University October 13, 2006.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Discovery Challenge – ECML/PKDD2004 September 20, 2004, Pisa, Italy Atherosclerosis Marie Tomečková EuroMISE Centre – Cardio Institute of Computer Science,
Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs.
Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
SDS-Rules and Classification Tomáš Karban ECML/PKDD 2003 – Dubrovnik (Cavtat) September 22, 2003.
7. Performance Measurement
A Smart Tool to Predict Salary Trends of H1-B Holders
SNS COLLEGE OF TECHNOLOGY
Chapter 7. Classification and Prediction
Prepared by: Mahmoud Rafeek Al-Farra
Basic Statistics Overview
Quantitative Data Analysis P6 M4
CSE 4705 Artificial Intelligence
A Unifying View on Instance Selection
NURS 790: Methods for Research and Evidence Based Practice
Ninja Trader: Introduction to data mining in financial applications
CSCI N317 Computation for Scientific Applications Unit Weka
Lecture 10 – Introduction to Weka
Presentation transcript:

ECML/PKDD 2003 Discovery Challenge Attribute-Value and First Order Data Mining within the STULONG project Anneleen Van Assche, Sofie Verbaeten, Darek Krzywania, Jan Struyf, Hendrik Blockeel Department of Computer Science Katholieke Universiteit Leuven

ECML/PKDD 2003 Discovery Challenge Data Mining Effort Data Data preprocessing Data mining Evaluation criteria Discovered Knowledge Initial exploration Entry data Control data Conclusions Outline

ECML/PKDD 2003 Discovery Challenge Data we studied 2 of the 4 data matrices from the STULONG data set:  the Entry data matrix  the Control data matrix men in the Entry data are divided into 3 subgroups based on occurrence of risk factors:  normal group (NG): non of these risk factors  risk group (RG): at least one of the risk factors  pathological group (PG): manifested serious disease

ECML/PKDD 2003 Discovery Challenge Data preprocessing missing values / empty entries / “not stated” / “no” data = propositionalisation of relational database many empty entries + redundancies (eg. personal anamnesis) 1-n relation from Entry to Control data set solution: relational representation (ILP) background knowledge can be used new features for trend analysis in control examinations

ECML/PKDD 2003 Discovery Challenge Attribute-value: Entry data set converted to Weka.arff format introduction of new attributes (eg. BMI, …) Relational: Entry + Control data set converted to relational ILP format introduction of background knowledge Data preprocessing

ECML/PKDD 2003 Discovery Challenge Data mining Entry data in.arff format Weka  classification (ZeroR, OneR, NB, Decision Stump, Decision Table, J48, …)  regression (Linear Regression, M5’)  association rules (Apriori) Entry + Control data in ILP format ACE  classification (Tilde)  regression (Tilde) since data distributions are skewed, better use regression to predict chance of being positive/negative instead of using classification

ECML/PKDD 2003 Discovery Challenge Evaluation criteria 10-fold cross-validation classifiers  ROC – analysis (Area Under Curve)  accuracy regression models  Relative error (RE)  Pearson’s correlation coefficient (r)

ECML/PKDD 2003 Discovery Challenge Data Mining Effort Data Data preprocessing Data mining Evaluation criteria Discovered Knowledge Initial exploration Entry data Control data Conclusions Outline

ECML/PKDD 2003 Discovery Challenge Initial exploration of Entry Comparison of mean values of attributes for the three subgroups reached education responsibility in job physical activity in job physical activity after job Skinfold above musculus triceps Skinfold above musculus subscpularis

ECML/PKDD 2003 Discovery Challenge Initial exploration of Entry Correlation between BMI and skin fold for the three subgroups

ECML/PKDD 2003 Discovery Challenge Results from the Entry data set Relations between social factors and other characteristics education level physical activity in job education level smoking pensioner drinking age blood pressure Relations between physical activities and other characteristics activity after jobsmoking duration of way to work drinking...

ECML/PKDD 2003 Discovery Challenge Results from the Entry data set Correlation between skinfolds and BMI in particular risk groups regression task: predict BMI using SUBSC and TRIC classification task: predict OVERWEIGHT(OW) (1 if BMI >25 else 0) ExperimentSizeACCRAErAUC OW_T6.071% OW_NG0.653% OW_RG3.974% OW_PG1.075% BMI_T BMI_NG BMI_RG BMI_PG

ECML/PKDD 2003 Discovery Challenge Results from the Entry data set Correlation between skinfolds and BMI in particular subgroups correlation is strongest in risk group for all different groups SUBSC > ±15 is most important split to distinguish between overweight en non-overweight SUBSC BMI influence of TRIC on BMI less than influence of SUBSC

ECML/PKDD 2003 Discovery Challenge Correlation between skinfolds and BMI in particular subgroups Results from the Entry data set TRIC < 15SUBSC < 10SUBSC < 15SUBSC < 20SUBSC < 70SUBSC < 35 ex. risk group: yesno yes no yes no yes

ECML/PKDD 2003 Discovery Challenge Results from the Entry data set Staying healthy in the risk group (RG) task: predict if person of RG came down with cardio disease new attribute ILL introduced based on HODN0 attr from Control no good performance (most correlation coefficients < 0.05) best correlation (0.15) for cholesterol level if cholesterol < 250 then chance to stay healthy

ECML/PKDD 2003 Discovery Challenge Results from the Control data set relational Control data setTilde task: predict whether person from risk group comes down with cardio disease (1) or not (0) use only controlexaminations (ce) before patient’s cardio disease: ce.year ≤ ROK i numeric attributes: extra features  compute trend over different ce’s  slope of least squares model of attr. over time interval T – N T: start of patient’s first disease N: parameter chosen by Tilde

ECML/PKDD 2003 Discovery Challenge Results from the Control data set Input attributesSizeACCRAErAUCAUC (33%) Job1.068% Physical activity0.168% Smoking3.767% Diet0.068% BMI1.467% Blood Pressure3.363% Cholesterol9.164% Glycaemia & Uric acid3.366% BMI & Cholesterol10.663% Smoking & Cholesterol12.563% All8.566% Statistics on the Control data experiments

ECML/PKDD 2003 Discovery Challenge Results from the Control data set Some interesting subgroups from the decision trees: proportion of class 1 in whole group = 32% total population = 1417 IF glycaemia > 7.2 and BMI > 23.5 in each examination and diastolic blood pressure slope during last 10 years < -77 THEN 64% (103) IF systolic blood pressure slope during last 20 years < THEN 53% (122) IF glycaemia > 7.2 in each examination THEN 48% (434) If patient leaves to full retirement in some examination THEN 20% (233) IF reduced smoking in some examination and slope in number of cigarettes during last 20 years THEN 16% (116) IF glycaemia < 7.2 in some examination THEN 7% (285)

ECML/PKDD 2003 Discovery Challenge Results from the Control data set Glycaemia most important attribute also blood pressure, cholesterol and smoking … slope of numeric attributes very useful statistics may be negatively biased due to cross-validation

ECML/PKDD 2003 Discovery Challenge Conclusions used variety of data mining algorithms  propositional techniques  multi-relational techniques results consistent over different algorithms much discovered knowledge difficult to handle interpretation of results by domain experts is necessary carefull handling of results if accuracy of classifier not larger than predicting the average classifier can still be informative!!

ECML/PKDD 2003 Discovery Challenge 1 21 The End Thanks for your attention!!