Download presentation

Presentation is loading. Please wait.

Published byAnderson Janey Modified over 2 years ago

1
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining Amir R Razavi, Hans Gill, Hans Åhlfeldt, Nosrat Shahsavar Department of Biomedical Engineering, Division of Medical Informatics Linköpings universitet, Linköping, Sweden

2
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 2 A Data Pre-processing Method in Data Mining Outline –Introduction –Dataset and variables –Data pre-processing –Data mining Algorithm (DTI) –Result –Discussion

3
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 3 Introduction Abundance of data in medicine and availability of comprehensive registers Difficulty in analysing huge amount of data with traditional methods Efficient data mining methods

4
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 4 Introduction Applying data mining methods to breast cancer register Pre-processing is an essential part of knowledge discovery in databases Finding an efficient pre-processing approach is essential for a successful data mining

5
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 5 Methods Dataset Data pre-processing –Data combination and selection –Cleaning data –Replacing missing values –Dimension reduction Decision Tree Induction (DTI) Performance comparison

6
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 6 Dataset 3949 female patients, 1986 to 1995, follow up to 2003 Data from three registers: regional, tumour marker and death registers, overall more than 150 variables

7
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 7 Variables

8
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 8 After combining data from different registers, important variables (predictors/outcomes) were selected after consulting with domain experts: –Number of predictors were reduced from +150 –Chosing four important outcomes for breast cancer Data Pre-processing – Data Selection

9
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 9 Cleaning the data from outliers and errors, for example: –Duration between diagnosis of the disease and the recurrence –Age Data Pre-processing – Cleaning Data

10
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 10 Data Pre-processing - Replacing Missing Values EM (expectation maximization) algorithm –Dempster et al., 1977 –A two step iterative approach that estimates the parameters of a model starting from an initial guess. Each iteration consists of two steps: An expectation step that finds the distribution for the missing data based on the known values for the observed variables and the current estimate of the parameters. A maximization step that substitutes the missing data with the expected value.

11
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 11 Data Pre-processing - Dimension Reduction Canonical Correlation Analysis (CCA) –It investigates the relationship between two sets of variables. –A canonical correlation is the correlation of two canonical variates, one representing a set of independent variables, the other a set of dependent variables. –A canonical variate, is a linear combination of a set of original variables.

12
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 12 Data Pre-processing - Dimension Reduction –The aim is to create a number of canonical solutions each consisting of a linear combination of one set of variables: Ui = a 1 X 1 + a 2 X 2 + … + a m X m and a linear combination of the other set of variables: Vi = b 1 Y 1 + b 2 Y 2 + … + b n Y n –The goal is to determine the coefficients (a’s and b’s) that maximize the correlation between canonical variates Ui and Vi.

13
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 13 Data Pre-processing - Dimension Reduction –For finding important variables in each set (predictors and outcomes) magnitude of loadings were used. –Variables with the absolute value of loadings more than or equal to 0.3 were assumed important and entered into the next step for data mining. –Loading shows how each original variable contribute towards each canonical variate.

14
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 14 Data Pre-processing - Dimension Reduction Variables with their loadings

15
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 15 Data Mining Algorithm Decision Tree Induction (DTI) –A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a classification or decision. –Each internal node denotes a test on variables, each branch stands for an outcome of the test, leaf nodes represent an outcome, and the uppermost node in a tree is the root node.

16
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 16 Resulted Decision Tree

17
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 17 Performance comparison Sensitivity = Specificity = Accuracy = Number of leaves and tree size TP, TN, FP and FN denotes true positive, true negatives, false positives and false negatives, respectively

18
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 18 Performance Comparison Comparing different approaches

19
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 19 Discussion Effective data pre-processing is a very important step in knowledge discovery –Real word data are usually Incomplete Noisy Inconsistent Are not collected for data mining

20
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 20 Discussion Replacing missing values before dimension reduction –Providing more information to CCA for dimension reduction Running CCA prior to DTI –Reducing the number of variables while increasing accuracy of classification –Considerable increase in the interpretability of DTI

21
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 21 Discussion In medical studies often no pre-processing is done before DTI Proper pre-processing including consulting with domain experts, replacing missing values and dimension reduction prepares the data for a better data mining by DTI Increasing the accuracy and interpretability of DTI are the result of our approach

22
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 22 Future Works Increase the efficiency of knowledge discovery of medical registers. Validate the result of our methodology (pre- processing prior to data mining ) with domain experts for the prediction of recurrence of cancer. How to use the discovered knowledge and integrate it with clinical workflow. Improve the quality of registers with adding and completing important predictors.

23
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden 23 Thanks for your attention

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google