Presentation is loading. Please wait.

Presentation is loading. Please wait.

Trend Analysis and Risk Identification 1 The Gerstner laboratory for intelligent decision making and control, Czech Technical University, Prague Lenka.

Similar presentations


Presentation on theme: "Trend Analysis and Risk Identification 1 The Gerstner laboratory for intelligent decision making and control, Czech Technical University, Prague Lenka."— Presentation transcript:

1 Trend Analysis and Risk Identification 1 The Gerstner laboratory for intelligent decision making and control, Czech Technical University, Prague Lenka Nováková 1, Jiří Kléma 1, Michal Jakob 1, Simon Rawles 2, Olga Štěpánková 1 PKDD 2003, Discovery Challenge 2 Department of Computer Science, University of Bristol, Bristol, UK

2 Outline STULONG data, orientation towards CVD Used tools –SumatraTT, Statistica, Weka Used techniques –mainly statistical tests - ANOVA, Chi-square, etc. Exploratory analysis and subgroup discovery –Entry table Trend analysis –Entry and Control tables –three principal ways of preprocessing –derived aggregated attributes –univariate and multivariate analysis

3 STULONG Data Four tables: Entry, Control, Letter, Death Dependent variable: CVD –CardioVascular Disease –boolean attribute derived of A 2 questionnaire (Control table) CVD = false The patient has no coronary disease. CVD = true The patient has one of these attributes true (Hodn1, Hodn2, Hodn3, Hodn11, Hodn13, Hodn14) We remove patients who have diabetes (Hodn4) or cancer (Hodn15) only. positive angina pectoris (silent) myocardial infarction cerebrovascular accident ischaemic heart disease

4 ENTRY - subgroup discovery AQ no.6: Are there any differences in the ENTRY examination for different CVD groups? Statistica 6.0 –module for interactive decision tree induction –two tailed t-test or chi-square test to asses significance of subgroups Dependencies are relatively weak Interesting dependencies found –social characteristics: derived attribute AGE_of_ENTRY –alcohol: positive effect of beer, no effect of wine –sugar consumption increases CVD risk –well-known dependencies are not mentioned (smoking, BMI, cholesterol)

5 ENTRY - general model General CVD model (in WEKA) –feature selection + modeling (e.g., decision trees) –tends to generate trivial models (always predicting false) –asymmetric error-cost matrix does not help Predict CVD risk –Identify principal variables (Chi-squared test) –Naïve Bayes + ROC evaluation –three independent variables –discretized AGE_of_ENTRY –discretized BMI –Cholrisk - derived of CHLST –AUC = 0.66

6 CONTROL - trend analysis AQ no.7: Are there any differences in development of risk factors for different CVD groups? ENTRY tableCONTR table ICO – primary key Year of birth Year of entry Smoking Alcohol Cholesterol Body Mass Index Blood pressure ICO Risk factors followed during 20 years

7 Global Approach Risk factors to be observed are selected –SYST, DIAST, TRIGL, BMI, CHLSTMG Selected control examinations are transformed –pivoting Patients with no control entries are removed –about 60 patients Trend aggregates are calculated ICOEntryContr1Contr2Aggr1AggrN... ContrM... ICO_1 ICO_2

8 Derived trend attributes Intercept Gradient Correlation coefficient Standard deviation x (decimal time ~ year + 1/12 month) y (observed variable) referential time (1975) Mean

9 Global Approach - results The derived aggregates were discretized –e.g., the gradient can be strongly decreasing, decreasing, constant, increasing, strongly increasing Chi-square test for independence wrt. to CVD Large number of aggregates proved to be significant including gradients (Chi square test, p=0.05)

10 12 Strongly decreasingDecreasingConstantIncreasingStrongly increasing

11 12 Strongly decreasingDecreasingConstantIncreasingStrongly increasing

12 ControlCount vs. CVD ControlCount –number of examinations –strong relation with CVD –AUC = 0.35 –ControlCount  CVD risk  –anachronistic attribute –introduced by the design of the study ControlCount has influence on the trend aggregates - ControlCount  gradients tend to be more steep etc. Conclusion: global approach cannot be applied (at least with these aggregates)

13 Windowing Approach I. The same risk factors, the same pivoting transformation and similar trend aggregates BUT the constant number of examinations Issues: –window time period vs. number of examinations 5 examinations are enough to express trend –patients : records (1 : ControlCount – 3) entry is used as the first examination records are dependent –CVD classification time from the last examination to CVD yes/no (yes = CVD in the next year or CVD in future)

14 Windowing Approach I. First vector New vector Data... Entry ??

15 Aggregate tests T-tests; Grouping: Time_round (Trend_all_nahrady in Trend_analysis.stw) Group 1: 1000 Group 2: 1 Trend aggregates approach the normal distribution in all (both) the specified CVD groups Two groups were selected – CVD never appears in the future (1000) vs. CVD appears at the next exam. (1) T-test for comparison of the group means can be applied (p<=0.05) Do the means of the calculated aggregates differ in the different CVD groups? Just a few of them –two variables (!gradients!) are clearly significant only SYST and DIAST –two significant intercepts TRIGL and CHLST

16 Further tests of SYST, DIAST T-tests; Grouping: Time_round (Trend_all_nahrady in Trend_analysis.stw) Group 1: 1000 Group 2: 1 Try to test the gradients for all the CVD groups, not only two extreme groups Repeated ANOVA can be applied – development of SYST/DIAST trend for different CVD groups

17

18

19

20 Windowing Approach II. There are missing values of risk factors Windowing I. –skips missing values –different numbers of rows are generated for different factors Windowing II. –replaces the missing values –the same numbers of rows are generated for different factors –enables multivariate analysis combination of different aggregates and their relation with CVD

21 Windowing II. First vector New vector Data... Entry ??

22 27 patients only!

23 Conclusions The main scope –AQ no.7: Are there any differences in development of risk factors for different CVD groups? Contributions –Pitfalls of the global approach revealed –Using windowing – differences proved for SYST and DIAST blood pressures –Other assumptions and ideas: interesting course of development of risk factors (DIAST is decreasing first then increases and CVD appears) other trends may have influence under specific conditions (BMITrend and overweight, etc.)


Download ppt "Trend Analysis and Risk Identification 1 The Gerstner laboratory for intelligent decision making and control, Czech Technical University, Prague Lenka."

Similar presentations


Ads by Google