Trend Analysis and Risk Identification 1 The Gerstner laboratory for intelligent decision making and control, Czech Technical University, Prague Lenka.

Slides:



Advertisements
Similar presentations
Groupe de travail athérosclérose 1 STULONG Discovery Challenges Feedback Marie Tomečková EuroMISE – Cardio This work is supported by the project LN00B107.
Advertisements

ADVANCED STATISTICS FOR MEDICAL STUDIES Mwarumba Mwavita, Ph.D. School of Educational Studies Research Evaluation Measurement and Statistics (REMS) Oklahoma.
Hypothesis Testing Steps in Hypothesis Testing:
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
The General Linear Model Or, What the Hell’s Going on During Estimation?
Chapter 13: Inference for Distributions of Categorical Data
COHORH STUDY A research paper on BMJ. What is cohort study? Investigates from exposure to outcome, in a group of patients without, or with appropriate.
x – independent variable (input)
Classification and risk prediction
Lecture 3: Chi-Sqaure, correlation and your dissertation proposal Non-parametric data: the Chi-Square test Statistical correlation and regression: parametric.
A Classification Approach for Effective Noninvasive Diagnosis of Coronary Artery Disease Advisor: 黃三益 教授 Student: 李建祥 D 楊宗憲 D 張珀銀 D
Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.
Statistics for the Social Sciences Psychology 340 Fall 2006 Review For Exam 1.
PSY 307 – Statistics for the Behavioral Sciences Chapter 19 – Chi-Square Test for Qualitative Data Chapter 21 – Deciding Which Test to Use.
Today Concepts underlying inferential statistics
Chapter 14 Inferential Data Analysis
1 Multivariate Normal Distribution Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking.
1 Overview of Major Statistical Tools UAPP 702 Research Methods for Urban & Public Policy Based on notes by Steven W. Peuquet, Ph.D.
AM Recitation 2/10/11.
Testing Hypotheses I Lesson 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics n Inferential Statistics.
Fundamentals of Statistical Analysis DR. SUREJ P JOHN.
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 2: Basic techniques for innovation data analysis. Part I: Statistical inferences.
Overview of Major Statistical Tools UAPP 702 Research Methods for Urban & Public Policy Based on notes by Steven W. Peuquet, Ph.D. 1.
1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003 September 22-27, 2003 Berzal F., Cubero J.C., Sanchez.
Trend Analysis in Stulong Data The Gerstner laboratory for intelligent decision making and control Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková.
Quantitative Analysis: Statistical Testing using SPSS Geof Staniford Room Telephone:
Week 6: Model selection Overview Questions from last week Model selection in multivariable analysis -bivariate significance -interaction and confounding.
1ECML / PKDD 2004 Discovery Challenge Mining Strong Associations and Exceptions in the STULONG Data Set Eduardo Corrêa Gonçalves and Alexandre Plastino.
Analysis of Death Causes in the STULONG Data Set Jan Burian, Jan Rauch EuroMISE – Cardio University of Economics Prague.
A three-step approach for STULONG database analysis: characterization of patients’ groups O. Couturier, H. Delalin, H. Fu, E. Kouamou, E. Mephu Nguifo.
Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation.
Chapter 9 Analyzing Data Multiple Variables. Basic Directions Review page 180 for basic directions on which way to proceed with your analysis Provides.
Biostatistics Class 6 Hypothesis Testing: One-Sample Inference 2/29/2000.
Chapter 16 The Chi-Square Statistic
1 Chapter 12 Simple Linear Regression. 2 Chapter Outline  Simple Linear Regression Model  Least Squares Method  Coefficient of Determination  Model.
Kano Model & Multivariate Statistics Dr. Surej P John.
Section 9-1: Inference for Slope and Correlation Section 9-3: Confidence and Prediction Intervals Visit the Maths Study Centre.
Statistics for clinicians Biostatistics course by Kevin E. Kip, Ph.D., FAHA Professor and Executive Director, Research Center University of South Florida,
The Statistical Analysis of Data. Outline I. Types of Data A. Qualitative B. Quantitative C. Independent vs Dependent variables II. Descriptive Statistics.
Statistical Inference for more than two groups Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
Academic Research Academic Research Dr Kishor Bhanushali M
Hypothesis Testing. Why do we need it? – simply, we are looking for something – a statistical measure - that will allow us to conclude there is truly.
ECML/PKDD 2003 Discovery Challenge Attribute-Value and First Order Data Mining within the STULONG project Anneleen Van Assche, Sofie Verbaeten,
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Non-parametric Tests e.g., Chi-Square. When to use various statistics n Parametric n Interval or ratio data n Name parametric tests we covered Tuesday.
ALLHAT 6/5/ CARDIOVASCULAR DISEASE OUTCOMES IN HYPERTENSIVE PATIENTS STRATIFIED BY BASELINE GLOMERULAR FILTRATION RATE (3 GROUPS by GFR)
Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
Tutorial I: Missing Value Analysis
Discovery Challenge – ECML/PKDD2004 September 20, 2004, Pisa, Italy Atherosclerosis Marie Tomečková EuroMISE Centre – Cardio Institute of Computer Science,
Probability and odds Suppose we a frequency distribution for the variable “TB status” The probability of an individual having TB is frequencyRelative.
© 2006 by The McGraw-Hill Companies, Inc. All rights reserved. 1 Chapter 11 Testing for Differences Differences betweens groups or categories of the independent.
6/5/ CARDIOVASCULAR DISEASE OUTCOMES IN HYPERTENSIVE PATIENTS STRATIFIED BY BASELINE GLOMERULAR FILTRATION RATE (4 GROUPS by GFR) ALLHAT.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
1 Statistics & R, TiP, 2011/12 Multivariate Methods  Multivariate data  Data display  Principal component analysis Unsupervised learning technique 
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.
Meta-analysis of observational studies Nicole Vogelzangs Department of Psychiatry & EMGO + institute.
STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.
Methods of Presenting and Interpreting Information Class 9.
Bootstrap and Model Validation
PSY 325 aid Something Great/psy325aid.com
An assessment of correlation between ethnicity and modifiable risk factors in the context of primary prevention of cardiovascular disease Patel PA2,
Online Conditional Outlier Detection in Nonstationary Time Series
psy 325 aid Expect Success/psy325aiddotcom
CH 5: Multivariate Methods
Do Age, BMI, and History of Smoking play a role?
Statistics II: An Overview of Statistics
Practice As part of a program to reducing smoking, a national organization ran an advertising campaign to convince people to quit or reduce their smoking.
Nazmus Saquib, PhD Head of Research Sulaiman AlRajhi Colleges
Presentation transcript:

Trend Analysis and Risk Identification 1 The Gerstner laboratory for intelligent decision making and control, Czech Technical University, Prague Lenka Nováková 1, Jiří Kléma 1, Michal Jakob 1, Simon Rawles 2, Olga Štěpánková 1 PKDD 2003, Discovery Challenge 2 Department of Computer Science, University of Bristol, Bristol, UK

Outline STULONG data, orientation towards CVD Used tools –SumatraTT, Statistica, Weka Used techniques –mainly statistical tests - ANOVA, Chi-square, etc. Exploratory analysis and subgroup discovery –Entry table Trend analysis –Entry and Control tables –three principal ways of preprocessing –derived aggregated attributes –univariate and multivariate analysis

STULONG Data Four tables: Entry, Control, Letter, Death Dependent variable: CVD –CardioVascular Disease –boolean attribute derived of A 2 questionnaire (Control table) CVD = false The patient has no coronary disease. CVD = true The patient has one of these attributes true (Hodn1, Hodn2, Hodn3, Hodn11, Hodn13, Hodn14) We remove patients who have diabetes (Hodn4) or cancer (Hodn15) only. positive angina pectoris (silent) myocardial infarction cerebrovascular accident ischaemic heart disease

ENTRY - subgroup discovery AQ no.6: Are there any differences in the ENTRY examination for different CVD groups? Statistica 6.0 –module for interactive decision tree induction –two tailed t-test or chi-square test to asses significance of subgroups Dependencies are relatively weak Interesting dependencies found –social characteristics: derived attribute AGE_of_ENTRY –alcohol: positive effect of beer, no effect of wine –sugar consumption increases CVD risk –well-known dependencies are not mentioned (smoking, BMI, cholesterol)

ENTRY - general model General CVD model (in WEKA) –feature selection + modeling (e.g., decision trees) –tends to generate trivial models (always predicting false) –asymmetric error-cost matrix does not help Predict CVD risk –Identify principal variables (Chi-squared test) –Naïve Bayes + ROC evaluation –three independent variables –discretized AGE_of_ENTRY –discretized BMI –Cholrisk - derived of CHLST –AUC = 0.66

CONTROL - trend analysis AQ no.7: Are there any differences in development of risk factors for different CVD groups? ENTRY tableCONTR table ICO – primary key Year of birth Year of entry Smoking Alcohol Cholesterol Body Mass Index Blood pressure ICO Risk factors followed during 20 years

Global Approach Risk factors to be observed are selected –SYST, DIAST, TRIGL, BMI, CHLSTMG Selected control examinations are transformed –pivoting Patients with no control entries are removed –about 60 patients Trend aggregates are calculated ICOEntryContr1Contr2Aggr1AggrN... ContrM... ICO_1 ICO_2

Derived trend attributes Intercept Gradient Correlation coefficient Standard deviation x (decimal time ~ year + 1/12 month) y (observed variable) referential time (1975) Mean

Global Approach - results The derived aggregates were discretized –e.g., the gradient can be strongly decreasing, decreasing, constant, increasing, strongly increasing Chi-square test for independence wrt. to CVD Large number of aggregates proved to be significant including gradients (Chi square test, p=0.05)

12 Strongly decreasingDecreasingConstantIncreasingStrongly increasing

12 Strongly decreasingDecreasingConstantIncreasingStrongly increasing

ControlCount vs. CVD ControlCount –number of examinations –strong relation with CVD –AUC = 0.35 –ControlCount  CVD risk  –anachronistic attribute –introduced by the design of the study ControlCount has influence on the trend aggregates - ControlCount  gradients tend to be more steep etc. Conclusion: global approach cannot be applied (at least with these aggregates)

Windowing Approach I. The same risk factors, the same pivoting transformation and similar trend aggregates BUT the constant number of examinations Issues: –window time period vs. number of examinations 5 examinations are enough to express trend –patients : records (1 : ControlCount – 3) entry is used as the first examination records are dependent –CVD classification time from the last examination to CVD yes/no (yes = CVD in the next year or CVD in future)

Windowing Approach I. First vector New vector Data... Entry ??

Aggregate tests T-tests; Grouping: Time_round (Trend_all_nahrady in Trend_analysis.stw) Group 1: 1000 Group 2: 1 Trend aggregates approach the normal distribution in all (both) the specified CVD groups Two groups were selected – CVD never appears in the future (1000) vs. CVD appears at the next exam. (1) T-test for comparison of the group means can be applied (p<=0.05) Do the means of the calculated aggregates differ in the different CVD groups? Just a few of them –two variables (!gradients!) are clearly significant only SYST and DIAST –two significant intercepts TRIGL and CHLST

Further tests of SYST, DIAST T-tests; Grouping: Time_round (Trend_all_nahrady in Trend_analysis.stw) Group 1: 1000 Group 2: 1 Try to test the gradients for all the CVD groups, not only two extreme groups Repeated ANOVA can be applied – development of SYST/DIAST trend for different CVD groups

Windowing Approach II. There are missing values of risk factors Windowing I. –skips missing values –different numbers of rows are generated for different factors Windowing II. –replaces the missing values –the same numbers of rows are generated for different factors –enables multivariate analysis combination of different aggregates and their relation with CVD

Windowing II. First vector New vector Data... Entry ??

27 patients only!

Conclusions The main scope –AQ no.7: Are there any differences in development of risk factors for different CVD groups? Contributions –Pitfalls of the global approach revealed –Using windowing – differences proved for SYST and DIAST blood pressures –Other assumptions and ideas: interesting course of development of risk factors (DIAST is decreasing first then increases and CVD appears) other trends may have influence under specific conditions (BMITrend and overweight, etc.)