Selecting the Right Predictors

Selecting the Right Predictors
Organized by Farrokh Alemi, Ph.D. Narrated by Yara Alemi In predicting an outcome from data in electronic health records, decisions have to be made about which set of predictors should be included in the model. This section helps you think through the selection of predictors. This brief presentation was organized by Dr. Alemi.

Types of Predictors Cross Join
If you have the right variables you can predict anything. In an electronic health record, we have hundreds of thousands of variables, so some of them are the right variables. In this section, we describe how to select among them.

Cross Join There are many different types of predictors in electronic health records. One can use diagnoses, treatment, or medications to predict the outcomes for a patient. One could use patient characteristics like their address or gender to improve the predictions. Vital signs can be used. There are hundreds of thousands of potential predictors in an electronic health record.

Cross Join In the past we have primarily relied on use of diagnoses, or in other words patients’ history of illness, to predict various outcomes.

Diagnoses Dx Cross Join
In predicting mortality of patients within 6 months, medical history , i.e. the patients diagnostic codes, was more predictive than laboratory values or physiological markers

Diagnoses Dx More Predictive than Heart Ejection Fraction Cross Join
For example, in predicting 6-month mortality from heart failure, patients’ diagnoses was more predictive that heart ejection fraction.

Diagnoses Dx More Predictive than Chronological Age Cross Join
Diagnoses are also more predictive than chronological age in predicting mortality within 6-months. It does not matter how old you are but what illnesses you have.

Diagnoses Dx More Predictive than Laboratory Values Cross Join
Diagnoses are also more predictive than laboratory values. Many lab values can be easily controlled through medications. A hypertensive patient may show as normal if he has controlled his medications.

Comprehensive Selective Versus Cross Join
There are thousands of diagnoses. A careful choice needs to be made. Historically, scientists have relied on clinicians to select a set of variables known to affect the outcome of interest. The approach that we prefer is to use all diagnoses. This will lead to the difficult situation where thousands of predictors are in the model but it has the advantage that no relevant piece of information is missing.

Including All Is More Accurate
Cross Join Including All Is More Accurate Selective Versus Comprehensive In predicting mortality, models that rely on all diagnoses, without grouping them into categories, have proven to be more accurate than models that are selective in their approach or that break diagnoses into homogenous categories.

Rare Predictors Cross Join
In statistical modeling, for example in regression equations, a common practice is to discard predictors that rarely occur. The logic is that these rare predictors occur too infrequently to make a difference for an average patient. In electronic health records, this is not advised..

In electronic health records, we have thousands of rare predictors
In electronic health records, we have thousands of rare predictors. Ignoring one has a negligible effect but ignoring thousands of rare predictors will have a large impact on accuracy of predictions for the average patient. Furthermore, ignoring these predictors will reduce accuracy in subset of patients who experience these rare diseases. Therefore we do not recommend that rare predictors should be excluded from the models. This yields a statistical model with thousands of variables, most of which occur in rare situations. The model will be accurate but difficult to manage as there are so many variables. If the choice is between accuracy and ease, we would rather take the accurate route.

Obvious Predictors Cross Join
A related issue is whether we should keep obvious predictors, things in which the prediction task is trivial. For example, predicting from coma that the patient will die. For another example, a patient with diabetic neuropathy is clearly diabetic. No need to predict if the patient has undiagnosed diabetes or will have diabetes in the future; clearly, he is diabetic, after all the word diabetic is in the name of the disease the patient is reporting to have had.

Cross Join Obvious predictors should be kept in the model for two reasons: (1) errors in these cases will lead to clinicians ridiculing the model and abandoning its use. It is important not to miss the prediction in obvious cases, by deleting these clues you make it harder for the computer to remain accurate in obvious situations.

Cross Join Second, in electronic health record crucial information may be missing and obvious predictors can adjust for situations where the information is missing. In our example, it may be that a patient is hospitalized with diabetic neuropathy but for this patient no diabetes was recorded. Diabetes is usually observed in an outpatient setting. It is possible that the doctor who sees this patient does not use the same electronic health record as the hospitals record. As a consequence, this piece of information is not available in our records. Keeping obvious predictors helps the system address missing information.

After the Fact Tautology
Cross Join After the Fact Tautology Statisticians are concerned with use of a variable that occurs after an outcome to predict the outcome. On the surface, such predictions look tautological.

For example, if we want to predict whether a patient will develop diabetes, then all complications of diabetes or consequences of diabetes are tautological predictors. They should not be part of the analysis.

Consequences & Complications of Diabetes
Undiagnosed Diabetes Consequences & Complications of Diabetes Prediction In contrast, if we are trying to detect whether a patient has already developed an illness. In these situations, we detect diabetes by its consequences or even complications. For example, undiagnosed diabetes, or diabetes not previously reported in electronic health record, can be detected by seeing if the patient has complications of diabetes such as renal illness.

Remove Later Predictors
Detection Backward Look Keep Later Predictors Prediction Forward Look Remove Later Predictors Detection and prediction utilize different set of predictors. In predictive models, we are looking forward to establish risk of future events. In these models, only predictors that occur before the outcome can be used. Predictors that occur after the outcome should be removed. In detection, we are looking backwards to see if a diagnosis was missed, in these models diagnoses before and after the outcome of interest can be used.

Association or Causal Cross Join
When evaluating predictive models, the practice is to divide the data into two sets: training and validation. The parameters of the predictive model are estimated in the training data set but the model is tested in the validation set. In the training set all diagnoses are included as predictors of the outcome. This means that diagnoses that occur after the outcome or before the outcome are included in estimating the association between the predictor and the outcome.

Avoid Time Travel Cross Join
In the testing or validation situation, we no longer have the luxury of including variables that occur after the outcome is known. Here we want to rely only on predictors that occur prior to the outcome. Therefore, it is important to exclude any diagnosis that occurs after the outcome. This information is available in the electronic health record but not in real life. In real life, we are making a prediction about the likelihood of the outcome before the outcome has occurred. Therefore, we do not have access to any diagnosis or other information that occur after the outcome.

Avoid Diagnoses on Causal Path
Cross Join Avoid Diagnoses on Causal Path Sometimes the available data are reasonable but should be ignored in the context of the analysis planned. Even though the data are correct, nothing is wrong about them, nevertheless they should be ignored.

Cross Join If we are studying the impact of treatment on survival, one must drop complications of treatment from multivariate analysis. Including these variables will distort the estimated impact of treatment on survival. In electronic health records, complications are diagnoses that occur after treatment. Same diagnosis before treatment is considered medical history, or at time of treatment is considered comorbidity, but after treatment it is considered a complication. The statistical advice requires us to drop some of the diagnoses and retain others. Here we see an example, where because the patient had an infection and was overweight, a large dose of antibiotic was given, which distorted the microbes in the patient’s gut, and the patient developed diabetes. If we keep diabetes in our multivariate model, then the effect of antibiotic on survival will be distorted. In these situations, we want to keep comorbidities, i.e. over weight and infection, but not the treatment complication.

Diagnoses and medical history, with some exceptions, are some of the best predictors to include in the analysis Diagnoses and medical history, with some exceptions, are some of the best predictors to include in the analysis

Selecting the Right Predictors

Similar presentations

Presentation on theme: "Selecting the Right Predictors"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Selecting the Right Predictors

Similar presentations

Presentation on theme: "Selecting the Right Predictors"— Presentation transcript:

Similar presentations

About project

Feedback