Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic Feature Selection Feb 2015. Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)

Similar presentations


Presentation on theme: "Automatic Feature Selection Feb 2015. Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)"— Presentation transcript:

1 Automatic Feature Selection Feb 2015

2 Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)  sandbox/#install sandbox/#install  Do tutorials – here   Add R / Rstudio Server to your VM  Use Rhadoop to inteface Hadoop and R

3 Issue There are many predictive analytical models that will work – Which among many is best?

4 Example Data – HVAC building log data date6/1/13 6/25/13 time0:00:01 0:13:19 target.temp actual.temp system system.age building.id temp.diff1489 temp.rangeCOLD NORMAL extreme.temp1110 countryEgyptFinlandSouth AfricaIndonesia hvac.productFN39TGGG1919FN39TGJDNS77 building.age building.managerM17M4M7M18 service.center.distance days.since.service he.efficiency fan.hours coolant.typeB12 software.releaseP10 ave.outside.temp software.P coolant.B neg.diff111 abs.diff14891 diff.size3221 cut.off1110

5 What to look for in among models  R-squared (linear models)  Variable Significance  # of Variables that are significant  Sign of Variables  Confusion Matrix “Score” (non-linear models)  AIC number (non-linear models)

6 What to look for in among models Variables and Significance AIC Score Confusion Matrix Confusion Matrix Score

7 Hand Done Model Outcome

8 Approach  Calculate the combinations of all independent variables  Write function to;  Run each model possibility  For a sample of X (~10) samples of training / test data sets  Collect;  # of variables that have significance <.1  “score” the confusion matrix  Multiple # of significant of variables by confusion matrix score, average over sampling range, sort results data frame

9 Step 1 – set up empty data frame to hold results

10 Step 2 – calculate all combinations of variables

11 Step 3 – run function to estimate all models and save parameters

12 Step 4 – average all models and sort

13 Average of Top Models Are … ModelMatrixMeanSigMeanWeigthed cut.off ~ + system + building.id + hvac.product + building.age + building.manager + coolant.type + software.P cut.off ~ + system.age + building.id + hvac.product + building.age + building.manager + he.efficiency + coolant.type cut.off ~ + building.id + hvac.product + building.age + building.manager + coolant.type + software.release + ave.outside.temp cut.off ~ + system + building.id + hvac.product + building.manager + service.center.distance + coolant.type + ave.outside.temp cut.off ~ + building.id + service.center.distance + days.since.service + fan.hours + coolant.type + software.release + software.P cut.off ~ + system + system.age + building.id + days.since.service + fan.hours + ave.outside.temp + software.P cut.off ~ + system + system.age + building.id + building.age + days.since.service + fan.hours + software.P cut.off ~ + building.id + country + building.manager + service.center.distance + days.since.service + fan.hours + coolant.type cut.off ~ + system.age + building.id + country + building.manager + service.center.distance + coolant.type + software.P cut.off ~ + system.age + building.id + country + building.manager + service.center.distance + coolant.type + software.release cut.off ~ + building.id + hvac.product + building.age + building.manager + service.center.distance + coolant.type + software.P cut.off ~ + building.id + hvac.product + building.age + building.manager + service.center.distance + coolant.type + ave.outside.temp cut.off ~ + building.id + building.age + building.manager + service.center.distance + days.since.service + he.efficiency + coolant.type cut.off ~ + building.id + country + building.manager + days.since.service + coolant.type + ave.outside.temp + software.P cut.off ~ + building.id + hvac.product + building.age + fan.hours + software.release + ave.outside.temp + software.P cut.off ~ + hvac.product + building.age + days.since.service + he.efficiency + coolant.type + ave.outside.temp + software.P

14 Each of these should be tested again  More extensive use of varied train / test data sample sets  Stability of each model beyond the scoring  Chosen model “makes sense”

15 Alternative ways to do this …  Caret Package function “rfe” (recursive feature elimination)  Try all variables first  Train and Test the model with cross-validation  Calculate the most important variables  Eliminate the least important variables  Train and Test the model again  Calculate the most important variables  Eliminate the least important variables  Repeat …..

16 Setting it up & running RFE data frame of predictor variables vector of outcome variable max number of variables to keep control functions run recursive elimination model

17 Outcome of the RFE

18 Problems  Number of variables combinations can get HUGE  Might need multicore or parallel to get through it

19 Thank You Brooke Aker


Download ppt "Automatic Feature Selection Feb 2015. Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)"

Similar presentations


Ads by Google