Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic Feature Selection Feb 2015. Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)

Similar presentations


Presentation on theme: "Automatic Feature Selection Feb 2015. Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)"— Presentation transcript:

1 Automatic Feature Selection Feb 2015

2 Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)  http://hortonworks.com/products/hortonworks- sandbox/#install http://hortonworks.com/products/hortonworks- sandbox/#install  Do tutorials – here  http://hortonworks.com/tutorials/http://hortonworks.com/tutorials/  Add R / Rstudio Server to your VM  Use Rhadoop to inteface Hadoop and R

3 Issue There are many predictive analytical models that will work – Which among many is best?

4 Example Data – HVAC building log data date6/1/13 6/25/13 time0:00:01 0:13:19 target.temp69666970 actual.temp55586071 system1413519 system.age620814 building.id174718 temp.diff1489 temp.rangeCOLD NORMAL extreme.temp1110 countryEgyptFinlandSouth AfricaIndonesia hvac.productFN39TGGG1919FN39TGJDNS77 building.age11171325 building.managerM17M4M7M18 service.center.distance15011510068 days.since.service14210916486 he.efficiency1222236 fan.hours1716158 coolant.typeB12 software.releaseP10 ave.outside.temp91467780 software.P120000 coolant.B121111 neg.diff111 abs.diff14891 diff.size3221 cut.off1110

5 What to look for in among models  R-squared (linear models)  Variable Significance  # of Variables that are significant  Sign of Variables  Confusion Matrix “Score” (non-linear models)  AIC number (non-linear models)

6 What to look for in among models Variables and Significance AIC Score Confusion Matrix Confusion Matrix Score

7 Hand Done Model Outcome

8 Approach  Calculate the combinations of all independent variables  Write function to;  Run each model possibility  For a sample of X (~10) samples of training / test data sets  Collect;  # of variables that have significance <.1  “score” the confusion matrix  Multiple # of significant of variables by confusion matrix score, average over sampling range, sort results data frame

9 Step 1 – set up empty data frame to hold results

10 Step 2 – calculate all combinations of variables

11 Step 3 – run function to estimate all models and save parameters

12 Step 4 – average all models and sort

13 Average of Top Models Are … ModelMatrixMeanSigMeanWeigthed cut.off ~ + system + building.id + hvac.product + building.age + building.manager + coolant.type + software.P120.795.604.45 cut.off ~ + system.age + building.id + hvac.product + building.age + building.manager + he.efficiency + coolant.type0.885.004.39 cut.off ~ + building.id + hvac.product + building.age + building.manager + coolant.type + software.release + ave.outside.temp0.854.904.17 cut.off ~ + system + building.id + hvac.product + building.manager + service.center.distance + coolant.type + ave.outside.temp0.774.303.30 cut.off ~ + building.id + service.center.distance + days.since.service + fan.hours + coolant.type + software.release + software.P120.913.603.28 cut.off ~ + system + system.age + building.id + days.since.service + fan.hours + ave.outside.temp + software.P120.863.803.25 cut.off ~ + system + system.age + building.id + building.age + days.since.service + fan.hours + software.P120.843.803.18 cut.off ~ + building.id + country + building.manager + service.center.distance + days.since.service + fan.hours + coolant.type0.883.603.17 cut.off ~ + system.age + building.id + country + building.manager + service.center.distance + coolant.type + software.P120.873.603.14 cut.off ~ + system.age + building.id + country + building.manager + service.center.distance + coolant.type + software.release0.853.703.14 cut.off ~ + building.id + hvac.product + building.age + building.manager + service.center.distance + coolant.type + software.P120.893.503.11 cut.off ~ + building.id + hvac.product + building.age + building.manager + service.center.distance + coolant.type + ave.outside.temp0.893.503.10 cut.off ~ + building.id + building.age + building.manager + service.center.distance + days.since.service + he.efficiency + coolant.type0.883.503.09 cut.off ~ + building.id + country + building.manager + days.since.service + coolant.type + ave.outside.temp + software.P120.853.603.06 cut.off ~ + building.id + hvac.product + building.age + fan.hours + software.release + ave.outside.temp + software.P120.813.703.00 cut.off ~ + hvac.product + building.age + days.since.service + he.efficiency + coolant.type + ave.outside.temp + software.P120.913.303.00

14 Each of these should be tested again  More extensive use of varied train / test data sample sets  Stability of each model beyond the scoring  Chosen model “makes sense”

15 Alternative ways to do this …  Caret Package function “rfe” (recursive feature elimination)  Try all variables first  Train and Test the model with cross-validation  Calculate the most important variables  Eliminate the least important variables  Train and Test the model again  Calculate the most important variables  Eliminate the least important variables  Repeat …..

16 Setting it up & running RFE data frame of predictor variables vector of outcome variable max number of variables to keep control functions run recursive elimination model

17 Outcome of the RFE

18 Problems  Number of variables combinations can get HUGE  Might need multicore or parallel to get through it

19 Thank You Brooke Aker baker@bigdatalens.com


Download ppt "Automatic Feature Selection Feb 2015. Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)"

Similar presentations


Ads by Google