Download presentation

Presentation is loading. Please wait.

Published byMatteo Hakey Modified over 2 years ago

1
Automatic Feature Selection Feb 2015

2
Update on Hadoop / R Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from HortonWorks) http://hortonworks.com/products/hortonworks- sandbox/#install http://hortonworks.com/products/hortonworks- sandbox/#install Do tutorials – here http://hortonworks.com/tutorials/http://hortonworks.com/tutorials/ Add R / Rstudio Server to your VM Use Rhadoop to inteface Hadoop and R

3
Issue There are many predictive analytical models that will work – Which among many is best?

4
Example Data – HVAC building log data date6/1/13 6/25/13 time0:00:01 0:13:19 target.temp69666970 actual.temp55586071 system1413519 system.age620814 building.id174718 temp.diff1489 temp.rangeCOLD NORMAL extreme.temp1110 countryEgyptFinlandSouth AfricaIndonesia hvac.productFN39TGGG1919FN39TGJDNS77 building.age11171325 building.managerM17M4M7M18 service.center.distance15011510068 days.since.service14210916486 he.efficiency1222236 fan.hours1716158 coolant.typeB12 software.releaseP10 ave.outside.temp91467780 software.P120000 coolant.B121111 neg.diff111 abs.diff14891 diff.size3221 cut.off1110

5
What to look for in among models R-squared (linear models) Variable Significance # of Variables that are significant Sign of Variables Confusion Matrix “Score” (non-linear models) AIC number (non-linear models)

6
What to look for in among models Variables and Significance AIC Score Confusion Matrix Confusion Matrix Score

7
Hand Done Model Outcome

8
Approach Calculate the combinations of all independent variables Write function to; Run each model possibility For a sample of X (~10) samples of training / test data sets Collect; # of variables that have significance <.1 “score” the confusion matrix Multiple # of significant of variables by confusion matrix score, average over sampling range, sort results data frame

9
Step 1 – set up empty data frame to hold results

10
Step 2 – calculate all combinations of variables

11
Step 3 – run function to estimate all models and save parameters

12
Step 4 – average all models and sort

13
Average of Top Models Are … ModelMatrixMeanSigMeanWeigthed cut.off ~ + system + building.id + hvac.product + building.age + building.manager + coolant.type + software.P120.795.604.45 cut.off ~ + system.age + building.id + hvac.product + building.age + building.manager + he.efficiency + coolant.type0.885.004.39 cut.off ~ + building.id + hvac.product + building.age + building.manager + coolant.type + software.release + ave.outside.temp0.854.904.17 cut.off ~ + system + building.id + hvac.product + building.manager + service.center.distance + coolant.type + ave.outside.temp0.774.303.30 cut.off ~ + building.id + service.center.distance + days.since.service + fan.hours + coolant.type + software.release + software.P120.913.603.28 cut.off ~ + system + system.age + building.id + days.since.service + fan.hours + ave.outside.temp + software.P120.863.803.25 cut.off ~ + system + system.age + building.id + building.age + days.since.service + fan.hours + software.P120.843.803.18 cut.off ~ + building.id + country + building.manager + service.center.distance + days.since.service + fan.hours + coolant.type0.883.603.17 cut.off ~ + system.age + building.id + country + building.manager + service.center.distance + coolant.type + software.P120.873.603.14 cut.off ~ + system.age + building.id + country + building.manager + service.center.distance + coolant.type + software.release0.853.703.14 cut.off ~ + building.id + hvac.product + building.age + building.manager + service.center.distance + coolant.type + software.P120.893.503.11 cut.off ~ + building.id + hvac.product + building.age + building.manager + service.center.distance + coolant.type + ave.outside.temp0.893.503.10 cut.off ~ + building.id + building.age + building.manager + service.center.distance + days.since.service + he.efficiency + coolant.type0.883.503.09 cut.off ~ + building.id + country + building.manager + days.since.service + coolant.type + ave.outside.temp + software.P120.853.603.06 cut.off ~ + building.id + hvac.product + building.age + fan.hours + software.release + ave.outside.temp + software.P120.813.703.00 cut.off ~ + hvac.product + building.age + days.since.service + he.efficiency + coolant.type + ave.outside.temp + software.P120.913.303.00

14
Each of these should be tested again More extensive use of varied train / test data sample sets Stability of each model beyond the scoring Chosen model “makes sense”

15
Alternative ways to do this … Caret Package function “rfe” (recursive feature elimination) Try all variables first Train and Test the model with cross-validation Calculate the most important variables Eliminate the least important variables Train and Test the model again Calculate the most important variables Eliminate the least important variables Repeat …..

16
Setting it up & running RFE data frame of predictor variables vector of outcome variable max number of variables to keep control functions run recursive elimination model

17
Outcome of the RFE

18
Problems Number of variables combinations can get HUGE Might need multicore or parallel to get through it

19
Thank You Brooke Aker baker@bigdatalens.com

Similar presentations

OK

Solving Systems of 3 or More Variables 14.6. Why a Matrix? In previous math classes you solved systems of two linear equations using the following method:

Solving Systems of 3 or More Variables 14.6. Why a Matrix? In previous math classes you solved systems of two linear equations using the following method:

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on varactor diode circuit Ppt on fire safety week Ppt on power generation transmission and distribution Ppt on gujarati culture society Ppt on sri lanka tourism Ppt on mental health act 1987 Ppt on cash flow statement Ppt on different types of soil Ppt on credit default swaps aig Ppt on history of olympics locations