Presentation on theme: "Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from HortonWorks)"— Presentation transcript:
Update on Hadoop / R Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from HortonWorks) http://hortonworks.com/products/hortonworks- sandbox/#install http://hortonworks.com/products/hortonworks- sandbox/#install Do tutorials – here http://hortonworks.com/tutorials/http://hortonworks.com/tutorials/ Add R / Rstudio Server to your VM Use Rhadoop to inteface Hadoop and R
Issue There are many predictive analytical models that will work – Which among many is best?
Example Data – HVAC building log data date6/1/13 6/25/13 time0:00:01 0:13:19 target.temp69666970 actual.temp55586071 system1413519 system.age620814 building.id174718 temp.diff1489 temp.rangeCOLD NORMAL extreme.temp1110 countryEgyptFinlandSouth AfricaIndonesia hvac.productFN39TGGG1919FN39TGJDNS77 building.age11171325 building.managerM17M4M7M18 service.center.distance15011510068 days.since.service14210916486 he.efficiency1222236 fan.hours1716158 coolant.typeB12 software.releaseP10 ave.outside.temp91467780 software.P120000 coolant.B121111 neg.diff111 abs.diff14891 diff.size3221 cut.off1110
What to look for in among models R-squared (linear models) Variable Significance # of Variables that are significant Sign of Variables Confusion Matrix “Score” (non-linear models) AIC number (non-linear models)
What to look for in among models Variables and Significance AIC Score Confusion Matrix Confusion Matrix Score
Approach Calculate the combinations of all independent variables Write function to; Run each model possibility For a sample of X (~10) samples of training / test data sets Collect; # of variables that have significance <.1 “score” the confusion matrix Multiple # of significant of variables by confusion matrix score, average over sampling range, sort results data frame
Step 1 – set up empty data frame to hold results
Step 2 – calculate all combinations of variables
Step 3 – run function to estimate all models and save parameters
Each of these should be tested again More extensive use of varied train / test data sample sets Stability of each model beyond the scoring Chosen model “makes sense”
Alternative ways to do this … Caret Package function “rfe” (recursive feature elimination) Try all variables first Train and Test the model with cross-validation Calculate the most important variables Eliminate the least important variables Train and Test the model again Calculate the most important variables Eliminate the least important variables Repeat …..
Setting it up & running RFE data frame of predictor variables vector of outcome variable max number of variables to keep control functions run recursive elimination model