Advanced Analytics Using Enterprise Miner

Advanced Analytics Using Enterprise Miner
A Primer for the Predictive Modeler Certification

Importing Data: PVA97NK Metadata Advisor Options. ”Advanced  Customize” class levels count threshold: 2 reject levels count threshold: 100 Target_D change to rejected

Exploring Data Right click dataset in the left side panel
Choose Explore. See options for SAMPLING METHOD Choose Plot. Histogram. DemAge set Role to X Right click  graph properties Change number of bins to 87 Observe histogram Then include a bin for missing values resize window to be smaller Create Pie Chart w/ Target_B (set role to category) see how you can interactively select parts of population from either chart.

See histograms for all variables
Drag dataset to diagram right click dataset, highlight all variables, click explore Look further into DemMedIncome variable. Add more bins to the histogram. We’ll want to change these 0 income values to missing. In the Explore Window you can also retrieve some basic descriptive statistics, including percent missing.

MODIFY tab This tab has nodes that involve modifying the columns of a dataset

MODIFY: Replacement Node
Can be used to replace certain values of a variable (usually extreme values) with a specified replacement value. Income variable is interval – so focus on that portion of properties panel Change Default Limits Method to ‘None’ (So nothing else gets changed) Change Replacement Values to ‘Missing’ Click ellipses next to Replacement Editor Change method to ’User-Specified’ for the DemMedIncome For Replacement Lower Limit put the number 1. A new variable is created All values of DemMedIncome values that fall below 1 are then set to missing. All other values do not change. Run the node, then in properties panel click Exported Data ellipses and explore the histogram for the generated variable – include a missing bin in histogram.

Regression Modelling Let’s build a regression model to predict binary target We’ll split our data into training and validation first Then we’ll need to take care of missing values using the impute node

SAMPLE tab This tab has nodes that involve modifying the rows of a dataset

SAMPLE: Data Partition Node
Splits data into specified proportions of Training/Validation/Test data Connect Data Partition Node to the Replacement Node and specify: 65% Training 35% Validation

MODIFY: Impute Node After Data Partition Node, Connect the Impute Node. Take a look at the Properties Panel and explore the defaults. Panel is split into Class Variables and Interval Variables. Input and Target variables are specified separately. Use Median Imputation for numeric variables and tree imputation for Class Variables. Under Score section, create binary indicator variables that show you’ve imputed a variable. Unique indicators are for each variable Single indicators are for each observation (1 if anything imputed) Set their Role to Input to include them in the modeling process

MODEL: Regression Node
Selection Model: Stepwise Selection Change the Selection Criteria for the model to validation misclassification rate (This is how EM will optimize the complexity of the model) Notice panel option for including all interactions and all quadratic terms. Can also specify certain interactions by setting User Terms to ‘Yes’ and using the Term Editor Choose the variable to be entered into interaction. click right arrow. Choose Second variable. click right arrow. Click Save to save that interaction term Can be used to create multivariable interactions.

MODIFY: Transform Variables
Let’s see if we can get a better model by transforming our numeric inputs to be more normal Drag Transform Variables node and connect to Impute Node For Interval INPUTS choose ‘Maximum Normal’ Can look at the results of this node to see what type of transformation was applied to each interval variable to make its distribution more normal.

ASSESS: Model Comparison Node
Which regression model worked best? Change the selection criteria to Validation Average Squared Error (Change both the Selection Statistic and the Selection Table), do you choose the same model? Which model has the best lift at a depth of 10%?

EXPLORE: StatExplore Node
The StatExplore Node will give basic univariate descriptive statistics and also statistics regarding the relationships of variables with the target. Connect the StatExplore Node after the Data Partition See the ”worth” of each interval variable and the Chi-Square Score for each Class variable. In this case, try to find the Chi-Squared value relating DemCluster to the target. Hint: Search the output for DemCluster.

Variable Selection – Two Methods
EXPLORE: Variable Selection Node. MODEL: Decision Tree Node. For this example, change the Subtree method to Largest (this results in an unpruned tree) Use Both of these methods to filter variables by simply connecting them to a model on the other side. Connect a Neural Network to each (two Neural Network nodes). Use the default Neural Network options. (1 Hidden Layer, 3 Hidden Units) Connect all 4 models (2 Regressions, 2 Nnets) to a Model Comparison Node.

Correcting for Prior Probabilities
Suppose the current data was oversampled to account for a rare event. We can enter in the true population proportion of events in the Decisions ellipses on the data set properties. This is the same place we entered in decision weights for profit/loss Set the prior of the event (Target =1) to 0.05 and the nonevent to 0.95 How does this effect our models? Regressions were chosen to minimize validation misclassification. That is done by calling everything a non-event with these priors! If we change that specification, we get better regression models.

Scoring a Data Set Lets score a new set of observations, contained in the dataset SCOREPVA97NK Import Data using same customized advanced metadata input as previous data (Slide 2) Set the data ROLE to SCORE (if you miss this on import, it’s on the properties panel) Drag data to the diagram Drag a Score node from ASSESS tab Connect Score Data and Model to the Score Node and run Score node Examine (Browse) Exported Data (The Score Table) to find predicted probabilities etc.

SAS CODE Node YOU MUST USE THE MACRO NAMES PROVIDED IN THE SAS CODE NODE TO REFER TO DATASET. Run proc univariate on the predicted probabilities: proc univariate data= &EM_import_score; var P_Target_B1; run;

SAS CODE Node Add a new column to that scored data that contains the variable name, equal to “Shaina” for all observations: Click Code Editor in the Properties Panel. data &EM_EXPORT_TRAIN; set &EM_Import_score; name = "Shaina"; run;

Advanced Analytics Using Enterprise Miner

Similar presentations

Presentation on theme: "Advanced Analytics Using Enterprise Miner"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced Analytics Using Enterprise Miner

Similar presentations

Presentation on theme: "Advanced Analytics Using Enterprise Miner"— Presentation transcript:

Similar presentations

About project

Feedback