# 1 Jerry Tsai This presentation available at: clintuition.com/pubs/

## Presentation on theme: "1 Jerry Tsai This presentation available at: clintuition.com/pubs/"— Presentation transcript:

1 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs/

22 Optimal Model Search By a Genetic Algorithm Using SAS® Jerry Tsai

33 Problem Statement  n observations; p possible predictors  n >> p >> 0  2 p possible subsets of the set of predictors  The challenge: Choose a subset of the possible predictors that has the greatest predictive ability relative to its size

44 Problem Definition  What do statisticians call this problem?  “Subset selection”  Finding the “best (predictive) model”  Finding a “parsimonious model”  How do statisticians approach this problem?  Conduct a search through a space defined by the 2 p possible combinations of the p parameters to find a subset of those parameters that optimizes an objective function

55 Reasons to Search for an Optimal Model 1.To describe the relative importance of variables 2.To save money in data collection and management 3.To enhance predictive ability  But we should make very sure it is worth the effort  Inappropriate for estimation and hypothesis testing  Time-consuming

66 Commonly-Known Search Heuristics  Forward; backward; stepwise  Found in REG, LOGISTIC, PHREG, more  LAR (least angle regression)  LASSO (least absolute shrinkage and selection operator)  Both found in GLMSELECT  All of these heuristics use an incremental approach when searching for an optimal model

77 Incremental Approach  To a set, add or subtract one variable at a time  Include or exclude a candidate variable if:  The variable meets entry and stopping criteria OR  The set of variables with the candidate variable added better optimizes the objective function

88 Holistic Approach  Assess a set of variables as a whole  Sets of variables are compared to one another  Each element (variable) of the set is treated equally  Disadvantage: less “helpful” elements of the set are treated the same as more “helpful” elements of the set  Advantage: May uncover synergism or confounding among variables

99 Advantage of a Non-incremental Approach  The absolute optimum may be undiscoverable through a incremental approach, due to:  Confounding  Endogeneity  Nonlinearity (with respect to a link function)  Space searched could be much greater  Forward selection: O(p 2 ) lim p → ∞ O(p 2 ) 2p2p = 0 and this expression quickly converges

10 Advantage of Using Regression  Statisticians are very familiar with generalized linear models (GLMs)  Parameter estimates are amenable to comprehensible interpretation

11 Genetic Algorithm Implementation  Create a generation of sets of variables (a set of sets)  Score all sets in a generation  Sets that score higher are selected for reproduction  These selected sets are recombined and mutated to yield additional sets.  These additional sets will constitute a new generation that will in turn undergo scoring, selection, and recombination.

12 Why Use a Genetic Algorithm?  Examples from nature suggest local optima are eventually found  A holistic approach allows variables to be assessed simultaneously  The search covers a much larger area than traditional incremental approaches

13 Implementation  The presence (or absence) of each variable in a set is represented by a bit  A string of bits together represent a chromosome of bits  So each chromosome represents a subset of the possible predictors

14 Implementation Illustration  12 possible parameters  alfa, bravo, charlie, delta…kilo, lima  Representation example:  The variables bravo, charlie, and kilo constitute a subset (i.e., constitute a model) 0110 0000 0010 abcd efgh ijkl

15 Genetic Operation – Mutation  Logically negate bits within a chromosome (point mutation)  0 becomes 1; 1 becomes 0

16 Implementation Illustration  Assume 12 possible parameters  alfa, bravo, charlie, delta…kilo, lima  Example:  bravo, charlie, and kilo are in the model, all other variables are not  0110 0000 0010 0110 0000 0010 abcd efgh ijkl

17 Mutation Example 0110 0000 0010

18 Mutation Example 0110 0000 0010

19 Mutation Example Randomly selected for mutation 0110 0000 0010 bravoecholima

20 0010 1000 0011 Mutation Example bravoecholima Randomly selected for mutation

21 Genetic Operation – Mutation  Logically negate random bits within a chromosome (point mutation)  0 becomes 1; 1 becomes 0  Example: { bravo ; charlie ; kilo }; MUTATE(bravo ; echo ; lima)  0010 1000 0011

22 Genetic Operation – Crossover  Two chromosomes exchange genetic information (Morgan 1916)

23 Crossover Example 0110 0000 0010 0100 1000 0001 { bravo ; charlie ; kilo } { bravo ; echo ; lima }

24 Crossover Example 0110 0000 0010 0100 1000 0001 { bravo ; charlie ; kilo } { bravo ; echo ; lima }

25 Crossover Example 0110 0000 0010 0100 1000 0001 { bravo ; charlie ; kilo } { bravo ; echo ; lima }

26 Crossover Example 0110 0000 0010 0100 1000 0001  0110 0000 0001 0100 1000 0010 { bravo ; charlie ; kilo } { bravo ; echo ; lima } { bravo ; charlie ; lima } { bravo ; echo ; kilo }

27 Genetic Operation – Crossover  Two chromosomes exchange genetic information (Morgan 1916)  Example:  CROSSOVER [ { bravo ; charlie ; kilo }; { bravo ; echo ; lima }; @ foxtrot ] 0110 0000 0010 0100 1000 0001  0110 0000 0001 0100 1000 0010

28 Genetic Algorithm - Main Steps  Initialize  Set up environment  Create starting generation  Evaluate (i.e., score)  Chromosomes (i.e., individuals)  Generation  Report, interim  Select (i.e., choose which individuals reproduce)  Reproduce (i.e., create new generation)  Apply genetic operators

29 Flow Chart Report, Interim Evaluate Select Initial- ize Repro- duce Escape ? Report, Final Yes No

30 Initialize  Clear environment  Initialize parameters  Create &&VAR&I macro variables from the list of possible parameters  Evaluate and store minimum (aka null) model  Evaluate and store maximum (aka full) model  Initialize parents (create starting generation)

31 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

32 Evaluate  Individual (chromosomes)  If a chromosome has a score saved, assign that score to the chromosome  Otherwise, evaluate the chromosome on its fitness for reproduction  Save scores for newly-evaluated chromosomes  Generation (of chromosomes)  Evaluate and store historical information on the characteristics of the generation, e.g., the mean score.

33 Scores  Evaluate each chromosome by computing the value of these functions:  Objective function = the function to be optimized  Reward greater predictive ability while penalizing any increase in the number of parameters  e.g., Akaike’s Information Criterion (AIC)  Fitness function  A function based on the objective function that determines the probability of a chromosome being selected for reproduction.

34 SAS® Code Evaluation Illustration proc anly-proc data = input-data-set ; model %do i = 1 to %cntvars.; %if %substr(&bitstrg., &i., 1) = 1 %then %do; &&var&i.. %end; ; run; p = # of possible parameters chromosome variable(s)

35 SAS® Code Comments  You will very likely create output data sets from the PROC– through the use of ODS statements, OUTPUT statements, or an output option on the MODEL statement– to obtain statistics that will constitute your objective function and fitness function scores.  I actually use a modified version of my %ITERLIST macro (Tsai, WUSS 2008) to create the list of variables in the MODEL statement.

36 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

37 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

38 Evaluate Escape Criterion  You need to specify a condition to escape the loop… if you want to algorithm to terminate  Escape criteria examples:  Mean score for a particular generation fails to exceed any of those for a specified number of generations immediately preceding  Failure to surpass the best score seen so far within a specified number of generations  Time or resource constraints reached  Minimum score surpassed

39 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

40 Flow Chart Report, Interim Evaluate Select Initial- ize Repro- duce Escape ? Report, Final Yes No

41 Select  Those chromosomes with superior scores are given preference in the selection for reproduction  The method of selection is at the analyst’s discretion.  One popular method used in GAs is stochastic universal sampling

42 Stochastic Universal Sampling  Uses a single randomly-chosen value to sample from the chromosome, choosing variables at evenly-spaced intervals across their collective fitness score  F = sum of the fitness scores for all chromosomes in a generation  N = number of chromosomes to be selected for reproduction Wikipedia, 2009

43 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

44 Reproduce  Apply to selected chromosomes the genetic operations of crossover and mutation.  The resulting chromosomes constitute (in part and possibly in full) a new generation.

45 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

46 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

47 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

48 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

49 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

50 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

51 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

52 Final Report  Number of generations algorithm evaluated  Mean fitness score for each generation  Most optimal chromosome discovered and its fitness and objective scores

53 Disadvantages of using a GA  Not a built-in SAS functionality  Many parameters to specify  Generation size  Crossover probability  Mutation rate  Objective function / Fitness function  Time-consuming to run  Still may not find the absolute optimum

54 Advantages of using a GA  Deeper exploration of the model space.  Allows you to remain within a familiar paradigm (regression) with interpretable parameter coefficients  Agnostic to the regression model chosen – can use the same macro for any GLM with minor modifications  “Proven” success in the real world

55 Suggested Reading  References in paper  Search heuristics  LAR and LASSO heuristics -- Robert Cohen, Peter Flom, and David Cassell  Information criteria in model selection  Linear regression -- Dennis Beal  Logistic and proportional hazards regression -- Ernest Shtatland  Mixed models -- Jesse Canchola and Torsten Neilands

56 Jerry Tsai Jerry.Tsai@clintuition.com This presentation available at: clintuition.com/pubs/