Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Jerry Tsai This presentation available at: clintuition.com/pubs/

Similar presentations


Presentation on theme: "1 Jerry Tsai This presentation available at: clintuition.com/pubs/"— Presentation transcript:

1 1 Jerry Tsai This presentation available at: clintuition.com/pubs/

2 22 Optimal Model Search By a Genetic Algorithm Using SAS® Jerry Tsai

3 33 Problem Statement  n observations; p possible predictors  n >> p >> 0  2 p possible subsets of the set of predictors  The challenge: Choose a subset of the possible predictors that has the greatest predictive ability relative to its size

4 44 Problem Definition  What do statisticians call this problem?  “Subset selection”  Finding the “best (predictive) model”  Finding a “parsimonious model”  How do statisticians approach this problem?  Conduct a search through a space defined by the 2 p possible combinations of the p parameters to find a subset of those parameters that optimizes an objective function

5 55 Reasons to Search for an Optimal Model 1.To describe the relative importance of variables 2.To save money in data collection and management 3.To enhance predictive ability  But we should make very sure it is worth the effort  Inappropriate for estimation and hypothesis testing  Time-consuming

6 66 Commonly-Known Search Heuristics  Forward; backward; stepwise  Found in REG, LOGISTIC, PHREG, more  LAR (least angle regression)  LASSO (least absolute shrinkage and selection operator)  Both found in GLMSELECT  All of these heuristics use an incremental approach when searching for an optimal model

7 77 Incremental Approach  To a set, add or subtract one variable at a time  Include or exclude a candidate variable if:  The variable meets entry and stopping criteria OR  The set of variables with the candidate variable added better optimizes the objective function

8 88 Holistic Approach  Assess a set of variables as a whole  Sets of variables are compared to one another  Each element (variable) of the set is treated equally  Disadvantage: less “helpful” elements of the set are treated the same as more “helpful” elements of the set  Advantage: May uncover synergism or confounding among variables

9 99 Advantage of a Non-incremental Approach  The absolute optimum may be undiscoverable through a incremental approach, due to:  Confounding  Endogeneity  Nonlinearity (with respect to a link function)  Space searched could be much greater  Forward selection: O(p 2 ) lim p → ∞ O(p 2 ) 2p2p = 0 and this expression quickly converges

10 10 Advantage of Using Regression  Statisticians are very familiar with generalized linear models (GLMs)  Parameter estimates are amenable to comprehensible interpretation

11 11 Genetic Algorithm Implementation  Create a generation of sets of variables (a set of sets)  Score all sets in a generation  Sets that score higher are selected for reproduction  These selected sets are recombined and mutated to yield additional sets.  These additional sets will constitute a new generation that will in turn undergo scoring, selection, and recombination.

12 12 Why Use a Genetic Algorithm?  Examples from nature suggest local optima are eventually found  A holistic approach allows variables to be assessed simultaneously  The search covers a much larger area than traditional incremental approaches

13 13 Implementation  The presence (or absence) of each variable in a set is represented by a bit  A string of bits together represent a chromosome of bits  So each chromosome represents a subset of the possible predictors

14 14 Implementation Illustration  12 possible parameters  alfa, bravo, charlie, delta…kilo, lima  Representation example:  The variables bravo, charlie, and kilo constitute a subset (i.e., constitute a model) abcd efgh ijkl

15 15 Genetic Operation – Mutation  Logically negate bits within a chromosome (point mutation)  0 becomes 1; 1 becomes 0

16 16 Implementation Illustration  Assume 12 possible parameters  alfa, bravo, charlie, delta…kilo, lima  Example:  bravo, charlie, and kilo are in the model, all other variables are not  abcd efgh ijkl

17 17 Mutation Example

18 18 Mutation Example

19 19 Mutation Example Randomly selected for mutation bravoecholima

20 Mutation Example bravoecholima Randomly selected for mutation

21 21 Genetic Operation – Mutation  Logically negate random bits within a chromosome (point mutation)  0 becomes 1; 1 becomes 0  Example: { bravo ; charlie ; kilo }; MUTATE(bravo ; echo ; lima) 

22 22 Genetic Operation – Crossover  Two chromosomes exchange genetic information (Morgan 1916)

23 23 Crossover Example { bravo ; charlie ; kilo } { bravo ; echo ; lima }

24 24 Crossover Example { bravo ; charlie ; kilo } { bravo ; echo ; lima }

25 25 Crossover Example { bravo ; charlie ; kilo } { bravo ; echo ; lima }

26 26 Crossover Example  { bravo ; charlie ; kilo } { bravo ; echo ; lima } { bravo ; charlie ; lima } { bravo ; echo ; kilo }

27 27 Genetic Operation – Crossover  Two chromosomes exchange genetic information (Morgan 1916)  Example:  CROSSOVER [ { bravo ; charlie ; kilo }; { bravo ; echo ; lima foxtrot ] 

28 28 Genetic Algorithm - Main Steps  Initialize  Set up environment  Create starting generation  Evaluate (i.e., score)  Chromosomes (i.e., individuals)  Generation  Report, interim  Select (i.e., choose which individuals reproduce)  Reproduce (i.e., create new generation)  Apply genetic operators

29 29 Flow Chart Report, Interim Evaluate Select Initial- ize Repro- duce Escape ? Report, Final Yes No

30 30 Initialize  Clear environment  Initialize parameters  Create &&VAR&I macro variables from the list of possible parameters  Evaluate and store minimum (aka null) model  Evaluate and store maximum (aka full) model  Initialize parents (create starting generation)

31 31 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

32 32 Evaluate  Individual (chromosomes)  If a chromosome has a score saved, assign that score to the chromosome  Otherwise, evaluate the chromosome on its fitness for reproduction  Save scores for newly-evaluated chromosomes  Generation (of chromosomes)  Evaluate and store historical information on the characteristics of the generation, e.g., the mean score.

33 33 Scores  Evaluate each chromosome by computing the value of these functions:  Objective function = the function to be optimized  Reward greater predictive ability while penalizing any increase in the number of parameters  e.g., Akaike’s Information Criterion (AIC)  Fitness function  A function based on the objective function that determines the probability of a chromosome being selected for reproduction.

34 34 SAS® Code Evaluation Illustration proc anly-proc data = input-data-set ; model %do i = 1 to %cntvars.; %if %substr(&bitstrg., &i., 1) = 1 %then %do; &&var&i.. %end; ; run; p = # of possible parameters chromosome variable(s)

35 35 SAS® Code Comments  You will very likely create output data sets from the PROC– through the use of ODS statements, OUTPUT statements, or an output option on the MODEL statement– to obtain statistics that will constitute your objective function and fitness function scores.  I actually use a modified version of my %ITERLIST macro (Tsai, WUSS 2008) to create the list of variables in the MODEL statement.

36 36 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

37 37 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

38 38 Evaluate Escape Criterion  You need to specify a condition to escape the loop… if you want to algorithm to terminate  Escape criteria examples:  Mean score for a particular generation fails to exceed any of those for a specified number of generations immediately preceding  Failure to surpass the best score seen so far within a specified number of generations  Time or resource constraints reached  Minimum score surpassed

39 39 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

40 40 Flow Chart Report, Interim Evaluate Select Initial- ize Repro- duce Escape ? Report, Final Yes No

41 41 Select  Those chromosomes with superior scores are given preference in the selection for reproduction  The method of selection is at the analyst’s discretion.  One popular method used in GAs is stochastic universal sampling

42 42 Stochastic Universal Sampling  Uses a single randomly-chosen value to sample from the chromosome, choosing variables at evenly-spaced intervals across their collective fitness score  F = sum of the fitness scores for all chromosomes in a generation  N = number of chromosomes to be selected for reproduction Wikipedia, 2009

43 43 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

44 44 Reproduce  Apply to selected chromosomes the genetic operations of crossover and mutation.  The resulting chromosomes constitute (in part and possibly in full) a new generation.

45 45 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

46 46 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

47 47 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

48 48 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

49 49 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

50 50 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

51 51 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

52 52 Final Report  Number of generations algorithm evaluated  Mean fitness score for each generation  Most optimal chromosome discovered and its fitness and objective scores

53 53 Disadvantages of using a GA  Not a built-in SAS functionality  Many parameters to specify  Generation size  Crossover probability  Mutation rate  Objective function / Fitness function  Time-consuming to run  Still may not find the absolute optimum

54 54 Advantages of using a GA  Deeper exploration of the model space.  Allows you to remain within a familiar paradigm (regression) with interpretable parameter coefficients  Agnostic to the regression model chosen – can use the same macro for any GLM with minor modifications  “Proven” success in the real world

55 55 Suggested Reading  References in paper  Search heuristics  LAR and LASSO heuristics -- Robert Cohen, Peter Flom, and David Cassell  Information criteria in model selection  Linear regression -- Dennis Beal  Logistic and proportional hazards regression -- Ernest Shtatland  Mixed models -- Jesse Canchola and Torsten Neilands

56 56 Jerry Tsai This presentation available at: clintuition.com/pubs/


Download ppt "1 Jerry Tsai This presentation available at: clintuition.com/pubs/"

Similar presentations


Ads by Google