Download presentation

Presentation is loading. Please wait.

Published bySteve Winey Modified about 1 year ago

1
1 Jerry Tsai This presentation available at: clintuition.com/pubs/

2
22 Optimal Model Search By a Genetic Algorithm Using SAS® Jerry Tsai

3
33 Problem Statement n observations; p possible predictors n >> p >> 0 2 p possible subsets of the set of predictors The challenge: Choose a subset of the possible predictors that has the greatest predictive ability relative to its size

4
44 Problem Definition What do statisticians call this problem? “Subset selection” Finding the “best (predictive) model” Finding a “parsimonious model” How do statisticians approach this problem? Conduct a search through a space defined by the 2 p possible combinations of the p parameters to find a subset of those parameters that optimizes an objective function

5
55 Reasons to Search for an Optimal Model 1.To describe the relative importance of variables 2.To save money in data collection and management 3.To enhance predictive ability But we should make very sure it is worth the effort Inappropriate for estimation and hypothesis testing Time-consuming

6
66 Commonly-Known Search Heuristics Forward; backward; stepwise Found in REG, LOGISTIC, PHREG, more LAR (least angle regression) LASSO (least absolute shrinkage and selection operator) Both found in GLMSELECT All of these heuristics use an incremental approach when searching for an optimal model

7
77 Incremental Approach To a set, add or subtract one variable at a time Include or exclude a candidate variable if: The variable meets entry and stopping criteria OR The set of variables with the candidate variable added better optimizes the objective function

8
88 Holistic Approach Assess a set of variables as a whole Sets of variables are compared to one another Each element (variable) of the set is treated equally Disadvantage: less “helpful” elements of the set are treated the same as more “helpful” elements of the set Advantage: May uncover synergism or confounding among variables

9
99 Advantage of a Non-incremental Approach The absolute optimum may be undiscoverable through a incremental approach, due to: Confounding Endogeneity Nonlinearity (with respect to a link function) Space searched could be much greater Forward selection: O(p 2 ) lim p → ∞ O(p 2 ) 2p2p = 0 and this expression quickly converges

10
10 Advantage of Using Regression Statisticians are very familiar with generalized linear models (GLMs) Parameter estimates are amenable to comprehensible interpretation

11
11 Genetic Algorithm Implementation Create a generation of sets of variables (a set of sets) Score all sets in a generation Sets that score higher are selected for reproduction These selected sets are recombined and mutated to yield additional sets. These additional sets will constitute a new generation that will in turn undergo scoring, selection, and recombination.

12
12 Why Use a Genetic Algorithm? Examples from nature suggest local optima are eventually found A holistic approach allows variables to be assessed simultaneously The search covers a much larger area than traditional incremental approaches

13
13 Implementation The presence (or absence) of each variable in a set is represented by a bit A string of bits together represent a chromosome of bits So each chromosome represents a subset of the possible predictors

14
14 Implementation Illustration 12 possible parameters alfa, bravo, charlie, delta…kilo, lima Representation example: The variables bravo, charlie, and kilo constitute a subset (i.e., constitute a model) abcd efgh ijkl

15
15 Genetic Operation – Mutation Logically negate bits within a chromosome (point mutation) 0 becomes 1; 1 becomes 0

16
16 Implementation Illustration Assume 12 possible parameters alfa, bravo, charlie, delta…kilo, lima Example: bravo, charlie, and kilo are in the model, all other variables are not abcd efgh ijkl

17
17 Mutation Example

18
18 Mutation Example

19
19 Mutation Example Randomly selected for mutation bravoecholima

20
Mutation Example bravoecholima Randomly selected for mutation

21
21 Genetic Operation – Mutation Logically negate random bits within a chromosome (point mutation) 0 becomes 1; 1 becomes 0 Example: { bravo ; charlie ; kilo }; MUTATE(bravo ; echo ; lima)

22
22 Genetic Operation – Crossover Two chromosomes exchange genetic information (Morgan 1916)

23
23 Crossover Example { bravo ; charlie ; kilo } { bravo ; echo ; lima }

24
24 Crossover Example { bravo ; charlie ; kilo } { bravo ; echo ; lima }

25
25 Crossover Example { bravo ; charlie ; kilo } { bravo ; echo ; lima }

26
26 Crossover Example { bravo ; charlie ; kilo } { bravo ; echo ; lima } { bravo ; charlie ; lima } { bravo ; echo ; kilo }

27
27 Genetic Operation – Crossover Two chromosomes exchange genetic information (Morgan 1916) Example: CROSSOVER [ { bravo ; charlie ; kilo }; { bravo ; echo ; lima foxtrot ]

28
28 Genetic Algorithm - Main Steps Initialize Set up environment Create starting generation Evaluate (i.e., score) Chromosomes (i.e., individuals) Generation Report, interim Select (i.e., choose which individuals reproduce) Reproduce (i.e., create new generation) Apply genetic operators

29
29 Flow Chart Report, Interim Evaluate Select Initial- ize Repro- duce Escape ? Report, Final Yes No

30
30 Initialize Clear environment Initialize parameters Create &&VAR&I macro variables from the list of possible parameters Evaluate and store minimum (aka null) model Evaluate and store maximum (aka full) model Initialize parents (create starting generation)

31
31 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

32
32 Evaluate Individual (chromosomes) If a chromosome has a score saved, assign that score to the chromosome Otherwise, evaluate the chromosome on its fitness for reproduction Save scores for newly-evaluated chromosomes Generation (of chromosomes) Evaluate and store historical information on the characteristics of the generation, e.g., the mean score.

33
33 Scores Evaluate each chromosome by computing the value of these functions: Objective function = the function to be optimized Reward greater predictive ability while penalizing any increase in the number of parameters e.g., Akaike’s Information Criterion (AIC) Fitness function A function based on the objective function that determines the probability of a chromosome being selected for reproduction.

34
34 SAS® Code Evaluation Illustration proc anly-proc data = input-data-set ; model %do i = 1 to %cntvars.; %if %substr(&bitstrg., &i., 1) = 1 %then %do; &&var&i.. %end; ; run; p = # of possible parameters chromosome variable(s)

35
35 SAS® Code Comments You will very likely create output data sets from the PROC– through the use of ODS statements, OUTPUT statements, or an output option on the MODEL statement– to obtain statistics that will constitute your objective function and fitness function scores. I actually use a modified version of my %ITERLIST macro (Tsai, WUSS 2008) to create the list of variables in the MODEL statement.

36
36 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

37
37 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

38
38 Evaluate Escape Criterion You need to specify a condition to escape the loop… if you want to algorithm to terminate Escape criteria examples: Mean score for a particular generation fails to exceed any of those for a specified number of generations immediately preceding Failure to surpass the best score seen so far within a specified number of generations Time or resource constraints reached Minimum score surpassed

39
39 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

40
40 Flow Chart Report, Interim Evaluate Select Initial- ize Repro- duce Escape ? Report, Final Yes No

41
41 Select Those chromosomes with superior scores are given preference in the selection for reproduction The method of selection is at the analyst’s discretion. One popular method used in GAs is stochastic universal sampling

42
42 Stochastic Universal Sampling Uses a single randomly-chosen value to sample from the chromosome, choosing variables at evenly-spaced intervals across their collective fitness score F = sum of the fitness scores for all chromosomes in a generation N = number of chromosomes to be selected for reproduction Wikipedia, 2009

43
43 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

44
44 Reproduce Apply to selected chromosomes the genetic operations of crossover and mutation. The resulting chromosomes constitute (in part and possibly in full) a new generation.

45
45 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

46
46 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

47
47 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

48
48 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

49
49 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

50
50 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

51
51 Flow Chart Report, Interim Evaluate SelectInitial- ize Repro- duce Escape ? Report, Final Yes No

52
52 Final Report Number of generations algorithm evaluated Mean fitness score for each generation Most optimal chromosome discovered and its fitness and objective scores

53
53 Disadvantages of using a GA Not a built-in SAS functionality Many parameters to specify Generation size Crossover probability Mutation rate Objective function / Fitness function Time-consuming to run Still may not find the absolute optimum

54
54 Advantages of using a GA Deeper exploration of the model space. Allows you to remain within a familiar paradigm (regression) with interpretable parameter coefficients Agnostic to the regression model chosen – can use the same macro for any GLM with minor modifications “Proven” success in the real world

55
55 Suggested Reading References in paper Search heuristics LAR and LASSO heuristics -- Robert Cohen, Peter Flom, and David Cassell Information criteria in model selection Linear regression -- Dennis Beal Logistic and proportional hazards regression -- Ernest Shtatland Mixed models -- Jesse Canchola and Torsten Neilands

56
56 Jerry Tsai This presentation available at: clintuition.com/pubs/

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google