The Assessment and Application of Lineage Information in Genetic Programs for Producing Better Models Gary D. Boetticher Boetticher@uhcl.edu Univ. of Houston - Clear Lake, Houston, TX, USA Kim Kaminsky Kaminsky@uhcl.edu Univ. of Houston - Clear Lake, Houston, TX, USA

About the Author: Gary D. Boetticher Ph.D. in Machine Learning and Software Engineering A neural network-based software reuse economic model Executive member of IEEE Reuse Standard Committees (1990s) Commercial consultant: U.S. Olympic Committee, LDDS Worldcom, Mellon Mortgage, … Currently: Associate Professor Department of Comp. Science/Software Engineering University of Houston - Clear Lake, Houston, TX, USA boetticher@uhcl.edu Research interests: Data mining, ML, Computational Bioinformatics, and Software metrics

Motivating Questions Does chromosome lineage information within a Genetic Program (GP) provide any insight into the effectiveness of solving problems? If so, how could these insights be utilized to make better breeding decisions?

2) Determine the fitness for each (1 /Stand. Error) Genetic Program Overview X, Y, and Z RESULT? XYZRESULT 24530 53216 :::: 13624 1) Create a population of equations Eq#Equation 1X+Y 2(Z-X)*Y+X :: 1000(X*X)-Z 87 84 : 57 3) Breed Equations X + Y (Z-X) * Y+X (Z-X) + Y X * Y+X 4) Generate new populations and breed until a solution is found

Genetic Program Overview EquationFitness (X+Y)87 (X - Z) * (Y * Y)86 ZYZY 75 :: Y22 Y - X18 Generation N Generation N+1 EquationFitness (X - Z) (X + Y) * (Y * Y) Z + Y : X Y + Y Why discard legacy information?

Goal: Examine fitness patterns over time EquationFitness (X+Y)87 (X - Z) * (Y * Y)86 ZY85 (X - Z) * (Y * Y)84 Y79 Y - X75 Z + Y75 (X - Z) * (Y * Y)75 Y73 Y - X71 (X - Z) * (Y * Y) + W + W68 Y - X67 ZY66 (X - Z) * (Y * Y)66 Y65 Y - X65 (X - Z) * (Y * Y) + W + W64 Y - X64 Z - Y62 (X - Z) * (Y * Y)59 Y58 Y - X55 (X - Z) * (Y * Y) + W + W44 EquationFitness (X+Y)87 (X - Z) * (Y * Y)86 ZY85 (X - Z) * (Y * Y)84 Y79 Y - X75 Z + Y75 (X - Z) * (Y * Y)75 Y73 Y - X71 (X - Z) * (Y * Y) + W + W68 Y - X67 ZY66 (X - Z) * (Y * Y)66 Y65 Y - X65 (X - Z) * (Y * Y) + W + W64 Y - X64 Z - Y62 (X - Z) * (Y * Y)59 Y58 Y - X55 (X - Z) * (Y * Y) + W + W44 EquationFitness (X+Y)87 (X - Z) * (Y * Y)86 ZY85 (X - Z) * (Y * Y)84 Y79 Y - X75 Z + Y75 (X - Z) * (Y * Y)75 Y73 Y - X71 (X - Z) * (Y * Y) + W + W68 Y - X67 ZY66 (X - Z) * (Y * Y)66 Y65 Y - X65 (X - Z) * (Y * Y) + W + W64 Y - X64 Z - Y62 (X - Z) * (Y * Y)59 Y58 Y - X55 (X - Z) * (Y * Y) + W + W44 Generation 1 Generation 2 Generation 3 Localized? Volatile?

Proof of Concept Experiments - 1 5 experiments using synthetic equations: Z = W + X + Y Z = 2 * X + Y – W Z = X / Y Z = X 3 Z = W 2 + W * X - Y Data slightly perturbed to prevent premature convergence Genetic Program 1000 Chromosomes (Equations) 50 Generations Breeding based on fitness rank

Proof of Concept Experiments - 2 For the 1000 Chromosomes: Divide into 5 groups of 200 (by fitness) Focus on the best, middle, and worst groups See where each group's offspring occur in the next generation

Results for Z = W + X + Y Best Middle Worst

Results for Z = 2 * X + Y – W Best Middle Worst

Results for Z = X / Y Best Middle Worst

Results for Z = X 3 Best Middle Worst

Results for Z = W 2 + W * X - Y Best Middle Worst

Applied Experiments Best class produces best offspring. Now what? Compare 2 Genetic Programs (GPs) 1) Use a vanilla-based GP 2) Use a GP that breeds only the top 20% of a population and replicates 5 times. Genetic Program 1000 Chromosomes (Equations) 50 Generations 20 Trials Equations to model Z = Sin(W) + Sin(X) + Sin(Y) Z = log 10 (W X ) + (Y * Z)

Results for Z = Sin(W) + Sin(X) + Sin(Y) Vanilla-Based GP Lineage-Based GP Average Fitness591.8740.9 Average r 2 0.87340.9315 Ave. Generations needed to complete 29.1 28.5

Results for Z = log 10 (W X ) + (Y * Z) Vanilla-Based GP Lineage-Based GP Average Fitness210.9346.5 Average r 2 0.72440.8069 Ave. Generations needed to complete 50.0 48.6

Conclusions Proof of concept experiments demonstrate the viability of considering lineage in GPs Applied experiments show that lineage-based GP modeling produce better results faster

