Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Coevolving Solutions to the Shortest Common Superstring Problem Assaf Zaritsky & Moshe Sipper Ben-Gurion University, Israel www.cs.bgu.ac.il/~assafza.

Similar presentations


Presentation on theme: "1 Coevolving Solutions to the Shortest Common Superstring Problem Assaf Zaritsky & Moshe Sipper Ben-Gurion University, Israel www.cs.bgu.ac.il/~assafza."— Presentation transcript:

1 1 Coevolving Solutions to the Shortest Common Superstring Problem Assaf Zaritsky & Moshe Sipper Ben-Gurion University, Israel www.cs.bgu.ac.il/~assafza

2 2 Outline The “Shortest Common Superstring” problem. The “Shortest Common Superstring” problem. DNA sequencing and the input domain. Standard and cooperative coevolutionary genetic algorithm (GA). The Puzzle approach. Conclusions and future work. Messy Puzzle.

3 3 The Shortest Common Superstring Problem (SCS) S = {s 1,…,s n }blocks Σsuperstring Let S = {s 1,…,s n } be a set of strings (blocks) over some alphabet Σ. A superstring of S is a string x such that each s i in S is a substring of x. Problem: Find shortest (common) superstring. Problem: Find shortest (common) superstring. NP-Complete. MAX-SNP hard. Motivation: DNA sequencing, data compression.

4 4 S = {ate, half, lethal, alpha, alfalfa} A trivial superstring is “atehalflethalalphaalfalfa” of length 25 (a simple concatenation of all blocks). A shortest common superstring is “lethalphalfalfate” of length 17. Note that a “compressed” permutation of the blocks is actually a superstring. SCS: Example

5 5 Approximation Algorithms Several linear approximations for SCS have been proposed, most of which rely on greedy approaches. GREEDY The most widely heuristic used in DNA sequencing. Conjecture [Blum 1994, Sweedyk 1999]: Superstring produced by GREEDY is of length at most two times the optimal. We are not aware of any previous evolutionary approach to the SCS problem.

6 6 Outline The “Shortest Common Superstring” problem. The “Shortest Common Superstring” problem. DNA sequencing and the input domain. DNA sequencing and the input domain. Standard and cooperative coevolutionary genetic algorithm (GA). The Puzzle approach. Conclusions and future work. Messy Puzzle.

7 7 DNA Sequencing The most common usage of the SCS problem.

8 8 DNA Sequencing (cont’d) The problem: “read” a string of DNA. Short DNA strands can be read in laboratory. To sequence a long DNA strand: (The DNA sequence appears in many copies) 1. Cut the DNA to short fragments using restriction enzymes. 2. Sequence each of the resulting fragments. 3. Order those fragments using a SCS algorithm.

9 9 The Input Domain The input strings used in the experiments were inspired by DNA sequencing:

10 10 Input Generation Setup: Parameters NB: increasing number of blocks results in exponential growth of the problem’s complexity. 250 bits (~50 blocks) 400 bits (~80 blocks) Size of random string 20 bitsMinimal block size 30 bits Maximal block size 5 Number of duplicates created from a random string

11 11 Outline The “Shortest Common Superstring” problem. The “Shortest Common Superstring” problem. DNA sequencing and the input domain. Standard and cooperative coevolutionary genetic algorithm (GA). Standard and cooperative coevolutionary genetic algorithm (GA). The Puzzle approach. Conclusions and future work. Messy Puzzle.

12 12 Simple Genetic Algorithm initial produce an initial population of individuals evaluate evaluate fitness of all individuals whiledo while termination condition not met do select select fitter individuals for reproduction recombine recombine individuals mutate mutate individuals evaluate evaluate fitness of modified individuals generate generate a new population end while

13 13 Simple GA for the SCS Problem Given a set of strings as input, generate initial population of random candidate solutions. lengthaccuracy The fitness of each individual depends on its length and accuracy. The GA uses selection, recombination, and mutation to create the next generation, each individual of which is then evaluated. Theses steps are repeated a predefined number of times or until the solution is deemed satisfactory.

14 14 Simple GA for the SCS Problem (cont’d) atomic Blocks of the input set are atomic components. Representation: An individual’s genome is represented as a sequence of blocks. An individual may have missing blocks or contain duplicate copies of the same block. Permutation Representation: Good or Bad?

15 15 Simple GA for the SCS Problem (cont’d) Evaluation: fitness of an individual is the length of it’s compressed genome + the total length of the blocks that are not covered by the individual. Genetic operators: Fitness proportionate selection. Two-points recombination. Allows growth and reduction in genome’s length. Block-change mutation.

16 16 Simple GA for the SCS Problem (example) S = {s 1,s 2,s 3,s 4 }; s 1 = 0011, s 2 = 1100, s 3 = 1001, s 4 = 111. Fitness ( ) = |110011| + |111| = 6 + 3 = 9. Fitness ( ) = |11100111| = 8. Recombination: || p 1 = || p 2 = p 3 = recombine 1 (p 1,p 2 ) = p 4 = recombine 2 (p 1,p 2 ) = mutate ( ) =

17 17 Coevolution Simultaneous evolution of two or more species with coupled fitness. compete cooperate Coevolving species either compete or cooperate. Competitive coevolution: Fitness of individual based on direct competition with individuals of other species, which in turn evolve separately in their own populations (“prey-predator”).

18 18 Cooperative Coevolution

19 19 Cooperative Coevolution (cont’d) Cooperative Coevolution involves a number of independently evolving species. Interaction between species occurs via fitness function only. The fitness of an individual depends on its ability to collaborate with individuals from other species.

20 20 Cooperative Coevolution (cont’d) Source: Potter & DeJong (1997)

21 21 Cooperative Coevolutionary Algorithm for the SCS Problem Two species evolve simultaneously. prefixes First species contains prefixes of candidate solutions to the SCS problem at hand. suffixes Second species contains candidate suffixes. representatives construct a global solution Fitness of an individual in each species depends on how good it interacts with representatives from other species to construct a global solution.

22 22 Cooperative Coevolutionary Algorithm for the SCS Problem (evaluation process) Prefixes population Suffixes population Suffix Representative Individual Merge

23 23 Cooperative Coevolutionary Algorithm for the SCS Problem (evaluation process) Prefixes population Suffixes population Fitness Evaluate

24 24 Experiments Compare: GREEDY, Standard GA, Cooperative Coevolution

25 25 Experimental Setup Each type of GA was executed twice on each problem instance; the better run of the two was used for statistical purposes. 500Population size 5000Number of generations 0.8Recombination rate 0.03Mutation rate 50Problem instances per experiment

26 26 Results: Experiment I (~50 blocks)

27 27 Results: Experiment II (~80 blocks)

28 28 Results: Summary 275 Distance from optimum: 25 280 Distance from optimum: 30 381 Distance from optimum: 131 547 Distance from optimum: 147 685 Distance from optimum: 285 596 Distance from optimum: 196 Problem size Algorithm 50 blocks 80 blocks GREEDYGeneticCooperative Average of the best superstring lengths

29 29 Conclusion: The collaboration between the two populations results in a good decomposition of the problem into two smaller sub-problems, each is solved using a standard GA.

30 30 Outline The “Shortest Common Superstring” problem. The “Shortest Common Superstring” problem. DNA sequencing and the input domain. Standard and cooperative coevolutionary genetic algorithm (GA). The Puzzle approach. The Puzzle approach. Conclusions and future work. Messy Puzzle.

31 31 The Puzzle Algorithm

32 32 The Schema Theorem “Short, low-order, above-average schemata receive exponentially increasing trials in subsequent generations of a genetic algorithm.” Holland (1975)

33 33 Building Blocks Hypothesis “A genetic algorithm seeks near-optimal performance through the juxtaposition of short, low-order, high-performance schemata, called the building blocks.”

34 34 Our Interpretation “The success of GAs stems from their ability to combine quality sub-solutions (building blocks) from separate individuals in order to form better global solutions.”

35 35 The Main Assumption Problems in nature have an inherent structural design. Even when the structure is not known explicitly GAs detect it implicitly and gradually enhance good building blocks.

36 36 A Problem Recombination may destroy quality building blocks found by the GA.

37 37 Example Brain Appearance 0010101010101010101000011110100010000

38 38 Example (con’t) Brain Appearance 0010101010101010101000011110100010000 1. Smart (assumable) 2. Blond But not very beautiful…

39 39 The Preservation of Favoured Building Blocks in the Struggle for Fitness: The Puzzle Algorithm

40 40 Puzzle Algorithm: The Idea Improve Recombination Operator. Preserve good building blocks discovered by GA using selection of recombination loci that do not destroy good building blocks. Result: Assembly of good building blocks to construct better solutions (as in a puzzle).

41 41 Puzzle Algorithm (cont’d) Two populations: 1. Candidate solutions: As in simple GA. 2. Building blocks: Each individual is a sequence of blocks contained in at least one candidate solution. Building blocks population Candidate solutions population

42 42 Puzzle Algorithm (cont’d) candidate solutions Interaction between candidate solutions and building blocks is through fitness function. Fitness evaluation Crossover location Building blocks population Candidate solutions population building blocks Interaction between building blocks and candidate solutions is through constraints on recombination points.

43 43 Puzzle Algorithm: Zoom In Building blocks population Candidate solutions population Fitness evaluation Crossover location each individual is a sequence of blocks

44 44 Puzzle Algorithm: Zoom In Building blocks population Candidate solutions population Fitness evaluation Crossover location each building block is contained in at least one individual in the solutions population overlapping building blocks

45 45 The Candidate Solutions Population Representation, fitness evaluation, selection, and mutation are identical to the simple GA. Recombination-aid vector aids in selecting the recombination loci. Recombination-aid vector is updated by building blocks individuals. Building blocks population Candidate solutions population Fitness evaluation Crossover location

46 46 The Building Blocks Population An individual is represented as a sequence of blocks, contained in at least one candidate solution. Fitness of an individual is the average of the fitness of candidate solutions containing it. Fitness-proportionate selection. Building blocks population Candidate solutions population Fitness evaluation Crossover location

47 47 The Building Blocks Population (con’t) “Unisex” individuals. Two modification operators: Expansion: Increase it’s genome by one block. Occurs with high probability. Exploration: “ Die ”, and start over as a new 2- block individual. O ccurs with low probability. Building blocks population Candidate solutions population Fitness evaluation Crossover location

48 48 Building Blocks – Candidate Solutions Fitness evaluation Building blocks population Candidate solutions population f2f2f2f2 f3f3f3f3 f4f4f4f4 f1f1f1f1

49 49 Building Blocks – Candidate Solutions Fitness evaluation Building blocks population Candidate solutions population f2f2f2f2 f3f3f3f3 f4f4f4f4 f1f1f1f1 Update “recombination-aid” vector f1f1f1f1 f1f1f1f1 f2f2f2f2 f2f2f2f2 f3f3f3f3 f3f3f3f3 f4f4f4f4

50 50 Update Recombination-aid vector Solution’s genome building block #1 fitness = 0.3 0000000 Recombination-aid vector building block #2 fitness = 0.4 building block #3 fitness = 0.6

51 51 Update Recombination-aid vector Solution’s genome 00.60.400.30.30 Recombination-aid vector building block #1 fitness = 0.3 building block #2 fitness = 0.4 building block #3 fitness = 0.6

52 52 Update Recombination-aid vector Solution’s genome 0.60.60.400.30.30.3 Recombination-aid vector building block #1 fitness = 0.3 building block #2 fitness = 0.4 building block #3 fitness = 0.6

53 53 Recombination-loci selection Solution’s genome 0.60.60.4 0 0.30.30.3 Recombination-aid vector * Ties are broken arbitrarily

54 54 Experiments Compare: GREEDY, Standard GA, Puzzle

55 55 Building Blocks - Experimental Setup 1000Population size 0.8Expansion rate 0.1Exploration rate

56 56 Results: Experiment III (~50 blocks) Cooperative

57 57 Results: Experiment IV (~80 blocks) Cooperative Did we lose to cooperative? NO!

58 58 Results: Summary 253 Distance from optimum: 3 280 Distance from optimum: 30 381 Distance from optimum: 131 571 Distance from optimum: 171 685 Distance from optimum: 285 596 Distance from optimum: 196 Problem size Algorithm 50 blocks 80 blocks GREEDYGeneticPuzzle Average of the best superstring lengths

59 59 Relations Between The Algorithms Co-Puzzle GAPuzzlepuzzle puzzle Cooperativecooperation cooperation

60 60 The Co-Puzzle Algorithm Possible building blocks population Candidate prefixes population Fitness eval Crossover location Possible building blocks population Candidate suffixes population Fitness eval Crossover location Fitness evaluation

61 61 Experiments Compare: GREEDY, Cooperative Coevolution, Co-Puzzle

62 62 Results: Experiment V (~80 blocks)

63 63 Results: Experiment VI (~50 blocks)Puzzle????

64 64 Results: Summary 268 Distance from optimum: 18 275 Distance from optimum: 25 381 Distance from optimum: 131 482 Distance from optimum: 82 547 Distance from optimum: 147 596 Distance from optimum: 196 Problem size Algorithm 50 blocks 80 blocks GREEDYCooperativeCo-puzzle size of shortest common superstring 42% 42% improvement over cooperative

65 65 Outline The “Shortest Common Superstring” problem. The “Shortest Common Superstring” problem. DNA sequencing and the input domain. Standard and cooperative coevolutionary genetic algorithm (GA). The Puzzle approach. Conclusions and future work. Conclusions and future work. Messy Puzzle.

66 66 Results: Summary 268 Distance from optimum: 18 253 Distance from optimum: 3 275 Distance from optimum: 25 381 Distance from optimum: 131 482 Distance from optimum: 82 571 Distance from optimum: 171 547 Distance from optimum: 147 596 Distance from optimum: 196 Problem size Algorithm 50 blocks 80 blocks GREEDY Cooperative Co-puzzle size of shortest common superstring Puzzle617 Distance from optimum: 167 683 Distance from optimum: 233 673 Distance from optimum: 223 677 Distance from optimum: 227 732 Distance from optimum: 232 813 Distance from optimum: 313 768 Distance from optimum: 268 768 90 blocks 100 blocks 20 problem instances per experiment 25% better 13% better 83% better 42% better

67 67 Larger Problems - Using More Species ? Distance from optimum: ? 867 Distance from optimum: 317 836 Distance from optimum: 286 906 Distance from optimum: 306 992 Distance from optimum: 392 906 Distance from optimum: 306 Problem size Algorithm 110 blocks 120 blocks GREEDYCo-puzzle3-Co-puzzle size of shortest common superstring

68 68 Conclusions Cooperative coevolution might prove deleterious when too many species are used (when close to optimum?). Cooperative coevolution might prove deleterious when too many species are used (when close to optimum?). When a suitable number of species are used, cooperative coevolution improves performance by decomposing the problem to several easier subproblems. When a suitable number of species are used, cooperative coevolution improves performance by decomposing the problem to several easier subproblems.

69 69 Conclusions (con’t) Evolving a population of building blocks to aid in the selection of recombination loci improves drastically the performance of a standard GA. Evolving a population of building blocks to aid in the selection of recombination loci improves drastically the performance of a standard GA. Cooperation between cooperative coevolution and Puzzle ultimately improves global performance. Cooperation between cooperative coevolution and Puzzle ultimately improves global performance.

70 70 Future Work Test the (Co-) Puzzle approach on other problem domains. Test the (Co-) Puzzle approach on other problem domains. A hybrid GA. A hybrid GA. Tackle larger problems. Tackle larger problems. Comparison to greedy-stochastically based local-search algorithms. Comparison to greedy-stochastically based local-search algorithms.

71 71 Outline The “Shortest Common Superstring” problem. The “Shortest Common Superstring” problem. DNA sequencing and the input domain. Standard and cooperative coevolutionary genetic algorithm (GA). The Puzzle approach. Conclusions and future work. Messy Puzzle. Messy Puzzle.

72 72 The Messy Puzzle Algorithm

73 73 Static Detection of Building Blocks for addressing the Linkage Problem Hillel Maoz Ben-Gurion University, Israel

74 74 The Linkage Problem A binary Genome of size n = 14. A binary Genome of size n = 14. Genes a and b together encode important information. Genes a and b together encode important information. Random cross over is applied. Random cross over is applied. Survival probability = The chance to appear in the offspring Left genome – 4/15 Left genome – 4/15 Right genome – 14/15 Right genome – 14/15

75 75 The Linkage Problem (con’t) In many cases it is hard to know the optimal representation

76 76 The MaxCut Problem   Input: undirected weighted graph G=(V, E, W).   Output: a partition of V into two disjoint sets (S,V\S).   Goal: maximal sum of edge weights between the sets.   NP-complete.

77 77 Cut = 34 Cut = 47 MaxCut - Example

78 78 Simple GA for MaxCut Population of candidate solutions Population of candidate solutions Give each node with a number Give each node with a number Assign ‘0’ or ‘1’ to indicate which set the node belongs to Assign ‘0’ or ‘1’ to indicate which set the node belongs to Iteration step Iteration step Select any two parents Select any two parents Recombine and create an offspring Recombine and create an offspring Repeat until a new population is generated Repeat until a new population is generated  Fitness – The weight of the cut

79 79 The Representation Problem “How to define the order of the vertices within the genome ?”

80 80 Messy Genes The main difficulty: identifying the related vertexes. Messy gene is an ordered pair. Possible solution: Use some sort of messy genes to detect related genes. Use the Puzzle approach to keep them together.

81 81 The Messy Puzzle Algorithm A building block’s genome is represented as a sequence of messy genes

82 82 Messy Puzzle Algorithm Two population setup as in the puzzle algorithm. Two population setup as in the puzzle algorithm. Enhanced recombination operator. Enhanced recombination operator. Evolved building blocks structure (similar to puzzle). Evolved building blocks structure (similar to puzzle).

83 83 Enhanced RecombinationI) II) III ) IV) 0.8 0.7 0.6 1 2 3 4 5 6 7 8 Add the 1st BB - success Add the 2nd BB - failure Add the 3 rd BB - success Simple crossover

84 84 Static Detection of Building Blocks Building blocks do not truly evolve. No Expansion and Exploration operators. Building blocks’ fitness is based on a number of generations. Purpose: to check and understand the core of the messy puzzle algorithm.

85 85 Results 1graph_200_0.01_1 2graph_200_0.05_1 3graph_200_0.1_1 4graph_200_0.3_1 5graph_200_0.5_1 6graph_300_0.01_1 7graph_300_0.05_1 8graph_300_0.1_1 9graph_300_0.3_1 10graph_300_0.5_1  Random Generated Graphs.  1000 generations.  10 separate experiments per problem instance. Distance to optimum Puzzle addition

86 86 Conclusions and Future Work Do messy work to solve the linkage problem. Even a small population of building blocks improves the GA performance. Messy puzzle is better when inner structures exists. Applying evolution to the building blocks population. Comparing to different representation-search techniques.


Download ppt "1 Coevolving Solutions to the Shortest Common Superstring Problem Assaf Zaritsky & Moshe Sipper Ben-Gurion University, Israel www.cs.bgu.ac.il/~assafza."

Similar presentations


Ads by Google