Probabilistic Modeling for Combinatorial Optimization Scott Davies School of Computer Science Carnegie Mellon University Joint work with Shumeet Baluja.

Probabilistic Modeling for Combinatorial Optimization Scott Davies School of Computer Science Carnegie Mellon University Joint work with Shumeet Baluja (done partially at Justsystem Pittsburgh Research Center)

Combinatorial Optimization zMaximize “evaluation function” f(x) yinput: fixed-length bitstring x youtput: real value zx might represent: yjob shop schedules yTSP tours ydiscretized numeric values yetc. zOur focus: “Black Box” optimization yNo domain-dependent heuristics

Most Commonly Used Approaches zHill-climbing, simulated annealing yGenerate candidate solutions “neighboring” single current working solution (e.g. differing by one bit) yTypically make no attempt to model how particular bits affect solution quality zGenetic algorithms yAttempt to implicitly capture dependency of solution quality on bit values by maintaining a population of candidate solutions yUse crossover and mutation operators on population members to generate new candidate solutions

Using Explicit Probabilistic Models zMaintain an explicit probability distribution P from which we generate new candidate solutions yInitialize P to uniform distribution yUntil termination criteria met: xStochastically generate K candidate solutions from P xEvaluate them xUpdate P to make it more likely to generate solutions “similar” to the “good” solutions zSeveral different choices for what sorts of P to use and how to update it after candidate solution evaluation

Probability Distributions Over Bitstrings zLet x = (x 1, x 2, …, x n ), where x i can take one of the values {0, 1} and n is the length of the bitstring. zCan factorize any distribution P(x 1 …x n ) bit by bit: P(x 1,…,x n ) = P(x 1 ) P(x 2 | x 1 ) P(x 3 | x 1, x 2 )…P(x n |x 1, …, x n-1 ) zIn general, the above formula is just another way of representing a big lookup table with one entry for each of the 2 n possible bitstrings. zObviously too many parameters to estimate from limited data!

Representing Independencies with Bayesian Networks zGraphical representation of probability distributions yEach variable is a vertex yEach variable’s probability distribution is conditioned only on its parents in the directed acyclic graph (“dag”) D F IA H Wean on Fire Fire Alarm Activated Hear Bells Ice Cream Truck Nearby Office Door Open P(F,D,I,A,H) = P(F) * P(D) * P(I|F) * P(A|F) * P(H|D,I,A)

“ Bayesian Networks” for Bitstring Optimization? P(x 1,…,x n ) = P(x 1 ) P(x 2 | x 1 ) P(x 3 | x 1, x 2 )…P(x n |x 1, …, x n-1 ) x1x1 x2x2 x3x3 xnxn Yuck. Let’s just assume all the bits are independent instead. (For now.) x1x1 x2x2 x3x3 xnxn Ah…much better. P(x 1,…,x n ) = P(x 1 ) P(x 2 ) P(x 3 )…P(x n )

Population-Based Incremental Learning zPopulation-Based Incremental Learning (PBIL) [Baluja, 1995] yMaintains a vector of probabilities: one independent probability P(x i =1) for each bit x i. yUntil termination criteria met: xGenerate a population of K bitstrings from P xEvaluate them xUse each of the best M of the K to update P as follows: if x i is set to 1 if x i is set to 0or xOptionally, also update P similarly with the bitwise complement of the worst of the K bitstrings, and/or “mutate” P with random tweaks yReturn best bitstring ever evaluated

PBIL vs. Discrete Learning Automata zEquivalent to a team of Discrete Learning Automata, one automata per bit. [Thathachar & Sastry, 1987] yLearning automata choose actions independently, but receive common reinforcement signal dependent on all their actions yPBIL update rule equivalent to linear reward-inaction algorithm [Hilgard & Bower, 1975] with “success” defined as “best in the bunch” zHowever, Discrete Learning Automata typically used previously in problems with few variables but noisy evaluation functions

PBIL vs. Genetic Algorithms zPBIL originated as tool for understanding GA behavior zSimilar to Bit-Based Simulated Crossover (BSC) [Syswerda, 1993] yRegenerates P from scratch after every generation yAll K used to update P, weighted according to probabilities that a GA would have selected them for reproduction zWhy might normal GAs be better? yImplicitly capture inter-bit dependencies with population yHowever, because model is only implicit, crossover must be randomized. Also, limited population size often leads to premature convergence based on noise in samples.

Four Peaks Problem zProblem used in [Baluja & Caruana, 1995] to test how well GAs maintain multiple solutions before converging. zGiven input vector X with N bits, and difficulty parameter T: yFourPeaks(T,X)=MAX(head(1,X), tail(0,X))+Bonus(T,X) xhead(b,X) = # of contiguous leading bits in X set to b xtail(b,X) = # of contiguous trailing bits in X set to b xBonus(T,X) = 100 if (head(1,X)>T) AND (tail(0,X) > T), or 0 otherwise zGA with one-point crossover can combine two solutions to get Bonus relatively easily?

Four Peaks Problem Results z80 GA settings tried; best five shown here zAverage over 50 runs. 100 bits, varying T

Large-Scale PBIL Empirical Comparison zSorry, this slide is a placeholder for a big table on another non-PowerPoint slide! zGist of results: 27 domains; comparison of yTwo versions of PBIL yThree versions of multiple-restart hillclimbing yTwo GAs yPBIL best on 22, hillclimbing on 3, GA on 2

Modeling Inter-Bit Dependencies zHow about automatically learning probability distributions in which at least some dependencies between variables are modeled? zProblem statement: given a dataset D and a set of allowable Bayesian networks {B i }, find the B i with the maximum posterior probability:

Likelihood Scoring Function zWhere: yd j is the j th datapoint yd i j is the value assigned to x i by d j.   i is the set of x i ’s parents in B  is the set of values assigned to  i by d j yP’ is the empirical probability distribution exhibited by D

Likelihood Scoring Function, cont’d zFact: given that B has a network structure S, the optimal probabilities to use in B are just the probabilities in D, i.e. P’. So: Pick B to Maximize: Pick S to Maximize:

Entropy Calculation Example x 17 x 20 x 32 x 20 ’s contribution to score: 80 total datapoints where H(p,q) = - p log p - q log q.

Single-Parent Tree-Shaped Networks zNow let’s allow each bit to be conditioned on at most one other bit. x 17 x5x5 x 22 x2x2 x6x6 x 19 x 13 x 14 x8x8 zAdding an arc from x j to x i increases network score by H(x i ) - H(x i |x j ) (x j ’s “information gain” with x i ) = H(x j ) - H(x j |x i ) (not necessarily obvious, but true) = I(x i, x j ) Mutual information between x i and x j

Optimal Single-Parent Tree-Shaped Networks zTo find optimal single-parent tree-shaped network, just find maximum spanning tree using I(x i, x j ) as the weight for the edge between x i and x j. [Chow and Liu, 1968] yStart with an arbitrary root node x r. yUntil all n nodes have been added to the tree: xOf all pairs of nodes x in and x out, where x in has already been added to the tree but x out has not, find the pair with the largest I(x in, x out ). xAdd x out to the tree with x in as its parent. zCan be done in O(n 2 ) time (assuming D has already been reduced to sufficient statistics)

Optimal Dependency Trees for Combinatorial Optimization [Baluja & Davies, 1997] zStart with a dataset D initialized from the uniform distribution zUntil termination criteria met: yBuild optimal dependency tree T with which to model D. yGenerate K bitstrings from probability distribution represented by T. Evaluate them.  Add best M bitstrings to D after decaying the weight of all datapoints already in D by a factor  between 0 and 1. zReturn best bitstring ever evaluated. zRunning time: O(K*n + M*n 2 ) per iteration

Tree-based Optimization vs. MIMIC zTree-based optimization algorithm inspired by Mutual Information Maximization for Input Clustering (MIMIC) [De Bonet, et al., 1997] yLearned chain-shaped networks rather than tree-shaped networks yDataset: best N% of all bitstrings ever evaluated x+: dataset has simple, well-defined interpretation x-: have to remember bitstrings x-: seems to converge too quickly on some larger problems x4x4 x8x8 x 17 x5x5

MIMIC dataset vs. Exp. Decay dataset ySolving a system of linear equations the hard way yError of best solution as a function of generation #, averaged over 10 problems y81 bits

Graph-Coloring Example zNoisy Graph-Coloring example yFor each edge connected to vertices of different colors, add +1 to evaluation function with probability 0.5

Peaks problems results zFraction of time Bonus rewarded in 100 runs

Checkerboard problem results z16*16 grid of bits zFor each bit in middle 14*14 subgrid, add +1 to evaluation for each of four neighbors set to opposite value Genetic Algorithm (avg: 740) PBIL (avg: 742) Chain (avg: 760) Tree (avg: 776)

Linear Equations Results z81 bits again; 9 problems, each with results averaged over 25 runs zMinimize error

Modeling Higher-Order Dependencies zThe maximum spanning tree algorithm gives us the optimal Bayesian Network in which each node has at most one parent. zWhat about finding the best network in which each node has at most K parents for K>1? yNP-complete problem! [Chickering, et al., 1995] yHowever, can use search heuristics to look for “good” network structures (e.g., [Heckerman, et al., 1995]), e.g. hillclimbing.

Scoring Function for Arbitrary Networks zRather than restricting K directly, add penalty term to scoring function to limit total network size yEquivalent to priors favoring simpler network structures yAlternatively, lends itself nicely to MDL interpretation ySize of penalty controls exploration/exploitation tradeoff  |B| |B|  penalty factor |B|: number of parameters in B

Bayesian Network-Based Combinatorial Optimization zInitialize D with C bitstrings from uniform distribution, and Bayesian network B to empty network containing no edges zUntil termination criteria met: yPerform steepest-ascent hillclimbing from B to find locally optimal network B’. Repeat until no changes increase score: xEvaluate how each possible edge addition, removal or deletion would affect penalized log-likelihood score xPerform change that maximizes increase in score. ySet B to B’. yGenerate and evaluate K bitstrings from B.  Decay weight of datapoints in D by . yAdd best M of the K recently generated datapoints to D. zReturn best bit string ever evaluated.

Cutting Computational Costs zCan cache contingency tables for all possible one-arc changes to network structure yOnly have to recompute scores associated with at most two nodes after arc added, removed, or reversed. yPrevents having to slog through the entire dataset recomputing score changes for every possible arc change when dataset changes. yHowever, kiss memory goodbye zOnly a few network structure changes required after each iteration since dataset hasn’t changed much (one or two structural changes on average)

Evolution of Network Complexity z100 bits; +1 added to evaluation function for every bit set to the parity of the previous K bits

Summation Cancellation zMinimize magnitudes of cumulative sum of discretized numeric parameters (s 1, …, s n ) represented with standard binary encoding: Average value over 50 runs of best solution found in 2000 generations

Bayesian Networks: Empirical Results Summary zDoes better than Tree-based optimization algorithm on some toy problems ySignificantly better on “Summation Cancellation” problem y10% reduction in error on System of Linear Equation problems zRoughly the same as Tree-based algorithm on others, e.g. small Knapsack problems zSignificantly more computation despite efficiency hacks, however. zWhy not much better results? yToo much emphasis on exploitation rather than exploration? ySteepest-ascent hillclimbing over network structures not good enough, particularly when starting from old networks?

Using Probabilistic Models for Intelligent Restarts zTree-based algorithm’s O(n 2 ) execution time per generation very expensive for large problems zEven more so for more complicated Bayesian networks zOne possible approach: use probabilistic models to select good starting points for faster optimization algorithms, e.g. hillclimbing

COMIT zCombining Optimizers with Mutual Information Trees [Baluja & Davies, 1997b]: yInitialize dataset D with bitstrings drawn from uniform distribution yUntil termination criteria met: xBuild optimal dependency tree T with which to model D. xGenerate K bitstrings from the distribution represented by T. Evaluate them. xExecute a hillclimbing run starting from single best bitstring of these K. xReplace up to M bitstrings in D with the best bitstrings found during the hillclimbing run. yReturn best bitstring ever evaluated

COMIT, cont’d zEmpirical tests performed with stochastic hillclimbing algorithm that allows at most PATIENCE moves to points of equal value before restarting zWe compare COMIT vs.: yHillclimbing with restarts from bitstrings chosen randomly from uniform distribution yHillclimbing with restarts from best bitstring out of K chosen randomly from uniform distribution

COMIT: Example of Behavior zTSP domain: 100 cities, 700 bits Evaluation Number * 10 3 Tour Length * 10 3 Hillclimber COMIT

COMIT: Empirical Comparisons yEach number is average over at least 25 runs yHighlighted: better than each non-COMIT hillclimber with P > 95% yAHCxxx: pick best of xxx randomly generated starting points before hillclimbing yCOMITxxx: pick best of xxx starting points generated by tree before hillclimbing

Summary zPBIL uses very simple probability distribution from which new solutions are generated, yet works surprisingly well zAlgorithm using tree-based distributions seems to work even better, though at significantly more computational expense zMore sophisticated networks: past the point of diminishing marginal returns? “Future research” zCOMIT makes tree-based algorithm applicable to much larger problems

Future Work zMaking algorithm based on complex Bayesian Networks more practical yCombine w/simpler search algorithms, ala COMIT? zApplying COMIT to more interesting problems yWALKSAT? zUsing COMIT to combine results of multiple search algorithms zOptimization in real-valued state spaces. What sorts of PDF representations might be useful? xGaussians? xKernel-based representations? xHierarchical representations? xHands off :-)

Acknowledgements zShumeet Baluja zJustsystem Pittsburgh Research Center zDoug Baker, Justin Boyan, Lonnie Chrisman, Greg Cooper, Geoff Gordon, Andrew Moore, …?

Probabilistic Modeling for Combinatorial Optimization Scott Davies School of Computer Science Carnegie Mellon University Joint work with Shumeet Baluja.

Similar presentations

Presentation on theme: "Probabilistic Modeling for Combinatorial Optimization Scott Davies School of Computer Science Carnegie Mellon University Joint work with Shumeet Baluja."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Probabilistic Modeling for Combinatorial Optimization Scott Davies School of Computer Science Carnegie Mellon University Joint work with Shumeet Baluja.

Similar presentations

Presentation on theme: "Probabilistic Modeling for Combinatorial Optimization Scott Davies School of Computer Science Carnegie Mellon University Joint work with Shumeet Baluja."— Presentation transcript:

Similar presentations

About project

Feedback