Recent Progress On Sampling Problem Yin Tat Lee (MSR/UW), Santosh Vempala (Gatech)
My Dream Tell the complexity of a convex problem by looking at the formula. Example Minimum Cost Flow Problem: This is a linear program, each row has two non-zero. It can be solved in 𝑂 (𝑚 𝑛 ). [LS14] (Previous: 𝑂 (𝑚 𝑚 ) for graph with 𝑚 edges and 𝑛 vertices.)
My Dream Tell the complexity of a convex problem by looking at the formula. Example Submodular Minimization: where 𝑓 satisfies diminishing return, i.e. 𝑓 𝑆+𝑒 −𝑓 𝑆 ≤𝑓 𝑇+𝑒 −𝑓 𝑇 ∀𝑇⊂𝑆, 𝑒∉𝑆. 𝑓 can be extended to a convex function on 0,1 𝑛 . subgradient of 𝑓 can be computed in 𝑛 2 time. It can be solved in 𝑂 ( 𝑛 3 ). [LSW15] (Previous: 𝑂 ( 𝑛 5 )) Fundamental in combinatorial optimization. Worth ≥2 Fulkerson prizes
Algorithmic Convex Geometry To describe a formula, we need some operations. Given a convex set 𝐾, we have following operations Membership(x): Check if 𝑥∈𝐾. Separation(x): Assert 𝑥∈𝐾, or find a hyperplane separate 𝑥 and 𝐾. Width(c): Compute min 𝑥∈𝐾 𝑐 𝑇 𝑥. Optimize(c): Compute argmin 𝑥∈𝐾 𝑐 𝑇 𝑥. Sample(g): Sample according to 𝑔 𝑥 1 𝐾 . (assume 𝑔 is logconcave) Integrate(g): Compute 𝐾 𝑔 𝑥 𝑑𝑥 . (assume 𝑔 is logconcave) Theorem: They are all equivalent by polynomial time algorithms. One of the Major Source of Polynomial Time Algorithms!
Algorithmic Convex Geometry Traditionally viewed as impractical. Now, we have an efficient version of ellipsoid method. Algorithmic Convex Geometry Why those operations? For any convex 𝑓, define the dual 𝑓 ∗ 𝑐 = argmax 𝑥 𝑐 𝑇 𝑥−𝑓(𝑥), and 𝑙 𝐾 =∞ 1 𝐾 𝑐 . Progress: We are getting the tight polynomial equivalence between left 4. Membership 𝑙 𝐾 (𝑥) Integration 𝐾 𝑔 𝑥 𝑑𝑥 Today Focus Width 𝑙 𝐾 ∗ (𝑐) Separation 𝜕 𝑙 𝐾 (𝑥) Sample ~ 𝑒 −𝑔 1 𝐾 Optimization 𝜕 𝑙 𝐾 ∗ (𝑐) Convex Optimization
Problem: Sampling Input: a convex set 𝐾. Output: sample a point from the uniform distribution on K. Generalized Problem: Input: a logconcave distribution 𝑓 Output: sample a point according to 𝑓. Why? useful for optimization, integration/counting, learning, rounding. Best way to minimize convex function with noisy value oracle. Only way to compute volume of convex set.
Non-trivial application: Convex Bandit Game: For each round 𝑡=1,2,⋯,𝑇, the player Adversary selects a convex loss function ℓ 𝑡 Chooses (possibly randomly) 𝑥 𝑡 from unit ball in n dim based on past observations. Receives the loss/observation ℓ 𝑡 𝑥 𝑡 ∈[0,1]. Nothing else about ℓ 𝑡 is revealed! Measure performance by regret: There is a good fixed action, but We only learn one point each iteration! Adversary can give confusing information! Sébastien Bubeck Ronen Eldan The gold of standard is getting 𝑂( 𝑇 ). Namely, 𝒦 1000 𝑇 is better than 𝒦 𝑇 2/3 .
Non-trivial application: Convex Bandit Game: For each round 𝑡=1,2,⋯,𝑇, the player Adversary selects a convex loss function ℓ 𝑡 Chooses (possibly randomly) 𝑥 𝑡 from unit ball in n dim based on past observations. Receives the loss/observation ℓ 𝑡 𝑥 𝑡 ∈[0,1]. Nothing else about ℓ 𝑡 is revealed! Measure performance by regret: After a decade of research, we have 𝑅 𝑇 = 𝑛 10.5 𝑇 . (The first polynomial time and regret algorithm.) Sébastien Bubeck Ronen Eldan The gold of standard is getting 𝑂( 𝑇 ). Namely, 𝒦 1000 𝑇 is better than 𝒦 𝑇 2/3 .
How to Input the set Oracle Setting: A membership oracle: answer YES/NO to “𝑥∈𝐾”. A ball 𝑥+𝑟𝐵 such that 𝑥+𝑟𝐵⊂𝐾⊂𝑥+poly 𝑛 𝑟𝐵. Explicit Setting: Given explicitly, such as polytopes, spectrahedrons, … In this talk, we focus on polytope {𝐴𝑥≥𝑏}. (m = # constraints)
Outline Oracle Setting: Explicit Setting: (original promised talk) Introduce the ball walk KLS conjecture and its related conjectures Main Result Explicit Setting: (original promised talk) Introduce the geodesic walk Bound the # of iteration Bound the cost per iteration
Sampling Problem Input: a convex set 𝐾 with a membership oracle Output: sample a point from the uniform distribution on K. Conjectured Lower Bound: 𝑛 2 . Generalized Problem: Given a logconcave distribution 𝑝, sampled 𝑥 from 𝑝.
Conjectured Optimal Algorithm: Ball Walk At 𝑥, pick random 𝑦 from 𝑥+𝛿 𝐵 𝑛 , if 𝑦 is in 𝐾, go to 𝑦. otherwise, sample again This walk may get trapped on one side if the set is not convex.
Isoperimetric constant For any set 𝐾, we define the isoperimetric constant 𝜙 𝐾 by 𝜙 𝐾 = min 𝑆 Area(𝜕𝑆) min(vol 𝑆 ,vol 𝑆 𝑐 ) Theorem Given a random point in 𝐾, we can generate another in 𝑂( 𝑛 𝛿 2 𝜙 𝐾 2 log(1/𝜀)) iterations of Ball Walk where 𝛿 is step size. 𝜙 𝐾 or 𝛿 larger, mix better. 𝛿 cannot be too large, otherwise, fail probability is ~1. 𝜙 large, hard to cut the set 𝜙 small, easy to cut the set
Isoperimetric constant of Convex Set Note that 𝜙 𝐾 is not affine invariant and can be arbitrary small. However, you can renormalize 𝐾 such that Cov 𝐾 =𝐼. Definition: 𝐾 is isotropic, if it is mean 0 and Cov 𝐾 =𝐼. Theorem: If 𝛿< 0.001 𝑛 , ball walk stays inside the set with constant probability. Theorem: Given a random point in isotropic 𝐾, we can generate another in 𝑂( 𝑛 2 𝜙 𝐾 2 log(1/𝜀)) To make body isotropic, we can sample the body to compute covariance. L 1 𝜙 𝐾 =1/𝐿.
KLS Conjecture Kannan-Lovász-Simonovits Conjecture: For any isotropic convex 𝐾, 𝜙 𝐾 =Ω(1). If this is true, Ball Walk takes O( 𝑛 2 ) iter for isotropic 𝐾 (Matched the believed information theoretical lower bound.) To get the “tight” reduction from membership to sampling, it suffices to prove KLS conjecture
KLS conjecture and its related conjectures Slicing Conjecture: Any unit volume convex set 𝐾 has a slice with volume Ω(1). Thin-Shell Conjecture: For isotropic convex 𝐾, 𝔼( 𝑥 − 𝑛 2 )=𝑂(1). Generalized Levy concentration: For logconcave distribution 𝑝, 1-Lipschitz 𝑓 with 𝔼𝑓=0, ℙ |𝑓 𝑥 −𝔼𝑓|>𝑡 =exp(−Ω 𝑡 ). Essentially, it is asking if all convex sets looks like ellipsoids.
Do you know better way to bound mixing time of ball walk? Main Result What if we cut the body by sphere only? 𝜎 𝐾 ≝ 𝑛 𝑉𝑎𝑟 𝑋 2 ≥ 𝜙 𝐾 [Lovasz-Simonovits 93] 𝜙=Ω 1 𝑛 −1/2 . [Klartag 2006] 𝜎=Ω 1 𝑛 −1/2 log 1/2 𝑛. [Fleury, Guedon, Paouris 2006] 𝜎=Ω 1 𝑛 −1/2 log 1/6 𝑛 log −2 log 𝑛 . [Klartag 2006] 𝜎=Ω(1) 𝑛 −0.4 . [Fleury 2010] 𝜎=Ω(1) 𝑛 −0.375 . [Guedon, Milman 2010] 𝜎=Ω(1) 𝑛 −0.333 . [Eldan 2012] 𝜙= Ω 1 𝜎= Ω (1) 𝑛 −0.333 . [Lee Vempala 2016] 𝜙=Ω 1 𝑛 −0.25 . In particular, we have 𝑂( 𝑛 2.5 ) mixing for ball walk. Do you know better way to bound mixing time of ball walk?
Outline Oracle Setting: Explicit Setting: Introduce the ball walk KLS conjecture and its related conjectures Main Result Explicit Setting: Introduce the geodesic walk Bound the # of iteration Bound the cost per iteration
Problem: Sampling Input: a polytope with 𝑚 constraints and 𝑛 variables. Output: sample a point from the uniform distribution on K. {𝐴𝑥≥𝑏} Iterations Time/Iter Polytopes KN09 Dikin walk 𝑚𝑛 𝑚 𝑛 1.38 LV16 Ball walk 𝑛 2.5 𝑚𝑛 LV16 Geodesic walk 𝑚 𝑛 0.75 𝑚 𝑛 1.38 First sub-quadratic algorithm. Cost of matrix inversion
How does nature mix particles? Brownian Motion. It works for sampling on ℝ 𝑛 . However, convex set has boundary . Option 1) Reflect it when you hit the boundary. However, it need tiny step for discretization.
How does the nature mixes particle? Brownian Motion. It works for sampling on ℝ 𝑛 . However, convex set has boundary . Option 2) Remove the boundary by blowing up. However, this requires explicit polytopes.
Blowing Up? After blow up Non-Uniform Distribution on Real Real Line The distortion makes the hard constraint becomes “soft”. Original Polytope Uniform Distribution on [0,1]
Enter Riemannian manifolds 𝑛-dimensional manifold M is an 𝑛-dimensional surface. Each point 𝑝 has a tangent space 𝑇 𝑝 𝑀 of dimension 𝑛, the local linear approximation of M at 𝑝; tangents of curves in 𝑀 lie in 𝑇 𝑝 𝑀. The inner product in 𝑇 𝑝 𝑀 depends on 𝑝: 𝑢,𝑣 𝑝 Informally, you can think it is like assigning an unit ball for every point
Enter Riemannian manifolds Each point 𝑝 has a linear tangent space 𝑇 𝑝 𝑀. The inner product in 𝑇 𝑝 𝑀 depends on 𝑝: 𝑢,𝑣 𝑝 Length of a curve 𝑐: 0,1 →𝑀 is 𝐿 𝑐 = 0 1 𝑑 𝑑𝑡 𝑐 𝑡 𝑐(𝑡) 𝑑𝑡 Distance 𝑑(𝑥,𝑦) is the infimum over all paths in M between x and y.
“Generalized” Ball Walk At x, pick random y from 𝐷 𝑥 where 𝐷 𝑥 ={𝑦:𝑑 𝑥,𝑦 ≤1}.
Hessian manifold 𝑠 3 𝑠 2 𝑠 1 Hessian manifold: a subset of ℝ 𝑛 with inner product defined by 𝑢,𝑣 𝑝 = 𝑢 𝑇 𝛻 2 𝜙 𝑝 𝑣. For polytope 𝑎 𝑖 𝑇 𝑥≥ 𝑏 𝑖 ∀𝑖 , we use the log barrier function 𝜙 𝑥 = 𝑖=1 𝑚 log( 1 𝑠 𝑖 𝑥 ) 𝑠 𝑖 𝑥 = 𝑎 𝑖 𝑇 𝑥− 𝑏 𝑖 is the distance from 𝑥 to constraint 𝑖 𝑝 blows up when 𝑥 close to boundary Our walk is slower when it is close to boundary.
random walk on real line Suggested algorithm At x, pick random y from 𝐷 𝑥 where 𝐷 𝑥 ={𝑦:𝑑 𝑥,𝑦 ≤1} is induced by log barrier. Doesn’t work! Converges to the boundary since the volume of “boundary” is +∞. (Called Dikin Ellipsoid) Corresponding Hessian Manifold random walk on real line Original Polytope
Getting Uniform Distribution Lemma If 𝑝 𝑥→𝑦 =𝑝(𝑦→𝑥), then stationary distribution is uniform. To make a Markov chain 𝑝 symmetric, we use 𝑝 𝑥→𝑦 = min 𝑝 𝑥→𝑦 ,𝑝 𝑦→𝑥 𝑖𝑓 𝑥≠𝑦 ⋯𝑖𝑓 𝑥=𝑦 . To implement it, we sample 𝑦 according to 𝑝(𝑥→𝑦) if 𝑝 𝑥→𝑦 <𝑝 𝑦→𝑥 , go to 𝑦. Else, we go to 𝑦 with probability 𝑝 𝑦→𝑥 /𝑝(𝑥→𝑦); Stay at x otherwise.
Dikin Walk At x, pick random y from 𝐷 𝑥 if 𝑥∉ 𝐷 𝑦 , reject 𝑦 else, accept 𝑦 with probability min(1, vol 𝐷 𝑥 vol 𝐷 𝑦 ). [KN09] proved it takes 𝑂 (𝑚𝑛) steps. Better than the previous best 𝑂 𝑛 2.5 for oracle setting. [Copied from KN09]
Dikin Walk and its Limitation At x, pick random y from 𝐷 𝑥 if 𝑥∉ 𝐷 𝑦 , reject 𝑦 else, accept 𝑦 with probability min(1, vol 𝐷 𝑥 vol 𝐷 𝑦 ). Dikin Walk and its Limitation Dikin ellipsoid is fully contained in 𝐾. Idea: Pick next step y from a blown-up Dikin ellipsoid. Can afford to blow up by ~ 𝑛/ log 𝑚 . WHP 𝑦∈𝐾. In high dimension, volume of 𝐷 𝑥 is not that smooth. (Worst case 0,1 𝑛 ) Any larger step makes the success probability exponentially small! 0,1 𝑛 is the worst case for ball walk, hit-and-run, Dikin walk .
Going back to Brownian Motion At x, pick random y from 𝐷 𝑥 if 𝑥∉ 𝐷 𝑦 , reject 𝑦 else, accept 𝑦 with probability min(1, vol 𝐷 𝑥 vol 𝐷 𝑦 ). Going back to Brownian Motion The walk is not symmetric in the “space”. Tendency of going to center. Taking step size to 0, Dikin walk becomes a stochastic differential equation: 𝑑 𝑥 𝑡 =𝜇 𝑥 𝑡 𝑑𝑡+𝜎 𝑥 𝑡 𝑑 𝑊 𝑡 where 𝜎 𝑥 𝑡 = 𝜙 ′′ 𝑥 𝑡 −1/2 and 𝜇( 𝑥 𝑡 ) is the drift towards center. Corresponding Hessian Manifold Original Polytope
What is the drift? Fokker-Planck equation The probability distribution of the SDE 𝑑 𝑥 𝑡 =𝜇 𝑥 𝑡 𝑑𝑡+𝜎 𝑥 𝑡 𝑑 𝑊 𝑡 is given by 𝜕𝑝 𝑑𝑡 𝑥,𝑡 =− 𝜕 𝜕𝑥 𝜇 𝑥 𝑝 𝑥,𝑡 + 1 2 𝜕 2 𝜕 𝑥 2 𝜎 2 𝑥 𝑝 𝑥,𝑡 . To make the stationary distribution constant, we need − 𝜕 𝜕𝑥 𝜇 𝑥 + 1 2 𝜕 2 𝜕 𝑥 2 𝜎 2 𝑥 =0 Hence, we have 𝜇 𝑥 =𝜎𝜎′.
A New Walk A new walk: 𝑥 𝑡+ℎ = 𝑥 𝑡 +ℎ⋅𝜇 𝑥 𝑡 +𝜎 𝑥 𝑡 𝑊 with 𝑊~𝑁(0,ℎ𝐼). It doesn’t make sense.
Exponential map Exponential map exp 𝑝 : 𝑇 𝑝 𝑀→ 𝑀 is defined as 𝛾 𝑣 : unique geodesic (shortest path) from p with initial velocity 𝑣.
Anyway to avoid using filter? Geodesic Walk A new walk: 𝑥 𝑡+ℎ = exp 𝑥 𝑡 (ℎ/2⋅𝜇 𝑥 𝑡 +𝜎 𝑥 𝑡 𝑊) with 𝑊~𝑁(0,ℎ𝐼). However, this walk has discretization error. So, we do a metropolis filter after. Since our walk is complicated, the filter is super complicated. Anyway to avoid using filter?
Outline Oracle Setting: Explicit Setting: (original promised talk) Introduce the ball walk KLS conjecture and its related conjectures Main Result Explicit Setting: (original promised talk) Introduce the geodesic walk Bound the # of iteration Bound the cost per iteration
Geodesic Walk A new walk: 𝑥 𝑡+ℎ = exp 𝑥 𝑡 (ℎ/2⋅𝜇 𝑥 𝑡 +𝑊) with 𝑊~𝑁(0,ℎ𝐼). Geodesic is better than “straight line”: It extends infinitely. It gives a massive cancellation.
Key Lemma 1: Provable Long Geodesic Straight line defines finitely; Geodesic defines infinitely. Thm [LV16]: For manifold induced by log barrier, a random geodesic 𝛾 starting from 𝑥 satisfies 𝑎 𝑖 𝑇 𝛾 ′ 𝑡 ≤𝑂( 𝑛 − 1 4 )( 𝑎 𝑖 𝑇 𝑥−𝑏) for 0≤𝑡≤ 𝑂 ( 𝑛 1/4 ). Namely, the geodesic is well behavior for a long time. Remark: If central path in IPM had this, we have a 𝑚 5/4 time algorithm for MaxFlow!
Key Lemma 2: Massive Cancellation Consider a SDE on 1 dimensional real line (NOT manifold) 𝑑 𝑥 𝑡 =𝜇 𝑥 𝑡 𝑑𝑡+𝜎 𝑥 𝑡 𝑑 𝑊 𝑡 . How good is the “Euler method”, namely 𝑥 0 +ℎ𝜇 𝑥 0 + ℎ 𝜎 𝑥 0 𝑊? By “Taylor” expansions, we have 𝑥 ℎ = 𝑥 0 +ℎ𝜇 𝑥 0 + ℎ 𝜎 𝑥 0 𝑊+ ℎ 2 𝜎 ′ 𝑥 0 𝜎 𝑥 0 𝑊 2 −1 +𝑂 ℎ 1.5 . If 𝜎 ′ 𝑥 0 ≠0, the error is 𝑂(ℎ). If 𝜎 ′ 𝑥 0 =0, the error is 𝑂( ℎ 1.5 ). For geodesic walk, 𝜎 ′ 𝑥 0 =0 (Christoffel symbols vanish in normal coordinates)
Is high order method for SDE used in MCMC? Convergence Theorem Thm [LV16]: For log barrier, the geodesic walk mixes in 𝑂 𝑚 𝑛 0.75 steps. Thm [LV16]: For log barrier on 0,1 𝑛 , it mixes in 𝑂 ( 𝑛 1/3 ) steps. The best bound for ball-walk, hit-and-run and Dikin walk is 𝑂( 𝑛 2 ) steps for 0,1 𝑛 . Our walk is similar to Milstein method. Is high order method for SDE used in MCMC?
Outline Oracle Setting: Explicit Setting: (original promised talk) Introduce the ball walk KLS conjecture and its related conjectures Main Result Explicit Setting: (original promised talk) Introduce the geodesic walk Bound the # of iteration Bound the cost per iteration
How to implement the algorithm Can we simply do Taylor expansion? In high dim, it may take 𝑛 𝑘 time to compute the 𝑘 𝑡ℎ derivatives. How to implement the algorithm In tangent plane at x, pick 𝑤∼ 𝑁 𝑥 (0,𝐼), i.e. standard Gassian in ‖. 𝑥 Compute 𝑦= exp 𝑥 ℎ 2 𝜇 𝑥 + ℎ 𝑤 Accept with probability Min 1, 𝑝 𝑦 → 𝑥 𝑝 𝑥 → 𝑦 How to compute geodesic and rejection probability? Need high accuracy for rejection probability due to “directedness”. Geodesic is given by geodesic equation; probability is given by Jacobi field.
Collocation Method for ODE A weakly polynomial time algorithm for some ODEs Consider the ODE 𝑦 ′ =𝑓 𝑡,𝑦(𝑡) with 𝑦 0 = 𝑦 0 . Given a degree 𝑑 poly 𝑞 and distinct points 𝑡 1 , 𝑡 2 ,⋯, 𝑡 𝑑 , let 𝑇(𝑞) be the unique degree 𝑑 poly 𝑝 s.t. 𝑝 ′ 𝑡 =𝑓(𝑡,𝑞(𝑡)) on 𝑡= 𝑡 1 , 𝑡 2 ,⋯, 𝑡 𝑑 𝑝 0 =𝑞(0). Lem [LV16]: 𝑇 is well defined. If 𝑡 𝑖 are Chebyshev points on [0,1], then 𝐿𝑖𝑝 𝑇 =𝑂 𝐿𝑖𝑝 𝑓 . Thm [LV16]: If 𝐿𝑖𝑝 𝑓 ≤0.001, we can find a fix point of 𝑇 efficiently. p
𝑂(𝑑 log 2 𝑑 𝜀 −1 ) with 𝑂(𝑑log 𝑑 𝜀 −1 ) evaluations of 𝑓. Collocation Method for ODE A weakly polynomial time algorithm for some ODEs Consider the ODE 𝑦 ′ =𝑓 𝑡,𝑦(𝑡) with 𝑦 0 = 𝑦 0 . Thm [LV16]: Suppose that 𝐿𝑖𝑝 𝑓 ≤0.001 There is a degree 𝑑 poly 𝑝 such that 𝑝 ′ − 𝑓 ′ ≤𝜀. Then, we can find a 𝑦 such that 𝑦−𝑦 1 =𝑂(𝜀) in time 𝑂(𝑑 log 2 𝑑 𝜀 −1 ) with 𝑂(𝑑log 𝑑 𝜀 −1 ) evaluations of 𝑓. Remark: No need to compute 𝑓′! In general, the runtime is 𝑂 (𝑛𝑑 Lip 𝑂(1) (𝑓)) instead.
How can I bound the 𝟐𝟕𝟎 𝒕𝒉 derivatives? For 1 variable function, we can estimate 𝑘 𝑡ℎ derivatives easily. Idea: reduce estimating derivatives of general functions to 1 variable. In general, we write 𝐹 ≤ 𝑥 𝑓 D 𝑘 𝐹(𝑥) ≤ 𝑓 𝑘 0 . Calculus rule: 𝐹 ≤ 𝑥 𝑓 and 𝐺 ≤ 𝐹(𝑥) 𝑔, then 𝐺∘𝐹 ≤ 𝑥 𝑔∘(𝑓−𝑓 0 ).
Implementation Theorem Using the trick before, we show geodesic can be approximated by 𝑂 (1) degree poly. Hence, collocation method finds in 𝑂 (1) steps. Thm [LV16]: If ℎ≤ 𝑛 −1/2 , 1 step of Geodesic walk can be implemented in matrix multiplication time. For hypercube, ℎ≤ 𝑂 (1) suffices.
Questions We have no background on numerical ODE/SDE and RG. So, the running time should be improvable easily. How to avoid the filtering step? Is there way to tell a walk mixed or not? (i.e. even if we cannot prove KLS, the algorithm can still stop early.) Is higher order method in SDE useful in MCMC? Any other suggestion/heuristic for sampling on convex set?