Recent Progress On Sampling Problem

Slides:

Advertisements

Similar presentations

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Advertisements

Shortest Vector In A Lattice is NP-Hard to approximate

6.896: Topics in Algorithmic Game Theory Lecture 11 Constantinos Daskalakis.

Heuristics for the Hidden Clique Problem Robert Krauthgamer (IBM Almaden) Joint work with Uri Feige (Weizmann)

How should we define corner points? Under any reasonable definition, point x should be considered a corner point x What is a corner point?

1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.

On the monotonicity of the expected volume of a random simplex Luis Rademacher Computer Science and Engineering The Ohio State University TexPoint fonts.

Entropy Rates of a Stochastic Process

A Randomized Polynomial-Time Simplex Algorithm for Linear Programming Daniel A. Spielman, Yale Joint work with Jonathan Kelner, M.I.T.

Function Optimization Newton’s Method. Conjugate Gradients

Totally Unimodular Matrices Lecture 11: Feb 23 Simplex Algorithm Elliposid Algorithm.

Tutorial 12 Unconstrained optimization Conjugate gradients.

EXPANDER GRAPHS Properties & Applications. Things to cover ! Definitions Properties Combinatorial, Spectral properties Constructions “Explicit” constructions.

Chapter 11: Limitations of Algorithmic Power

1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.

Randomness in Computation and Communication Part 1: Randomized algorithms Lap Chi Lau CSE CUHK.

Diffusion Maps and Spectral Clustering

Dana Moshkovitz, MIT Joint work with Subhash Khot, NYU.

Beating the Union Bound by Geometric Techniques Raghu Meka (IAS & DIMACS)

online convex optimization (with partial information)

6.853: Topics in Algorithmic Game Theory Fall 2011 Constantinos Daskalakis Lecture 11.

Edge-disjoint induced subgraphs with given minimum degree Raphael Yuster 2012.

NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.

Geometry of Shape Manifolds

Linear Programming Maximize Subject to Worst case polynomial time algorithms for linear programming 1.The ellipsoid algorithm (Khachian, 1979) 2.Interior.

Linear Programming Chapter 9. Interior Point Methods  Three major variants  Affine scaling algorithm - easy concept, good performance  Potential.

The Poincaré Constant of a Random Walk in High- Dimensional Convex Bodies Ivona Bezáková Thesis Advisor: Prof. Eric Vigoda.

TU/e Algorithms (2IL15) – Lecture 12 1 Linear Programming.

The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.

Theory of Computational Complexity Probability and Computing Ryosuke Sasanuma Iwama and Ito lab M1.

TU/e Algorithms (2IL15) – Lecture 12 1 Linear Programming.

Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.

The NP class. NP-completeness

Lap Chi Lau we will only use slides 4 to 19

New Characterizations in Turnstile Streams with Applications

Advanced Engineering Mathematics 6th Edition, Concise Edition

Topics in Algorithms Lap Chi Lau.

On Testing Dynamic Environments

Vitaly Feldman and Jan Vondrâk IBM Research - Almaden

The minimum cost flow problem

Haim Kaplan and Uri Zwick

PSG College of Technology

Path Coupling And Approximate Counting

Chapter 5. Optimal Matchings

Log-Sobolev Inequality on the Multislice (and what those words mean)

Latent Variables, Mixture Models and EM

Possibilities and Limitations in Computation

Background: Lattices and the Learning-with-Errors problem

Structural Properties of Low Threshold Rank Graphs

Random WALK, BROWNIAN MOTION and SDEs

Enumerating Distances Using Spanners of Bounded Degree

2. Solving Schrödinger’s Equation

Hidden Markov Models Part 2: Algorithms

Depth Estimation via Sampling

Instructor: Shengyu Zhang

CSCI B609: “Foundations of Data Science”

The Curve Merger (Dvir & Widgerson, 2008)

3.5 Minimum Cuts in Undirected Graphs

Hariharan Narayanan, University of Chicago Joint work with

On the effect of randomness on planted 3-coloring models

Chapter 11 Limitations of Algorithm Power

Monte Carlo I Previous lecture Analytical illumination formula

Daniel Dadush Centrum Wiskunde & Informatica (CWI) Aussois 2019

Lecture 6: Counting triangles Dynamic graphs & sampling

Computational Geometry

15th Scandinavian Workshop on Algorithm Theory

GPAT – Chapter 7 Physics.

Competitively Chasing Convex Bodies

Introduction to Machine Learning

Presentation transcript:

Recent Progress On Sampling Problem Yin Tat Lee (MSR/UW), Santosh Vempala (Gatech)

My Dream Tell the complexity of a convex problem by looking at the formula. Example Minimum Cost Flow Problem: This is a linear program, each row has two non-zero. It can be solved in 𝑂 (𝑚 𝑛 ). [LS14] (Previous: 𝑂 (𝑚 𝑚 ) for graph with 𝑚 edges and 𝑛 vertices.)

My Dream Tell the complexity of a convex problem by looking at the formula. Example Submodular Minimization: where 𝑓 satisfies diminishing return, i.e. 𝑓 𝑆+𝑒 −𝑓 𝑆 ≤𝑓 𝑇+𝑒 −𝑓 𝑇 ∀𝑇⊂𝑆, 𝑒∉𝑆. 𝑓 can be extended to a convex function on 0,1 𝑛 . subgradient of 𝑓 can be computed in 𝑛 2 time. It can be solved in 𝑂 ( 𝑛 3 ). [LSW15] (Previous: 𝑂 ( 𝑛 5 )) Fundamental in combinatorial optimization. Worth ≥2 Fulkerson prizes

Algorithmic Convex Geometry To describe a formula, we need some operations. Given a convex set 𝐾, we have following operations Membership(x): Check if 𝑥∈𝐾. Separation(x): Assert 𝑥∈𝐾, or find a hyperplane separate 𝑥 and 𝐾. Width(c): Compute min 𝑥∈𝐾 𝑐 𝑇 𝑥. Optimize(c): Compute argmin 𝑥∈𝐾 𝑐 𝑇 𝑥. Sample(g): Sample according to 𝑔 𝑥 1 𝐾 . (assume 𝑔 is logconcave) Integrate(g): Compute 𝐾 𝑔 𝑥 𝑑𝑥 . (assume 𝑔 is logconcave) Theorem: They are all equivalent by polynomial time algorithms. One of the Major Source of Polynomial Time Algorithms!

Algorithmic Convex Geometry Traditionally viewed as impractical. Now, we have an efficient version of ellipsoid method. Algorithmic Convex Geometry Why those operations? For any convex 𝑓, define the dual 𝑓 ∗ 𝑐 = argmax 𝑥 𝑐 𝑇 𝑥−𝑓(𝑥), and 𝑙 𝐾 =∞ 1 𝐾 𝑐 . Progress: We are getting the tight polynomial equivalence between left 4. Membership 𝑙 𝐾 (𝑥) Integration 𝐾 𝑔 𝑥 𝑑𝑥 Today Focus Width 𝑙 𝐾 ∗ (𝑐) Separation 𝜕 𝑙 𝐾 (𝑥) Sample ~ 𝑒 −𝑔 1 𝐾 Optimization 𝜕 𝑙 𝐾 ∗ (𝑐) Convex Optimization

Problem: Sampling Input: a convex set 𝐾. Output: sample a point from the uniform distribution on K. Generalized Problem: Input: a logconcave distribution 𝑓 Output: sample a point according to 𝑓. Why? useful for optimization, integration/counting, learning, rounding. Best way to minimize convex function with noisy value oracle. Only way to compute volume of convex set.

Non-trivial application: Convex Bandit Game: For each round 𝑡=1,2,⋯,𝑇, the player Adversary selects a convex loss function ℓ 𝑡 Chooses (possibly randomly) 𝑥 𝑡 from unit ball in n dim based on past observations. Receives the loss/observation ℓ 𝑡 𝑥 𝑡 ∈[0,1]. Nothing else about ℓ 𝑡 is revealed! Measure performance by regret: There is a good fixed action, but We only learn one point each iteration! Adversary can give confusing information! Sébastien Bubeck Ronen Eldan The gold of standard is getting 𝑂( 𝑇 ). Namely, 𝒦 1000 𝑇 is better than 𝒦 𝑇 2/3 .

Non-trivial application: Convex Bandit Game: For each round 𝑡=1,2,⋯,𝑇, the player Adversary selects a convex loss function ℓ 𝑡 Chooses (possibly randomly) 𝑥 𝑡 from unit ball in n dim based on past observations. Receives the loss/observation ℓ 𝑡 𝑥 𝑡 ∈[0,1]. Nothing else about ℓ 𝑡 is revealed! Measure performance by regret: After a decade of research, we have 𝑅 𝑇 = 𝑛 10.5 𝑇 . (The first polynomial time and regret algorithm.) Sébastien Bubeck Ronen Eldan The gold of standard is getting 𝑂( 𝑇 ). Namely, 𝒦 1000 𝑇 is better than 𝒦 𝑇 2/3 .

How to Input the set Oracle Setting: A membership oracle: answer YES/NO to “𝑥∈𝐾”. A ball 𝑥+𝑟𝐵 such that 𝑥+𝑟𝐵⊂𝐾⊂𝑥+poly 𝑛 𝑟𝐵. Explicit Setting: Given explicitly, such as polytopes, spectrahedrons, … In this talk, we focus on polytope {𝐴𝑥≥𝑏}. (m = # constraints)

Outline Oracle Setting: Explicit Setting: (original promised talk) Introduce the ball walk KLS conjecture and its related conjectures Main Result Explicit Setting: (original promised talk) Introduce the geodesic walk Bound the # of iteration Bound the cost per iteration

Sampling Problem Input: a convex set 𝐾 with a membership oracle Output: sample a point from the uniform distribution on K. Conjectured Lower Bound: 𝑛 2 . Generalized Problem: Given a logconcave distribution 𝑝, sampled 𝑥 from 𝑝.

Conjectured Optimal Algorithm: Ball Walk At 𝑥, pick random 𝑦 from 𝑥+𝛿 𝐵 𝑛 , if 𝑦 is in 𝐾, go to 𝑦. otherwise, sample again This walk may get trapped on one side if the set is not convex.

Isoperimetric constant For any set 𝐾, we define the isoperimetric constant 𝜙 𝐾 by 𝜙 𝐾 = min 𝑆 Area(𝜕𝑆) min⁡(vol 𝑆 ,vol 𝑆 𝑐 ) Theorem Given a random point in 𝐾, we can generate another in 𝑂( 𝑛 𝛿 2 𝜙 𝐾 2 log(1/𝜀)) iterations of Ball Walk where 𝛿 is step size. 𝜙 𝐾 or 𝛿 larger, mix better. 𝛿 cannot be too large, otherwise, fail probability is ~1. 𝜙 large, hard to cut the set 𝜙 small, easy to cut the set

Isoperimetric constant of Convex Set Note that 𝜙 𝐾 is not affine invariant and can be arbitrary small. However, you can renormalize 𝐾 such that Cov 𝐾 =𝐼. Definition: 𝐾 is isotropic, if it is mean 0 and Cov 𝐾 =𝐼. Theorem: If 𝛿< 0.001 𝑛 , ball walk stays inside the set with constant probability. Theorem: Given a random point in isotropic 𝐾, we can generate another in 𝑂( 𝑛 2 𝜙 𝐾 2 log(1/𝜀)) To make body isotropic, we can sample the body to compute covariance. L 1 𝜙 𝐾 =1/𝐿.

KLS Conjecture Kannan-Lovász-Simonovits Conjecture: For any isotropic convex 𝐾, 𝜙 𝐾 =Ω(1). If this is true, Ball Walk takes O( 𝑛 2 ) iter for isotropic 𝐾 (Matched the believed information theoretical lower bound.) To get the “tight” reduction from membership to sampling, it suffices to prove KLS conjecture

KLS conjecture and its related conjectures Slicing Conjecture: Any unit volume convex set 𝐾 has a slice with volume Ω(1). Thin-Shell Conjecture: For isotropic convex 𝐾, 𝔼( 𝑥 − 𝑛 2 )=𝑂(1). Generalized Levy concentration: For logconcave distribution 𝑝, 1-Lipschitz 𝑓 with 𝔼𝑓=0, ℙ |𝑓 𝑥 −𝔼𝑓|>𝑡 =exp(−Ω 𝑡 ). Essentially, it is asking if all convex sets looks like ellipsoids.

Do you know better way to bound mixing time of ball walk? Main Result What if we cut the body by sphere only? 𝜎 𝐾 ≝ 𝑛 𝑉𝑎𝑟 𝑋 2 ≥ 𝜙 𝐾 [Lovasz-Simonovits 93] 𝜙=Ω 1 𝑛 −1/2 . [Klartag 2006] 𝜎=Ω 1 𝑛 −1/2 log 1/2 𝑛. [Fleury, Guedon, Paouris 2006] 𝜎=Ω 1 𝑛 −1/2 log 1/6 𝑛 log −2 log 𝑛 . [Klartag 2006] 𝜎=Ω(1) 𝑛 −0.4 . [Fleury 2010] 𝜎=Ω(1) 𝑛 −0.375 . [Guedon, Milman 2010] 𝜎=Ω(1) 𝑛 −0.333 . [Eldan 2012] 𝜙= Ω 1 𝜎= Ω (1) 𝑛 −0.333 . [Lee Vempala 2016] 𝜙=Ω 1 𝑛 −0.25 . In particular, we have 𝑂( 𝑛 2.5 ) mixing for ball walk. Do you know better way to bound mixing time of ball walk?

Outline Oracle Setting: Explicit Setting: Introduce the ball walk KLS conjecture and its related conjectures Main Result Explicit Setting: Introduce the geodesic walk Bound the # of iteration Bound the cost per iteration

Problem: Sampling Input: a polytope with 𝑚 constraints and 𝑛 variables. Output: sample a point from the uniform distribution on K. {𝐴𝑥≥𝑏} Iterations Time/Iter Polytopes KN09 Dikin walk 𝑚𝑛 𝑚 𝑛 1.38 LV16 Ball walk 𝑛 2.5 𝑚𝑛 LV16 Geodesic walk 𝑚 𝑛 0.75 𝑚 𝑛 1.38 First sub-quadratic algorithm. Cost of matrix inversion

How does nature mix particles? Brownian Motion. It works for sampling on ℝ 𝑛 . However, convex set has boundary . Option 1) Reflect it when you hit the boundary. However, it need tiny step for discretization.

How does the nature mixes particle? Brownian Motion. It works for sampling on ℝ 𝑛 . However, convex set has boundary . Option 2) Remove the boundary by blowing up. However, this requires explicit polytopes.

Blowing Up? After blow up Non-Uniform Distribution on Real Real Line The distortion makes the hard constraint becomes “soft”. Original Polytope Uniform Distribution on [0,1]

Enter Riemannian manifolds 𝑛-dimensional manifold M is an 𝑛-dimensional surface. Each point 𝑝 has a tangent space 𝑇 𝑝 𝑀 of dimension 𝑛, the local linear approximation of M at 𝑝; tangents of curves in 𝑀 lie in 𝑇 𝑝 𝑀. The inner product in 𝑇 𝑝 𝑀 depends on 𝑝: 𝑢,𝑣 𝑝 Informally, you can think it is like assigning an unit ball for every point

Enter Riemannian manifolds Each point 𝑝 has a linear tangent space 𝑇 𝑝 𝑀. The inner product in 𝑇 𝑝 𝑀 depends on 𝑝: 𝑢,𝑣 𝑝 Length of a curve 𝑐: 0,1 →𝑀 is 𝐿 𝑐 = 0 1 𝑑 𝑑𝑡 𝑐 𝑡 𝑐(𝑡) 𝑑𝑡 Distance 𝑑(𝑥,𝑦) is the infimum over all paths in M between x and y.

“Generalized” Ball Walk At x, pick random y from 𝐷 𝑥 where 𝐷 𝑥 ={𝑦:𝑑 𝑥,𝑦 ≤1}.

Hessian manifold 𝑠 3 𝑠 2 𝑠 1 Hessian manifold: a subset of ℝ 𝑛 with inner product defined by 𝑢,𝑣 𝑝 = 𝑢 𝑇 𝛻 2 𝜙 𝑝 𝑣. For polytope 𝑎 𝑖 𝑇 𝑥≥ 𝑏 𝑖 ∀𝑖 , we use the log barrier function 𝜙 𝑥 = 𝑖=1 𝑚 log( 1 𝑠 𝑖 𝑥 ) 𝑠 𝑖 𝑥 = 𝑎 𝑖 𝑇 𝑥− 𝑏 𝑖 is the distance from 𝑥 to constraint 𝑖 𝑝 blows up when 𝑥 close to boundary Our walk is slower when it is close to boundary.

random walk on real line Suggested algorithm At x, pick random y from 𝐷 𝑥 where 𝐷 𝑥 ={𝑦:𝑑 𝑥,𝑦 ≤1} is induced by log barrier. Doesn’t work! Converges to the boundary since the volume of “boundary” is +∞. (Called Dikin Ellipsoid) Corresponding Hessian Manifold random walk on real line Original Polytope

Getting Uniform Distribution Lemma If 𝑝 𝑥→𝑦 =𝑝(𝑦→𝑥), then stationary distribution is uniform. To make a Markov chain 𝑝 symmetric, we use 𝑝 𝑥→𝑦 = min 𝑝 𝑥→𝑦 ,𝑝 𝑦→𝑥 𝑖𝑓 𝑥≠𝑦 ⋯𝑖𝑓 𝑥=𝑦 . To implement it, we sample 𝑦 according to 𝑝(𝑥→𝑦) if 𝑝 𝑥→𝑦 <𝑝 𝑦→𝑥 , go to 𝑦. Else, we go to 𝑦 with probability 𝑝 𝑦→𝑥 /𝑝(𝑥→𝑦); Stay at x otherwise.

Dikin Walk At x, pick random y from 𝐷 𝑥 if 𝑥∉ 𝐷 𝑦 , reject 𝑦 else, accept 𝑦 with probability min(1, vol 𝐷 𝑥 vol 𝐷 𝑦 ). [KN09] proved it takes 𝑂 (𝑚𝑛) steps. Better than the previous best 𝑂 𝑛 2.5 for oracle setting. [Copied from KN09]

Dikin Walk and its Limitation At x, pick random y from 𝐷 𝑥 if 𝑥∉ 𝐷 𝑦 , reject 𝑦 else, accept 𝑦 with probability min(1, vol 𝐷 𝑥 vol 𝐷 𝑦 ). Dikin Walk and its Limitation Dikin ellipsoid is fully contained in 𝐾. Idea: Pick next step y from a blown-up Dikin ellipsoid. Can afford to blow up by ~ 𝑛/ log 𝑚 . WHP 𝑦∈𝐾. In high dimension, volume of 𝐷 𝑥 is not that smooth. (Worst case 0,1 𝑛 ) Any larger step makes the success probability exponentially small! 0,1 𝑛 is the worst case for ball walk, hit-and-run, Dikin walk .

Going back to Brownian Motion At x, pick random y from 𝐷 𝑥 if 𝑥∉ 𝐷 𝑦 , reject 𝑦 else, accept 𝑦 with probability min(1, vol 𝐷 𝑥 vol 𝐷 𝑦 ). Going back to Brownian Motion The walk is not symmetric in the “space”. Tendency of going to center. Taking step size to 0, Dikin walk becomes a stochastic differential equation: 𝑑 𝑥 𝑡 =𝜇 𝑥 𝑡 𝑑𝑡+𝜎 𝑥 𝑡 𝑑 𝑊 𝑡 where 𝜎 𝑥 𝑡 = 𝜙 ′′ 𝑥 𝑡 −1/2 and 𝜇( 𝑥 𝑡 ) is the drift towards center. Corresponding Hessian Manifold Original Polytope

What is the drift? Fokker-Planck equation The probability distribution of the SDE 𝑑 𝑥 𝑡 =𝜇 𝑥 𝑡 𝑑𝑡+𝜎 𝑥 𝑡 𝑑 𝑊 𝑡 is given by 𝜕𝑝 𝑑𝑡 𝑥,𝑡 =− 𝜕 𝜕𝑥 𝜇 𝑥 𝑝 𝑥,𝑡 + 1 2 𝜕 2 𝜕 𝑥 2 𝜎 2 𝑥 𝑝 𝑥,𝑡 . To make the stationary distribution constant, we need − 𝜕 𝜕𝑥 𝜇 𝑥 + 1 2 𝜕 2 𝜕 𝑥 2 𝜎 2 𝑥 =0 Hence, we have 𝜇 𝑥 =𝜎𝜎′.

A New Walk A new walk: 𝑥 𝑡+ℎ = 𝑥 𝑡 +ℎ⋅𝜇 𝑥 𝑡 +𝜎 𝑥 𝑡 𝑊 with 𝑊~𝑁(0,ℎ𝐼). It doesn’t make sense.

Exponential map Exponential map exp 𝑝 : 𝑇 𝑝 𝑀→ 𝑀 is defined as 𝛾 𝑣 : unique geodesic (shortest path) from p with initial velocity 𝑣.

Anyway to avoid using filter? Geodesic Walk A new walk: 𝑥 𝑡+ℎ = exp 𝑥 𝑡 (ℎ/2⋅𝜇 𝑥 𝑡 +𝜎 𝑥 𝑡 𝑊) with 𝑊~𝑁(0,ℎ𝐼). However, this walk has discretization error. So, we do a metropolis filter after. Since our walk is complicated, the filter is super complicated. Anyway to avoid using filter?

Outline Oracle Setting: Explicit Setting: (original promised talk) Introduce the ball walk KLS conjecture and its related conjectures Main Result Explicit Setting: (original promised talk) Introduce the geodesic walk Bound the # of iteration Bound the cost per iteration

Geodesic Walk A new walk: 𝑥 𝑡+ℎ = exp 𝑥 𝑡 (ℎ/2⋅𝜇 𝑥 𝑡 +𝑊) with 𝑊~𝑁(0,ℎ𝐼). Geodesic is better than “straight line”: It extends infinitely. It gives a massive cancellation.

Key Lemma 1: Provable Long Geodesic Straight line defines finitely; Geodesic defines infinitely. Thm [LV16]: For manifold induced by log barrier, a random geodesic 𝛾 starting from 𝑥 satisfies 𝑎 𝑖 𝑇 𝛾 ′ 𝑡 ≤𝑂( 𝑛 − 1 4 )( 𝑎 𝑖 𝑇 𝑥−𝑏) for 0≤𝑡≤ 𝑂 ( 𝑛 1/4 ). Namely, the geodesic is well behavior for a long time. Remark: If central path in IPM had this, we have a 𝑚 5/4 time algorithm for MaxFlow!

Key Lemma 2: Massive Cancellation Consider a SDE on 1 dimensional real line (NOT manifold) 𝑑 𝑥 𝑡 =𝜇 𝑥 𝑡 𝑑𝑡+𝜎 𝑥 𝑡 𝑑 𝑊 𝑡 . How good is the “Euler method”, namely 𝑥 0 +ℎ𝜇 𝑥 0 + ℎ 𝜎 𝑥 0 𝑊? By “Taylor” expansions, we have 𝑥 ℎ = 𝑥 0 +ℎ𝜇 𝑥 0 + ℎ 𝜎 𝑥 0 𝑊+ ℎ 2 𝜎 ′ 𝑥 0 𝜎 𝑥 0 𝑊 2 −1 +𝑂 ℎ 1.5 . If 𝜎 ′ 𝑥 0 ≠0, the error is 𝑂(ℎ). If 𝜎 ′ 𝑥 0 =0, the error is 𝑂( ℎ 1.5 ). For geodesic walk, 𝜎 ′ 𝑥 0 =0 (Christoffel symbols vanish in normal coordinates)

Is high order method for SDE used in MCMC? Convergence Theorem Thm [LV16]: For log barrier, the geodesic walk mixes in 𝑂 𝑚 𝑛 0.75 steps. Thm [LV16]: For log barrier on 0,1 𝑛 , it mixes in 𝑂 ( 𝑛 1/3 ) steps.  The best bound for ball-walk, hit-and-run and Dikin walk is 𝑂( 𝑛 2 ) steps for 0,1 𝑛 . Our walk is similar to Milstein method. Is high order method for SDE used in MCMC?

Outline Oracle Setting: Explicit Setting: (original promised talk) Introduce the ball walk KLS conjecture and its related conjectures Main Result Explicit Setting: (original promised talk) Introduce the geodesic walk Bound the # of iteration Bound the cost per iteration

How to implement the algorithm Can we simply do Taylor expansion? In high dim, it may take 𝑛 𝑘 time to compute the 𝑘 𝑡ℎ derivatives. How to implement the algorithm In tangent plane at x, pick 𝑤∼ 𝑁 𝑥 (0,𝐼), i.e. standard Gassian in ‖. 𝑥 Compute 𝑦= exp 𝑥 ℎ 2 𝜇 𝑥 + ℎ 𝑤 Accept with probability Min 1, 𝑝 𝑦 → 𝑥 𝑝 𝑥 → 𝑦 How to compute geodesic and rejection probability? Need high accuracy for rejection probability  due to “directedness”. Geodesic is given by geodesic equation; probability is given by Jacobi field.

Collocation Method for ODE A weakly polynomial time algorithm for some ODEs Consider the ODE 𝑦 ′ =𝑓 𝑡,𝑦(𝑡) with 𝑦 0 = 𝑦 0 . Given a degree 𝑑 poly 𝑞 and distinct points 𝑡 1 , 𝑡 2 ,⋯, 𝑡 𝑑 , let 𝑇(𝑞) be the unique degree 𝑑 poly 𝑝 s.t. 𝑝 ′ 𝑡 =𝑓(𝑡,𝑞(𝑡)) on 𝑡= 𝑡 1 , 𝑡 2 ,⋯, 𝑡 𝑑 𝑝 0 =𝑞(0). Lem [LV16]: 𝑇 is well defined. If 𝑡 𝑖 are Chebyshev points on [0,1], then 𝐿𝑖𝑝 𝑇 =𝑂 𝐿𝑖𝑝 𝑓 . Thm [LV16]: If 𝐿𝑖𝑝 𝑓 ≤0.001, we can find a fix point of 𝑇 efficiently. p

𝑂(𝑑 log 2 𝑑 𝜀 −1 ) with 𝑂(𝑑log 𝑑 𝜀 −1 ) evaluations of 𝑓. Collocation Method for ODE A weakly polynomial time algorithm for some ODEs Consider the ODE 𝑦 ′ =𝑓 𝑡,𝑦(𝑡) with 𝑦 0 = 𝑦 0 . Thm [LV16]: Suppose that 𝐿𝑖𝑝 𝑓 ≤0.001 There is a degree 𝑑 poly 𝑝 such that 𝑝 ′ − 𝑓 ′ ≤𝜀. Then, we can find a 𝑦 such that 𝑦−𝑦 1 =𝑂(𝜀) in time 𝑂(𝑑 log 2 𝑑 𝜀 −1 ) with 𝑂(𝑑log 𝑑 𝜀 −1 ) evaluations of 𝑓. Remark: No need to compute 𝑓′! In general, the runtime is 𝑂 (𝑛𝑑 Lip 𝑂(1) (𝑓)) instead.

How can I bound the 𝟐𝟕𝟎 𝒕𝒉 derivatives? For 1 variable function, we can estimate 𝑘 𝑡ℎ derivatives easily. Idea: reduce estimating derivatives of general functions to 1 variable. In general, we write 𝐹 ≤ 𝑥 𝑓 D 𝑘 𝐹(𝑥) ≤ 𝑓 𝑘 0 . Calculus rule: 𝐹 ≤ 𝑥 𝑓 and 𝐺 ≤ 𝐹(𝑥) 𝑔, then 𝐺∘𝐹 ≤ 𝑥 𝑔∘(𝑓−𝑓 0 ).

Implementation Theorem Using the trick before, we show geodesic can be approximated by 𝑂 (1) degree poly. Hence, collocation method finds in 𝑂 (1) steps. Thm [LV16]: If ℎ≤ 𝑛 −1/2 , 1 step of Geodesic walk can be implemented in matrix multiplication time. For hypercube, ℎ≤ 𝑂 (1) suffices.

Questions We have no background on numerical ODE/SDE and RG. So, the running time should be improvable easily. How to avoid the filtering step? Is there way to tell a walk mixed or not? (i.e. even if we cannot prove KLS, the algorithm can still stop early.) Is higher order method in SDE useful in MCMC? Any other suggestion/heuristic for sampling on convex set?