Log-Sobolev Inequality on the Multislice (and what those words mean)

Slides:

Advertisements

Similar presentations

Slow and Fast Mixing of Tempering and Swapping for the Potts Model Nayantara Bhatnagar, UC Berkeley Dana Randall, Georgia Tech.

Advertisements

Inapproximability of MAX-CUT Khot,Kindler,Mossel and O ’ Donnell Moshe Ben Nehemia June 05.

Gibbs sampler - simple properties It’s not hard to show that this MC chain is aperiodic. Often is reversible distribution. If in addition the chain is.

Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.

Dictator tests and Hardness of approximating Max-Cut-Gain Ryan O’Donnell Carnegie Mellon (includes joint work with Subhash Khot of Georgia Tech)

Entropy Rates of a Stochastic Process

Probabilistic Methods in Coding Theory: Asymmetric Covering Codes Joshua N. Cooper UCSD Dept. of Mathematics Robert B. Ellis Texas A&M Dept. of Mathematics.

EXPANDER GRAPHS Properties & Applications. Things to cover ! Definitions Properties Combinatorial, Spectral properties Constructions “Explicit” constructions.

Expanders Eliyahu Kiperwasser. What is it? Expanders are graphs with no small cuts. The later gives several unique traits to such graph, such as: – High.

Fourier Analysis of Boolean Functions Juntas, Projections, Influences Etc.

Random Walks Great Theoretical Ideas In Computer Science Steven Rudich, Anupam GuptaCS Spring 2004 Lecture 24April 8, 2004Carnegie Mellon University.

1 Biased card shuffling and the asymmetric exclusion process Elchanan Mossel, Microsoft Research Joint work with Itai Benjamini, Microsoft Research Noam.

Mixing Times of Markov Chains for Self-Organizing Lists and Biased Permutations Prateek Bhakta, Sarah Miracle, Dana Randall and Amanda Streib.

Mixing Times of Self-Organizing Lists and Biased Permutations Sarah Miracle Georgia Institute of Technology.

Ryan O’Donnell Carnegie Mellon University. Part 1: A. Fourier expansion basics B. Concepts: Bias, Influences, Noise Sensitivity C. Kalai’s proof of Arrow’s.

Primer on Fourier Analysis Dana Moshkovitz Princeton University and The Institute for Advanced Study.

Random Walks Great Theoretical Ideas In Computer Science Steven Rudich, Anupam GuptaCS Spring 2005 Lecture 24April 7, 2005Carnegie Mellon University.

Image segmentation Prof. Noah Snavely CS1114

Time to Equilibrium for Finite State Markov Chain 許元春（交通大學應用數學系）

15-853:Algorithms in the Real World

The length of vertex pursuit games Anthony Bonato Ryerson University CCC 2013.

Quantum random walks and quantum algorithms Andris Ambainis University of Latvia.

1/19 Minimizing weighted completion time with precedence constraints Nikhil Bansal (IBM) Subhash Khot (NYU)

1 Mean Field and Variational Methods finishing off Graphical Models – Carlos Guestrin Carnegie Mellon University November 5 th, 2008 Readings: K&F:

Seminar on random walks on graphs Lecture No. 2 Mille Gandelsman,

Presented by Alon Levin

Equitable Rectangular Dissections Dana Randall Georgia Institute of Technology Joint with: Sarah Cannon and Sarah Miracle.

Hypercontractivity & Sums of Squares

Classifier Representation in LCS

Computational problems, algorithms, runtime, hardness

Markov Chains and Random Walks

Information Complexity Lower Bounds

Introduction to Randomized Algorithms and the Probabilistic Method

Random walks on undirected graphs and a little bit about Markov Chains

Markov Chains Mixing Times Lecture 5

Primer on Fourier Analysis

LECTURE 03: DECISION SURFACES

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Path Coupling And Approximate Counting

Joint work with Avishay Tal (IAS) and Jiapeng Zhang (UCSD)

Phase Transitions In Reconstruction Yuval Peres, U.C. Berkeley

Structural Properties of Low Threshold Rank Graphs

Great Theoretical Ideas In Computer Science

Haim Kaplan and Uri Zwick

Depth Estimation via Sampling

Models of Network Formation

Linear sketching with parities

Models of Network Formation

Models of Network Formation

Killing and Collapsing

Linear sketching over

Models of Network Formation

Ilan Ben-Bassat Omri Weinstein

Linear sketching with parities

Great Theoretical Ideas in Computer Science

Markov Random Fields Presented by: Vladan Radosavljevic.

ML – Lecture 3B Deep NN.

13. The Weak Law and the Strong Law of Large Numbers

CPS 173 Computational problems, algorithms, runtime, hardness

Dimension versus Distortion a.k.a. Euclidean Dimension Reduction

Readings: K&F: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7 Markov networks, Factor graphs, and an unified view Start approximate inference If we are lucky… Graphical.

Embedding Metrics into Geometric Spaces

The Zig-Zag Product and Expansion Close to the Degree

David Kauchak cs161 Summer 2009

the k-cut problem better approximate and exact algorithms

Small Set Expansion in The Johnson Graph

Data Exploration and Pattern Recognition © R. El-Yaniv

Sparse Kindler-Safra Theorem via agreement theorems

13. The Weak Law and the Strong Law of Large Numbers

Small Set Expansion in The Johnson Graph

Presentation transcript:

Log-Sobolev Inequality on the Multislice (and what those words mean) Ryan O’Donnell Carnegie Mellon Yuval Filmus (Technion) Xinyu Wu (Carnegie Mellon)

Random walks on (regular) graphs Boolean Cube: V = {0,1}n, Hamming-distance 1 edges Hamming “Slice”: Boolean strings of Hamming weight k Edge when strings differ by a transposition “Multislice”: E.g., ternary strings with exactly k1 1’s, k2 2’s, and k3 3’s Edge when strings differ by a transposition Symmetric group: V = Sn, transposition edges Grassmann graph, association schemes, polar spaces, …

Log-Sobolev inequalities: Related to mixing time for random walk

Conductance / Expansion Let S ⊆ {0,1}n be a starting set. Pick u ~ S at random. Take one step from u, to v. Ask: did we escape from S? Φ(S) = Pr [v ∉ S] (Conductance / Expansion / Boundary size) u ~ S v ~ u

vol(S) = Pr [u ∈ S | u uniform] = 1/2 Examples Φ(S) = Pr [v ∉ S | u ~ S, v ~ u] S = { u ∈ {0,1}n : u1 = 1 } Φ(S)= 1/n S = { u : HamWeight(u) > n/2 } Φ(S) = S = { u : HamWeight(u) is odd } Φ(S) = 1 These S are all large: vol(S) = Pr [u ∈ S | u uniform] = 1/2

Examples Φ(S) = Pr [v ∉ S | u ~ S, v ~ u] Some S with exponentially small vol(S): S = { 111∙∙∙1 } Φ(S) = 1 S = { u : HamWeight(u) > (3/4)n } Φ(S) ≈ 3/4 S = { u : u1 = u2 = ∙∙∙ = un/2 = 1 } Φ(S) = 1/2 “Small Set Expansion” phenomenon in the Boolean cube.

Isoperimetric Problem: Among all S of fixed vol(S), how small can Φ(S) be? Ancient combinatorics on the Boolean cube: The exact minimizer S for every vol(S) is known. Log-Sobolev inequalities take a more analytic approach, Only known way to answer… “Among all S of fixed vol(S), how small can Φt(S) be?” Φt(S) = Pr [v ∉ S] u ~ S v ~ t steps from u

a slightly less ‘spiky’ distribution an even less ‘spiky’ distribution A related question: If we pick u ~ S and do a t-step random walk to v, how close is v’s distribution to the uniform distribution? u = u0 u1 u2 u3 ∙∙∙ ut−1 v = ut uniform on S a slightly less ‘spiky’ distribution an even less ‘spiky’ distribution ∙∙∙ a pretty ‘smooth/flat’ distribution (?) an even ‘smoother/flatter’ distribution u∞ the uniform distribution

Pretend each distribution is uniform on a subset. Helpful intuition: Pretend each distribution is uniform on a subset. u = u0 u1 u2 u3 ∙∙∙ ut−1 v = ut uniform on S a slightly less ‘spiky’ distribution an even less ‘spiky’ distribution ∙∙∙ a pretty ‘smooth/flat’ distribution (?) an even ‘smoother/flatter’ distribution u∞ the uniform distribution

Pretend each distribution is uniform on a subset. Helpful intuition: Pretend each distribution is uniform on a subset. u = u0 u1 u2 u3 ∙∙∙ ut−1 v = ut uniform on S0 = S “uniform on neighbors S1 of S0” “uniform on neighbors S2 of S1” “Small Set Expansion” idea: Φ(Si) is always “large” so long as Si is “small” ⇒ walk mixes quickly, at least at the beginning u∞ uniform on all of {0,1}n

Pretend each distribution is uniform on a subset. Helpful intuition: Pretend each distribution is uniform on a subset. u = u0 u1 u2 u3 ∙∙∙ ut−1 v = ut uniform on S0 = S “uniform on neighbors S1 of S0” “uniform on neighbors S2 of S1” “Small Set Expansion” idea: Φ(Si) is always “large” so long as Si is “small” ⇒ walk mixes quickly, at least at the beginning If Si reaches, say, { u : u1 = 1 }, will take Θ(n) steps to make more progress. u∞ uniform on all of {0,1}n

This intuition is well captured by “Log-Sobolev inequalities”. We’ll need to quantify “distance of a distribution from uniform”. There are zillions of “distances” for probability distributions: Total variation, Hellinger, KL divergence, χ2-distance, Lp-distances… We’ll get to these. But first, an easier cousin of Log-Sobolev inequalities…

Expansion vs. volume via eigenvalues Poincaré Inequality: It implies: Φ(S) ≥ (2/n) (1−vol(S)) Hence: vol(S) ≤ 1/2 ⇒ Φ(S) ≥ 1/n Better than nothing, but doesn’t capture “Small Set Expansion”.

Expansion vs. volume via eigenvalues Poincaré Inequality: It implies: Φ(S) ≥ (2/n) (1−vol(S)) More generally, it is the following statement: For any probability distribution p on {0,1}n, Implies the above set statement by taking p = UnifS.

Expansion vs. volume via eigenvalues Poincaré Inequality: For any probability distribution p on {0,1}n, Exactly equivalent to: “Second eigenvalue gap for the Boolean cube’s random walk matrix is ≥ 2/n.”

Expansion vs. volume via eigenvalues Poincaré Inequality: For any probability distribution p on {0,1}n, average “local” L2-distance 2 global L2-distance of p from uniformity 2

Log-Sobolev Inequality [Gross’75] It implies: Φ(S) ≥ ½(2/n) ln(1/vol(S)) So again: for vol(S) ≈ 1/2, only get Φ(S) ≥ Ω(1/n) But for vol(S) = 2−Θ(n), you get Φ(S) ≥ Ω(1) ! Small Set Expansion! This inequality is sharp (up to Θ(1)) for all values of vol(S).

Log-Sobolev Inequality [Gross’75] It implies: Φ(S) ≥ ½(2/n) ln(1/vol(S)) More generally, it is the following statement: For any probability distribution p on {0,1}n, a global distance of p from uniformity average local Hellinger2-distance

Log-Sobolev Inequality [Gross’75] It implies: Φ(S) ≥ ½(2/n) ln(1/vol(S)) More generally, it is the following statement: For any probability distribution p on {0,1}n, Implies the above set statement by taking p = UnifS.

Log-Sobolev Inequality What else is great: It “tensorizes”. I.e., it behaves beautifully under taking product graphs. Hamming cube is n-fold product of a single-edge graph. For any probability distribution p on {0,1}n, a global distance of p from uniformity average local Hellinger2-distance

Log-Sobolev Inequality What else is great: It “tensorizes”. I.e., it behaves beautifully under taking product graphs. Hamming cube is n-fold product of a single-edge graph. The “log-Sobolev constant” for the Hamming cube. a global distance of p from uniformity average local Hellinger2-distance

Log-Sobolev Inequality What else is great: It “tensorizes”. I.e., it behaves beautifully under taking product graphs. Hamming cube is n-fold product of a single-edge graph. The “log-Sobolev constant” for the Hamming cube. Tensorization property ⇒ log-Sobolev constant for {0,1}n is (1/n)(log-Sobolev constant for single-edge)

Log-Sobolev Inequality Log-Sobolev constant for a single-edge graph is 2: This is a simple 1-variable inequality. The “log-Sobolev constant” for the Hamming cube. Tensorization property ⇒ log-Sobolev constant for {0,1}n is (1/n)(log-Sobolev constant for single-edge)

Log-Sobolev Inequality ⇔ Hypercontractive Inequality [Gross’75] Let p be a probability distribution on {0,1}n. Let q be final distribution of: “Draw u ~ p, walk for ≈ t steps.” (Technically: Walk for T ~ Poisson(t) steps.) avg (q(u) − 2−n)2 ≤ avg |p(u) − 2−n|1+c u Then where c = exp(−2 (2/n) t) < 1. E.g., if t = .1n then c = exp(−.4) ≈ .67, so 1+c ≈ 1.67.

Log-Sobolev Inequality ⇔ Hypercontractive Inequality [Gross’75] Let p be a probability distribution on {0,1}n. Let q be final distribution of: “Draw u ~ p, walk for ≈ .1n steps.” avg (q(u) − 2−n)2 ≤ avg |p(u) − 2−n|1.67 u Then (average) L2-distance to uniformity (average) L1.67-distance to uniformity

Log-Sobolev Inequality ⇔ Hypercontractive Inequality [Gross’75] avg-L2-distance(p) ≥ avg-L1+c-distance(p) always; e.g., for p = single point mass, ~2−n/2 vs. ~2−n Let p be a probability distribution on {0,1}n. Let q be final distribution of: “Draw u ~ p, walk for ≈ .1n steps.” avg (q(u) − 2−n)2 ≤ avg |p(u) − 2−n|1.67 u Then (average) L2-distance to uniformity (average) L1.67-distance to uniformity

Log-Sobolev Inequality ⇔ Hypercontractive Inequality [Gross’75] Let p be a probability distribution on {0,1}n. Let q be final distribution of: “Draw u ~ p, walk for ≈ t steps.” avg (q(u) − 2−n)2 ≤ avg |p(u) − 2−n|1+c u Then where c = exp(−2 (2/n) t) < 1. If you let p = UnifS you get…

Log-Sobolev Inequality ⇔ Hypercontractive Inequality ⇓ “Φϵn”(S) ≥ 1 − vol(S)ϵ / (1−ϵ) Very strong “Small Set Expansion” in the “noisy hypercube”! where “Φϵn”(S) = Pr [v ∉ S] u ~ S v = Noiseϵ(u) Noiseϵ(u) = flip each coordinate indep. with probability ϵ

Log-Sobolev Inequality ⇔ Hypercontractive Inequality ⇓ “Φϵn”(S) ≥ 1 − vol(S)ϵ / (1−ϵ) Very strong “Small Set Expansion” in the “noisy hypercube”! ⇓ Zillions of applications: KKL Theorem, Friedgut Junta Theorem, robust Kruskal−Katona, weak-learning monotone functions, sharp threshold phenomena, optimal Unique Games-hardness results, …

Summary so far Log-Sobolev inequalities are cool They imply “small set expansion” for 1-step walks They imply hypercontractive inequalities Which imply “small set expansion” for long walks, and have many many other applications They’re easy to prove for product graphs / Markov chains

Random walks on (regular) graphs the only product graph Boolean Cube: V = {0,1}n, Hamming-distance 1 edges Hamming “Slice”: Boolean strings of Hamming weight k Edge when strings differ by a transposition “Multislice”: E.g., ternary strings with exactly k1 1’s, k2 2’s, and k3 3’s Edge when strings differ by a transposition Symmetric group: V = Sn, transposition edges Grassmann graph, association schemes, polar spaces, …

Random walks on (regular) graphs Boolean Cube: Log-Sobolev constant 2/n Boolean Cube: Hamming “Slice”: [Diaconis−Saloff-Coste ’96]: Log-Sobolev constant is Θ(1/n log n) “Multislice”: The log n ruins everything.  No good small set expansion, no good hypercontractivity, no KKL or other cool applications… Symmetric group:

Random walks on (regular) graphs Boolean Cube: Log-Sobolev constant 2/n Boolean Cube: Hamming “Slice”: Boolean strings of Hamming weight k Edge when strings differ by a transposition “Multislice”: Symmetric group:

Random walks on (regular) graphs Boolean Cube: Log-Sobolev constant 2/n Boolean Cube: Hamming “Slice”: Boolean strings of k0 0’s and k1 1’s. Edge when strings differ by a transposition “Multislice”: [T.-Y. Lee and H.-T. Yau ’98]: Log-Sobolev constant is Θ(1/n) provided k0/n and k1/n are Ω(1). Symmetric group: Great! These slices enjoy all the same SSE and applications as Boolean Cube!

Random walks on (regular) graphs Boolean Cube: Log-Sobolev constant 2/n Boolean Cube: Hamming “Slice”: Log-Sobolev constant Θ(1/n) provided each symbol used Ω(n) times  “Multislice”: Symmetric group: Log-Sobolev constant Θ(1/n log n) 

Random walks on (regular) graphs Boolean Cube: Log-Sobolev constant 2/n Boolean Cube: Hamming “Slice”: Log-Sobolev constant Θ(1/n) provided each symbol used Ω(n) times  “Multislice”: E.g., ternary strings with exactly k1 1’s, k2 2’s, and k3 3’s Edge when strings differ by a transposition Symmetric group: Log-Sobolev constant Θ(1/n log n) 

Random walks on (regular) graphs Boolean Cube: Log-Sobolev constant 2/n Boolean Cube: Hamming “Slice”: Log-Sobolev constant Θ(1/n) provided each symbol used Ω(n) times  “Multislice”: Log-Sobolev constant Θ(1/n) provided each symbol used Ω(n) times, O(1) symbols −[Filmus-O-Wu ’18]  Symmetric group: Log-Sobolev constant Θ(1/n log n) 

How did Lee−Yau do it for the Slice? And why can’t you do the same for the Multislice? Chain rule for KL-divergence… Averaging over a random coordinate… An induction… In the Slice, there is only one kind of step: swapping a 0 and a 1. Even in the ternary Multislice, there are multiple kinds of steps: swapping 1 & 2, swapping 1 & 3, swapping 2 & 3.

How did Lee−Yau do it for the Slice? And why can’t you do the same for the Multislice? Chain rule for KL-divergence… Averaging over a random coordinate… An induction… The induction becomes much more complicated. Doesn’t look like it’s going to work. But then one of the coauthors just makes it work. 

for more interesting Markov chains! Open Directions Log-Sobolev inequalities (and hypercontractivity, and Small Set Expansion) for more interesting Markov chains! And if they turn out badly, try to classify the small sets for which SSE fails!

The End - Thanks!