Presentation is loading. Please wait.

Presentation is loading. Please wait.

Distance Correlation E-Statistics Gábor J. Székely Rényi Institute of the Hungarian Academy of Sciences Columbia University, April 28-April 30, 2014.

Similar presentations


Presentation on theme: "Distance Correlation E-Statistics Gábor J. Székely Rényi Institute of the Hungarian Academy of Sciences Columbia University, April 28-April 30, 2014."— Presentation transcript:

1 Distance Correlation E-Statistics Gábor J. Székely Rényi Institute of the Hungarian Academy of Sciences Columbia University, April 28-April 30, 2014

2 Topics Lecture 1. Distance Correlation. From correlation (Galton/Pearson, 1895) to distance correlation (Szekely, 2005). Important measures of dependences and how to classify them via invariances. Distance correlation t-test of independence. Open problems for big data. Lecture 2. Energy statistics (E-statistics) and their applications. Testing for symmetry, testing for normality, DISCO analysis, energy clustering, etc. A simple inequality on energy statistics and a beautiful theorem of Fourier transforms. What makes a statistic U (or V)? Lecture 3. Brownian correlation. Correlation with respect to stochastic processes. Distances and negative definite functions. Physics principles in statistics (the uncertainty principle of statistics, symmetries/invariances, equilibrium estimates). CLT for dependent variables via Brownian correlation. What if the sample is not iid, what if the sample comes from a stochastic process? Colloquium talk. Partial distance correlation. Distance correlation and dissimilarities via unbiased distance covariance estimates. What is wrong with the Mantel test? Variable selection via pdCor. What is a good measure of dependence? My Erlangen program in Statistics.

3 Lecture 1. Distance Correlation Dependence Measures and Tests for Independence Kolmogorov: “Independence is the most important notion of probability theory” Correlation (Galton 1885-1888, Natural Inheritance, 1889, Pearson, 1895) Chi-square (Pearson, 1900) Spearman’s rank correlation (1904) Amer. J. Psychol. 15: 72–101. Fisher, R. (1922) and Fisher’s exact test Kendall’s tau (1938) A New Measure of Rank Correlation".Biometrika 30 (1–2):81–89.Biometrika Maximal correlation (Hirschfeld) (Gebelein,1941) ( Lancaster, 1957), (Rényi,1959), (Sarmanov,1958), (Buja, 1990) Dembo(2001) Hoeffding’s independence test (1948) Annals of Mathematical Statistics 19: 293–325, 1948. Blum-Kiefer-Wolfowitz (1961) Mantel test (1967) RKHS Baker (1973), Fukumizo, Gretton, Poczos, … RV coefficient (1976) Robert, P.; Escoufier, Y. A Unifying Tool for Linear Multivariate Statistical Methods: The RV- Coefficient“ Applied Statistics 25 (3): 257–265. Also here is a question/answer from Stack Exchange which mentioned dcor and it was better than RV coefficient, apparently. http://math.stackexchange.com/questions/690972/distance-or-similarity-between-matrices-that-are-not-the-same-size Distance correlation (dCor) Szekely (2005), Szekely Bakirov and Rizzo (2007) nice free version apparently for Matlab/Octave. I should perhaps add a link to our energy page. http://mastrave.org/doc/mtv_m/dist_corr Brownian correlation Szekely and Rizzo (2009) DCor generalizes and improves Correlation, RV, Mantel and Chi-square (denominator!) MIC, 2010 Valhalla --- GÖTTERDÄMMERUNG

4 Kolmogorov: “Independence is the most important notion of probability theory” What is Pearson’s correlation? Sample: (X k,Y k ) k=1,2,…,n Centered sample: A k, =X k -X. B k =Y k -Y. cov(x,y)=(1/n)Σ k A k B k r:=cor(x,y) = cov(x,y)/[cov(x,x) cov(y,y)] 1/2 Prehistory: (i) Gauss (1823) – normal surface with n correlated variables – for Gauss this was just one of the several parameters (ii) Auguste Bravais(1846) referred to one of the parameters of the bivariate normal distribution as « une correlation” but like Gauss he did not recognize the importance of correlation as a measure of dependence between variables. [Analyse mathématique sur les probabilités des erreurs de situation d'un point. Mémoires présentés par divers savants à l'Académie royale des sciences de l'Institut de France, 9, 255-332.] (iii) Francis Galton (1885-1888) (iv)Karl Pearson (1895) product-moment r LIII. On lines and planes of closest fit to systems of points in space Philosophical Magazine Series 6, 1901 -- cited by 1700 Pearson had no unpublished thoughts Why do we (NOT) like Pearson’s correlation? What is the remedy? LIII. On lines and planes of closest fit to systems of points in space

5 Apples and Oranges If we want to study the dependence between oranges and apples then it is hard to add or multiply them but it is always easy to do the same with their distances.

6 a k,l := |X k – X l | b k,l := |Y k – Y l | for k,l=1,2,…,n A k,l := a k,l –a k.–a. l + a.. B k,l := b k,l –b k.–b. l + b.. Distance Cov(X,Y) 2 :=dCov²(X,Y) := V²(X,Y):= (1/n 2 )Σ k l A k,l B k,l ≥ 0 (!?!) see Szekely, Rizzo, Bakirov(2007) Ann. Statist. 35/7

7 Distance Covariance: V²(X,Y):= (1/n 2 )Σ k l A k,l B k,l Distance standard deviation: V(X):= V(X,X), V(Y):=V(Y,Y) Distance Correlation: dCor(X,Y)²:=R(X,Y)²:= V(X,Y)²/ V(X)V(Y) This should be introduced in our teaching at the undergraduate level.

8 The population values are a.s. limits of the empirical ones as n→∞. Thm: dCov²=||f n (s,t)-f n (s)f n (t)||² where ||.|| is the L 2 -norm with the singular kernel w(s,t):= c/(st)² This kernel is unique if we have the following invariance: dCov²(a 1 +b 1 O 1 X, a 2 +b 2 O 2 Y)=b 1 b 2 dCov²(X,Y).

9 A beautriful theorem on Fourier transforms ∫(1-cos tx)/t 2 dt= c|x| The Fourier transform of any power of |t| is a constant times a power of |x| Gel’fand, I. M. – Shilov, G. E. (1958, 1964), Generalized Functions

10 Thm V(X) =0 iff X is constant V(a + bCX) = |b|V(X) V(X+Y) <= V(X) + V(Y) for independent rv’s with equality iff X or Y is constant

11 0 ≤ dCorr(X,Y) ≤1 dCorr(X,Y) =0 iff X, Y are independent dCorr (X,Y) = 1 iff Y=a + bXC

12 a kl := |X k – X l | α b kl := |Y k – Y l | α R α, for 0< α <2 [R 1 = R] R 2 (X,Y)= |Cor(X,Y)|= |Pearson’s correlation| E-statistics(energy statistics). R package version 1.1-0.

13 Thm Under independence of X and Y n dCov 2 n (X,Y) →Q= ∑ λ k Z² k otherwise the limit is ∞ Thus we have a universally consistent test of independence

14 What if (X,Y) is bivariate normal? In this case 0.89 |corr|≤dCor ≤ |cor|

15 Unbiased Distance Correlation

16 Unbiased distance correlation Unbiased estimator of dCor² (X, Y) is dCor* n := := 1/[n(n-3)] (A*, B*) This is an inner product in the linear space H n of nxn matrices generated by nxn distance matrices. The population Hilbert space is denoted by H where the inner product is (generated by) dCov*(X, Y).

17 The power of dCor test for independence is very good especially for high dimensions p,q Denote the unbiased version by dCov* n The corresponding bias corrected distance correlation is R* n This is the correlation for the 21 st century. Theorem. In high dimension if the CLT holds for the coordinates then T n :=[M-1] 1/2 R* n /[1-(R* n ) 2 ] 1/2 where M=n(n-3)/2 is t-distributed with d.f. M-1.

18 Why? R* n = Σ ij U ij V ij / [Σ U ij 2 Σ V ij 2 ] 1/2 with iid standard normal variables. Put Z ij = U ij / [Σ U ij 2 ] 1/2 ; then Σ Z ij 2 = 1 Under the null, independence of U ij and V ij, Z ij does not depend on V ij Given Z, by Cochran’s thm (the square of Σ ij Z ij V ij has rank 1), T n is t-distributed when Z is given, thus even without this condition.

19 Under the alternative? We need to show that if U, V are standard normal with zero expected value and correlation ρ>0 then P(UV > c) is a monotone increasing function of ρ. For the proof notice that if X, Y are iid standard normal a²+b² =1, 2ab = ρ then for U:=aX+bY and V:= bX+aY We have Var(U)=Var(V) = 1 and E(UV)=2ab = ρ. Thus UV=ab(X²+Y²)+(a²+b²)XY= ρ(X²+Y²)/2 + XY Q.E.D. (I do not need but I do not know what if the expectations are not zero.)

20 The number of operations is O(n²), independently of the dimension which can even be infinite (X and Y can be in two different metric spaces – Hilbert spaces) The storage complexity can be reduced to O(n) via recursive formula Parallel processing for big n?

21 A characteristic measure of dependence (population value) dCov 2 (X,Y) =E|X-X’||Y-Y’|+ E|X-X’|E|Y-Y’| - 2E|X-X’||Y-Y’’|

22 dCov = cov of distances? (X,Y), (X’,Y’), (X”, Y”) are iid dCov 2 (X,Y)=E[|X–X’||Y-Y’|] +E|X-X’|E|Y-Y’| -E[|X–X’||Y-Y’’|] - E[|X–X’’||Y-Y’|] = cov(|X–X’|, |Y–Y’|) – 2cov(|X-X’|, |Y-Y”|) (i)Does cov(|X–X’|, |Y–Y’|)=0 imply X and Y are independent? (ii)Does the independence of X and Y imply the independence of X, Y? (i)q(x)=–c/2 for -1<x<0, ½ for 0<x<c, 0 otherwise, p(x,y):=1/4–q(x)q(y)

23 Max correlation? sup f,g Cor(f(X), g(Y)) for all f,g Borel functions with 0 < Var f(X), Var g(Y) < ∞. Why should we (not) like max cor? If max cor (X, Y) = 0 then X, Y are independent For bivariate normal normal maxcor = |cor| For partial sums if iid maxcor 2 (S m,S n )=m/n for m≤n Sarmanov(1958) Dokl. Nauk. SSSR What is wrong with maxcor?

24 What is the meaning of max cor = 1?

25 Trigonometric coins n := S n := sin U+ sin 2U + … + sin nU tends to Cauchy (we did not divide by √n !!) Open problem: What is the sup of dCor for uncorrelated X and Y. Can it be > 0.85 ?

26 Lecture 2. Energy Statistics (E-statistics) Newton’s gravitational potential energy can be generalized for statistical applications. Statistical observations are heavenly bodies (in a metric space) governed by a statistical potential energy which is zero iff an underlying statistical null hypothesis holds. Potential energy statistics are symmetric functions of distances between statistical observations in metric spaces. EXAMPLE Testing Independence

27 Potential Energy Statistics Potential energy statistics or energy statistics or E-statistics in short are U- statistics or V-statistics that are functions of distances between sample elements. The idea is to consider statistical observations as heavenly bodies governed by a statistical potential energy which is zero iff an underlying statistical null hypothesis is true.heavenly bodiespotential energynull hypothesis

28 Distances and Energy: the next level of abstraction (Prelude) In the beginning Man created integers. The accountants of Uruk in Mesopotamia, about five thousand years ago, invented the first numerals – signs encoding the concept of oneness, twoness, threeness, etc. abstracted from any particular entity. Before that for about another 5000 years jars of oil were counted with ovoid, measures of grain were counted with cones, etc., numbers were indicated with one-to-one correspondence. Numerals revolutionized our civilization: they expressed abstract thoughts, after all, “two” does not exist in nature, only two fingers, two people, two sheep, two apples, two oranges. After this abstraction we could not tell from the numerals what the objects were; seeing the signs of 1,2,3,... we could not see or smell oranges, apples, etc. but we could do comparisons, we could do “statistics”, “statistical inference”. In this lecture instead of working with statistical observations, data taking integer or real values, or taking values in Euclidean spaces, Hilbert spaces or in more general metric spaces we make inferences from their distances. Distances and angles make wonders in science (see e.g. Thales 600 BC; G.J. Szekely: Thales and the Ten Commandments). Here we will exploit this in statistics. Instead of working with numbers, vectors, functions, etc. first we compute their distances and all our inferences will be based on these distances. This is the next level of abstraction where not only we cannot tell the objects, we cannot even tell how big their numbers are, we cannot tell what the data are, we can just tell how far they are from each other. At this level of abstraction we of course lose even more information, we cannot sense lots of properties of data, e.g. if we add the same constant to all data then their distances won’t change. No rigid motion in the space changes the distances. On the other hand we gain a lot: distances are always easy to add, multiply, etc. even when it is not so natural to add or multiply vectors and more abstract observations especially if they are not from the same space. The next level of abstraction is energy statistic: invariance wrt ratios of distances: the angles are invariant. Distance correlation is depends on angles.

29 Goodness-of-fit

30 Dual space

31 Application in statistics Construct a U (or V) statistic with kernel h(x,y)= E|x-X| + E|y-Y| - E|X-Y| - |x-y| V n =(1/n 2 )∑ h(X i,X k ) By the NULL: Eh(X,Y) =0 but h is also a rank 1 degenerate kernel because Eh(x,Y’) =0 a.s. under the null thus Under the Null the limit distribution of nV n is Q:=∑ k λ k Z k 2 where λ k are eigenvalues of Hilbert-Schmidt: ∫h(x,y)ψ(y)dF(y) = λψ(x) and under the alternative (X and Y has different distributions) nV n →∞ a.s.

32 What to do with Hilbert-Schmidt? ∫h(x,y)ψ(y)dF(y) = λψ(x), Q:=∑ k λ k Z k 2 (i)Approximate the eigenvalues: (1/n) ∑ i h(Xi, Xj)ψ=λψ (ii) If Σ i λ i = 1 then P(Q ≥ c) ≤ P(Z 2 ≥ c) if this probability is at most 0.215 [conservative, consistent test) (iii) t-test (see later)

33 Simple vs Good E α (X,Y):= 2E|X-Y| α -E|X-X’| α -E|Y-Y’| α ≥ 0 For 0 < α < 2 = 0 iff X and Y are identically distributed For α=2 we have E 2 (X,Y):= 2[E(X) – E(Y)] 2 In case of “classical statistics” (α =2) life is simple but not always good In case of “energy statistics” (0 < α < 2) life is not so simple but good.

34 Testing for (multivariate) normality Standardized sample: y 1 y 2,…, y n E|Z – Z’|= ? For U(a,b): E|U-U’|=|b-a|/3 for exponential E|e-e’| = 1/λ E|y – Z|= ? For U(a,b): E|x-U|= quadratic polynomial (hint: if Z is a d-variate standard normal then |y-Z| 2 has a noncentral chi-square distribution with non-centrality parameter |y| 2 /2 and d.f. d+p where p is a Poisson r.v. with mean |y| 2 /2, see Zacks (1981) p. 55) In 1 dim: E|y – Z| = (2/π) 1/2 exp{-y 2 /2}+ x Φ(y) - y √ For implementation see Energy package in R and Szekely and Rizzo (2004).

35 Why is energy a very good test for normality? 1.It is affine invariant 2. Consistent against general alternatives 3. Powerful omnibus test In the univariate case our energy test is “almost” the same as the Anderson-Darling EDF test based on ∫(F n (x) – F(x)) 2 dF(x)/ [F(x)(1-F(x)] But here dF(x)/[F(x)(1-F(x)] is close to constant for standard normal F and thus almost the same as “energy” thus our energy test is essentially a multivariate extension of the powerful Anderson-Darling test.

36 Distance skewness Advantages: Skew(X):= E[X- E(X)/σ]³ = 0 does NOT characterize symmetry but distance skewness: dSkew(X):= 1–E|X-X’|/E|X+X’|=0 iff X is centrally symmetric. Sample: 1-Σ|X i -X k |/Σ|X i +X k |

37 DISCO: a nonparametric extension of ANOVA DISCO is a multi-sample test of equal distributions, a generalization of the hypothesis of equal means which is ANOVA. Put A=(X 1, X 2, …, X n ), B=(Y 1, Y 2, …, Y m ), and d(A, B):=(1/nm)∑ |X i – Y k | Within-sample dispersion W:= ∑ j (n j /2)∑ d(A j,A j ) Put N =n 1 +n 2 +…+n K and A:= {A 1, A 2, …, A k } Total dispersion T:=(N/2) d(A,A) Thm. T = B + W where B,the between sample dispersion, is the energy distance, i.e. the weighted sum of E(A j, A k )= 2 d(A j,A k )- d(A j,A j )- d(A k,A k ) The same thing with exponent α = 2 in d(A, B) is ANOVA

38 E-clustering Hierarchical clustering: we merge clusters with minimum energy distance: E(C i UC j,C k )=(n i +n k )/(n i +n j +n k )E(C i, C k )+(n j +n k )/(n i +n j +n k )E(C j, C k ) - n k /(n i +n j +n k )E(C i, C j ) In E-clustering not only the cluster centers matter but the cluster point distributions. If the exponent in d is α=2 then we get Ward’s minimum variance method, a geometrical method that separates and identifies clusters by their centers. Thus Ward is not consistent but E-clustering is consistent. The ability of E-clustering to separate and identify clusters with equal or nearly equal centers has important practical applications. For details see Szekely- Rizzo (2005) Hierarchical clustering via joint between-within distances, Journal of Classification, 22(2), 151-183.

39 Under the Null the limit distribution of nV n is Q:=∑ k λ k Z k 2 where λ k are eigenvalues of Hilbert-Schmidt: ∫h(x,y)ψ(y)dF(y) = λψ(x) where h(x,y)= E|x-X| + E|y-Y| - E|X-Y| - |x-y| Differentiate twice: -ψ”/(2f) = Eψ with boundary conditions: ψ’(a)= ψ’(b)= 0 (the second derivative wrt x of (1/2)|x-y| is -δ(x-y) where δ is the Dirac delta) Thus in 1Dimension E= 1/λ Thus we transformed the potential energy (Hilbert-Schmidt) equation into a kinetic energy (Schrödinger) equation. Schrödinger equation(1926): -ψ(x)”/(2m) + V(x)ψ(x) = (E + 1/E)ψ(x) Energy conservation law? Kinetic Energy (E)

40 My Erlangen Program in Statistics Klein, Felix 1872. "A comparative review of recent researches in geometry". This is a classification of geometries via invariances (Euclidean, Similarity, Affine, Projective,…) Klein was then at Erlangen. Energy statistics are always rigid motion invariant, their ratios, e.g. dCor is also invariant wrt scaling (angles remain invariant like in Thales’s geometry of similarities) Can we have more invariance? In the univariate case we have monotone invariant rank statistics. But in the multivariate case if a statistic is 1-1 affine/projective invariant and continuous then it is constant. (projection is affine but not 1-1, still because of continuity thr statistics are invariant to all projections to (coordinate) lines thus they are constant).

41 Affine invariant energy statistics They cannot be continuous but in case of testing for normality affine invariance is natural (it is not natural for testing independence because it changes angles). BUT dCor =0 is invariant with respect to all 1-1 Borel functions and max cor is also invariant wrt all 1-1 Borel but these are population values. Maximal correlation is too invariant. Why? Max correlation can easily be 1 for uncorrelated rv’s but the max of dCor for uncorrelated variables is < 0.85.

42 Unsolved dCor problems Using subsampling construct confidence interval for dCor^2. Why (not) bootstrap? Definition of Complexity of function f via dCor (X, f(X)) Find sup dCor (X,Y) for uncorrelated X and Y.

43 Energy and U, V

44 Lecture 3. Brownian Correlation / Covariance X id := X – E(X) = id(X) – E(id(X)|id(.)) Cov 2 (X, Y)= E(X id X’ id Y id’ Y’ id’ ) X W :=W(X) – E(W(X)|W(.)) Cov 2 W (X,Y):=E(X W X’ W Y W’ Y’ W’ ) Remark: Cov id (X,Y) = |Cov(X,Y)| Theorem: dCov (X,Y)= Cov W (X,Y) (!!) Szekely (2009) Ann. Appl. Statist 3/4 Discussion Paper What if Brownian motion is replaced by another stochastic process? What matters is the (positive definite) covariance function of the process.

45 Why Brownian? We can replace BM by any two stochastic processes U=U(t) and V=V(t) Cov 2 U,V (X,Y):=E(X U X’ U Y V Y’ V ) But why is this generalization good, how to compute, how to apply? The covariance function of BM is 2min(s,t)= |s| + |t| -|s-t|.

46 Fractional BM The simplest extension is |s| α + |t| α -|s-t| α and a zero mean Gaussian process with this cov is the fractional BM defined for 0 < α < 2. This process was mentioned for the first time in Kolmogorov(1940). α = 2H where H is the Hurst exponent. Fractal dimension D= 2-H. and a zero mean Gaussian process with this cov is the fractional BM defined for 0 < α < 2. This process was mentioned for the first time in Kolmogorov(1940). α = 2H where H is the Hurst exponent. Fractal dimension D= 2-H. H describes the raggedness of the resultant motion, with a higher value leading to a smoother motion. The value of H determines what kind of process the fBm is: if H = 1/2 then the process is in fact a Brownian motion or Wiener process;Brownian motionWiener process if H > 1/2 then the increments of the process are positively correlated;correlated if H < 1/2 then the increments of the process are negatively correlated. The increment process, X(t) = B H (t+1) − B H (t), is known as fractional Gaussian noise.

47 Variogram What properties of the (fractional) BM we need to make sure that the cov wrt certain stochastic processes is “energy” type i.e. it depends on the distances of observations only? In spatial statistics the variagram 2γ(s,t) of a random field Z(t) is 2γ(s,t):= Var(Z(s) –Z(t)). Suppose E(Z(t))=0. For stationary processes γ(s,t):= γ(s-t) and for stationary isotropic ones: γ(s,t):= γ(|s-t|) A function is a variogram of a zero expected value process/field iff it is conditionally negative definite (see later). If the covariance function C(s,t) of a process exists then 2 2 2 = 2C(s,t) = 2E[(Z(s)Z(t)]= EZ(s) 2 +EZ(t) 2 +E[Z(t)-Z(s)] 2 = γ(s,s) + γ(t,t) – 2 γ(s-t). For BM we had 2min(s,t)= |s| + |t| - 2|s-t|. We also have the converse: γ(s-t)= C(s,s) + C(t,t) – 2C(s,t). Cov 2 U,V (X,Y) is of “energy type” if the increments of U,V are stationary isotropic.

48 Special Gaussian processes The negative log of the symmetric Laplace 2 ) ch.f is γ(t):=log(1 + |t| 2 ) defines a Laplace- Gaussian process with the corresponding C(s,t) because this γ is conditionally negative definite. The negative log of the ch.f. of the difference of two iid Poisson is γ(t):= cos t - 1. This defines a Poisson-Gaussian process.

49 Correlation wrt stochastic processes When the covariance counts only we can assume the processes are Gaussian. Why do we need this generalization? t t t t Conjecture. We need this generalization if the observations (X t Y t ) are not iid but stationary ergodic? Then consider cor wrt zero mean (Gaussian) processes with stationary increments having the same cov as (X t Y t )?

50 A property of squared distances What exactly are the properties of distances in Euclidean spaces (and Hilbert spaces) that we need for statistical inferences? We need the following properties of squared distances |x-y|^2 in Hilbert spaces. iiii Let H be a real Hilbert space, x i in H. Then we have that if a i in R and Σ i a i =0 then ij ijijij ii Σ ij a i a j |x i – x j |²= - 2| Σ ij a i x i |² ≤ 0 i Thus if y i is in H i=1,2,…, n is another set of elements from H then ij ijijij ijiji iii Σ ij a i a j |x i – y j |² = -2 Σ ij a i a j x i y j ≤ -| Σ i a i (x i + y i )|² ≤ 0. This is what we call the (conditional) negative definite property of |x-y|^2.

51 Negative definite functions Let the data come from an arbitrary set S. A function h(x,y): SxS → R is negative definite if h(x,y) = h(y,x) (symmetric), h(x,x) = 0 ii i and for all real numbers a i if Σ i a i = 0 then ij ijij Σ ij a i a j h(x i, y j ) ≤ 0. (*) i The function h is strongly negative definite if (*) is true and equality in (*) holds iff all a i =0. Theorem (I. J. Schoenberg (1938)) A metric space (S,d) embeds in a Hilbert space iff h= d^2 is negative definite.

52 Further examples α h(x,y):= |x-y| α is negative definite if 0 < α ≤ 2, strictly negative definite if 0 < α < 2. α This is equivalent with the claim that exp{-|t| α } is a characteristic function (of a symmetric stable distribution). Classical statistics was built on α = 2. This makes classical formulae simpler but because the “strictness” does not hold here, the corresponding “quadratic theorems” apply to “quadratic type distributions” only e.g. Gaussian distributions whose densities are exp{ quadratic polynomial} See also least squares For α = 2 life is simple (~ multivariate normal) but not always good, for 0 < α < 2 life is not so simple but good (nonparametric). My “energy” inferences are based on strictly negative definite kernels.

53 Why do we need negative definite functions? iiii iiii iii Let p i and q i be two probability distributions on the points x i, y i, resp. Let X, Y be independent rv’s: P(X=x i ) = p i, P(Y = y i ) = q i. Then the strong definiteness of h(x,y) implies that if a i = p i – q i then ij iijjij Σ ij (p i – q i ) (p j – q j ) |x i – y j | ≤ 0 i.e. if E denotes the expectation of a random variables then the potential energy of (X,Y) E(X,Y):= E|X-Y|+E|X’-Y’| - E|X-X’| - E|Y-Y’| ≥ 0 (*) where X’ and Y’ are iid copies of X and Y resp. Strong negative definiteness implies that equality holds iff X and Y are identically distributed. What it means is that the double centered expected distance of X and Y, i.e. the potential energy of (X,Y), is always nonnegative and equals zero iff X and Y are identically distributed.

54 High school example i i n “red” cities x i are on two sides of a line L (river), k of them of the left side, n-k on the right; similarly, m “green” cities y i are on the left side of the same line, n-m are the the right. We connect two cities if they are on different sides of the river. Red cities are connected with red, greens with green, and mixed with blue. Claim: 2#blue - #red - # green ≥ 0 and = 0 iff k=m Hint: k(n-m)+m(n-k) – k(n-k)-m(n-m) = (k-m)² ≥ 0 Combine this with M. W. Crofton (1868) integral geometry formula on random lines to get Energy.

55 Newton’s potential energy – Statistical potential energy Newton’s potential energy in our 3-dim space is proportional to the reciprocal of the distance; if r:= |x-y| denotes the distance of points x,y, then the potential energy is proportional to 1/r. The mathematical significance of this function is that it is harmonic, i.e. 1/r is the fundamental solution of the Laplace equation. In 1 dimension r itself is harmonic. For statistical applications what is relevant is that r^{α} is strictly negative definite iff 0 < α < 2. Statistical potential energy is the double centered version of E|X- Y| α for 0 < α < 2 E(X,Y):= 2E|X-Y| α - E|X-X’| α - E|Y-Y’| α ≥ 0 for for 0 < α < 2.

56 SEP 12 Suppose for simplicity that the kernel of a V statistic has two arguments: h= h(x 1, x 2 ). This is the situation if we want to check that X= given distribution. But what if the sample is SEP? Stationary = ? Ergodic = ? Even in this case the SLLN holds and thus 2 i,jij (1/n 2 ) Σ i,j h(X i, X j ) → Eh(X, X’) a.s. thus we have strongly consistent estimators. i,jijkk k 2 k 12221 If h has rank 1 and the sample is iid then the limit distribution of 1/n \sum Σ i,j h(X i, X j ) is Q= Σ k λ k Z k 2 where λ k are eigenvalues of the Hilbert-Schmidt operator ∫ h(x 1, x 2 ) Ψ(x 2 )dF(x 2 ) = λΨ(x 1 ). 12. We know that in general this is not true for SEP e.g. if h=x 1 x 2. k We can still compute the eigenvalues μ k k=1,2,…,n of the random operator ( nxn random matrix) ij (1/n) [h(X i, X j ); i,j =1,2,…,n] k=1 n k k 2 and we can consider the corresponding Gaussian quadratic form Q= Σ k=1 n μ k Z k 2. Can the critical values for the corresponding null hypothesis be computed from Q if CLT holds e.g. if we have martingale difference structure or mixing/weak dependence or dCor→ 0 (we need to approach Gauss distribution) 1234 What to do with kernels like h(x 1, x 2, x 3, x 4 ), etc. and how to test independence from SEP?

57 Testing independence of ergodic processes If we have strongly stationary ergodic sequences then by the SLLN for V-statistics we know that the empirical dCor converges a.s. to the population dCor and this is constant a.s. So we have a consistent estimator for the population dCor. But how can we test if dCor=0 i.e. if the X process is independent of the Y process? Permutation tests won’t work. Limit theorems to Q depend on the dependence structure so it is complicated. How about the t-test? For this we need a kind of CLT.

58 What is the question? tt Do we want to test if X t is independent of Y t or if the X sequence is independent of the Y sequence? t t=1,2,… be iid and t = t+1 Example. Let X t t=1,2,… be iid and Y t = X t+1 tt but the Y sequence is a shift of the X sequence so they are not independent. We can now test if ( t, t+1 ) Then X t is independent of Y t but the Y sequence is a shift of the X sequence so they are not independent. We can now test if (X t, X t+1 ) is independent of ( t, t+1 ), etc. using permutation test. is independent of (Y t, Y t+1 ), etc. using permutation test. Null: The X st process is independent of the Y st process Test if p-tuples of consecutive observations with random starting points are independent e.g. with p= √n.

59 Proof of a conjecture of Ibragimov-Linnik How to avoid mixing condition in CLT? n, n=0, +-1, +- 2, …be a strictly stationary sequence, E n =0 S n =X 1 +…+ X n Thm. Let X n, n=0, +-1, +- 2, …be a strictly stationary sequence, EX n =0 S n =X 1 +…+ X n (i) s n: = [Var(S n )] (i) s n: = [Var(S n )] 1/2 = nf(n) where f(n) is a slowly varying function, W (S -m /s m, (S r+m -S m )/s m ) → 0 (ii) Cor W (S -m /s m, (S r+m -S m )/s m ) → 0 as m,r→∞ and r/m → 0 and (iii) (S m /s m ) is uniformly integrable. (iii) (S m /s m ) 2 is uniformly integrable. Then the CLT holds. (Bakirov, N. K and Szekely, G.J. (Bakirov, N. K and Szekely, G.J. Brownian covariance and central limit theorem for stationary sequences, Theory of Probability and Its Applications, Vol. 55, No. 3, 371-394, 2011.)

60 ER ER in our case has two meanings: Emergency Room and Energy in R, i.e. Energy programs in the program package R. Classical emergency toolkits of statisticians contain things like t-test, F- test, ANOVA, tests of independence, Pearson’s correlation, etc. Most of them are based on the assumption that the underlying distribution is Gaussian. Our first aid toolkit is a collection of programs that are based on the notion of energy and they do not assume that the underlying distribution is Gaussian. ANOVA is replaced by DISCO, Ward’s hierarchical clustering is replaced by energy clustering, Pearson’s correlation by distance correlation, etc. We can also test if a distribution is (multivariate) Gaussian. We suggest statisticians use our Energy package in R, ER, as a first aid for analyzing data in the Emergency Room. ER of course cannot replace further scrutiny of specialists.

61 References Székely, G.J. (1985-2005) Technical Reports on Energy (E-)statistics and on distance correlation. Potential and Kinetic energy in Statistics. Székely, G.J. and Rizzo, M. L. and Bakirov, N.K. (2007) Measuring and testing independence by correlation of distances, Ann. Statistics 35/6, 2769-2794. Székely, G. J. and Rizzo, M. L (2009) Brownian distance covariance, Discussion paper, Ann. Applied Statistics. 3 /4 1236-1265. Lyons, R. (2013) Distance covariance in metric spaces, Ann. Probability. 41/5, 3284- 3305. Szekely, G.J., Rizzo, M. L. (2013) Energy Statistics: A class of statistics based on distances, JSPI, Invited paper.


Download ppt "Distance Correlation E-Statistics Gábor J. Székely Rényi Institute of the Hungarian Academy of Sciences Columbia University, April 28-April 30, 2014."

Similar presentations


Ads by Google