Presentation is loading. Please wait.

Presentation is loading. Please wait.

Satyaki Mahalanabis Daniel Štefankovi č University of Rochester Density estimation in linear time (+approximating L 1 -distances)

Similar presentations


Presentation on theme: "Satyaki Mahalanabis Daniel Štefankovi č University of Rochester Density estimation in linear time (+approximating L 1 -distances)"— Presentation transcript:

1 Satyaki Mahalanabis Daniel Štefankovi č University of Rochester Density estimation in linear time (+approximating L 1 -distances)

2 Density estimation DATA+ f1f1 f2f2 f3f3 f4f4 f5f5 f6f6 density F = a family of densities

3 Density estimation - example +  N( ,  ) 0.418974, 0.848565, 1.73705, 1.59579, -1.18767, -1.05573, -1.36625 F = a family of normal densities with  =1

4 Measure of quality: L 1 – distance from the truth Why L 1 ? |f-g| 1 =  |f(x)-g(x)| dx 1) small L 1  all events estimated with small additive error 2) scale invariant g=TRUTH f=OUTPUT

5 Obstacles to “quality”: DATA+ weak class of densities bad data F dist 1 (g,F)  ?

6 What is bad data ? g = TRUTH h = DATA (empirical density) | h-g | 1  = 2max |h(A)-g(A)| A  Y(F) Y(F) = Yatracos class of F A ij ={ x | f i (x)>f j (x) } f1f1 f2f2 f3f3 A 12 A 13 A 23

7  = 2max |h(A)-g(A)| A  Y(F) Density estimation DATA (h) + F with small |g-f| 1 assuming these are small: dist 1 (g,F) f

8  = 2max |h(A)-g(A)| A  Y(F) Why would these be small ??? dist 1 (h,F) 1) pick a large enough F 2) pick a small enough F so that VC-dimension of Y(F) is small 3) data are iid from h They will be if: E[max|h(A)-g(A)|]  Theorem ( Haussler,Dudley, Vapnik, Chervonenkis ): VC(Y) samples AYAY

9 How to choose from 2 densities? f1f1 f2f2

10 f1f1 f2f2 +1

11 How to choose from 2 densities? f1f1 f2f2 +1 T T  f 1 T  f 2  ThTh

12 How to choose from 2 densities? f1f1 f2f2 +1 T T  f 1 T  f 2  ThTh Scheffé: if T  h > T  (f 1 +f 2 )/2  f 1 else  f 2 Theorem (see DL’01): |f-g| 1  3dist 1 (g,F) + 2 

13  = 2max |h(A)-g(A)| A  Y(F) Density estimation DATA (h) + F with small |g-f| 1 assuming these are small: dist 1 (g,F) f

14 Test functions T ij (x) = sgn(f i (x) – f j (x)) T ij  (f i – f j ) =  (f i -f j )sgn(f i -f j ) = |f i – f j | 1 F={f 1,f 2,...,f N } T ij  f i T ij  f j f i winsf j wins T ij  h

15 Density estimation algorithms Scheffé tournament: Pick the density with the most wins. Theorem (DL’01): |f-g| 1  9dist 1 (g,F)+8  Minimum distance estimate (Y’85): Output f k  F that minimizes max |(f k -h)  T ij | Theorem (DL’01): |f-g| 1  3dist 1 (g,F)+2  ij n2n2 n3n3

16 Density estimation algorithms Scheffé tournament: Pick the density with the most wins. Theorem (DL’01): |f-g| 1  9dist 1 (g,F)+8  Minimum distance estimate (Y’85): Output f k  F that minimizes max |(f k -h)  T ij | Theorem (DL’01): |f-g| 1  3dist 1 (g,F)+2  ij n2n2 n3n3 Can we do better?

17 Our algorithm: Efficient minimum loss-weight repeat until one distribution left 1) pick the pair of distributions in F that are furthest apart (in L 1 ) 2) eliminate the loser Theorem [MS’08]: |f-g| 1  3dist 1 (g,F)+2  n Take the most “discriminative” action. * * after preprocessing F

18 Tournament revelation problem INPUT: a weighed undirected graph G (wlog all edge-weights distinct) OUTPUT: REPORT: heaviest edge {u 1,v 1 } in G ADVERSARY eliminates u 1 or v 1  G 1 REPORT: heaviest edge {u 2,v 2 } in G 1 ADVERSARY eliminates u 2 or v 2  G 2..... OBJECTIVE: minimize total time spent generating reports

19 Tournament revelation problem 1 23 4 56 A B C D report the heaviest edge

20 Tournament revelation problem 1 23 4 56 A B C D report the heaviest edge BC

21 Tournament revelation problem 1 23 A C D report the heaviest edge BC eliminate B report the heaviest edge

22 Tournament revelation problem 1 23 A C D report the heaviest edge BC eliminate B report the heaviest edge AD

23 Tournament revelation problem 1 C D report the heaviest edge BC eliminate B report the heaviest edge AD eliminate A report the heaviest edge CD

24 Tournament revelation problem 1 23 4 56 A B C D BC B C AD BD AD D B DCAC AD AB 2 O(F) preprocessing  O(F) run-time O(F 2 log F) preprocessing  O(F 2 ) run-time WE DO NOT KNOW: Can get O(F) run-time with polynomial preprocessing ???

25 Efficient minimum loss-weight repeat until one distribution left 1) pick the pair of distributions that are furthest apart (in L 1 ) 2) eliminate the loser 2 O(F) preprocessing  O(F) run-time O(F 2 log F) preprocessing  O(F 2 ) run-time WE DO NOT KNOW: Can get O(F) run-time with polynomial preprocessing ??? (in practice 2) is more costly)

26 Efficient minimum loss-weight repeat until one distribution left 1) pick the pair of distributions that are furthest apart (in L 1 ) 2) eliminate the loser Theorem: |f-g| 1  3dist 1 (g,F)+2  n Proof: For every f’ to which f loses |f-f’| 1  max |f’-f’’| 1 f’ loses to f’’ “that guy lost even more badly!”

27 Proof: For every f’ to which f loses |f-f’| 1  max |f’-f’’| 1 f’ loses to f’’ “that guy lost even more badly!” f1f1 BEST=f 2 f3f3 bad loss 2h  T 23  f 2  T 23 + f 3  T 23 (f 1 -f 2 )  T 12  (f 2 -f 3 )  T 23 (f 4 -h)  T 23  (f i -f j )  (T ij -T kl )  0 |f 1 -g| 1  3|f 2 -g| 1 +2 

28 Application: kernel density estimates (Akaike’54,Parzen’62,Rosenblatt’56) K = kernel h = density kernel used to smooth empirical g (x 1,x 2,...,x n i.i.d. samples from h)  K(y-x i ) 1 n i=1 n g * K h * K as n  =

29  K(y-x i ) 1 n i=1 n h * K as n  What K should we choose? Dirac  would be goodDirac  is not good Something in-between: bandwidth selection for kernel density estimates K s (x)= K(x/s) s as s  0 K s (x)  Dirac  Theorem (see DL’01): as s  0 with sn  |g*K – h| 1  0 g * K =

30 Data splitting methods for kernel density estimates  K 1 nsns y-x i ( ) s How to pick the smoothing factor ? i=1 n x 1,x 2,...,x n x 1,...,x n-m x n-m+1,...,x n f s =  K 1 (n-m)s y-x i ( ) s i=1 n-m choose s using density estimation

31 Kernels we will use:  K 1 nsns y-x i ( ) s piecewise uniform piecewise linear

32 Bandwidth selection for uniform kernels N distributions each is piecewise uniform with n pieces m datapoints E.g. N  n 1/2 m  n 5/4 Goal: run the density estimation algorithm efficiently g  T ij  (f i +f j )  T ij 2 |f i -f j | 1 (f k -h)  T kj EMLWMD N2N2 N2N2 N n TIME n+m log n

33 Bandwidth selection for uniform kernels N distributions each is piecewise uniform with n pieces m datapoints E.g. N  n 1/2 m  n 5/4 Goal: run the density estimation algorithm efficiently g  T ij  (f i +f j )  T ij 2 |f i -f j | 1 (f k -h)  T kj EMLWMD N2N2 N2N2 N n TIME n+m log n Can speed this up?

34 Bandwidth selection for uniform kernels N distributions each is piecewise uniform with n pieces m datapoints E.g. N  n 1/2 m  n 5/4 Goal: run the density estimation algorithm efficiently g  T ij  (f i +f j )  T ij 2 |f i -f j | 1 (f k -h)  T kj EMLWMD N2N2 N2N2 N n TIME n+m log n Can speed this up? absolute error bad relative error good

35 Approximating L 1 -distances between distributions WE WILL DO: (N 2 +Nn) (log N) 22 TRIVIAL (exact): N 2 n N piecewise uniform densities (each n pieces)

36 Dimension reduction for L 2 Johnson-Lindenstrauss Lemma (’82)  : L 2  L t 2 t = O(  -2 ln n) (  x,y  S) d(x,y)  d(  (x),  (y))  (1+  )d(x,y) |S|=n N(0,t -1/2 )

37 Dimension reduction for L 1 Cauchy Random Projection (Indyk’00)  : L 1  L t 1 t = O(  -2 ln n) (  x,y  S) d(x,y)  est(  (x),  (y))  (1+  )d(x,y) |S|=n N(0,t -1/2 )C(0,1/t) (Charikar, Brinkman’03 : cannot replace est by d)

38 Cauchy distribution C(0,1) density function:1  (1+x 2 ) X  C(0,1)  aX  C(0,|a|) X  C(0,a), Y  C(0,b)  X+Y  C(0,a+b) FACTS:

39 X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 X8X8 X9X9 A B z X 1  C(0,z) A(X 2 +X 3 ) + B(X 5 +X 6 +X 7 +X 8 ) Cauchy random projection for L 1 D (Indyk’00)

40 X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 X8X8 X9X9 A B z Cauchy random projection for L 1 D D(X 1 +X 2 +...+X 8 +X 9 ) (Indyk’00) X 1  C(0,z) A(X 2 +X 3 ) + B(X 5 +X 6 +X 7 +X 8 )  Cauchy(0,|  -  | 1 )

41 All pairs L 1 -distances piece-wise linear densities

42 All pairs L 1 -distances piece-wise linear densities X 1 X 2  C(0,1/2) R=(3/4)X 1 + (1/4)X 2 B=(3/4)X 2 + (1/4)X 1 R-B  C(0,1/2)

43 All pairs L 1 -distances piece-wise linear densities Problem: too many intersections! Solution: cut into even smaller pieces! Stochastic measures are useful.

44 Brownian motion exp(-x^2/2) 1 (2  1/2 Cauchy motion 1  (1+x) 2

45 Brownian motion exp(-x^2/2) 1 (2  1/2  f dL = Y  N(0,S) computing integrals is easy f:R  R d

46  f dL = Y  C(0,s) for d=1 computing integrals is easy f:R  R d Cauchy motion 1  (1+x) 2 computing integrals is hard d>1 * obtaining explicit expression for the density *

47 X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 X8X8 X9X9 What were we doing?  (f 1,f 2,f 3 ) dL = (w 1 ) 1,(w 2 ) 1,(w 3 ) 1

48 X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 X8X8 X9X9 What were we doing?  (f 1,f 2,f 3 ) dL = (w 1 ) 1,(w 2 ) 1,(w 3 ) 1 Can we efficiently compute integrals dL for piecewise linear?

49 Can we efficiently compute integrals dL for piecewise linear?  R  R 2  z)=(1,z) (X,Y)=   dL

50  R  R 2  z)=(1,z) (X,Y)=   dL (2(X-Y),2Y) has density at u+v,u-v 2

51 All pairs L 1 -distances for mixtures of uniform densities in time O( (N^2+Nn) (log N) 22 ) All pairs L 1 -distances for piecewise linear densities in time O( (N^2+Nn) (log N) 22 )

52  R  R 3  z)=(1,z,z 2 ) (X,Y,Z)=   dL ? 1) QUESTIONS 2) higher dimensions ?


Download ppt "Satyaki Mahalanabis Daniel Štefankovi č University of Rochester Density estimation in linear time (+approximating L 1 -distances)"

Similar presentations


Ads by Google