Satyaki Mahalanabis Daniel Štefankovi č University of Rochester Density estimation in linear time (+approximating L 1 -distances)

Density estimation DATA+ f1f1 f2f2 f3f3 f4f4 f5f5 f6f6 density F = a family of densities

Density estimation - example +  N( ,  ) 0.418974, 0.848565, 1.73705, 1.59579, -1.18767, -1.05573, -1.36625 F = a family of normal densities with  =1

Measure of quality: L 1 – distance from the truth Why L 1 ? |f-g| 1 =  |f(x)-g(x)| dx 1) small L 1  all events estimated with small additive error 2) scale invariant g=TRUTH f=OUTPUT

Obstacles to “quality”: DATA+ weak class of densities bad data F dist 1 (g,F)  ?

What is bad data ? g = TRUTH h = DATA (empirical density) | h-g | 1  = 2max |h(A)-g(A)| A  Y(F) Y(F) = Yatracos class of F A ij ={ x | f i (x)>f j (x) } f1f1 f2f2 f3f3 A 12 A 13 A 23

 = 2max |h(A)-g(A)| A  Y(F) Density estimation DATA (h) + F with small |g-f| 1 assuming these are small: dist 1 (g,F) f

 = 2max |h(A)-g(A)| A  Y(F) Why would these be small ??? dist 1 (h,F) 1) pick a large enough F 2) pick a small enough F so that VC-dimension of Y(F) is small 3) data are iid from h They will be if: E[max|h(A)-g(A)|]  Theorem ( Haussler,Dudley, Vapnik, Chervonenkis ): VC(Y) samples AYAY

How to choose from 2 densities? f1f1 f2f2

f1f1 f2f2 +1

How to choose from 2 densities? f1f1 f2f2 +1 T T  f 1 T  f 2  ThTh

How to choose from 2 densities? f1f1 f2f2 +1 T T  f 1 T  f 2  ThTh Scheffé: if T  h > T  (f 1 +f 2 )/2  f 1 else  f 2 Theorem (see DL’01): |f-g| 1  3dist 1 (g,F) + 2 

 = 2max |h(A)-g(A)| A  Y(F) Density estimation DATA (h) + F with small |g-f| 1 assuming these are small: dist 1 (g,F) f

Test functions T ij (x) = sgn(f i (x) – f j (x)) T ij  (f i – f j ) =  (f i -f j )sgn(f i -f j ) = |f i – f j | 1 F={f 1,f 2,...,f N } T ij  f i T ij  f j f i winsf j wins T ij  h

Our algorithm: Efficient minimum loss-weight repeat until one distribution left 1) pick the pair of distributions in F that are furthest apart (in L 1 ) 2) eliminate the loser Theorem [MS’08]: |f-g| 1  3dist 1 (g,F)+2  n Take the most “discriminative” action. * * after preprocessing F

Tournament revelation problem INPUT: a weighed undirected graph G (wlog all edge-weights distinct) OUTPUT: REPORT: heaviest edge {u 1,v 1 } in G ADVERSARY eliminates u 1 or v 1  G 1 REPORT: heaviest edge {u 2,v 2 } in G 1 ADVERSARY eliminates u 2 or v 2  G 2..... OBJECTIVE: minimize total time spent generating reports

Tournament revelation problem 1 23 4 56 A B C D report the heaviest edge

Tournament revelation problem 1 23 4 56 A B C D report the heaviest edge BC

Tournament revelation problem 1 23 A C D report the heaviest edge BC eliminate B report the heaviest edge

Tournament revelation problem 1 23 A C D report the heaviest edge BC eliminate B report the heaviest edge AD

Tournament revelation problem 1 C D report the heaviest edge BC eliminate B report the heaviest edge AD eliminate A report the heaviest edge CD

Tournament revelation problem 1 23 4 56 A B C D BC B C AD BD AD D B DCAC AD AB 2 O(F) preprocessing  O(F) run-time O(F 2 log F) preprocessing  O(F 2 ) run-time WE DO NOT KNOW: Can get O(F) run-time with polynomial preprocessing ???

Efficient minimum loss-weight repeat until one distribution left 1) pick the pair of distributions that are furthest apart (in L 1 ) 2) eliminate the loser 2 O(F) preprocessing  O(F) run-time O(F 2 log F) preprocessing  O(F 2 ) run-time WE DO NOT KNOW: Can get O(F) run-time with polynomial preprocessing ??? (in practice 2) is more costly)

Efficient minimum loss-weight repeat until one distribution left 1) pick the pair of distributions that are furthest apart (in L 1 ) 2) eliminate the loser Theorem: |f-g| 1  3dist 1 (g,F)+2  n Proof: For every f’ to which f loses |f-f’| 1  max |f’-f’’| 1 f’ loses to f’’ “that guy lost even more badly!”

Proof: For every f’ to which f loses |f-f’| 1  max |f’-f’’| 1 f’ loses to f’’ “that guy lost even more badly!” f1f1 BEST=f 2 f3f3 bad loss 2h  T 23  f 2  T 23 + f 3  T 23 (f 1 -f 2 )  T 12  (f 2 -f 3 )  T 23 (f 4 -h)  T 23  (f i -f j )  (T ij -T kl )  0 |f 1 -g| 1  3|f 2 -g| 1 +2 

Application: kernel density estimates (Akaike’54,Parzen’62,Rosenblatt’56) K = kernel h = density kernel used to smooth empirical g (x 1,x 2,...,x n i.i.d. samples from h)  K(y-x i ) 1 n i=1 n g * K h * K as n  =

 K(y-x i ) 1 n i=1 n h * K as n  What K should we choose? Dirac  would be goodDirac  is not good Something in-between: bandwidth selection for kernel density estimates K s (x)= K(x/s) s as s  0 K s (x)  Dirac  Theorem (see DL’01): as s  0 with sn  |g*K – h| 1  0 g * K =

Data splitting methods for kernel density estimates  K 1 nsns y-x i ( ) s How to pick the smoothing factor ? i=1 n x 1,x 2,...,x n x 1,...,x n-m x n-m+1,...,x n f s =  K 1 (n-m)s y-x i ( ) s i=1 n-m choose s using density estimation

Kernels we will use:  K 1 nsns y-x i ( ) s piecewise uniform piecewise linear

Bandwidth selection for uniform kernels N distributions each is piecewise uniform with n pieces m datapoints E.g. N  n 1/2 m  n 5/4 Goal: run the density estimation algorithm efficiently g  T ij  (f i +f j )  T ij 2 |f i -f j | 1 (f k -h)  T kj EMLWMD N2N2 N2N2 N n TIME n+m log n

Bandwidth selection for uniform kernels N distributions each is piecewise uniform with n pieces m datapoints E.g. N  n 1/2 m  n 5/4 Goal: run the density estimation algorithm efficiently g  T ij  (f i +f j )  T ij 2 |f i -f j | 1 (f k -h)  T kj EMLWMD N2N2 N2N2 N n TIME n+m log n Can speed this up?

Bandwidth selection for uniform kernels N distributions each is piecewise uniform with n pieces m datapoints E.g. N  n 1/2 m  n 5/4 Goal: run the density estimation algorithm efficiently g  T ij  (f i +f j )  T ij 2 |f i -f j | 1 (f k -h)  T kj EMLWMD N2N2 N2N2 N n TIME n+m log n Can speed this up? absolute error bad relative error good

Approximating L 1 -distances between distributions WE WILL DO: (N 2 +Nn) (log N) 22 TRIVIAL (exact): N 2 n N piecewise uniform densities (each n pieces)

Dimension reduction for L 2 Johnson-Lindenstrauss Lemma (’82)  : L 2  L t 2 t = O(  -2 ln n) (  x,y  S) d(x,y)  d(  (x),  (y))  (1+  )d(x,y) |S|=n N(0,t -1/2 )

Dimension reduction for L 1 Cauchy Random Projection (Indyk’00)  : L 1  L t 1 t = O(  -2 ln n) (  x,y  S) d(x,y)  est(  (x),  (y))  (1+  )d(x,y) |S|=n N(0,t -1/2 )C(0,1/t) (Charikar, Brinkman’03 : cannot replace est by d)

Cauchy distribution C(0,1) density function:1  (1+x 2 ) X  C(0,1)  aX  C(0,|a|) X  C(0,a), Y  C(0,b)  X+Y  C(0,a+b) FACTS:

X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 X8X8 X9X9 A B z X 1  C(0,z) A(X 2 +X 3 ) + B(X 5 +X 6 +X 7 +X 8 ) Cauchy random projection for L 1 D (Indyk’00)

X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 X8X8 X9X9 A B z Cauchy random projection for L 1 D D(X 1 +X 2 +...+X 8 +X 9 ) (Indyk’00) X 1  C(0,z) A(X 2 +X 3 ) + B(X 5 +X 6 +X 7 +X 8 )  Cauchy(0,|  -  | 1 )

All pairs L 1 -distances piece-wise linear densities

All pairs L 1 -distances piece-wise linear densities X 1 X 2  C(0,1/2) R=(3/4)X 1 + (1/4)X 2 B=(3/4)X 2 + (1/4)X 1 R-B  C(0,1/2)

All pairs L 1 -distances piece-wise linear densities Problem: too many intersections! Solution: cut into even smaller pieces! Stochastic measures are useful.

Brownian motion exp(-x^2/2) 1 (2  1/2 Cauchy motion 1  (1+x) 2

Brownian motion exp(-x^2/2) 1 (2  1/2  f dL = Y  N(0,S) computing integrals is easy f:R  R d

 f dL = Y  C(0,s) for d=1 computing integrals is easy f:R  R d Cauchy motion 1  (1+x) 2 computing integrals is hard d>1 * obtaining explicit expression for the density *

X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 X8X8 X9X9 What were we doing?  (f 1,f 2,f 3 ) dL = (w 1 ) 1,(w 2 ) 1,(w 3 ) 1

X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 X8X8 X9X9 What were we doing?  (f 1,f 2,f 3 ) dL = (w 1 ) 1,(w 2 ) 1,(w 3 ) 1 Can we efficiently compute integrals dL for piecewise linear?

Can we efficiently compute integrals dL for piecewise linear?  R  R 2  z)=(1,z) (X,Y)=   dL

 R  R 2  z)=(1,z) (X,Y)=   dL (2(X-Y),2Y) has density at u+v,u-v 2

All pairs L 1 -distances for mixtures of uniform densities in time O( (N^2+Nn) (log N) 22 ) All pairs L 1 -distances for piecewise linear densities in time O( (N^2+Nn) (log N) 22 )

 R  R 3  z)=(1,z,z 2 ) (X,Y,Z)=   dL ? 1) QUESTIONS 2) higher dimensions ?

Satyaki Mahalanabis Daniel Štefankovi č University of Rochester Density estimation in linear time (+approximating L 1 -distances)

Similar presentations

Presentation on theme: "Satyaki Mahalanabis Daniel Štefankovi č University of Rochester Density estimation in linear time (+approximating L 1 -distances)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Satyaki Mahalanabis Daniel Štefankovi č University of Rochester Density estimation in linear time (+approximating L 1 -distances)

Similar presentations

Presentation on theme: "Satyaki Mahalanabis Daniel Štefankovi č University of Rochester Density estimation in linear time (+approximating L 1 -distances)"— Presentation transcript:

Similar presentations

About project

Feedback