Presentation is loading. Please wait.

Presentation is loading. Please wait.

So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

Similar presentations


Presentation on theme: "So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time."— Presentation transcript:

1 So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time

2 So Many Slides Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time So Little Time (before lunch) (before lunch)

3 computation math experimentationalgorithms

4 Computers have two problems

5 1. They don’t have steering wheels

6

7 2. End of Moore’s Law party’s over !

8 computation algorithms experimentation

9 32 x 17 224 32 = 544 This is not me

10 FFT RSA

11

12

13 noisy low entropy uncertain unevenly priced big

14 noisy low entropy uncertain unevenly priced big

15 Biomedical imaging Sloan Digital Sky Survey 4 petabytes (~1MG) (~1MG) 10 petabytes/yr 150 petabytes/yr

16 Collected works of Micha Sharir My A(9,9)-th paper

17 massive input massive input output Sublinear Algorithms Sample tiny fraction

18 Shortest Paths [C-Liu-Magen ’03] New York DelphiDelphi

19 Ray Shooting  Volume  Intersection  Point location

20 Approximate MST [C-Rubinfeld- Trevisan ’01]

21 Reduces to counting connected components

22 EE = no. connected components varvar << (no. connected components) 22 whp, is a good estimator of # connected components

23 worst case input space average case (uniform)

24 worst case

25 average case = actuarial view

26 “ OK, if you elect NOT to have the surgery, the insurance company offers 6 days and 7 nights in Barbados. “

27 arbitrary, unknown random source Self-Improving Algorithms

28 Yes ! This could be YOU, too !

29 E Tk  Optimal expected time for random source time T1 time T2 time T3 time T4

30 Clustering [ Ailon-C-Liu-Comandur ’05 ] K-median over Hamming cube

31 minimize sum of distances

32

33 [ Kumar-Sabharwal-Sen ’04 ] COST OPT ( 1 + )

34 How to achieve linear limiting time? Input space {0,1} dndn prob < O(dn)/KSS Identify core Tail:Tail: Use KSS

35 Store sample of precomputed KSS Nearest neighbor Incremental algorithm

36 Main difficulty: How to spot the tail?

37

38 encode

39 decode

40

41 Data inaccessible before noise What makes you think it’s wrong?

42 Data inaccessible before noise must satisfy some property (eg, convex, bipartite) but does not quite

43 f(x) = ? x f(x) data f = access function

44 f(x) = ? x f(x) f = access function

45 f(x) = ? x f(x) But life being what it is…

46 f(x) = ? x f(x)

47 Humans Define distance from any object to data class

48 f(x) = ? x g(x) x 1, x 2,… f ( x 1), f ( x 2),… filter g is access function for:

49 Online Data Reconstructio n Online Data Reconstructio n

50 Monotone function: [n]  R d Filter requires polylog (n) lookups [ Ailon-C-Liu-Comandur ’04 ] [ Ailon-C-Liu-Comandur ’04 ]

51 Convex polygon Filter requires : lookups [C-Comandur ’06 ]

52 Convex terrain lookups Filter requires :

53 Iterated planar separator theorem

54

55 Iterated (weak) planar separator theorem Iterated (weak) planar separator theorem in sublinear time!

56 Using epsilon-nets in spaces of unbounded VC dimension reconstruct

57 bipartite graph k-connectivity expander

58 denoising low-dim attractor sets

59 Priced computation & accuracy Priced computation & accuracy spectrometry/cloning/gene chip spectrometry/cloning/gene chip PCR/hybridization/chromatography PCR/hybridization/chromatography gel electrophoresis/blotting gel electrophoresis/blotting spectrometry/cloning/gene chip spectrometry/cloning/gene chip PCR/hybridization/chromatography PCR/hybridization/chromatography gel electrophoresis/blotting gel electrophoresis/blotting 0 1 0 0 10 0 11 1 0 1 0 1 01 1 0 0 1 0 0 01 1 1o 1 0 0 1 0 Linear programming Linear programming

60 Pricing data Pricing data Factoring is easy. Here’s why… Gaussian mixture sample: 00100101001001101010101 ….

61 Collaborators: Collaborators: Nir Ailon, Seshadri Comandur, Ding Liu Avner Magen, Ronitt Rubinfeld, Luca Trevisan Collaborators: Collaborators: Nir Ailon, Seshadri Comandur, Ding Liu Avner Magen, Ronitt Rubinfeld, Luca Trevisan


Download ppt "So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time."

Similar presentations


Ads by Google