Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bruno Ribeiro CS69000-DM1 Topics in Data Mining. Bruno Ribeiro  Reviews of next week’s papers due Friday 5pm (Sunday 11:59pm submission closes) ◦ Assignment.

Similar presentations


Presentation on theme: "Bruno Ribeiro CS69000-DM1 Topics in Data Mining. Bruno Ribeiro  Reviews of next week’s papers due Friday 5pm (Sunday 11:59pm submission closes) ◦ Assignment."— Presentation transcript:

1 Bruno Ribeiro CS69000-DM1 Topics in Data Mining

2 Bruno Ribeiro  Reviews of next week’s papers due Friday 5pm (Sunday 11:59pm submission closes) ◦ Assignment on blackboard  Deadline to select projects ◦ Sept 29 2 Announcement Reminder

3 Bruno Ribeiro  Murai, F., Ribeiro, B., Towsley, D., & Wang, P. (2013). On Set Size Distribution Estimation and the Characterization of Large Networks via Sampling. JSAC 2013.On Set Size Distribution Estimation and the Characterization of Large Networks via Sampling  Veitch, D., & Tune, P. (2015). Optimal Skampling for the Flow Size Distribution. IEEE Transactions on Information Theory 2015.Optimal Skampling for the Flow Size Distribution 3 Today

4 Bruno Ribeiro  Why is your bus often full? 4 Waiting Time Paradox but

5 Bruno Ribeiro Set Size Estimation Problem 5 sample prob = p More likely to observe sets with large no. elements How much more likely to see green set than blue set? Observed sets

6 Bruno Ribeiro Set Size Distribution Estimation 6 random sampling estimation Set size distribution observed data original data

7 Bruno Ribeiro 7 Example Application Do we see c 0 ?

8 Bruno Ribeiro 8 Problem Formulation (corrected)

9 Bruno Ribeiro  If edges arrive independently at random…  Estimate original average degree ◦ Knowing the sampling probability p 9 Application 1: Estimate Latent Characteristics Observed during window [0, T ]Underlying “true network” “e.g. phone calls” p p p

10 Bruno Ribeiro Estimate the original flow size distribution from counts of no. sampled packet 10 Application 2: TCP flow size estimation TCP flow packets packet samplin g … no packet sampled (flow not sampled) 1 packet sampled all packets sampled random sampling estimation Set size distribution observed data original data

11 Bruno Ribeiro 11 Maximum Likelihood Estimation in practice… accuracy of proposed estimator sampling rate=1/100 without proto. info. with proto. info. n

12 Bruno Ribeiro Fisher information data processing inequality “debug” measurement methods 12 What I will show Lessons:  Feature engineering by trial & error is tricky and expensive  Analyze last step ◦ enough information to proceed to estimate? ◦ exists better summary function? ◦ where information lost?

13 Bruno Ribeiro Data processing inequality: “No processing can increase the amount of statistical information already contained in the data” 13 Estimating characteristics from sampling Nature raw samples sample summary characteristic summary sampling Estimator Data processing inequality

14 Bruno Ribeiro  Fisher information ◦ Amount of information observations carry about the unknown characteristic  Cramér-Rao inequality ◦ Connect the Fisher information with the minimum Mean Squared Error (MSE) achievable by any unbiased estimator 14 “Debugging” the sampling design Nature raw samples sample summary characteristic summary sampling Best Estimator Data processing inequality poor good summary best estimator quality of estimates? done back to the drawing board assumption: θ

15 Bruno Ribeiro [The finding] that the amount of information extracted in the process of estimation could never exceed the quantity supplied by the data Combined with the practical fact that directly available processes of computation would extract almost always a very large fraction of the total available [information], shifted the moral balance. The weight of [the statistician’s] responsibility was thrown back on to the process by which the data had come into existence. […] what types of observational programs would yield the most information for a given expenditure in time, money and labor. R. A. Fisher 1947 15

16 Bruno Ribeiro 16 Problem Formulation

17 Bruno Ribeiro where or in matrix form 17 Fisher Information

18 Bruno Ribeiro Suppose we obtain unbiased estimates from observations Mean squared error (covariance matrix) Cramér-Rao Bound for 18 Cramér-Rao Lower Bound Inverse Fisher information

19 Bruno Ribeiro But Must consider parameter constraint 19 Cramér-Rao Lower Bound CRLB without constraint CRLB with constraint

20 Bruno Ribeiro Fisher Information with Priors  Fisher information with priors total FI FI of prior FI original

21 Bruno Ribeiro Different Sampling Designs  FS = Flow sampling: Sample sets with probability q  SH = Randomly sample first element with probability q’ but collect all future elements of same set  DS = Dual Sampling: Sample first element with high probability. Sample following elements with low probability and use “sequence numbers” to obtain elements lost “in the middle”  PS = Packet Sampling: Sample elements with probability p moca seeing as a stream of elements jg

22 Bruno Ribeiro Results: Different Sampling Designs (Veitch & Tune’14)  FS = Flow sampling  SH = Sample and hold DS = Dual sampling PS = Packet sampling

23 Bruno Ribeiro  Murai, F., Ribeiro, B., Towsley, D., & Wang, P. (2013). On Set Size Distribution Estimation and the Characterization of Large Networks via Sampling. JSAC 2013.On Set Size Distribution Estimation and the Characterization of Large Networks via Sampling  Veitch, D., & Tune, P. (2015). Optimal Skampling for the Flow Size Distribution. IEEE Transactions on Information Theory 2015.Optimal Skampling for the Flow Size Distribution 23 Today

24 Bruno Ribeiro  Part 1: Random Sampling v.s. Data Streaming 24

25 Bruno Ribeiro  Fisher information to of sample summary? 25 What if we decided to bypass sampling?

26 Bruno Ribeiro 26 0 Sketching router Estimation phase powerful back end server powerful back end server 0 0 universal hash function 1 12 0 0 Sketch phase 1 2 collision!! counters summary flow size distribution estimate Prevent collisions keep unique packet ID (flow sampling) Disambiguate

27 Bruno Ribeiro  Why? ◦ Fisher information analysis shows collided counter ≃ 0 information 27 Eviction Sketch

28 Bruno Ribeiro 28

29 Bruno Ribeiro Set Size Estimation Errors in Practice  p = 0.25  (a) N=10,000 and (b) N=50,000 sampled sets  (c) N ∊ {5,10,20,50,100} x 10 3 sampled sets

30 Bruno Ribeiro Set Size Estimation Errors in Practice II  p = 0.90  (a) N=10,000 and (b) N=50,000 sampled sets  (c) N ∊ {5,10,20,50,100} x 10 3 sampled sets

31 Bruno Ribeiro Scaling on max set size: Phase transition of estimation errors - observable set sizes W – size of largest set T i ( S ) – estimate of θ i

32 Bruno Ribeiro Infinite support  & power laws  If  is power law with infinite support (W  ∞) ◦ if p < ½ any unbiased estimator is inaccurate  might as well output random estimates ◦ if p > ½ estimates can be accurate if enough samples are collected 32

33 Bruno Ribeiro  How to collect data!! 33 Next Class


Download ppt "Bruno Ribeiro CS69000-DM1 Topics in Data Mining. Bruno Ribeiro  Reviews of next week’s papers due Friday 5pm (Sunday 11:59pm submission closes) ◦ Assignment."

Similar presentations


Ads by Google