Sampling and Flow Measurement Eric Purpus 5/18/04.

Sampling and Flow Measurement Eric Purpus 5/18/04

2 Outline Application of Sampling Methodologies to Network Traffic Characterization (Claffy, Polyzos, Braun, ‘93) Trajectory Sampling for Direct Traffic Observation (Duffield, Grossglauser, ‘00) Trajectory Sampling with Unreliable Reporting (Duffield, Grossglauser, ‘04)

3 Problem/Motivation Dealing with huge traces for wide are network traffic is often infeasible –Difficult to collect –Difficult to analyze Instead, we’d like to perform sampling to reduce the size of the traces. Many methods of sampling exist, but what is the best in this context?

4 Approach Take a full trace and simulate different types of sampling on that trace. Compare sample distributions to the parent population (full trace) –Packet arrival rate –Byte arrival rate –Mean per-second packet size Measure performance as the degree of similarity between sample and parent distributions.

5 Sampling Mechanisms Systematic - select every k-th element Stratified random - determine a bucket size k and select a random element from each bucket Simple random - select n elements from the total population at random

6 Timer-driven vs Packet-driven Use both packet and timer-driven sampling using the three sampling methods. –When the timer expires, select the next packet to arrive –Only use systematic and stratified random sampling for the timer-driven approach

7 Metrics for disparity between distributions  2 coefficient: –B : number of bins –O i : number of observations found in the ith bin –E i : number of observations expected in ith bin based on the parent population model –Drawback: sensitive to the size of the data set

8 Phi Coefficient  coefficient: Well established in statistical literature Not affected by sample size

9 Parent Population 24-hour trace of packets sent from SDSC to NSFNET –Collected at the single entrance interface into the backbone 650MB trace collected on March 22, 1993 –Only perform sampling simulations on the period from 13:00 to 14:00 on March 23

10 Mean phi across sampling methods for packet interarrivals Lower 3 lines are packet driven Upper 2 lines are time driven Timer-based sampling is especially bad for bursty periods with many packets and small interarrival times

11 Sampling Fraction Values of  for packet size distribution across sampling fraction (1/x = bucket size) using systematic sampling Analyzing a 1024 second interval

12 Sampling Duration Values of  for packet size distribution across sampling durations for packet size distribution

13 Conclusions Time-driven sampling is not appropriate for wide area network traffic Within each class of sampling (time or packet-driven), differences are small? –(Even though systematic sampling was used in all of the results graphs) Provides a way to compare across sampling fractions and intervals, but it not rigorous

14 Trajectory Sampling Traffic measurement to determine the paths followed by packets between any ingress and egress point of a domain Useful for understanding traffic patterns for solving problems in traffic engineering

15 Main Idea Hash packets based on their content to decide if a packet is sampled or not Because the same hash function is used at all the monitors, a packet should be sampled on every link if it is sampled on one link. Uniquely (with high probability) hash the packet contents for a measurement period to track its trajectory

16 Trajectory Sampling

17 Some Formalization Invariance function  - input is the packet contents, output is the invariant content of the packet. i.e. not modified upon forwarding (TTL for example) Sampling hash function h =  (x) mod A - based on invariant content, hash into l-bits. Sampling domain D - if h(  (x))  D, then sample the packet. Here, D = {0, 1, … r - 1} Identification hash function g =  (x) mod B - produces a unique identifier with high probability for a measurement period. (Note: B  A)

18 Packet Invariance

19 Label collision Label collision can create ambiguity in the label subgraph Simple solution is to discard ambiguous labels

20 Is sampling with mod A good? Use traces from 500 campus and 3000 distinct external hosts for a total of 1 million IP packets Distribution of any variable attribute of the packet (such as source/destination IP) should be the same for the sample as the parent population (sound familiar?) Use the chi-squared test on address prefix, bitwise address, and temporal sampling distributions Summary: yes, they seem to think it’s ok

21 Optimal Sampling Have a choice of taking n samples and having m bits per sample as the packet identification Want to maximize the expected number of unique samples constrained by total measurement traffic nm not exceeding a constant c. Turns into an optimization problem

22 Optimization U(n, m) is the number of unique samples. –To maximize U(n, m), set m = c/n –Becomes U(n) optimization

23 Optimal Values “real world” Say you have 100 OC-192 links (10 Gbps each) Measurement system can handle 10Mbps Measurement epoch is T = 10s (upper bound on packet lifetime) Assume all packets are 1500 bytes For c = T * 10 7 = 10 8 bits per measurement epoch: n = 3.84 * 10 6 (3840 samples per link per second) m = 26 bits per label collision probability p coll = 5.4%

24 Experimental Setup Assume a service provider wants to determine the fraction of packets on a certain backbone link that belong to a certain customer Fraction = n c,b / n b

25 Real and Estimated (1000 bit)

26 Real and Estimated (10000 bit)

27 Conclusions Simple processing (no memory lookups, just division arithmetic) No router state required Direct packet observation (does not require knowledge or understanding of routing state/behavior)

28 Trajectory Sampling with Unreliable Reporting Loss is possible between router and collector, since reliability is not used in trajectory reporting. Duplicate labels are possible which can make trajectories ambiguous.

29 Approach Record labels in Bloom filters at ingress which are transmitted to the collector –Unbiased inference of original traffic intensities Reconstruct paths from incomplete trajectories using multiple packet reports for the same path Infer link loss rates even in the presence of report loss

30 Label Sets

31 Duplicate Elimination For every label l, eliminate it if: –It has been observed more than once at a single ingress router and/or –It has been observed at multiple ingress routers Unbiased elimination process (proven in the paper)

32 Path Reconstruction When routing is stable, packets from the same flow follow the same path (or set of paths) As long as the report loss rate on a given link is not 100%, a report will be received from every link on the path

33 Conclusion Trajectory sampling is possible even with lossy reporting

Sampling and Flow Measurement Eric Purpus 5/18/04.

Similar presentations

Presentation on theme: "Sampling and Flow Measurement Eric Purpus 5/18/04."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sampling and Flow Measurement Eric Purpus 5/18/04.

Similar presentations

Presentation on theme: "Sampling and Flow Measurement Eric Purpus 5/18/04."— Presentation transcript:

Similar presentations

About project

Feedback