Presentation is loading. Please wait.

Presentation is loading. Please wait.

REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.

Similar presentations


Presentation on theme: "REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams."— Presentation transcript:

1 REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams are long sequences of data packets. Information travels over computer networks in the form of data streams. Data streams are often very large (millions of elements). Streaming Algorithms We wish to analyze elements of a data stream to discover anomalies, reveal patterns in traffic, etc. By doing so, misuse of network resources can be detected. Entropy can be used to obtain this information. Entropy is a measure of predictability in the value of each stream element. We consider a data stream of length m with values in the range {1, 2, 3, …, n}. The entropy H of a stream is defined as follows: where m i is the frequency of the ith element. When all stream elements are identical, H = 0; when all elements have the same frequency, H attains its maximum value of log (m). Experiments The algorithm was implemented using C++. Several experiments were performed simulating a data stream, with the following specifications: n = 1000, and ε, δ = 0.25. The “stream” elements take on values from 0 through 999. Multiple sets of values representing different data flows were used: Fig. 1: The counts for all values are reasonably close to a uniform distribution. This stream contains 25,000 elements. Fig. 2: The approximated and actual entropies of streams of increasing length. These streams follow the same distribution as above. Conclusions Entropy is useful in detecting unusual volumes or distributions of traffic flow. The algorithm performs reasonably well for a close to uniform distribution of values. As the entropy of the stream decreases, the time required by the algorithm increases. The algorithm also produces estimates which are closer to the entropy H as defined in the formula, for greater values of S, and for streams of greater length m. The Algorithm Since streams are typically large, an algorithm with minimal space requirement is ideal, i.e. sub-linear. Lall et al. show that any strictly deterministic or randomized approximation algorithm must use at least m bits of space. Therefore, a combined approach is needed. Rather than compute entropy value H for a stream, the algorithm computes S defined by The algorithm described by Lall et al. uses an (ε, δ)-approximation. This returns an answer with a relative error of at most ε with probability (1 – δ). The algorithm has three phases: 1.Pre-processing: A number of random locations in the stream are chosen. 2.Online: For each random location chosen, a new counter is created. Each active counter is updated. 3.Post-processing: Counts are arranged in a matrix. Estimated S values are calculated from the counts, then the mean of each row is taken, and the median of the means is returned as final estimated value. This guarantees a tight error bound on the estimated value of S. References 1.S. Muthukrishnan. Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science, Vol. 1, No 2, pp. 117—236, 2005. 2.A. Lall, V. Sekar, M. Ogihara, J. Xu, and H. Zhang. Data streaming algorithms for estimating entropy of network traffic. In Proceedings of the ACM SIGMETRICS conference, pp. 145—156, 2006. Fig 1. A close to uniform distribution Fig. 2. The performance of the algorithm The Algorithm 1: Pre-processing stage 2: z := 32 log m/2, g := 2 log (1/δ) 3: choose z ∗ g locations in the stream at random 4: Online stage 5: for each item aj in the stream do 6: if aj already has one or more counters then 7: increment all of aj ’s counters 8: if j is one of the randomly chosen locations then 9: start keeping a count for aj, initialized at 1 10: Post-processing stage 11: // View the g ∗ z counts as a matrix c of size g × z 12: for i := 1 to g do 13: for j := 1 to z do 14: Xi,j := m ∗ (ci,j log ci,j − (ci,j − 1) log (ci,j − 1)) 15: for i := 1 to g do 16: avg[i] := the average of the Xs in group i 17: return the median of avg[1],..., avg[g]


Download ppt "REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams."

Similar presentations


Ads by Google