REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.

Slides:



Advertisements
Similar presentations
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Advertisements

Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.
Lindsey Bleimes Charlie Garrod Adam Meyerson
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Data Streaming Algorithms for Accurate and Efficient Measurement of Traffic and Flow Matrices Qi Zhao*, Abhishek Kumar*, Jia Wang + and Jun (Jim) Xu* *College.
Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.
1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.
Acoustic design by simulated annealing algorithm
Analog Circuits for Self-organizing Neural Networks Based on Mutual Information Janusz Starzyk and Jing Liang School of Electrical Engineering and Computer.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
Randomized Algorithms Randomized Algorithms CS648 Lecture 6 Reviewing the last 3 lectures Application of Fingerprinting Techniques 1-dimensional Pattern.
Noam Segev, Israel Chernyak, Evgeny Reznikov Supervisor: Gabi Nakibly, Ph. D.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Quantum Counters Smita Krishnaswamy Igor L. Markov John P. Hayes.
1 BRICK: A Novel Exact Active Statistics Counter Architecture Nan Hua 1, Bill Lin 2, Jun (Jim) Xu 1, Haiquan (Chuck) Zhao 1 1 Georgia Institute of Technology.
Effect of Mutual Coupling on the Performance of Uniformly and Non-
On the Construction of Data Aggregation Tree with Minimum Energy Cost in Wireless Sensor Networks: NP-Completeness and Approximation Algorithms National.
Approximate Load Balance Based on ID/Locator Split Routing Architecture 1 Sanqi Zhou, Jia Chen, Hongbin Luo, Hongke Zhang Beijing JiaoTong University
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
Optimal Degree Distribution for LT Codes with Small Message Length Esa Hyytiä, Tuomas Tirronen, Jorma Virtamo IEEE INFOCOM mini-symposium
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Analysis of Algorithms
Energy-Aware Scheduling with Quality of Surveillance Guarantee in Wireless Sensor Networks Jaehoon Jeong, Sarah Sharafkandi and David H.C. Du Dept. of.
Detection Unknown Worms Using Randomness Check Computer and Communication Security Lab. Dept. of Computer Science and Engineering KOREA University Hyundo.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Testing Models on Simulated Data Presented at the Casualty Loss Reserve Seminar September 19, 2008 Glenn Meyers, FCAS, PhD ISO Innovative Analytics.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Estimating Component Availability by Dempster-Shafer Belief Networks Estimating Component Availability by Dempster-Shafer Belief Networks Lan Guo Lane.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Alastair R. Beresford Frank Stajano University of Cambridge Presented by Arcadiy Kantor — CS4440 September 13, 2007.
Jeff J. Orchard, M. Stella Atkins School of Computing Science, Simon Fraser University Freire et al. (1) pointed out that least squares based registration.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Combinatorics Sep. 23, 2013.
The Misra Gries Algorithm. Motivation Espionage The rest we monitor.
Calculating frequency moments of Data Stream
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Yue Xu Shu Zhang.  A person has already rated some movies, which movies he/she may be interested, too?  If we have huge data of user and movies, this.
Incrementally Improving Lookup Latency in Distributed Hash Table Systems Hui Zhang 1, Ashish Goel 2, Ramesh Govindan 1 1 University of Southern California.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
SketchVisor: Robust Network Measurement for Software Packet Processing
A Mental Game as a Source of CS Case Studies
Random Testing: Theoretical Results and Practical Implications IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2012 Andrea Arcuri, Member, IEEE, Muhammad.
Finding Frequent Items in Data Streams
Mean Value Analysis of a Database Grid Application
Streaming & sampling.
Hidden Markov Models Part 2: Algorithms
mEEC: A Novel Error Estimation Code with Multi-Dimensional Feature
The Capacity of Wireless Networks
BRICK: A Novel Exact Active Statistics Counter Architecture
Approximate Frequency Counts over Data Streams
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Range-Efficient Computation of F0 over Massive Data Streams
By: Ran Ben Basat, Technion, Israel
Heavy Hitters in Streams and Sliding Windows
By: Ran Ben Basat, Technion, Israel
Lu Tang , Qun Huang, Patrick P. C. Lee
Author: Ramana Rao Kompella, Kirill Levchenko, Alex C
(Learned) Frequency Estimation Algorithms
Approximate Mean Value Analysis of a Database Grid Application
Presentation transcript:

REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams are long sequences of data packets. Information travels over computer networks in the form of data streams. Data streams are often very large (millions of elements). Streaming Algorithms We wish to analyze elements of a data stream to discover anomalies, reveal patterns in traffic, etc. By doing so, misuse of network resources can be detected. Entropy can be used to obtain this information. Entropy is a measure of predictability in the value of each stream element. We consider a data stream of length m with values in the range {1, 2, 3, …, n}. The entropy H of a stream is defined as follows: where m i is the frequency of the ith element. When all stream elements are identical, H = 0; when all elements have the same frequency, H attains its maximum value of log (m). Experiments The algorithm was implemented using C++. Several experiments were performed simulating a data stream, with the following specifications: n = 1000, and ε, δ = The “stream” elements take on values from 0 through 999. Multiple sets of values representing different data flows were used: Fig. 1: The counts for all values are reasonably close to a uniform distribution. This stream contains 25,000 elements. Fig. 2: The approximated and actual entropies of streams of increasing length. These streams follow the same distribution as above. Conclusions Entropy is useful in detecting unusual volumes or distributions of traffic flow. The algorithm performs reasonably well for a close to uniform distribution of values. As the entropy of the stream decreases, the time required by the algorithm increases. The algorithm also produces estimates which are closer to the entropy H as defined in the formula, for greater values of S, and for streams of greater length m. The Algorithm Since streams are typically large, an algorithm with minimal space requirement is ideal, i.e. sub-linear. Lall et al. show that any strictly deterministic or randomized approximation algorithm must use at least m bits of space. Therefore, a combined approach is needed. Rather than compute entropy value H for a stream, the algorithm computes S defined by The algorithm described by Lall et al. uses an (ε, δ)-approximation. This returns an answer with a relative error of at most ε with probability (1 – δ). The algorithm has three phases: 1.Pre-processing: A number of random locations in the stream are chosen. 2.Online: For each random location chosen, a new counter is created. Each active counter is updated. 3.Post-processing: Counts are arranged in a matrix. Estimated S values are calculated from the counts, then the mean of each row is taken, and the median of the means is returned as final estimated value. This guarantees a tight error bound on the estimated value of S. References 1.S. Muthukrishnan. Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science, Vol. 1, No 2, pp. 117—236, A. Lall, V. Sekar, M. Ogihara, J. Xu, and H. Zhang. Data streaming algorithms for estimating entropy of network traffic. In Proceedings of the ACM SIGMETRICS conference, pp. 145—156, Fig 1. A close to uniform distribution Fig. 2. The performance of the algorithm The Algorithm 1: Pre-processing stage 2: z := 32 log m/2, g := 2 log (1/δ) 3: choose z ∗ g locations in the stream at random 4: Online stage 5: for each item aj in the stream do 6: if aj already has one or more counters then 7: increment all of aj ’s counters 8: if j is one of the randomly chosen locations then 9: start keeping a count for aj, initialized at 1 10: Post-processing stage 11: // View the g ∗ z counts as a matrix c of size g × z 12: for i := 1 to g do 13: for j := 1 to z do 14: Xi,j := m ∗ (ci,j log ci,j − (ci,j − 1) log (ci,j − 1)) 15: for i := 1 to g do 16: avg[i] := the average of the Xs in group i 17: return the median of avg[1],..., avg[g]