ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan.
Vladimir(Vova) Braverman UCLA Joint work with Rafail Ostrovsky.
1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
What is an Algorithm? (And how do we analyze one?)
Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.
By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 17. Long Term Trends and Hurst Phenomena From ancient times the Nile river region has been known for its peculiar long-term behavior: long periods of.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan.
1 Budgeted Nonparametric Learning from Data Streams Ryan Gomes and Andreas Krause California Institute of Technology.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
A survey on stream data mining
Sketching and Streaming Entropy via Approximation Theory Nick Harvey (MSR/Waterloo) Jelani Nelson (MIT) Krzysztof Onak (MIT)
Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden.
Algorithm Design and Analysis Liao Minghong School of Computer Science and Technology of HIT July, 2003.
Analysis of Simulation Results Andy Wang CIS Computer Systems Performance Analysis.
Primal-Dual Meets Local Search: Approximating MST’s with Non-uniform Degree Bounds Author: Jochen Könemann R. Ravi From CMU CS 3150 Presentation by Dan.
Online Detection of Change in Data Streams Shai Ben-David School of Computer Science U. Waterloo.
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
SIGCOMM 2002 New Directions in Traffic Measurement and Accounting Focusing on the Elephants, Ignoring the Mice Cristian Estan and George Varghese University.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
DoWitcher: Effective Worm Detection and Containment in the Internet Core S. Ranjan et. al in INFOCOM 2007 Presented by: Sailesh Kumar.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Analysis of a Protocol for Dynamic Configuration of IPv4 Link Local Addresses Using Uppaal Miaomiao Zhang Frits W. Vaandrager Department of Computer Science.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
1Computer Sciences Department. Book: Introduction to Algorithms, by: Thomas H. Cormen Charles E. Leiserson Ronald L. Rivest Clifford Stein Electronic:
ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Time Complexity of Algorithms (Asymptotic Notations)
1 Algorithms  Algorithms are simply a list of steps required to solve some particular problem  They are designed as abstractions of processes carried.
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
Calculating frequency moments of Data Stream
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
The Message Passing Communication Model David Woodruff IBM Almaden.
Beating CountSketch for Heavy Hitters in Insertion Streams Vladimir Braverman (JHU) Stephen R. Chestnut (ETH) Nikita Ivkin (JHU) David P. Woodruff (IBM)
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut,
Advanced Algorithms Analysis and Design
Clustering Data Streams
Information Complexity Lower Bounds
New Characterizations in Turnstile Streams with Applications
Empirically Characterizing the Buffer Behaviour of Real Devices
Finding Frequent Items in Data Streams
Streaming & sampling.
Sublinear Algorithmic Tools 2
Optimal Elephant Flow Detection Presented by: Gil Einziger,
Hidden Markov Models Part 2: Algorithms
Objective of This Course
CS 201 Fundamental Structures of Computer Science
Approximate Frequency Counts over Data Streams
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Range-Efficient Computation of F0 over Massive Data Streams
Decision Trees for Mining Data Streams
Totally Asynchronous Iterative Algorithms
Dynamically Maintaining Frequent Items Over A Data Stream
(Learned) Frequency Estimation Algorithms
Presentation transcript:

ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result deals with a measure we call the “entropy norm” of an input stream: it is closely related to entropy but is structurally similar to the well-studied notion of frequency moments. We give a polylogarithmic space one-pass algorithm for estimating this norm under certain conditions on the input stream. We also prove a lower bound that rules out such an algorithm if these conditions do not hold. Our second group of results are for estimating the empirical entropy of an input stream. We first present a sublinear space one-pass algorithm for this problem. For a stream of m items and a given real parameter α, our algorithm uses space Õ(m 2α ) and provides an approximation of 1/α in the worst case and (1 + ε) in “most” cases. We then present a two- pass polylogarithmic space (1 + ε)-approximation algorithm. BACKGROUND AND MOTIVATIONS Already the focus of much recent research, algorithms for computational problems on data streams grow increasingly essential in today’s highly connected world. In this model, the input is a stream of “items” too long to be stored completely in memory, and a typical problem involves computing some statistic on this stream. The challenge is to design algorithms efficient not only in terms of running time, but also in terms of space: sublinear is a must and polylogarithmic is often the goal. The quintessential need for such algorithms arises in analyzing IP network traffic at packet level on high speed routers. In monitoring IP traffic, one cares about anomalies, which in general are hard to define and detect since there are subtle intrusions and sophisticated dependence amongst network events and agents. An early attempt to capture the overall behavior of a data stream, now a classical problem in the area, was a family of statistics called frequency moments. If a stream of length m contains m i occurrences of item i (1  i  n), then its k th frequency moment, denoted F k, is defined by. In their seminal paper, Alon et al. [1] showed that F k can be estimated arbitrarily well in o(n) space for all integers k  0 and in Õ(1) space for k  2. Their results were later improved by Coppersmith and Kumar [2] and Indyk and Woodruff [4]. OUTLINE OF ALGORITHMS A subroutine computes the basic estimator: random variable X whose mean is the target quantity and whose variance is small. The algorithm uses this subroutine to maintain s 1 s 2 independent basic estimators X ij, for 1  i  s 1, 1  i  s 2. It outputs a final estimator Y, defined by For entropy norm, our basic estimator subroutine is Input stream: A =  a 1, a 2,…, a m , where each a i  {1,…, n}. 1. Choose p uniformly at random from {1,…, m}. 2. Let r = |{q : a q = a p, p  q  m}|. 3. Let X = m[r lg r – (r – 1)lg(r – 1)], where 0 lg 0 = 0. For empirical entropy, the subroutine is identical except for line 3, where we now have Estimating Entropy and Entropy Norm on Data Streams Amit Chakrabarti † Khanh Do Ba †,* S. Muthukrishnan ‡ † Department of Computer Science, Dartmouth College ‡ Department of Computer Science, Rutgers University * Work done during DIMACS REU 2005; supported by a Dean of Faculty Undergraduate Research Grant SUMMARY OF RESULTS Result 1.1: We designed a one-pass algorithm that estimates F H to a (1 + ε)-factor in Õ(1) space, assuming F H is sufficiently large. Result 1.2: If F H is too small to satisfy the assumption above, then we proved that no one-pass, Õ(1)-space algorithm can approximate it to even a constant factor. Result 2.1: We designed a one-pass algorithm that estimates H to a (1 + ε)-factor in Õ(1) space if H is sufficiently large, or Õ(m 2/3 ) space in general. Result 2.2: We designed a two-pass algorithm that estimates H to a (1 + ε)-factor in Õ(1) space for any H. PROBLEM STATEMENTS We want to estimate the following statistics on the input stream in o(n), possibly Õ(1), space and a single pass. 1)Entropy norm: 2)Empirical entropy: REFERENCES [1]N. Alon, Y. Matias, M. Szegedy. The space complexity of approximating the frequency moments. Proc. ACM STOC, 20-29, [2]D. Coppersmith and R. Kumar. An improved data stream algorithm for frequency moments. ACM-SIAM SODA, , [3]Y. Gu, A. McCallum, D. Towsley. Detecting anomalies in network traffic using maximum entropy estimation. Proc. Internet Measurement Conference, [4]P. Indyk and D. Woodruff. Optimal approximations of the frequency moments of data streams. ACM STOC, , [5]K. Xu, Z. Zhang, S. Bhattacharya. Profiling internet backbone traffic: behavior models and applications. Proc. ACM SIGCOMM, CONCLUSIONS Our complete results have been published as a technical report (DIMACS-TR ). We believe our algorithms will be of practical interest in data stream systems, as recent work in the networking community appear to be converging on entropy as a reasonable approach to anomaly detection [3, 5]. In future work, although we have proven our entropy norm algorithm to be optimal, it appears feasible to improve our last algorithm for estimating empirical entropy (Result 2.2) to complete in a single pass. It will also be of interest to study these problems on streams in the presence of deletions as well as insertions.