Presentation is loading. Please wait.

Presentation is loading. Please wait.

Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.

Similar presentations


Presentation on theme: "Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan."— Presentation transcript:

1 Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan

2 Review of Data Streams Motivation: huge data stream that needs to be mined for info “efficiently.” Applications: monitoring IP traffic, mining email and text message streams, etc.

3 The Mathematical Model Sequence of integers A =  a 1, …, a m , where each a i  N = {1, …, n}. For each v  N, the frequency m v of v is # occurrences of v in A. Statistics to be estimated are functions on A, but usually just on the m v ’s (e.g. frequency moments).

4 What is Entropy? In physics: measure of disorder in a system. In math: measure of randomness (or uniformity) of a probability distribution. Formula:

5 Entropy on Data Streams For big m, m v /m → Pr[v]. So formula becomes: Suffices to compute m (easy) and

6 The Goal Approximation algorithm to estimate μ. Approximate means to output a number Y such that: Pr[|Y – μ|  λμ]  ε, for any user- specified λ, ε > 0. Restrictions: o(n), preferably Õ(1), space, and only 1 pass over data.

7 The Algorithm We want Y to have E[Y] = μ and very small variance, so find a computable random variable X with E[X] = μ and small variance, and compute it several times. Y is the median of s 2 RVs Y i, each of which is the mean of s 1 RVs X ij = X (independently, identically computed).

8 Computing X Choose p  {1, …, m} uniformly at random. Let r = #{q  p | a q = a p } (  1). X = m[r log r – (r – 1) log (r – 1)].

9 The Analysis Easy: E[Y] = E[X] = μ. Hard: Var[Y] is very small. Turns out s 1 = O(log n), s 2 = O(1) works. Each X maintained in O(log n + log m) space. Total: O(s 1 s 2 (log n + log m)) = O(log n log m).

10 Future Directions Extension to insert/delete streams. Applications in: DBMSs where massive secondary storage cannot be scanned quickly enough to answer real-time queries. Monitoring open flows through internet routers. Lowerbound proof showing algorithm is optimal, or an improved algorithm.


Download ppt "Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan."

Similar presentations


Ads by Google