Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining of Massive Datasets Ch4. Mining Data Streams.

Similar presentations


Presentation on theme: "Mining of Massive Datasets Ch4. Mining Data Streams."— Presentation transcript:

1 Mining of Massive Datasets Ch4. Mining Data Streams

2 Outline Counting distinct elements in a stream The count distinct problem The Flajoiet-Martin algorithm Combining estimates Estimating moment Definition of moment The Alon-Matias-Szegedy algorithm Dealing with infinite streams

3 The count distinct problem Stream elements are chosen from a universal set How many different element have appeared in the stream? Example: Amazon user logs Google IP addresses Obvious approach: Maintain the set of elements seen so far Search tree Memory size problem IP 4 bytes : There are about 4 billion IP address

4 The Flajoiet-Martin algorithm

5

6

7 Combining estimates

8 Exercises

9 Estimating moment The problem of counting distinct elements in a stream The problem called computing moments Ex : distribution of frequencies of different elements

10 Definition of moment

11

12 The Alon-Matias-Szegedy algorithm

13

14

15 Why the Alon-Matias-Szegedy algorithm works

16 Dealing with infinite streams In practice, n grows with time As the stream gets longer the estimate of the moment will be too large Maintain as many variables as we can Throw some out as the stream grows The discarded variables are replaced by new ones

17 Dealing with infinite streams Suppose we store s variables, and we have seen n elements When (n+1)st element arrives pick it with probability s/(n+1) If not picked, s variables keep their positions If it is picked, then throw out one of the current s variables, with equal probability Replace the one discarded by a new variable whose element is the (n+1)st and value is 1 aaaa 1 3 2mama bbbb

18 Dealing with infinite streams aaaa 1 3 2mama bbbb


Download ppt "Mining of Massive Datasets Ch4. Mining Data Streams."

Similar presentations


Ads by Google