More Stream-Mining Counting Distinct Elements Computing “Moments”

Slides:

Advertisements

Similar presentations

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Advertisements

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.

Hash-Based Improvements to A-Priori

1 Frequent Itemset Mining: Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored basket-by-basket.

Data Mining of Very Large Data

Mining Data Streams.

 Back to finding frequent itemsets  Typically, data is kept in flat files rather than in a database system:  Stored on disk  Stored basket-by-basket.

Probability Distributions CSLU 2850.Lo1 Spring 2008 Cameron McInally Fordham University May contain work from the Creative Commons.

Outline input analysis input analyzer of ARENA parameter estimation

Sampling Distributions and Sample Proportions

1 Mining Associations Apriori Algorithm. 2 Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored.

Output analyses for single system

1 Association Rules Market Baskets Frequent Itemsets A-Priori Algorithm.

Planning under Uncertainty

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006

1 Association Rules Apriori Algorithm. 2 Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored.

1 Improvements to A-Priori Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results.

Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006

1 Intractable Problems Time-Bounded Turing Machines Classes P and NP Polynomial-Time Reductions.

CURE Algorithm Clustering Streams

1 Association Rules Market Baskets Frequent Itemsets A-priori Algorithm.

Improvements to A-Priori

Improvements to A-Priori

A survey on stream data mining

1 More Stream-Mining Counting How Many Elements Computing “Moments”

1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.

1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.

1 Still More Stream-Mining Frequent Itemsets Elephants and Troops Exponentially Decaying Windows.

1 “Association Rules” Market Baskets Frequent Itemsets A-priori Algorithm.

Introduction to Regression with Measurement Error STA431: Spring 2013.

1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing Locality-Sensitive Hashing.

CSE 373 Data Structures Lecture 15

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.

1 1 Slide Discrete Probability Distributions (Random Variables and Discrete Probability Distributions) Chapter 5 BA 201.

Database Management 9. course. Execution of queries.

Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

3. Counting Permutations Combinations Pigeonhole principle Elements of Probability Recurrence Relations.

1 Chapter 7 Sampling Distributions. 2 Chapter Outline  Selecting A Sample  Point Estimation  Introduction to Sampling Distributions  Sampling Distribution.

Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.

1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing.

Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.

The Markov Chain Monte Carlo Method Isabelle Stanton May 8, 2008 Theory Lunch.

ANOVA, Regression and Multiple Regression March

Mining of Massive Datasets Ch4. Mining Data Streams

MATH 256 Probability and Random Processes Yrd. Doç. Dr. Didem Kivanc Tureli 14/10/2011Lecture 3 OKAN UNIVERSITY.

Mining of Massive Datasets Ch4. Mining Data Streams.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Mining Data Streams (Part 1)

More Stream Mining Bloom Filters Sampling Streams Counting Distinct Items Computing Moments We are going to take up a number of interesting algorithms.

The Stream Model Sliding Windows Counting 1’s

Finding Frequent Items in Data Streams

Conditional Probability

Streaming & sampling.

Mining Data Streams (Part 1)

Counting How Many Elements Computing “Moments”

Mining Data Streams (Part 2)

COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.

Mining Data Streams Some of these slides are based on Stanford Mining Massive Data Sets Course slides at

CSCI B609: “Foundations of Data Science”

Range-Efficient Computation of F0 over Massive Data Streams

Lecture 7 Sampling and Sampling Distributions

Introduction to Stream Computing and Reservoir Sampling

Statistics for Business and Economics (13e)

Presentation transcript:

More Stream-Mining Counting Distinct Elements Computing “Moments” Frequent Itemsets Elephants and Troops Exponentially Decaying Windows

Counting Distinct Elements Problem: a data stream consists of elements chosen from a set of size n. Maintain a count of the number of distinct elements seen so far. Obvious approach: maintain the set of elements seen.

Applications How many different words are found among the Web pages being crawled at a site? Unusually low or high numbers could indicate artificial pages (spam?). How many different Web pages does each customer request in a week?

Using Small Storage Real Problem: what if we do not have space to store the complete set? Estimate the count in an unbiased way. Accept that the count may be in error, but limit the probability that the error is large.

Flajolet-Martin* Approach Pick a hash function h that maps each of the n elements to at least log2n bits. For each stream element a, let r (a ) be the number of trailing 0’s in h (a ). Record R = the maximum r (a ) seen. Estimate = 2R. * Really based on a variant due to AMS (Alon, Matias, and Szegedy)

Why It Works The probability that a given h (a ) ends in at least r 0’s is 2-r. If there are m different elements, the probability that R ≥ r is 1 – (1 - 2-r )m. Prob. all h(a)’s end in fewer than r 0’s. Prob. a given h(a) ends in fewer than r 0’s.

Why It Works – (2) Since 2-r is small, 1 - (1-2-r)m ≈ 1 - e -m2 . If 2r >> m, 1 - (1 - 2-r )m ≈ 1 - (1 - m2-r) ≈ m /2r ≈ 0. If 2r << m, 1 - (1 - 2-r )m ≈ 1 - e -m2 ≈ 1. Thus, 2R will almost always be around m. First 2 terms of the Taylor expansion of e x -r

Why It Doesn’t Work E(2R ) is actually infinite. Probability halves when R -> R +1, but value doubles. Workaround involves using many hash functions and getting many samples. How are samples combined? Average? What if one very large value? Median? All values are a power of 2.

Solution Partition your samples into small groups. Take the average of groups. Then take the median of the averages.

Generalization: Moments Suppose a stream has elements chosen from a set of n values. Let mi be the number of times value i occurs. The k th moment is the sum of (mi )k over all i.

Special Cases 0th moment = number of different elements in the stream. The problem just considered. 1st moment = sum of the numbers of elements = length of the stream. Easy to compute. 2nd moment = surprise number = a measure of how uneven the distribution is.

Example: Surprise Number Stream of length 100; 11 values appear. Unsurprising: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9. Surprise # = 910. Surprising: 90, 1, 1, 1, 1, 1, 1, 1 ,1, 1, 1. Surprise # = 8,110.

AMS Method Works for all moments; gives an unbiased estimate. We’ll just concentrate on 2nd moment. Based on calculation of many random variables X. Each requires a count in main memory, so number is limited.

One Random Variable Assume stream has length n. Pick a random time to start, so that any time is equally likely. Let the chosen time have element a in the stream. X = n * ((twice the number of a ’s in the stream starting at the chosen time) – 1). Note: store n once, count of a ’s for each X.

Expected Value of X 2nd moment is Σa (ma )2. E(X ) = (1/n )(Σall times t n * (twice the number of times the stream element at time t appears from that time on) – 1). = Σa (1/n)(n )(1+3+5+…+2ma-1) . = Σa (ma )2. Group times by the value seen Time when the last a is seen Time when the penultimate a is seen Time when the first a is seen

Combining Samples Compute as many variables X as can fit in available memory. Average them in groups. Take median of averages. Proper balance of group sizes and number of groups assures not only correct expected value, but expected error goes to 0 as number of samples gets large.

Problem: Streams Never End We assumed there was a number n, the number of positions in the stream. But real streams go on forever, so n is a variable – the number of inputs seen so far.

Fixups The variables X have n as a factor – keep n separately; just hold the count in X. Suppose we can only store k counts. We must throw some X ’s out as time goes on. Objective: each starting time t is selected with probability k /n.

Solution to (2) Choose the first k times for k variables. When the n th element arrives (n > k ), choose it with probability k / n. If you choose it, throw one of the previously stored variables out, with equal probability.

New Topic: Counting Items Problem: given a stream, which items appear more than s times in the window? Possible solution: think of the stream of baskets as one binary stream per item. 1 = item present; 0 = not present. Use DGIM to estimate counts of 1’s for all items.

Extensions In principle, you could count frequent pairs or even larger sets the same way. One stream per itemset. Drawbacks: Only approximate. Number of itemsets is way too big.

Approaches “Elephants and troops”: a heuristic way to converge on unusually strongly connected itemsets. Exponentially decaying windows: a heuristic for selecting likely frequent itemsets.

Elephants and Troops When Sergey Brin wasn’t worrying about Google, he tried the following experiment. Goal: find unusually correlated sets of words. “High Correlation ” = frequency of occurrence of set >> product of frequency of members.

Experimental Setup The data was an early Google crawl of the Stanford Web. Each night, the data would be streamed to a process that counted a preselected collection of itemsets. If {a, b, c} is selected, count {a, b, c}, {a}, {b}, and {c}. “Correlation” = n 2 * #abc/(#a * #b * #c). n = number of pages.

After Each Night’s Processing . . . Find the most correlated sets counted. Construct a new collection of itemsets to count the next night. All the most correlated sets (“winners ”). Pairs of a word in some winner and a random word. Winners combined in various ways. Some random pairs.

After a Week . . . The pair {“elephants”, “troops”} came up as the big winner. Why? It turns out that Stanford students were playing a Punic-War simulation game internationally, where moves were sent by Web pages.

New Topic: Mining Streams Versus Mining DB’s Unlike mining databases, mining streams doesn’t have a fixed answer. We’re really mining in the “Stat” point of view, e.g., “Which itemsets are frequent in the underlying model that generates the stream?”

Stationarity Our assumptions make a big difference: Is the model stationary ? I.e., are the same statistics used throughout all time to generate the stream? Or does the frequency of generating given items or itemsets change over time?

Some Options for Frequent Itemsets Run periodic experiments, like E&T. Like SON – itemset is a candidate if it is found frequent on any “day.” Good for stationary statistics. Frame the problem as finding all frequent itemsets in an “exponentially decaying window.” Good for nonstationary statistics.

Exponentially Decaying Windows If stream is a1, a2,… and we are taking the sum of the stream, take the answer at time t to be: Σi = 1,2,…,t ai e -c (t-i ). c is a constant, presumably tiny, like 10-6 or 10-9.

Example: Counting Items If each ai is an “item” we can compute the characteristic function of each possible item x as an E.D.W. That is: Σi = 1,2,…,t δi e -c (t-i ), where δi = 1 if ai = x, and 0 otherwise. Call this sum the “count ” of item x.

Sliding Versus Decaying Windows . . . 1/c

Counting Items – (2) Suppose we want to find those items of weight at least ½. Important property: sum over all weights is 1/(1 – e -c ) or very close to 1/[1 – (1 – c)] = 1/c. Thus: at most 2/c items have weight at least ½.

Extension to Larger Itemsets* Count (some) itemsets in an E.D.W. When a basket B comes in: Multiply all counts by (1-c ); For uncounted items in B, create new count. Add 1 to count of any item in B and to any counted itemset contained in B. Drop counts < ½. Initiate new counts (next slide). * Informal proposal of Art Owen

Initiation of New Counts Start a count for an itemset S ⊆B if every proper subset of S had a count prior to arrival of basket B. Example: Start counting {i, j } iff both i and j were counted prior to seeing B. Example: Start counting {i, j, k } iff {i, j }, {i, k }, and {j, k } were all counted prior to seeing B.

How Many Counts? Counts for single items < (2/c ) times the average number of items in a basket. Counts for larger itemsets = ??. But we are conservative about starting counts of large sets. If we counted every set we saw, one basket of 20 items would initiate 1M counts.