Presentation is loading. Please wait.

Presentation is loading. Please wait.

COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong

Similar presentations


Presentation on theme: "COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong"— Presentation transcript:

1 COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

2 COMP53312 Data Mining over Static Data 1.Association 2.Clustering 3.Classification Static Data Output (Data Mining Results)

3 COMP53313 Data Mining over Data Streams 1.Association 2.Clustering 3.Classification Output (Data Mining Results) … Unbounded Data Real-time Processing

4 COMP53314 Data Streams 12 … Less recentMore recent Each point: a transaction

5 COMP53315 Data Streams Traditional Data Mining Data Stream Mining Data TypeStatic Data of Limited Size Dynamic Data of Unlimited Size (which arrives at high speed) MemoryLimitedLimited  More challenging EfficiencyTime-ConsumingEfficient OutputExact AnswerApproximate (or Exact) Answer

6 COMP53316 Entire Data Streams 12 … Less recentMore recent Each point: a transaction Obtain the data mining results from all data points read so far

7 COMP53317 Entire Data Streams 12 … Less recentMore recent Each point: a transaction Obtain the data mining results over a sliding window

8 COMP53318 Data Streams Entire Data Streams Data Streams with Sliding Window

9 COMP53319 Entire Data Streams Association Clustering Classification Frequent pattern/item

10 COMP533110 Frequent Item over Data Streams Let N be the length of the data streams Let s be the support threshold (in fraction) (e.g., 20%) Problem: We want to find all items with frequency >= sN 12 … Less recentMore recent Each point: a transaction

11 COMP533111 Data Streams Traditional Data Mining Data Stream Mining Data TypeStatic Data of Limited Size Dynamic Data of Unlimited Size (which arrives at high speed) MemoryLimitedLimited  More challenging EfficiencyTime-ConsumingEfficient OutputExact AnswerApproximate (or Exact) Answer

12 COMP533112 Data Streams Static Data Output (Data Mining Results) Output (Data Mining Results) … Unbounded Data Frequent item I 1 Infrequent item I 2 I 3 Frequent item I 1 I 3 Infrequent item I 2

13 COMP533113 False Positive/Negative E.g. Expected Output Frequent item I 1 Infrequent item I 2 I 3 Algorithm Output Frequent item I 1 I 3 Infrequent item I 2 False Positive -The item is classified as frequent item -In fact, the item is infrequent Which item is one of the false positives?I3I3 More?No. No. of false positives = 1 If we say: The algorithm has no false positives. All true infrequent items are classified as infrequent items in the algorithm output.

14 COMP533114 False Positive/Negative E.g. Expected Output Frequent item I 1 I 3 Infrequent item I 2 Algorithm Output Frequent item I 1 Infrequent item I 2 I 3 False Negative -The item is classified as infrequent item -In fact, the item is frequent Which item is one of the false negatives?I3I3 More?No. No. of false negatives = 1 No. of false positives =0 If we say: The algorithm has no false negatives. All true frequent items are classified as frequent items in the algorithm output.

15 COMP533115 Data Streams Traditional Data Mining Data Stream Mining Data TypeStatic Data of Limited Size Dynamic Data of Unlimited Size (which arrives at high speed) MemoryLimitedLimited  More challenging EfficiencyTime-ConsumingEfficient OutputExact AnswerApproximate (or Exact) Answer We need to introduce an input error parameter 

16 COMP533116 Data Streams Static Data Output (Data Mining Results) Output (Data Mining Results) … Unbounded Data Frequent item I 1 Infrequent item I 2 I 3 Frequent item I 1 I 3 Infrequent item I 2

17 COMP533117 Data Streams Static Data Output (Data Mining Results) Output (Data Mining Results) … Unbounded Data Store the statistics of all items I 1 : 10 I 2 : 8 I 3 : 12 Estimate the statistics of all items I 1 : 10 I 2 : 4 I 3 : 10 ItemTrue FrequencyEstimated Frequency I1I1 10 I2I2 84 I3I3 1210 0 4 2 N: total no. of occurrences of items N = 20  = 0.2  N = 4 Diff. D D <=  N ? Yes

18 COMP533118  -deficient synopsis Let N be the current length of the stream (or total no. of occurrences of items) Let  be an input parameter (a real number from 0 to 1) All true frequent items are classified as frequent items in the algorithm output. An algorithm maintains an  -deficient synopsis if its output satisfies the following properties Condition 2: The difference between the estimated frequency and the true frequency is at most  N. Condition 1: There is no false negative. Condition 3: All items whose true frequencies less than (s-  )N are classified as infrequent items in the algorithm output

19 COMP533119 Frequent Pattern Mining over Entire Data Streams Algorithm Sticky Sampling Algorithm Lossy Counting Algorithm Space-Saving Algorithm

20 COMP533120 Sticky Sampling Algorithm Sticky Sampling s   … Unbounded Data Statistics of items Output Stored in the memory Frequent items Infrequent items Support threshold Error parameter Confidence parameter

21 COMP533121 Sticky Sampling Algorithm The sampling rate r varies over the lifetime of a stream Confidence parameter  (a small real number) Let t =  1/  ln(s -1  -1 )  Data No.r (sampling rate) 1 ~ 2t1 2t+1 ~ 4t2 4t+1 ~ 8t4 ……

22 COMP533122 Sticky Sampling Algorithm The sampling rate r varies over the lifetime of a stream Confidence parameter  (a small real number) Let t =  1/  ln(s -1  -1 )  Data No.r (sampling rate) 1 ~ 2t1 2t+1 ~ 4t2 4t+1 ~ 8t4 …… e.g. s = 0.02  = 0.01  = 0.1 t = 622 1 ~ 2*622 2*622+1 ~ 4*622 4*622+1 ~ 8*622 1~1244 1245~2488 2489~4976

23 COMP533123 Sticky Sampling Algorithm The sampling rate r varies over the lifetime of a stream Confidence parameter  (a small real number) Let t =  1/  ln(s -1  -1 )  Data No.r (sampling rate) 1 ~ 2t1 2t+1 ~ 4t2 4t+1 ~ 8t4 …… e.g. s = 0.5  = 0.35  = 0.5 t = 4 1 ~ 2*4 2*4+1 ~ 4*4 4*4+1 ~ 8*4 1~8 9~16 17~32

24 COMP533124 Sticky Sampling Algorithm 1.S: empty list  will contain (e, f) element Estimated frequency 2.When data e arrives,  if e exists in S, increment f in (e, f)  if e does not exist in S, add entry (e, 1) with prob. 1/r (where r: sampling rate) 3.Just after r changes, For each entry (e, f), Repeatedly toss a coin with P(head) = 1/r until the outcome of the coin toss is head If the outcome of the toss is tail, Decrement f in (e, f) If f = 0, delete the entry (e, f) 4. [Output] Get a list of items where f  +  N >= sN

25 COMP533125 Analysis  -deficient synopsis Sticky Sampling computes an  -deficient synopsis with probability at least 1-  Memory Consumption Sticky Sampling occupies at most  2/  ln(s -1  -1 )  entries on average

26 COMP533126 Frequent Pattern Mining over Entire Data Streams Algorithm Sticky Sampling Algorithm Lossy Counting Algorithm Space-Saving Algorithm

27 COMP533127 Lossy Counting Algorithm Lossy Counting s  … Unbounded Data Statistics of items Output Stored in the memory Frequent items Infrequent items Support threshold Error parameter

28 COMP533128 Lossy Counting Algorithm 12 … Less recentMore recent Each point: a transaction Bucket 1 Bucket 2 Bucket 3 … Bucket b current N: current length of stream b current = Width w =

29 COMP533129 Lossy Counting Algorithm 1.D: Empty set Will contain (e, f,  ) element Frequency of element since this entry was inserted into D Max. possible error in f 2.When data e arrives, If e exists in D, Increment f in (e, f,  ) If e does not exist in D, Add entry (e, 1, b current -1) 3.Remove some entries in D whenever N  0 mod w (i.e., whenever it reaches the bucket boundary) The rule of deletion is: (e, f,  ) is deleted if f +  <= b current 4.[Output] Get a list of items where f  +  N >= sN

30 COMP533130 Lossy Counting Algorithm  -deficient synopsis Lossy Counting computes an  -deficient synopsis Memory Consumption Lossy Counting occupies at most   1/  log(  N)  entries.

31 COMP533131 Comparison  -deficient synopsis Memory Consumption Sticky Sampling 1-  confidence  2/  ln(s -1  -1 )  Lossy Counting100% confidence  1/  log(  N)  Memory = 1243 e.g. s = 0.02  = 0.01  = 0.1 N = 1000 Memory = 231

32 COMP533132 Comparison  -deficient synopsis Memory Consumption Sticky Sampling 1-  confidence  2/  ln(s -1  -1 )  Lossy Counting100% confidence  1/  log(  N)  Memory = 1243 e.g. s = 0.02  = 0.01  = 0.1 N = 1,000,000 Memory = 922

33 COMP533133 Comparison  -deficient synopsis Memory Consumption Sticky Sampling 1-  confidence  2/  ln(s -1  -1 )  Lossy Counting100% confidence  1/  log(  N)  Memory = 1243 e.g. s = 0.02  = 0.01  = 0.1 N = 1,000,000,000 Memory = 1612

34 COMP533134 Frequent Pattern Mining over Entire Data Streams Algorithm Sticky Sampling Algorithm Lossy Counting Algorithm Space-Saving Algorithm

35 COMP533135 Sticky Sampling Algorithm Sticky Sampling s   … Unbounded Data Statistics of items Output Stored in the memory Frequent items Infrequent items Support threshold Error parameter Confidence parameter

36 COMP533136 Lossy Counting Algorithm Lossy Counting s  … Unbounded Data Statistics of items Output Stored in the memory Frequent items Infrequent items Support threshold Error parameter

37 COMP533137 Space-Saving Algorithm Space-Saving s M … Unbounded Data Statistics of items Output Stored in the memory Frequent items Infrequent items Support threshold Memory parameter

38 COMP533138 Space-Saving M: the greatest number of possible entries stored in the memory

39 COMP533139 Space-Saving 1.D: Empty set Will contain (e, f,  ) element Frequency of element since this entry was inserted into D Max. possible error in f 2.p e = 0 3.When data e arrives, If e exists in D, Increment f in (e, f,  ) If e does not exist in D, If the size of D = M p e  min e  D {f +  } Remove all entries e where f +   p e Add entry (e, 1, p e ) 4.[Output] Get a list of items where f +  >= sN

40 COMP533140 Space-Saving Greatest Error Let E be the greatest error in any estimated frequency. E  1/M  -deficient synopsis Space-Saving computes an  -deficient synopsis if E  

41 COMP533141 Comparison  -deficient synopsis Memory Consumption Sticky Sampling 1-  confidence  2/  ln(s -1  -1 )  Lossy Counting100% confidence  1/  log(  N)  Space-Saving 100% confidence where E <=  M Memory = 1243 e.g. s = 0.02  = 0.01  = 0.1 N = 1,000,000,000 Memory = 1612 Memory can be very large (e.g., 4,000,000) Since E <= 1/M  the error is very small

42 COMP533142 Data Streams Entire Data Streams Data Streams with Sliding Window

43 COMP533143 Data Streams with Sliding Window Association Clustering Classification Frequent pattern/itemset

44 COMP533144 Sliding Window Mining Frequent Itemsets in a sliding window E.g. t 1 : I 1 I 2 t 2 : I 1 I 3 I 4 … To find frequent itemsets in a sliding window t1t1 t2t2 … Sliding window

45 COMP533145 B1B1 B2B2 B3B3 B4B4 Sliding Window Sliding window Last 4 batches Storage

46 COMP533146 Sliding Window Sliding window B1B1 B2B2 B3B3 B4B4 Last 4 batches B5B5 Remove the whole batch Storage


Download ppt "COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong"

Similar presentations


Ads by Google