Presentation is loading. Please wait.

Presentation is loading. Please wait.

Crossroads: A Practical Data Sketching Solution for Mining Intersection of Streams Jun Xu, Zhenglin Yu (Georgia Tech) Jia Wang, Zihui Ge, He Yan (AT&T.

Similar presentations


Presentation on theme: "Crossroads: A Practical Data Sketching Solution for Mining Intersection of Streams Jun Xu, Zhenglin Yu (Georgia Tech) Jia Wang, Zihui Ge, He Yan (AT&T."— Presentation transcript:

1 Crossroads: A Practical Data Sketching Solution for Mining Intersection of Streams Jun Xu, Zhenglin Yu (Georgia Tech) Jia Wang, Zihui Ge, He Yan (AT&T Labs Research) Ashwin Lall (Denison University)

2 Overview Problem statemen t Motivation Our approach Data collection architecture Evaluation Conclusions

3 Problem Statement Anonymously mine the logs of cellular data traffic to rapidly detect network performance anomalies Can be converted to an association-rule mining problem if we store every cellular packet into a static database

4 Motivation Complexity of the network ecosystem (sophisticated phones, tablets; application server; a tremendous variety of apps and online services) The performance issues can be introduced by different causes or combinations (device, app, network, app server) Infeasible to store all combinations of such data (especially when it is collected in real-time)

5 Problem Statement (detailed) An example event “Device = Magic Phone 7” & “OS = Magic OS 88.8” & “Application = FunContent.app” & “Source = Metropolis downtown location” & “Destination = FunContent.com” & “Time = τ” ⇒ “unusually long RTT”.

6 Main challenge Cannot afford to store all the combinations (since the number of different attribute combinations is huge) Our Goal Asymptotic reduction in space usage while keeping accuracy loss small when detecting anomalous values Contribution of Paper An intersection scheme that can significantly reduce the storage cost while keep accuracy loss small.

7 Our approach ●Based on data sketching solution (inspired by the tug- of-war sketch) ●A sketch is constructed to succinctly summarize the performance metrics (e.g., average RTT) of all data items

8 Our approach Partition attributes to 2 groups Example: group attributes to 2 groups: A i, B j

9 Our approach A i :the set of packets that match (“Source = the Metropolis downtown location” & “Destination = FunContent.com” & “Time = τ”) B j :the set of packets that match “Device = Magic Phone 7” & “OS = Magic OS 88.8” & “Application = FunContent.app”)

10 Our approach We can compute functions on the intersection of arbitrary sets A i and B j

11 Our approach ●Use sketches to store summary statistic (e.g., mean, variance) for A i and B j ●Derive the performance metrics of the data by intersecting the sketches ( A i ∩ B j )

12 Storage saving Reduce the storage cost from O(n) to O(√n) For example, number is in trillions ( ∼ 10 12 ) for joint value combinations of all these attributes Each subset may only be in millions ( ∼ 10 6 )

13 3-Way Intersection ●How about 3-way intersection? ●An impossibility result:

14 Data Collection Architecture Real data collection * * Note: No personally identifiable information (PII) was gathered or used in conducting this study. To the extent any data was analyzed, it was anonymous and/or aggregated data.

15 Evaluation 8 different attributes in our real data*. Partition scheme: ●1 st group: RNC, service category, handheld device speed category, day of week ●2 nd group: handheld device manufacturer/model, content provider, access point network, hour * Note: No personally identifiable information (PII) was gathered or used in conducting this study. To the extent any data was analyzed, it was anonymous and/or aggregated data.

16 Evaluation 1.4 million distinct combinations for the 1 st group 1.5 million distinct combinations for the 2 st group Storage cost of maintaining the value of every combination: 1.4M × 1.5M × 4 bytes= 7.5 TB

17 Evaluation Using our intersection scheme, and using 4096 counters in each sketch, the space cost is (1.4M + 1.5M ) × 4096 × 4 bytes = 45 GB Much less than 7.5 TB Relative error will be about 10 %

18 Evaluation ●As buckets number(memory usage) increases, average relative error will decrease. ●As intersection ratio increase, average relative error will decrease.

19 Evaluation Results of mean relative errors when varying memory(buckets number) respectively for situations that intersection ratio = 0.01, 0.02, 0.05, 0.10

20 Conclusions We provide an intersection scheme for estimating arbitrary summary statistics on large data sets We show how to reduce storage cost from O(n) to O(√n) We demonstrate efficacy using both synthetic and real data


Download ppt "Crossroads: A Practical Data Sketching Solution for Mining Intersection of Streams Jun Xu, Zhenglin Yu (Georgia Tech) Jia Wang, Zihui Ge, He Yan (AT&T."

Similar presentations


Ads by Google