Crossroads: A Practical Data Sketching Solution for Mining Intersection of Streams Jun Xu, Zhenglin Yu (Georgia Tech) Jia Wang, Zihui Ge, He Yan (AT&T.

Slides:

Advertisements

Similar presentations

Service QoE Monitoring in the Access Network Bart De Vleeschauwer Ghent University – IBBT-IMEC Department of Information Technology

Advertisements

A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad.

Data Streaming Algorithms for Accurate and Efficient Measurement of Traffic and Flow Matrices Qi Zhao*, Abhishek Kumar*, Jia Wang + and Jun (Jim) Xu* *College.

UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.

1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.

A Fast and Compact Method for Unveiling Significant Patterns in High-Speed Networks Tian Bu 1, Jin Cao 1, Aiyou Chen 1, Patrick P. C. Lee 2 Bell Labs,

Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research.

Detecting DDoS Attacks on ISP Networks Ashwin Bharambe Carnegie Mellon University Joint work with: Aditya Akella, Mike Reiter and Srinivasan Seshan.

Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,

A Data Stream Management System for Network Traffic Management Shivnath Babu Stanford University Lakshminarayanan Subramanian Univ. California, Berkeley.

Experiments on Query Expansion for Internet Yellow Page Services Using Log Mining Summarized by Dongmin Shin Presented by Dongmin Shin User Log Analysis.

A Cloud-Assisted Design for Autonomous Driving Swarun Kumar Shyamnath Gollakota and Dina Katabi.

IBM TJ Watson Research Center © 2010 IBM Corporation – All Rights Reserved AFRL 2010 Anand Ranganathan Role of Stream Processing in Ad-Hoc Networks Where.

Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun Minlan Yu, Michael J. Freedman, Jennifer Rexford Princeton University.

1/24 Passive Interference Measurement in Wireless Sensor Networks Shucheng Liu 1,2, Guoliang Xing 3, Hongwei Zhang 4, Jianping Wang 2, Jun Huang 3, Mo.

Streaming Algorithms for Robust, Real- Time Detection of DDoS Attacks S. Ganguly, M. Garofalakis, R. Rastogi, K. Sabnani Krishan Sabnani Bell Labs Research.

1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.

Reverse Hashing for High-speed Network Monitoring: Algorithms, Evaluation, and Applications Robert Schweller 1, Zhichun Li 1, Yan Chen 1, Yan Gao 1, Ashish.

On the Constancy of Internet Path Properties Yin Zhang, Nick Duffield AT&T Labs Vern Paxson, Scott Shenker ACIRI Internet Measurement Workshop 2001 Presented.

Privacy-Preserving Cross-Domain Network Reachability Quantification

What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.

Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.

CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.

Accounting Management IACT 918 April 2005 Glenn Bewsell/Gene Awyzio SITACS University of Wollongong.

Performance Debugging in Data Centers: Doing More with Less Prashant Shenoy, UMass Amherst Joint work with Emmanuel Cecchet, Maitreya Natu, Vaishali Sadaphal.

Unconstrained Endpoint Profiling (Googling the Internet)‏ Ionut Trestian Supranamaya Ranjan Aleksandar Kuzmanovic Antonio Nucci Northwestern University.

A Search-based Method for Forecasting Ad Impression in Contextual Advertising Defense.

Using Conviva 29 Aug Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can Spark.

Intrusion Detection System Marmagna Desai [ 520 Presentation]

Not All Microseconds are Equal: Fine-Grained Per-Flow Measurements with Reference Latency Interpolation Myungjin Lee †, Nick Duffield‡, Ramana Rao Kompella†

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Intrusion and Anomaly Detection in Network Traffic Streams: Checking and Machine Learning Approaches ONR MURI area: High Confidence Real-Time Misuse and.

SCAN: a Scalable, Adaptive, Secure and Network-aware Content Distribution Network Yan Chen CS Department Northwestern University.

A Web Crawler Design for Data Mining

Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.

An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.

Scalable and Efficient Data Streaming Algorithms for Detecting Common Content in Internet Traffic Minho Sung Networking & Telecommunications Group College.

Using Identity Credential Usage Logs to Detect Anomalous Service Accesses Daisuke Mashima Dr. Mustaque Ahamad College of Computing Georgia Institute of.

Secure Sensor Data/Information Management and Mining Bhavani Thuraisingham The University of Texas at Dallas October 2005.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

1 LD-Sketch: A Distributed Sketching Design for Accurate and Scalable Anomaly Detection in Network Data Streams Qun Huang and Patrick P. C. Lee The Chinese.

A Light-Weight Distributed Scheme for Detecting IP Prefix Hijacks in Real-Time Lusheng Ji†, Joint work with Changxi Zheng‡, Dan Pei†, Jia Wang†, Paul Francis‡

CINBAD CERN/HP ProCurve Joint Project on Networking 26 May 2009 Ryszard Erazm Jurga - CERN Milosz Marian Hulboj - CERN.

Active Sampling for Accelerated Learning of Performance Models Piyush Shivam, Shivnath Babu, Jeff Chase Duke University.

Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

Web-Mining …searching for the knowledge on the Internet… Marko Grobelnik Institut Jožef Stefan.

IETF 67 – SIMPLE WG SIMPLE Problem Statement Draft-rang-simple-problem-statement-01 Tim Rang - Microsoft Avshalom Houri – IBM Edwin Aoki – AOL.

Louisiana Tech Capstone Submitted by Capstone 2010 Cyber Security Situational Awareness System.

Real-Time Trip Information Service for a Large Taxi Fleet

Calculating frequency moments of Data Stream

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Emir Halepovic, Jeffrey Pang, Oliver Spatscheck AT&T Labs - Research

@ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University.

Xiangnan Kong,Philip S. Yu An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Dept. of Computer Science University of Illinois.

1 Workload Analysis of Globus’ GridFTP Nicolas Kourtellis Joint Work with:Lydia Prieto, Gustavo Zarrate, Adriana Iamnitchi, Dan Fraser University of South.

SCREAM: Sketch Resource Allocation for Software-defined Measurement Masoud Moshref, Minlan Yu, Ramesh Govindan, Amin Vahdat (CoNEXT’15)

REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.

1 Netflow Collection and Aggregation in the AT&T Common Backbone Carsten Lund.

Vivaldi: A Decentralized Network Coordinate System

Data Streaming in Computer Networking

Lightweight Application Classification for Network Management

3 | Analyzing Server, Network, and Client Health

Query-Friendly Compression of Graph Streams

De-anonymizing the Internet Using Unreliable IDs By Yinglian Xie, Fang Yu, and Martín Abadi Presented by Peng Cheng 03/22/2017.

Sublinear Algorithmic Tools 2

Optimal Elephant Flow Detection Presented by: Gil Einziger,

By: Ran Ben Basat, Technion, Israel

Lu Tang , Qun Huang, Patrick P. C. Lee

Unconstrained Endpoint Profiling (Googling the Internet)‏

(Learned) Frequency Estimation Algorithms

Presentation transcript:

Crossroads: A Practical Data Sketching Solution for Mining Intersection of Streams Jun Xu, Zhenglin Yu (Georgia Tech) Jia Wang, Zihui Ge, He Yan (AT&T Labs Research) Ashwin Lall (Denison University)

Overview Problem statemen t Motivation Our approach Data collection architecture Evaluation Conclusions

Problem Statement Anonymously mine the logs of cellular data traffic to rapidly detect network performance anomalies Can be converted to an association-rule mining problem if we store every cellular packet into a static database

Motivation Complexity of the network ecosystem (sophisticated phones, tablets; application server; a tremendous variety of apps and online services) The performance issues can be introduced by different causes or combinations (device, app, network, app server) Infeasible to store all combinations of such data (especially when it is collected in real-time)

Problem Statement (detailed) An example event “Device = Magic Phone 7” & “OS = Magic OS 88.8” & “Application = FunContent.app” & “Source = Metropolis downtown location” & “Destination = FunContent.com” & “Time = τ” ⇒ “unusually long RTT”.

Main challenge Cannot afford to store all the combinations (since the number of different attribute combinations is huge) Our Goal Asymptotic reduction in space usage while keeping accuracy loss small when detecting anomalous values Contribution of Paper An intersection scheme that can significantly reduce the storage cost while keep accuracy loss small.

Our approach ●Based on data sketching solution (inspired by the tug- of-war sketch) ●A sketch is constructed to succinctly summarize the performance metrics (e.g., average RTT) of all data items

Our approach Partition attributes to 2 groups Example: group attributes to 2 groups: A i, B j

Our approach A i :the set of packets that match (“Source = the Metropolis downtown location” & “Destination = FunContent.com” & “Time = τ”) B j :the set of packets that match “Device = Magic Phone 7” & “OS = Magic OS 88.8” & “Application = FunContent.app”)

Our approach We can compute functions on the intersection of arbitrary sets A i and B j

Our approach ●Use sketches to store summary statistic (e.g., mean, variance) for A i and B j ●Derive the performance metrics of the data by intersecting the sketches ( A i ∩ B j )

Storage saving Reduce the storage cost from O(n) to O(√n) For example, number is in trillions ( ∼ ) for joint value combinations of all these attributes Each subset may only be in millions ( ∼ 10 6 )

3-Way Intersection ●How about 3-way intersection? ●An impossibility result:

Data Collection Architecture Real data collection * * Note: No personally identifiable information (PII) was gathered or used in conducting this study. To the extent any data was analyzed, it was anonymous and/or aggregated data.

Evaluation 8 different attributes in our real data*. Partition scheme: ●1 st group: RNC, service category, handheld device speed category, day of week ●2 nd group: handheld device manufacturer/model, content provider, access point network, hour * Note: No personally identifiable information (PII) was gathered or used in conducting this study. To the extent any data was analyzed, it was anonymous and/or aggregated data.

Evaluation 1.4 million distinct combinations for the 1 st group 1.5 million distinct combinations for the 2 st group Storage cost of maintaining the value of every combination: 1.4M × 1.5M × 4 bytes= 7.5 TB

Evaluation Using our intersection scheme, and using 4096 counters in each sketch, the space cost is (1.4M + 1.5M ) × 4096 × 4 bytes = 45 GB Much less than 7.5 TB Relative error will be about 10 %

Evaluation ●As buckets number(memory usage) increases, average relative error will decrease. ●As intersection ratio increase, average relative error will decrease.

Evaluation Results of mean relative errors when varying memory(buckets number) respectively for situations that intersection ratio = 0.01, 0.02, 0.05, 0.10

Conclusions We provide an intersection scheme for estimating arbitrary summary statistics on large data sets We show how to reduce storage cost from O(n) to O(√n) We demonstrate efficacy using both synthetic and real data