Augmented Sketch: Faster and More Accurate Stream Processing

Slides:

Advertisements

Similar presentations

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.

Advertisements

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

1 An Efficient, Hardware-based Multi-Hash Scheme for High Speed IP Lookup Hot Interconnects 2008 Socrates Demetriades, Michel Hanna, Sangyeun Cho and Rami.

Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu (Ask.com) Alexandra Potapova, Maha Alabduljalil (Univ. of California.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Energy Conservation in Datacenters through Cluster Memory Management and Barely-Alive Memory Servers Vlasia Anagnostopoulou Susmit.

Database Implementation of a Model-Free Classifier Konstantinos Morfonios ADBIS 2007 University of Athens.

SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.

Streaming Algorithms for Robust, Real- Time Detection of DDoS Attacks S. Ganguly, M. Garofalakis, R. Rastogi, K. Sabnani Krishan Sabnani Bell Labs Research.

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.

Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.

1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.

Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.

Mining Optimal Decision Trees from Itemset Lattices Dr, Siegfried Nijssen Dr. Elisa Fromont KDD 2007.

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.

Multiple Aggregations Over Data Streams Rui ZhangNational Univ. of Singapore Nick KoudasUniv. of Toronto Beng Chin OoiNational Univ. of Singapore Divesh.

Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.

Qingqing Gan Torsten Suel CSE Department Polytechnic Institute of NYU Improved Techniques for Result Caching in Web Search Engines.

1 LD-Sketch: A Distributed Sketching Design for Accurate and Scalable Anomaly Detection in Network Data Streams Qun Huang and Patrick P. C. Lee The Chinese.

Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.

Efficient Local Statistical Analysis via Integral Histograms with Discrete Wavelet Transform Teng-Yok Lee & Han-Wei Shen IEEE SciVis ’13Uncertainty & Multivariate.

By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.

Hardened IDS using IXP Didier Contis, Dr. Wenke Lee, Dr. David Schimmel Chris Clark, Jun Li, Chengai Lu, Weidong Shi, Ashley Thomas, Yi Zhang  Current.

SCREAM: Sketch Resource Allocation for Software-defined Measurement Masoud Moshref, Minlan Yu, Ramesh Govindan, Amin Vahdat (CoNEXT’15)

Accurate WiFi Packet Delivery Rate Estimation and Applications Owais Khan and Lili Qiu. The University of Texas at Austin 1 Infocom 2016, San Francisco.

Re-evaluating Measurement Algorithms in Software Omid Alipourfard, Masoud Moshref, Minlan Yu {alipourf, moshrefj,

BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.

Dense-Region Based Compact Data Cube

SketchVisor: Robust Network Measurement for Software Packet Processing

Big Data Infrastructure

Constant Time Updates in Hierarchical Heavy Hitters

Processes and threads.

Frequency Counts over Data Streams

International Conference on Data Engineering (ICDE 2016)

A paper on Join Synopses for Approximate Query Answering

Embedded Systems Design

Finding Frequent Items in Data Streams

Implementation of GPU based CCN Router

A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.

Supporting Fault-Tolerance in Streaming Grid Applications

Kijung Shin1 Mohammad Hammoud1

Query-Friendly Compression of Graph Streams

DISTRIBUTED CLUSTERING OF UBIQUITOUS DATA STREAMS

StreamApprox Approximate Stream Analytics in Apache Flink

Pyramid Sketch: a Sketch Framework

Optimal Elephant Flow Detection Presented by: Gil Einziger,

StreamApprox Approximate Stream Analytics in Apache Spark

StreamApprox Approximate Computing for Stream Analytics

Chapter 4: Threads.

SCREAM: Sketch Resource Allocation for Software-defined Measurement

Elastic Sketch: Adaptive and Fast Network-wide Measurements

Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy

Smita Vijayakumar Qian Zhu Gagan Agrawal

Approximate Frequency Counts over Data Streams

Stratified Sampling for Data Mining on the Deep Web

Pramod Bhatotia, Ruichuan Chen, Myungjin Lee

Range-Efficient Computation of F0 over Massive Data Streams

Elastic Sketch: Adaptive and Fast Network-wide Measurements

GATES: A Grid-Based Middleware for Processing Distributed Data Streams

Constant Time Updates in Hierarchical Heavy Hitters

Lecture 6: Counting triangles Dynamic graphs & sampling

By: Ran Ben Basat, Technion, Israel

By: Ran Ben Basat, Technion, Israel

Online Analytical Processing Stream Data: Is It Feasible?

Lu Tang , Qun Huang, Patrick P. C. Lee

Presentation transcript:

Augmented Sketch: Faster and More Accurate Stream Processing Pratanu Roy Arijit Khan Gustavo Alonso Systems Group ETH Zurich Nanyang Technical University Singapore

Data Stream Processing f( ) = 3 e f( ) = 2 e …. e a a c e e Data Stream IP traffic, phone calls, sensor measurements, web clicks and crawls P. Roy, A. Khan, G. Alonso

Data Stream Processing f( ) = 3 e f( ) = 2 e …. e a a c e e Data Stream IP traffic, phone calls, sensor measurements, web clicks and crawls Applications: Load balancing Ranking Frequent itemsets mining Classification P. Roy, A. Khan, G. Alonso

Challenges in Stream Processing Trade-off among Space, Accuracy, and Efficiency: -- Increasing space increases accuracy, but reduces throughput Other requirements: -- Build summary in one pass over the stream -- Incremental updates in summary P. Roy, A. Khan, G. Alonso

Related Work h H1(e) (e,f) w Hw(e) + f + f + f Sketch Space-saving Wavelets Sampling h + f H1(e) (e,f) + f w Hw(e) + f Count-Min Sketch P. Roy, A. Khan, G. Alonso

Our Motivation Improve accuracy for frequent items -- Critical for threshold checking (service-level agreement ) , ranking, load-balancing Reduce misclassification -- Frequent items mining the first step of frequent itemsets mining -- Even a small number of misclassified items lead to a large number of false-positive itemsets Improve throughput P. Roy, A. Khan, G. Alonso

Main Takeaway Let the common case go faster Hot data Frequency Cold data Data mining tasks often has to deal with skewed data and traditionally it is seen as a performance bottleneck especially for parallelization A skewed distribution looks like this where few items have higher frequency But if we look closely then few items have significantly higher count then others 3. We want to design a framework to let the common case go faster without slowing down the others Items Let the common case go faster P. Roy, A. Khan, G. Alonso

State-of-the-art Sketch Algorithms Main Takeaway: Solution Framework Hot data Cold data Optimized Codepath (Filter) State-of-the-art Sketch Algorithms Input -- -- We have designed an algorithm where we create a shortcut codepath that takes away the hot data. -- Shortcut Code path can be designed with SIMD instructions on modern multi-core machines -- Improvements can come in the form of throughput and accuracy Optimized codepath acts like a filter for hot data Improvements: accuracy, throughput P. Roy, A. Khan, G. Alonso

Main Takeaway: Desired Outcome Solution framework Without going into too much detail, if we plot skew versus throughput of the state-of-the-art shown in red curve vs the proposed algorithm, then As we increase skew, more and more items will be processed by the optimized codepath and improve the throughput, upto the point where everything is processed by the optimized codepath If you design the optimized codepath well, then you will loose only a little when there is no skew in the data In addition, the separation of high-frequency and low-frequency items also result in improved accuracy as we will show later State of the art P. Roy, A. Khan, G. Alonso

Augmented Sketch (ASketch) Items with lower count Filter Input Count Min Items with higher count Very small (~0.3%) and make the processing very fast (with SIMD) estimate(k) > minimum in (filter) Challenges -- Removing items from sketch -- Cascading exchanges P. Roy, A. Khan, G. Alonso

Exchange Mechanism Count-Min 4 3 8 7 H1 6 9 H2 filter A B Items 7 10 new 2 1 old A 8 Found in filter Update in filter Count-Min 4 3 8 7 H1 6 9 H2 filter A B Items 8 10 new 2 1 old 9 C 10 Not found in filter Update in count-min estimate (C) > minimum count (filter)  initiate an exchange P. Roy, A. Khan, G. Alonso

Exchange Mechanism We do not perform multiple exchanges 4 3 9 7 6 10 A Count-Min 4 3 9 7 H1 6 10 H2 filter A B Items 8 10 new 2 1 old A 8 2 C 9 9 10 9 Step 1: Move C to filter Count-Min 4 3 9 7 H1 6 10 H2 filter C B Items 9 10 new 1 old 13 10 Step 2: Move A to Count-Min with (8-2) = 6 We do not perform multiple exchanges P. Roy, A. Khan, G. Alonso

Other Technical Contributions Theoretical Error Bounds Four different filter implementation -- Array, Strict Heap, Relaxed Heap, Stream Summary Hardware-conscious filter (SIMD) Pipeline parallelism SPMD parallelism P. Roy, A. Khan, G. Alonso

Experimental Results: Stream Processing Throughput Count-Min [Cormode et al., Algo 05] Frequency Aware Counting (FCM) [Thomas et al, ICDE 09] Holistic UDAFs [Cormode et al, SIGMOD 04] Synthetic data, 8M data items, stream size: 32M, sketch size = 128 KB, filter size = 0.4 KB P. Roy, A. Khan, G. Alonso

Experimental Results: Query Processing Throughput Synthetic data, 8M data items, stream size: 32M, sketch size = 128 KB, filter size = 0.4 KB Queries are generated by sampling the input distribution P. Roy, A. Khan, G. Alonso

Experimental Results: Accuracy Improvement IP-Trace, 13M data items, stream size = 461M, sketch size = 128 KB, zipf 0.9 Queries are generated by sampling the input distribution P. Roy, A. Khan, G. Alonso

Conclusions ASketch dynamically identifies and aggregates most frequent items -- Improves throughput and accuracy of existing sketches -- Allows efficient utilization of modern hardware features such as SIMD, multi-cores, etc. Future work: investigate the use of Asketch in the context of machine learning and data mining applications P. Roy, A. Khan, G. Alonso

Impact of Misclassifications 8M data items; stream size = 32M, filter size = 0.4 KB P. Roy, A. Khan, G. Alonso

Dataflow with Pipeline Parallelism Core 1 Core 2 Optimized codepath Count-Min Input Use message passing to communicate across the cores P. Roy, A. Khan, G. Alonso

Parallel-ASketch vs ASketch Synthetic data, 8M data items, stream size: 32M, sketch size = 128 KB, filter size = 0.4 KB P. Roy, A. Khan, G. Alonso