A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

New Directions in Traffic Measurement and Accounting Cristian Estan – UCSD George Varghese - UCSD Reviewed by Michela Becchi Discussion Leaders Andrew.
Fast Algorithms For Hierarchical Range Histogram Constructions
Estimating TCP Latency Approximately with Passive Measurements Sriharsha Gangam, Jaideep Chandrashekar, Ítalo Cunha, Jim Kurose.
An Improved Construction for Counting Bloom Filters Flavio Bonomi Michael Mitzenmacher Rina Panigrahy Sushil Singh George Varghese Presented by: Sailesh.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Bloom Filters Kira Radinsky Slides based on material from:
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
Bloom filters Probability and Computing Randomized algorithms and probabilistic analysis P109~P111 Michael Mitzenmacher Eli Upfal.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by.
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer.
Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.
272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 17: Code Mining.
SIGCOMM 2002 New Directions in Traffic Measurement and Accounting Focusing on the Elephants, Ignoring the Mice Cristian Estan and George Varghese University.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal ( )
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Compact Data Structures and Applications Gil Einziger and Roy Friedman Technion, Haifa.
CEDAR Counter-Estimation Decoupling for Approximate Rates Erez Tsidon Joint work with Iddo Hanniel and Isaac Keslassy Technion, Israel 1.
Shades: Expediting Kademlia’s Lookup Process Gil Einziger, Roy Friedman, Yoav Kantor Computer Science, Technion 1.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
TinyLFU: A Highly Efficient Cache Admission Policy
Qingqing Gan Torsten Suel CSE Department Polytechnic Institute of NYU Improved Techniques for Result Caching in Web Search Engines.
Kaleidoscope – Adding Colors to Kademlia Gil Einziger, Roy Friedman, Eyal Kibbar Computer Science, Technion 1.
1 A Throughput-Efficient Packet Classifier with n Bloom filters Authors: Heeyeol Yu and Rabi Mahapatra Publisher: IEEE GLOBECOM 2008 proceedings Present:
The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Practical LFU implementation for Web Caching George KarakostasTelcordia Dimitrios N. Serpanos University of Patras.
The Bloom Paradox Ori Rottenstreich Joint work with Isaac Keslassy Technion, Israel.
Calculating frequency moments of Data Stream
1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.
University of Illinois at Urbana-Champaign
@ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University.
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
Transforming Policies into Mechanisms with Infokernel Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Nathan C. Burnett, Timothy E. Denehy, Thomas J.
Cuckoo Filter: Practically Better Than Bloom Author: Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher Publisher: ACM CoNEXT 2014 Presenter:
Jiahao Chen, Yuhui Deng, Zhan Huang 1 ICA3PP2015: The 15th International Conference on Algorithms and Architectures for Parallel Processing. zhangjiajie,
1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
SketchVisor: Robust Network Measurement for Software Packet Processing
Confidence Intervals Cont.
Updating SF-Tree Speaker: Ho Wai Shing.
Lower bounds for approximate membership dynamic data structures
A paper on Join Synopses for Approximate Query Answering
The Variable-Increment Counting Bloom Filter
Copyright © Cengage Learning. All rights reserved.
Bloom filters Probability and Computing Michael Mitzenmacher Eli Upfal
Pyramid Sketch: a Sketch Framework
Optimal Elephant Flow Detection Presented by: Gil Einziger,
Estimating
Bloom Filters Very fast set membership. Is x in S? False Positive
Optimizing Data Popularity Conscious Bloom Filters
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
By: Ran Ben Basat, Technion, Israel
Network-Wide Routing Oblivious Heavy Hitters
Heavy Hitters in Streams and Sliding Windows
By: Ran Ben Basat, Technion, Israel
Bloom filters From Probability and Computing
Hash Functions for Network Applications (II)
Lecture 1: Bloom Filters
Author: Yi Lu, Balaji Prabhakar Publisher: INFOCOM’09
Lu Tang , Qun Huang, Patrick P. C. Lee
NitroSketch: Robust and General Sketch-based Monitoring in Software Switches Alan (Zaoxing) Liu Joint work with Ran Ben-Basat, Gil Einziger, Yaron Kassner,
2019/11/12 Efficient Measurement on Programmable Switches Using Probabilistic Recirculation Presenter:Hung-Yen Wang Authors:Ran Ben Basat, Xiaoqi Chen,
Presentation transcript:

A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa

We wish to count the number of occurrences of various items from a very large domain. To gain space efficiency, we are willing to tolerate an “approximate count” only. Approximate Counting

Bloom Filters An array BF of m bits and k hash functions {h 1,…,h k } over the domain [0,…,m-1] Adding an object obj to the Bloom filter is done by computing h 1 (obj),…, h k (obj) and setting the corresponding bits in BF Checking for set membership for an object cand is done by computing h 1 (cand),…, h k (cand) and verifying that all corresponding bits are set m=11, k=3, 111 h 1 (o1)=0, h 2 (o1)=7, h 3 (o1)=5 BF= h 1 (o2)=0, h 2 (o2)=7, h 3 (o2)=4 √ ×

Counting Bloom Filters A vector of counters (instead of bits) A counting Bloom filter supports the operations: – Increment Increment by 1 all entries that correspond to the results of the k hash functions – Decrement Decrement by 1 all entries that correspond to the results of the k hash functions – Estimate (instead of get) Return the minimal value of all corresponding entries m= k=3, h 1 (o1)=0, h 2 (o1)=7, h 3 (o1)=5 CBF= Estimate(o1)=

Give up the ability to Decrement in favor of accuracy/space efficiency – During an Increment operation, only update the lowest counters m= k=3, h 1 (o1)=0, h 2 (o1)=7, h 3 (o1)=5 SBF-MI= Increment(o1) only adds to the first entry (3->4) 4 Empirically shown to improve accuracy! Up to two orders of magnitude for some workloads. – But not formally understood. Conservative Update Technique

Motivation Applications: – Network messurements and heavy hitters. – Network security: anomaly detection. – Cache admission policy Additional applications in other fields: e.g. databases and natural language processing.

TinyLFU - Cache Admission Policy (PDP 2014) Frequency Rank The access distribution of most content is skewed ▫ Often modeled using Zipf-like functions, power-law, etc. Long Heavy Tail For example~(50% of the weight) A small number of very popular items For example~(50% of the weight)

Cache Victim Winner Eviction and Admission Policies Eviction Policy Admission Policy New Item One of you guys should leave… is the new item any better than the victim? What is the common Answer?

Conservative Update allows counting just the head items, with high accuracy, so our cache can make educated admission decisions. Undesired Desired Items Conservative Update - Intuition

Admission Policy Example More memory Better cache management Without admission policy Frequency based admission policy Cache Size Hit Rate

The Basic Observation CBF = LCS = If we can quantify how many items are inserted to each level in the LCS we can bound the error. A CBF is exactly like

Simple Observations It is useful to discuss the number of items that are inserted to each level of the LCS. Since all levels are considered the same – the false positive probability of each level is determined only by the number of items inserted to that level. A false positive at a higher level implies false positive at all lower levels.

Known (constant) distribution Large enough sample – We assume that we can make a ‘characteristic’ histogram. Formally we know how many items are going to appear every number of times. The Model

Denote A[i] - the number of items that are actually inserted to level i. By definition: A min/max argument about the lowest level that could have experienced a false positive yields the following: Lower Bound

Upper Bound Is derived similar by upper bounding A[i]. Requires a bit further assumptions. Technical details in the paper.

Accurate Configuration – Uniform

Accurate Configuration – Zipf 1

Inaccurate Configuration – Uniform

Inaccurate Configuration – Zipf 1

Real Trace – Counting TCP packets

Summery A simple analysis to an extensively used approximate counting optimization. First to analyze it for general distributions Lower and upper bounds on model Good indicator on real workloads. An extended version published as tech report. Thank You