Compact Data Structures and Applications Gil Einziger and Roy Friedman Technion, Haifa.

Slides:



Advertisements
Similar presentations
Hash Tables CS 310 – Professor Roch Weiss Chapter 20 All figures marked with a chapter and section number are copyrighted © 2006 by Pearson Addison-Wesley.
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
CS4432: Database Systems II Buffer Manager 1. 2 Covered in week 1.
External Memory Hashing. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B.
Bio Michel Hanna M.S. in E.E., Cairo University, Egypt B.S. in E.E., Cairo University at Fayoum, Egypt Currently is a Ph.D. Student in Computer Engineering.
Hash-Based Indexes Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
Hashing Dashiell Fryer CS 157B Dr. Lee. Contents Static Hashing Static Hashing File OrganizationFile Organization Properties of the Hash FunctionProperties.
Hash-Based Indexes The slides for this text are organized into chapters. This lecture covers Chapter 10. Chapter 1: Introduction to Database Systems Chapter.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Skip List & Hashing CSE, POSTECH.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Hashing as a Dictionary Implementation
Cuckoo Filter: Practically Better Than Bloom
Dictionaries Collection of pairs.  (key, element)  Pairs have different keys. Operations.  get(theKey)  put(theKey, theElement)  remove(theKey) 5/2/20151.
Index tuning Hash Index. overview Introduction Hash-based indexes are best for equality selections. –Can efficiently support index nested joins –Cannot.
Today’s Agenda  Stacks  Queues  Priority Queues CS2336: Computer Science II.
Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]
An Improved Construction for Counting Bloom Filters Flavio Bonomi Michael Mitzenmacher Rina Panigrahy Sushil Singh George Varghese Presented by: Sailesh.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Hit or Miss ? !!!.  Cache RAM is high-speed memory (usually SRAM).  The Cache stores frequently requested data.  If the CPU needs data, it will check.
CSC1016 Coursework Clarification Derek Mortimer March 2010.
Bloom Filters Kira Radinsky Slides based on material from:
Data Structures Hash Tables
Modern Information Retrieval
Hash Table indexing and Secondary Storage Hashing.
Beyond Bloom Filters: From Approximate Membership Checks to Approximate State Machines By F. Bonomi et al. Presented by Kenny Cheng, Tonny Mak Yui Kuen.
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
Hash Tables1 Part E Hash Tables  
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
Hash Tables1 Part E Hash Tables  
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
CMPE 421 Parallel Computer Architecture
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio.
Shades: Expediting Kademlia’s Lookup Process Gil Einziger, Roy Friedman, Yoav Kantor Computer Science, Technion 1.
TinyLFU: A Highly Efficient Cache Admission Policy
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CSIT 301 (Blum)1 Cache Based in part on Chapter 9 in Computer Architecture (Nicholas Carter)
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Measurement COS 597E: Software Defined Networking.
HASHING PROJECT 1. SEARCHING DATA STRUCTURES Consider a set of data with N data items stored in some data structure We must be able to insert, delete.
The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.
Intro  Scratchpad rings and queues.  First – In – Firs – Out (FIFO) data structure.  Rings are fixed-sized, circular FIFO.  Queues not fixed-size.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
The Bloom Paradox Ori Rottenstreich Joint work with Isaac Keslassy Technion, Israel.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.
Cuckoo Filter: Practically Better Than Bloom Author: Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher Publisher: ACM CoNEXT 2014 Presenter:
Jiahao Chen, Yuhui Deng, Zhan Huang 1 ICA3PP2015: The 15th International Conference on Algorithms and Architectures for Parallel Processing. zhangjiajie,
Review William Cohen. Sessions Randomized algorithms – The “hash trick” – Bloom filters how would you use it? how and why does it work? – Count-min sketch.
CS4432: Database Systems II
Indexing and hashing.
Updating SF-Tree Speaker: Ho Wai Shing.
Dynamic Hashing (Chapter 12)
The Variable-Increment Counting Bloom Filter
563.10: Bloom Cookies Web Search Personalization without User Tracking
Optimal Elephant Flow Detection Presented by: Gil Einziger,
Edge computing (1) Content Distribution Networks
Computer Architecture
Database Design and Programming
Heavy Hitters in Streams and Sliding Windows
By: Ran Ben Basat, Technion, Israel
Presentation transcript:

Compact Data Structures and Applications Gil Einziger and Roy Friedman Technion, Haifa

Approximate Set Membership The problem: The problem: – Maintain enough state to approximately answer set membership queries. (add/query). Things to consider: Things to consider: – False positive probability. – No false negatives! – A tradeoff between space and the false positive probability. Bloom filters are the classical example.

Application Example: Black List Your Computer Client: Client: Web browser Server: Server:Let’s say Google’s data center. Mission: Mission:Check if each accessed URL is part of the black list. Challenge: Challenge: The black list is too big to fit in memory. Data Center Trivial Solution: access data center for each URL Approximate Set: Access data center only in case of a positive answer. (must access it as the answer may be wrong) But… there are no false negatives! Therefore every site that tests negative is not on the list and we do not have to contact the datacenter.

Bloom Filters An array BF of m bits and k hash functions {h 1,…,h k } over the domain [0,…,m-1] Adding an object obj to the Bloom filter is done by computing h 1 (obj),…, h k (obj) and setting the corresponding bits in the Bloom filter. Checking for set membership for an object cand is done by computing h 1 (cand),…, h k (cand) and verifying that all corresponding bits are set m=11, k=3, 111 h 1 (o1)=0, h 2 (o1)=7, h 3 (o1)=5 BF= h 1 (o2)=0, h 2 (o2)=7, h 3 (o2)=4 √ ×

Approximate Counting Multiset: Multiset: Instead of a query – ‘estimate’ operation “How many times the item was added before?” Things to consider: Things to consider: – False positives – No false negatives we only get over approximation! – A tradeoff between space and accuracy Typically solved with Bloom filter extensions – like Spectral Bloom filter, Count Min Sketch, Multi Stage Filters and many more.

Counting with Bloom Filter A vector of counters (instead of bits) A counting Bloom filter supports the operations: – Increment Increment by 1 all entries that correspond to the results of the k hash functions – Decrement Decrement by 1 all entries that correspond to the results of the k hash functions – Estimate (instead of get) Return the minimal value of all corresponding entries m= k=3, h 1 (o1)=0, h 2 (o1)=7, h 3 (o1)=5 CBF= Estimate(o1)=

Some Applications Google Chrome Google BigTable Apache Hadoop Facebook’s) Apache) Casandra Venti (archive system) Cache Admission Policy – And many more…

Approximate Set with Hash Tables 1712 We use a bitwise array The function: P assigns each item a place in the array. The function: F assigns each item a fingerprint. – The add operation writes f(o), in the location p(o). – The query operation compares p(o) with the content of f(o) For example: Only works when there are no collisions… × p(o1)=3, h (o1)=7 p(o2)=5, h (o2)=12 √

Chain based hash table ? A single pointer is 64 bits – so most of the space is simply for pointers! Array ? Linear Probing? Can we do anything that is more space efficient than an array? Handling Collisions

BucketChainTag (fingerprint) Inferred from place in tableOnly tag bits are stored. TinyTable Overview

Chain Index Chain Index (1 bit per chain)Array A Is Last Is Last? ( 1 bit per item) BC Chain 2 is not empty Second item is last in chain First Item is not last Chain 7 is empty Encoding of a Single Bucket

Chain Index Array ( 6 items of fixed size) Is Last? Logical View Add B to chain 5: A A B B

Chain Index Array ( 6 items of fixed size) Is Last? Logical View Add C to chain 0: A A B B A BC C

Chain Index Array ( 6 items of fixed size) Is Last? Logical View Add D to chain 2: A B B A BC C BD D A

Handling Overflows “When a bucket overflows … ‘steal’ space from a neighboring bucket. “ Bucket 2 Items: 2 Capacity: 5 Bucket 1 Items: 4 Capacity: 5 Bucket 0 Items: 4 Capacity: 5 Bucket 0 Items: 5 Capacity: 5 Bucket 0 Items: 6 Capacity: 6 Bucket 1 Items: 4 Capacity: 4 Bucket 0 Items: 7 Capacity: 7 Bucket 1 Items: 4 Capacity: 4 Bucket 2 Items: 2 Capacity: 4

Performance Tradeoff

The Tradeoff: Analysis and Empirical

Time for 1 million operations (seconds) Alpha = 1.2 Alpha = 1.1 Alpha = Query is over 10 times faster than in Bloom filter, regardless of alpha. Update Speed depends on alpha – and can be similar to that of Bloom filters.

TinyTable and Counting Bloom Filters This is the original (plain) Bloom filter – without removals. Relative Space Requirement of TinyTable 1.1 Compared to State of the art CBF.

TinyTable vs Table Based CBF Alpha = 1.2 Alpha = 1.1

Approximate Counting How to represent counters? Consider a single logical chain: We add a single bit per item to indicate whether this item is a fingerprint or a counter part. A This item is a key This item is a counter part. A has a large counter (2 counter parts). B Another key B’s counter Table as a black box: Items can either be counters or keys. Counters are associated with keys to the left.

Summery: TinyTable Query is always very fast Query is always very fast. It is based on efficient bitwise operations and very memory local. Update time depends on memory density Update time depends on memory density, denser table = slower update. additions, removals and counting. Full support of additions, removals and counting. smaller than Bloom filters Many attractive configurations! Can be made smaller than Bloom filters with reasonable (better) performance.

TinyTable (Alpha =1.2) vs Approximate Counting

TinyLFU: Admission policy (PDP 2014) It is not always beneficial to add a new item at the expense of cache victim. Frequency Rank Long Heavy Tail For example~(50% of the weight) A small number of very popular items For example~(50% of the weight)

Cache Victim Winner Eviction and Admission Policies Eviction Policy Admission Policy New Item One of you guys should leave… is the new item any better than the victim? What is the common Answer? TinyLFU: Admission policy (PDP 2014)

TinyLFU: (PDP 2014) TinyTable Use a sample of recent events to manage cache. (alternative implementation)

TinyLFU: Admission policy results 1.Low metadata overhead less than 8 bytes per cache line. 2.Higher cache hit rate 3.Faster cache operation (Query is faster than update)

TinyTable is soon to be released as an open source project. TinyLFU was released as the Shades open source project. I believe there are many other applications! Thank you for your Time! (and use my hash table – it is awesome)