Bloom Filters Differential Files Simple large database.  Collection/file of records residing on disk.  Single key.  Index to records. Operations. 

Slides:



Advertisements
Similar presentations
Tuning in Relational Systems 2012/06/04. Index The performance of queries largely depends upon what indexes or hashing scheme exist. – Efficiency of queries.
Advertisements

Preliminaries Advantages –Hash tables can insert(), remove(), and find() with complexity close to O(1). –Relatively easy to program Disadvantages –There.
CS4432: Database Systems II Buffer Manager 1. 2 Covered in week 1.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Hashing. CENG 3512 Motivation The primary goal is to locate the desired record in a single access of disk. – Sequential search: O(N) – B+ trees: O(log.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
Chapter 11 (3 rd Edition) Hash-Based Indexes Xuemin COMP9315: Database Systems Implementation.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Multidimensional Data
Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]
Chapter 11: File System Implementation
Bloom Filters Kira Radinsky Slides based on material from:
BTrees & Bitmap Indexes
1 Hash-Based Indexes Yanlei Diao UMass Amherst Feb 22, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Bloom Filters Differential Files Simple large database.  File of records residing on disk.  Single key.  Index to records. Operations.  Retrieve. 
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
E.G.M. PetrakisHashing1 Hashing on the Disk  Keys are stored in “disk pages” (“buckets”)  several records fit within one page  Retrieval:  find address.
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
Multidimensional Data Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Sarang Dharmapurikar With contributions from : Praveen Krishnamurthy,
1 Physical Data Organization and Indexing Lecture 14.
Data and its manifestations. Storage and Retrieval techniques.
Architecture Rajesh. Components of Database Engine.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
Author: Haoyu Song, Fang Hao, Murali Kodialam, T.V. Lakshman Publisher: IEEE INFOCOM 2009 Presenter: Chin-Chung Pan Date: 2009/12/09.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
© Pearson Education Limited, Chapter 13 Physical Database Design – Step 4 (Choose File Organizations and Indexes) Transparencies.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
Maintaining a Database Access Project 3. 2 What is Database Maintenance ?  Maintaining a database means modifying the data to keep it up-to-date. This.
Module 4.0: File Systems File is a contiguous logical address space.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
Introduction to Database, Fall 2004/Melikyan1 Hash-Based Indexes Chapter 10.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Indexed Sequential Access Method.
Author: Heeyeol Yu and Rabi Mahapatra
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition File System Implementation.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
1 Lecture 21: Hash Tables Wednesday, November 17, 2004.
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Module D: Hashing.
1 Indexes ► Sort data logically to improve the speed of searching and sorting operations. ► Provide rapid retrieval of specified rows from the table without.
CS4432: Database Systems II
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8 Jianping Fan Dept of Computer Science UNC-Charlotte.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Hash 2004, Spring Pusan National University Ki-Joune Li.
File-System Management
Module 11: File Structure
Indexing Goals: Store large files Support multiple search keys
Dynamic Hashing (Chapter 12)
Hash-Based Indexes Chapter 11
Hashing CENG 351.
Latihan Create a separate table with the same structure as the Booking table to hold archive records. Using the INSERT statement, copy the records from.
CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE
Bloom Filters Very fast set membership. Is x in S? False Positive
Hash-Based Indexes R&G Chapter 10 Lecture 18
Hash-Based Indexes Chapter 10
Hash Tables – 2 Comp 122, Spring 2004.
Selected Topics: External Sorting, Join Algorithms, …
Bloom Filters Differential Files Simple large database. Operations.
2018, Spring Pusan National University Ki-Joune Li
Module 12a: Dynamic Hashing
Chapter 11 Instructor: Xin Zhang
Hash Tables – 2 1.
Presentation transcript:

Bloom Filters Differential Files Simple large database.  Collection/file of records residing on disk.  Single key.  Index to records. Operations.  Retrieve.  Update. Insert a new record. Make changes to an existing record. Delete a record.

Naïve Mode Of Operation Problems.  Index and File change with time.  Sooner or later, system will crash.  Recovery => Copy Master File (MF) from backup. Copy Master Index (MI) from backup. Process all transactions since last backup.  Recovery time depends on MF & MI size + #transactions since last backup. Key Index File Ans.

Differential File Make no changes to master file. Alter index and write updated record to a new file called differential file.

Differential File Operation Key Index File Ans. DF Advantage.  DF is smaller than File and so may be backed up more frequently.  Index needs to be backed up whenever DF is. So, index should be small as well.  Recovery time is reduced.

Differential File Operation Key Index File Ans. DF Disadvantage.  Eventually DF becomes large and can no longer be backed up with desired frequency.  Must integrate File and DF now.  Following integration, DF is empty.

Differential File Operation Key Index File Ans. DF Large Index.  Index cannot be backed up as frequently as desired.  Time to recover current state of index & DF is excessive.  Use a differential index. Make no changes to Index. DI is an index to all deleted records and records in DF.

Differential File & Index Operation Performance hit.  Most queries search both DI and Index.  Increase in # of disk accesses/query. Use a filter to decide whether or not DI should be searched. Key Index File Ans. DF DI Y N Y

Ideal Filter Key Index File Ans. DF Filter Y N Y DI Y Y => this key is in the DI. N => this key is not in the DI. Functionality of ideal filter is same as that of DI. So, a filter that eliminates performance hit of DI doesn’t exist.

Bloom Filter (BF) N => this key is not in the DI. M (maybe) => this key may be in the DI. Filter error.  BF says Maybe.  DI says No. Key Index File Ans. DF BF Y N Y DI M N

Bloom Filter (BF) Filter error.  BF says Maybe.  DI says No. Key Index File Ans. DF BF Y N Y DI M N BF resides in memory. Performance hit paid only when there is a filter error.

Longest Matching Prefix Suppose the router prefixes have W different lengths. Create W Bloom filters, one for each length. ith Bloom filter is for prefixes of length i. Keep W hash tables. ith hash table has length i prefixes together with next hop information. Query Bloom filters to get list of hash tables that may have matching prefix. Query hash tables in decreasing order of length (or, in parallel) to find longest matching prefix.

Longest Matching Prefix B1B1 B2B2 B3B3 BWBW … On Chip H1H1 H2H2 H3H3 HWHW … Off Chip

Bloom Filter Design Use m bits of memory for the BF. Larger m => fewer filter errors. When DI empty, all m bits = 0. Use h > 0 hash functions: f 1 (), f 2 (), …, f h (). When key k inserted into DI, set bits f 1 (k), f 2 (k), …, and f h (k) to 1. f 1 (k), f 2 (k), …, f h (k) is the signature of key k.

Example m = 11 (normally, m would be much much larger). h = 2 (2 hash functions). f 1 (k) = k mod m. f 2 (k) = (2k) mod m. k = k = 17. 1

Example DI has k = 15 and k = 17. Search for k.  f 1 (k) = 0 or f 2 (k) = 0 => k not in DI.  f 1 (k) = 1 and f 2 (k) = 1 => k may be in DI. k = 6 => filter error

Bloom Filter Design Choose m (filter size in bits).  Use as much memory as is available. Pick h (number of hash functions).  h too small => probability of different keys having same signature is high.  h too large => filter becomes filled with ones too soon. Select the h hash functions.  Hash functions should be relatively independent.

Optimal Choice Of h Probability of a filter error depends on:  Filter size … m.  # of hash functions … h.  # of updates before filter is reset to 0 … u. Insert Delete Change Assume that m and u are constant. # of master file records = n >> u.

Probability Of Filter Error p(u) = probability of a filter error after u updates = A * B A = p(request for an unmodified record after u updates) B = p(filter bits are all 1 for this request for an unmodified record)

A = p(request for unmodified record) p(update j is for record i) = 1/n. p(record i not modified by update j) = 1 – 1/n. p(record i not modified by any of the u updates) = (1 – 1/n) u = A.

B = p(filter bits are all 1 for this request) Consider an update with key K. p(f j (K) != i) = 1 – 1/m. p(f j (K) != i for all j) = (1 – 1/m) h. p(bit i = 0 after one update) = (1 – 1/m) h. p(bit i = 0 after u updates) = (1 – 1/m) uh. p(bit i = 1 after u updates) = 1 – (1 – 1/m) uh. p(signature of K is 1 after u updates) = [1 – (1 – 1/m) uh ] h = B.

Probability Of Filter Error p(u) = A * B = (1 – 1/n) u * [1 – (1 – 1/m) uh ] h (1 – 1/x) q ~ e –q/x when x is large. p(u) ~ e –u/n (1 – e –uh/m ) h d p(u)/dh = 0 => h = (ln 2)m/u ~ 0.693m/u.

Optimal h h ~ 0.693m/u. m = 10 6, u = 10 6 /2  h ~  Use h = 1 or h = 2. m = 2*10 6, u = 10 6 /2  h ~  Use h = 2 or h = 3. h p(u) h opt p(u) ~ e –u/n (1 – e –uh/m ) h