CS 345: Topics in Data Warehousing Thursday, October 28, 2004.

CS 345: Topics in Data Warehousing Thursday, October 28, 2004

Review of Tuesday’s Class Bitmap compression with BBC codes –Gaps and Tails –Variable byte-length encoding of lengths –Special handling of lone bits Speeding up star joins –Cartesian product of dimensions –Semi-join reduction –Early aggregation

Outline of Today’s Class Join Indexes Projection Indexes –Horizontal vs. Vertical decomposition Bit-Sliced Indexes –Fast bitmap counts and sums –Range queries Bit Vector Filtering * O’Neil and Quass, “Improved Query Performance with Variant Indexes”, 1997

Join Indexes Generally, indexes are built on the tuples of a single relation Join indexes include attributes from more than one relation –Or else index entries consist of attributes from one relation and RIDs from another Example: –Standard index on Customer.State Index entry consists of pair –Join index on Customer.State & Sales fact Index entry consists of pair

Defining Join Indexes Join indexes are like indexes on a materialized view –Except that the view isn’t actually materialized… Oracle syntax: –CREATE BITMAP INDEX ON Sales(Customer.State) FROM Sales, Customer WHERE Sales.Customer_key = Customer.Customer_key Creates an index with entries Join condition used in joining the tables is specified at index creation time

Why Join Indexes? Consider a star query plan using semi-join reductions –For each dimension: Determine keys of dimension rows that satisfy filter conditions Join list of dimension keys to fact table (using index) Result = list of fact RIDs –Merge lists of fact RIDs from all dimensions –Retrieve matching fact rows –Join back to grouping dimensions –Perform grouping and aggregation Join indexes allow the join step to be skipped –Join is precomputed when join index created –Fact RID list stored directly in join index Bitmap join indexes are particularly common –Other types are possible (B-Tree, etc.) –Bitmap join indexes facilitate RID list merging / index intersection

Limitations of Join Indexes Index creation cost is high –Need to compute join result to create index Index maintenance cost is high –Usually not an issue in data warehouses –Data warehouse is “read-mostly” –Drop index before warehouse load begins –Re-create after load completes Can’t access non-indexed columns of dimension table –SELECT Customer.Age, SUM(Quantity) FROM Sales, Customer WHERE Sales.Customer_key = Customer.Customer_key AND Customer.State = 'CA' –Using ordinary index on State: Get dimension table RIDs using index Look up dimension rows to learn Age for each Join to fact table to learn Quantity –Using join index on State: Get fact RIDs using index Look up fact rows to learn Quantity Join back to dimension table to learn Age

Horizontal vs. Vertical Decomposition Relations have rows and columns –“Two-dimensional” structure Disk access model is one-dimensional stream –Disk spins fast, disk arm moves slowly –Sequential I/O = leave the disk arm in one place –Essentially, disk provides one-dimensional locality Need to store the bytes in some particular order –Embedding 2D structure in 1D space Two options: –Horizontal decomposition (“row-major” order) –Vertical decomposition (“column-major” order)

Horizontal vs. Vertical Horizontal decomposition –All attributes of a record are stored together –Values for the same attribute but different records are separated –Standard DBMS storage model Vertical decomposition –All values of an attribute are stored together –Values for the same record but different attributes are separated Col1Col2Col3 147 258 369 Col1Col2Col3 123 456 789 Horizontal Vertical

Horizontal vs. Vertical Horizontal decomposition –Good locality when: Multiple attributes from same record are co-accessed Not too many records are co-accessed –For example: Inserts and deletes Entire record is in one place Perform insert or delete with a single I/O Vertical decomposition –Good locality when: Multiple values of the same attribute are co-accessed Not too many attributes are co-accessed –Frequently the case for OLAP queries

Vertical Decomposition Example SELECT customer_key, income FROM Customer WHERE state = 'CA' AND age > 50 –Subplan for OLAP query that filters on state, age and groups on income –Suppose Customer has 100 attributes Horizontally decomposed data: –Scan table, one record at a time –Ignore 96 attributes of each record –Use 4 attributes (state, age, income, customer_key) –Lots of wasted I/O bandwidth Vertically decomposed data: –Parallel scans of customer_key, income, state, and age columns –If ith state = ‘CA’ and ith age > 50, then output ith customer_key and ith income –No redundant attributes are read from disk –I/O is kept to a minimum

Projection Index The idea behind projection indexes –Databases usually store data in horizontal format –Vertical format is more efficient for many analysis queries –Why not do both? Projection index –Logically: Index entries are pairs –Stored in same order as records in relation i.e. sorted by RID instead of sorted by Value –In practice: Storing RID is unnecessary Array storage format Array index determined from RID Incremental approach to vertical decomposition –Vertical decomposition = complete set of projection indexes –A few projection indexes = partial vertical decomposition

Using Projection Indexes Consider a star query via semi-join decomposition –For each dimension: Determine keys of dimension rows that satisfy filter conditions Join list of dimension keys to fact table (using index) Result = list of fact RIDs –Merge lists of fact RIDs from all dimensions –Retrieve matching fact rows –Join back to grouping dimensions –Perform grouping and aggregation

Using Projection Indexes “Retrieve matching fact rows” step can be expensive –All we care about are values of aggregated columns –Other attributes are irrelevant (e.g. dimension foreign keys) –Poor packing of relevant information into disk pages Relevant and irrelevant data is clumped together Using a projection index is often faster –List of fact RIDs tells us which index entries to read –Many index entries packed into same disk page No “clutter” from irrelevant fields –Fewer I/Os required

Bit-Sliced Indexes Bit-sliced indexes –Generally used for measurement columns –Allow for: Efficient aggregation Efficient range filtering (particularly for large ranges) –Most suitable for bitmap-based plans –Requires positive integer-valued column Fixed-precision decimals OK Example: Interpret $5.67 as 567 cents Bit-sliced index on attribute A –Treat A as multiple logical binary-valued columns Column A1 = Least significant bit of A Column A2 = 2 nd least significant bit of A Etc. Number of logical columns determined by max value –Store each column as a separate bitmap

Bit-Sliced Index Example Amount 5 13 2 6 7 Binary 0101 1101 0010 0110 0111 B4: 01000 B3: 11011 B2: 00111 B1: 11001 Bit-Sliced Index

Fast Bitmap Aggregation Take advantage of word-level parallelism –Implicit parallelism arising from SIMD operations in modern computer architectures –SIMD: Single Instruction, Multiple Data –Processor can compute bitwise operations on all bits in a word at the same time –Index intersection via bitmap merge takes advantage of this fact –It’s one reason for byte-aligned compression

Fast Bitmap Count Count the number of 1’s in a bitmap –Treat the bitmap as a byte array –Pre-compute lookup table with number of 1’s in each byte –Cycle through bitmap one byte at a time, accumulating count using lookup table Pseudocode –count = 0; for (int i = 0; i < n/8; i++) count += numSetBits[bitmap[i]]; –numSetBits[0] = 0, numSetBits[7] = 3, etc. Treating bitmap as short int array → even faster –Lookup table has 65536 entries instead of 256 –Bitmap of n bits → only add n/16 numbers

SUM using Bit-Sliced Index Suppose B f represents the foundset –foundset = List of fact RIDs that pass all filters For each bit slice B i : –Compute B f AND B i –Do fast bitmap count of resulting bitmap –Multiply count by 2 i Total = sum of weighted counts for all slices

SUM Example Amount 5 13 2 6 7 B4: 01000 B3: 11011 B2: 00111 B1: 11001 Bit-Sliced Index Count of B4: 1 Count of B3: 4 Count of B2: 3 Count of B1: 3 1*2 4 + 4*2 3 + 3*2 2 + 3*2 1 = 8 + 16 + 6 + 3 = 33 5 + 13 + 2 + 6 + 7 = 33

Range Filtering with Bit-Sliced Indexes Bit-sliced indexes allow range filtering Cost of applying range predicate independent of size of range –Not true for bitmap indexes, B-Trees We’ll give the algorithm for “A < c” –A is the attribute that is indexed –c is some constant –Other operations (>, =, etc.) are similar

Pseudocode for “A < c” Set B LT = all zeros. Set B EQ = all ones. For each bit slice B i, from most to least significant: –If bit i of constant c is 1: B LT = B LT OR (B EQ AND NOT(B i )) B EQ = B EQ AND B i –If bit i of constant c is 0: B EQ = B EQ AND NOT(B i ) Return B LT. Why does it work? Invariant: B EQ = 1 for all rows that match c on the most significant bits (and only those rows) A value x is less than c iff for some bit i: –x and c agree on all bits more significant than i –The ith bit of x is 0, and the ith bit of c is 1

“Amount < 7” Example Amount 5 13 2 6 7 B4: 01000 B3: 11011 B2: 00111 B1: 11001 Bit-Sliced Index 7 = 0111 B LT B EQ 0000011111 00000 00100 10100 10110 10111 10011 00011 00001

Bit-Sliced vs. Projection Index Both benefit from vertical decomposition –Values for 1 column are packed onto as few pages as possible Both allow for fast aggregation Both allow range filtering Bit-sliced indexes can be faster when: –Attribute values don’t use full data type precision –Query is CPU-bound Fewer machine instructions due to SIMD parallelism Bit-sliced indexes are more complicated

Bit Vector Filtering Sometimes called “Bloom Join” –From “Bloom Filters” [Bloom 1970] Bloom filter = cheap, approximate semi-join –Store bitmap with 1 bit per hash bucket –Initialize bitmap to all zeros –For each record in relation A: Compute hash value of join key Set the bit for the appropriate hash bucket to 1 In distributed database, send bitmap to Server 2. For each record in relation B: –Compute hash value of join key –Check whether bit for the appropriate bucket is set Yes → Record might join to something in A No → Record definitely doesn’t join to anything in A Send qualifying B records to Server 1 & compute join

Bit Vector Filtering Also useful in non-distributed database Applies to multi-pass hash or merge join Hash join –Partition relation A –Partition relation B –Join each A partition with matching B partition Merge join –Generate sorted runs for relation A –Generate sorted runs for relation B –Merge A’s runs, Merge B’s runs, Merge A & B Bit vector filtering –While pre-processing A, generate Bloom filter –While pre-processing B, discard records that don’t match filter –Significantly reduce size of B at low cost –Since A is already being scanned, generating the Bloom filter is “free” –Bloom filter can be made as small or large as memory permits Fewer buckets → more collisions → less effective filtering Fewer buckets → less memory to store bitmap

Next week: Physical DB Design This concludes the query processing topic Next week, we’ll begin physical database design –Selection of indexes and materialized views –Partitioning and data layout –RAID and hardware considerations –Physical design trade-offs –Database tuning Note on course project: –Start thinking about topics –We’ll discuss the project in more detail next class.

CS 345: Topics in Data Warehousing Thursday, October 28, 2004.

Similar presentations

Presentation on theme: "CS 345: Topics in Data Warehousing Thursday, October 28, 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 345: Topics in Data Warehousing Thursday, October 28, 2004.

Similar presentations

Presentation on theme: "CS 345: Topics in Data Warehousing Thursday, October 28, 2004."— Presentation transcript:

Similar presentations

About project

Feedback