# Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.

## Presentation on theme: "Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data."— Presentation transcript:

Multidimensional Data

Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data about sales. - A sale is described by (store, day, item, color, size, custName, etc.). Sale = point in 6dim space. - A customer is described by (custName, age, salary, pcode, marital­status, etc.). Typical Queries Range queries: "How many customers for gold jewelry have age between 45 and 55, and salary less than 100K?" Nearest neighbor : "If I am at coordinates (a,b), what is the nearest McDonalds." They are expressible in SQL. Do you see how?

SQL Range queries: “How many customers for gold jewelry have age between 45 and 55, and salary less than 100K?” SELECT * FROM Customers WHERE age>=45 AND age<=55 AND sal<100; Nearest neighbor : “If I am at coordinates (a,b), what is the nearest McDonalds.” Suppose we have a relation Points(x,y,name) SELECT * FROM Points p WHERE p.name=‘McDonalds’ AND NOT EXISTS ( SELECT * FROM POINTS q WHERE (q.x-a)*(q.x-a)+(q.y-b)*(q.y-b) < (p.x-a)*(p.x-a)+(p.y-b)*(p.y-b) AND q.name=‘McDonalds’ );

Big Impediment For these types of queries, there is no clean way to eliminate lots of records that don't meet the condition of the WHERE­ clause. An Approach for range queries Index on attributes independently. - Intersect pointers in main memory to save disk I/O.

Attempt at using B-trees for MD-queries (example 14.26) Database = 1,000,000 points evenly distributed in a 1000×1000 square. Stored in 10,000 blocks (100 recs per block) B-tree secondary indexes on x and on y Range query {(x,y) : 450  x  550, 450  y  550} 100,000 pointers (i.e. 1,000,000/10) for the x range, and same for y 10,000 pointers for answer (found by pointer intersection) Retrieve 10,000 records. If they are stored randomly we need to do 10,000 I/O’s. Add here the cost of B-Trees: Root of each B-tree in main memory Suppose leaves have avg. 200 keys  500 disk I/O in each B-tree to get pointer lists  1000 + 2(for intermediate B-tree level) disk I/O’s Total 11,004 disk I/O’s, more than sequential scan of file = 10,000 I/O’s.

Bitmap Indexes

Suppose we have n tuples. A bitmap index for a field F is a collection of bit vectors of length n, one vector for each possible value that may appear in the field F. The vector for value v has 1 in position i if the i-th record has v in field F, and it has 0 there if not. (30, foo) (30, bar) (40, baz) (50, foo) (40, bar) (30, baz) foo 100100 bar 010010 baz 001001

Graphical Picture MF 10 10 01 10 CustidNameGenderRating 112JoeM3 115SamM5 119SueF5 112WuM4 12345 00100 00001 00001 00010 Two bit strings for the Gender bitmap Customer table. We will index Gender and Rating. Note that this is just a partial list of all the records in the table Five bit strings for the Rating bitmap

Bitmap operations Bit maps are designed to support partial match and range queries. How? To identify the records holding a subset of the values from a given dimension, we can do a binary OR on the bitmaps from that dimension. - For example, the OR of bit strings for Age = (20, 21, 22) To identify the partial matches on a group of dimensions, we can simply perform a binary AND on the OR-ed maps from each dimension. These operations can be done very quickly since binary operations are natively supported by the CPU.

Bit Map example MF 10 10 01 10 12345 00100 00001 00001 00010 SELECT * FROM Customer WHERE gender = M AND (rating = 3 OR rating = 5) 1 1 1 0 1 0 1 1 AND= 1 0 1 0 First two records in our fact table are retrieved 1 1 1 0 1 0 0 0 0 0 1 1 OR= 1 0 1 1

Gold-Jewelry Data (25; 60) (45; 60) (50; 75) (50; 100) (50; 120) (70; 110) (85; 140) (30; 260) (25; 400) (45; 350) (50; 275) (60; 260) What would be the bitmap index for age, and the bitmap index for salary? Suppose we want to find the jewelry buyers with an age in the range 45-55 and a salary in the range 100-200. What do we do?

Compressed Bitmaps Run length encoding is used to encode sequences or runs of zeros. Say that we have 20 zeros, then a 1, then 30 more zeros, then another 1. Naively, we could encode this as the integer pair - This would work. But what's the problem? - On a typical 32-bit machine, an integer uses 32 bits of storage. So our pair uses 64 bits. The original string only had 52!

Run Length Encoding So we must use a technique that stores our run-lengths as compactly as possible. Let’s say we have the string 000101 - This is made up of runs with 3 zeros and 1 zero. - In binary, 3 = 11, while 1 is, of course, just 1 - This gives us a compressed representation of 111. The problem? - How do we decompress this? - We could interpret this as 1-11 or 11-1 or even 1-1-1. - This would give us three different strings after the decompression.

Proper Run Length Encoding Run of length i has i 0’s followed by a 1. Let’s say that that we have a run of length i. Let j be the number of bits required to represent i. To define a run, we will use two values: - The “unary” representation of j A sequence of j – 1 “1” bits followed by a zero (the zero signifies the end of the unary string) - The value i (using j bits) - The special cases of i = 0 and i = 1 use 00 and 01 respectively. Here j=1 so there are no 1 bits (since j-1 =0)  Here we have two “0” runs of length 13 and 6  13 can be represented by 4 bits, 6 requires 3 bits  Run 1: 3 “1” bits + 0 + i → 111 0 1101  Run 2: 2 “1” bits + 0 + i → 11 0 110  Final compressed string: 11101101110110  Compression rate: 14/21 = 0.66 Example: 000000000000010000001

Decoding Let’s decode 1110110100110110 11101101001011  13 11101101001011  0 11101101001011  3 Our sequence of run lengths is: 13, 0, 3. What’s the bitmap? 0000000000000110001

Summary Pros: 1.Bitmaps provide efficient storage for low cardinality dimensions. On sparse, high cardinality dimensions, compression can be effective. 2.Bit operations can support multi-dimensional partial match and range queries Cons: 1.De-compression requires run-time overhead 2.Bit operations on large maps and with large dimension counts can be expensive.

Download ppt "Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data."

Similar presentations