Presentation is loading. Please wait.

Presentation is loading. Please wait.

A N I MPROVED I NDEXING S CHEME FOR R ANGE Q UERIES Yvonne Yao Adviser: Professor Huiping Guo.

Similar presentations


Presentation on theme: "A N I MPROVED I NDEXING S CHEME FOR R ANGE Q UERIES Yvonne Yao Adviser: Professor Huiping Guo."— Presentation transcript:

1 A N I MPROVED I NDEXING S CHEME FOR R ANGE Q UERIES Yvonne Yao Adviser: Professor Huiping Guo

2 D ATABASE - AS - A -S ERVICE Business organizations handle a large amount of data (TB) Cost of managing and maintaining these data onsite is high DAS DBMSs outsourcing Clients rely on service providers for data management and maintenance Cost is a lot lowered. But…

3 D ATABASE - AS - A -S ERVICE Security of data is not guaranteed Service providers are untrusted Store only an encrypted form of data onto the remote server Only users with the correct key(s) can have access How then can we query the encrypted data? Retrieve and decrypt the entire table, and apply SQL statements on it. Too expensive! A more realistic approach was discovered

4 D ATABASE - AS - A -S ERVICE

5 B UCKETIZATION Various approaches to build meta-data: B+-tree based, hash-based, and bucket-based What is bucketization? Partition of attribute data into several buckets Each bucket is identified by an ID Bucket IDs are stored, along with encrypted data, on the remote server Client keeps partition information as meta-data General bucketization approach Equi-width Equi-depth

6 E XAMPLE 1

7 PartitionID [0.0 ~ 1.0]Bucket_1 [1.1 ~ 2.0]Bucket_2 [2.1 ~ 3.0]Bucket_3 [3.1 ~ 4.0]Bucket_4

8 E XAMPLE 1 User query: SELECT * FROM grades WHERE gpa < 3.0 Q server : SELECT * FROM egrades WHERE gpaID = ‘Bucket_1’ OR gpaID = ‘Bucket_2’ OR gpaID = ‘Bucket_3’ Size of superset is 29, of which 7 of them are false positives

9 Q UERY O PTIMAL B UCKETIZATION General idea: minimizing the bucket cost of each bucket Input: V = { v 1, v 2, v 3, …, v n } where v 1 < v 2 < v 3 < … < v n F = Frequency of each value M = Number of buckets to fill Output: a matrix indicating the boundary of each bucket

10 Q UERY O PTIMAL B UCKETIZATION QOB Finds optimum solutions to two smaller sub-problems one contains the leftmost M -1 buckets covering the ( n-i ) smallest points Another contains the rightmost single bucket covering the remaining i points V = { v 1, v 2, v 3, v 4, v 5, v 6, …, v n-3, v n-2, v n-1, v n } n-i points go to last i points go to M -1 buckets last bucket

11 E XAMPLE 2 PartitionID [0.7 ~ 1.2]Bucket_1 [1.5 ~ 2.5]Bucket_2 [2.8 ~ 3.0]Bucket_3 [3.5 ~ 4.0]Bucket_4

12 E XAMPLE 2 Q server : SELECT * FROM egrades WHERE gpaID = ‘Bucket_1’ OR gpaID = ‘Bucket_2’ OR gpaID = ‘Bucket_3’ Same as the general bucketization method In most cases, QOB can outperform the conventional bucketization strategy, but not always

13 D EVIATION B UCKETIZATION Built upon QOB, takes the same parameters Has two levels of buckets First level: same as those produced by QOB Second level: bucketization of deviation values, the difference between the value itself to the average of the bucket Each first-level-bucket has at most M second level buckets QOB has at most M buckets, while DB has at most M 2 buckets

14 D EVIATION B UCKETIZATION DB Run QOB ( D, M ) Construct First-Level-Buckets from boundary matrix For each First-Level-Bucket Initialize empty datasets v i ’ and f i ’ For each v i in the bucket v i ’ = v i ’ ∪ v i ’ – avg() f i ’ = f i ’ ∪ 1 Create a new dataset d i = ( v i ’, f i ’ ) Run QOB( d i, M )

15 E XAMPLE 3 PartitionIDAvg [0.7 ~ 1.2]Bucket_10.93 [1.5 ~ 2.5]Bucket_21.84 [2.8 ~ 3.0]Bucket_32.93 [3.5 ~ 4.0]Bucket_43.67 PartitionIDAvg ……… [2.8 ~ 2.8]Bucket_3_12.8 [2.9 ~ 2.9]Bucket_3_22.9 [3.0 ~ 3.0]Bucket_3_33.0 ………

16 E XAMPLE 3 Q server : SELECT * FROM egrades WHERE gpaID = ‘Bucket_1’ OR gpaID = ‘Bucket_2’ OR gpaID = ‘Bucket_3_1’ OR gpaID = ‘Bucket_3_2’ In this case, no false positives are returned Generally, false positives will still be returned, just the number of them will be greatly reduced

17 E XPERIMENTS Two datasets Synthetic dataset: 10 5 integers from [0, 999] Real dataset: 10 3 data points from the Aspect column of the Forest CoverType database in UCI’s KDD Archive Two sets of queries Q syn Q real

18 E XPERIMENT 1

19 E XPERIMENT 2

20 Thank You


Download ppt "A N I MPROVED I NDEXING S CHEME FOR R ANGE Q UERIES Yvonne Yao Adviser: Professor Huiping Guo."

Similar presentations


Ads by Google