Presentation is loading. Please wait.

Presentation is loading. Please wait.

Database Operations on GPU Changchang Wu 4/18/2007.

Similar presentations


Presentation on theme: "Database Operations on GPU Changchang Wu 4/18/2007."— Presentation transcript:

1 Database Operations on GPU Changchang Wu 4/18/2007

2 Outline Database Operations on GPU Point List Generation on GPU Nearest Neighbor Searching on GPU

3 Database Operations on GPU

4 Design Issues Low bandwidth between GPU and CPU Avoid frame buffer readbacks No arbitrary writes Avoid data rearrangements Programmable pipeline has poor branching Evaluate branches using fixed function tests

5 Design Overview Use depth test functionality of GPUs for performing comparisons Implements all possible comparisons =, >, ==, !=, ALWAYS, NEVER Use stencil test for data validation and storing results of comparison operations Use occlusion query to count number of elements that satisfy some condition

6 Basic Operations Basic SQL query Select A From T Where C A= attributes or aggregations (SUM, COUNT, MAX etc) T=relational table C= Boolean Combination of Predicates (using operators AND, OR, NOT)

7 Basic Operations Predicates – a i op constant or a i op a j Op is one of, =,!=, =, TRUE, FALSE Boolean combinations – Conjunctive Normal Form (CNF) expression evaluation Aggregations – COUNT, SUM, MAX, MEDIAN, AVG

8 Predicate Evaluation a i op constant (d) Copy the attribute values a i into depth buffer Define the comparison operation using depth test Draw a screen filling quad at depth d glDepthFunc(…) glStencilOp( fail, zfail, zpass );

9 Predicate Evaluation Comparing two attributes: a i op a j is treated as (a i – a j ) op 0 Semi-linear queries Easy to compute with fragment shader

10 Boolean Combinations Expression provided as a CNF CNF is of form (A 1 AND A 2 AND … AND A k ) where A i = (B i 1 OR B i 2 OR … OR B i mi ) CNF does not have NOT operator If CNF has a NOT operator, invert comparison operation to eliminate NOT Eg. NOT (a i (a i >= d) For example, c ompute a i within [low, high] Evaluated as ( a i >= low ) AND ( a i <= high )

11 Algorithm

12 Range Query Compute a i within [low, high] Evaluated as ( a i >= low ) AND ( a i <= high )

13 Aggregations COUNT, MAX, MIN, SUM, AVG No data rearrangements

14 COUNT Use occlusion queries to get pixel pass count Syntax: Begin occlusion query Perform database operation End occlusion query Get count of number of attributes that passed database operation Involves no additional overhead!

15 MAX, MIN, MEDIAN We compute Kth-largest number Traditional algorithms require data rearrangements We perform no data rearrangements, no frame buffer readbacks

16 K-th Largest Number By comparing and counting, determinate every bit in order of MSB to LSB

17 Example: Parallel Max S={10,24,37,99,192,200,200,232} Step 1: Draw Quad at 128(10000000) S = {10,24,37,99,192,200,200,232} Step 2: Draw Quad at 192(11000000) S = {10,24,37,192,200,200,232} Step 3: Draw Quad at 224(11100000) S = {10,24,37,192,200,200,232} Step 4: Draw Quad at 240(11110000) – No values pass Step 5: Draw Quad at 232(11101000) S = {10,24,37,192,200,200,232} Step 6,7,8: Draw Quads at 236,234,233 – No values pass, Max is 232

18 Accumulator, Mean Accumulator - Use sorting algorithm and add all the values Mean – Use accumulator and divide by n Interval range arithmetic Alternative algorithm Use fragment programs – requires very few renderings Use mipmaps [Harris et al. 02], fragment programs [Coombe et al. 03]

19 Accumulator Data representation is of form a k 2 k + a k-1 2 k-1 + … + a 0 Sum = sum(a k ) 2 k + sum(a k-1 ) 2 k-1 +…+sum(a 0 ) Current GPUs support no bit-masking operations

20 The Algorithm >=0.5 means i-th bit is 1

21 Implementation Algorithm CPU – Intel compiler 7.1 with hyper-threading, multi-threading, SIMD optimizations GPU – NVIDIA Cg Compiler Hardware Dell Precision Workstation with Dual 2.8GHz Xeon Processor NVIDIA GeForce FX 5900 Ultra GPU 2GB RAM

22 Benchmarks TCP/IP database with 1 million records and four attributes Census database with 360K records

23 Copy Time

24 Predicate Evaluation

25 Range Query

26 Multi-Attribute Query

27 Semi-linear Query

28 Kth-Largest

29

30 Kth-Largest conditional

31 Accumulator

32 Analysis: Issues Precision Copy time Integer arithmetic Depth compare masking Memory management No Branching No random writes

33 Analysis: Performance Relative Performance Gain High Performance – Predicate evaluation, multi-attribute queries, semi-linear queries, count Medium Performance – Kth-largest number Low Performance - Accumulator

34 High Performance Parallel pixel processing engines Pipelining Early Z-cull Eliminate branch mispredictions

35 Medium Performance Parallelism FX 5900 has clock speed 450MHz, 8 pixel processing engines Rendering single 1000x1000 quad takes 0.278ms Rendering 19 such quads take 5.28ms. Observed time is 6.6ms 80% efficiency in parallelism!!

36 Low Performance No gain over SIMD based CPU implementation Two main reasons: Lack of integer-arithmetic Clock rate

37 Advantages Algorithms progress at GPU growth rate Offload CPU work Fast due to massive parallelism on GPUs Algorithms could be generalized to any geometric shape Eg. Max value within a triangular region Commodity hardware!

38 GPU Point List Generation Data compaction

39 Overall task

40 3D to 2D mapping

41 Current Problem

42 The solution

43 Overview, Data Compaction

44 Algorithm: Discriminator

45 Algorithm: Histogram Builder

46 Histogram Output

47 Algorithm: PointList Builder

48 PointList Output

49 Timing Reduces a highly sparse matrix with N elements to a list of its M active entries in O(N) + M (log N) steps,

50 Applications Image Analysis Feature Detection Volume Analysis Sparse Matrix Generation

51 Searching 1D Binary Search Nearest Neighbor Search for High dimension space K-NN Search

52 Binary Search Find a specific element in an ordered list Implement just like CPU algorithm Assuming hardware supports long enough shaders Finds the first element of a given value v If v does not exist, find next smallest element > v Search algorithm is sequential, but many searches can be executed in parallel Number of pixels drawn determines number of searches executed in parallel 1 pixel == 1 search

53 Binary Search Search for v0 v0v0v2v2v5v0v5 Sorted List 01345627 4 Initialize Search starts at center of sorted array v2 >= v0 so search left half of sub-array v2

54 Binary Search Search for v0 v0v0v2v2v2v5v0v5 Sorted List 01345627 4 Initialize 2 Step 1 v0 >= v0 so search left half of sub-array

55 Binary Search Search for v0 v0v2v2v2v5v0v5 Sorted List 01345627 4 Initialize 2 1 Step 1 Step 2 v0 >= v0 so search left half of sub-array v0

56 Binary Search Search for v0 v0v2v2v2v5v0v5 Sorted List 01345627 4 Initialize 2 1 0 Step 1 Step 2 Step 3 At this point, we either have found v0 or are 1 element too far left One last step to resolve v0

57 Binary Search Search for v0 v0v2v2v2v5v0v5 Sorted List 01345627 4 Initialize 2 1 0 0 Step 1 Step 2 Step 3 Step 4 Done! v0

58 Binary Search Search for v0 and v2 v0v0v2v2v5v0v5 Sorted List 01345627 4 Initialize 4 Search starts at center of sorted array Both searches proceed to the left half of the array v2

59 Binary Search Search for v0 and v2 v0v0v2v2v2v5v0v5 Sorted List 01345627 4 Initialize 2 Step 1 4 2 The search for v0 continues as before The search for v2 overshot, so go back to the right

60 Binary Search Search for v0 and v2 v0v2v2v5v0v5 Sorted List 01345627 4 Initialize 2 1 Step 1 Step 2 4 2 3 v0v2 We’ve found the proper v2, but are still looking for v0 Both searches continue

61 Binary Search Search for v0 and v2 v0v2v2v2v5v0v5 Sorted List 01345627 4 Initialize 2 1 0 Step 1 Step 2 Step 3 4 2 3 2 v0 Now, we’ve found the proper v0, but overshot v2 The cleanup step takes care of this

62 Binary Search Search for v0 and v2 v0v2v2v5v0v5 Sorted List 01345627 4 Initialize 2 1 0 0 Step 1 Step 2 Step 3 Step 4 4 2 3 2 3 v0v2 Done! Both v0 and v2 are located properly

63 Binary Search Summary Single rendering pass Each pixel drawn performs independent search O(log n) steps

64 Nearest Neighbor Search Very fundamental step in similarity search of data mining, retrieval… Curse of dimensionality, When dimensionality is very high, structures like k-d tree does not help Use GPU to improve linear scan

65 Distances N-norm distance Cosine distance acos(dot(x,y))

66 Data Representation Use separate textures to store different dimensions.

67 Distance Computation Accumulating distance component of different dimensions

68 Reduction in RGBA

69 Reduction to find NN

70 Results

71

72 K-Nearest Neighbor Search Given a sample point p, find the k points nearest p within a data set On the CPU, this is easily done with a heap or priority queue Can add or reject neighbors as search progresses Don’t know how to build one efficiently on GPU kNN-grid Can only add neighbors…

73 kNN-grid Algorithm sample point neighbors found candidate neighbor Want 4 neighbors

74 kNN-grid Algorithm Candidate neighbors must be within max search radius Visit voxels in order of distance to sample point sample point neighbors found candidate neighbor Want 4 neighbors

75 kNN-grid Algorithm If current number of neighbors found is less than the number requested, grow search radius 1 sample point neighbors found candidate neighbor Want 4 neighbors

76 kNN-grid Algorithm 2 sample point neighbors found candidate neighbor Want 4 neighbors If current number of neighbors found is less than the number requested, grow search radius

77 kNN-grid Algorithm Don’t add neighbors outside maximum search radius Don’t grow search radius when neighbor is outside maximum radius 2 sample point neighbors found candidate neighbor Want 4 neighbors

78 kNN-grid Algorithm Add neighbors within search radius 3 sample point neighbors found candidate neighbor Want 4 neighbors

79 kNN-grid Algorithm Add neighbors within search radius 4 sample point neighbors found candidate neighbor Want 4 neighbors

80 kNN-grid Algorithm Don’t expand search radius if enough neighbors already found 4 sample point neighbors found candidate neighbor Want 4 neighbors

81 kNN-grid Algorithm Add neighbors within search radius 5 sample point neighbors found candidate neighbor Want 4 neighbors

82 kNN-grid Algorithm Visit all other voxels accessible within determined search radius Add neighbors within search radius 6 sample point neighbors found candidate neighbor Want 4 neighbors

83 kNN-grid Summary Finds all neighbors within a sphere centered about sample point May locate more than requested k- nearest neighbors 6 sample point neighbors found candidate neighbor Want 4 neighbors

84 References Naga Govindaraju, Brandon Lloyd, Wei Wang, Ming Lin and Dinesh Manocha, Fast Computation of Database Operations using Graphics Processors http:// www.gpgpu.org/s2004/slides/govindaraju.DatabaseOperations.ppt Benjamin Bustos, Oliver Deussen, Stefan Hiller, and Daniel Keim, A Graphic Hardware Accelerated Algorithm for Nearest Neighbor Search Gernot Ziegler, Art Tevs, Christian Theobalt, Hans-Peter Seidel, GPU Point List Generation through Histogram Pyramids http://www.mpi-inf.mpg.de/~gziegler/gpu_pointlist/ Tim Purcell, Sorting and Searching http://www.gpgpu.org/s2005/slides/purcell.SortingAndSearching. ppt


Download ppt "Database Operations on GPU Changchang Wu 4/18/2007."

Similar presentations


Ads by Google