Presentation is loading. Please wait.

Presentation is loading. Please wait.

Massive Spatial Query on the Kepler Architecture

Similar presentations


Presentation on theme: "Massive Spatial Query on the Kepler Architecture"— Presentation transcript:

1 Massive Spatial Query on the Kepler Architecture
Presentation in the 2017 ASAP International Conference Wenhai Li School of Computer, Wuhan University 7/11/2017 Massive Spatial Query on the Kepler Architecture

2 Content Motivation Driving Spatial Join in GPU
Cell-Driven Execution on Virtual Warp Preemption-based Scheduling Experimental Evaluation

3 1. Motivation: Why this topic?
A classical moving object application. 10^7 MO 10^6 Q How to… 1. We can see the moving object application everywhere. A classic scenario is to answer 10^6 queries per second over 10^7 moving objects that are continuously moving within a big city. 2. If the moving objects, i.e., human, vehicle and so on, are at a maximal speed (100 meters/second), how can we query on their latest positions in each second? 3. We suppose the moving objects are managed under a snapshot, where their latest positions have properly loaded onto the GPU devices. Our target is to employ high-performance GPU to improve the efficiency of the massive range queries (with the rectangle of <coordinate x_min, coordinate y_min, coordinate x_max, coordiante y_max>) against enormous spatial positions that have the form of <coordinate x, coordinate y>.

4 Is challenge to conduct massive cores in GPU.
1. Motivation: Why GPU? Cannot stand! Suppose we are given a moving set that is composed of 10^7 position pairs, each query range should trigger 10^7 times of comparisons in the worst case (without an index). If we were run the query on a normal index each bucket has an array of 100 entries of position pairs, each query will scan at least on bucket each time as well as retrieve a series of inner nodes to locate the bucket it requires. A better way is to use a partition-based structure to directly access the bucket and employ massive cores to scan the involved buckets for each query. Very expensive. 1. In practice, the brute force execution will use too much invalid verification for the query range against the irrelevant objects. 2. The existing index will retrieve the path to locate the bucket, making the algorithm suboptimal. 3. The partition-based structure has the benefits to locate the buckets, but needs deliberate design to drive the platforms in their best way. Notice here that in the current GPUs, a socket with more than thousands cores is quite common, which poses potential opportunities to improve the query throughput. 1 Is challenge to conduct massive cores in GPU. 1 13

5 1. Motivation: Why scheduling?
How can we efficiently schedule the thread block (TBs) to exploit the SMX to run the job in round-robin)? 8192 cores with fast in-chip memory Rather than the locality-irrelevant platforms such as multi-cores CPU, the GPU require the user-defined blocks regular and homogeneous. The most challenging problem is how to run the entire queries over the dispersing buckets, and how to resolve the imbalance issues derived from the variety of the query range and the skewed object distributions? Slow memory

6 2. Driving Spatial Query in GPU
Spatial grid: a direct accessing structure over massive buckets of objects. The grid is based on uniform division on the spatial domain, each cell in the grid is composed of a series of buckets. The range of a cell can be directly computed in terms of its index ID, such that a query with the form of <xmin, ymin, xmax, ymax> can be probed with the cell range. This can significant filter the irrelevant cells. Within each cell, the involved objects should be join against the given query in round-robin. The scale of this join is substantially small since the irrelevant cells (as well as their involved objects have been pruned).

7 2. Driving Spatial Query in GPU
We thus transform the massive queries to spatial join, and design a two-stage framework to drive the join (of massive queries on enormous objects) on GPU. Dispatching stage: Construct a vector of requests to map the objects to their intersected queries. Join stage: For each element in the request vector from the dispatching stage, apply the join algorithm on the objects and the query ranges. Query-driven: Once a thread get a request, it fetch the bucket from the device memory and join the query on this bucket. Cell-centered: Once a head thread (of a group) get a request, it fetch the bucket to trigger it’s inner-group threads to respectively run their query on this bucket. 1. Our motivation is to translate the näive round-robin execution of massive range queries into a two-stages join, where the first stage constructs a vector of requests mapping the objects to their intersected queries, so as to support the join in the second stage to batched answer these requests. 2. The cell-centered join strategy aims to improve the locality of the memory accesses by drive a group of threads in batch. 3. Notice that the cell-centered strategy benefits the inner-group threads without fetching their buckets.

8 2. Driving Spatial Query in GPU
Dispatching: Fixed-size stepping Cell range-based pruning For the both join strategies Notice: Atomicadd in line 11 is required to compete the request list RCA among the entire threads.

9 3. Cell-Driven Execution on Virtual Warp
Query-driven strategy. A request vector. Stepping threads. Each cell each time Fetch the bucket Join locally Remark Poor locality Simple scheduling In the first query-driven strategy, the entire threads obey three stages, i.e., Fetch a query request in fixed-size step; Fetch its associated bucket from the device memory; Process the query against the buckets. It has poor locality to access the device memory, especially when the queries and objects are highly skewed.

10 3. Cell-Driven Execution on Virtual Warp
Cell-centered strategy: Thread header Fetch a request Fetch a bucket Trigger siblings Sibling threads Process a query Fetch next query With good locality

11 3. Cell-Driven Execution on Virtual Warp
Join: Fixed-size stepping Get the request and prepare for join Join on each object Because the query-driven join can be easily realized, we focus on the cell-centered method in the remaining presentation. This basic cell-driven strategy use STRIDE to drive the fixed-step to fetch the request. Afterwards, it fetch the bucket and join the query against the buckets.

12 4. Preemption-based Scheduling
Our observation is that, in the above basic cell-based join, the fixed-size strategy may be bottleneck once to many queries are conducted on a cell. We want a better scheduling method to drive the threads (group) to fetch the requests as will. We thus conduct a non-synchronous preemption-based scheduling method to realize this kind of fetching. We introduce another atomic operation to endow the head of each thread group to compete a request after their sibling threads all compete their current jobs. We Call this group-based thread management the Virtual Warping scheme, and endow a header for each group to manage its siblings.

13 4. Preemption-based Scheduling
Notice Cell is fetched once Query is processed by preemption The groups of threads may be balanced adaptively The Cell can be shared by the siblings.

14 5. Experimental Evaluation
Query efficiencies in different cell granularity. The cell-centered join needs an optimal granularity 512. The cell-centered join is 10X faster than query-driven ones If the both optimized versions were run on their best granularities, CC (cell-centered)s outperformed QD (query-driven)s by one order of magnitude, i.e., in granularity 512 of CCs.

15 5. Experimental Evaluation
Query efficiencies in different query ranges. The cell-centered join are insensitive to the query ranges. It is 10X better than the query-driven implementations. If the both optimized versions were run on their best granularities, CC (cell-centered)s outperformed QD (query-driven)s by one order of magnitude, i.e., in granularity 512 of CCs.

16 5. Experimental Evaluation
The cell-centered preemption method has good scalability. It’s always better than the query-driven methods by orders of magnitudes. The preemption has good scheduling effects for skewed data. Notice the y-axes of all the figures are plotted in logistic, which makes the difference of the methods too small to see.

17 Thanks for you attention.


Download ppt "Massive Spatial Query on the Kepler Architecture"

Similar presentations


Ads by Google