Download presentation

Presentation is loading. Please wait.

Published byNatalie Bell Modified over 2 years ago

1
1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

2
2

3
3 Background I have been working with Astronomy community to build the World Wide Telescope: all telescope data federated in one internet-scale DB A great Web Services app The work here joint with Alex Szalay SkyServer.Sdss.Org is first installment, SkyQuery.Net is second installment (federated web services)

4
4 Outline How to do spatial lookup: –The old way: HTM –The new way: zoned lookup A load workflow system Embedded documentation: literate programming for SQL DDL

5
5 Spatial Data Access – SQL extension Szalay, Kunszt, Brunner Added Hierarchical Triangular Mesh (HTM) table-valued function for spatial joins Every object has a 20-deep Mesh ID Given a spatial definition, routine returns up to 10 covering triangles Spatial query is then up to 10 range queries Fast: 1,000 triangles / second / Ghz 2,2 2,1 2,0 2,3 2,3,0 2,3,1 2,3,22,3,3 2,2 2,1 2,0 2,32,2 2,1 2,0 2,3 2,3,0 2,3,1 2,3,22,3,3 2,3,0 2,3,1 2,3,22,3,3

6
6 A typical call -- find objects within 1 arcminute of (60,20) select objID, ra, dec from PhotoObj as p, fHtmCover(60,20,1) as triangle where p.htmID between triangle.startHtmID and triangle.endHtmID and -- or better yet select objID, ra, dec, distance from dbo.fGetNearbyObjEq(60,20,1) careful distance test rejects false positives Coarse distanc e test Coarse filter Correct filter

7
7 Integration with CLR Makes it Nicer Peter Kukol converted 500 lines of external stored procedure glue code to 50 lines of C# code. Now we are converting library to C# Also, Cross Apply is VERY useful select objID, count(*) from PhotoObj p cross apply dbo.fGetNearbyObjEq(p.ra, p.dec, 1)

8
8 But… Wanted a faster way to do this: some computations were taking toooooo long (see below). Wanted to define areas in relational form. Wanted a portable way that works on any relational system. So, developed a constraint database approach – see below.

9
9 The Idea: Equations Define Subspaces For (x,y) above the line ax+by > c Reverse the space by -ax + -by > -c Intersect 3 half-spaces: a 1 x + b 1 y > c 1 a 2 x + b 2 y > c 2 a 3 x + b 3 y > c 3 x y x=c/a y=c/b ax + by = c x y

10
10 The Idea: Equations Define Subspaces a 1 x + b 1 y > c 1 a 2 x + b 2 y > c 2 a 3 x + b 3 y > c 3 x y select count(*) from convex where + < c select count(*) from convex where + > c x y

11
11 Domain is Union of Convex Hulls Simple volumes are unions of convex hulls. Higher order curves also work Complex volumes have holes and their holes have holes. (that is harder). Not a convex hull +

12
12 Now in Relational Terms create table HalfSpace ( domainID int not null -- domain name foreign key references Domain(domainID), convexID int not null,-- grouping a set of ½ spaces halfSpaceID int identity(),-- a particular ½ space a float not null, -- the (a,b,..) parameters b float not null, -- defining the ½ space cfloat not null, -- the constraint (c above) primary key (domainID, convexID, halfSpaceID) (x,y) inside a convex if it is inside all lines of the convex (x,y) inside a convex if it is NOT OUTSIDE ANY line of the convex Convexes containing point select convexID -- return the convex hulls from HalfSpace -- from the constraints where * a * b) < c -- point outside the line? group by all convexID -- insist no line of convex having count(*) = 0 -- is outside (count outside == 0)

13
13 All Domains Containing this Point The group by is supported by the domain/convex index, so its a sequential scan (pre-sorted!). select distinct domainID -- return domains from HalfSpace -- from constraints where * a * b) < c -- point outside group by all domainID, convexID -– never happens having count(*) = 0 -- count outside == 0

14
14 The Algebra is Simple = spDomainNew = spDomainNewConvex = spDomainNewConvexConstraint = select * from float) Once constructed they can be manipulated with the Boolean = spDomainOr = spDomainAnd = spDomainNot varchar(8000))

15
15 What! No Bounding Box? Bounding box limits search. A subset of the convex hulls. If query runs at 3M half-space/sec then no need for bounding box, unless you have more than 10,000 lines. But, if you have a lot of half-spaces then bounding box is good.

16
16 OK: solved Areas Contain Point? What about: Points near point? Table-valued function find points near a point –Select * from fGetNearbyEq(ra,dec,r) Use Hierarchical Triangular Mesh –Space filling curve, bounding triangles… –Standard approach 13 ms/call… So 70 objects/second. Too slow, so pre-compute neighbors: Materialized view. At 70 objects/sec: takes 6 months to compute materialized view on billion objects.

17
17 Zone Based Spatial Join Divide space into zones Key points by Zone, offset (on the sphere this need wrap-around margin.) Point search look in a few zones at a limited offset: ra ± r a bounding box that has 1-π/4 false positives All inside the relational engine Avoids impedance mismatch Can batch all-all comparisons 33x faster and parallel 6 days, not 6 months! r ra-zoneMax (r 2 +(ra-zoneMax) 2 ) cos(radians(zoneMax)) zoneMax x Ra ± x

18
18 In SQL: points near point select o1.objID -- find objects from zone o1 -- in the zoned table where o1.zoneID between -- where zone # and -- overlaps the circle and o1.ra quick filter on ra and o1.dec between and -- quick filter on dec and ( (sqrt( -- careful filter on distance Eliminates the ~ 21% = 1-π/4 False positives Bounding box

19
19 Quantitative Evaluation: 7x faster than external stored proc: (linkage is expensive) time vs. radius for neighbors various zone heights. Any small zone height is adequate. time vs. best various radius. A zoneHeight of 4 is near-optimal

20
20 All Neighbors of All points (can Batch Process the Joins) A 5x additional speedup (35x in total) in {-1, 0, 1} example ignores some spherical geometry details in paper insert neighbors-- insert one zone's neighbors select o1.objID as objID, -- object pairs o2.objID as NeighborObjID,.. other fields elided from zone o1 join zone o2 -- join 2 zones on = o2.zoneID -- using zone number and ra and o2.ra between o1.ra and o1.ra -- points near ra where -- elided margin logic, see paper. and o2.dec between and -- quick filter on dec and sqrt(power(o1.x-o2.x,2)+power(o1.y-o2.y,2)+power(o1.z-o2.z,2)) -- careful filter on distance

21
21 Spatial Stuff Summary Easy –Point in polygon –Polygons containing points – (instance and batch) Works in higher dimensions Side note: Spherical polygons are –hard in 2-space –Easy in 3-space

22
22 Spatial Stuff Summary Constraint databases are in –Streams (data is query, query is in DB) –Notification: subscription in DB, data is query –Spatial: constraints in DB, data is query You can express constraints as rows Then You –Can evaluate LOTS of predicates per second –Can do set algebra on the predicates. Benefits from SQL parallelism SQL == Prolog // DataLog?

23
23 References Representing Polygon Areas and Testing Point-in-Polygon Containment in a Relational Database A Purely Relational Way of Computing Neighbors on a Sphere,

24
24 Outline How to do spatial lookup: –The old way: HTM –The new way: zoned lookup A load workflow system Embedded documentation: literate programming for SQL DDL

25
25 Loading consists of Many Tasks Very simply: capture, analyze to produce catalog info, convert to sql, validate, import, index, Each of these steps has many sub-steps We learned from the TerraServer that (1) LOADING IS WHERE THE TIME GOES. (2) You get to load it again when you discover better data you discover a bug in the data you discover a better design. (3) Essentially, you are always loading or preparing to load. Everyone knows this but you have to experience it to grasp it. Telescope Observation Generate Catalogs Validate catalogs Load to Database

26
26 The SkyServer Load Manager Built a workflow engine with SQL Agent (our batch job scheduler) and DTS State machine is in database visible to all worker nodes Workers at each node pull work Step is a stored procedure –logs to load monitor database –3 levels of logging: job, step, phase –Logs and reporting are VERY useful Automatic some manual backout to fix problems Demo

27
27 Outline How to do spatial lookup: –The old way: HTM –The new way: zoned lookup A load workflow system Embedded documentation: literate programming for SQL DDL

28
28 How do you document your schema?

29
29 Knuths Literate Programming Put documentation in the program Tool generates manual from program.. In the new world: tool generates online hypertext from program. We Annotated every table, view, function,.. with tags: units comment reference to other tables (flags, star schema). Program scans DDL, generates web site.

30
30 CREATE FUNCTION fGetNearbyObjXYZ float) /H Returns table of primary objects arcmins of an xyz /T There is no limit on the number of objects returned, but there are about 40 per sq arcmin. --/T returned table: --/T objID bigint PRIMARY KEY, -- Photo primary object identifier --/T run int NOT NULL, -- run that observed this object --/T camcol int NOT NULL, -- camera column that observed the object --/T field int NOT NULL, -- field that had the object --/T rerun int NOT NULL, -- computer processing run that discovered the object --/T type int NOT NULL, -- type of the object (3=Galaxy, 6= star, see PhotoType in DBconstants) --/T cx float NOT NULL, -- x,y,z of unit vector to this object --/T cy float NOT NULL, --/T cz float NOT NULL, --/T htmID bigint, -- Hierarchical Triangular Mesh id of this object --/T distance float -- distance in arc minutes to this object from the ra,dec. --/T Sample call to find PhotoObjects within 5 arcminutes of xyz ,-.1,0 --/T --/T select * --/T from dbo.fGetNearbyObjXYZ(-.996,-.1,0,5) --/T --/T see also fGetNearbyObjEq, fGetNearestObjXYZ, fGetNearestObjXYZ TABLE ( objID bigint PRIMARY KEY,

31
31 CREATE TABLE Chunk ( /H Contains basic data for a Chunk -- --/T A Chunk is a unit for SDSS data export. --/T It is a part of an SDSS stripe, a 2.5 degree wide cylindrical segment --/T aligned at a great circle between the survey poles. --/T A Chunk has had both strips completely observed. Since --/T the SDSS camera has gaps between its 6 columns of CCDs, each stripe has --/T to be scanned twice (these are the strips) resulting in 12 slightly --/T overlapping narrow observation segments. --/T Only those parts of a stripe are ready for export where the observation --/T is complete, hence the declaration of a chunk, usually consisting of 2 runs chunkNumberint NOT NULL,--/D Unique chunk identifier startMu int NOT NULL,--/D Starting mu value --/U arcsec endMu int NOT NULL,--/D Ending mu value --/U arcsec [stripe]int NOT NULL,--/D Stripe number exportVersionvarchar(32) NOT NULL,--/D Export Version )

32
32 SkyServer Object Browser VB program scans DDL, generates web site. It is VERY useful Free Text Search VERY useful

33
33 Outline How to do spatial lookup: –The old way: HTM –The new way: zoned lookup A load workflow system Embedded documentation: literate programming for SQL DDL Not shown: A C# web service (http://skyservice.pha.jhu.edu/SdssCutout) or better yet!http://skyservice.pha.jhu.edu/SdssCutout

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google