1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

Slides:



Advertisements
Similar presentations
Data Challenges I'm Struggling With Jim Gray, Microsoft Research 1.Sneakernet is probably the best way to moving WAN data at 1GBps File transfer efforts.
Advertisements

Chapter 13: Query Processing
Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.
Spatial (or N-Dimensional) Search in a Relational World Jim Gray.
Advanced SQL Topics Edward Wu.
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
1 Roger L. Costello 16 June 2010 XQuery
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Trying to Use Databases for Science Jim Gray Microsoft Research
Multi-RQP Generating Test Databases for the Functional Testing of OLTP Applications Carsten Binnig Joint work with: Donald Kossmann, Eric Lo DBTest Workshop,
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
By Rick Clements Software Testing 101 By Rick Clements
The Reinberger Childrens Library Center Step-by-step instructions for capturing a MARC record and adding a 658 Tag to a record.
XP New Perspectives on Microsoft Office Word 2003 Tutorial 7 1 Microsoft Office Word 2003 Tutorial 7 – Collaborating With Others and Creating Web Pages.
Exit a Customer Chapter 8. Exit a Customer 8-2 Objectives Perform exit summary process consisting of the following steps: Review service records Close.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Relational data objects 1 Lecture 6. Relational data objects 2 Answer to last lectures activity.
The National Certificate in Adult Numeracy
SQL: The Query Language Part 2
1 Advanced Tools for Account Searches and Portfolios Dawn Gamache Cindy Bylander.
Excel Functions. Part 1. Introduction 2 An Excel function is a formula or a procedure that is performed in the Visual Basic environment, outside the.
1.
Computer Literacy BASICS
Tony Rees Divisional Data Centre CSIRO Marine Research, Australia Application of c-squares spatial indexing to an archive of remotely.
Factoring Quadratics — ax² + bx + c Topic
Configuration management
1 Lecture 5: SQL Schema & Views. 2 Data Definition in SQL So far we have see the Data Manipulation Language, DML Next: Data Definition Language (DDL)
The Fun of Programming Chapter 6 How to Write a Financial Contract by Simon Peyton Jones Roger L. Costello July
Information Systems Today: Managing in the Digital World
Database Performance Tuning and Query Optimization
Campaign Overview Mailers Mailing Lists
Modern Programming Languages, 2nd ed.
The School Assistant Software Product Overview. What is the School Assistant software? It is a data-management software solution, intended to assist schools.
R ELATIONAL M ODEL TO SQL Data Model. 22 C ONCEPTUAL D ESIGN : ER TO R ELATIONAL TO SQL How to represent Entity sets, Relationship sets, Attributes, Key.
Creating Tables in a Web Site
Microsoft Access.
Displaying Data from Multiple Tables
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
14 Vectors in Three-dimensional Space Case Study
1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.
Benchmark Series Microsoft Excel 2013 Level 2
GIS Lecture 8 Spatial Data Processing.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
4 Oracle Data Integrator First Project – Simple Transformations: One source, one target 3-1.
© 2012 National Heart Foundation of Australia. Slide 2.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
1 Motion and Manipulation Configuration Space. Outline Motion Planning Configuration Space and Free Space Free Space Structure and Complexity.
Splines I – Curves and Properties
1 Wiki Tutorial. 2 Outline of Wiki Tutorial 1) Welcome and Introductions 2) What is a wiki, and why is it useful for our work in moving forward the program.
Chapter 2 Entity-Relationship Data Modeling: Tools and Techniques
We will resume in: 25 Minutes.
Chapter 12 Analyzing Semistructured Decision Support Systems Systems Analysis and Design Kendall and Kendall Fifth Edition.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Local Search Jim Little UBC CS 322 – CSP October 3, 2014 Textbook §4.8
PSSA Preparation.
Chapter 11 Creating Framed Layouts Principles of Web Design, 4 th Edition.
Essential Cell Biology
Chapter 13 The Data Warehouse
Chapter 13 Web Page Design Studio
CFR 250/590 Introduction to GIS, Autumn 1999 Data Search & Import © Phil Hurvitz, find_data 1  Overview Web search engines NSDI GeoSpatial Data.
© Paradigm Publishing, Inc Access 2010 Level 2 Unit 2Advanced Reports, Access Tools, and Customizing Access Chapter 8Integrating Access Data.
Profile. 1.Open an Internet web browser and type into the web browser address bar. 2.You will see a web page similar to the one on.
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
Windfall Web Throughout this slide show there will be hyperlinks (highlighted in blue). Follow the hyperlinks to navigate to the specified Topic or Figure.
Recent spatial work by Jim Gray and Alex Szalay Bob Mann.
January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh1 High-Speed Access for an NVO Data Grid Node María A. Nieto-Santisteban, Aniruddha R.
Lecture 3 With every passing hour our solar system comes forty-three thousand miles closer to globular cluster 13 in the constellation Hercules, and still.
Spatial Searches in the ODM. slide 2 Common Spatial Questions Points in region queries 1.Find all objects in this region 2.Find all “good” objects (not.
Efficient Catalog Matching with Dropout Detection
Presentation transcript:

1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

2

3 Background I have been working with Astronomy community to build the World Wide Telescope: all telescope data federated in one internet-scale DB A great Web Services app The work here joint with Alex Szalay SkyServer.Sdss.Org is first installment, SkyQuery.Net is second installment (federated web services)

4 Outline How to do spatial lookup: –The old way: HTM –The new way: zoned lookup A load workflow system Embedded documentation: literate programming for SQL DDL

5 Spatial Data Access – SQL extension Szalay, Kunszt, Brunner Added Hierarchical Triangular Mesh (HTM) table-valued function for spatial joins Every object has a 20-deep Mesh ID Given a spatial definition, routine returns up to 10 covering triangles Spatial query is then up to 10 range queries Fast: 1,000 triangles / second / Ghz 2,2 2,1 2,0 2,3 2,3,0 2,3,1 2,3,22,3,3 2,2 2,1 2,0 2,32,2 2,1 2,0 2,3 2,3,0 2,3,1 2,3,22,3,3 2,3,0 2,3,1 2,3,22,3,3

6 A typical call -- find objects within 1 arcminute of (60,20) select objID, ra, dec from PhotoObj as p, fHtmCover(60,20,1) as triangle where p.htmID between triangle.startHtmID and triangle.endHtmID and -- or better yet select objID, ra, dec, distance from dbo.fGetNearbyObjEq(60,20,1) careful distance test rejects false positives Coarse distanc e test Coarse filter Correct filter

7 Integration with CLR Makes it Nicer Peter Kukol converted 500 lines of external stored procedure glue code to 50 lines of C# code. Now we are converting library to C# Also, Cross Apply is VERY useful select objID, count(*) from PhotoObj p cross apply dbo.fGetNearbyObjEq(p.ra, p.dec, 1)

8 But… Wanted a faster way to do this: some computations were taking toooooo long (see below). Wanted to define areas in relational form. Wanted a portable way that works on any relational system. So, developed a constraint database approach – see below.

9 The Idea: Equations Define Subspaces For (x,y) above the line ax+by > c Reverse the space by -ax + -by > -c Intersect 3 half-spaces: a 1 x + b 1 y > c 1 a 2 x + b 2 y > c 2 a 3 x + b 3 y > c 3 x y x=c/a y=c/b ax + by = c x y

10 The Idea: Equations Define Subspaces a 1 x + b 1 y > c 1 a 2 x + b 2 y > c 2 a 3 x + b 3 y > c 3 x y select count(*) from convex where + < c select count(*) from convex where + > c x y

11 Domain is Union of Convex Hulls Simple volumes are unions of convex hulls. Higher order curves also work Complex volumes have holes and their holes have holes. (that is harder). Not a convex hull +

12 Now in Relational Terms create table HalfSpace ( domainID int not null -- domain name foreign key references Domain(domainID), convexID int not null,-- grouping a set of ½ spaces halfSpaceID int identity(),-- a particular ½ space a float not null, -- the (a,b,..) parameters b float not null, -- defining the ½ space cfloat not null, -- the constraint (c above) primary key (domainID, convexID, halfSpaceID) (x,y) inside a convex if it is inside all lines of the convex (x,y) inside a convex if it is NOT OUTSIDE ANY line of the convex Convexes containing point select convexID -- return the convex hulls from HalfSpace -- from the constraints where * a * b) < c -- point outside the line? group by all convexID -- insist no line of convex having count(*) = 0 -- is outside (count outside == 0)

13 All Domains Containing this Point The group by is supported by the domain/convex index, so its a sequential scan (pre-sorted!). select distinct domainID -- return domains from HalfSpace -- from constraints where * a * b) < c -- point outside group by all domainID, convexID -– never happens having count(*) = 0 -- count outside == 0

14 The Algebra is Simple = spDomainNew = spDomainNewConvex = spDomainNewConvexConstraint = select * from float) Once constructed they can be manipulated with the Boolean = spDomainOr = spDomainAnd = spDomainNot varchar(8000))

15 What! No Bounding Box? Bounding box limits search. A subset of the convex hulls. If query runs at 3M half-space/sec then no need for bounding box, unless you have more than 10,000 lines. But, if you have a lot of half-spaces then bounding box is good.

16 OK: solved Areas Contain Point? What about: Points near point? Table-valued function find points near a point –Select * from fGetNearbyEq(ra,dec,r) Use Hierarchical Triangular Mesh –Space filling curve, bounding triangles… –Standard approach 13 ms/call… So 70 objects/second. Too slow, so pre-compute neighbors: Materialized view. At 70 objects/sec: takes 6 months to compute materialized view on billion objects.

17 Zone Based Spatial Join Divide space into zones Key points by Zone, offset (on the sphere this need wrap-around margin.) Point search look in a few zones at a limited offset: ra ± r a bounding box that has 1-π/4 false positives All inside the relational engine Avoids impedance mismatch Can batch all-all comparisons 33x faster and parallel 6 days, not 6 months! r ra-zoneMax (r 2 +(ra-zoneMax) 2 ) cos(radians(zoneMax)) zoneMax x Ra ± x

18 In SQL: points near point select o1.objID -- find objects from zone o1 -- in the zoned table where o1.zoneID between -- where zone # and -- overlaps the circle and o1.ra quick filter on ra and o1.dec between and -- quick filter on dec and ( (sqrt( -- careful filter on distance Eliminates the ~ 21% = 1-π/4 False positives Bounding box

19 Quantitative Evaluation: 7x faster than external stored proc: (linkage is expensive) time vs. radius for neighbors various zone heights. Any small zone height is adequate. time vs. best various radius. A zoneHeight of 4 is near-optimal

20 All Neighbors of All points (can Batch Process the Joins) A 5x additional speedup (35x in total) in {-1, 0, 1} example ignores some spherical geometry details in paper insert neighbors-- insert one zone's neighbors select o1.objID as objID, -- object pairs o2.objID as NeighborObjID,.. other fields elided from zone o1 join zone o2 -- join 2 zones on = o2.zoneID -- using zone number and ra and o2.ra between o1.ra and o1.ra -- points near ra where -- elided margin logic, see paper. and o2.dec between and -- quick filter on dec and sqrt(power(o1.x-o2.x,2)+power(o1.y-o2.y,2)+power(o1.z-o2.z,2)) -- careful filter on distance

21 Spatial Stuff Summary Easy –Point in polygon –Polygons containing points – (instance and batch) Works in higher dimensions Side note: Spherical polygons are –hard in 2-space –Easy in 3-space

22 Spatial Stuff Summary Constraint databases are in –Streams (data is query, query is in DB) –Notification: subscription in DB, data is query –Spatial: constraints in DB, data is query You can express constraints as rows Then You –Can evaluate LOTS of predicates per second –Can do set algebra on the predicates. Benefits from SQL parallelism SQL == Prolog // DataLog?

23 References Representing Polygon Areas and Testing Point-in-Polygon Containment in a Relational Database A Purely Relational Way of Computing Neighbors on a Sphere,

24 Outline How to do spatial lookup: –The old way: HTM –The new way: zoned lookup A load workflow system Embedded documentation: literate programming for SQL DDL

25 Loading consists of Many Tasks Very simply: capture, analyze to produce catalog info, convert to sql, validate, import, index, Each of these steps has many sub-steps We learned from the TerraServer that (1) LOADING IS WHERE THE TIME GOES. (2) You get to load it again when you discover better data you discover a bug in the data you discover a better design. (3) Essentially, you are always loading or preparing to load. Everyone knows this but you have to experience it to grasp it. Telescope Observation Generate Catalogs Validate catalogs Load to Database

26 The SkyServer Load Manager Built a workflow engine with SQL Agent (our batch job scheduler) and DTS State machine is in database visible to all worker nodes Workers at each node pull work Step is a stored procedure –logs to load monitor database –3 levels of logging: job, step, phase –Logs and reporting are VERY useful Automatic some manual backout to fix problems Demo

27 Outline How to do spatial lookup: –The old way: HTM –The new way: zoned lookup A load workflow system Embedded documentation: literate programming for SQL DDL

28 How do you document your schema?

29 Knuths Literate Programming Put documentation in the program Tool generates manual from program.. In the new world: tool generates online hypertext from program. We Annotated every table, view, function,.. with tags: units comment reference to other tables (flags, star schema). Program scans DDL, generates web site.

30 CREATE FUNCTION fGetNearbyObjXYZ float) /H Returns table of primary objects arcmins of an xyz /T There is no limit on the number of objects returned, but there are about 40 per sq arcmin. --/T returned table: --/T objID bigint PRIMARY KEY, -- Photo primary object identifier --/T run int NOT NULL, -- run that observed this object --/T camcol int NOT NULL, -- camera column that observed the object --/T field int NOT NULL, -- field that had the object --/T rerun int NOT NULL, -- computer processing run that discovered the object --/T type int NOT NULL, -- type of the object (3=Galaxy, 6= star, see PhotoType in DBconstants) --/T cx float NOT NULL, -- x,y,z of unit vector to this object --/T cy float NOT NULL, --/T cz float NOT NULL, --/T htmID bigint, -- Hierarchical Triangular Mesh id of this object --/T distance float -- distance in arc minutes to this object from the ra,dec. --/T Sample call to find PhotoObjects within 5 arcminutes of xyz ,-.1,0 --/T --/T select * --/T from dbo.fGetNearbyObjXYZ(-.996,-.1,0,5) --/T --/T see also fGetNearbyObjEq, fGetNearestObjXYZ, fGetNearestObjXYZ TABLE ( objID bigint PRIMARY KEY,

31 CREATE TABLE Chunk ( /H Contains basic data for a Chunk -- --/T A Chunk is a unit for SDSS data export. --/T It is a part of an SDSS stripe, a 2.5 degree wide cylindrical segment --/T aligned at a great circle between the survey poles. --/T A Chunk has had both strips completely observed. Since --/T the SDSS camera has gaps between its 6 columns of CCDs, each stripe has --/T to be scanned twice (these are the strips) resulting in 12 slightly --/T overlapping narrow observation segments. --/T Only those parts of a stripe are ready for export where the observation --/T is complete, hence the declaration of a chunk, usually consisting of 2 runs chunkNumberint NOT NULL,--/D Unique chunk identifier startMu int NOT NULL,--/D Starting mu value --/U arcsec endMu int NOT NULL,--/D Ending mu value --/U arcsec [stripe]int NOT NULL,--/D Stripe number exportVersionvarchar(32) NOT NULL,--/D Export Version )

32 SkyServer Object Browser VB program scans DDL, generates web site. It is VERY useful Free Text Search VERY useful

33 Outline How to do spatial lookup: –The old way: HTM –The new way: zoned lookup A load workflow system Embedded documentation: literate programming for SQL DDL Not shown: A C# web service ( or better yet!