Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Database Management Systems: part of the solution or part of the problem? Clive Page 2004 April 28.

Similar presentations


Presentation on theme: "1 Database Management Systems: part of the solution or part of the problem? Clive Page 2004 April 28."— Presentation transcript:

1 1 Database Management Systems: part of the solution or part of the problem? Clive Page 2004 April 28

2 2 Cone search Easy to provide cone-search using either: –Spatial index e.g. R-tree, available with most DBMS –Pixel-code method, e.g. HTM or HEALPix. Problems –Scalability: search for Brown Dwarfs in Hyades had initial SELECT returning 8 million rows.

3 3 Distributed Cone Search Several “Standard” protocols for cone-search over web: –CGI-based: GLU from CDS: http://simbad.u-strasbg.fr/glu/glu.htx US-NVO protocol: http://www.us-vo.org/metadata/conesearch/index.html –XML-based: ADQL has function called REGION Results hard to combine or merge because of lack of compatible metadata – UCDs not yet widespread.

4 4 Cross-matching Catalogues Easy with true spatial indexing e.g R-trees buy join syntax is very DBMS-specific. Can write a range of back-ends. Harder if only pixel-code method (e.g. HTM) in use –Algorithm more complex –User needs to have privilege to CREATE INDEX –Slow and less scalable than join with spatial index.

5 5 Cross-match cases Cross-match of user’s own table with standard catalogues –Needs table upload: formats? FITS, VOTable, TSV… –Need to generate error-box columns e.g. using ALTER TABLE t ADD COLUMN (errbox box); UPDATE t SET errbox = box(whatever); Cross-match of two standard catalogues, or sample from one catalogue cross-matched with another –Easy if stored within same DBMS

6 6 Distributed Cross-match Often need all or most of information in the smaller table to be included in the results – easier to copy table as a unit DBMS join algorithms need a lot of network activity, not performed well over the wide-area network –Tests using Postgres+dblink show speeds about 7 times lower than when both tables in same DBMS Could be done with a series of cone-searches plus merger of the results. Not yet tested, but also bound to be slow.

7 7 Analysis of Brown Dwarf Search in Hyades Cluster Most naturally done incrementally, e.g. 1.Select stars in the right patch of sky (cone-search) from USNO-B 2.Select from them stars with proper- motion vector in the right range 3.Compute projected proper-motion 4.Cross-match with 2MASS 5.Select stars with appropriate redness.

8 8 SQL for steps (1) and (2) SELECT * FROM usnob WHERE REGION('Circle J2000 66.72 15.87 13.0') AND pmra BETWEEN -50 AND -150 AND pmdec BETWEEN 0 AND 100 AND acos( (sin(radians(7.6)) - sin(radians(decl)) * cos(sin(radians(decl)) * sin(radians(7.6)) + cos(radians(decl)) * cos(radians(7.6)) * cos(radians(94.2- ra))))) / (cos(radians(decl)) * sin(acos(sin(radians(decl)) * sin(radians(7.6)) + cos(radians(decl)) * cos(radians(7.6)) * cos(radians(94.2-ra)))) )) BETWEEN atan2(pmra,pmdec) - 1.5 * sqrt(ra_err*ra_err + dec_err*dec_err)/ sqrt(pmra*pmra + pmdec*pmdec) AND atan2(pmra,pmdec) + 1.5 * sqrt(ra_err*ra_err + dec_err*dec_err)/ sqrt(pmra*pmra + pmdec*pmdec);

9 9 User Facilities Needed SELECT INTO new-table (within DBMS) UPDATE table SET column = expression CREATE and DROP table Probably: CREATE and DROP index Separate namespace for each user –Can be done with Schemas of DB2 and Postgres Export of tables to MySpace or user’s computer in suitable Do we want some query builder to aggregate simple selections into monster SQL? Problems: –ADQL only supports SELECT so far –But JDBC (probably) supports all these statements

10 10 Non-positional Queries Simple example: –SELECT FROM table WHERE (bmag – vmag) > 1.5; Fast only if index exists on (bmag-vmag) –Infeasible in tables with many parameter combinations A great many such queries will need a scan of the whole table For tables the size of USNO-B or 2MASS this takes around 30 – 60 minutes even on a fast DBMS system.

11 11 Speeding up scanning queries Two obvious ways of storing data in a table –Row-based – makes transactions easy: used by all RDBMS –Column-based – better for read-only tables and queries involving only a few columns out of many: used by hardly any packages. Store data in binary file not in DBMS –Do both: reduces time to scan whole of one column of USNO-B by factor of 80, e.g. from an hour to under one minute.

12 12 Sybase-IQ Sybase-IQ uses column-based storage, efficient data formats, advanced indexing methods e.g bit-mask indices, supports same SQL as regular Sybase-ASE. But… –Not yet available on Linux, only Solaris –Speed apparently only few times better than RDBMS –Has no spatial indexing –List price £35,000/cpu, or AstroGrid site licence for “under £2M”.

13 13 Use of parallel hardware? Most DBMS not designed to exploit simple PC clusters ORACLE/RAC needs special “shared-everything” cluster hardware configuration. Many DBMS support replication but only to improve resilience by fail-over to another node –Transactions very hard to distributed to cluster. –Read-only databases not of much commercial interest. It would be interesting to try do-it-yourself parallelism e.g. –Install DBMS on each node of a Beowulf cluster –Load section of large catalogue on each node –Gather and merge results from distributed queries on master node.

14 14 Other DBMS problems RDBMS have very limited statistical functionality, no graphical output or facilities for visualisation. –Can solve by exporting data to other packages, but Awkward slow loses metadata

15 15 Other ADQL enhancements needed Syntax for cross-match is not yet mature: –Match criterion uses N “sigma”, should use probability. –Cannot specify outer join (report unmatched sources). Need to support physical units: SELECT FROM table WHERE properMotion > 100.0 ; –Have standard units for each UCD? –Units to be specified with each constant? Provide users with estimate of running-time, e.g. using commonly-provided EXPLAIN statement. Need time-out on long-running queries. Need exception-handling mechanism.


Download ppt "1 Database Management Systems: part of the solution or part of the problem? Clive Page 2004 April 28."

Similar presentations


Ads by Google