Presentation on theme: "Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)"— Presentation transcript:
Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and Gretchen Greene)
What are SQL and ADQL? SQL (sequel sometimes) is Structured Query Language a standard for creating and getting information from relational databases. ADQL (Astronomy Data Query Language) is an adaptation of SQL to meet specific needs for astronomical queries, especially positional queries. These enable astronomers to make sophisticated queries of astronomical databases.
Lots of places to find info NVO Book chapters on SQL Web site with lots of links about SQL: – On-line tutorials – – – – GIYF ADQL standard – SDSS Online database – OpenSkyQuery –
Background: The kinds of databases Network databases: pointers –Internal data in programs Hierarchical database: structure –XML files Relational databases: common indices –Relational database management systems
Network database Dora Supervisor Jamie CMS Joe Salary: 75,000 Admin Roles Personnel Team The WorldWideWeb is the most successful of all network databases.
Hierarchical Database 75,000 … …. With XML, hierarchical databases are making a comeback. (e.g., Carnivore registry)
Relational Database Personnel UID Name Salary 1 Dora 75,000 2 Joe 30,000 3 Jamie 66,000 … Teams TeamID Name LeadUID 1 CMS 1 2 Test 13 … TeamMembers TeamID UID …
What about astronomy? Relational DBMSs used by all major astronomical data providers: –ADS, MAST, IPAC, SDSS, … –Only RDBMS can scale to size of modern astronomical tables (100s of columns, 10 9 rows) –Flexible interactions between tables Standard SQL provides limited support for positional queries Some RDBMSs have support for objects
RDBMS servers MySQL: free, widely used, fast Postgres: free widely used, better standards compliance, object support Sybase (commercial) SQLServer (Microsoft) Gazillions more MySQL installation at NOAO available for use in summer school projects -- or just download a copy to run on your machine. (http://dev.mysql.com/downloads/)
Web availability CASJOBS –SQL based, SDSS database. Allows users to generate their own tables. OpenSkyQuery –ADQL based, lots of missions but more fragile.
Using CASJOBS 1.Connect to CASJOBS web site: 1.http://casjobs.sdss.org/CasJobs/http://casjobs.sdss.org/CasJobs/ 2.Get account or login 3.Build and query dat This is a production service and does not always respond in webtime.
Basic SQL commands CREATE TABLE tablename (col1 type1, col2 type2, …) DROP TABLE tablename; INSERT INTO TABLE tablename (col1,col2,…) VALUES(val1,val2,…) DELETE FROM TABLE tablename WHERE condition UPDATE tablename SET col1=val1,col2=val2,… WHERE condition SELECT fields FROM tables WHERE conditions ORDER BY col1,col2
Create an account in CASJOBS
Select MYDB to get access your private database Then use CREATE TABLE command Table name Column names Column types MySQL doesnt like this name.Must be escaped as `dec`
Types Numeric types –int,bigint,smallint –real,float –As in C, the size of types is not well standardized –Typically lots of aliases for various sizes of integers and floating point numbers Character types: char(n), varchar(n), text –Use varchar for long, variable length strings –Use text for very long strings that you wont need to compare with others (e.g., file content) Business oriented types (money, dates, decimal values) –Some are occasionally useful
Add data to small table: The INSERT command Use single quotes for strings. Values in same order as in create statement. RA,Dec normally stored as decimal degrees
How do I get rid of rows in a table? DELETE FROM table WHERE conditions DELETE FROM stars WHERE ra is null To delete all rows (but not delete the table entirely) DELETE from table
How do I modify a table? Update values in existing rows: –UPDATE table SET field=value,field=value,… WHERE condition UPDATE stars SET ra= , dec= WHERE starid=49 Adding columns. –Not supported by standard SQL. Create a new table with new columns and copy old values. Some DBs support ALTER TABLE … ADD COLUMN … Copying tables (differs a bit from DMBS to DBMS) –Create second table then SELECT INTO table2 * FROM table1 or INSERT INTO TABLE2 (SELECT * FROM table1)
Null values: no name for these stars Specify parameters to fill Include an explicit null in the list
Whats in the table? Status of my tables Lets try to query the table we created: use the SELECT command
But it fails! Try again… This query works. CasJobs doesnt like null values. The lesson is that one needs to be chary of nulls (and that CasJobs is not a full featured DB). CasJobs put the results in another table for us. We click on MYDB in the top bar, then the created table name, then on Sample to see the results.
The SELECT statement The SELECT statement is used to query and existing table or set of tables A typical query is: SELECT field1,field2,… FROM table1, table2, … WHERE condition1 AND/OR condition2 AND/OR … ORDER BY sortfield1,sortfield2,… The list of fields to be returned can usually be specified as * to get all the fields in the table. There are also GROUP BY and HAVING clauses for advanced queries.
The FROM clause The table, or list of tables, to be queried. –… FROM mystars … There may be a database specified, a collection of related tables –… FROM mydb.mystars … Each table in the list may have an alias that can be used to identify that table elsewhere – … FROM mystars m ….
Fields (all other clauses) Can be: Simple name of column in table: mag Expression: ra-dec A constant: , or pi You can name the result if you like using AS, e.g. select ra AS myra,ra-dec AS diff from mystars will return columns named myra and diff When there are multiple tables, table columns can be distinguished by table names and aliases. –Select t1.ra-t2.ra from mystars t1, yourstars, t2 where t1.id=t2.id –Select mystars.ra-yourstars.ra from mystars,yourstars where mystars.id=yourstars.id One difference between SQL and ADQL is that ADQL requires all column names to specify their origin table, not just where there is an ambiguity.
Conditions Comparisons: –mag < 6 –mag between 4 and 6 –name=star1 –ra < 3*dec –sin(ra)*cos(dec)-cos(ra)*sin(dec) >.234 –name like star% % is the wild card. _ matches one character. Lists –name in (star1, star2, star2) Null tests: –name is null Negation –not between 4 and 6 –name is not null Compound conditions –(mag 10 and mag>2*flux)
Querying multiple table: Joins Joins join the rows in two or more tables –One to one relationship The two tables are logically one bigger table A salary table and address table with one row for each employee. –Many to one Components Checks to Checking account –Each check belongs to exactly one account but each account can have many checks –Many to Many Observations/Sources –An single observation may include many sources –A single source may be seen in many observations
How does a query work… Table1 a11,a12,a13,… a21,a22,a23,.. a31,a32,a33,.. … Table2 b11,b12,b13,… b21,b22,b23,.. b31,b32,b33,.. … Naïve view: First looking at the FROM clause the database creates the product of all the rows of all the participating tables. If there are n rows in table 1 and m in table 2, then there are nxm in the join. For many way joins this can very large very quickly! Data base optimization is all about trying to reduce the number of rows that have to be looked at. a11,a12,a13…,b11,b12,b13… a11,a12,a13,…b21,b22,b23,… a11,a12,a13,…b31,b32,b33,… a21,a22,a23,…b11,b12,b13,… a31,a32,a33,…b11,b12,b13,… …
WHERE clause: Filtering the intermediate table The WHERE clause filters the rows in the product table. If a condition involves only one table it can be applied to the inputs before cross-product. If a condition applies to both tables, they can sometimes be organized (indexed) such that only a few rows in one table need to be checked for each row in the other.
Specifying the output The selection list tells the DBMS which columns to include (or generate) on the output The sort tells the DBMS the order.
Astronomy and RDBMSs RDBMSs expect exact joins on keys (user id, SSNs, product number) which matches business use. –Sometimes correct for astronomical tables, e.g., targets in a proposal. Many astronomical joins are soft: nearby in position and/or time looking for counterparts. –1-D is not a problem, but 2-D soft joins are hard. –GIS systems provide some non-standard features –ADQL extensions –Indexing of tables and planning of queries requires lots of work to do efficiently.
Create sample tables create table pets (owner varchar(12), pet varchar(12), species varchar(12)) insert into pets values('fred', 'fido', 'dog') insert into pets values('joe', 'ruff', 'dog') insert into pets values('mary', 'ears', 'dog') insert into pets values('andie', 'max', 'dog') insert into pets values('joe', 'silvertip', 'cat') insert into pets values('marlene', 'buster', 'cat') insert into pets values('bill', 'meow', 'cat') insert into pets values('simone', 'furball', 'cat') insert into pets values('mary', 'silver', 'horse') insert into pets values('nell', 'fairmane', 'horse') insert into pets values('effie', 'belle', 'horse') insert into pets values('nell', 'slither', 'salamander') insert into pets values('nell', 'hairy', 'tarantula') create table vets (vet varchar(12), species varchar(12)) insert into vets values('merriwether', 'dog') insert into vets values('parell', 'dog') insert into vets values('parell', 'cat') insert into vets values('parell', 'horse') insert into vets values('nestor', 'salamander')
Sample Queries: Which vets should each owner know? select p.owner,v.vet from pets p, vets v where p.species = v.species order by p.owner Are any pets not covered by a vet? select pet from pets where species not in (select species from vets) How many pets can each vet treat? select v.vet,count(p.pet) as pcnt from pets p, vets v where v.species = p.species group by v.vet
Upload SPOCS data
Query for Sun-like stars SELECT s.spocs, s.Name, s.Teff,s.Log_g,s.M_o_H FROM spocs s WHERE s.Teff BETWEEN 5720 AND 5820 AND s.Log_g BETWEEN 4.34 AND 4.54 AND s.m_o_h BETWEEN -0.1 AND +0.1 ORDER BY s.Teff
Correlate with SDSS: Can it be done? SELECT p.objID,p.field, p.ra, p.dec FROM dr2..PhotoObj p,mydb..spocs s WHERE s._dej2000 between p.dec-.001 and p.dec+.001 and s._raj2000 between p.ra-.001 and p.dec+.001 Making this possible is what ADQL is really about!
~7 ½ minutes to run this little query.
Other SQL topics TOP/LIMIT/SET ROWCOUNT –Different ways to limit the rows output Functions and procedures Indices, clusters, and ensuring efficient access How do we maintain a dynamic database? –Referential integrity, triggers, transactions. –Less critical to typical, relatively static astronomical databases. Groups and group functions –AVG, MIN, MAX, COUNT, SUM –GROUP BY and HAVING clauses
What to do if DB needed in project Databases and accounts set up and available for use at NOAO. Lots of free databases available. If you want help setting one up just ask….