Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre CADC
Stephen Gwyn Canadian Astronomy Data Centre - Astronomy is using more and more archival data - More than 50% of HST papers are archival - Similar trends for other telescopes - Harder for solar system astronomy SSOIS: S olar S ystem O bject I mage S earch allows users to search for images of moving targets
Stephen Gwyn Canadian Astronomy Data Centre SSOIS: S olar S ystem O bject I mage S earch allows users to search for images of moving targets
Stephen Gwyn Canadian Astronomy Data Centre SSOIS: S olar S ystem O bject I mage S earch allows users to search for images of moving targets
Stephen Gwyn Canadian Astronomy Data Centre CFHT Initally, only data from CFHT/MegaCam was searched
Stephen Gwyn Canadian Astronomy Data Centre NEAT CFHT Subaru ESO Gemini AAT SDSS NOAO ING HST WISE Next added data from external telescope archives
Stephen Gwyn Canadian Astronomy Data Centre CADC Next added data from external telescope archives
Stephen Gwyn Canadian Astronomy Data Centre For each image, we need: - position (RA,Dec) - Field of view - MJD of mid-exposure - filter - exposure time - target name - URL to data Scraping external archives:
Stephen Gwyn Canadian Astronomy Data Centre For each image, we need: - position (RA,Dec) - Field of view - MJD of mid-exposure - filter - exposure time - target name - URL to data Scraping external archives:
There are a variety of data archive interfaces....
Stephen Gwyn Canadian Astronomy Data Centre - In an ideal world: one query to get all metadata - In real life: row limits - As the archives are updated, they need to be re-scraped periodically - Programmatic retrieval is required Scraping external archives:
Stephen Gwyn Canadian Astronomy Data Centre Advantages: - A single tool can scrape multiple archives Disadvantages: - Not all archives have an SIAP interface - Many SIAP services do not conform to the VO standard - Not all SIAP services contain all the necessary metadata - Most archives have at least 1 heavily observed patch of sky: hit the row limit again - SIAP services vary in ability for positional queries - maximum search area - search is circle or box - may require 10 5 queries: may be perceived as DOS attack Far better off scraping by day/night/MJD - Almost all telescopes take <10000 observations per 24 hours: - Can re-scrape with fewer queries Use SIAP?
Stephen Gwyn Canadian Astronomy Data Centre Scraping by RA/Dec
Stephen Gwyn Canadian Astronomy Data Centre Scraping by Date
Stephen Gwyn Canadian Astronomy Data Centre Older archive interfaces: - Query page + simple CGI result page - view source on the query page - get form inputs - issue repeated queries to CGI result page using GET or POST with wget/curl/scripting API - Easy
Stephen Gwyn Canadian Astronomy Data Centre Newer archive interfaces: - AJAX/HTML5/etc page - Download Javascript and run through de-obfuscator - locate relevant XMLHttpRequest - determine if cookies are necessary - issue repeated queries to XMLHttpRequest URLs - Much harder
Stephen Gwyn Canadian Astronomy Data Centre Easiest of all...
Stephen Gwyn Canadian Astronomy Data Centre A script to get all Subaru/SuprimeCam metadata... #!/bin/bash wget wget wget wget wget wget wget wget wget wget wget wget wget wget wget wget
Stephen Gwyn Canadian Astronomy Data Centre The second easiest: CADC's Advanced Search
Stephen Gwyn Canadian Astronomy Data Centre The second easiest: CADC's Advanced Search
Stephen Gwyn Canadian Astronomy Data Centre The second easiest: CADC's Advanced Search
Stephen Gwyn Canadian Astronomy Data Centre The second easiest: CADC's Advanced Search cnrc.gc.ca/tap/sync?LANG=ADQL&REQUEST=doQuery&QUERY=SELECT%20Observation.observationURI%20AS%20%22Preview%22%2C%20Observation.coll ection%20AS%20%22Collection%22%2C%20Observation.observationID%20AS%20%22Obs.%20ID%22%2C%20COORD1(CENTROID(Plane.position_bounds))% 20AS%20%22RA%20(J2000.0)%22%2C%20COORD2(CENTROID(Plane.position_bounds))%20AS%20%22Dec.%20(J2000.0)%22%2C%20Plane.time_bounds_c val1%20AS%20%22Start%20Date%22%2C%20Observation.instrument_name%20AS%20%22Instrument%22%2C%20Plane.time_exposure%20AS%20%22Int.%2 0Time%22%2C%20Observation.target_name%20AS%20%22Target%20Name%22%2C%20Plane.energy_bandpassName%20AS%20%22Filter%22%2C%20Plan e.calibrationLevel%20AS%20%22Cal.%20Lev.%22%2C%20Observation.type%20AS%20%22Obs.%20Type%22%2C%20Plane.energy_bounds_cval1%20AS%20 %22Min.%20Wavelength%22%2C%20Plane.energy_bounds_cval2%20AS%20%22Max.%20Wavelength%22%2C%20Observation.proposal_id%20AS%20%22Pro posal%20ID%22%2C%20Observation.proposal_pi%20AS%20%22P.I.%20Name%22%2C%20Plane.productID%20AS%20%22Product%20ID%22%2C%20Plane.d ataRelease%20AS%20%22Data%20Release%22%2C%20AREA(Plane.position_bounds)%20AS%20%22Field%20of%20View%22%2C%20Plane.position_sample Size%20AS%20%22Pixel%20Scale%22%2C%20Plane.dataProductType%20AS%20%22Data%20Type%22%2C%20Plane.position_timeDependent%20AS%20%2 2Moving%20Target%22%2C%20Plane.provenance_name%20AS%20%22Provenance%20Name%22%2C%20Plane.provenance_keywords%20AS%20%22Proven ance%20Keywords%22%2C%20Observation.intent%20AS%20%22Intent%22%2C%20Observation.target_type%20AS%20%22Target%20Type%22%2C%20Obser vation.target_standard%20AS%20%22Target%20Standard%22%2C%20Plane.metaRelease%20AS%20%22Meta%20Release%22%2C%20Observation.sequence Number%20AS%20%22Sequence%20Number%22%2C%20Observation.algorithm_name%20AS%20%22Algorithm%20Name%22%2C%20Observation.proposal_ti tle%20AS%20%22Proposal%20Title%22%2C%20Observation.proposal_keywords%20AS%20%22Proposal%20Keywords%22%2C%20Observation.proposal_proje ct%20AS%20%22Proposal%20Project%22%2C%20Plane.position_bounds%20AS%20%22Polygon%22%2C%20Plane.energy_emBand%20AS%20%22Band%22 %2C%20Plane.provenance_reference%20AS%20%22Prov.%20Reference%22%2C%20Plane.provenance_version%20AS%20%22Prov.%20Version%22%2C%20 Plane.provenance_project%20AS%20%22Prov.%20Project%22%2C%20Plane.provenance_producer%20AS%20%22Prov.%20Producer%22%2C%20Plane.proven ance_runID%20AS%20%22Prov.%20Run%20ID%22%2C%20Plane.provenance_lastExecuted%20AS%20%22Prov.%20Last%20Executed%22%2C%20Plane.prov enance_inputs%20AS%20%22Prov.%20Inputs%22%2C%20Plane.energy_restwav%20AS%20%22Rest- frame%20Spectral%20Coverage%22%2C%20Plane.planeID%20AS%20%22planeID%22%2C%20isDownloadable(Plane.planeURI)%20AS%20%22DOWNLOADA BLE%22%2C%20Plane.planeURI%20AS%20%22CAOM%20Plane%20URI%22%2C%20Observation.instrument_keywords%20AS%20%22Instrument%20Keyword s%22%2C%20Plane.energy_transition_species%20AS%20%22Molecule%22%2C%20Plane.energy_transition_transition%20AS%20%22Transition%22%2C%20Pl ane.position_resolution%20AS%20%22IQ%22%20FROM%20caom2.Plane%20AS%20Plane%20JOIN%20caom2.Observation%20AS%20Observation%20ON%20 Plane.obsID%20%3D%20Observation.obsID%20WHERE%20%20(%20Observation.instrument_name%20%3D%20%27MegaPrime%27%20AND%20Observation.c ollection%20%3D%20%27CFHT%27%20)&FORMAT=tsv
Stephen Gwyn Canadian Astronomy Data Centre The second easiest: CADC's Advanced Search SELECT Observation.observationURI AS "Preview", Observation.collection AS "Collection", Observation.observationID AS "Obs. ID", COORD1(CENTROID(Plane.position_bounds)) AS "RA (J2000.0)", COORD2(CENTROID(Plane.position_bounds)) AS "Dec. (J2000.0)", Plane.time_bounds_cval1 AS "Start Date", Observation.instrument_name AS "Instrument", Plane.time_exposure AS "Int. Time", Observation.target_name AS "Target Name", Plane.energy_bandpassName AS "Filter", Plane.calibrationLevel AS "Cal. Lev.", Observation.type AS "Obs. Type", Plane.energy_bounds_cval1 AS "Min. Wavelength", Plane.energy_bounds_cval2 AS "Max. Wavelength", Observation.proposal_id AS "Proposal ID", Observation.proposal_pi AS "P.I. Name", Plane.productID AS "Product ID", Plane.dataRelease AS "Data Release", AREA(Plane.position_bounds) AS "Field of View", Plane.position_sampleSize AS "Pixel Scale", Plane.dataProductType AS "Data Type", Plane.position_timeDependent AS "Moving Target", Plane.provenance_name AS "Provenance Name", Observation.intent AS "Intent", Observation.target_type AS "Target Type", Observation.target_standard AS "Target Standard", Observation.sequenceNumber AS "Sequence Number", Observation.algorithm_name AS "Algorithm Name", Observation.proposal_title AS "Proposal Title", Observation.proposal_keywords AS "Proposal Keywords", Plane.energy_emBand AS "Band", Plane.provenance_version AS "Prov. Version", Plane.provenance_project AS "Prov. Project", Plane.provenance_runID AS "Prov. Run ID", Plane.provenance_lastExecuted AS "Prov. Last Executed", Plane.energy_restwav AS "Rest-frame Spectral Coverage", isDownloadable(Plane.planeURI) AS "DOWNLOADABLE", Plane.planeURI AS "CAOM Plane URI", Observation.instrument_keywords AS "Instrument Keywords", Plane.energy_transition_species AS "Molecule", Plane.energy_transition_transition AS "Transition", Plane.position_resolution AS "IQ" FROM caom2.Plane AS Plane JOIN caom2.Observation AS Observation ON Plane.obsID = Observation.obsID WHERE ( Observation.collection = 'CFHT' )
Stephen Gwyn Canadian Astronomy Data Centre The other hard part: - Parsing downloaded metadata - Which observations are images? - Quality control - is MJD right? - Are coordinates or ? - Sorting out filters: - remove narrow band filter data - remove bad filters - remove grism data - maybe homogenize filter names (B vs Bj vs Bjohnson vs Johnson B vs...) - Telescope footprint not typically part of the metadata - Work out links back to original images
SSOIS saves the Earth....
Stephen Gwyn Canadian Astronomy Data Centre Summary: - SSOIS allows multi-archive searches for moving objects - Metadata is harvested from external archives - Lessons learned: - SIAP is not useful for metadata harvesting - multiple queries by time not by position - older interfaces are easier to scrape - parsing metadata often harder than retrieving it
Stephen Gwyn Canadian Astronomy Data Centre
Stephen Gwyn Canadian Astronomy Data Centre Summary: - SSOIS allows multi-archive searches for moving objects - Metadata is harvested from external archives - Lessons learned: - SIAP is not useful for metadata harvesting - multiple queries by time not by position - older interfaces are easier to scrape - parsing metadata often harder than retrieving it