Presentation on theme: "CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004."— Presentation transcript:
CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004
Talk outline General Divisional context – past and present Data Centre approaches and tools – including MarLIN, Data Warehouse & Trawler, CAAB, C-squares, and OBIS Data Centre services to CMR projects Cleveland-specific issues Target audience and level of talk Introductory / overview level, some examples but not full detail Aimed at CMR staff in general, project managers, plus project metadata staff Database designers, application developers will find material of interest, but need separate more detailed info.
Our people (and their intellectual capabilities) Our hardware (collecting platforms etc.) and technologies Our data – newly collected, plus historic data What are the Division’s chief assets? How do we manage our data assets? Mixture of good, moderately good, and not good at all “good” – well documented; details in searchable catalogue; appropriate/current formats; online access (to appropriate users); ongoing curation “moderately good” and “not good” depart from the above, to lesser or greater degree Data Centre curates selected datasets on behalf of the Division, others reside long-term in projects Data Centre also maintains “MarLIN” – the Division’s data catalogue (metadata system)
Overview of metadata, data systems – national context CMRNOOGAAADAIMSetc. metadata systems MarLIN Neptune CMR data NOO data GA data AAD data AIMS data etc. describe / point to... ASDD Australian Spatial Data Directory – national cross-agency metadata gateway 3 rd party data (CMR copy) example search via ASDD – search across multiple agencies, basic functionality search via MarLIN – search only CMR holdings, but extra functionality (also view “CMR internal” records not visible to external users)
The Card Index...
MarLIN Marine Laboratories Information Network Divisional Data Catalogue (metadata system)
What is in MarLIN? Descriptions of <2,000 Divisional datasets (including c.1000 held by the Data Centre) Individual MarLIN records are searchable by subject, keyword, CMR project, geographic region, time period, biological species, voyage reference, and more Contain metadata (“data about data”) in a common structure (ANZLIC format plus CMR-specific additional fields) Can contain links to images, related documents, data files, and other metadata records “Quick maps” (using c-squares data footprints, see later) can indicate the spatial extent of the data Who creates MarLIN metadata records? Records are created/maintained by the data custodians, who best understand the data and associated useful resources, using an online metadata entry form (Data Centre staff can assist with this process)
Sample MarLIN content Alphabetical dataset lists Indexes by keyword, etc.
Sample MarLIN content Alphabetical dataset lists Indexes by keyword, etc. Brief dataset details
example search result... (etc.)
Viewing the full metadata record produces... (etc.) with clickable link to show dataset extent using c-squares:
(Quick look at the ASDD)
What’s in it for me / us? Allows CMR staff / others to know what data we have already, what we are collecting (or plan to collect), what we do not have (gap analysis) – facilitates data re-use, avoids duplicate acquisition, fosters collaborations Permits inspection of relevant data documentation in order to assess data usefulness / completeness / quality, inspect thumbnails of data coverage, etc. Gives a contact person and/or electronic access for the data, via a standard entry point Provides dissemination of project scientific activities into a new “information space” – online searching via the ASDD, indexing by web search engines, possible future one-csiro system (only don’t hold your breath for the latter) Can be feasible for projects to utilise MarLIN to catalogue / access their own data – use MarLIN’s built-in search capability rather than re-invent.
Data Warehouse and Data Trawler
2000 onwards – databasing of “all” Data Centre holdings into a Divisional Data Warehouse, accessed by a custom “Data Trawler” application Historic holdings of Hydrology (bottle chemistry) and CTD data – 200,000 HYD analyses, 10,000 CTD casts, from hundreds of research voyages and coastal stations Underway data for 175 research voyages (10 million observations) – depth, position, time, meteorological variables, sea temperature, salinity, fluorescence Biological (catch composition) data from 85 voyages – 10,000 trawls, 240,000 individual species records (number or weight caught) Currents data from 548 moored current meters (3 million readings) ADCP data, some old hydrology data still in archives, awaiting migration to on-line Warehouse system. Also note, c. 50% of Divisional catch data is not held by the Data Centre at this time (probably still with original investigators)
example Data Trawler Screens
HYD and CTD data – all years current Warehouse content accessible via Data Trawler
moorings data – all years current Warehouse content accessible via Data Trawler
catch data – all years current Warehouse content accessible via Data Trawler
What’s in it for me / us? Provides access to centrally held data on a self-serve basis, via a standard web browser Allows queries to be constructed by data type, region, time period, species, voyage... Contains the actual data, but not text information (the latter is in MarLIN) Permits retrieval of data across multiple projects, as integrated result set in a common format Provides preview / mapping of spatial extents of result sets generated (closer to true web GIS facility cf. MarLIN, which is more of a quick “thumbnail” facility) Data are provided in csv / spreadsheet compatible format, suitable for upload to user’s own machine for further manipulation.
Remote Applications Divisional Systems “MarLIN” Data Catalogue Divisional Data Warehouse “Data Trawler” application Austr. Spatial Data Directory (ASDD) Hyperlinked documents, graphics, etc. Project-based data holdings Off line archived data Systems considered thus far...
CAAB Codes for Australian Aquatic Biota master taxonomic database
1999-current – upgrading of “CAAB” master taxon management system for the Division CAAB (Codes for Australian Aquatic Biota) is a database of species names and codes, now covering >25,000 marine species in Australian waters codes are standardised species identifiers for use in Divisional databases (species names may change, codes are intended to be constant) “quick maps” of all catch data in the Warehouse (by species) have been associated with relevant CAAB record; also predicted species ranges for c. 3,000 fish species individual maps form clickable interface(s) to retrieve corresponding data items (individual catch records) from the warehouse and display in a web page
web-accessible version of CAAB
What’s in it for me / us? Codes are a standard storage and interchange format for taxonomic information in CMR and other regional databases CAAB website and derived tables allow matching of codes to names, and vice versa Check correct spelling of species names, full citation, generate Australian species lists per genus / family / larger category Links to pictures and maps of CMR data distribution, where available “Quick maps” form clickable front end to Data Warehouse queries Also provides access to most recent predicted species range in many cases Potentially supports “what lives here” queries from predicted species ranges and specified depths (fishes only, at present time).
C-squares Concise Spatial Query and Representation System spatial indexing and mapping utility
“C-squares” mapping / spatial indexing utility Original Data Centre creation, 2001 onwards Mainly a developer’s tool Permits “lightweight” spatial indexing, queries, and web mapping from a standard text-based system (no GIS required) Currently used in 4 CMR and 3 international systems (Tony Rees can supply more details if interested).
OBIS Ocean Biogeographic Information System
OBIS – Ocean Biogeographic Information System Operated by an international consortium, including CMR representation Like a “super CAAB” for the world, but with names only (not codes) Can currently access point data for 20,000 marine species from c. 20 institutions worldwide (2 million records), plus lists of names awaiting data, and returns integrated result sets (like Data Trawler) Many aspects similar to CAAB, including “Quick maps”, click-on-map spatial queries, OBIS taxonomic groups, and more (Data Centre staff did the interface and query logic) CMR catch data to be visible via the system in due course.
Data Centre Services to CMR projects
Who are we? Tony Rees (Hobart) – Data Centre manager; MarLIN, CAAB, C-squares technical support & development; national & international connections; project-level advice (metadata) Pamela Brodie, Leanne Wilkes (Hobart) – Data Warehouse, Data Trawler support and data loading; project-level advice (databases) Miroslaw Ryba (Hobart) – Oracle support; ships biological data collection suite Terry Byrne (Hobart) – National Facility Data Librarian; data requests; data archiving Hiski Kippo (Floreat) – project-level liaison, DC representation (WA) Steven Edgar (Cleveland) – project-level liaison, DC representation (QLD)
“On the ground” DC services to CMR projects Advice and assistance to CMR project staff – metadata entry, database design, general data management issues Maintaining the Division’s Oracle systems, and provision of Oracle advice and web-based help Servicing/forwarding data requests as appropriate Migrating project data to the Data Warehouse, for integration with other relevant data holdings, and archiving data to offline media as required Looking at whole-of-Division issues such as data access and exchange policies, engagement with relevant national and international data operations, cross-CSIRO data access, etc. New Data Management officers in Floreat (2002) and Cleveland (2004) Developing interest in GIS data layers and systems e.g. ArcSDE, ArcIMS Continuing to advance existing DC systems on three fronts – tools, content, and connectivity (internally, nationally, internationally).
Cleveland-specific issues... Steven Edgar has an advisory role for Data Management in projects at the Cleveland site (project personnel actually do the project-level management); can assist with database design, etc., also some/all Oracle administration needs Steve’s time (or portions of it) can be spent on migrating project data to our central warehouse/trawler system, also assisting project staff with metadata entry as needed Steve brings new expertise in GIS systems to the Data Centre; will take an interest in cross-project / cross-Divisional GIS issues and progress where possible Steve can act as conduit for technology/content/expertise transfer in 2 directions (DC Systems/tools > CMR projects and vice versa) – also the “eyes and ears” of the Data Centre in Cleveland to bring local issues to Hobart attention as needed Additional Hobart-based staff are only an or phone call away if they can be of assistance.
Summary – an idealised “data life cycle” at CMR Project starts Divisional Data Warehouse “MarLIN” Data Catalogue PSS “Data Trawler” application administrative details project overview interim documents, graphics, etc. Project-based data holdings Project completed Off line archived data Persistent project db’s project data repository project data published output
towards “best practice” data management at project level... Projects should be recording the existence of their data in MarLIN – ideally sooner rather than at end of project Data should eventually be migrated off PCs into Divisional systems As much relevant data as possible should be in the Warehouse Effort should be made to produce definitive / final version of the data Data Centre can help with archiving for closed projects Data Warehouse table structure, and other Divisional databases, can provide starting points / examples for project level databases Taxonomic / survey data recording should employ CAAB codes as a Divisional standard... refer Data Centre internal website and local Data Centre person/s for additional information.
Some action items / ideas for discussion... Upgrade MarLIN content to reflect the true data holdings of the Division (augmented with project descriptions as available) Look into migrating more “completed” project datasets into centralised (Data Centre) holdings / systems Locate as much as possible of the “missing” catch data, to add to present Warehouse content Obtain clearance as needed to make CMR catch data visible to the outside world (currently, it is all intranet-only) via Data Trawler and other linked systems (CAAB, OBIS, others) Assist project staff with pressing data management issues and work to ensure good technology transfer for database design, etc. Work with key project staff to progress the usefulness of the “new” web- enabled GIS systems across appropriate datasets, for the benefit of multiple users Identify needs to digitise important non-digital data holdings (notebooks, field log sheets etc.) and assist in seeking resources to digitise them.
Feedback / discussion time...
“C-squares” Spatial Indexing/ Mapping System Remote Applications Divisional Systems Divisional Data Warehouse “MarLIN” Data Catalogue Hyperlinked documents, graphics, etc. “CAAB” Taxonomic Database “Data Trawler” application Project-based data holdings Off line archived data Distributed AODC? OBIS? other? Austr. Spatial Data Directory (ASDD) external c-squares users – FishBase, OBIS, others jsp/loginpage.jsp asdd.ga.gov.au/asdd/ (e.g.) Summary of core Data Centre components as at April 2004