Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Formats, Flags and Vocabularies Roy Lowry British Oceanographic Data Centre SeaDataNet Training Course, Ostend, June 16-19, 2008.

Similar presentations


Presentation on theme: "Data Formats, Flags and Vocabularies Roy Lowry British Oceanographic Data Centre SeaDataNet Training Course, Ostend, June 16-19, 2008."— Presentation transcript:

1 Data Formats, Flags and Vocabularies Roy Lowry British Oceanographic Data Centre SeaDataNet Training Course, Ostend, June 16-19, 2008

2 File Formats  Available formats  Format Selection Criteria  Types of Data  Delivery Use Case Issues  SeaDataNet Profiling Objectives  SeaDataNet Profiling Details

3 Available Formats  Three format profiles are being developed for SeaDataNet data transfers  SeaDataNet ODV Profile  Simple ASCII format based on a spreadsheet model  SeaDataNet MEDATLAS Profile  Minor variation on an established ASCII format  SeaDataNet CF NetCDF Profile  Binary data conforming to API and content model based on an established community standard (CF)

4 Format Selection Criteria  The $64,0000 question is “What format should I use for my data?”  The answer depends on the type of data and on the data delivery use case

5 Types of Data  Think of data in terms of ‘feature types’  Profiles (x, y, t effectively fixed: z varies)  Bottle casts, CTDs, XBTs, radiosondes, core profiles  Point series (x, y, z effectively fixed: t varies)  Current meters, wave statistics, sea level, wind velocity  Trajectories (x, y, z (sometimes), t all vary)  Underway data (TSG, bathymetry, meteorology), undulator data, airborne measurements  Grids (Two or more of x, y, z, t vary systematically )  Satellite data, model output, synthesised data products

6 Types of Data  Most of our data may be modelled in terms of these feature types  For example:  CTD data –Modelled well by the ‘profile’ type  Recording current meter data –Modelled well by the ‘point series’ type  Moored ADCP –Modelled poorly by ‘point series’ type (needs to be considered as one point series per depth bin) –But is modelled well by ‘grid’ with z, t varying and x, y fixed

7 Delivery Use Case Issues  Data exchange between consenting Mediterranean partners  Data provider holds data in MEDATLAS format  Data recipient wants data in MEDATLAS format  Could be addressed using Nemo software to convert MEDATLAS to ODV profile

8 Delivery Use Case Issues  Problems with this approach  Recipient needs to do unnecessary work converting ODV to MEDATLAS  Risk of information loss in the conversion process  MEDATLAS is used by a significant proportion of the SeaDataNet community  Consequently, the transaction system development overhead to support exchange in MEDATLAS format was considered worthwhile

9 Format Recommendations  Mandatory formats  Use ODV for  Profiles  Point series  Trajectories (including underway ADCP)  Use NetCDF for  Grids  Data that don’t fit comfortably into ODV due to shape or volume  Data for use with NetCDF-enabled tools

10 Format Recommendations  Optional format  Use MEDATLAS for  Whatever you use MEDATLAS for at the moment

11 SeaDataNet Profiling Objectives  Two objectives  Providing linkage between data and SeaDataNet metadata (CDI record)  Standardising semantics  Consistent labelling of parameters –Use terms from a controlled vocabulary (more on this later)  Consistent labelling of storage units –Use terms from a controlled vocabulary –Parameter definition DOES NOT dictate storage unit

12 SeaDataNet ODV Profile  Described in BSCW document (Word)  https://www.ifremer.fr/bscw/bscw.cgi/d93460/Specific ation%20of%20SeaDataNet%20Data%20Transport%2 0Formats https://www.ifremer.fr/bscw/bscw.cgi/d93460/Specific ation%20of%20SeaDataNet%20Data%20Transport%2 0Formats  Examples of profile, point series and trajectory data (Excel)  https://www.ifremer.fr/bscw/bscw.cgi/d93465/Exampl es%20of%20SeaDataNet%20variant%20ODV%20spre adsheet-based%20import%20format https://www.ifremer.fr/bscw/bscw.cgi/d93465/Exampl es%20of%20SeaDataNet%20variant%20ODV%20spre adsheet-based%20import%20format

13 SeaDataNet ODV Profile  ODV format based on a spreadsheet model with three types of row  Comment row  One cell with text starting with //  Column header row  Data row  Column header and data rows have three types of column  Metadata columns  Primary variable data columns (value + flag)  Data columns (value + flag pairs)

14 SeaDataNet ODV Profile  SeaDataNet profile extensions  CDI linkage  Addition of two extra metadata columns (LOCAL_CDI_ID and EDMO_code)  Semantic mapping  Structured comment records immediately preceding the ODV column header record  First record is ‘//SDN_parameter_mapping’  Followed by one mapping record for each data column in the file

15 SeaDataNet ODV Profile  Mapping record example  // SDN:LOCAL:Depth SD N:P011::ADEPZZ01 SDN:P061::ULA A –Subject element is the column heading text excluding ODV units field (e.g. ‘Depth’ for ‘Depth [m]’) –Object element is the SeaDataNet URN for the parameter (SDN:P011::ADEPZZ01) –Units element is the SeaDataNet URN for the data storage units (SDN:P061::ULAA)  More about URNs and what we can do with them later…..

16 SeaDataNet ODV Profile  SeaDataNet Metadata and Primary Variables  Profile data  Metadata (x,y,t) set to nominal profile position and time (same for every data value)  Primary variable is the z co-ordinate (depth in metres or pressure in decibars)  Point series data  Metadata (x,y,t) set to the measurement location and series start time (same for every data value)  Primary variable is the t co-ordinate (Chronological Julian Day - days elapsed since 00:00 on January 1 4713 BC)  Trajectory data  Metadata (x,y,t) set to measurement time and position  Primary variable is the z co-ordinate (depth in metres or pressure in decibars)

17 SeaDataNet ODV Profile  Watchpoints  File extension should be.txt  Field separator is the tab character (not semi-colon)  Physical file mapping  The format is capable of holding multiple SeaDataNet data objects in a single physical file  The SeaDataNet 1 system CANNOT support this  Means aggregation and splitting tools (or a lot of patience!) will be required (hardly rocket science)

18 SeaDataNet MEDATLAS Profile  Those who want to use MEDATLAS know it better than me, so I’m not going to try and teach the format!  The most important SeaDataNet extension is the link to CDI records, which is done by a pair of structured comment records for each SeaDataNet object thus:  *EDMO_CODE = EDMO identifier of the data centre managing the CDI  *LOCAL_CDI_ID = local identifier of the station

19 SeaDataNet MEDATLAS Profile  We can also add standardised semantic mapping records as per ODV such as:  * SDN:LOCAL:Temperature SDN:P 011::TEMPS901 SDN:P061::UPAA  However, once the mapping between MEDATLAS parameter codes and P011 is completed, these become unnecessary

20 SeaDataNet CF NetCDF Profile  This is VERY immature, so currently there is nothing to teach  ASCII formats should be sufficient for most SeaDataNet 1 transactions  Further work during the next 6 months  Partners who feel they need NetCDF for their data should contact the Technical Task Team (Dick Schaap or Roy Lowry)

21 SeaDataNet Qualifying Flags  What is a Qualifying Flag?  SeaDataNet Flags  Conflict resolution

22 What is a Qualifying Flag?  Back in the mists of time (IODE in early 1980s?) it was decreed that all data values should be accompanied by a ‘flag’ in the form of a 1-byte code  Built into many data format specifications (MEDATLAS, BODC PXF/QXF, GF3…)  Initially thought of as a data quality label  However, it provides the only metadata ‘hook’ that is unambiguously linked to a specific data value  Consequently, it has suffered information overload carrying other information about non-quality issues  We cannot correct this without major re-engineering of data held as files, which isn’t going to happen

23 SeaDataNet Flags  Information overloading has led to two types of flag in SeaDataNet  Quality Flags  0 – quality unknown  1 – good value (looks good and no reported problems)  2 – probably good value (associated with a known malfunction but looks OK)  3 – probably bad value (associated with a known malfunction but looks wrong)  4 – bad value (clearly wrong)

24 SeaDataNet Flags  Information overloading has led to two types of flag in SeaDataNet  Information flags  5 – changed value (during quality control)  6 – below detection (true value <quoted value)  7 – value in excess (true value >quoted value)  8 – interpolated value (special case of a changed value)  9 – missing value  A – phenomenon uncertain (e.g. question over identification of biological specimen)

25 Conflict Resolution  We can now see the problems caused by overloading  How can we tell the difference between a ‘good changed value’ and a ‘bad changed value’?  Simple answer is the we can’t. We can indicate the value was changed (flag 5), good (flag 1) or bad (flag 4)  So we have to compromise…..

26 Conflict Resolution  How do we compromise?  By prioritising flag assignments  Initially, all flags are set to 0, 9, 7, 6 or A (detection level and uncertainty information comes from the originator, not QA)  Next we either interpolate or replace and flag appropriately (8 or 5)  Finally we switch remaining zero flags to 1, 2, 3 or 4 as appropriate  This is not ideal and we need to do better in SeaDataNet 2.

27 Vocabularies  What are vocabularies and mappings?  Vocabularies for Metadata  Vocabularies for Data  Vocabulary Access  Vocabulary Maintenance

28 What is a Vocabulary?  A vocabulary is a list of standardised terms used to populate a metadata field  The SeaDataNet vocabulary model considers each such term to possess  A key (permanent, semantically neutral (possibly a mnemonic) identifier for the term  A term (full human-readable label)  An abbreviation (short human-readable label)  A definition (full explanation of the term’s meaning)

29 What is a Mapping?  A mapping is a set of relationships between terms  Each relationship consists of a subject term (sometimes called subject concept), a predicate and an object term  The predicate gives the relationship ‘meaning’  Predicates may be simple to underpin something like a thesaurus (e.g. SKOS)  exactMatch - synonyms  narrowMatch – subject concept totally embraces the object concept  broadMatch – subject concept is totally embraced by the object concept  majorMatch – subject and object have a lot in common but some unique semantic elements  minorMatch - subject and object have something in common but significant unique semantic elements

30 What is a Mapping?  Predicates may also be semantically rich such as:  hasUnits – links a parameter to a unit of measurement  isMember – links a person to a group  hasName – links a person to a label  Mappings between defined entities with semantically rich predicates are what computer scientists call an ontology

31 Vocabularies for Metadata  Many fields in SeaDataNet metadata are linked through the document schema to appropriate vocabularies  These cover subject areas such as:  Discovery parameters  Instruments  Platforms  Geographic locations (e.g. ports, sea areas)  Lists to be used are defined in the metadata guidance documentation.  List references (e.g. P021) provide the key to vocabulary access information

32 Vocabularies for Data  There are four vocabularies needed for data in SeaDataNet  ‘Light’ Parameter Usage Vocabulary (P012)  ‘The Full Monty’ Parameter Usage Vocabulary (P011)  SeaDataNet flags (L201)  Units Vocabulary (P061)

33 Vocabularies for Data  ‘Light’ Parameter Usage Vocabulary (P012)  Terms to describe parameters (i.e. column headings)  Kept as pure (no methods) and as simple as possible  Definitions available  Mapped to MEDATLAS/GF3 extended terms  Should be the first port of call for SeaDataNet data providers

34 Vocabularies for Data  ‘Full’ Parameter Usage Vocabulary (P011)  Comprehensive (nearly 20,000 terms) but can be hard to navigate  Microsoft Access navigation tool used inside BODC could be made available on request  True superset of P012, so all P012 URLs have an identical P011 equivalent  Handling data files will be easier if P011 version is used in SeaDataNet data files  Port of call if P012 fails to deliver

35 Vocabularies for Data  SeaDataNet data qualifier flags (L201)  The full list of the flags discussed previously  Units Vocabulary (P061)  Unlike MEDATLAS or the BODC internal system, SeaDataNet policy is to label a value with parameter and units INDEPENDENDLY  The vocabulary is a standardised description of the units used, it does not dictate the units  An aspiration is to develop units interconversion based on P061 terms

36 Vocabulary Access  There are five ways to access the SeaDataNet vocabularies  SeaDataNet Vocabulary Portal  Term and list URLs  HTTP-POX interface  SOAP API  BODC client interface  But I’m only going to cover the first four as the portal should cover SeaDataNet needs

37 Vocabulary Access  SeaDataNet Vocabulary Portal  User input through a web form at http://seadatanet.maris2.nl/v_bodc_vocab/welcome.aspx http://seadatanet.maris2.nl/v_bodc_vocab/welcome.aspx  Returns a human-readable table with key, term, abbreviation, definition and modification date columns  Table may be exported as a semicolon-delimited ‘CSV’ ASCII file

38 Vocabulary Access  Term and List URLs  User input is a URL  Returns an XML document based on the SKOS standard  List documents include labels and definitions for all terms in the list  Term documents include labels, definition and mappings for the term

39 Vocabulary Access  URL syntax  Namespace base (http://vocab.ndg.nerc.ac.uk/)  ‘list’ or ‘term’  List identifier (e.g. P021)  List version or ‘current’  Term identifier for term URL (e.g. TEMP)  Examples  List (SeaDataNet Parameter Discovery Vocabulary)  http://vocab.ndg.nerc.ac.uk/list/P021/current/ http://vocab.ndg.nerc.ac.uk/list/P021/current/  Term (CF Standard Name for sea temperature)  http://vocab.ndg.nerc.ac.uk/term/P071/current/CFSN0335 http://vocab.ndg.nerc.ac.uk/term/P071/current/CFSN0335

40 Vocabulary Access - SDN:P071:7:CFSN0335 sea_water_temperature 2008-02-26T10:02:57.564+0000

41 Vocabulary Access  In SeaDataNet data and metadata we use URNs, not URLs (in case the server namespace changes)  URN syntax is  Namespace base (SDN)  List identifier (e.g. P021)  List version or null field for ‘current’  Term identifier (e.g. TEMP)  For example the URL http://vocab.ndg.nerc.ac.uk/list/P021/current/TEMP is represented by the URN SDN:P021::TEMP http://vocab.ndg.nerc.ac.uk/list/P021/current/TEMP  URN to URL conversion is simple string slicing

42 Vocabulary Access HTTP-POX API  User input is a URL  Returns an XML document based on a BODC-defined schema  Provides access to  List catalogue  List contents (keys, terms, abbreviations, definitions, mappings)  Mappings  Plaintext searches across lists  Term verification  The API is documented at http://www.bodc.ac.uk/products/web_services/vocab/methods.html http://www.bodc.ac.uk/products/web_services/vocab/methods.html

43 Vocabulary Access  SOAP API  User input is a programmatic service call from Java, Perl, PHP, Python, etc. application  Returns an XML document based on a BODC-defined schema  Provides access to  List catalogue  List contents (keys, terms, abbreviations, definitions, mappings)  Mappings  Plaintext searches across lists  Term verification  The API is documented at http://www.bodc.ac.uk/products/web_services/vocab/methods.html http://www.bodc.ac.uk/products/web_services/vocab/methods.html  The WSDL is available from http:// vocab.ndg.nerc.ac.uk/ http:// vocab.ndg.nerc.ac.uk/

44 Vocabulary Maintenance  What if you can’t find the term you need?  Initially contact the SeaDataNet help desk (sdn- userdesk@seadatanet.org)sdn- userdesk@seadatanet.org  If they cannot resolve your problem they will pass the problem on to me  I will endeavour to add new terms or identify appropriate existing terms  Adding terms may involve discussions with vocabulary governance authorities  This can take time (possibly 2-3 weeks) so please try to think ahead


Download ppt "Data Formats, Flags and Vocabularies Roy Lowry British Oceanographic Data Centre SeaDataNet Training Course, Ostend, June 16-19, 2008."

Similar presentations


Ads by Google