Presentation is loading. Please wait.

Presentation is loading. Please wait.

ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b.

Similar presentations

Presentation on theme: "ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b."— Presentation transcript:

1 ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b Kent State University NEERI Helsinki Standards Workshop

2 NEERI Helsinki Standards Workshop Data category The result of the specification of a given data field A data category is an elementary descriptor in a linguistic structure or an annotation scheme (ISO ) Linguistic data categories: /part of speech/, /noun/, /verb/ /definition/, /context/, etc. © DCR Group of 29

3 Data Category Applications DCs are used as: Field names in databases Permissible values for closed and constrained data categories Tag names and attribute values in annotation frameworks DCs are used by: Different broad thematic domains (e.g., terminology, morphosyntax, lexicography, etc.) Different communities of practice within a given domain Data category selections exist as: Resource tag sets (e.g., tagsets used in major corpora) Standardized sets of field names and values (e.g., TBX Basic, TBX [ISO 29042]) © DCR Group of 29 NEERI Helsinki Standards Workshop

4 Data category TC 37 practice treats both data fields and enumerated domain values as data categories: Open data categories: e.g., term, which can take any value designated as a term Closed data categories: e.g., grammatical gender, which takes a set of enumerated values as its content Constrained data category: e.g., Olympic years, which takes as its content values defined by a formal constraint (i.e., every fourth year starting from a certain date) Simple data categories: e.g., masculine, member of an enumerated value domain © DCR Group of 29 NEERI Helsinki Standards Workshop

5 Data category types writtenForm any string open grammaticalGender enumerated string neuter masculine feminine closed simple: Address constrained string constrained complex: © DCR Group of 29 NEERI Helsinki Standards Workshop

6 Data category relationships Value domain membership Subsumption relationships between simple data categories Relationships between complex data categories are not stored in the DCR partOfSpeech pronoun personal pronoun enumerated string © DCR Group of 29 NEERI Helsinki Standards Workshop

7 Data Category Registry (DCR) set of data categories to be used as a reference for the definition of linguistic annotation schemes or any other formats used in the area of language resources Implemented as the TC 37 ISOcat registry Registration Authority: Max Planck Institute for Psycholinguistics Nijmegen Open and accessible at: Come play with the cat! But – hes a bit fussy and likes to have people follow some simple rules! Simple rules are spelled out in the DCR Guidelines. © DCR Group of 29 NEERI Helsinki Standards Workshop

8 ISOcat model and mission & metaphor Not a layered onion … A segmented aggregate, like knob of garlic instead: Cloves are sets of private data categories The center stem represents the standardization core Many DCs and DCS may never be intended for standardization Only the standardized core described in ISO 12620:2009 Need to define non- and pre- standardization procedures NEERI Helsinki Standards Workshop

9 ISOcat Data model The ISO data model consists of 3 main parts: Administrative part Administration and identification Descriptive part Documentation and information for working language or languages Data element names and identifiers Data element concept definitions Linguistic part Conceptual domain of object language Data element type declarations Special object language constraints © DCR Group of 29 NEERI Helsinki Standards Workshop

10 Data Model and DC life cycle Part 1 of the ISOcat data model reflects the DC standardization cycle Major steps in the workflow = classes in the DC model But the creation cycle precedes standardization A DC must be created, and ideally discussed in a group before the standardization process even begins. Not all DCs will be standardized. © DCR Group of 29 The process starts out here, and we need to define this process. NEERI Helsinki Standards Workshop

11 Non- & Pre Standardization Workflow DC created in private work space Option: DC remains private Option: assign DC to a group Option: DC discussed & revised in the group to achieve consensus Option: DC used in group Option: DC used widely by public group Option: DC submitted for standardization Standards process starts with submission Stan- dardized Core DCS DCR NEERI Helsinki Standards Workshop

12 Cascade of Responsibility ISOcat Model Design the ISOcat development group Approved by TC 37 Standardized in ISO 12620:2009 ISOcat input template, interface presentation Implementation by the ISOcat programmer/system administrator Approved by development group Scrutiny of beta testers, user community ISOcat Guidelines for data category specifications Instantiation by the individual expert user Scrutiny by other users, eventually by DCR TDGs/DCRB NEERI Helsinki Standards Workshop

13 ISO Overview Three parts Lynch pin: Data Category

14 Part 1 Global Information & Administration Information Section

15 Identifiers – Responsibilities Global Information Non-mnemonic identifier (Key) System-assigned internal identifier Persistent identifier (PID) System-generated external identifier Administration Record User-assigned Identifier: camel case mnemonic ID XML-valid element name (without a namespace) partOfSpeech my:POS, 123POS Required NEERI Helsinki Standards Workshop

16 Justification – Creator Responsibility Justification for /part of speech/: Part of speech obvious, but not true of every DC for every potential user. Required for standardization Highly desirable for any DC that will be shared outside a private scope Neeri Helsinki Required

17 Administration Information Section Implementation of the standardization workflow Embodied in the information workflow associated with the standardization process Standardized in ISO 12620:2009 in compliance with ISO Directives Annex ST for Standards as Databases Represented by the flowchart in slide 19/20 Responsibility: Thematic Domain Groups (TDGs), which act as stewards in maintaining data category specifications (DCs) and data category selections (DCSs) Data Category Registry Board (DCRB), which validates DCs and DCSs and endeavors to harmonize among TDGs Neeri Helsinki

18 Data category The standardization option Data categories can be kept private or submitted to the standardization process, in which case they are assigned to a Thematic Domain Group which judges them. DCR Board TDG metadata TDG ….. TDG morphosyntax TDG terminology At regular intervals, snapshots of the standardized subset of the DCR will be submitted to ISO to form a standard as database according to Annex ST of the ISO/IEC Directives. NEERI Helsinki

19 TDG Role: Maintenance Team Neeri Helsinki

20 DCRB Role: Validation Role Neeri Helsinki

21 Part 2 Descriptive Part Describes equivalents in working languages; English data element name, definition, and justification comment required Database, format or application specific data element names Rigorous terminological definition consisting of a single sentence fragment linked to a logical concept system

22 Part 2: Guideline Responsibilities Data Element Name: Language-independent name for the data category used in a specific application domain (specified in the Source) PoS / POS / pos are all common short forms used for /part of speech/ in various application environments. Name Section in a Language Section (Min. one required in English Language Section) (Multiple in multiple Language Sections permitted) Human-legible (mnemonic) name part of speech in the English language section partie du discours in the French language section NEERI Helsinki Standards Workshop

23 Neeri Helsinki One en Name required. Multiple Names optional. Multiple Names in other languages optional.

24 Part 2: Guideline Responsibilities Definition: Rigorous intentional definitions (ISO 704) Single sentence fragment Additional information in comments fields, justification, etc. Example: Die Klasse von Wörtern einer Sprache (broader concept) … auf Grund der Zuordnung (characteristic) nach gemeinsamen grammatischen Merkmalen. (characteristic) Source: The source for any quoted material; here: Wikipedia NEERI Helsinki Standards Workshop

25 Part 3 Linguistic Part

26 Data category Linguistic part Complex, constrained and simple data categories are explicitly modeled here Constraints for a given object language Enumeration of permissible values in closed value domains NEERI Helsinki Standards Workshop

27 Data category Linguistic part (example) Data category: /grammatical gender/ Conceptual domain: /masculine/, /feminine/, /neuter/ Lists all admissible values for all languages Linguistic Section Language: fr Value Domain: /masculine/, /feminine/ Lists all admissible values for French Linguistic section values must be subset of the defined conceptual domain. Data category: /part of speech/, value: /partitive/ Limited in the Linguistic Section to French Issue with the partitive case in Finnish – some values are very language dependent NEERI Helsinki Standards Workshop

28 QA Components Option for ad hoc group validation TDG approval during standardization DCRB harmonization & validation ISOcat Checker NEERI Helsinki Standards Workshop

29 Thank you for your attention Come play with the cat!

Download ppt "ISOcat Data Model: Workflow & Guidelines Marc Kemps-Snijders a, Sue Ellen Wright b, Menzo Windhouwer a a Max Planck Institute for Psycholinguistics, b."

Similar presentations

Ads by Google