Presentation is loading. Please wait.

Presentation is loading. Please wait.

Standards for language resources the ISO/TC 37(/SC 4) perspective

Similar presentations


Presentation on theme: "Standards for language resources the ISO/TC 37(/SC 4) perspective"— Presentation transcript:

1 Standards for language resources the ISO/TC 37(/SC 4) perspective
Laurent Romary Directeur de Recherche INRIA ISO/TC 37/SC 4 chair

2 Context ISO TC37 - Terminology and other language resources
SC3 - Computer applications in terminology ISO Martif ISO Data categories (under revision) ISO TMF (Terminological Markup Framework) SC4 - Language Resource Management

3 An example scenario: information extraction
Semantic content Content analysis Syntactic structures Chunk parsing Part-of-speech tagging POS tagging Primary Data

4 Horizontal view (W3C perspective)
Semantic content OWL XML Content analysis RDF Syntactic structures Chunk parsing Part-of-speech tagging SOAP POS tagging Primary Data

5 Vertical view (ISO/TC 37/SC 4 perspective)
Semantic content Content analysis Evaluation Linguistic models and descriptors (Data Categories) Syntactic structures Chunk parsing Lexica Part-of-speech tagging POS tagging Primary Data

6 Linguistic information sources …and initiatives
Access protocols [Corba, SOAP] Primary resources (text, dialogues) Structural mark-up Basic annotations [TEI, MPEG7, TMX, XLIFF, XHTML, etc.] Knowledge structures Hierarchies of types Relations between concepts (subjects/topics etc.) Links to primary resources [Topic Maps, OIL, RDF] Links NLP structures (annotations) POS tagging Chunks (cf. Named Entities) Deep Syntactic structures Co-references etc. [Eagles/ISLE, CES, MATE,…] Lexical structures (Language models) Terminologies Transfer lexica LTAG/HPSG/LFG lexica [TBX, OLIF, Eagles/ ISLE (Genelex)] Meta-data [Dublin core, OLAC, ISLE, MPEG7, RDF]

7 SC4 Approach Efforts geared towards defining abstract models and general frameworks for the creation and representation of language resources In principle, abstract enough to accommodate diverse linguistic, theoretical or practical approaches No provision of new formats Situate development squarely in the framework of XML and related standards Ensure compatibility with established and widely accepted web-based technologies Ensure feasibility of transduction from legacy formats into newly defined formats

8 SC4 and other standardizing bodies
Contributing organizations ----- ----- ----- TEI text representation Reference for primary sources e.g.: text archives ----- Oscar Text W3C basic protocols and formats XML (Schemas) XPath XPointer + RDF, SVG, SMIL, SOAP ISO TC37/SC4 - language resources, NLP perspective e.g. linguistic annotations, lexical formats Technical background MPEG - Multimedia, XML based e.g. MPEG7-4 Word and phone lattices Audio/Speech

9 ISO/TC 37/SC 4 structure Data categories WG4 WG2 WG3 WG5 WG1
Lexical databases WG2 Representation schemes WG3 Multilingual text representation Workflow of language Resource Management WG5 WG1 Basic descriptors and mechanisms for language resources

10 On-going activities Feature structure representation (in collaboration with the TEI - Text Encoding Initiative) ISO DIS 24610 Morpho-syntactic annotation ISO NP 24611 Lexical markup framework ISO NP (+ ISO NP ) Task force on Meta-data for language resources (OLAC+IMDI) ACL/Sigsem working group on multimodal content representation Data category registry for ISO/TC 37 ISO CD on ballot (deadline Jan. 2004)

11 Modeling linguistic annotation structures

12 General framework - 1 Model for linguistic annotation that can
be instantiated in a standard representational format GMT: Generic Mapping Tool serve as a pivot format into and out of which proprietary formats may be transduced to enable Comparison, merging, manipulation via common tools Reference: ISO Terminological Markup Framework

13 General framework - 2 A meta-model A set of data-categories
A general, underlying model that informs current practice A set of data-categories Provides to precise semantics of the format Obtained: By sub-setting a Data Category Registry By providing application specific categories Vs. terminology - fixed named levels

14 ISO 16642: A family of formats
TMF TML1 TML2 TML3 TMLi (Geneter) (TBX) GMT

15

16 Meta-model * * * * Terminological Data Collection (TDC)
Global Information (GI) Complementary Information (CI) * Terminological Entry (TE) * Language Section (LS) * Term Section (TS) * Term Component Section (TCS)

17 TMF: example TE LS LS TS TS id=‘ID67’ subjectField=‘ manufacturing ’
definition=‘A value…’ TE LS LS lang=‘ hu ’ lang=‘ en ’ TS term=‘alpha smoothing factor’ termType=‘fullForm’ term=‘…’ TS

18 Implementation in TBX (cf. www.lisa.org)
<termEntry id='ID67'> <descrip type='subjectField‘>manufacturing</descrip> <descrip type='definition'>A value between 0 and 1 used in ...</descrip> <langSet lang='en'> <tig> <term>alpha smoothing factor</term> <termNote type='termType'>fullForm</termNote> </tig> </langSet> <langSet lang='hu'> <term>Alfa ...</term> </termEntry>

19 Implementing a Data Category Registry for ISO TC37

20 Data Category Definition: Example: Background:
Elementary descriptor used in a linguistic description or annotation scheme Example: /Part of speech/, /Grammatical gender/, /Grammatical number/, /Feminine/, /Plural/, /Ablative/ Background: Experience gained from ISO in linguistic format specification Wider notion of data-categories as meta-data for tagged language resources

21 Multiple uses of data categories
Documentation Meta-data XML schemas Data category selection Meta model XSL filters

22 Application domains Terminological data collection (TC 37/SC 3)
Cf. “old” ISO set of data categories for terminology Language codes (TC 37/SC 2) Cf. evolution from ISO and ISO to ISO 639-4 On-going and future SC4 activities (TC 37/SC 4) Meta-data for language resources Morpho-syntax/Syntax, Discourse level annotation NLP lexica, MT lexica Multilingual data representations (e.g. translation memories) and access (query languages)

23 Technical background ISO (ISO JTC 1/SC 32): meta-data registry view Provide mechanisms for the management of data categories ISO (ISO TC 37/SC 3): terminology view Provides ways of dealing with multilingual issues OWL (W3C Sem. Web activity): ontology view Provides a framework for dealing with hierarchies and expressing constraints on data-categories E.g. a /noun/ can be described by means of /gender/ and /number/ in French

24 XML schema declaration
Relation to ISO 11179 Complex datcat Set of Simple datcats /masculine/ /feminine/ /neuter/ /gender/ Data element concept Conceptual domain Data element Value domain XML object List of values Implemented as an XML attribute named ‘gen’ m, f, n XML schema declaration <w lemme=“vert” gen=“f”>verte</w>

25 The ISO 12620-1 proposal Entry Identifier: gender
Profile: morpho-syntax Definition (fr): Catégorie grammaticale reposant, selon les langues et les systèmes, sur la distinction naturelle entre les sexes ou sur des critères formels (Source: TLFi) Definition (en): Grammatical category… (Source: TLFi (Trad.)) Conceptual Domain: {/feminine/, /masculine/, /neuter/} Object Language: fr Name: genre Conceptual Domain: {/feminine/, /masculine/} Object Language: en Name: gender Object Language: de Name: Geschlecht Conceptual Domain: {/feminine/, /masculine/, /neuter/}

26 Perspectives ISO/TC 37/SC 4 in a wider picture
Basic building blocks to bring coherence in the representation of linguistic information in a variety of application domains E.g. e-documentation, e-learning, e-business (e-catalogues), multimedia, localisation… Provide vertical solution to linguistically based applications E.g. Information extraction, indexing


Download ppt "Standards for language resources the ISO/TC 37(/SC 4) perspective"

Similar presentations


Ads by Google