Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,

Similar presentations

Presentation on theme: "Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,"— Presentation transcript:

1 Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive, School of Oriental and African Studies

2 Outline 1.Introduction and motivation 2.Linguistic ontologies and markups 3.Representing knowledge 4.Supporting fieldworkers 5.Supporting speakers 6.Conclusions

3 1. Introduction and motivation

4 Introduction The main goal of this paper: how does GOLD meets the requirements of portability for language documentation and description (Bird & Simons, 2003) Road-testing: ability to meet the needs of archive users and contributors

5 Motivation The Endangered Languages Archive (ELAR) is part of the Hans Rausing Endangered Languages Project (HRELP) HRELP supports: the archive grants for documentation projects postgraduate programs focussing on language documentation

6 Motivation We (ELAR): support a digital archive (preserve data and provide access to it) We also train students and grantees in: markup strategies data management strategies multimedia development choice of recording equipment

7 Motivation There is concern that cataloguing metadata (IMDI / OLAC) has not yet been sufficiently extended (Nathan and Austin, 2004) rich linguistic and contextual information is not being recorded in well-formed portable formats/structures Common ontologies present a solution to this

8 How does GOLD meet our needs We find GOLD to be the most suitable ontology for supporting data portability GOLDs focus has been on datanalysis sets

9 Summary We suggest extending the focus to: data acquisition data access Key extensions: formalising the definitions of concepts by representing them as a set of formal properties explicitly capturing the conventions and constraints for presentation (rendering) modelling features that are inherently indeterminate and/or complex structures

10 2. Linguistic ontologies and markups

11 Linguistic ontologies and markups Ontology: strictly, what we agree exists Markup: strictly, what we are certain about Ontology and markup converge: only with consensus and complete confidence but there is rarely full confidence in the classification of new hard-to-classify phenomena in little-studied endangered languages

12 Indeterminacy Builders of ontologies outside of linguistics have been reluctant to accept inherent indeterminacy: In some cases, the incompatibilities [between ontologies] can be smoothed over by tweaking definitions of concepts or formalizations of axioms; in other cases, wholesale theoretical revision may be required. (Niles & Pease, 2001) If we can identify the incompatibilities, we can model them

13 Supporting linguistics A theory-neutral model of linguistics is not possible: Theories are poly-centric They will change We need a pan-theory model of linguistics

14 Formulising definitions Each concept in GOLD should be represented by a set of properties that describe that concept Three possible values for a given property: Yes, No, or Undefined (default) To accurately represent variance: include enough properties to distinguish terms For portability: include as many properties as possible

15 Formulising definitions Yes can potentially be expanded: whether the property is mandatory or optional for the concept dependencies between properties for a concept

16 Example Noun in GOLD: Noun Definition: A noun is a broad classification of parts of speech which include substantives and nominals (Crystal 1997:371; Mish et al. 1990:1176). (, last checked 23/05/2003) How do I know if my definition is the same as Crystal or Mish et al? Is it both definitions, or the common ground?

17 Example Will future users of GOLD have the same definition? the core of noun may have longevity the boundaries with other concepts will not COPEs can define extensions in terms of sets of properties, and add those properties to GOLD

18 Example GOLD: COPEs: NOUN GerundNOUNNomVerbNOUN Cant formally identify the similarities

19 Example GOLD: COPEs: NOUN GerundNOUNNomVerbNOUN + property: verb suffix Can formally identify the similarities Definition of NOUN can grow

20 3. Representing knowledge

21 Rendering Separating form from content: ideal for flexibility not possible for some materials (esp. video)

22 Rendering conventions / constraints Some are well known: italicize part-of-speech in dictionaries align interlinear transcriptions Some are not: representation of language-specific kinship systems, ethnobotanical ontologies etc

23 Solution 1 Include a (written) description and/or example of the rendering conventions and constraints: hard-code the interface

24 Solution 2 Include formal representations of the conventions within the data: interface takes instructions from the data

25 Solutions These are two extremes hard-coded and language specific data driven and language independent Database architectures and linguistic ontologies not designed for navigation transparent access to such structures – who does it support?

26 4. Supporting fieldworkers

27 Supporting indeterminacy There are two kinds of indeterminacy in linguistics: confidence in assigning a category (uncertainty) phenomena that are inherently variable, probabilistic, gradient or continuous

28 The most valuable information The most valuable information that a field linguist learns may be the least likely to be annotated Example: 7uhch in Lakanon Maya: A temporal-modal deictic expressing participant frames and speaker's footings (Bergqvist 2005) This term has been given the most thought by the researcher, but it is still not completely understood The uncertainty (or the extent of certainty) should be recorded: all the properties we do know

29 5 reasons for modelling uncertainty 1. To record our the extent of our knowledge For example, we want everything known about 7uhch in Lakanon Maya to be recorded, even if we dont yet have a category for it

30 5 reasons for modelling uncertainty 2. For searchability If an archive implementing an ontology with uncertain categories exists, then we can more easily find existing solutions to a problem If a problem is truly new, then we can allow future researchers to find it

31 5 reasons for modelling uncertainty 3. To reach certainty Even an indeterminate markup can allow a corpus analysis that can inform a decision about assigning the appropriate category

32 5 reasons for modelling uncertainty 4. To highlight problems with descriptive frameworks A feature may only appear to belong to multiple (or no) categories because the descriptive framework does not yet account for it

33 5 reasons for modelling uncertainty 5. Because the concept is inherently indeterminate The concept may be inherently fuzzy but not previously encountered as a continuous / contiguous phenomena

34 Inherently indeterminate features Eg: cline, gradience, squish, continuities, contiguities, vague, fuzzy, probabilistic Many prosodic, semantic and discourse features are inherently continuous Growing arguments for probabilities to be part of our formal linguistic models for morphological and syntactic structures (Aarts, 2004; Bayen, 2003; Manning, 2003)

35 Inherently indeterminate features Representing categories by formal properties meets the current requirements of modelling gradience (Aarts, 2004) Perhaps the ContinuousObject concept of SUMO (Niles & Pease, 2001) could also be used? The problem is, currently, largely unresolved

36 Incorporating new categories How do we know that a given category is not the same as another one identified elsewhere? Formal properties for concepts give us another means for comparison

37 Incorporating structures As well as inherently discrete phenomena and inherently indeterminate ones, there is a third kind: concepts that are complex structures common in syntax and discourse semantics How do we model a structure in an ontology?

38 5. Supporting speakers

39 Users of EL archives The largest (and growing) user group for endangered languages materials are the speakers of endangered languages Rarely interested in linguistic categories or navigating a corpus or archive via them Supporting language-specific ontologies means supporting information-rich structures for both navigation and analysis

40 Case Study: Yolngu kinship The Yolngu languages have an extensive kinship terminology called Gurrutu 27 terms that identify individuals and sets of individuals in terms of moiety, generation, gender, and patriline or matriline. The terms extend infinitely through cyclicity

41 Case Study: Yolngu kinship Speakers draw from the same sets of kinship relations to describe their relationship to the Yolngu lands We cannot always annotate well-known linguistic concepts independently of language-specific ontologies

42 6. Conclusions

43 Conclusions Ontology building for endangered languages can be very different to other ontology projects The uncertain is often more valuable than the certain The local is often more interesting than the universal … but will still need interoperability We suggest extending the focus of GOLD to data acquisition data access

44 Conclusions Current GOLD does not need to be altered to incorporate our suggestions except to remove assumptions of invariability Key extensions formalising the definitions of concepts by representing them as a set of formal properties explicitly capturing the conventions and constraints for presentation (rendering) modelling features that are inherently indeterminate and/or complex structures

45 References Aarts, B 2004 Modelling linguistic gradience. Studies in Language, 28(1):1–49. Bateman, J 1992 The theoretical status of ontologies in natural language processing. In Text Representation and Domain Modelling – ideas from linguistics and AI, Technische Universität Berlin Bayen, H 2003 Probabilistic Approaches to Morphology In Bod, R., Hay J. and Jannedy, S. (eds). Probabilistic Linguistics. MIT Press. Bergqvist, H 2005 Semantics of temporal deictics in Lakandon Maya. Presentation given at the ELAP-ELAR seminar series, SOAS, London. Bird, S & G Simons Seven Dimensions of Portability for Language Documentation and Description, Language 79/3: Christie, M & W Gaykamangu Kinship, moiety, land & language in Arnhem Land. In literacy link. Australian Council for Adult Literacy, vol 23, no 5 Oct Christie, M, W Gaykamangu & D Nathan Yolngu Languages and Culture: Gupapuyngu. Faculty of Aboriginal and Torres Strait Islander Studies, NTU [Multimedia CD-ROM] Crystal, D A dictionary of linguistics and phonetics. 4th edition. Cambridge, MA: Blackwell Cysouw, M, J Good, M Albu & HJ Bibiko 2005 Can GOLD cope with WALS? Retrofitting an ontology onto the World Atlas of Language Structures. Proceedings of the E-MELD 2005 Farrar, S. & D. T. Langendoen A linguistic ontology for the Semantic Web. GLOT International 7 (3), Farrar, S. 2003a Markup and the GOLD ontology. Proceedings of the EMELD 2003 Farrar, S. 2003b An ontological account of linguistics: extending SUMO with GOLD. Proceedings of the 2003 IEEE International Conference on Natural Language Processing and Knowledge Engineering. Beijing Foley, W A 2003 Genre, register and language documentation in literate and preliterate communities. In Peter K Austin (ed.) Language Documentation and Description vol 1 Grinevald, C 2003 Speakers and documentation of endangered languages. In Peter K Austin (ed.) Language Documentation and Description volume 1 Gruber, T R A translation approach to portable ontologies. Knowledge Acquisition, 5(2), Himmelmann, N P 1998 Documentary and descriptive linguistics. Linguistics Berlin: de Gruyter. Holton, G 2003 Approaches to digitization and annotation: A survey of language documentation materials in the Alaska Native Language Center Archive. Proceedings of the EMELD 2003 Manning, C Probabilistic Syntax In Bod, R., Hay J. and Jannedy, S. (eds). Probabilistic Linguistics. MIT Press. Nathan, D. (ed) Australias Indigenous Languages. Adelaide: SSABSA Nathan, D and P K Austin (2004) Reconceiving metadata: language documentation through thick and thin. In Peter K Austin (ed.) Language Documentation and Description Volume 2. Niles, I & A Pease Towards a standard upper ontology. Proceedings of the 2nd International Conference on Formal Ontology in Information Systems (FOIS-2001) Penton, D, C Bow, S Bird & B Hughes Towards a General Model for Linguistic Paradigms. Proceedings of EMELD 2004

Download ppt "Towards portability and interoperability for linguistic annotation and language- specific ontologies Robert Munro & David Nathan Endangered Languages Archive,"

Similar presentations

Ads by Google