Presentation is loading. Please wait.

Presentation is loading. Please wait.

Language data and XML: archiving and interoperability Simon Musgrave Linguistics Program Monash University

Similar presentations


Presentation on theme: "Language data and XML: archiving and interoperability Simon Musgrave Linguistics Program Monash University"— Presentation transcript:

1 Language data and XML: archiving and interoperability Simon Musgrave Linguistics Program Monash University

2 DRH Cheltenham 2/9/03 Language documentation Language documentation produces large quantities of text –Transcribed language events –associated annotations –lexica / dictionaries –analyses –ethnographic notes –……. There is no standard software tool used by linguists Use of proprietary software results in file formats with limited portability

3 DRH Cheltenham 2/9/03 Advantages of XML: Archiving UNICODE compatibility assured –Besides script possibilities, access to the full International Phonetic Alphabet character set is important for linguists Explicit coding of data model Generic file format assures better portability and lifespan

4 DRH Cheltenham 2/9/03 Building an archive Addition of data to an XML archive should be automated This implies the existence of transformation scripts to move data between formats Creating these scripts is work which has to be done It can have a second benefit

5 DRH Cheltenham 2/9/03 Advantages of XML: Interoperability Members of a research team may use different software running on different platforms Problems can arise in sharing data An important use of XML is as an interchange format Transformation scripts created for archiving can also be used for sharing data

6 DRH Cheltenham 2/9/03 Data structures - 1 Researchers may not agree on common data structures –They are used to working with one tool in one particular way –Their interests are different Even if they agree on a data structure for current work, heritage data may have to be imported to the archive

7 DRH Cheltenham 2/9/03 Data structures - 2 Archive files must be able to hold all the information coded in all the possible input formats - there should be no loss of data We can think of this in terms of the logic of attribute-value matrices: all inputs must be able to unify with the general data structure Where possible, correspondences will be made between the information in different input files

8 DRH Cheltenham 2/9/03 Example: Dictionary files The prototype implementation of the process uses a simple type of information: dictionary files Source 1 is a FilemakerPro database of lexical material from the language Nusalaut Source 2 is a table in an Access database containing data from several languages

9 DRH Cheltenham 2/9/03 Source 1

10 DRH Cheltenham 2/9/03 Source 2

11 DRH Cheltenham 2/9/03 Process overview

12 DRH Cheltenham 2/9/03 Stage 1 – txt to xml Data exported from database as delimited text file A document type description (DTD) is created for each source file –This replicates the existing data structure, possibly with additions A Perl script reads data from the txt file and adds tags based on the DTD

13 DRH Cheltenham 2/9/03 Sample: specific XML

14 DRH Cheltenham 2/9/03 Stage 1 – Why? Newer versions of commercial software offer an export to XML facility Importing data from a normalized database often means having access to data from more than one table –XSLT takes a single input file –Perl (or an equivalent) does not have this limitation Type conversion can be done using Perl

15 DRH Cheltenham 2/9/03 Stage 2 – XML1 to XML2 DTD for archive file has a place for all information in all input files More structure imposed at this level –Stage 1 used only elements –Stage 2 uses attributes, mainly for metadata –“Pseudo-normalization”: recurring data substructures treated as optionally recurring elements – the archive data structure is actually more general than ANY of the inputs Date stamping done at this stage

16 DRH Cheltenham 2/9/03 Sample: General XML 1

17 DRH Cheltenham 2/9/03 Sample: General XML 2

18 DRH Cheltenham 2/9/03 Exporting Data XSLT with The only complication is undoing “pseudo- normalization”

19 DRH Cheltenham 2/9/03 A more complex problem: aligned interlinear text Important way of presenting data for linguists Various lines of annotation, different levels have different alignment patterns

20 DRH Cheltenham 2/9/03 The Bird, Bow & Hughes Model Bird, Steven, Cathy Bow and Baden Hughes (2003) A generalised model of interlinear text Proceedings of the EMELD Workshop A general data model for representing this type of information Four levels: –Text –Phrase –Word –Morpheme

21 DRH Cheltenham 2/9/03 XML model for aligned text

22 DRH Cheltenham 2/9/03 Aligned text: Problems Various types of input: –Text strings with space and/or tabs (Shoebox) –Formatted text (e.g. Word tables) –Structured data (e.g. Spinoza database) Type of processing varies –Text strings need a lot of parsing –Structured data needs access to multiple tables Ideally, time alignment to AV source should be included also

23 DRH Cheltenham 2/9/03 What is gained Interoperability within the project –Data can be imported to the archive file from one format and exported to another format Interoperability outside the project –People who wish to share data with a group will define transformations from their data formats –A bottom-up approach to developing standards Improved data modeling –Encourages members of the project to revise their data formats –Gives us help in developing high-level models for linguistic data

24 DRH Cheltenham 2/9/03 Future work Processing aligned text formats Using schemas rather than DTDs: data validation Improved version control, especially checking for duplicate or conflicting records

25 DRH Cheltenham 2/9/03 Some details This work is part of the project Endangered Maluku Languages: Eastern Indonesia and the Dutch Diaspora Funding: –Hans Rausing Endangered Languages Project –Australian Research Council –Faculty of Arts, Monash University Contacts: –http://www.arts.monash.edu.au/ling/malukuhttp://www.arts.monash.edu.au/ling/maluku


Download ppt "Language data and XML: archiving and interoperability Simon Musgrave Linguistics Program Monash University"

Similar presentations


Ads by Google