Language data and XML: archiving and interoperability Simon Musgrave Linguistics Program Monash University

Language data and XML: archiving and interoperability Simon Musgrave Linguistics Program Monash University (Simon.Musgrave@arts.monash.edu.au)

DRH 2003 - Cheltenham 2/9/03 Language documentation Language documentation produces large quantities of text –Transcribed language events –associated annotations –lexica / dictionaries –analyses –ethnographic notes –……. There is no standard software tool used by linguists Use of proprietary software results in file formats with limited portability

DRH 2003 - Cheltenham 2/9/03 Advantages of XML: Archiving UNICODE compatibility assured –Besides script possibilities, access to the full International Phonetic Alphabet character set is important for linguists Explicit coding of data model Generic file format assures better portability and lifespan

DRH 2003 - Cheltenham 2/9/03 Building an archive Addition of data to an XML archive should be automated This implies the existence of transformation scripts to move data between formats Creating these scripts is work which has to be done It can have a second benefit

DRH 2003 - Cheltenham 2/9/03 Advantages of XML: Interoperability Members of a research team may use different software running on different platforms Problems can arise in sharing data An important use of XML is as an interchange format Transformation scripts created for archiving can also be used for sharing data

DRH 2003 - Cheltenham 2/9/03 Data structures - 1 Researchers may not agree on common data structures –They are used to working with one tool in one particular way –Their interests are different Even if they agree on a data structure for current work, heritage data may have to be imported to the archive

DRH 2003 - Cheltenham 2/9/03 Data structures - 2 Archive files must be able to hold all the information coded in all the possible input formats - there should be no loss of data We can think of this in terms of the logic of attribute-value matrices: all inputs must be able to unify with the general data structure Where possible, correspondences will be made between the information in different input files

DRH 2003 - Cheltenham 2/9/03 Example: Dictionary files The prototype implementation of the process uses a simple type of information: dictionary files Source 1 is a FilemakerPro database of lexical material from the language Nusalaut Source 2 is a table in an Access database containing data from several languages

DRH 2003 - Cheltenham 2/9/03 Source 1

DRH 2003 - Cheltenham 2/9/03 Source 2

DRH 2003 - Cheltenham 2/9/03 Process overview

DRH 2003 - Cheltenham 2/9/03 Stage 1 – txt to xml Data exported from database as delimited text file A document type description (DTD) is created for each source file –This replicates the existing data structure, possibly with additions A Perl script reads data from the txt file and adds tags based on the DTD

DRH 2003 - Cheltenham 2/9/03 Sample: specific XML

DRH 2003 - Cheltenham 2/9/03 Stage 1 – Why? Newer versions of commercial software offer an export to XML facility Importing data from a normalized database often means having access to data from more than one table –XSLT takes a single input file –Perl (or an equivalent) does not have this limitation Type conversion can be done using Perl

DRH 2003 - Cheltenham 2/9/03 Stage 2 – XML1 to XML2 DTD for archive file has a place for all information in all input files More structure imposed at this level –Stage 1 used only elements –Stage 2 uses attributes, mainly for metadata –“Pseudo-normalization”: recurring data substructures treated as optionally recurring elements – the archive data structure is actually more general than ANY of the inputs Date stamping done at this stage

DRH 2003 - Cheltenham 2/9/03 Sample: General XML 1

DRH 2003 - Cheltenham 2/9/03 Sample: General XML 2

DRH 2003 - Cheltenham 2/9/03 Exporting Data XSLT with The only complication is undoing “pseudo- normalization”

DRH 2003 - Cheltenham 2/9/03 A more complex problem: aligned interlinear text Important way of presenting data for linguists Various lines of annotation, different levels have different alignment patterns

DRH 2003 - Cheltenham 2/9/03 The Bird, Bow & Hughes Model Bird, Steven, Cathy Bow and Baden Hughes (2003) A generalised model of interlinear text Proceedings of the EMELD Workshop A general data model for representing this type of information Four levels: –Text –Phrase –Word –Morpheme

DRH 2003 - Cheltenham 2/9/03 XML model for aligned text

DRH 2003 - Cheltenham 2/9/03 Aligned text: Problems Various types of input: –Text strings with space and/or tabs (Shoebox) –Formatted text (e.g. Word tables) –Structured data (e.g. Spinoza database) Type of processing varies –Text strings need a lot of parsing –Structured data needs access to multiple tables Ideally, time alignment to AV source should be included also

DRH 2003 - Cheltenham 2/9/03 What is gained Interoperability within the project –Data can be imported to the archive file from one format and exported to another format Interoperability outside the project –People who wish to share data with a group will define transformations from their data formats –A bottom-up approach to developing standards Improved data modeling –Encourages members of the project to revise their data formats –Gives us help in developing high-level models for linguistic data

DRH 2003 - Cheltenham 2/9/03 Future work Processing aligned text formats Using schemas rather than DTDs: data validation Improved version control, especially checking for duplicate or conflicting records

DRH 2003 - Cheltenham 2/9/03 Some details This work is part of the project Endangered Maluku Languages: Eastern Indonesia and the Dutch Diaspora Funding: –Hans Rausing Endangered Languages Project –Australian Research Council –Faculty of Arts, Monash University Contacts: –maluku@arts.monash.edu.aumaluku@arts.monash.edu.au –http://www.arts.monash.edu.au/ling/malukuhttp://www.arts.monash.edu.au/ling/maluku

Language data and XML: archiving and interoperability Simon Musgrave Linguistics Program Monash University

Similar presentations

Presentation on theme: "Language data and XML: archiving and interoperability Simon Musgrave Linguistics Program Monash University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Language data and XML: archiving and interoperability Simon Musgrave Linguistics Program Monash University

Similar presentations

Presentation on theme: "Language data and XML: archiving and interoperability Simon Musgrave Linguistics Program Monash University"— Presentation transcript:

Similar presentations

About project

Feedback