Language data and XML: archiving and interoperability Simon Musgrave Linguistics Program Monash University

Slides:



Advertisements
Similar presentations
The Seven Pillars of Open Language Archiving: A Vision Statement Gary Simons and Steven Bird Workshop on Web-based Language Documentation and Description.
Advertisements

Archiving and linguistic databases Jeff Good, MPI EVA LSA Annual Meeting Oakland, California January 6, 2005 Available at:
Jan 7, 2005 Linguistic Society of America 2005 Annual Meeting, Oakland, CA The E-MELD Project: Helen Aristar Dry The LINGUIST List Eastern Michigan University.
New Services for Data Creators and Providers Louise Corti, Head ESDS Qualidata/ Outreach & Training Alasdair Crockett, ESDS Data Services Manager.
A Common Standard for Data and Metadata: The ESDS Qualidata XML Schema Libby Bishop ESDS Qualidata – UK Data Archive E-Research Workshop Melbourne 27 April.
Open Office.Org What is the Open Office.org Source Project? Open source project through which Sun Microsystems is releasing the technology for the popular.
Forest Markup / Metadata Language FML
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
Software Tools for Language Documentation DocLing 2013 Peter K. Austin Department of Linguistics, SOAS.
ISO DSDL ISO – Document Schema Definition Languages (DSDL) Martin Bryan Convenor, JTC1/SC18 WG1.
An Introduction to XML Based on the W3C XML Recommendations.
Case Tools Trisha Cummings. Our Definition of CASE  CASE is the use of computer-based support in the software development process.  A CASE tool is a.
Dr Gordon Russell, Napier University Unit Data Dictionary 1 Data Dictionary Unit 5.3.
XML for Information Management – Day 2 Airi Salminen University of Erlangen-Nuremberg Computational Linguistics Instructor: Professor Airi Salminen
XML for Information Management – Day 2 Airi Salminen University of Erlangen-Nuremberg Computational Linguistics Instructor: Professor Airi Salminen
1 XML: Document Type Definitions 2 Road Map  Introduction to DTDs  What’s a DTD?  Why are they important?  What will we cover?  Our First DTD 
XML(EXtensible Markup Language). XML XML stands for EXtensible Markup Language. XML is a markup language much like HTML. XML was designed to describe.
Introduction to XML This material is based heavily on the tutorial by the same name at
11 Data Interface Standard for Accounting Software Project Progress Report China National Audit Office June, 2015.
OCLC Online Computer Library Center Two Paths to Interoperable Metadata Jean Godby, Devon Smith, Eric Childress DC-2003 September 29, 2003.
IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey.
Subcommittee 3D DATA SETS FOR LIBRARIES. SC 3D Exchange of dictionary data Cape Town, (Cape Town/Radley)3 Donald Radley Chairman, SC3D.
Formex XML Two years after introduction Dr. Holger Bagola Publications Office Directorate A ‘OJ and Access to Legislation’ ‘Methodology and development’
Testing XML Pallavi Patwa CSTE,ISTQB (Foundation).
FIX Repository based Products Infrastructure for the infrastructure Presenter Kevin Houstoun.
XML – Extensible Markup Language XML eXtensible – add to language. Markup – delimit info using tags. Language – a way to express info.
Revitalizing Endangered Language Data: Case studies in rescuing legacy documentation CELCNA 2007 Naomi Fox, Julia James, University of Utah.
June 20, 2006E-MELD 2006, MSU1 Toward Implementation of Best Practice: Anthony Aristar, Wayne State University Other E-MELD Outcomes.
Copyright © 2012 Accenture All Rights Reserved.Copyright © 2012 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are.
XML Overview. Chapter 8 © 2011 Pearson Education 2 Extensible Markup Language (XML) A text-based markup language (like HTML) A text-based markup language.
1/14 ITApplications XML Module Session 2: Using and Creating XML Documents.
Max Planck Institute for Psycholinguistics Tool development report H. Brugman MPI Nijmegen.
The Metadata Object Description Schema (MODS) NISO Metadata Workshop May 20, 2004 Rebecca Guenther Network Development and MARC Standards Office Library.
XML 1 Enterprise Applications CE00465-M XML. 2 Enterprise Applications CE00465-M XML Overview Extensible Mark-up Language (XML) is a meta-language that.
Introduction to MDA (Model Driven Architecture) CYT.
XSLT transforms Mapping from Different Metadata Standards.
TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
Session IV Chapter 9 – XML Schemas
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.
Data Storage Choices File or Database ? Binary or Text file ? Variable or fixed record length ? Choice of text file record and field delimiters XML anyone.
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Resource Conversion William Lewis CSU Fresno.
Gdmxml: An XML Implementation of the GENTECH Genealogical Data Model Hans Fugal.
Date : 3/3/2010 Web Technology Solutions Class: Application Syndication: Parse and Publish RSS & XML Data.
WEB BASED DATA TRANSFORMATION USING XML, JAVA Group members: Darius Balarashti & Matt Smith.
© GMV S.A., 2004 Property of GMV S.A. All rights reserved 2004/05/13 XML in CCSDS CCSDS Spring Meeting - Montreal Fran Martínez GMVSA 4081/04.
Bringing “it” all Together !? Dean Djokic, ESRI David Maidment.
DITA Single Source technology. What is Single Source? Single source technology is a concept of publishing documents when same content can be used in different.
The european ITM Task Force data structure F. Imbeaux.
R. Addie & S. Dekeyser XML for M&C / USQ ? What ? Why ? How ? When ?
5.2 Scope: This standard defines common data interchange formats for event records for voting systems. Voting systems, including election administration.
1 Digital Preservation Testbed Database Preservation Issues Remco Verdegem Bern, 9 April 2003.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
SIL FieldWorks Language Explorer: The lexicon component Gary Simons SIL International Lexicon Tools and Lexicon Standards Nijmegen, 4–5 August 2010.
LINGUISTICS RESEARCH AND ANALYSIS OF THE BULGARIAN FOLKLORE. EXPERIMENTAL IMPLEMENTATION OF LINGUISTIC COMPONENTS IN BULGARIAN FOLKLORE DIGITAL LIBRARY.
DMED1100 InDesign Advanced Class 8. Agenda  Scripting  Introduction to XML 2.
XML eXtensible Markup Language. XML A method of defining a format for exchanging documents and data. –Allows one to define a dialect of XML –A library.
Introduction of Geoprocessing Lecture 9. Geoprocessing  Geoprocessing is any GIS operation used to manipulate data. A typical geoprocessing operation.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
7 Strategies for Extracting, Transforming, and Loading.
XML CSC1310 Fall HTML (TIM BERNERS-LEE) HyperText Markup Language  HTML (HyperText Markup Language): December  Markup  Markup is a symbol.
XML DTD. XML Validation XML with correct syntax is "Well Formed" XML. XML validated against a DTD is "Valid" XML.
PQDIF PQDIF: A Technical Overview Prepared by: Erich Gunther, Bill Dabbs, and Rob Scott Electrotek Concepts, Inc. NEW! IMPROVED!
Developing OLIF, Version 2 Susan M. McCormick Christian Lieske OLIF2 Consortium SAP/Walldorf, Germany.
Standards for representing meeting metadata and annotations in meeting databases Standards for representing meeting metadata and annotations in meeting.
July 1-3, 2005 E-MELD 2005 Ontologies in Linguistic Annotation 1 The GOLD Effort So Far Terry Langendoen Brian Fitzsimons Emily Kidder Department of Linguistics.
MESA A Simple Microarray Data Management Server. General MESA is a prototype web-based database solution for the massive amounts of initial data generated.
1 XML and XML in DLESE Katy Ginger November 2003.
William Lewis CSU Fresno
APE EAD3 introduction - DARIAH - Brussels
Presentation transcript:

Language data and XML: archiving and interoperability Simon Musgrave Linguistics Program Monash University

DRH Cheltenham 2/9/03 Language documentation Language documentation produces large quantities of text –Transcribed language events –associated annotations –lexica / dictionaries –analyses –ethnographic notes –……. There is no standard software tool used by linguists Use of proprietary software results in file formats with limited portability

DRH Cheltenham 2/9/03 Advantages of XML: Archiving UNICODE compatibility assured –Besides script possibilities, access to the full International Phonetic Alphabet character set is important for linguists Explicit coding of data model Generic file format assures better portability and lifespan

DRH Cheltenham 2/9/03 Building an archive Addition of data to an XML archive should be automated This implies the existence of transformation scripts to move data between formats Creating these scripts is work which has to be done It can have a second benefit

DRH Cheltenham 2/9/03 Advantages of XML: Interoperability Members of a research team may use different software running on different platforms Problems can arise in sharing data An important use of XML is as an interchange format Transformation scripts created for archiving can also be used for sharing data

DRH Cheltenham 2/9/03 Data structures - 1 Researchers may not agree on common data structures –They are used to working with one tool in one particular way –Their interests are different Even if they agree on a data structure for current work, heritage data may have to be imported to the archive

DRH Cheltenham 2/9/03 Data structures - 2 Archive files must be able to hold all the information coded in all the possible input formats - there should be no loss of data We can think of this in terms of the logic of attribute-value matrices: all inputs must be able to unify with the general data structure Where possible, correspondences will be made between the information in different input files

DRH Cheltenham 2/9/03 Example: Dictionary files The prototype implementation of the process uses a simple type of information: dictionary files Source 1 is a FilemakerPro database of lexical material from the language Nusalaut Source 2 is a table in an Access database containing data from several languages

DRH Cheltenham 2/9/03 Source 1

DRH Cheltenham 2/9/03 Source 2

DRH Cheltenham 2/9/03 Process overview

DRH Cheltenham 2/9/03 Stage 1 – txt to xml Data exported from database as delimited text file A document type description (DTD) is created for each source file –This replicates the existing data structure, possibly with additions A Perl script reads data from the txt file and adds tags based on the DTD

DRH Cheltenham 2/9/03 Sample: specific XML

DRH Cheltenham 2/9/03 Stage 1 – Why? Newer versions of commercial software offer an export to XML facility Importing data from a normalized database often means having access to data from more than one table –XSLT takes a single input file –Perl (or an equivalent) does not have this limitation Type conversion can be done using Perl

DRH Cheltenham 2/9/03 Stage 2 – XML1 to XML2 DTD for archive file has a place for all information in all input files More structure imposed at this level –Stage 1 used only elements –Stage 2 uses attributes, mainly for metadata –“Pseudo-normalization”: recurring data substructures treated as optionally recurring elements – the archive data structure is actually more general than ANY of the inputs Date stamping done at this stage

DRH Cheltenham 2/9/03 Sample: General XML 1

DRH Cheltenham 2/9/03 Sample: General XML 2

DRH Cheltenham 2/9/03 Exporting Data XSLT with The only complication is undoing “pseudo- normalization”

DRH Cheltenham 2/9/03 A more complex problem: aligned interlinear text Important way of presenting data for linguists Various lines of annotation, different levels have different alignment patterns

DRH Cheltenham 2/9/03 The Bird, Bow & Hughes Model Bird, Steven, Cathy Bow and Baden Hughes (2003) A generalised model of interlinear text Proceedings of the EMELD Workshop A general data model for representing this type of information Four levels: –Text –Phrase –Word –Morpheme

DRH Cheltenham 2/9/03 XML model for aligned text

DRH Cheltenham 2/9/03 Aligned text: Problems Various types of input: –Text strings with space and/or tabs (Shoebox) –Formatted text (e.g. Word tables) –Structured data (e.g. Spinoza database) Type of processing varies –Text strings need a lot of parsing –Structured data needs access to multiple tables Ideally, time alignment to AV source should be included also

DRH Cheltenham 2/9/03 What is gained Interoperability within the project –Data can be imported to the archive file from one format and exported to another format Interoperability outside the project –People who wish to share data with a group will define transformations from their data formats –A bottom-up approach to developing standards Improved data modeling –Encourages members of the project to revise their data formats –Gives us help in developing high-level models for linguistic data

DRH Cheltenham 2/9/03 Future work Processing aligned text formats Using schemas rather than DTDs: data validation Improved version control, especially checking for duplicate or conflicting records

DRH Cheltenham 2/9/03 Some details This work is part of the project Endangered Maluku Languages: Eastern Indonesia and the Dutch Diaspora Funding: –Hans Rausing Endangered Languages Project –Australian Research Council –Faculty of Arts, Monash University Contacts: –