Presentation on theme: "A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California."— Presentation transcript:
A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California
Organized by: Jeff Good and Heidi Johnson Open Language Archives Community (OLAC) Outreach Committee Moderator: Laura Welcher Speakers: Debbie Anderson, Michael Appleby, Jessica Boynton, Naomi Fox, Connie Dickinson
Presentations from this session will be posted at http://www.language- archives.org/news.html#olac07 http://www.language- archives.org/news.html#olac07
Best Practice in Your Back Pocket: Getting the Most Out of the Tools You Have Laura Welcher The Rosetta Project / Long Now Foundation
A great way to freak out a linguist “To be in compliance with best practice recommendations (ahem), your interlinear glossed text needs to be in XML format with morphosyntactic tags that reference the GOLD ontology.”
Reality Check There’s a difference between ideal best practice resources (which is still somewhat of a moving target) and a good, sufficient approximation. Some common practices are far from ideal or sufficient (like saving the dictionary you worked 5 years on as a Microsoft Word document file). We can easily modify these practices to produce archivable resources that will last. And this can be done using tools that you already have, and knowledge that is easy to acquire. Hence the title: Best practice in your back pocket: getting the most out of the tools that you have.
Best Practice E-MELD project (Electronic Metastructure for Endangered Languages Data) Goals: –Help preserve endangered languages data –Develop infrastructure for electronic archives Defining best practice –E-MELD summer workshops http://www.emeld.orghttp://www.emeld.org Promoting best practice: –“School of Best Practice” at http://www.emeld.org/school/index.html http://www.emeld.org/school/index.html
Good, Better, Best Practice The information presented here comes from presentations of the E-MELD team, particularly the following: Simons and Dry (2006) Good, Better, and Best Practice The Experience of the E-MELD Project http://www.linguistlist.org/emeld/docume nts/Bielefeld-Dry-Simons.pdf http://www.linguistlist.org/emeld/docume nts/Bielefeld-Dry-Simons.pdf
The first consideration: working, presentation and archival formats The process of creating digital language resources usually involves creating files in different formats: –Working format –Presentation format –Archival format
Working Format The saved format of whatever program you are working in: –.doc (MS Word) –.xls (Excel) –.fp7 (FileMaker Pro) This format is what you use for your own convenience and productivity –Typically this format is proprietary –Less typically, people may work in programs whose native format is not proprietary, automatically saving in.txt (plain text),.xml or.html (types of formatted plain text) A proprietary working file format is not the only format you should have!
Archival Format A very important format -- this format helps ensure that your resource will last and be usable well into the future An archival format has LOTS of good qualities (Simons, 2004) –Lossless –Open Standard –Transparent –Supported by multiple vendors
Archival Format: Lossless Avoid compressed formats that lose content A good rule-of-thumb is to use uncompressed formats: –Text:.txt,.html,.xml –Images:.tiff,.bmp –Audio:.wav (Windows),.aiff (Apple),.au (Sun, Java, Unix) but make sure it is PCM (uncompressed) –Video:.avi (some codecs),.rtv Most compressed formats lose content, but some are lossless (.zip for text, black and white.gif for images,.ale Apple Lossless Encoding for audio, jpeg2000 video codec) -- use with caution!
Archival Format: Open Avoid proprietary formats like.doc,.xls,.fp7 The company that produces the software may stop supporting the format, rendering your file unreadable For your archival format, choose a file format that is “open standard” like.xml,.html,.pdf or.rtf “Open standard” means that the specification of the format is publically available, and anyone can implement it.
Archival Format: Transparent Use a file format that is easy to interpret Example: text files (.txt) –Have common characters like letters, numbers, punctuation –Virtually no formatting (tabs, returns) –Because of the simplicity of this file type, many programs can read it and make use of the data Other transparent formats:.wav,.aiff can be read by any audio program Not transparent:.zip,.mp3 (require a special algorithm for interpretation)
Archival Format: Supported Prefer formats that are widely supported If more vendors support it, it is less likely to become obsolete This is another reason to prefer an open standard format to a proprietary one
Presentation Format Presentation formats are those you choose for the convenience and ease of accessibility and display It is fine that presentation formats be compressed, so long as you make a lossless archival copy as well Examples of presentation formats include.pdf files,.mp3 files,.jpg images, MPEG-2 video
So far, so good? As a responsible linguist creating digital language documentation that will last well into the future you… –Know the difference between a working, presentation, and archival file format –Know what makes a good archival format (LOTS) –Maintain an archival format of your data Anything beyond this? Yes, a bit more…
Best Practice Digital Resources are… Preservable in formats that are not vulnerable to decay or obsolescence (see LOTS) Intelligible so that content that is easily understood by future scholars Accessible so that resources are easily discovered and accessed They are also interoperable, but this is mostly a concern of archives and services (Simons and Dry, 2006)
Create Preservable Resources Linguists are responsible for making preservable resources That is, creating archival formats that follow the principles of LOTS
Create Intelligible Resources In order to create resources that are intelligible to others, you must document your practices! Documentation includes: –Your markup practices –The encoding you use –Metadata about your resources This information should be kept a file or files in an archival format, and archived along with your resources.
Presentational Markup Many people use presentational markup, particularly in the working formats like Microsoft Word. Presentational markup means that aspects of the presentation (like bold, italics, indenting) are themselves meaningful For example…
Example of Presentational Markup AS_5.2.1978_audio: Alice Spear, Potawatomi, “Crane Boy”, May 2, 1978, Mayetta, Kansas. AS_5.2.1978_audio AliceSpear “Crane Boy” May 2, 1978 Mayetta, Kansas
Presentational Markup Presentational markup is not recommended. BUT if you do use it, describe all meaningful aspects (e.g. “bold” means head word, “italics” is used for the part of speech)
Descriptive Markup It is better practice to use descriptive markup, like XML XML is basically text with “tags” that provide information about what is between the tags – mnomen – rice Tags can be also used to group information, much like you would group information in a database record, and have a whole set of information in a database
Example of Descriptive Markup AS_5.2.1978_audio: Alice Spear, Potawatomi, “Crane Boy”, May 2, 1978, Mayetta, Kansas. AS_5.2.1978_audio Alice Spear “Crane Boy” May 2, 1978 Mayetta, Kansas
Descriptive Markup: XML AS_5.2.1978_audio Analog audio recording on Cassette tape Alice Spear Laura Buszard-Welcher “Crane Boy” narrative told in Potawatomi and in English Mayetta, Kansas digital audio: AS_5.2.1978_audio.wav, interlinear text: AS_5.2.1978_audio.txt Some restrictions; contact field linguist
Descriptive Markup: XML It is a good practice to use standard tags where they are available. –OLAC has a set of tags that you would use for metadata to describe your resources –GOLD has a set of tags used for morphosyntactic description Otherwise, be sure to document the meaning of the tags that you use Although some people feel comfortable working in XML, many don’t like to use it as a working format. Fortunately many common programs now allow you to save your work as an XML file.
The Advantage of XML Besides creating an archival data file, XML has other advantages By creating stylesheets, you can give the same XML file different presentation forms For example…
Delimited Text Another kind of markup that you might find yourself using is delimited text. Spreadsheet and database programs allow you to export your data as text, delimited by a particular character –Comma separated text (.csv) –Tab separated text (.tab) To help with intelligibility, create an initial record where the name of each field / cell is given inside the record itself. That way, the names of your fields / cells will be exported and saved along with the rest of your data. Text data exported this way is good practice, particularly if you are careful about documenting your practices inside your fields / cells (for more on this see following slides).
Other aspects of markup Document any special conventions that you use What do your morpheme boundary markers mean (+ / - / = …any others?) What glossing conventions do you use? Give the full names of abbreviations (e.g. POS means ‘possessive’, PV means ‘preverb’). Describe grammatical terms that you use (like ‘aorist’, or ‘preverb’) and what it means for the language you are describing. You don’t have to write a grammar -- a sentence or two describing the term is sufficient) Also note if you are using standard terminology sets, like Leipzig Glossing Rules, or GOLD terminology
Document the Encoding Identify the character set you are using Document any non-standard characters Best practice is to use Unicode
Create Metadata You will need to create some additional information about your resources Metadata usually includes information about: –The setting (time, date, participants, location) –The language (ISO 636-3) –Linguistic type (text, grammar, lexicon) and subject –Access restrictions There are metadata standards for language resources: OLAC and IMDI
OLAC Metadata Elements Contributor (content)Language (audience) Coverage (e.g. location)Publisher Creator (content)Relation (to another resource DateRights (controlled vocab.) DescriptionSource (say, for re-elicited data) FormatSubject (controlled vocab.) Encoding Format (character set)Subject Language (ISO 636-3 code) Markup Format (XML schema)Title Identifier (file name, URLLinguistic Type http://www.language-archives.org/OLAC/olacms.html
Create Metadata Keep a metadata record for each of your resources. The records should themselves be in an archival format. This could be: –A text file (good) –Delimited text, exported from a simple database file (good) –An XML file (better) –An OLAC or IMDI formatted XML file (best) Your archivist may have a preference about metadata formats, and prefer something relatively simple (like a paper form) if the archive will be manually entering the metadata. Archive this file along with the rest of your resources.
Make your resources accessible Archive, archive, archive! (Not just on your own, or your departmental server. Archives are committed to the long- term preservation and availability of your resources.) Before you leave to do fieldwork, or when you are writing your grant, establish contact with the archive where you intend to deposit your resources Archivists will –give you guidelines for creating archival files –help you select the best metadata set –give you information about setting access levels When you return, the first thing to do is send your files, along with the metadata and markup descriptions to the archive Most archives will then give you an ID number for your resources that you can then cite in your publications
A Community Responsibility Best practice involves what individual field linguists do, but also how we collectively use and care for these resources This broader community involves –Other researchers like yourself who create resources –A growing set of interconnected digital language archives that care for, protect, and disseminate your resources –People who develop tools and services to make your resources locateable, searchable, and reusable –Others: linguistics organizations, organizations like OLAC and DELAMAN, funding agencies who promote the work of this community
Unicode Debbie Anderson “A field linguists’ guide to Unicode” Michael Appleby “How to use Unicode on your computer”
Field Case Studies: Texts and Databases Jessica Boynton –“Transcription, Time-Alignment and Annotation” Naomi Fox –“Using Filemaker Pro to produce archivable language documentation” Connie Dickonson –“The Tsafiki Text Factory”
Panel Session Talks are 25 minutes, consecutive. Please remember or write down your questions! We will field them in a panel session after the talks.