Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Rosetta Project Digital Language Archive Laura Buszard-Welcher The Long Now Foundation / University of California, Berkeley.

Similar presentations


Presentation on theme: "The Rosetta Project Digital Language Archive Laura Buszard-Welcher The Long Now Foundation / University of California, Berkeley."— Presentation transcript:

1 The Rosetta Project Digital Language Archive Laura Buszard-Welcher The Long Now Foundation / University of California, Berkeley

2 The Rosetta Project Archive A public, Web-based, digital archive of language documentation Part of the National Science Digital Library (NSF program for dissemination of educational STEM resources) Over 95,000 pages of resources on over 2,300 languages Over 3000 wordlists (Swadesh lists, 500-1500 term lists) New! Audio files

3 Project Goals: Resources We are a digital language archive with comprehensive, global scope: we can and do accept digital resources on any language, dialect, family, or subgroup. Promotes linguistic diversity by broadly disseminating resources on languages with small numbers of speakers--contributes to the effort to document and disseminate resources on endangered languages. Comprehensive scope both requires and builds communities: global networks of linguists, speakers, educators

4 Project Goals: Interoperability and Resource Discovery Supporting metadata standardization and interoperability (OLAC participating archive and individuals, E-MELD, GOLD, LSA Conversation on Endangered Language Archiving) Promoting resource discovery through open archive search: we serve oai_dc, nsdl_dc, olac_dc metadata

5 Project Goals: Developing tools for collaborative linguistic research Endangered Language Query Room DOCS (Digital Online Curation Services) LangGator Wordlist tool (collaboration with MPI-EVA) New Rosetta V2.0 Website

6 Site Infrastructure Plone 2.1 content management system, running in the Zope Application Server Open source, leverages worldwide developer communities Lots of “plug in” modules for functionality expansion –CMF Bibliography AT, Plone Board, etc. Heavily modified infrastructure (language node design) and user interface

7 Nodal Architecture Languages, language families, family subgroups, dialects all represented by nodes. A node is a content aggregation page Nodes and parent-child relationships each have unique IDs The system currently represents Ethnologue language relationships, but has the flexibility to be agnostic about them, represent relationships from various theoretical perspectives

8 Node Pages Accessible from a variety of browse and search pages –Browse by language name, family, country data type –Quick search, advanced search Node page organization –Node metadata –Descriptive Resources –Navigation: classification tree –Links to people functions, LINGUIST List people search –External links: searches

9 Content In-house collection, vetting –Primary focus of collection –Rosetta descriptive categories Special collections –Endangered Language Fund (ELF) Digital Archives –Alan Lomax Audio Collection –Future collections that come in through DOCS Future development –Uploaded, peer-reviewed resources –Collaborative content areas (bulletin boards, wiki)

10 Scanning Historically, the primary focus of in-house collection Rosetta serves over 95,000 images from a variety of published resources Excerpts in data categories (see following slides) Public domain resources can be scanned in their entirety

11 Categories of Collection (1) Ethnologue metadata General information from www.ethnologue.com about language affiliation, where spoken, number of speakers, dialects, alternate language names. General description General description of the language. Origin and current distribution of language, number of speakers, family, typology, history, etc. Maps Maps of the geographic distribution of a language and its relationship to other languages in the region. Orthography Writing system(s) of the language with any accompanying guide to pronunciation, use, etc. Phonology A description of the basic sound units in a language (phonemes) and how they combine to form utterances.

12 Categories of Collection (2) Grammar How a language combines the smallest units of meaning (morphemes) to create words and words to create sentences. Core Word Lists A common word list of 100 or 200 terms typically collected in linguistic fieldwork (“Swadesh Lists”), often used for comparative purposes. Numbers A description of the numbering system(s) in a language with a list of basic terms. Parallel Texts A common text with translation for each language. Initially Genesis Chapters 1-3 (a commonly collected text). Now also the UN Declaration of Human Rights. Glossed Texts Transcribed indigenous texts with word glosses, free translations and grammatical markup.

13 Resource Pages Accessed from node pages Bibliographic metadata Links to other resources Resource bundles Associated resource files –Scanned images –OCR’ed live text files –Annotated text files –Audio/video files –User comments

14 Community Functions Goal: build a network of linguists, speakers, educators People: –Member pages –Regional and language curators Collaborative content: –Discussions (nodes, resources) –Resource upload –Vetting by volunteer language/family experts –In the future? Wiki documents (unvetted, but resources produced may go through higher vetting levels)

15 Member Gallery Central access to member search and browse Central access to language forums Highlighted members

16 Member Profile Page User-defined content area List of recent uploads Lists of recent forum postings

17 Audio Digitization Alan Lomax language audio collection (mostly reel-to-reel, some cassette) Edirol external digitizer (96 kHz sample rate, 24 bit depth) Sound Forge 7.0, uncompressed.wav Now accepting audio deposits (on a limited basis) We archive and serve digital resources, not physical media

18 Rosetta Depositor Consent Form Prompted by special collections (ELF, Alan Lomax Audio) Intended to work on paper, or in digital form Inspired by AILLA’s graded access system Encourages depositors to see archiving as a kind of publication: assumes dissemination of some or all of resources –“In general, we encourage all depositors to make their resources freely available, and to consider archiving with us as a form of publication. If you feel the need to place an extreme form of restriction on the resource, then our project may not be the most suitable place to archive your resource. We reserve the right to archive only those resources that we deem appropriate to our project, with respect to both content and access.”

19 Level 1: Open access to recordings Users have full access to recordings after agreeing to our Terms and Conditions. For this level, we assume that depositors have already gained permission for public access from the speakers or authors of the resource. Level 1 access may be applied to the entire deposit, or to parts of the deposit. If portions of the deposit are to be restricted, attach a detailed description that clearly identifies them, and designate one of the following access levels (2-5) for each restricted portion.

20 Level 2: Access limited by password Users may access recordings only if they know a password that you create. This type of access allows you to keep resources private, or provide access to others by sharing the password with them. Access limited by passwords must be renegotiated with The Rosetta Project every five years, at which time depositors may continue use of a previous password, choose a new password, or select another access level (Rosetta will contact the depositor at the appropriate time). If not renegotiated, access to the resource changes to open access (Level 1).

21 Level 3: Access protected by a time limit Users may not access the resources until after a specified date. Although we encourage all depositors to make their resources freely available, we understand that some depositors may want to restrict access to resources for a few years (normally five or less) while preparing a publication, such as a dissertation. After the date you specify, access to the resource changes to open access (Level 1).

22 Levels 4 and 5: Designated Controllers Level 4. The depositor controls access to the resource. The Rosetta Project will provide contact information, and the user will have to contact the depositor directly for permission, and the depositor then will write to The Rosetta Project. If permission is granted, The Rosetta Project will give the user access to the resource. Level 5. The depositor designates another person or organization to control the resource. The Rosetta Project will contact the controller on the user’s behalf. If permission is granted, The Rosetta Project will give the user access to the resource (please attach controller’s contact information).

23 Depositor/Controller Responsibilies Note: for Levels 2, 3, 4, and 5, the depositor must ensure that the appropriate contact information is up to date. If contact information is not up to date, or documented good faith attempts made by the Rosetta archive or its users to obtain access are not answered, then determinations of permission to access and use the resource reverts to the curator of the archive.

24 The Archivist in the Driver’s Seat Archiving and serving digital resources is a valuable, (and expensive) service Some archives also provide digitization services For these reasons, archives can be expected to set conditions on what they will archive Rosetta’s consent forms are intended to ensure that: –The majority of our resources are publicly accessible on the Web (all are available for listening in person) –Archivist is never at the mercy of extreme access restrictions –All access conditions work toward open access (Level 1)

25 URLs Electronic Metastructure for Endangered Language Data (E-MELD) http://www.emeld.org (School of Best Practice, FIELD Tool). http://www.emeld.org Endangered Language Query Rooms http://rosettaproject.org:8080/emeldbase/.http://rosettaproject.org:8080/emeldbase/ The Ethnologue http://www.ethnologue.com.http://www.ethnologue.com General Ontology for Linguistic Description (GOLD) http://www.linguistics- ontology.org or http://emeld.org/school/workroom/terminology/http://www.linguistics- ontology.orghttp://emeld.org/school/workroom/terminology/ LINGUIST List http://www.linguistlist.orghttp://www.linguistlist.org National Science Digital Library (NSDL) http://nsdl.orghttp://nsdl.org ODIN www.csufresno.edu/odinwww.csufresno.edu/odin Open Language Archives Community (OLAC) http://www.language- archives.org.http://www.language- archives.org The Rosetta Project, http://www.rosettaproject.org/live. A preview of the new Web site is available at http://preview.rosettaproject.org.http://www.rosettaproject.org/livehttp://preview.rosettaproject.org


Download ppt "The Rosetta Project Digital Language Archive Laura Buszard-Welcher The Long Now Foundation / University of California, Berkeley."

Similar presentations


Ads by Google