Presentation on theme: "OLAC: Open Language Archives Community OLAC : The Open Language Archives Community Gary F. Simons SIL International and Graduate Institute of Applied Linguistics."— Presentation transcript:
OLAC: Open Language Archives Community OLAC : The Open Language Archives Community Gary F. Simons SIL International and Graduate Institute of Applied Linguistics DRIVER Summit, Goettingen, Jan 2008
2 What is OLAC? OLAC is an international partnership of institutions and individuals who are creating a world-wide virtual library of language resources by: Developing consensus on best current practice for the digital archiving of language resources Developing a network of interoperating repositories and services for housing and accessing such resources Founded in December 2000 Now has 34 participating archives 12 European participants (bolded on next slide)
3 Aboriginal Studies Electronic Data Archive Academia Sinica Alaska Native Language Center Archive of Indigenous Languages of Latin America ATILF Resources Berkeley Language Center Centre de Ressources pour la Description de l'Oral CHILDES Data Repository Comparative Corpus of Spoken Portuguese Cornell Language Acquisition Laboratory Dictionnaire Universel Boiste 1812 DOBES catalogue (MPI, Nijmegen) Ethnologue: Languages of the World European Language Resources Association Laboratoire Parole et Langage Linguistic Data Consortium Corpus Catalog LINGUIST List Language Resources Natural Language Software Registry Online Database of Interlinear Text (ODIN) Oxford Text Archive PARADISEC Perseus Digital Library Research Papers in Computational Linguistics Rosetta Project 1000 Language Archive SIL Language and Culture Archives Surrey Morphology Group Databases Survey for California and Other Indian Languages TalkBank Tibetan and Himalayan Digital Library TRACTOR Typological Database Project University of Bielefeld Language Archive University of Queensland Flint Archive Virtual Kayardild Archive (Melbourne) Whos involved?
4 How does it work? Based on OAI Protocol for Metadata Harvesting Adds a community-specific archive description to the Identify response Defines a new olac metadata format We operate a static repository gateway for participants with small collections (needs olac format only) We operate an aggregator that harvests all participants and crosswalks them to oai_dc format
5 OLAC metadata format Based on the Dublin Core metadata set Record format follows the DC guidelines for implementing Qualified DC in XML Adds community-specific controlled vocabularies: Linguistic Data Type to qualify Type Linguistic Field to qualify Subject Participant Role to qualify Creator and Contributor ISO to qualify Language and Subject
6 Whos involved?
7 Controlled vocabularies for language identification Situation: 6,912 living languages are used throughout the world Source: Ethnologue, 15 th edition Problem: The standard used in the library community (MARC language codes, or ISO 639-2) Has codes for fewer than 400 languages Uses 66 collective codes to handle the other 6,500, e.g. South American Indian (Other) [sai] covers 421 languages Bantu (Other) [bnt] covers 612 languages
8 ISO In 2002, ISO TC37 invited SIL to propose a comprehensive standard compatible with Result: ISO 639-3, Alpha-3 code for comprehensive coverage of languages (published ) Codes for ~6,900 living languages Codes for ~600 extinct, historical, ancient, and constructed languages RA site: OLAC uses this controlled vocabulary for identi- fying the languages a resource is in or about
9 What is the current coverage of OLAC? All archives Excluding Ethnologue Items in catalog30,59123,292 ISO languages included 7,2993,134 Items with online open access 16,0188,719
10 Current developments In first year of a 3-year NSF sponsored grant to in- crease use and coverage by an order of magnitude 1.Develop guidelines and services that encourage best common practices among language archives that will facilitate language resource discovery with precision through OLAC (and attract more archives to join). 2.Develop services to bridge the resource catalogs of the repository, library, and web domains (e.g. OAI, MARC, Google) to facilitate language resource discovery with precision through OLAC. E.g. User searches OLAC aggregator for a specific code and finds hits in external aggregators
11 External interoperation Strategy 1: Use existing cataloging information to identify languages with precision codes for individual languages Language names in LC subject headings, Call numbers Strategy 2: Promote use of in cataloging ISO639-3 is now an encoding scheme in DC Terms iso639-3 has been added to the MARC standard as a recognized identifier for a source in Field 041 E.g., these are valid 041 fields for a grammar in English of Lushootseed [ lut ] of the Salishan [ sal ] family 041 1_$asal$aeng (using by default) $alut$aeng$2iso639-3 (using 639-3)
12 Conclusion OLAC would like to establish interoperation with the DRIVER infrastructure. We could: Implement a driver set on OLAC aggregator Harvest language resources from DRIVER aggregator OLAC is pleased that DRIVER already recommends ISO as best practice with Language element We are available to advise institutions who need help implementing this We are looking for partners who will help advocate adop- tion of in other guidelines and standards so as to broaden the base for language-related interoperation