Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Rosetta Project ALL Language Archive A Project of the Long Now Foundation & A National Science Digital Library www.rosettaproject.org Presented by:

Similar presentations


Presentation on theme: "The Rosetta Project ALL Language Archive A Project of the Long Now Foundation & A National Science Digital Library www.rosettaproject.org Presented by:"— Presentation transcript:

1 The Rosetta Project ALL Language Archive A Project of the Long Now Foundation & A National Science Digital Library www.rosettaproject.org Presented by: Laura Buszard-Welcher The Rosetta Project / University of California, Berkeley

2 Primary Goals Support the documentation of the world’s nearly 7000 languages through building –A digital archive of language documentation –A linguistically sophisticated site that is also useful and interesting for the general public –Networks of speakers, educators, linguists Contributes to the effort to document endangered languages Promotes linguistic diversity by educating the public about languages with small numbers of speakers.

3 Secondary Goals Support metadata standardization and interoperability –OLAC –EMELD Develop tools for collaborative linguistic research –Endangered Language Query Room –Wordlist Tool –Collaborative document editing/creation (new site)

4 Roles The Long Now Foundation –Parent organization of The Rosetta Project –Projects, seminars on topics that foster long term thinking The National Science Digital Library –U.S. National Science Foundation Program –Goal is to bring online high quality STEM (Science, Technology, Engineering, and Math) resources for education –Sponsor of Rosetta Project (NSF 333727) Stanford University –Online and offline storage of Rosetta materials

5 The Long Now Foundation

6 The National Science Digital Library

7 Stanford University Libraries

8 Project History: The 1000 Language Archive Initiated by The Long Now Foundation Wanted to experiment with new microetching technology, looking for suitable content Decided to collect basic descriptive information for 1000 of the world’s approximately 7000 languages

9 Why language information? Most natural human languages are products of millenia of human history (therefore a good long term thinking project) Repositories of cultural information Languages showcase –Human intellectual sophistication –Cultural diversity To draw attention to the critical issue of language endangerment

10 The Rosetta Disk Next generation microfiche Micro-etched 2" nickel disk at densities of up to 200,000 page images per disk Developed by Los Alamos Laboratories and Norsam Technologies Reading the disk requires a microscope, either optical or electron, depending on the density of encoding

11 The Rosetta Stone Not us! (196 BC) Parallel text written in three scripts: –Hieroglyphic –Demotic (script form) –Greek The key to deciphering Egyptian Hieroglyphs

12 Rosetta Stone Language Learning Software (Also not us!)

13 Design of the Disk Original design has human- eye readable text (Genesis text) and micro-etched text inside an index New design has human-eye readable text (instructions) on one side and microetched images on the reverse

14 In-House Scanning HP CapShare Scanners Scan printed page in multiple passes, any direction Page is ‘assembled’ into one image Stores about 50 pages at a time (300 dpi bitmap.tif) Uploads numerically sequenced images to computer by infrared port

15 In-House Scanning Minolta PS 7000 Overhead Bitmap and grayscale scans up to 600 dpi Multiple sizes, orientations Single page / double page spread (good for text collections with verso annotations) Best for fragile books, manuscripts that would be damaged by hand scanning

16 Categories of Collection (1) Ethnologue description General information from www.ethnologue.com about language affiliation, where spoken, number of speakers, dialects, alternate language names. General description General description of the language. Origin and current distribution of language, number of speakers, family, typology, history, etc. Maps Maps of the geographic distribution of a language and its relationship to other languages in the region. Orthography Writing system(s) of the language with any accompanying guide to pronunciation, use, etc. Phonology A description of the basic sound units in a language (phonemes) and how they combine to form utterances.

17 Categories of Collection (2) Grammar How a language combines the smallest units of meaning (morphemes) to create words and words to create sentences. Core Word Lists A common word list of 100 or 200 terms typically collected in linguistic fieldwork (“Swadesh Lists”), often used for comparative purposes. Numbers A description of the numbering system(s) in a language with a list of basic terms. Parallel Texts A common text with translation for each language. Initially Genesis Chapters 1-3 (a commonly collected text). Now also the UN Declaration of Human Rights. Glossed Texts Transcribed indigenous texts with word glosses, free translations and grammatical markup.

18 Language Curation Ethnologue DescriptionGrammar (1167) General Description (1651)Core Word Lists (3098) Maps (376)Numbering Systems (215) Orthography (1052)Main Parallel Texts (1109) Phonology (1731)Glossed Vernacular Texts (869)

19 Rosetta Project Web Site Welcome Search for a language Language overview page Browse (by name, family, country) Wordlist tool

20 Welcome

21 Search

22 Language Overview

23 Browse

24 Projects Endangered Language Query Rooms Digital Online Curation Services for Endangered Language Archives (DOCS) Wordlist Tool LangGator

25 Endangered Language Query Rooms http://emeld.rosettaproject.org/

26 Query Room Virtual Keyboard

27 Potawatomi Query Room Re: Bozho by Donald Perrot (host) on July 9 2004, 8:53 PM Nmedagwe'ndan e'gi nebye'ge'yen ngom. Neaseno ndesh ne kas ge' nin, mine E'shkanabe' e'nda ge' nin. I like what you have written. I am called Neaseno (Southwind) myself, and I live in Escanaba, MI. Re: Bozho by Justin Neely on September 7 2004, 1:16 PM Bozho Neaseno mine Lameen Zagnenibi ndeznekas. Nishnabe ndaw ipi Bodewadmi ndaw. Shi shi ban nee yek ndebendagwes. Zego ndotem. Kansas City,Mo ndoch bya. Eskanabe edayen ge nin. Bama pi ngom Zagnenibi ndeznekas [Hello Neaseno and Lameen my name is Zagnenibi. I’m Native and Potawatomi. I belong to the Citizen Band. I’m Crane Clan. I’m from Kansas City, Missouri. I also live in Escanaba. Bye for now, Zagnenibi.]

28 Taking Conversational Risks by [TL] on July 17 2004, 10:30 AM mbesuk onago ngi zhyamen. nseze wgi bye tot i jiman ewi nepamshkamen be gishek. wabek nuwi zhya men ibe eje shna mbesuk. ngi wabmak gode chemokmanuk demojgewat. wabek nin gezhe ni demojgeyan gnebech. bama mine mtego [I went to the lake yesterday. My brother brought a canoe so we could float around all day. Tomorrow we’ll go there to the lake. I saw the white folks fishing. Tomorrow I’ll fish too, maybe. So long for now, Mtego.] Re: onago egi zhejkeyak by [JN] on July 17 2004, 8:12 PM mbesek ndazhya ngom. Mbish ksenyak shode. Nedwendan ode Mbish gshatek. Megwa Nwinebyege ode bodewadmi kiktowenen bama. Megwetch Zagnenibi nin se [I should go to the lake today. The water is cold here. I wish the water were warm. I’ll write more of this Potawatomi conversation later. Thanks, yours truly Zagnenibi.]

29 Factors in query room success NiasPotawatomi Speech community500,000~25 native Robust useOn NiasNowhere DiasporaIndonesia, WestUS, Ontario Internet accessOnly in diasporaUS-normal Online communityPreexisting Rooms requestedBy speaker

30 DOCS Project Digital Online Curation Services for Endangered Language Archives Many small language archives are beginning to digitize their materials Lack technical infrastructure to bring resources online Goal is to provide access through Rosetta

31 DOCS Project Archives Endangered Language Fund (ELF) Survey for California and Other American Indian Languages (SCOIL) The Alaska Native Languages Center (ANLC) Max-Planck Institute for Evolutionary Anthropology (Leipzig)

32 Wordlist Tool Swadesh lists (100, 200, 207 terms) from: –Tryon's Comparative Austronesian Dictionary (rekeyed) –Tim Usher's Indo-Pacific database (2002 version) –Paul Whitehouse's Australian and New Guinea database (2002 version) –George Starostin's Dravidian database –Ilya Peiros' Mon Khmer database Total of 1,384 languages, 3,090 lists online Additional 3000 lists, up to 1850 terms per list, most 300-500 words in length.

33 LangGator A linguistic “Wayback Machine” Language resource location and aggregation –Use alternate language names, spellings Deutsch, Hochdeutsch, High German, Allemande Fadicca, Fadicha, Fedija, Fadija, Fiadidja, Fiyadikkya, and Fedicca –Character identification (inventory, distribution) Dera (Chadic, Nigeria) Dera (Trans-New Guinea, Indonesia) –Seed crawler with Wordlist terms (see previous slide), weighted towards longer terms Archiving through Internet Archive Serve results through the Rosetta site

34 Collaborations Electronic Metastructure for Endangered Languages Data (E-MELD) General Ontology for Linguistic Description (GOLD) Open Language Archives Community (OLAC)

35 E-MELD Electronic Metastructure for Endangered Language Data School of Best Practice http://emeld.org/school/index.html http://emeld.org/school/index.html –Guidelines and examples for putting linguistic data into best practice digital formats –XML with XML Schema or DTD –Mapping terminology to ontology (GOLD) FIELD lexical database tool http://emeld.org/tools/field/beta/ http://emeld.org/tools/field/beta/ –Online collaborative tool to build linguistic dictionaries, backed by ontology (GOLD)

36 GOLD General Ontology for Linguistic Description Built in OWL (Web Ontology Language), linked to SUMO (Suggested Upper Merged Ontology) Best practice resources should include a mapping between the researcher’s terms, and a standard set, known as the ‘profile’ –‘independent’ (mine) = ‘main clause’ (GOLD) –‘obviative’ (mine) = ‘fourth person’ (GOLD) The standard terminology set can then allow sophisticated searches across disparate resources.

37 GOLD Community Model

38 OLAC Open Language Archives Community Set of 23 metadata elements and controlled vocabularies (based on Dublin Core) –Subject.language (language described, rather than audience language) uses SIL language codes –Type.linguistic (grammar, lexicon, text) –IMDI (Isle Metadata Initiative) has 135 elements Recommended extensions (Discourse Types, Linguistic Field, Participant roles Enables searches across a network of archives that use OLAC metadata http://www.language-archives.org/tools/search/ http://www.language-archives.org/tools/search/

39 URLs Electronic Metastructure for Endangered Language Data (E-MELD) http://www.emeld.org (School of Best Practice, FIELD Tool). http://www.emeld.org Endangered Language Query Rooms http://rosettaproject.org:8080/emeldbase/. http://rosettaproject.org:8080/emeldbase/ The Ethnologue http://www.ethnologue.com.http://www.ethnologue.com General Ontology for Linguistic Description (GOLD) http://www.linguistics-ontology.org. http://www.linguistics-ontology.org ISLE MetaData Initiative (IMDI) http://www.mpi.nl/IMDI/.http://www.mpi.nl/IMDI/ National Science Digital Library (NSDL) http://nsdl.orghttp://nsdl.org Open Language Archives Community (OLAC) http://www.language- archives.org.http://www.language- archives.org The Rosetta Project, http://www.rosettaproject.org/live. A preview of the new Web site (currently under construction) is available at http://preview.rosettaproject.org.http://www.rosettaproject.org/live http://preview.rosettaproject.org

40 Credits This project is funded by the US National Science Digital Library (NSF 333727)


Download ppt "The Rosetta Project ALL Language Archive A Project of the Long Now Foundation & A National Science Digital Library www.rosettaproject.org Presented by:"

Similar presentations


Ads by Google