Presentation is loading. Please wait.

Presentation is loading. Please wait.

DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

Similar presentations


Presentation on theme: "DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015."— Presentation transcript:

1 DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015

2 OUTLINE Concept Resource scarce languages Overview of the language situation in South Africa Lack of language resources and high level support for development of resources Co-ordination of activities in resource development and management The demand for localised language services over digital devices and related opportunities 2

3 Resource scarce languages “Under-resourced languages are generally described as languages that suffer from a chronic lack of available resources, from human, financial, and time resources to linguistic ones (language data and language technology), and often also experience the fragmentation of efforts in resource development.” (Language Resources and Evaluation (LRE) Journal Special Issue Call, August 2014). 3

4 Resource scarce languages (2) "This situation is exacerbated by the realization that as technology progresses and the demand for localised languages services over digital devices increases, the divide between adequately- and under- resourced languages keeps widening." (Language Resources and Evaluation (LRE) Journal Special Issue Call, August 2014). 4

5 Issues are A chronic lack of available resources, from human, financial, and time resources to linguistic ones Fragmentation of efforts in resource development As technology progresses the demand for localised languages services over digital devices increases But first, consider the language situation in South Africa 5

6 Language Situation in South Africa Home language (n = 52 mil speakers) 11 Official languages 6

7 Nguni group Sotho group Tshivenda / Xitsonga group isiZulu isiXhosa Siswati isiNdebele Northern Sotho / Sepedi Southern Sotho / Sesotho Western Sotho / Setswana Tshivenda Xitsonga Cross border languages: Mozambique, Zimbabwe, Swaziland, Lesotho, Botswana The official African languages grouped 45% 24% 4% 7

8 Similarities at different levels within groups Sotho group - disjunctive spelling – lexical items Ke tla bolela Sepedi.I will speak Sepedi. Ke tla bua Setswana.I will speak Setswana. Ke tla bua Sesotho.I will speak Sesotho. Nguni group - conjunctive spelling – lexical items Ngizokhuluma isiZulu.I will speak isiZulu. Ndizothetha isiXhosa.I will speak isiXhosa. Implications for NLP Grammatical structures across language groups the same Regular spelling: Grapheme to phoneme conversion – direct Tone languages – specific implications and challenges for TTS systems 8

9 Afrikaans and its Germanic roots English: My hand is in warm water. Afrikaans:My hand is in warm water. Dutch:Mijn hand is in warm water. German:Meine Hand ist in warmen Wasser. Danish:Min hånd er i varmt vand. Norwegian:Min hånd er i varmt vann. Swedish:Min hand är i varmt vatten. Implications Bootstrapping Afrikaans systems from e.g. Dutch. 9

10 ISSUE #1 Chronic lack of available (digital) resources, from human, financial, and time resources to linguistic ones Digital resources for previously marginalised languages extremely limited: newspapers, periodicals, relatively low presence on the Web Lack of language expertise – no tradition of Computational Linguistics - limited number of students in local languages – only North-West University with degree courses in Language technologies ("Linguists are still needed" – Ed Greffenstatte) Growing expertise in Computer Science and Signal processing with focus on natural languages in most of the larger universities. Financial support mainly ad-hoc from private sources 10

11 Various initiatives for text and speech data collections over a number of decades – mainly for linguistic / phonetic research at academic institutions – difficult to share resources Continued academic pressure (on grounds of the constitution) on government for support of research and development of Language Technologies - not to marginalise the indigenous languages again Large data acquisition projects sponsored by national government since 1999 – Part of National Language Plan (RSA and India are only countries with official policy regarding LT development). 11

12 Ministerial Panel: HLT Strategy for South Africa (2002) Focus on digital resources: text & speech (SA official languages) 2008: Human Language Technology Expert Panel (HLTEP) established commissions HLT application projects annually with governmental funds these projects invariably create digital resources obvious that it was necessary to create a central depository for all newly created language resources Ongoing major projects since 2000 in text and speech domains Refer to RMA resources to be discussed 12

13 ISSUE #2 Fragmentation of efforts in resource development Various language projects across the country generating text and speech resources for different purposes – availability of the data (?) Resources from projects commissioned by the HLTEP (i.e. funded by tax payers money) needed to be deposited in a central place 2012: The National Department of Arts and Culture (DAC) established Resource Management Agency (RMA) at the North-West University (Potchefstroom) under the auspices of the Centre for Text Technology (CTexT) as a 3 year project. (www.rma.nwu.ac.za)www.rma.nwu.ac.za 13

14 http://www.rma.nwu.ac.za 14

15 NEWSLETTER 15

16 Contents of the RMA ) LANGUAGE AfrikaansAfrikaans (31) EnglishEnglish (30) isiNdebeleisiNdebele (20) isiXhosaisiXhosa (23) isiZuluisiZulu (27) Sesotho sa Leboa (Sepedi)Sesotho sa Leboa (Sepedi)(22) SetswanaSetswana (20) Sesotho (Southern Sotho)Sesotho (Southern Sotho) (22) SiswatiSiswati (20) TshivendaTshivenda (20) XitsongaXitsonga (24) DutchDutch (4) YorubaYoruba (3) PROJECT AutshumatoAutshumato (18) LwaziLwazi (36) NCHLT TextNCHLT Text (43) NCHLT SpeechNCHLT Speech (13) African Speech TechnologyAfrican Speech Technology (15) DATABASE TYPE Monolingual Speech Corpora: AnnotatedMonolingual Speech Corpora: Annotated (22) Multilingual Text Corpora: AlignedMultilingual Text Corpora: Aligned (3) Monolingual Text Corpora: AnnotatedMonolingual Text Corpora: Annotated (1) RESOURCE TYPES Data Modules Applications Tools/ Platforms 16

17 FROM RMA TO NATIONAL CENTRE FOR DIGITAL LANGUAGE RESOURCES (NCDLR) RMA: status 3-4 year project (2012 – 2015) (Dept of Arts & Culture) Untenable as development of resources is ongoing (living archive) National Department of Science and Technology (DST) (2014): International panel to determine a new South African Research Infrastructure Roadmap (SARIR) Presentations made to include language (Humanities) and technology in a Roadmap dominated by natural science, medicine, engineering, earth sciences etc. June 2015: The National Centre for Digital Language Resources approved – long term funding (Press statement of DST to follow soon) 17

18 National Centre for Digital Language Resources University of Pretoria Department of African Languages CSIR MERAKA INSTITUTE (Human Language Technologies Research Group ) North-West University Centre for Text Technology (CTexT) University of South Africa Department of African Languages University of South Africa Department of African Languages ICELDA PARTNERSHIP 18

19 NATIONAL CENTRE FOR DIGITAL LANGUAGE RESOURCES Functions Single point of entry for information on SA language resources (portal) Free open access for academic research Licensed access for commercial applications Includes RMA resources Systematic digitisation of scientifically valuable language resources – historical nature (Scientific committee) 19

20 Systematic digitisation of different registers/modes of language resources by the Centre, as well as by academics/public as open call funded projects Combine these projects with MA / PhD studies with data to be deposited at Centre Resource centre for studies in the domain of Digital Humanities 20

21 ISSUE #3 Demand for localised language services over digital devices increases Available At text level Spelling checkers for all SA languages – CTexT (Microsoft) http://www.nwu.CTexT.ac.zahttp://www.nwu.CTexT.ac.za Machine translation – government documents – CTexT (Autshumato IMT) http://www.autshumato.sourceforge.net http://www.autshumato.sourceforge.net On-line translations: e.g. www.Translate.org, www.Freelang.net and various others software programs ranging from word lists to communication phraseswww.Translate.orgwww.Freelang.net At speech/text level (interactive telephone based systems) (Major projects) African Speech Technology: Hotel reservation system in 5 languages (prototype) www.lrec- conf.org/proceedings/lrec2004/summaries/445.htmwww.lrec- conf.org/proceedings/lrec2004/summaries/445.htm LWAZI I and II: Various community based applications www.meraka.org.za/lwazi/www.meraka.org.za/lwazi/ 21

22 Why do we need to speed up localised language services? There is a demand for a wide array of language based communication systems: Interactive multilingual voice systems as information systems Interactive text-to-speech systems Literacy training in different languages Language specific reading support for the blind Machine translation systems for public use Speech-to speech communication systems with various language pairs Etc…… There are specific research and business opportunities – consider the following 22

23 Mobile telephone penetration selected countries http://www.itu.int/ITU-D/ict/statistics/explorer/index.html 23 Mobile cellular subscriptionsMillion Japan149 Nigeria127 Germany100 South Africa76 Korea (Rep)55 France36 Mobile cellular subscriptions per 100 inhabitants South Africa146 Germany121 Japan117 Korea (Rep)111 France98 Nigeria73

24 24

25 Conclusion Challenges for the development and management of different types of language resources and applicable tools, Academic considerations: insights into language structures and use Commercial considerations: providing multilingual applications for a growing market, specifically in the African context In order to meet these challenges it is necessary to develop and update language resources not only on a case to case basis, but also systematically in a coordinated manner over as long a period as possible. This is what we are attempting to do in the South African context. 25

26 Thank you for listening. 26


Download ppt "DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015."

Similar presentations


Ads by Google