Download presentation
Presentation is loading. Please wait.
Published byJustina Garrison Modified over 9 years ago
1
Language Technologies for African Languages – AfLaT 2009 The S AWA Corpus A Parallel Corpus English - Swahili Guy De Pauw (guy.depauw@aflat.org) Peter Waiganjo Wagacha (waiganjo@aflat.org) Gilles-Maurice de Schryver (gillesmaurice.deschryver@aflat.org)
2
1 Language Technologies for African Languages – AfLaT 2009 Resource-scarceness Language technology vs the digital divide Digital data increasingly important for African languages (web, mobile phone, …) But: most research on African languages is rooted in knowledge-based paradigm ( ↔ LT for Indo-European languages): -Hand-crafted expert systems -Typically high accuracy for domain -Limited portability to other languages and subdomains -Costly development phase -Limited resources (linguistic, expertise, financial, …) Need for a cheaper and faster (language-independent) alternative for developing African language technology
3
2 Language Technologies for African Languages – AfLaT 2009 Data-driven approaches For Indo-European and Asian languages: the data-driven, corpus-based approach has become the dominant paradigm since the 90’s Basic methodology: automatically extract linguistic knowledge from annotated text material (corpus) and bootstrap the development of language technology component Advantages: -language independence: portability (!!!!) -Knowledge acquisition bottleneck data-acquisition bottleneck -Robustness AfLaT-team: explore application of data-driven paradigm to African languages (Swahili, Gikuyu, Luo, Northern Sotho, …)
4
3 Language Technologies for African Languages – AfLaT 2009 Machine Translation 3 paradigms: -Rule-based MT -Statistical MT -Example-based MT data-driven Learn translation from examples: !! Parallel corpus !!
5
4 Language Technologies for African Languages – AfLaT 2009 Parallel Corpus Collection of translated texts in two different languages, aligned on paragraph, sentence, phrase and/or word level S AWA Corpus: parallel corpus English - Swahili
6
5 Language Technologies for African Languages – AfLaT 2009 Universal Declaration of Human Rights Preamble Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world, Whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind, and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people, Katika Disemba 10, 1948, Baraza kuu la Umoja wa Mataifa lilikubali na kutangaza Taarifa ya Ulimwengu juu ya Haki za Binadamu. Maelezo kamili ya Taarifa hiyo yamepigwa chapa katika kurasa zifuatazo. Baada ya kutangaza taarifa hii ya maana Baraza Kuu lilizisihi nchi zote zilizo Wanachama wa Umoja wa Mataifa zitangaze na "zifanye ienezwe ionyeshwe, isomwe na ielezwe mashuleni na katika vyuo vinginevyo bila kujali siasa ya nchi yo yote." UMOJA WA MATAIFA OFISI YA IDARA YA HABARI TAARIFA YA ULIMWENGU JUU YA HAKI ZA BINADAMU UTANGULIZI Kwa kuwa kukiri heshima ya asili na haki sawa kwa binadamu wote ndio msingi wa uhuru, haki na amani duniani, Kwa kuwa kutojali na kudharau haki za binadamu kumeletea vitendo vya kishenzi ambavyo vimeharibu dhamiri ya binadamu na kwa sababu taarifa ya ulimwengu ambayo itawafanya binadamu wafurahie uhuru wao wa kusema, kusadiki na wa kutoogopa cho chote imekwisha kutangazwa kwamba ndio hamu kuu ya watu wote, Example
7
6 Language Technologies for African Languages – AfLaT 2009 3 phases Data-collection: finding parallel texts Data-constitution: aligning the parallel texts on word level Data-exploitation -Statistical Machine Translation -Bootstrapping linguistic annotation
8
7 Language Technologies for African Languages – AfLaT 2009 Data Collection Limited availability of parallel texts English – Kiswahili: -Smaller documents: investment reports, political texts, e.g. Universal Declaration of Human Rights “there is no data, like more data” -Bible, Quran, secular literature -New translations
9
8 Language Technologies for African Languages – AfLaT 2009 Data Collection Even if the source data is digitally available beforehand, we are often faced with tough alignment problems during data constitution. e.g. paragraph alignment
10
9 Language Technologies for African Languages – AfLaT 2009 Universal Declaration of Human Rights Preamble Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world, Whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind, and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people, Katika Disemba 10, 1948, Baraza kuu la Umoja wa Mataifa lilikubali na kutangaza Taarifa ya Ulimwengu juu ya Haki za Binadamu. Maelezo kamili ya Taarifa hiyo yamepigwa chapa katika kurasa zifuatazo. Baada ya kutangaza taarifa hii ya maana Baraza Kuu lilizisihi nchi zote zilizo Wanachama wa Umoja wa Mataifa zitangaze na "zifanye ienezwe ionyeshwe, isomwe na ielezwe mashuleni na katika vyuo vinginevyo bila kujali siasa ya nchi yo yote." UMOJA WA MATAIFA OFISI YA IDARA YA HABARI TAARIFA YA ULIMWENGU JUU YA HAKI ZA BINADAMU UTANGULIZI Kwa kuwa kukiri heshima ya asili na haki sawa kwa binadamu wote ndio msingi wa uhuru, haki na amani duniani, Kwa kuwa kutojali na kudharau haki za binadamu kumeletea vitendo vya kishenzi ambavyo vimeharibu dhamiri ya binadamu na kwa sababu taarifa ya ulimwengu ambayo itawafanya binadamu wafurahie uhuru wao wa kusema, kusadiki na wa kutoogopa cho chote imekwisha kutangazwa kwamba ndio hamu kuu ya watu wote,
11
10 Language Technologies for African Languages – AfLaT 2009 e.g. sentence alignment Article 12 No one shall be subjected to arbitrary interference with his privacy, family, home or correspondence, nor to attacks upon his honour and reputation. Everyone has the right to the protection of the law against such interference or attacks. Kifungu cha 12 Kila mtu asiingiliwe bila sheria katika mambo yake ya faragha, ya jamaa yake, ya nyumbani mwake au ya barua zake. Wala asivunjiwe heshima na sifa yake. Kila mmoja ana haki ya kulindwa na sheria kutokana na pingamizi au mambo kama hayo.
12
11 Language Technologies for African Languages – AfLaT 2009 Available data in S AWA Corpus English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament16.4k16.3k189.2k151.1k Quran14.3k14.5k165.5k124.3k Declaration of HR0.2k1.8k Kamusi.org5.6k35.5k26.7k Movie Subtitles9.0k72.2k58.4k Investment Reports3.2k3.1k52.9k54.9k Local Translator1.5k1.6k25.0k25.7k Total50.2k50.3k542.1k442.9k All manually sentence aligned!
13
12 Language Technologies for African Languages – AfLaT 2009 Available data in S AWA Corpus English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament16.4k16.3k189.2k151.1k Quran14.3k14.5k165.5k124.3k Declaration of HR0.2k1.8k Kamusi.org5.6k35.5k26.7k Movie Subtitles9.0k72.2k58.4k Investment Reports3.2k3.1k52.9k54.9k Local Translator1.5k1.6k25.0k25.7k Total50.2k50.3k542.1k442.9k All manually sentence aligned!
14
13 Language Technologies for African Languages – AfLaT 2009 Available data in S AWA Corpus English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament16.4k16.3k189.2k151.1k Quran14.3k14.5k165.5k124.3k Declaration of HR0.2k1.8k Kamusi.org5.6k35.5k26.7k Movie Subtitles9.0k72.2k58.4k Investment Reports3.2k3.1k52.9k54.9k Local Translator1.5k1.6k25.0k25.7k Total50.2k50.3k542.1k442.9k All manually sentence aligned! Thanks to Mahmoud Shokrollahi-Far University College of Nabiye Akram (Iran)
15
14 Language Technologies for African Languages – AfLaT 2009 Available data in S AWA Corpus English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament16.4k16.3k189.2k151.1k Quran14.3k14.5k165.5k124.3k Declaration of HR0.2k1.8k Kamusi.org5.6k35.5k26.7k Movie Subtitles9.0k72.2k58.4k Investment Reports3.2k3.1k52.9k54.9k Local Translator1.5k1.6k25.0k25.7k Total50.2k50.3k542.1k442.9k All manually sentence aligned!
16
15 Language Technologies for African Languages – AfLaT 2009 Available data in S AWA Corpus English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament16.4k16.3k189.2k151.1k Quran14.3k14.5k165.5k124.3k Declaration of HR0.2k1.8k Kamusi.org5.6k35.5k26.7k Movie Subtitles9.0k72.2k58.4k Investment Reports3.2k3.1k52.9k54.9k Local Translator1.5k1.6k25.0k25.7k Total50.2k50.3k542.1k442.9k All manually sentence aligned!
17
16 Language Technologies for African Languages – AfLaT 2009 Available data in S AWA Corpus English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament16.4k16.3k189.2k151.1k Quran14.3k14.5k165.5k124.3k Declaration of HR0.2k1.8k Kamusi.org5.6k35.5k26.7k Movie Subtitles9.0k72.2k58.4k Investment Reports3.2k3.1k52.9k54.9k Local Translator1.5k1.6k25.0k25.7k Total50.2k50.3k542.1k442.9k All manually sentence aligned!
18
17 Language Technologies for African Languages – AfLaT 2009 Available data in S AWA Corpus English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament16.4k16.3k189.2k151.1k Quran14.3k14.5k165.5k124.3k Declaration of HR0.2k1.8k Kamusi.org5.6k35.5k26.7k Movie Subtitles9.0k72.2k58.4k Investment Reports3.2k3.1k52.9k54.9k Local Translator1.5k1.6k25.0k25.7k Total50.2k50.3k542.1k442.9k All manually sentence aligned!
19
18 Language Technologies for African Languages – AfLaT 2009 Available data in S AWA Corpus English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament16.4k16.3k189.2k151.1k Quran14.3k14.5k165.5k124.3k Declaration of HR0.2k1.8k Kamusi.org5.6k35.5k26.7k Movie Subtitles9.0k72.2k58.4k Investment Reports3.2k3.1k52.9k54.9k Local Translator1.5k1.6k25.0k25.7k Total50.2k50.3k542.1k442.9k All manually sentence aligned! Thanks to Dr. James Omboga Zaja University of Nairobi
20
19 Language Technologies for African Languages – AfLaT 2009 Available data in S AWA Corpus English Sentences Kiswahili Sentences English Words Kiswahili Words New Testament16.4k16.3k189.2k151.1k Quran14.3k14.5k165.5k124.3k Declaration of HR0.2k1.8k Kamusi.org5.6k35.5k26.7k Movie Subtitles9.0k72.2k58.4k Investment Reports3.2k3.1k52.9k54.9k Local Translator1.5k1.6k25.0k25.7k Total50.2k50.3k542.1k442.9k All manually sentence aligned!
21
20 Language Technologies for African Languages – AfLaT 2009 Word alignment Most difficult task: relate words between languages Noshe ‘s uh,,upnorth La,,,yuko,aajuukaskazini
22
21 Language Technologies for African Languages – AfLaT 2009 Word alignment Youcaughtmeskiving,I‘mafraid. Samahani,umenidakanikihepa.
23
22 Language Technologies for African Languages – AfLaT 2009 Word alignment Can be done automatically using established tools (GIZA++) Provide manual reference to evaluate automatic word alignment tools (5000 words)
24
23 Language Technologies for African Languages – AfLaT 2009 Current results Still a lot of room for improvement PrecisionRecallF (=1) 39.4%44.5%41.79%
25
24 Language Technologies for African Languages – AfLaT 2009 Word alignment Some alignment patterns are easy Noshe ‘s uh,,upnorth La,,,yuko,aajuukaskazini
26
25 Language Technologies for African Languages – AfLaT 2009 Alignment problems nimemkatalia haveturnedhimdownI
27
26 Language Technologies for African Languages – AfLaT 2009 Morphological decomposition haveturnedhimdownI ni+ me+m+ katalia
28
27 Language Technologies for African Languages – AfLaT 2009 Current results Morpheme/Word alignment Better alignment, but more complicated decoding PrecisionRecallF (=1) 50.2%64.5%55.8%
29
28 Language Technologies for African Languages – AfLaT 2009 Future work Projection of Annotation
30
29 Language Technologies for African Languages – AfLaT 2009 Future work Projection of Annotation Refine GIZA++ alignment Part-of-speech tagger
31
30 Language Technologies for African Languages – AfLaT 2009 Future work Projection of Annotation Refine GIZA++ alignment Part-of-speech tagger No data like more data: web-mining & comparable corpora Example-based MT (omegaT) Statistical MT (Moses)
32
31 Language Technologies for African Languages – AfLaT 2009 Conclusion Modest, but workable parallel corpus English – Swahili Bi-directional Machine Translation is now in the cards Modest, but encouraging word alignment scores Data-driven approach is viable for African languages
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.