11 minute read
Lacuna Fund invests $1m in Datasets for Low Resource African Languages
Lacuna Fund invests $1M in Datasets for Low Resource Languages in Sub-Saharan Africa
Lacuna Fund — the world’s first collaborative effort to provide data scientists, researchers and social enterprises in low- and middle-income contexts globally with the resources they need to produce training datasets that address urgent problems in their communities — has invested $1-million in its next cohort of 10 projects which are creating openly accessible text and speech datasets that will fuel natural language processing (NLP) technologies in 29 languages across Africa.
The fund pointed out in a statement in late April that the supported projects will produce text and speech datasets for NLP technologies that will have significant downstream impacts on education, financial inclusion, healthcare, agriculture, communication, and disaster response in Sub-Saharan Africa.
Lacuna Fund explained that the funding recipients will produce training datasets in Eastern, Western, and Southern Africa that will support a range of needs for low resource languages, including machine translation, speech recognition, named entity recognition and part of speech tagging, sentiment analysis, and multimodal datasets. All datasets produced will be locally developed and owned, and will be openly accessible to the international data community.
“With over 50 impressive applications from, or in partnership with, organisations across Africa, there are many more initiatives poised for impact. This movement towards locally developed and owned datasets has only just begun, and with the right support and funding these initiatives will unlock the power of AI to deliver new social sector solutions and increase the presence of African countries on the international data map,” Lacuna Fund stated.
Also commenting in the same statement, ABSA Chair of Data Science at University of Pretoria Vukosi Marivate drew attention to how the South African government has been using chatbots to provide daily updates on COVID. “Right now, translating those updates to Latin languages is really easy, but the datasets necessary to translate those updates to a range of African languages don’t exist, which means that the government isn’t currently able to communicate with many of its people in their native languages. That is one of the many examples of why we need this work now,” explained Marivate.
Meet the recipients
Building an Annotated Spoken Corpus for Igbo NLP Tasks:
This project addresses the gap in the availability of an Igbo spoken corpus for NLP tasks. Existing corpora—such as the Igbo web Corpus (IgWaC) and literary, religious and grammar texts—are either unannotated or not archived for research and NLP tasks. This study will create an annotated 1000-sentence corpus and 25 hours of unannotated audio data to launch an open access spoken corpus that would be available for research and NLP tasks.
Data will be gathered from oral narratives and live Igbo news. Ethnographic interviews will be used to collect data that covers several domains of the Igbo life such as marriage, religion, language, burial, education, security, and trade. To ensure adequate representation, balance, and homogeneity, data collection will take place in the five south-eastern states where Igbo is predominantly spoken, and the team will recruit 50 different language speakers across the states to provide audio data. Igbo news recordings will be acquired from the Federal Radio Corporation of Nigeria across the five states.
Igbo NLP Tasks Project Team member Gerald Nweya from the University of Ibadan said the team is excited to embark on this project due to the impact it will have on the NLP community as it it particularly concerns the Igbo language.
“The need to build an annotated corpus of contemporary Igbo is one that is long overdue. It could be very interesting to study the language from naturally occurring contexts such as narratives, stories and conversations. Therefore, we are both overjoyed and grateful to the Meridian Institute for giving us this unique opportunity through Lacuna Fund. We are very hopeful that this will serve as a springboard for the use of Igbo for NLP tasks and other applied linguistic research,” added Nweya.
Masakhane MT: Decolonising Scientific Writing for Africa:
The ability for science to be discussed in local indigenous languages can not only help expand knowledge to those who do not speak English or French as a first language, but also can integrate the facts and methods of science into cultures that have been denied it in the past. Thus, the team will build a multilingual parallel corpus of African research, by translating African preprint research papers released on AfricArxiv into six diverse African languages.
Masakhane MT: Decolonizing Scientific Writing for Africa Project Team member Jade Abbott pointed out that when it comes to communication and education, language matters. “The ability of science to be discussed in local indigenous languages not only can reach more people, but can open up African methodologies and research to the world. We’re exceptionally excited to bring African science to the global community and continue the journey of decolonisation of scientific discourse,” added Abbott.
Multimodal Datasets for Bemba:
This project will create the first multimodal dataset for Bemba—the most populous language in Zambia, but one that lacks significant resources. The team will collect visually-grounded dialogues between native Bemba speakers, which will be diarised and transcribed in full. A sample of the data will also be translated into English.
The dataset will enable the development of speech recognition and speech and text translation applications, as well as facilitate research in language grounding and multimodal model development.
Multimodal Datasets for Bemba team member Clayton Sikasote explained that this will be the first multimodal speech dataset created for any Zambian language. “We are excited about this project because the dataset will enable the development of speech recognition and speech to text translation applications, as well as facilitate research in language grounding and multimodal model development,” added Sikasote.
Named Entity Recognition and Parts of Speech Datasets for African Languages:
Currently, the majority of existing Named Entity Recognition (NER) datasets for African languages are automatically annotated and noisy, since the text quality for African languages is not verified—only a few African languages have human-annotated NER datasets. Likewise, the only open-source POS datasets that exist are for a small subset of languages in South Africa, and Yoruba, Naija, Wolof, and Bambara.
This project will develop a Part-of-Speech (POS) and (NER) corpus for 20 African languages based on news data. NER is a core NLP task in information extraction, and NER systems are a requirement for numerous products from spell-checkers to localisation of voice and dialogue systems, conversational agents, and information retrieval necessary to identify African names, places, and people.
“This project will lead to a better understanding of the linguistic structures of 20 African languages from four language families (Afro-Asiatic, English Creole, NigerCongo, and Nilo-Saharan) and regions of Africa. It will also encourage benchmarking of African language datasets in natural language processing (NLP) research. We look forward to how this initiative will spur NLP research in African universities,” stated Masakhane.
Building NLP Text and Speech Datasets for Low Resourced Languages in East Africa:
The project will deliver open, accessible, and high-quality text and speech datasets for low-resource East African languages from Uganda, Tanzania, and Kenya.
Taking advantage of the advances in NLP and voice technology requires a large corpora of high quality text and speech datasets. This project will aim to provide this data for these languages: Luganda, Runyankore-Rukiga, Acholi, Swahili, and Lumasaaba.
The speech data for Luganda and Swahilli will be geared towards training a speech-to-text engine for an SDG relevant use-case and general-purpose ASR models that could be used in tasks such as driving aids for people with disabilities and development of AI tutors to support early education.
Monolingual and parallel text corpora will be used in several NLP applications that need NLP models, including natural language classification, topic classification, sentiment analysis, spell checking and correction, and machine translation.
Open Source Datasets for Local Ghanaian Languages: A Case for Twi and Ga:
This project will develop a new speech dataset that makes it possible for Twi (Asante, Akuapim, Fante dialects) and Ga speakers in Ghana with low English literacy to access digital financial services in their native language.
Access to digital financial services will serve as the immediate use case—however, the bulk of the collected data will be additionally useful for other purposes. The team will build a phonetically balanced speech corpus (with transcriptions and rough English translations) that is focused on the financial domain, and since the speech corpus will be phonetically balanced, it should be useful in acoustic modelling for use cases beyond accessing digital financial services.
English illiteracy and low literacy, the Asheshi University and Nokwary team explained, are barriers that keep many Ghanaians from accessing the full benefits of the digital age and in particular, digital financial services.
“Advancement in speech and language technology can break this illiteracy barrier but it is impossible to apply these advances to our native languages without datasets in these languages. The funding from Lacuna Fund will enable us to build a dataset in Twi and Ga that we believe will spur AI innovations that will help bring the full benefits of the digital age to all Ghanaians regardless of socio-economic status,” they added.
KenCorpus: Kenyan Languages Corpus
This project recognises the central role that language plays in preserving identity and culture in equalising access to information. The team will build the KenCorpus (Kenyan Languages Corpus) with the goal of providing rich textual and speech data resources for selected languages spoken in Kenya.
The KenCorpus will be collected from Kiswahili, Luhyia, and Dholuo languages and is a deliberate effort to provide equal opportunities, inclusivity, participation in decision-making, and accessibility to information by providing a base dataset for building NLP tools (e.g., POS taggers, Machine Translation systems, Automatic Speech Recognition, Text to Speech, Question Answering and Conversational agents in African languages).
Lacuna said the project will have a great impact on the methodologies used in the rapid assembly of corpora for under-resourced languages, shed light on how to prepare and annotate speech and texts for use in multilingual communities, and inspire the growth of human language technology firms across Africa.
The KenCorpus Project team commented," Every language and culture has a story to tell, and one's native language speaks to one's soul. Nelson Mandela once said; “If you talk to a man in a language he understands, that goes to his head. If you talk to him in his language, that goes to his heart”. The official and national languages of Kenya are English and Kiswahili. Kenya is a multilingual nation with approximately 68 native languages. "KenCorpus", a Native Languages of Kenya (NaLaKe) Dataset for NLP and ML, seeks to bring Kenyan languages into the NLP space. Collection of quality linguistics datasets is a first step to achieving our long-term goal of availing life changing NLP tools for African languages as instruments of culture. Our ability to communicate new ideas and discoveries in Native Languages is crucial to scientific linguistic advancement. In this project, we will get the chance to work with selected Native speakers across Kenya, involve students in data collection, annotation, and mentor them into building NLP tools for African Languages."
Development of Corpus, Sentiment, and Hate Speech Lexicon for Major Nigerian Languages
Sentiment analysis is a novel field of research in Natural Language Processing that deals with the identification and classification of people’s opinions and sentiments about products and services contained in a piece of text, usually in web data. While there are various resources and datasets proposed in the research community, most of them are for English, Chinese, and European languages. However, there are several under-resourced languages being used in Nigeria. For example, the Hausa, the Yoruba, and the Igbo languages are the most widely spoken languages in Nigeria, with over 150 million speakers in Nigeria alone, and widely used in other African countries—though there are few resources for sentiment analysis in these languages.
The sentiment lexicon is one of the most crucial resources for most sentiment analysis tasks, and the huge amount of data generated in these languages through social media remain untapped. Thus, the team will seek to develop a corpus, sentiment lexicon, and hate speech lexicon for Hausa, Yoruba and Igbo languages.
The project team commented," Languages spoken in Africa are low-resource; they have insufficient resources like datasets for machine learning and other important AI tasks. In this project, we aim to produce the first large-scale high-quality datasets for machine learning from social media contents written in three major Nigerian languages (Hausa, Igbo, and Yoruba). These datasets will be useful in Natural Language Processing tasks such as sentiment analysis, emotion analysis, hate speech detection, and fake news detection. We are a team of researchers from HausaNLP research group of Faculty of Computer Science and Information Technology, Bayero University Kano-Nigeria. We have international collaborations with Masakhane, Sentiment Analysis Lab of Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, and INESC TEC’s Artificial Intelligence and Decision Support Laboratory (LIAAD)."
Lacuna Fund began as a funder collaborative between The Rockefeller Foundation, Google.org, and Canada’s International Development Research Centre, with support from the German development agency GIZ on behalf of the Federal Ministry for Economic Cooperation and Development (BMZ) on this call for proposals.
Earlier this year Lacuna Fund announced that six African projects building agricultural datasets for AI would receive a total of $1.1-million in its first round funding. (Read more about this cohort in this Synapse Q1 2021 article). Launched in July 2020 with pooled funding of $4-million, Lacuna Fund supports the creation, expansion, and maintenance of datasets used for training or evaluation of machine learning models, initially in three key sectors: agriculture, health, and languages. The initiative has since evolved into a multi-stakeholder engagement composed of technical experts, thought leaders, local beneficiaries, and end users.
* The article was updated on 14 June 2021 to include the KenCorpus: Kenyan Language Corpus , as well as the Development of Corpus, Sentiment, and Hate Speech Lexicon for Major Nigerian Languages projects.