INVESTMENT
LACUNA FUND INVESTS $1M
in Datasets for Low Resource Languages in Sub-Saharan Africa Lacuna Fund — the world’s first collaborative effort to provide data scientists, researchers and social enterprises in low- and middle-income contexts globally with the resources they need to produce training datasets that address urgent problems in their communities — has invested $1-million in six projects which are creating openly accessible text and speech datasets that will fuel natural language processing (NLP) technologies in 29 languages across Africa.
T
he fund pointed out in a statement in late April that the supported projects will produce text and speech datasets for NLP technologies that will have significant downstream impacts on education, financial inclusion, healthcare, agriculture, communication, and disaster response in Sub-Saharan Africa. Lacuna Fund explained that the funding recipients will produce training datasets in Eastern, Western, and Southern Africa that will support a range of needs for low resource languages, including machine translation, speech recognition, named entity recognition and part of speech tagging, sentiment analysis, and multimodal datasets. All datasets produced will be locally developed and owned, and will be openly accessible to the international data community. “With over 50 impressive applications from, or in partnership with, organisations
12
SYNAPSE | 2ND QUARTER 2021
across Africa, there are many more initiatives poised for impact. This movement towards locally developed and owned datasets has only just begun, and with the right support and funding these initiatives will unlock the power of AI to deliver new social sector solutions and increase the presence of African countries on the international data map,” Lacuna Fund stated. Also commenting in the same statement, ABSA Chair of Data Science at University of Pretoria Vukosi Marivate drew attention to how the South African government has been using chatbots to provide daily updates on COVID. “Right now, translating those updates to Latin languages is really easy, but the datasets necessary to translate those updates to a range of African languages don’t exist, which means that the government isn’t currently able to communicate with many of its people in their native languages. That is one of the many examples of why we need this work now,” explained Marivate.
Meet the recipients Building an Annotated Spoken Corpus for Igbo NLP Tasks: This project addresses the gap in the availability of an Igbo spoken corpus for NLP tasks. Existing corpora—such as the Igbo web Corpus (IgWaC) and literary, religious and grammar texts—are either unannotated or not archived for research and NLP tasks. This study will create an annotated 1000-sentence corpus and 25 hours of unannotated audio data to launch an open access spoken corpus that would be available for research and NLP tasks. Data will be gathered from oral narratives and live Igbo news. Ethnographic interviews will be used to collect data that covers several domains of the Igbo life such as marriage, religion, language, burial, education, security, and trade. To ensure adequate representation, balance, and homogeneity, data collection will take place in the five south-eastern states where Igbo is predominantly spoken, and the team will recruit 50 different language speakers across the states to provide audio data. Igbo news recordings will be acquired from the Federal Radio Corporation of Nigeria across the five states. Igbo NLP Tasks Project Team member Gerald Nweya from the University of Ibadan said the team is excited to embark on this project due to the impact it will have on the NLP community as it it particularly concerns the Igbo language. “The need to build an annotated corpus of contemporary Igbo is one that is long overdue. It could be very interesting to study the language