Having Trouble Using Ethiopian Languages on AI? This Open-Source Dataset Might Have a Fix

By Ayu Beteseb

A comprehensive dataset on five Ethiopian languages poised for use in training artificial intelligence systems like Large Language Models (LLMs) has been developed by iCog. The Company formerly known as ICog Anyone Can Code has debuted Leyu Ai, an open-source voice dataset that includes Amharic, Afaan Oromo, Tigrinya, Af-Somali, and Sidama languages. Leyu, which means ‘to identify’ in Amharic, incorporates refined dialects in its dataset to ensure the accommodation of linguistic nuances.

“This approach incorporates local linguistic nuances into AI and Natural Language Processing (NLP) applications, helping businesses and organizations develop more inclusive and effective digital solutions for Ethiopia and beyond.” Says Betelhem Dessie, CEO of iCog.

The data collected through crowdsourcing goes through a comprehensive validation process, where experts review it to ensure accuracy, contextual relevance, and cultural propriety. Established linguistic standards are weighed against the collected data to incorporate variations in language use and context sensitivity. Data quality is further assessed to make certain that coherence, consistency, and overall suitability for the intended use.

“The multi-layered review process is essential to guarantee the integrity and reliability of the dataset, positioning it as a valuable resource in comparison to other benchmark datasets,” Betelhem told Shega.

Click here to read more