MarkTechPost→ original

Google released WAXAL, an open speech dataset for African languages

Google made WAXAL public, a speech dataset for African languages that should accelerate the development of speech recognition and synthesis in low-resource…

AI-processed from MarkTechPost; edited by Hamidun News
Google released WAXAL, an open speech dataset for African languages
Source: MarkTechPost. Collage: Hamidun News.
◐ Listen to article

Google has opened WAXAL — a large speech corpus for African languages, created as a foundation for speech recognition and synthesis systems. The project is aimed at a market where voice technologies are developing noticeably slower due to chronic lack of high-quality open data.

Why This Matters

The main problem with speech AI has long been not the models themselves, but the distribution of data. For English, Spanish, or Chinese, there are huge open and commercial corpora, so speech recognition and voice synthesis systems progress rapidly there. For many African languages, the situation is the opposite: little annotated speech, few quality recordings, few open licenses. Because of this, people speaking languages with millions of speakers still get the worst quality for dictation, auto-subtitles, voice assistants, and interface voiceovers. WAXAL is trying to close exactly this infrastructural gap.

Notably, the project already looks alive, not a static archive. In the technical description, the team mentions 24 languages and a starter set for speech recognition and synthesis tasks. In Google's release blog from March 6, 2026, there is already an expanded initial delivery: 27 languages, more than 1,846 hours of data for speech recognition and over 565 hours for synthesis. That is, Google didn't just release one dataset, but seems to be building a long-term open foundation for languages that typically fall outside major AI platforms.

How WAXAL Works

WAXAL was divided into two independent parts because speech recognition and voice synthesis have different data requirements. The first needs diverse speakers, natural environment, and spontaneous speech so the model works better in real conditions. The second needs cleaner audio, phonetically balanced texts, and controlled recording, otherwise it's hard to get natural and stable voice output. In this sense, WAXAL looks not like a universal "audio folder," but like a thoughtfully designed dataset for two different classes of tasks.

  • In the speech recognition part, participants were asked to describe images in their native language rather than read prepared scripts.
  • Google notes that such prompts covered more than 50 topics and better elicited natural speech, including tonal nuances and code-switching.
  • In the synthesis part, phonetically balanced texts and more controlled recording conditions were used.
  • The dataset was released under the open CC-BY-4.0 license so it could be used in research and applied products.

Who Collected the Data

A key part of the project — not just volume, but the method of collection. Google worked not alone, but together with African universities and local organizations, including Makerere University, University of Ghana, Digital Umuganda, African Institute for Mathematical Sciences Senegal, Media Trust, and Loud and Clear Communications. This format is important because local teams better understand speech habits, code-mixing, regional pronunciation variants, and the contexts in which people actually speak, not read text in laboratory silence.

"The corpus was created by the community and for the community that needs it."

The production details are also interesting. For the TTS part, participants prepared texts from 10 to 20 thousand words and worked in pairs: one read, the other recorded and checked quality. To get cleaner audio, some teams even built their own studio boxes. Google specifically emphasizes that WAXAL should help not only academic benchmarks, but real scenarios: local voice interfaces, machine dictation, automatic transcription, service voiceovers, and conversational systems that must understand natural speech, not just perfectly read text.

At the same time, an applied and research ecosystem is already growing around the corpus. Google mentions work on data collection for people with speech disabilities, a separate large corpus for five Ghanaian languages, and benchmarks for models like Whisper, XLS-R, MMS, and W2v-BERT on African languages. This is a good signal: WAXAL is useful not only as an archive, but as a common reference point where you can compare models, find weak spots, and faster bring voice products to working quality.

What This Means

WAXAL lowers the entry barrier for startups, researchers, and local teams who want to build voice AI not just for global languages. If such open corpora continue to grow and be regularly updated, African languages will have a chance to catch up faster with the rest of the market in recognition quality, synthesis, and digital service accessibility.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…