IEEE Spectrum AI→ original

Маори разработали собственный синтезатор речи и защитили его от Big Tech скрейпинга

Профессор Te Taka Keegan и его команда разработали синтезатор речи для маорийского диалекта Waikato-Maniapoto. Модель была обучена на 7 часах 45 минут записей п

AI-processed from IEEE Spectrum AI; edited by Hamidun News
Маори разработали собственный синтезатор речи и защитили его от Big Tech скрейпинга
Source: IEEE Spectrum AI. Collage: Hamidun News.
◐ Listen to article

Māori communities in New Zealand have developed their own text-to-speech synthesizer that they fully control. This is a first step toward digital sovereignty, where language remains the property of the people who speak it.

Scraping without permission

ChatGPT, Claude, and Perplexity speak Māori excellently. They can do this because they were trained on data from Māori communities — texts and audio that were scraped without permission. Professor Te Taka Keegan from the University of Waikato sees this as the main problem: "These companies have the resources to create good models, but they scraped all the data without our participation, and we don't own the result. Our language is the main way we transmit our knowledge, and technology developed outside Aotearoa increasingly controls this transmission."

Why Māori is harder than English

The Māori language is unlike English, and this creates problems for AI. Several linguistic features make it particularly difficult to automate:

  • Vowel length changes the meaning of a word: keke — "cake", kēkē — "armpit", kekē — "creak"
  • Digraphs are not pronounced like in English: "wh" sounds like "f"
  • It is a low-resource language with few available texts and recordings in digital form

To solve this problem, Keegan invited Ngaringi Katipa — a translator and teacher of the Māori language. First, they recorded 4.5 hours of her reading, then expanded the dataset with the help of linguist Peter Keegan (Te Taka's brother) to a final 7 hours 45 minutes.

Phonemes instead of letters

Keegan and his graduate student Kingsley Eng chose a phonemic approach — the model is trained not on letters, but on sounds. This gave the model a "head start in learning": it immediately understands how groups of letters sound. They tested three open-source architectures (Matcha-TTS, Tacotron2, Piper) and chose Piper because it works offline on a local computer.

The results exceeded expectations. With less than 8 hours of recordings, the model achieved an error rate of 6.78% — considered a "good" result in the industry, where hundreds of hours are usually required.

When 68 native speakers of Māori listened to synthetic and human voices and tried to distinguish them, they guessed correctly only 65% of the time. Keegan explains: "We are pleased because some were relatives of the original voice, know it well, but even they were wrong."

From university to community

Rather than releasing the model into open access, Keegan is negotiating with three iwi — Māori tribes (Waikato, Maniapoto, and Raukawa) to whom Katipa is related. "Stewardship of this should be with them, not the university," the professor says. He sees in this the embodiment of a principle that Māori call "kaitiakitanga" — protecting knowledge for future generations. This is part of a global trend. The Māori organization Te Hiku Media developed a speech recognition system with 92% accuracy for Māori language and 82% for bilingual speech, releasing it under the Kaitiakitanga license, which prohibits the use of data without benefit to the Māori people.

What this means

Keegan plans not one "Māori LLM," but separate models for each dialect: Maniapoto LLM, Tūhoe LLM, and so on, each owned by its own people and trained on their voices. This creates a template for other small languages in the world: synthesize, own, protect. Not to be the object of scraping, but to be the master of your own technology.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…