How Amazon taught Alexa to speak Irish

Like Henry Higgins, the vocalist from George Bernard Shaw’s play “Pygmalion,” Marius Kotescu and Georgy Tenchev recently showed how their student was trying to overcome his articulation difficulties.

The two data scientists, who both work for Amazon in Europe, were teaching Alexa, the company’s digital assistant. Their mission: to help Alexa master English with an Irish accent with the help of artificial intelligence and recordings from native speakers.

During the demonstration, Alexa talked about a memorable night. “The party last night was so crazy,” Alexa said at length, using the Irish word for fun. “We got ice cream on the way home, and we were glad to get out.”

Mr. Tenchev shook his head. Alexa dropped the “r” in the word “Party,” making the word sound flat, like pah-tee. He concluded that he was very British.

The technologists are part of a team at Amazon that works in a challenging area of data science known as audio decoding. It’s a challenging problem that has taken on new significance amid a wave of AI developments, as researchers believe the puzzle of speech and technology can help make AI-powered devices, bots, and speech synthesizers more conversational—that is, able to appeal to many regional players. accents.

Dealing with phonemic detangling involves more than just grasping vocabulary and grammar. The speaker’s pitch, timbre, and accent often give exact meaning to words and emotional weight. Linguists call this language feature “display,” and it’s something machines have had a hard time mastering.

Only in recent years, thanks to advances in artificial intelligence, computer chips and other devices, have researchers made strides in solving the problem of audio decoding, turning computer-generated speech into something more pleasing to the ear.

Such work may eventually converge with an explosion of “generative AI,” the researchers said, which is technology that enables chatbots to generate their own responses. Chatbots like ChatGPT and Bard may one day operate entirely on users’ voice commands and respond verbally. At the same time, voice assistants like Alexa and Apple’s Siri will become more conversational, which could revive consumer interest in a technology sector that appears to have stalled, analysts said.

Getting voice assistants like Alexa, Siri, and Google Assistant to speak multiple languages has been an expensive and time-consuming process. Technology companies have hired voice actors to record hundreds of hours of speech, which has helped create artificial voices for digital assistants. Advanced artificial intelligence systems known as “text-to-speech models”—because they convert text into natural-sounding synthetic speech— I’m just starting to simplify this process.

The technology is “now able to create a human voice and a synthetic voice based on text input in different languages, dialects and dialects,” said Marion Laborie, chief strategist at Deutsche Bank Research.

Amazon has been under pressure to catch up with competitors like Microsoft and Google in the artificial intelligence race. In April, Andy Jassy, CEO of Amazon, said, for Wall Street analysts that the company planned to make Alexa “more active and talking” with the help of cutting-edge generative AI Rohit Prasad, Amazon’s chief scientist for Alexa, said. he told CNBC In May he saw the voice assistant as a voice-enabled “instantly available personal AI”.

Irish Alexa made its commercial debut in November, after nine months of training to understand and then speak an Irish accent.

“Accent is different from language,” Mr. Prasad said in an interview. AI techniques must learn to extract accent from other parts of speech, such as intonation and frequency, before they can replicate the characteristics of local dialects—for example, maybe the “a” is flatter and the “t’s” are pronounced more forcefully.

These systems have to detect these patterns, he said, “so they can create an entirely new accent.” “this is difficult.”

Harder still is trying to get the technology to learn a new accent pretty much on its own, from a different-sounding speech form. That’s what Mr. Cotescu’s team attempted to build the Irish Alexa. They relied heavily on the existing speech model of mainly English British accents—with a much smaller selection of American, Canadian, and Australian accents—to train them to speak Irish English.

The team faced various language challenges of the English-Irish language. The Irish tend to drop the “h” in the “th,” pronouncing the letters as “t” or “d,” for example, making “bath” sound like “bat” or even “bad.” Irish English is also rhotic, which means the letter “r” is pronounced overly. This means that the “r” in “party” will be more pronounced than what you might hear from a Londoner’s mouth. Alexa had to learn and master these features of speech.

Irish English is “difficult,” said Mr Kotescu, who is Romanian and was the principal investigator for Alexa’s Irish team.

Speech models that support Alexa’s verbal skills have evolved more advanced in recent years. In 2020, Amazon researchers taught Alexa He speaks Spanish fluently From an English speaking model.

Mr. Cotescu and the team saw dialects as the next frontier for Alexa’s speech capabilities. They designed Irish Alexa to rely more on AI than on actors to build her speech model. As a result, the Irish Alexa was trained on a relatively small group — about 24 hours of recordings by voice actors who recited 2,000 speeches in Irish English.

At first, when Amazon researchers presented the Irish recordings to the still-learning Irish Alexa, some strange things happened.

Sometimes, letters and syllables leaked out of the response. Sometimes the “S” are stuck together. One or two words, sometimes decisive, were inexplicably mumbled and unintelligible. In at least one instance, Alexa’s female voice dropped a few octaves, sounding more masculine. Worse, the masculine voice sounded distinctly British, the kind of goof that might raise eyebrows in some Irish homes.

“They’re big black boxes,” Tenchev, a Bulgarian and Amazon’s chief scientist on the project, said of the speech models. “You must have a lot of experiences to tune in to.”

This is what the techs did to correct Alexa’s “partisan” slip-up. They untangled speech, word by word, sound (the smallest audible piece of a word) by voice to pinpoint and fine-tune where Alexa slips. They then fed Alexa’s Irish speech model more recorded audio data to correct the verbal error.

Result: “r” is returned in “party”. But then the “p” disappeared.

So data scientists did the same process again. They finally focus on the sound containing the missing “p”. Then they fine-tuned the model further so that the “p” sound returned and the “r” did not go away. Alexa finally learned to speak like a Dublin.

Since then, two Irish linguists — Eileen Vaughan, who teaches at the University of Limerick, and Kate Tallon, a doctoral student working in the Phonetics and Speech Lab at Trinity College Dublin — have given Alexa high marks on the Irish accent. They said the way Irish Alexa stressed the “r’s” and softened the “t” stopped, and Amazon got the accent just right.

“It seems real to me,” said Ms. Tallon.

Amazon researchers said they were pleased with the largely positive feedback. Their speech models untangled the Irish accent so quickly, giving them hope that the accents could be replicated elsewhere.

And they wrote in the language of A January research paper About the Irish Alexa Project.