How Correctly Do Speech Synthesizers Speak?

clip_image002Technology bugs sometimes lead to high-tech embarrassments. We’ve known that Text-to-Speech (TtS) synthesizers are widely used for people with visual impairments or reading disabilities to listen to written works on a computer. These days, TtS techniques are used in the entertainment productions such as games as well.

A few days back when the recently launched Amazon’s Kindle couldn’t seem to pronounce the name of the US president correctly, it made news. With Kindle being positioned as a replacement of the daily newspaper and the US President’s coverage almost a daily affair, it indeed was an awkward situation.

When Barack Obama was read by the Kindle’s computerized text-to-speech system, it sounded something like “Brack Alabama.”

Although Jeff Bezos, CEO of Amazon laughed over it calling the faux pas as “unfortunate,” Amazon nonetheless quickly tried to fix the problem with its text-to-speech partner Nuance.

This brings us to the question, how correctly do speech synthesizers speak?

TtS systems have moved from robotic-sounding to human-sounding over the last 20 years. But how well are the words being understood? Moreover, the challenge of pronouncing all the words correctly, text normalization, speech melody & rhythm, and problems with grammar & semantics still remain.

For example, one word can be pronounced in two different ways:

  • Lives

“it lives” vs. “nine lives”

  • Bow

“bow down” vs. “bow and arrow”

Word pronunciation is even more difficult for names.

Speech melody, rhythm, and pauses are very important to any text. Humans tend to break words in order to group words to meaningful chunks, to breathe, and to make words sound more prominent based on the importance of information that it carries.

For example:

“She gave the money to Lisa.” (Who did she give the money to?)

“She gave the money to Lisa.” (What did she give Lisa?)

She gave the money to Lisa.” (Who gave the money to Lisa?)

The system should know what should be said and how it should be said in order to match human rendering. We just have to wait and watch and see how technology improves with time and makes better Text-to-Voice services for us to use.