World Telecommunication Day 1999

IHT October 11, 1999


Computers Sign Up for Voice and Diction 101


Talking computers have been science fiction staples at least since the HAL 9000 in ''2001, a Space Odyssey.'' In 1984, Apple Computer introduced a text-to-speech program, Macintalk, for the Macintosh computer. It was innovative, but once the ''Wow!'' factor wore off, not good enough to be useful.

Computers convert text into speech using speech synthesis. The disembodied voices in rental cars and elevators are prerecorded digital samples, not synthesized. Phone directory services use sampled voice fragments pieced together on the fly.

These systems do not need speech synthesis because they have small, fixed vocabularies. General-purpose text-to-speech requires speech synthesis because the possible vocabulary in a text is so large.

High-quality speech synthesis requires processing power that has only recently become available at mass-market costs. The software first analyzes text, breaking it into groups of words, syllables and the basic elements of speech - phonemes - which it knows how to synthesize as complex sound waveforms.

Based on the grouping of syllables and words, punctuation and language rules, it adds pitch inflection, pauses and other nuances that give natural speech its melodic and rhythmic flow.

Both deconstructing text to phonemes and adding natural inflections pose significant computational problems that we take for granted when reading a text.

All languages have irregularities and context-dependent rules. We pronounce words correctly by rules and memorized irregularities. Consider G.B. Shaw's joke that, in English, ''ghoti'' should be pronounced the same as ''fish'' (''gh'' as in enough, ''o'' as in women, and ''ti'' as in nation).

Consider the difference between ''You're going to marry me?'' and ''You're going to marry me!'' Punctuation and context must be taken into account to convey the message behind the words.

Much research has been done on speech synthesis in recent years, and the quality is improving. Modern systems allow enhanced programmatic control of pitch modulation (range), volume and prosody (rhythm and emphasis) and let users choose among voices.

Support for languages other than English is also growing.

As processing power increases rapidly, expect to see speech synthesis moving to the mainstream.

Charles Tobermann