circuitcellar.com
Magazine Support   Digital Library   Products & Services   Suppliers Directory 
 
 





 

Issue 133 August 2001
Listening Chips


by Tom Cantrell

Start In The Realm Of The SensoryLip Reader Walk The TalkSoft Sounds Yak AttackHearing AidSources & PDF

IN THE REALM OF THE SENSORY

Although it hasn’t reached household name status, in the relatively new field of voice recognition, Sensory can be considered one of the pioneers. They’ve been around for years, slowly but surely percolating their technology into emerging applications one by one.

I’ve kept in touch with Sensory and monitored their progress, but held back on writing an article. The fact is, with ASIC- and ROM-based custom silicon underpinning a focus-accounts marketing strategy, what they had to offer was only suitable for a few big outfits like Sony, VTech, and Uniden. But now, after successfully establishing their place, Sensory is moving to expand the market with low-cost standard chips suitable for a broad range of applications from customers big and small.

Enter the Voice Extreme Toolkit (see Photo 1) which, at only $129, is not only ideal for prototyping and demos, but is also suitable for moderate volume applications.

Photo 1—When it comes to voice recognition, the Voice Extreme Toolkit represents a new high in ease of use and, at only $129, a new low in price.

The kit is wrapped around a special version of Sensory’s RSC-364 speech-recognition chip. The ROM on the chip is factory-programmed with a C-like language interpreter and memory manager designed to work with a commodity external flash memory chip. Note that a ROM-less version, RSC-360, is available (see Table 1).

Table 1—The RSC-364, with 64-KB on-chip ROM, is a single-chip voice recognition solution. Taking advantage of the features that require lots of storage, such as voice recording, requires adding external memory. Accuracy above 95% must be maintained. The RSC-364 assumes the use of on-chip ROM/RAM only and external serial EEPROM memory. It depends on the choice of musical instrument and requires external storage for recordings.


The external flash memory is used to store an application’s particular vocabulary, specifically the templates and weights that lie at the heart of Sensory’s recognition technology. There are two sources for the vocabulary, and the choice is determined by the specifics of the application.

For speaker-independent applications, Sensory can draw from a library of common words in the major languages or provide service to generate a custom (i.e., atypical language) vocabulary. By contrast, speaker-dependent apps rely on training (i.e., writing flash memory) by the end user in the field.

An interesting tweak of speaker-dependent recognition is known as speaker verification. The latter is kind of the inverse of the former. Instead of recognizing a word from a predefined vocabulary spoken by a known person, verification recognizes which speaker from a predefined group is saying a known word.

A specific application might use a combination of recognition modes. For instance, a security system could recognize a particular user’s voice (speaker verification) and then, knowing his identity, determine his specific password (speaker-dependent) before accepting generic commands (speaker-independent).

Other Sensory variations on the recognition theme include word spotting and continuous listening. Word spotting finds trigger words in continuous speech, so "Please open the door" could be recognized as "open door." To reduce false triggering complications, use words with more syllables or include more than one word, like a brief phrase.

Because there is a slight delay between recognition of the first word in a multi-word trigger and listening for the following word, I recommend that you try establishing a scheme that uses trigger words that are naturally separated by other speech or otherwise won’t easily run together. Note that word spotting only works with speaker-dependent recognition.

Continuous listening is similar, except that it waits for a specific isolated phrase (i.e., only "open door" would be recognized), with pauses delineating each word. Although not as powerful as word spotting, continuous listening does have the advantage of working with both speaker-dependent and independent recognition modes.


© Circuit Cellar, The Magazine for Computer Applications. Reprinted with permission. For subscription information call (860) 875-2199, email subscribe@circuitcellar.com or on our web site at www.circuitcellar.com.