circuitcellar.com
Magazine Support   Digital Library   Products & Services   Suppliers Directory 
 
 





 

Issue 110, September 1999
Talking Back: Adding Speech to Embedded Applications


by Rodger Richey

Training embedded apps to process speech may be as easy as finding the right 8-bit micro. Don't let what Rodger has to say about using an ADPCM algorithm and PWM output to generate speech to go in one ear and out the other.


BIT RATE vs. QUALITY

When choosing a speech processor, you must first determine the desired quality of the speech reproduction. A speech-processing system attempts to balance the quality of the reconstructed speech with the bit rate of the encoding/decoding. In most cases, speech quality degrades as the bit rate drops.

The search for a happy medium between bit rate and quality has filled volumes. A high bit rate, high-quality speech processor implies a sophisticated algorithm that is computationally intensive with long encoding/decoding delays (i.e., requires the use of a DSP or special audio processor device).

This would also imply that an 8-bit microcontroller is not a solution for all applications but can provide reasonably good quality at medium-to-low bit rates. These tradeoffs between bit rate, quality, and the complexity of the system can be summarized by the following questions:

 

• What level of speech degradation can be tolerated?

• What is the highest bit rate a system can tolerate (in terms of bandwidth)?

• What are the limitations on operating frequency, printed circuit board area, and power consumption?

• How much can you afford to spend on the speech subsystem?

Unfortunately, one answer can’t satisfy all these questions. However, cost seems to drive most decisions.

Cost is the main factor behind bit rate. Lower bit rates are desirable because they lower operating bandwidth as well as memory storage requirements. It also means less memory to store, a fixed amount of speech, and lower cost. Figure 1 shows graph of speech quality versus bit rate.

9905025fig1.gif (5884 bytes)

Figure 1—A designer must make tradeoffs between bit rate and quality of reconstructed speech. After defining these two parameters, the selection of a speech coding algorithm can be made.

A typical system might sample speech with a 12-bit ADC at a rate of 8 kHz, which is more than sufficient to preserve signal quality. At this rate (i.e., 96 kbps), 1 min. of storage requires 720 KB.

To transmit the information over a communications channel requires something higher than 96 kbps to permit supplemental information (e.g., start-of-frame indicators, channel number). These requirements are beyond the scope of most applications and can be reduced by using speech coding.

Speech-coding techniques for reducing the bit rate fall into two categories. The first method is called waveform coding.

There is a higher probability of a speech signal taking a small value rather than a large value. So, a speech processor can reduce the bit rate by quantizing the smaller samples with finer step sizes and the large samples with coarse step sizes.

The bit rate can be reduced further by using an inherent characteristic of speech—there is a high correlation between consecutive speech samples. Rather than encode the speech signal itself, the difference between consecutive samples can be encoded. This relatively simple method is repeated on each sample with little overhead from one sample to the next. An example of a waveform algorithm is ADPCM.

The other way to reduce bit rate is to analyze the speech signal according to a model of the vocal tract. The speech remains relatively constant over short intervals and a set of parameters (e.g., pitch and amplitude) can define that interval of speech. These parameters are then stored or transferred over the communication channel.

This technique requires significant processing on the incoming signal as well as memory to store and analyze the speech interval. Examples of this type of processor (called a vocoder or hybrid coder) are linear predictive coding (LPC) or code-excited linear predictive coding (CELP).

Quality is difficult to define or even measure. The goal of a measurement is to completely describe the quality of a speech processor in a single number. This measurement should be reliable across all measurement platforms as well as speech algorithms.

Unfortunately, however, measurements are broken up into subjective and objective. Subjective tests measure how a listener perceives the speech. Objective tests compare the original speech against the reconstructed output and make measurements based on signal-to-noise ratio (SNR).

The goal of a subjective test is to represent the personal opinions of a listener about the reconstructed speech in a single number. The listener evaluates speech segments based on the intelligibility or signal degradations (e.g., nasal, muffled, hissing, buzzing, and so forth. Several subjective tests exist such as diagnostic rhyme test (DRT), mean opinion score (MOS), and diagnostic acceptability measure. Table 1 shows the MOS score and bit rate for some common speech processors.

Coder name Algorithm type Bit rate MOS
G.711 log PCM 64 4.3
G.721 ADPCM 32 4.1
G.723 CELP 5.6 & 6.4 3.9
G.726 ADPCM 16, 24, 32, 40 –, 3.7, 3.9, 3.9
G.727 ADPCM 16, 24, 32, 40 –, 3.7, 3.9, 3.9
G.728 Low delay CELP 16 4.0
FS 1015 LPC-10 2.4 2.3
FS 1016 CELP/MELP 4.8/3.2 2.4/3.5
GSM RPE-LTP 13 3.5
MBE 4.8 3.7

Table 1—To help reduce the decision-making process, designers should rely on speech coder test results such as MOS, DAM, or SNR. Typically, the lower bit rate algorithms are significantly more complex than the higher bit rate ones. 

As I said, objective testing usually involves SNR measurements. SNR is a measurement of how closely the reconstructed speech follows the original signal. The speech signal is broken up into smaller segments, and the SNR is measured. All the SNR measurements are averaged together to get an overall SNR measurement for the speech signal.

Although this measurement is sensitive to variations in gain and delay, it cannot account for the properties of the human ear. The input to the speech processor is usually a sine wave or narrow-band noise waveform to maintain a repeatable test for all systems.

Because determining the quality of the speech processor is not as easy as picking the best number, both kinds of tests should be used to identify the best processor for your application. The best method may be to sit and listen to the outputs of the speech processor and simply select the one that you like the best. After all, quality is not a measured parameter but rather a listener-perceived parameter.