Issue
110, September 1999
Talking
Back: Adding Speech to Embedded Applications
by
Rodger Richey
Training
embedded apps to process speech may be as easy as
finding the right 8-bit micro. Don't let what Rodger
has to say about using an ADPCM algorithm and PWM
output to generate speech to go in one ear and out
the other.
BIT
RATE vs. QUALITY
When
choosing a speech processor, you must first determine
the desired quality of the speech reproduction. A speech-processing
system attempts to balance the quality of the reconstructed
speech with the bit rate of the encoding/decoding. In
most cases, speech quality degrades as the bit rate
drops.
The
search for a happy medium between bit rate and quality
has filled volumes. A high bit rate, high-quality speech
processor implies a sophisticated algorithm that is
computationally intensive with long encoding/decoding
delays (i.e., requires the use of a DSP or special audio
processor device).
This
would also imply that an 8-bit microcontroller is not
a solution for all applications but can provide reasonably
good quality at medium-to-low bit rates. These tradeoffs
between bit rate, quality, and the complexity of the
system can be summarized by the following questions:
What level of speech degradation can be tolerated?
What is the highest bit rate a system can tolerate
(in terms of bandwidth)?
What are the limitations on operating frequency, printed
circuit board area, and power consumption?
How much can you afford to spend on the speech subsystem?
Unfortunately,
one answer cant satisfy all these questions. However,
cost seems to drive most decisions.
Cost
is the main factor behind bit rate. Lower bit rates
are desirable because they lower operating bandwidth
as well as memory storage requirements. It also means
less memory to store, a fixed amount of speech, and
lower cost. Figure 1 shows graph of speech quality versus
bit rate.
|

Figure 1A designer must make tradeoffs
between bit rate and quality of reconstructed
speech. After defining these two parameters, the
selection of a speech coding algorithm can be
made.
|
A
typical system might sample speech with a 12-bit ADC
at a rate of 8 kHz, which is more than sufficient to
preserve signal quality. At this rate (i.e., 96 kbps),
1 min. of storage requires 720 KB.
To
transmit the information over a communications channel
requires something higher than 96 kbps to permit supplemental
information (e.g., start-of-frame indicators, channel
number). These requirements are beyond the scope of
most applications and can be reduced by using speech
coding.
Speech-coding
techniques for reducing the bit rate fall into two categories.
The first method is called waveform coding.
There
is a higher probability of a speech signal taking a
small value rather than a large value. So, a speech
processor can reduce the bit rate by quantizing the
smaller samples with finer step sizes and the large
samples with coarse step sizes.
The
bit rate can be reduced further by using an inherent
characteristic of speechthere is a high correlation
between consecutive speech samples. Rather than encode
the speech signal itself, the difference between consecutive
samples can be encoded. This relatively simple method
is repeated on each sample with little overhead from
one sample to the next. An example of a waveform algorithm
is ADPCM.
The
other way to reduce bit rate is to analyze the speech
signal according to a model of the vocal tract. The
speech remains relatively constant over short intervals
and a set of parameters (e.g., pitch and amplitude)
can define that interval of speech. These parameters
are then stored or transferred over the communication
channel.
This
technique requires significant processing on the incoming
signal as well as memory to store and analyze the speech
interval. Examples of this type of processor (called
a vocoder or hybrid coder) are linear predictive coding
(LPC) or code-excited linear predictive coding (CELP).
Quality
is difficult to define or even measure. The goal of
a measurement is to completely describe the quality
of a speech processor in a single number. This measurement
should be reliable across all measurement platforms
as well as speech algorithms.
Unfortunately,
however, measurements are broken up into subjective
and objective. Subjective tests measure how a listener
perceives the speech. Objective tests compare the original
speech against the reconstructed output and make measurements
based on signal-to-noise ratio (SNR).
The
goal of a subjective test is to represent the personal
opinions of a listener about the reconstructed speech
in a single number. The listener evaluates speech segments
based on the intelligibility or signal degradations
(e.g., nasal, muffled, hissing, buzzing, and so forth.
Several subjective tests exist such as diagnostic rhyme
test (DRT), mean opinion score (MOS), and diagnostic
acceptability measure. Table 1 shows the MOS score and
bit rate for some common speech processors.
| Coder name |
Algorithm type |
Bit rate |
MOS |
| G.711 |
log PCM |
64 |
4.3 |
| G.721 |
ADPCM |
32 |
4.1 |
| G.723 |
CELP |
5.6 & 6.4 |
3.9 |
| G.726 |
ADPCM |
16, 24, 32, 40 |
, 3.7, 3.9, 3.9 |
| G.727 |
ADPCM |
16, 24, 32, 40 |
, 3.7, 3.9, 3.9 |
| G.728 |
Low delay CELP |
16 |
4.0 |
| FS 1015 |
LPC-10 |
2.4 |
2.3 |
| FS 1016 |
CELP/MELP |
4.8/3.2 |
2.4/3.5 |
| GSM |
RPE-LTP |
13 |
3.5 |
| |
MBE |
4.8 |
3.7 |
Table
1To help reduce the decision-making process, designers
should rely on speech coder test results such as MOS,
DAM, or SNR. Typically, the lower bit rate algorithms
are significantly more complex than the higher bit rate
ones.
As
I said, objective testing usually involves SNR measurements.
SNR is a measurement of how closely the reconstructed
speech follows the original signal. The speech signal
is broken up into smaller segments, and the SNR is measured.
All the SNR measurements are averaged together to get
an overall SNR measurement for the speech signal.
Although
this measurement is sensitive to variations in gain
and delay, it cannot account for the properties of the
human ear. The input to the speech processor is usually
a sine wave or narrow-band noise waveform to maintain
a repeatable test for all systems.
Because
determining the quality of the speech processor is not
as easy as picking the best number, both kinds of tests
should be used to identify the best processor for your
application. The best method may be to sit and listen
to the outputs of the speech processor and simply select
the one that you like the best. After all, quality is
not a measured parameter but rather a listener-perceived
parameter.