February
1998, Issue 91
Low-Cost
Voice Recognition
THEORY
OF OPERATION
The
68HC05 processor is very simple. There are no ADCs,
so you need a way to convert the time domain signal
to a format the microcontroller can recognize.
The
small amount of memory requires a lot of approximations
and simplifications to convert the speech into a small
set of features.
To
meet these limitations, I use a simplified formant tracker.
The microphone input is high-pass filtered and then
infinitely clipped using two operational amplifiers.
The resulting square wave is connected to an MCU input.
By
sorting and tallying long and short pulse widths of
the square wave, you get a crude but effective two-channel
frequency analyzer. One channel gives frequencies below
1500 Hz, and the other ranges from 1500 Hz to 5 kHz.
These
two frequency areas roughly define F1 and F2, the two
formant regions of speech. It’s a well-known principle
that F1 and F2 for a given speaker and a given set of
vowels remain the same.
Using
F1 and F2 was first tried in 1952 by Bell Labs employing
vacuum tubes and capacitors for memory. Crude as it
sounds, that system achieved 97% recognition accuracy!
The
input signal is high-pass filtered (i.e., pre-emphasized)
to accentuate the F2 frequencies. Figure 1 illustrates
why this is necessary.
|
a)
b)
c)
(Click
here to enlarge)
|
Figure
1a—This is a waveform of the voiced sound "ee" as
in "speech." The arrow points to high-frequency
wiggles corresponding to the second formant (F2).
Note that these wiggles do not cross the zero axis.
b—After preemphasis or high-pass filtering, the
F2 components now cross the zero axis with the same
waveform. c—After being infinitely clipped, the
waveform of Figure 1b is a square wave showing both
F1 and F2 components. This signal is applied to
the microprocessor via a digital input pin.
|
Figure
1a is a sample of the voiced vowel sound "ee"
as in "speech." Note the F2 component shown
by the arrow. Also note that these high-frequency wiggles
do not cross the zero axis. Thus, if the waveform is
infinitely amplified and clipped, the square wave would
not reveal the F2 component.
However,
Figure 1b shows what happens after pre-emphasis. The
F2 wiggles cross the zero axis, and the resultant infinitely
clipped square wave now contains both F1 and F2 (see
Figure 1c).