Start
Prop Job
Cog in the Machine
Hubba-Hubba
Spin Control
It's a Cog's Life
Propeller Heads Wanted
Sources and PDF
HUBBA-HUBBA
At
this point, you’re probably thinking: “Looks
pretty simple. What’s the big deal?”
The
big deal is found in Photo 2 (p. 80). The Propeller
packs eight cogs worth of multicore machismo
underneath its otherwise mild-mannered MCU exterior.
Time to cue the The Twilight Zone music because
now it starts to get a little spooky.
|

(Click
here to enlarge)
|
Photo
2—Like the Cobra racecar of yore, Propeller
crams a high-output eight-cylinder engine
(i.e., eight cogs) in a small chassis to
blow the doors off conventional MCUs. |
In
techspeak, Propeller is a symmetric multiprocessor,
or SMP (i.e., the cogs are all the same), using
“shared memory” as the communication medium.
Said shared memory, comprising 32 KB of RAM
and 32 KB of ROM, is found in the hub.
The
mechanism by which the memory is actually shared
is invariably one of the messier aspects of
multiprocessor design. A traditional approach
has individual processors contending for access
to the memory when they want it with an arbitration
mechanism imposed to resolve conflicts.
Now,
I suppose arbitration is better than a jury
trial (“Ladies and gentleman, my client was
unfairly denied access.”), but it’s still messy.
First, the arbitration logic resides in the
critical path between processor(s) and memory,
thus slowing everything down. Second, it introduces
timing uncertainty for all processors depending
on the arbitration outcome (i.e., whether a
processor is immediately granted access or it
has to wait for another processor to finish).
Finally, although not a requirement, arbitration
schemes often lead down the primrose path of
architectural embellishments such as priority
(some processors have more access rights than
others), which themselves lead to potential
problems (e.g., priority inversion), calling
for more hack workarounds (e.g., dynamically
programmable priority). It’s a death spiral
of complexity, delay, and uncertainty.
By
contrast, the Propeller sharing scheme is brutally
simple. Like the distributor in an old V8, the
hub simply goes round and round granting access
to each of the cogs in turn (see Figure 2).
The obvious downside is that cogs get access
even if they don’t need it, blocking others
that possibly do. However, the distributor approach
minimizes the jitter (i.e., lack of determinism)
that plagues traditional arbitration schemes.
|

(Click
here to enlarge)
|
Figure
2—Multicores are fine when each core is
doing its own thing. The challenge arises
when they contend for access to shared resources.
Propeller uses a round-robin hub that grants
each core deterministic access to shared
RAM and ROM. |
From
a cog’s perspective, the only uncertainty involves
waiting for the first access to shared memory
as the distributor spins around. After that
first access is obtained, cogs can schedule
their subsequent activity knowing precisely
when subsequent accesses will be granted—no
ifs, ands, or buts.
The
hub also includes eight semaphores. However,
these have nothing to do with the basic sharing
mechanism. The distributor itself guarantees
there can be no sharing conflicts for a single
(byte, word, or long word) access. Indeed, that
guarantee is exploited to implement the semaphores
themselves. Rather, the semaphores are a way
for applications to adjudicate shared access
to higher-level structures (arrays and I/O)
if necessary.
The
hub is also where the clocks for the entire
chip (cogs and hub) are derived and distributed.
As I mentioned earlier, one option is an on-chip
RC oscillator that offers nominal 12 MHz and
20 kHz selections. The 20 kHz option is useful
as a sleepy mode because cogs only consume about
3 µA at that clock rate. The other option is
an external crystal or oscillator, which feeds
a programmable PLL, boosting the rate by up
to a factor of 16.
Now
is a good time to talk megahertz and MIPs. The
first chips run at up to 80 MHz (e.g., 5-MHz
crystal with 16× PLL clock multiplier). Virtually
all cog instructions execute in four clocks,
except conditional branches, which require four
(branch taken) or eight (not taken) clocks.
That means the performance for the entire Propeller
chip approaches 160 MIPs, or roughly 16 MIPs
per buck. Not bad at all.