December
1998, Issue 101
Hot
Chips
IN
MEMORY OF CRAY
Another
example of effective recycling of yesterday’s know-how
is seen in the widespread adoption of SIMD techniques
(i.e., applying a single instruction to multiple data
items in parallel). In Cray’s day, this technique was
known as vector processing.
The
appeal lies in the fact that it’s relatively easy to
find and exploit parallelism in scientific and signal-processing
algorithms that rely on vector operations.
Almost
all hot chips support vector ops these days, the most
well-known example being the Intel MMX. At their simplest,
such schemes carve a full-size register into parallel
subparts that can be operated on. For example, a conventional
32-bit ADD is extended to perform two 16-bit ADDs or
four 8-bit ADDs at once.
The
latest generation of psuedo-SIMDs pushes the concept
further with wider words, more operands, and extra instructions.
Consider Motorola’s AltiVec upgrade of the PowerPC architecture.
The upgrade adds a complete vector unit featuring 128-bit
registers that can be interpreted as 16 × 8-bit, 8 ×
16-bit, or 4 × 32-bit data.
There
are 162 new instructions, including both the typical
intra-element and the newly introduced inter-element
operations. Figure 4 shows how the two make short work
of the inner loops at the heart of scientific and DSP
code.
|

(Click
here to enlarge)
|
Figure
4—These days, all hot chips employ SIMD techniques.
Motorola’s AltiVec scheme goes beyond the usual
intra-element operations (e.g., vmsum instruction)
and adds inter-element operations (e.g., vsum instruction).
The result—an inner loop that requires 36 instructions
and 18 cycles for a regular PowerPC is cut to two
instructions and two cycles. |
Although
his life was cut short by a car accident, the spirit
of Seymour Cray lives on in the SV1 from Silicon Graphics.
The SV1 not only incorporates SIMD techniques, but because
SGI purchased Cray’s company, it is also upwardly compatible
with his YMP.
As
a classic vector processor, the SV1 faces a different
set of challenges. For instance, there’s little concern
with conventional benchmarks like SPEC. The only goal
is crunching through vectors at blazing speed, and we’re
talking billions of operations per second.
One
source of head scratching comes when vector ops and
cache get in each other’s way. Vector data may not be
reused, and worse, arrays (i.e., vectors of vectors)
introduce the issue of stride.
For
instance, a column operation on a 256 × 1024 array calls
for accessing every 1024th element, which is contrary
to the concept of locality (i.e., the next access is
near the previous one) on which the cache concept is
based.
In
fact, it’s amusing to construct mental cache-buster
exercises. Choose the worst-case combination of algorithm,
data layout, cache size, and organization—and the grandest
chip is reduced to a quivering sliver of silicon.
Considering
locality and the desire to exploit the burst characteristics
of DRAMs, most caches use long (dozens or hundreds of
words) line lengths. When a miss occurs, the controller
loads a complete line, presuming that the penalty for
extra transfers is offset by the likelihood of subsequent
accesses within the same line.
But,
an ugly mismatch of algorithm, stride, and cache may
result in a complete line refill for each array element
access. You’d be better off chucking the cache altogether!
The
SV1 addresses the situation with a 128-KB streaming-cache
design that has short lines (only 8 bytes), is very
nonblocking (up to 192 pending references), and delivers
at 4+ GBps.