circuitcellar.com
Magazine Support   Digital Library   Products & Services   Suppliers Directory 
 
 





 

December 1998, Issue 101

Hot Chips


by Tom Cantrell

IN MEMORY OF CRAY

Another example of effective recycling of yesterday’s know-how is seen in the widespread adoption of SIMD techniques (i.e., applying a single instruction to multiple data items in parallel). In Cray’s day, this technique was known as vector processing.

The appeal lies in the fact that it’s relatively easy to find and exploit parallelism in scientific and signal-processing algorithms that rely on vector operations.

Almost all hot chips support vector ops these days, the most well-known example being the Intel MMX. At their simplest, such schemes carve a full-size register into parallel subparts that can be operated on. For example, a conventional 32-bit ADD is extended to perform two 16-bit ADDs or four 8-bit ADDs at once.

The latest generation of psuedo-SIMDs pushes the concept further with wider words, more operands, and extra instructions. Consider Motorola’s AltiVec upgrade of the PowerPC architecture. The upgrade adds a complete vector unit featuring 128-bit registers that can be interpreted as 16 × 8-bit, 8 × 16-bit, or 4 × 32-bit data.

There are 162 new instructions, including both the typical intra-element and the newly introduced inter-element operations. Figure 4 shows how the two make short work of the inner loops at the heart of scientific and DSP code.

(Click here to enlarge)

Figure 4—These days, all hot chips employ SIMD techniques. Motorola’s AltiVec scheme goes beyond the usual intra-element operations (e.g., vmsum instruction) and adds inter-element operations (e.g., vsum instruction). The result—an inner loop that requires 36 instructions and 18 cycles for a regular PowerPC is cut to two instructions and two cycles.

Although his life was cut short by a car accident, the spirit of Seymour Cray lives on in the SV1 from Silicon Graphics. The SV1 not only incorporates SIMD techniques, but because SGI purchased Cray’s company, it is also upwardly compatible with his YMP.

As a classic vector processor, the SV1 faces a different set of challenges. For instance, there’s little concern with conventional benchmarks like SPEC. The only goal is crunching through vectors at blazing speed, and we’re talking billions of operations per second.

One source of head scratching comes when vector ops and cache get in each other’s way. Vector data may not be reused, and worse, arrays (i.e., vectors of vectors) introduce the issue of stride.

For instance, a column operation on a 256 × 1024 array calls for accessing every 1024th element, which is contrary to the concept of locality (i.e., the next access is near the previous one) on which the cache concept is based.

In fact, it’s amusing to construct mental cache-buster exercises. Choose the worst-case combination of algorithm, data layout, cache size, and organization—and the grandest chip is reduced to a quivering sliver of silicon.

Considering locality and the desire to exploit the burst characteristics of DRAMs, most caches use long (dozens or hundreds of words) line lengths. When a miss occurs, the controller loads a complete line, presuming that the penalty for extra transfers is offset by the likelihood of subsequent accesses within the same line.

But, an ugly mismatch of algorithm, stride, and cache may result in a complete line refill for each array element access. You’d be better off chucking the cache altogether!

The SV1 addresses the situation with a 128-KB streaming-cache design that has short lines (only 8 bytes), is very nonblocking (up to 192 pending references), and delivers at 4+ GBps.