circuitcellar.com
Magazine Support   Digital Library   Products & Services   Suppliers Directory 
 
 





 

December 1998, Issue 101

Hot Chips


by Tom Cantrell

ARCHITECTURE WARS

Keeping pace with Moore’s Law is no trivial task. For some time it has seemed that computer architects have had trouble coming up with any breakthroughs or radical changes in computer organization that have really paid off.

Instead, the continuing trend is an attempt to wring the last bit of performance out of the traditional solutions by brute force (i.e., throwing transistors at the problem). It would be easy to contend that architecture is dead were it not for the fact that the name of the game is performance at any price—no matter how little the gain, no matter how high the price.

The bag of tricks now includes big caches, superscalar (multi-instruction), speculative and out-of-order execution, branch prediction, SIMD (vector) ops, and so on. The art of computer architecture involves choosing the right combination and finessing the details.

Cache-wise, bigger is always better. For instance, the latest version of HP’s Precision Architecture (PA)—the PA 8500 in Figure 1—includes a whopping 1.5 MB of cache (0.5 MB of instruction, 1 MB of data). Given HP’s long-time position in favor of off- versus on-chip cache, such a development is even more notable. Fact is, with tens of millions of transistors to find homes for, big cache is the easiest way out.

(Click here to enlarge)

Figure 1—With plenty of function units, out-of-order execution, high clock rate, and huge (0.5-MB instruction, 1-MB data) caches, the HP PA-8500 is a good example of the latest trend for performance-at-any-price chips.

Besides making cache bigger, the goal is to build and use it smarter. Even if half a dozen instructions can be found to keep all those execution units fed, the cache can become a bottleneck.

Thus, the trend towards nonblocking designs escalates (when a cache miss happens, don’t just sit there twiddling your thumbs; try to execute another instruction). The latest designs allow dozens or even hundreds of cache accesses to be pending, without stalling the processor.

As for using cache more intelligently, the earlier trend towards software-directed prefetching, illustrated in Figure 2, has become de rigueur. The idea is to give the cache a head start, with the goal, in a perfect world, being the elimination of the dreaded miss.

(Click here to enlarge)

Figure 2a—To ease the pain of a cache miss, the HP PA-8500 and other high-end chips employ both hardware and software techniques. One hardware approach is a nonblocking cache that allows multiple outstanding references (b), while software solutions include compiler-inserted prefetch to initiate cache access prior to anticipated use (c).

The conditional branch has become the bane of heavily pipelined, superscalar, and speculative superdupers. Mere mortal CPUs can only take five and wait for new marching orders (i.e., condition resolves).

The latest chips go to extraordinary lengths trying to predict the branch’s outcome. For instance, the DEC-now-Compaq Alpha 21264 happily wades 20 branches into the future, relying on a crystal ball that not only includes the usual branch history but also how the program arrived there (see Figure 3).

(Click here to enlarge)

Figure 3—When it comes to branch prediction, the Alpha 21264 considers both the past behavior of the branch and the path taken to arrive at the branch.