December
1998, Issue 101
Hot
Chips
ARCHITECTURE
WARS
Keeping
pace with Moore’s Law is no trivial task. For some time
it has seemed that computer architects have had trouble
coming up with any breakthroughs or radical changes
in computer organization that have really paid off.
Instead,
the continuing trend is an attempt to wring the last
bit of performance out of the traditional solutions
by brute force (i.e., throwing transistors at the problem).
It would be easy to contend that architecture is dead
were it not for the fact that the name of the game is
performance at any price—no matter how little the gain,
no matter how high the price.
The
bag of tricks now includes big caches, superscalar (multi-instruction),
speculative and out-of-order execution, branch prediction,
SIMD (vector) ops, and so on. The art of computer architecture
involves choosing the right combination and finessing
the details.
Cache-wise,
bigger is always better. For instance, the latest version
of HP’s Precision Architecture (PA)—the PA 8500 in Figure
1—includes a whopping 1.5 MB of cache (0.5 MB of instruction,
1 MB of data). Given HP’s long-time position in favor
of off- versus on-chip cache, such a development is
even more notable. Fact is, with tens of millions of
transistors to find homes for, big cache is the easiest
way out.
|

(Click
here to enlarge)
|
Figure
1—With plenty of function units, out-of-order execution,
high clock rate, and huge (0.5-MB instruction, 1-MB
data) caches, the HP PA-8500 is a good example of
the latest trend for performance-at-any-price chips.
|
Besides
making cache bigger, the goal is to build and use it
smarter. Even if half a dozen instructions can be found
to keep all those execution units fed, the cache can
become a bottleneck.
Thus,
the trend towards nonblocking designs escalates (when
a cache miss happens, don’t just sit there twiddling
your thumbs; try to execute another instruction). The
latest designs allow dozens or even hundreds of cache
accesses to be pending, without stalling the processor.
As
for using cache more intelligently, the earlier trend
towards software-directed prefetching, illustrated in
Figure 2, has become de rigueur. The idea is to give
the cache a head start, with the goal, in a perfect
world, being the elimination of the dreaded miss.
|

(Click
here to enlarge)
|
Figure
2a—To ease the pain of a cache miss, the HP PA-8500
and other high-end chips employ both hardware and
software techniques. One hardware approach is a
nonblocking cache that allows multiple outstanding
references (b), while software solutions include
compiler-inserted prefetch to initiate cache access
prior to anticipated use (c). |
The
conditional branch has become the bane of heavily pipelined,
superscalar, and speculative superdupers. Mere mortal
CPUs can only take five and wait for new marching orders
(i.e., condition resolves).
The
latest chips go to extraordinary lengths trying to predict
the branch’s outcome. For instance, the DEC-now-Compaq
Alpha 21264 happily wades 20 branches into the future,
relying on a crystal ball that not only includes the
usual branch history but also how the program arrived
there (see Figure 3).
|

(Click here to enlarge)
|
Figure
3—When it comes to branch prediction, the Alpha
21264 considers both the past behavior of the branch
and the path taken to arrive at the branch. |