The
processor has a single memory port for reading instructions
and loading and storing data. Most memory accesses
are for fetching instructions. The processor is also
the DMA engine, and a video refresh DMA cycle occurs
once every eight clocks or so. Therefore, in any given
clock cycle, the processor executes either an instruction
fetch memory cycle, a DMA memory cycle, or a load/store
memory cycle.
Memory
transactions are pipelined. In each memory cycle,
the processor drives the next memory cycles
address and control signals and awaits RDY, indicating
the access has been completed. So, what happens when
memory is not ready?
The
simplest thing to do is to stop the pipeline for that
cycle. CTRL deasserts all pipeline register clock
enables PCE, ACE, and so forth. The pipeline registers
do not clock, and this extends all pipeline stages
by one cycle. In Table 2, memory is not ready during
the fetch of instruction I3 in t3, and so t4 repeats
t3. (Repeated pipe stages are italicized.)
| t1 |
t2 |
t3 |
t4 |
t5 |
| IF1 |
DC1 |
EX1 |
EX1 |
|
|
IF2 |
DC2 |
DC2 |
EX2 |
|
|
IF3 |
IF3 |
DC3 |
|
|
|
|
IF4 |
| Table
2During t3, the instruction fetch
memory access of I3 is not RDY, so the pipeline
registers do not clock, and the pipeline stalls
until RDY is asserted in t4. Repeated pipeline
stages are italicized. |
IL
in Listing 1 is a load word instruction. Loads and
stores need a second memory access, causing pipeline
havoc (see Table 3). In t4 you must run a load data
access instead of an instruction fetch. You must stall
the pipeline to squeeze in this access.
if ((p->flags & 7) == 1)
p->x = p->y;
IL: lw r6,2(r10) ;load r6 with p->flags
I2: andi r6,7 ;is (p->flags & 7)
I3: addi r0,r6,-1 ;==1?
IB: bne T
I5: lw r6,6(r10) ;yes: load r6 with p->y
...
|
| Listing
1This C code produces assembly code
that includes a load IL and a branch IB. Each
causes pipeline headaches. |
| t1 |
t2 |
t3 |
t4 |
t5 |
t6 |
t7 |
t8 |
t9 |
| IFL |
DCL |
EXL |
EXL |
|
|
|
|
|
| |
IF2 |
DC2 |
DC2 |
EX2 |
|
|
|
|
| |
|
IF3 |
IF3 |
DC3 |
EX3 |
|
|
|
| |
|
|
|
IFB |
DCB |
EXB |
|
|
| |
|
|
|
|
IF5 |
DC5 |
EX5 |
|
| |
|
|
|
|
|
IF6 |
DC6 |
EX6 |
| |
|
|
|
|
|
|
IFT |
DCT |
| Table
3Pipelined execution of the load
instruction IL, I2, I3, the branch IB, the annulled
I5 and I6, and the branch target IT. During t4
you stall the pipeline for the IL load/store memory
cycle. The branch IB executed in t7 causes I5
and I6 to be annulled in t8 and t9. Annulled instructions
are struck through. |
Then,
although you fetched I3 in t3, you must not latch
it into the instruction register (IR) as t3 ends,
because neither EXL nor DC2 are finished at this point.
In particular, DC2 must await the load result in order
to forward it to A, because I2 uses r6the result
of IL!
Finally,
if (in t3) you dont save the just-fetched I3
somewhere, youll lose it, because in t4, the
memory port is busy with the load cycle. If you lose
it, youll have to re-fetch it no sooner than
t5, with the result that even a no-wait load requires
three cycles, which is unacceptable.
To
fix this problem, the control unit has a 16-bit NEXTIR
register and an IR source multiplexer (IRMUX). In
t3, it captures I3 in NEXTIR, and then in t4, IR is
loaded from NEXTIR instead of from the memory port
(which is busy with the load). NEXTIR ensures a two-cycle
load or store, at a cost of eight CLBs.
As
with instruction fetch accesses, load/store memory
accesses may have to wait on slow memory. For example,
had RDY not been asserted during t4, the pipeline
would have stalled another cycle to wait for EXL access
to complete.