circuitcellar.com
Magazine Support   Digital Library   Products & Services   Suppliers Directory 
 
 





 

April 2000, Issue 117

Building a RISC System In AN FPGA Part 1:
Part 2: Pipeline and Control Unit Design by Jan Gray


The processor has a single memory port for reading instructions and loading and storing data. Most memory accesses are for fetching instructions. The processor is also the DMA engine, and a video refresh DMA cycle occurs once every eight clocks or so. Therefore, in any given clock cycle, the processor executes either an instruction fetch memory cycle, a DMA memory cycle, or a load/store memory cycle.

Memory transactions are pipelined. In each memory cycle, the processor drives the next memory cycle’s address and control signals and awaits RDY, indicating the access has been completed. So, what happens when memory is not ready?

The simplest thing to do is to stop the pipeline for that cycle. CTRL deasserts all pipeline register clock enables PCE, ACE, and so forth. The pipeline registers do not clock, and this extends all pipeline stages by one cycle. In Table 2, memory is not ready during the fetch of instruction I3 in t3, and so t4 repeats t3. (Repeated pipe stages are italicized.)

t1 t2 t3 t4 t5
IF1 DC1 EX1 EX1
IF2 DC2 DC2 EX2
IF3 IF3 DC3
IF4
Table 2—During t3, the instruction fetch memory access of I3 is not RDY, so the pipeline registers do not clock, and the pipeline stalls until RDY is asserted in t4. Repeated pipeline stages are italicized.

IL in Listing 1 is a load word instruction. Loads and stores need a second memory access, causing pipeline havoc (see Table 3). In t4 you must run a load data access instead of an instruction fetch. You must stall the pipeline to squeeze in this access.

if ((p->flags & 7) == 1)
    p->x = p->y;

IL:	lw r6,2(r10)	;load r6 with p->flags
I2:	andi r6,7	;is (p->flags & 7)
I3:	addi r0,r6,-1	;==1?
IB:	bne T
I5:	lw r6,6(r10)	;yes: load r6 with p->y
	...
Listing 1—This C code produces assembly code that includes a load IL and a branch IB. Each causes pipeline headaches.

 

t1 t2 t3 t4 t5 t6 t7 t8 t9
IFL DCL EXL EXL          
  IF2 DC2 DC2 EX2        
    IF3 IF3 DC3 EX3      
        IFB DCB EXB    
          IF5 DC5 EX5  
            IF6 DC6 EX6
              IFT DCT
Table 3—Pipelined execution of the load instruction IL, I2, I3, the branch IB, the annulled I5 and I6, and the branch target IT. During t4 you stall the pipeline for the IL load/store memory cycle. The branch IB executed in t7 causes I5 and I6 to be annulled in t8 and t9. Annulled instructions are struck through.

Then, although you fetched I3 in t3, you must not latch it into the instruction register (IR) as t3 ends, because neither EXL nor DC2 are finished at this point. In particular, DC2 must await the load result in order to forward it to A, because I2 uses r6—the result of IL!

Finally, if (in t3) you don’t save the just-fetched I3 somewhere, you’ll lose it, because in t4, the memory port is busy with the load cycle. If you lose it, you’ll have to re-fetch it no sooner than t5, with the result that even a no-wait load requires three cycles, which is unacceptable.

To fix this problem, the control unit has a 16-bit NEXTIR register and an IR source multiplexer (IRMUX). In t3, it captures I3 in NEXTIR, and then in t4, IR is loaded from NEXTIR instead of from the memory port (which is busy with the load). NEXTIR ensures a two-cycle load or store, at a cost of eight CLBs.

As with instruction fetch accesses, load/store memory accesses may have to wait on slow memory. For example, had RDY not been asserted during t4, the pipeline would have stalled another cycle to wait for EXL access to complete.