17.11.2012 Views

Soft-Core Processor Design - CiteSeer

Soft-Core Processor Design - CiteSeer

Soft-Core Processor Design - CiteSeer

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

pipelined implementation. Control-flow instructions are executed in one cycle by flushing the<br />

prefetch unit and issuing a new instruction address on the negative clock edge, while the rest of<br />

the processor logic is synchronized with the positive clock edge. Therefore, the prefetch unit<br />

issues a new read in the cycle in which the control-flow instruction executes. If only on-chip<br />

memory is used, the new instruction is available with one cycle latency. Since most control-flow<br />

instructions have a delay slot behaviour, there is no branch penalty.<br />

The 2-stage pipelined implementation has a very low cycle count compared to the Altera Nios.<br />

However, the Fmax is also lower, resulting in lower overall performance. The Fmax for this<br />

implementation was 25 MHz, compared with over 100 MHz for the 16-bit Altera Nios in the<br />

same system configuration. Aside from the low performance, the main issue with this<br />

implementation was the register file implementation. Since all instructions execute in a single<br />

cycle, the register file has to be read asynchronously. As mentioned in the introduction to Chapter<br />

4, this is a problem because Stratix FPGAs contain only synchronous memory. Several solutions<br />

to the problem were tried, including implementing the register file in logic, controlling the<br />

synchronous memory with an inverted clock as suggested in [17], and implementing only 32<br />

registers in logic and swapping the values between a large register file in memory and the<br />

registers implemented in logic. Since none of these solutions provided a satisfactory performance,<br />

an additional pipeline stage was introduced, resulting in a 3-stage pipelined design.<br />

The 3-stage pipelined processor was similar to the UT Nios design presented in Chapter 4,<br />

with the operand and execute stages implemented as a single stage. The new pipeline stage was<br />

introduced in a way that allowed the register file to be implemented in the synchronous memory.<br />

The introduction of this stage caused the control-flow instructions to commit one stage later,<br />

causing one cycle of branch penalty. Thus, this implementation had a larger cycle count than the<br />

2-stage implementation. However, the 3-stage implementation achieved the Fmax of 61 MHz, and<br />

performance within 10% of the Altera Nios over a small set of programs used for testing.<br />

The 3-stage implementation was the basis for a 4-stage pipelined implementation. Timing<br />

analysis of the last stage of the pipeline showed that significant gains in Fmax could be achieved<br />

by introducing another pipeline stage, the operand stage, which would include all the logic<br />

feeding the inputs of the ALU. If the branch logic could be included in the operand pipeline stage,<br />

there would be no increase in the cycle count due to the additional branch penalties. However,<br />

after the operand pipeline stage was introduced, flushing the prefetch unit on the negative clock<br />

edge became a limiting factor for Fmax. Using the negative edge of the clock for one design<br />

module, and the positive edge for the rest of the design imposes severe limitations on the timing.<br />

In the worst case, if the module is the slowest part of the design, the Fmax is effectively halved.<br />

77

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!