1 Montgomery Modular Multiplication in Hard- ware
1 Montgomery Modular Multiplication in Hard- ware
1 Montgomery Modular Multiplication in Hard- ware
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
FEI KEMT<br />
units are available, the total execution time TMMM will <strong>in</strong>crease. On the other<br />
hand the area occupation of the coprocessor can be changed accord<strong>in</strong>g to the area<br />
constra<strong>in</strong>ts of the target device. Implementation of n < nmax stages means also<br />
more operations needed for read<strong>in</strong>g from and stor<strong>in</strong>g <strong>in</strong> the memory. Shift<strong>in</strong>g the<br />
processed data between the stages is faster than stor<strong>in</strong>g the <strong>in</strong>termediate results <strong>in</strong><br />
the memory block and their repeated read<strong>in</strong>g to f<strong>in</strong>ish the computations on them.<br />
Therefore the best performance is achieved <strong>in</strong> design with maximal number of stages<br />
nmax (n = nmax).<br />
Parametrisation The MMM coprocessor has three variable parameters (w, e, and<br />
n) that can be chosen for any implementation. Accord<strong>in</strong>g to the required area of<br />
the implemented coprocessor and the required tim<strong>in</strong>gs for the MMM computations<br />
the number of pipel<strong>in</strong>ed stages and the word width (n, w) can be chosen. The<br />
security level of public-key algorithm def<strong>in</strong>es the length of operands for the multiplier<br />
(k = we). This approach gives high flexibility to the processor and coprocessor<br />
design.<br />
In general, there are two possible approaches how to <strong>in</strong>crease the speed of the<br />
MMM computation <strong>in</strong> the proposed designs (check Equation 2.4 to understand the<br />
relations between the design parameters and the computation time TMMM):<br />
1. To <strong>in</strong>crease the word length w. In this way the number of iterations given by<br />
e is reduced what yields a shorter computation time. While the older FPGAs<br />
provide memory blocks with dual port memory feature and configurable word<br />
lengths only up to 16 bits (Altera Apex [8]), <strong>in</strong> the high-performance models<br />
it can be up to 32 bits for middle-sized blocks or 128 bits for large memory<br />
blocks (Altera Stratix II [20]). S<strong>in</strong>ce the capacity of the block is sufficient<br />
for typical RSA operands it makes sense to use only one block per operand.<br />
In case of an older technology with smaller memory blocks and chosen bigger<br />
word width (16 < w ≤ 32) two memory blocks per variable aare required.<br />
In dependency of the memory configuration several variables may share one<br />
memory block. Operands mapp<strong>in</strong>g to the memory is especially important for<br />
constra<strong>in</strong>ed SOC designs with limited number of memory blocks.<br />
2. To <strong>in</strong>crease the number of pipel<strong>in</strong>ed stages n. The hard<strong>ware</strong> structure of the<br />
PE for both solutions (CSA PE and CPA PE) is relatively simple and fast<br />
and <strong>in</strong>dependent on the number of stages, what was a condition for a scalable<br />
design. An addition of several pipel<strong>in</strong>ed stages can <strong>in</strong>crease the overall speed,<br />
31