1 Montgomery Modular Multiplication in Hard- ware

Technical University of Koˇsice 

Faculty of Electrical Engineering and Informatics 

Analysis and Implementation of Selected 

Blocks for Public-Key Cryptosystems in 

FPGAs 

2010 Martin ˇSimka

Technical University of Koˇsice 

Faculty of Electrical Engineering and Informatics 

Department of Electronics and Multimedia Communications 

Analysis and Implementation of Selected 

Blocks for Public-Key Cryptosystems in 

FPGAs 

Montgomery Modular Multiplier and True Random 

Number Generator 

Doctoral Thesis 

Discipline: 26-13-9 Electronics 

Department: Department of Electronics and Multime- 

dia Communications (FEI) 

Supervisor: doc. Ing. Miloˇs Drutarovský, PhD. 

Consultant: prof. Ing. Viktor Fischer, PhD. 

Koˇsice 2010 Martin ˇSimka

Metadata Sheet 

Author: Martin ˇ Simka 

Thesis title: Analysis and Implementation of Selected Blocks for Public- 

Key Cryptosystems in FPGAs 

Subtitle: Montgomery Modular Multiplier and True Random Num- 

ber Generator 

Language: English 

Type of Thesis: Doctoral Thesis 

Number of Pages: 126 

Degree: PhD. 

University: Technical University of Koˇsice 

Faculty: Faculty of Electrical Engineering and Informatics (FEI) 

Department: Department of Electronics and Multimedia Communica- 

tions (KEMT) 

Discipline: 26-13-9 Electronics 

Town: Koˇsice, Slovakia 

Supervisor: doc. Ing. Miloˇs Drutarovský, PhD. 

Consultant(s) : prof. Ing. Viktor Fischer, PhD. 

Date of Submission: 2. 8. 2010 

Date of Defence: 9. 2010 

Keywords: modular multiplication, elliptic curve method, factorisation, 

random number generator 

Category Conspectus: Technika, technológia, inˇzinierstvo; Elektronika 

Thesis Citation: Martin ˇ Simka: Analysis and Implementation of Selected 

Blocks for Public-Key Cryptosystems in FPGAs. Koˇsice: 

Technical University of Koˇsice, Faculty of Electrical Engi- 

neering and Informatics. 2010. 126 pages 

Title SK: Analýza a implementácia vybraných blokov pre kryp- 

tografické systémy s verejným kl’účom 

Subtitle SK: Montgomeryho modulárna násobička a generátor skutočne 

náhodných čísel 

Keywords SK: modulárne násobenie, metóda eliptických kriviek, fak- 

torizácia, generátor náhodných čísel

Abstract in English 

In the thesis we deal with two elementary blocks used in public key cryptosystems 

– the first block is a modular multiplier for very long operands, the second one 

is random number generator. Both blocks are designed on programmable target 

platform (FPGA devices) what allows quick prototyping of proposed systems. 

Our main goal in case of multiplier is to achieve a scalable and parametrised 

solution, which is easily portable and adaptable according to a final target platform 

and processed data. Note that due to requested high flexibility of solution the 

achieved speed for clocking is lower than in case of dedicated design focused on speed. 

On the other hand, our solution is perfect for prototyping and proof-of-concept 

designs approach. In the thesis we analyse algorithm improvements in relation to 

technical features of chosen FPGA families. Obtained universal arithmetic solution 

needs to be enhanced with equally universal interface in order to connect a control 

unit. As a result we obtained a building block – the multiplier for application in 

cryptographic and cryptanalytic systems. For the multiplier it is possible to choose 

a range of occupied physical area, computational time and size of operands. 

The second area we deal with is a generation of random numbers in digital 

environment of integrated circuits. A random number generator (RNG) is the only 

cryptographic element for which there are no generally applied algorithms. The main 

reason for this is in the fact that harvesting mechanism of RNG is tightly related to 

a target platform. Physical sources of randomness are very limited in digital devices. 

In addition, we deal with problematic issue of randomness testing. The chosen design 

of RNG we analyse under changing temperature of a chip. Finally, the proposed 

stochastic model of generator allows better understanding of its principle. 

Abstract in Slovak 

V dizertačnej práci sa zaoberáme dvoma elementárnymi blokmi pouˇzívanými v 

kryptografických systémoch s verejným kl’účom – prvým je násobička pre operácie s 

vel’kými číslami, druhým je generátor náhodných čísel. Oba bloky sú realizované v 

technológii hradlových polí (obvody typu FPGA), čo umoˇzňuje vytvorenie prototypu 

vo vel’mi krátkom čase. 

Naˇsim hlavným ciel’om v prípade násobičky je realizácia l’ahko parametrizova- 

tel’ného a ˇskálovatel’ného rieˇsenia, ktoré umoˇzňuje prispôsobenie architektúry podl’a

FEI KEMT 

ciel’ovej platformy a vlastností spracúvaných dát. Treba poznamenat’, ˇze dôsledkom 

flexibility rieˇsenia je niˇzˇsia dosahovaná rýchlost’ výpočtov. Na druhej strane, takéto 

rieˇsenie je ideálne v prípade realizácie prototypov a návrhov, ktoré majú potvrdit’ 

navrhovaný koncept rieˇsenia. V práci sa zaoberáme prispôsobením ˇstruktúry náso- 

bičky k architektúre ciel’ovej platformy vybraných rodín hradlových polí. Získané 

univerzálne rieˇsenie je potrebné vybavit’ rovnako univerzálnym rozhraním, ktoré 

umoˇzní prepojenie výpočtovej jednotky ku rôznorodým typom riadiacich jednotiek. 

Ako výsledok sme získali stavebný prvok kryptografických a kryptoanalytických 

systémov, pre ktorý je moˇzné zvolit’ vel’kost’ obsadenej plochy na ciel’ovej platforme, 

rýchlost’ vykonávanej operácie násobenia a vel’kost’ akceptovaných parametrov. 

Druhou oblast’ou, ktorou sa v práci zaoberáme je oblast’ generovania náhodných 

postupností v prostredí číslicových integrovaných obvodov. Generátor náhodných 

čísel (RNG) je jediným prvkom kryptografických systémov, ktorého princíp nie je 

daný medzinárodným ˇstandardom. Hlavným dôvodom je to, ˇze spôsob získavania 

náhodných hodnôt je striktne závislý od ciel’ovej platformy pre implementáciu gene- 

rátora. Fyzické zdroje entrópie pouˇzitel’né v číslicových obvodoch majú obmedzené 

moˇznosti, k čomu sa eˇste pripája problematika testovania náhodnosti výstupnej pos- 

tupnosti. Vybraný generátor analyzujeme z hl’adiska jeho správania v meniacich sa 

tepelných podmienkach súčiastky, v ktorej je umiestnený. Predstavený stochastický 

model generátora pribliˇzuje podstatu princípu generovania náhodnej postupnosti. 

v

Declaration 

I hereby declare that this thesis is my own work and effort. Where other sources 

of information have been used, they have been acknowledged. 

Koˇsice 2. 8. 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . 

Signature

Acknowledgement 

There are several persons who contributed to the research results published in 

the thesis and to the fact I can submit the thesis for defence. 

I am very grateful to my advisor Miloˇs Drutarovský for guiding me all along my 

research, for his effort and dedication, and for all time he found for me. I want to 

thank my special advisor Prof. Viktor Fischer for his great advice, support and 

ideas for research, and for making possible my stage in France. I would like to 

express my gratitude to Prof. Duˇsan Levický for help in tough situations during my 

stay at department he leads. 

Big thanks goes to Nathalie Bochard and Frédéric Celle for very good coop- 

eration and help regarding FPGA design. I was glad to meet Corinne Fournier 

and Loïc Denis who made my weekends very enjoyable. Thanks to all members of 

Hubert Curien Laboratory in Saint-Etienne, I had nice time with you. 

I would like to thank all my colleagues from COSY group. Especially to Jan 

Pelzl for very fruitful joint work on hardware implementation of ECM. I am grateful 

to Prof. Christof Paar who allowed me to work in his research group and get such a 

priceless experience. Thanks to Sandeep Kumar, Andy Rupp and Axel Poschmann 

for great time in Bochum, spent on research, but not only. Special thanks goes to 

Irmgard Kühn for making my contact with all bureaucracy much easier. 

From the COSIC group I would like to thank Prof. Ingrid Verbauwhede and 

Prof. Bart Preneel for making it possible to join their team in Leuven. Thanks 

to Lejla Batina and Elke De Mulder for incorporating me in side-channel attack 

research and all members of COSIC for creating great atmosphere there. 

I want to thank my family for their encouragements and support, and especially 

my sister Katka for all our inspiring discussions. 

Most importantly, I thank my dear Kasia for her endless love and patience. 

Thanks to all of you! 

Martin

Preface 

Systems for public key cryptography are intensively applied in order to digitally sign 

or encrypt data. In this way we assure integrity and confidentiality of the signed 

message and provide authentication and non-repudiation features for a signer. The 

complexity of computations has impact on performance of the system, especially in 

case of long keys. The security of the operations is based on secrecy of the private 

key, while its public part and the algorithm itself are publicly known. 

In the first part of thesis we analyse the computational part of the systems and 

focus on flexible implementation of modular multiplier. The output of the research 

was applied in order to estimate performance of Elliptic Curve Method (ECM) 

increased thanks to its hardware realisation. Scalable nature of the multiplier was 

spread in the whole design, and the proof-of-concept implementation was designed 

and tested in a very short time. 

In the second part of document we focus on the key generating element – a Ran- 

dom Number Generator (RNG). Already known design was analysed under several 

aspects and we provide results in the form of a stochastic model of the RNG and 

proposed testing methods suitable for this type of RNGs. 

The target platform for the selected building blocks of cryptosystems is FPGA 

(Field Programmable Gate Array) what offers a reduction of development time, wide 

range of devices and high level of security. In the thesis analyse particular families of 

devices from FPGA vendors which include dedicated electronic elements used in our 

designs. Parameters of the blocks and algorithm improvements may have significant 

impact on the performance of system. 

Three topics of the thesis provide a picture of complexity level in cryptology and 

underline relevance of research in area of cryptographic systems implementation.

Contents 

Introduction 1 

1 Montgomery Modular Multiplication in Hardware - preliminaries 3 

1.1 Implementation Platforms . . . . . . . . . . . . . . . . . . . . . . . . 3 

1.2 RSA Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 

1.2.1 Modular Exponentiation and Multiplication . . . . . . . . . . 8 

1.2.2 Hardware Implementations of the MMM . . . . . . . . . . . . 12 

1.3 EC in Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 

1.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 

2 Montgomery Modular Multiplication in Hardware 20 

2.1 Scalable MMM design . . . . . . . . . . . . . . . . . . . . . . . . . . 20 

2.1.1 Scalable Multiple-Word Algorithms . . . . . . . . . . . . . . . 22 

2.1.2 Comparison of Implementation Approaches . . . . . . . . . . . 23 

2.2 Multiplier Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 25 

2.2.1 Adder Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 26 

2.2.2 Memory Block . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 

2.2.3 Interface to Controller . . . . . . . . . . . . . . . . . . . . . . 34 

2.3 Implementation of the MMM . . . . . . . . . . . . . . . . . . . . . . 36 

2.3.1 Comparison of CSA and CPA PE . . . . . . . . . . . . . . . . 36 

2.3.2 Montgomery Multiplication Coprocessor . . . . . . . . . . . . 38 

2.3.3 Hardware-Software Co-design of MMM: a Case Study . . . . . 38 

2.3.4 Implementation Results . . . . . . . . . . . . . . . . . . . . . 42 

2.4 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . 42 

3 Elliptic Curve Method in Hardware - preliminaries 44 

3.1 Integer Factoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 

3.1.1 Factoring Algorithms . . . . . . . . . . . . . . . . . . . . . . . 44 

3.1.2 Motivation for Hardware Implementation . . . . . . . . . . . . 45 

3.2 Previous Implementations of ECM . . . . . . . . . . . . . . . . . . . 46 

3.3 Mathematical Background . . . . . . . . . . . . . . . . . . . . . . . . 47 

3.3.1 Pollard’s (p − 1)-algorithm . . . . . . . . . . . . . . . . . . . . 48 

3.3.2 ECM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 49

FEI KEMT 

4 Elliptic Curve Method in Hardware 55 

4.1 Parameterisation of the ECM Algorithm . . . . . . . . . . . . . . . . 56 

4.1.1 Phase 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 

4.1.2 Phase 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 

4.2 Design of the ECM Unit . . . . . . . . . . . . . . . . . . . . . . . . . 58 

4.2.1 Control Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 

4.2.2 Memory Management . . . . . . . . . . . . . . . . . . . . . . . 59 

4.2.3 Choice of the Arithmetic Algorithms . . . . . . . . . . . . . . 60 

4.2.4 Parallelization of the Algorithm . . . . . . . . . . . . . . . . . 64 

4.3 Implementation of the ECM Unit . . . . . . . . . . . . . . . . . . . . 65 

4.3.1 Hardware Platform . . . . . . . . . . . . . . . . . . . . . . . . 65 

4.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 

4.3.3 ECM-Based Acceleration of GNFS: a Case Study . . . . . . . 67 

4.4 Conclusions and Future Steps . . . . . . . . . . . . . . . . . . . . . . 69 

5 True Random Number Generator - preliminaries 71 

5.1 Randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 

5.1.1 Definitions of Randomness . . . . . . . . . . . . . . . . . . . . 72 

5.1.2 Random Number Generator . . . . . . . . . . . . . . . . . . . 73 

5.1.3 Applications of Random Numbers . . . . . . . . . . . . . . . . 75 

5.2 TRNG Implementations in Digital Systems . . . . . . . . . . . . . . . 76 

5.2.1 Sources of Randomness . . . . . . . . . . . . . . . . . . . . . . 77 

5.2.2 Survey of Designs Based on Jitter . . . . . . . . . . . . . . . . 82 

5.3 PLL-Based TRNG on FPGA . . . . . . . . . . . . . . . . . . . . . . 85 

5.3.1 Randomness Extraction Method . . . . . . . . . . . . . . . . . 85 

5.3.2 Coherent Sampling . . . . . . . . . . . . . . . . . . . . . . . . 88 

5.4 Testing of TRNGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 

5.5 Attacks against TRNG . . . . . . . . . . . . . . . . . . . . . . . . . . 91 

5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 

6 True Random Number Generator 94 

6.1 Clock Synthesis in FPGAs . . . . . . . . . . . . . . . . . . . . . . . . 94 

6.1.1 PLL as Source of Randomness . . . . . . . . . . . . . . . . . . 96 

6.2 PLL-Based TRNG on FPGA . . . . . . . . . . . . . . . . . . . . . . 101 

6.2.1 PLL Configurations . . . . . . . . . . . . . . . . . . . . . . . . 101 

6.2.2 Analysis of TRNG in Altera Stratix FPGAs . . . . . . . . . . 103 

x

FEI KEMT 

6.2.3 Analysis of TRNG in Actel FPGAs . . . . . . . . . . . . . . . 105 

6.2.4 Stochastic Model of PLL-TRNG . . . . . . . . . . . . . . . . . 109 

6.3 Active Non-Invasive Attack on TRNG . . . . . . . . . . . . . . . . . 114 

6.3.1 Attack description . . . . . . . . . . . . . . . . . . . . . . . . 114 

6.3.2 Measurements results . . . . . . . . . . . . . . . . . . . . . . . 115 

6.4 Conclusions and Further Research . . . . . . . . . . . . . . . . . . . . 120 

7 Research Contribution 124 

Bibliography 127 

xi

List of Figures 

1 – 1 Typical architecture of the smallest functional unit in a FPGA. . . . 6 

1 – 2 RSA encryption scheme when A sends encrypted message to B. First 

A receive B’s public key upon a request, afterwards A encrypts a 

message X using the B’s public key Y = X E mod M. Finally B 

decrypts the received message Y using own private key X = Y D mod 

M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 

2 – 1 Architecture of a general scalable coprocessor based on separate mem- 

ory and ALU connected by w-bit data-path . . . . . . . . . . . . . . 21 

2 – 2 One level of the w-bit adder implemented as CPA and CSA with FAs 27 

2 – 3 Block diagram of the CSA-based w-bit MWR2MM processing element 

(CSA PE) based on FA . . . . . . . . . . . . . . . . . . . . . . . . . . 28 

2 – 4 Block diagram of CPA-based w-bit MWR2MM processing element 

(CPA PE) based on FA . . . . . . . . . . . . . . . . . . . . . . . . . . 29 

2 – 5 Pipelined organization of the MMM coprocessor based on n-stage PEs 

connection and separated embedded data memory . . . . . . . . . . . 30 

2 – 6 Organisation of the dual-port memory register inside the MMM co- 

processor for one variable with e words of width w bits . . . . . . . . 32 

2 – 7 Proposed universal interface for the MMM coprocessor . . . . . . . . 34 

4 – 1 Architecture of the ECM unit . . . . . . . . . . . . . . . . . . . . . . 58 

4 – 2 Organisation of the ECM unit’s memory registers for 21 variables 

with e words of width w . . . . . . . . . . . . . . . . . . . . . . . . . 60 

4 – 3 Scalable addition and subtraction unit for operands with word width w 63 

5 – 1 Schematic diagram of a TRNG with designation of internal signals 

and interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 

5 – 2 Illustration of stable states (0 and 1) and undefined metastable state 78 

5 – 3 Timing jitter in clock signal . . . . . . . . . . . . . . . . . . . . . . . 81 

5 – 4 Ring oscillator structures proposed by Golić. . . . . . . . . . . . . . . 83 

5 – 5 Block structure of the PLL-TRNG with two PLLs, sampling gate and 

corrector of the output sequence. . . . . . . . . . . . . . . . . . . . . 86 

5 – 6 Sampling of the CLJ clock signal including the tracking jitter on the 

raising edge of the CLK signal (illustrated for KM = 5 and KD = 7) 86 

6 – 1 Block diagram of analog PLL circuitry for clock signal synthesis in 

Altera FPGA [11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

FEI KEMT 

6 – 2 Block diagram of digital DLL unit typical for Xilinx FPGA clock 

management circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 

6 – 3 Jitter of the clock signal in Altera Stratix design (horizontal scale: 

200 ps/div) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 

6 – 4 Configurations of TRNG with: a) one PLL, b) two parallel PLLs and 

c) two cascaded PLLs . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 

6 – 5 Distribution of mean values of ordered CLJ signal samples obtained 

during Q = 1000 periods TQ . . . . . . . . . . . . . . . . . . . . . . . 110 

6 – 6 Block diagram of design for on-chip samples reordering . . . . . . . . 111 

6 – 7 Reordered samples from generator measured by oscilloscope . . . . . 111 

6 – 8 Sampled waveform of a clock signal for TRNG for Configuration A 

for temperatures in range −40 ◦ C + 30 ◦ C. . . . . . . . . . . . . . . . . 116 

6 – 9 Sampled waveform of a clock signal for TRNG for configuration B 

for temperatures in range −40 ◦ C + 32 ◦ C. . . . . . . . . . . . . . . . . 117 

6 – 10Amount of sampled ones during 1000 sampling periods for TRNG 

with configuration A (detail of the raising edge). . . . . . . . . . . . . 119 

6 – 11Amount of sampled ones during 1000 sampling periods for TRNG 

with configuration B, with low-pass loop filter (detail of the raising 

edge). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 

6 – 12Amount of sampled ones during 1000 sampling periods according to 

temperature for chosen sample positions in TRNG with configuration 

A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 

6 – 13Amount of sampled ones during 1000 sampling periods according to 

temperature for chosen sample positions in TRNG with configuration 

B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 

6 – 14Comparison of probability histograms for the jitter measured by tem- 

perature 20 ◦ C in TRNG with configuration A and B. Data measured 

were around the rising edge of the sampled clock waveform. . . . . . . 123 

6 – 15Difference in number of sampled ones for critical samples by boundary 

temperatures −40 ◦ C and +30 ◦ C in TRNG with configuration A and 

B around the rising edge of the sampled clock waveform. . . . . . . . 123 

xiii

List of Tables 

1 – 1 Comparison of the key length (in bits) for equivalent security level 

for public-key cryptosystems . . . . . . . . . . . . . . . . . . . . . . . 16 

2 – 1 Address of operands from host processor level (LSB right) . . . . . . 33 

2 – 2 PE sizes and speeds for old style Altera FPGAs . . . . . . . . . . . . 37 

2 – 3 PE sizes and speeds for new style Altera FPGAs . . . . . . . . . . . . 37 

2 – 4 Area occupation in number of LEs and maximal clock frequency 

(fclkMMM ) (MHz) of the MMM coprocessor (w = 32, n = 1..4) with 

MWR2MM CSA algorithm . . . . . . . . . . . . . . . . . . . . . . . . 38 

2 – 5 Execution times of software implementation of MMM on Altera Nios 

development board (with APEX EP20K200 clocked at 50 MHz) . . . 40 

2 – 6 Execution times of mixed hardware-software implementation of MMM 

on Altera Nios development board (with APEX EP20K200) for the 

CSA PE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 

2 – 7 Execution times of mixed hardware-software implementation of the 

MMM on Altera Nios development board (with APEX EP20K200) 

for the CPA PE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 

4 – 1 Computational complexity and memory requirements for phase 2 de- 

pending on D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 

4 – 2 A command syntax for the ECM unit (LSB left) . . . . . . . . . . . . 59 

4 – 3 Running Times of the ECM Implementation (198 bits modulus), p = 

2, w = 32 (Xilinx Virtex2000E-6 and ARM7TDMI, 25MHz) . . . . . 67 

6 – 1 Parameters of PLL embedded in Altera FPGAs . . . . . . . . . . . . 97 

6 – 2 Parameters of PLL embedded in Actel FPGAs . . . . . . . . . . . . . 98 

6 – 3 Parameters settings for different TRNG configurations . . . . . . . . 102 

6 – 4 Configuration parameters of tested TRNG . . . . . . . . . . . . . . . 105 

6 – 5 Results of quality evaluation of tested TRNG configurations . . . . . 105 

6 – 6 Achievable sensitivity on jitter using two clock signals in Actel ProA- 

SICplus (FCLI = 40MHz) . . . . . . . . . . . . . . . . . . . . . . . . . 107 

6 – 7 Area occupation of one PLL TRNG with delay line in FPGA Actel 

ProASICPlus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 

6 – 8 Mean values measured using the stochastic model E[pi] and the out- 

put sequence of the TRNG m = E [x(nTQ)] . . . . . . . . . . . . . . . 114 

6 – 9 Results of statistical tests (FIPS) of TRNG output and number of 

random samples influenced by the jitter at different chip temperatures 118

List of Algorithms 

1 – 1 Montgomery exponentiation algorithm [86], the definition of M ′ re- 

quires that gcd(M, R) = 1, b denotes base or radix. . . . . . . . . . . 10 

1 – 2 The Montgomery modular multiplication algorithm for k-bit operands 

X = (xk−1, . . . , x1, x0), Y , and M . . . . . . . . . . . . . . . . . . . . 11 

1 – 3 The basic radix-2 Montgomery multiplication algorithm for k-bit operands 

X = (xk−1, . . . , x1, x0), Y , and M . . . . . . . . . . . . . . . . . . . . 13 

1 – 4 Optimized radix-2 Montgomery multiplication algorithm . . . . . . . 15 

1 – 5 Key generation in ECC [78] . . . . . . . . . . . . . . . . . . . . . . . 18 

1 – 6 Message signing in ECC [78] . . . . . . . . . . . . . . . . . . . . . . . 18 

2 – 1 The multiple word radix-2 Montgomery multiplication MWR2MM CSA 

algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 

2 – 2 The multiple word radix-2 Montgomery multiplication MWR2MM CPA 

algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 

3 – 1 Elliptic Curve Method . . . . . . . . . . . . . . . . . . . . . . . . . . 50 

3 – 2 Exponentiation for Curves in Montgomery Form . . . . . . . . . . . . 53 

4 – 1 Modified MWR2MM algorithm . . . . . . . . . . . . . . . . . . . . . 62 

4 – 2 Modular addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 

4 – 3 Modular subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

List of Symbols and Abbreviations 

A (x) the x th word of vector A 

Ax..y particular range of bits in a vector A from position x to position y 

A (y) 

x 

bit position of the y th word of A 

B bound of smoothness 

D a parameter in improved standard continuation of ECM 

DCLJ dividing factor for CLJ clock signal 

DCLK dividing factor for CLK clock signal 

FCLJ frequency of CLJ clock signal 

FCLK frequency of CLK clock signal 

KD decimation factor of CLK clock signal 

KM decimation factor of CLJ clock signal 

M modulus 

MCLJ multiplication factor for CLJ clock signal 

MCLK multiplication factor for CLK clock signal 

S partial sum 

TQ 

time period of bit generation 

TCLJ time period of CLJ clock signal 

TCLK time period of CLK clock signal 

X nultiplier 

Y multiplicand 

φ canonical homomorphism 

φ() Euler tontien function

FEI KEMT 

π(p) prime counting function, number of primes ≤ p 

σjit standard deviation of jitter 

xA the x th part of vector A 

b base or radix 

e number of words 

k length of operands 

n positive integer to be factored 

p prime factor 

w word width 

ALU Arithmetic Logic Unit 

ASIC Application-Specific Integrated Circuits 

AT Area-Time 

CASR Cellular Automation Shift Register 

CLB Configurable Logic Block 

CPA Carry Propagate Adder 

CPU Central Processing Unit 

CRT Chinese Reminder Theorem 

CSA Carry Save Adder 

DJ Deterministic Jitter 

DLL Delay Locked Loop 

DSA Digital Signature Algorithm 

EC Elliptic Curves 

ECC Elliptic Curve Cryptography 

xvii

FEI KEMT 

ECDLP Elliptic Curve Discrete Logarithm Problem 

ECDSA Elliptic Curve Digital Signature Algorithm 

ECM Elliptic Curve Method 

EPLL Enhanced PLL 

FA Full Adder 

FPGA Field Programmable Gate Array 

FPLL Fast PLL 

gcd Greatest Common Divisor 

GMP GNU Multiple Precision 

GNFS Generalised Number Field Sieve 

I/O Input/Output 

IP Intellectual Property 

ITU International Telecommunications Union 

LAB Logic Array Block 

LE Logic Element 

LFSR Linear Feedback Shift Register 

LPM Library of Parameterized Modules 

LSB Least Significant Bit 

LUT Look-Up Table 

MM Modular Multiplication 

MMM Modular Montgomery Multimplication 

MPQS Multiple Polynomial Quadratic Sieve 

MSB Most Significant Bit 

xviii

FEI KEMT 

MWR2MM Multiple Word Radix-2 Montgomery Multiplication 

NA Not Available 

P&R Place and Route 

PCI Peripheral Component Interconnect 

PE Processing Element 

PLL Phase Locked Loop 

PRNG Pseudo-Random Number Generator 

RAM Random Access Memory 

RFID Radio Frequency Identification 

RISC Reduced Instruction Set Computer 

RJ Random Jitter 

RMS Root Mean Square 

RNG Random Number Generator 

RO Ring Oscillator 

ROM Read-Only Memory 

SIMD Single Instruction Multiple Data 

SOC System on a Chip 

SOS Separated Operand Scanning 

TRNG True-Random Number Generator 

UART Universal Asynchronous Receiver/Transmitter 

VCCIO Positive Supply Voltage for IO Pins 

VCO Voltage Controlled Oscillator 

VHDL VHSIC Hardware Description Language 

VHSIC Very High Speed Integrated Circuit 

xix

FEI KEMT 

Introduction 

In the thesis we analyse two elementary blocks of almost each public key cryp- 

tosystem, a multiplier for operations on very long operands and a random number 

generator. 

In the case of multiplier our main goal is to achieve scalable and parametrised 

design for fast prototyping in Field Programmable Gate Arrays (FPGAs). Flexibility 

of the design and computational latency create a trade-off, therefore this concept is 

suitable mostly for prototyping and proof-of-concept designs. As a secondary objec- 

tive we want to achieve effective utilisation of a selected family of FPGAs and apply 

its specific features. In this way we can analyse suitability of a certain algorithm for 

the selected FPGA platform. Such approach is particularly appropriate in case the 

final implementation platform will be the same FPGA family. 

Flexible and effective design of multiplier would have a chance to offer an univer- 

sal solution in the applications with different asymmetric algorithms or in similar 

systems based on the same algebraic operations. Our goal is to design and im- 

plement a multiplier block with a universal interface that could be included in a 

variety of cryptosystems offering features for changing its configuration parameters 

e.g. length of the input parameters, computational time and occupied area. 

Another area of our focus are random numbers, namely their generation in con- 

ditions of digital platforms. The Random Number Generator (RNG) design depends 

significantly on the target implementation platform. Therefore we analyse the fea- 

tures of FPGAs devices, change the working conditions what simulates attacker 

behaviour and describe the relations between the parameters of the generator and 

the statistical parameters of the generated sequence. 

Classification of the generators heavily depends on level of their description. Ac- 

cording to the latest trends in this research area, designers of RNGs should provide 

in addition to the statistical tests results also an detailed analysis and model of the 

generator. The generator’s behaviour needs to be explained in details, supported 

by practical experiments. Special attention in the RNG design should be paid to 

testability of the RNG. The tests can be done on the generated sequence. However, 

as we will show, for analysed generator there exist more effective methods for test- 

ing. The proposed methods should take into account the fundamental principle of 

extracting the random values. 

In the Chapter 1 we introduce mathematical background of two currently most 

known and used cryptographic algorithms for public key cryptosystems, the RSA 

1

FEI KEMT 

and Elliptic Curve Cryptography (ECC). In computationally highly intensive public- 

key algorithms we identify the most expensive and also most used operation - mod- 

ular multiplication. The comparison of the operands length shows the range for 

which an universal architecture needs to be found. 

The Chapter 2 provides our design approach and implementation results for 

Montgomery multiplier. We compare two designs which differ in handling carry 

bits in adders inside the multiplier block. The analysis provides suggestions which 

technique is suitable for a certain platform architecture. We present a scalable archi- 

tecture of algebraic coprocessor that is suitable for the multiplier. A communication 

interface between the coprocessor and a control unit is also discussed. The final case 

study provides results of our hardware-software co-design in case of multiplier in ap- 

plications with soft-core processor and dedicated coprocessor. 

In Chapter 3 we start with mathematical background of integer factoring meth- 

ods and provide details on Elliptic Curve Method (ECM) algorithm including the 

first and second phase of the algorithm. The motivation for hardware implementa- 

tion of the algorithm and previous approaches for implementations are summarised. 

The Chapter 4 describes the first published hardware implementation of ECM 

method for factoring numbers up to 200 bits. An ECM unit design is introduced 

and we discuss the way how the implemented algorithms were chosen. In the final 

section we present the implementation results of the ECM units and a case study of 

application of the ECM unit in a well-known factoring method. 

Randomness is the main topic of the Chapter 5. We discuss required features 

of random sequences intended for cryptographic application. We widely describe 

a design of RNG with focus on digital devices and analyse available sources of 

randomness. In the last part of chapter a review of recently published RNG concepts 

is provided, while our focus is put on solution based on a Phase Locked Loop (PLL). 

The sections on tests and attacks summarise available knowledge from these areas. 

In Chapter 6 we deliver our results in research of PLL-based RNG. Starting 

with analysis of PLL parameters in available FPGA devices we provide description 

of design process for two FPGA vendors. Thanks to observations of RNG’s internal 

signals we were able to introduce a stochastic model of the generator and describe 

its behaviour in changing chip temperature. Based on the empirical experiments we 

enhance the design process with additional requirements in order to achieve more 

robust solution. 

The research contribution of the thesis is summarised in the final Chapter 7 

where we collect the results from all three topics discussed in the thesis. 

2

FEI KEMT 

1 Montgomery Modular Multiplication in Hard- 

ware - preliminaries 

Many popular public-key cryptographic algorithms and protocols, such as RSA, 

ElGamal, elliptic curve cryptography (ECC), Diffie-Hellman, etc. [86] extensively 

use modular operations with large numbers. Typical size of operands in ECC and 

RSA is 160-300 bits and 1000-2000 bits, respectively. 

We start the chapter with discussion on optimal choice of the computation 

method and way of its implementation according to chosen implementation plat- 

form (the Section 1.1). In Section 1.2 we bring a summary on RSA algorithm 

together with a short analysis of available algorithms for modular multiplication. 

We mention the aspects of hardware implementation and review the available pa- 

pers in this area. Finally, the further implemented algorithm and its modification 

are introduced. The Section 1.3 we start with definition of elliptic curves (EC) and 

continue with their application in cryptography. The last section summarises the 

most important features of the presented public-key algorithms and identifies the 

most important part of the system for effective implementation. 

1.1 Implementation Platforms 

By having all parts of cryptosystem (encryption, authentication, key storage, gen- 

eration of random numbers . . . ) implemented on the same platform one is able to 

achieve highly compact and therefore potentially secure implementation. The more 

signals are available for an adversary for observation, the more information about 

processed data can be obtained. 

While in the past the development of hardware and software platforms was done 

separately, beside the initial requirements and definitions of data formats and inter- 

faces, nowadays with so called hardware-software co-design one tries to find optimum 

in effective utilisation of resources. In such case some of operations are implemented 

as a hardware structure and the others as a software function. With reconfigurable 

devices and embedded soft-core processors the situation is very suitable for such an 

approach. However, development of mixed systems is not a trivial task for designers, 

especially on the level when decision on tasks division is done. Systems making pos- 

sible to simulate and evaluate system performance by proposed software-hardware 

architecture before a real (and expensive) implementation are only on the early stage 

of development (check e.g. GEZEL language and design environment [99]). 

3

FEI KEMT 

The hardware implementation platforms offer higher level of security thanks to 

possibility to separate physically a sensitive data and in dependency on operations 

also higher performance as similar software implementations. 

As a hardware platform can be considered: 

• ASIC (Application-Specific Integrated Circuit), 

• FPGA (Field-Programmable Gate Array) or 

• RFID (Radio Frequency Identification) chip. 

There are different approaches by implementation of cryptosystems. Implementa- 

tion can provide some supporting functions for general-purpose processor, cover all 

crypto-related operations in a standard system or even represent complete system 

able to substitute the original non-secured system. 

In dependency on the application the implementation can be done in the form 

of a smart card, IP (Intellectual Property) core, co-processor, PCI card, router etc. 

With enlarging area of chips it is possible to implement a CPU, memory blocks, 

peripherals, interfaces and co-processor on a single chip providing a such called 

system-on-a-chip (SOC). Especially in cryptography there is a requirement for im- 

plementation systems as SOC which hides the internal signals from possible abuse 

by the adversary. SOC raises another requirement, namely to find a way for im- 

plementation of all parts of SOC on the same chip, same platform, if possible by 

sharing the same resources. 

Applications have various requirements for area, speed, energy, or power con- 

sumption. Additionally, in case of cryptosystems we define also level of security 

taking into account the vulnerability against eavesdropping and side-channel at- 

tacks, or ability of the system to detect an attack and thereafter delete the sensitive 

data in a way making impossible to restore them by an adversary (tamper resis- 

tance). Definitions of conditions required to certify cryptographic implementations 

on a certain level of security and areas where such systems can be used are set in 

standards of well-known standardisation organisations [57]. 

Reconfigurable Devices Reconfigurable device is an hardware architecture with 

both a functionality of processing elements and an interconnection between them 

can be modified after fabrication time. The most known reconfigurable hardware 

components are FPGAs. 

4

FEI KEMT 

Cryptographic primitives belong to group of systems suitable for reconfigurable 

devices due to the following features: 

• standardized algorithms - most of the cryptographic algorithms, but random 

number generators are approved by international standard organisations (e.g. 

[54–56,58,59]). Thus, the functionality described by mathematical algorithms 

and equations can by deeply studied and tailored to the hardware structure. It 

is possible that group of secure cryptographic algorithms is changed in the time 

due to newly invented attacks. The reconfigurable platform makes possible to 

remove obsolete algorithms from running systems and provide the new ones, 

even without hardware update or exchange. 

• several supported functionality modes and lengths of operands - while the num- 

ber of the most popular algorithms is limited, each of them provides a group of 

selectable parameters what results in need to implement a group of algorithms 

combinations. 

• sequential structure - in dependency on running operation only selected crypto- 

graphic blocks need to programmed in a device and in case of operation change 

the other configuration is loaded. As an example we mentioned a scheme when 

at the beginning of the communication a secret key is distributed to the parties 

by an asymmetric algorithm which is later misplaced by a faster symmetric 

encryption implemented on the same device. 

FPGA Architecture The underlying FPGA architecture consists of an array of 

the smallest programmable units - logic elements (LE) or configurable logic blocks 

(CLB), and the programmable connection switches. A typical FPGA architecture 

consists of a high number (hundreds to thousands) of LEs and routing channels with 

different length/speed. By the LE we understand the smallest functional unit that 

is addressed by the mapping tools. Typically it consists of a look-up table (LUT) 

and a register (D flip-flop) (see Figure 1 – 1), what makes possible to implement the 

combinatorial as well as sequential logic, or a small memory block. Additionally, the 

FPGA architecture may include special dedicated blocks or building items for other 

functions e.g. for storing data, computing multiplication and addition, synthesis 

clock signals. . . 

Modern FPGAs provide support for implementation of a wide range of the algo- 

rithms from area of signal processing, communication or networking. The crypto- 

5

FEI KEMT 

data 

inputs 

clock 

Look-up 

Table 

carry 

input 

Carry 

Chain 

carry 

output 

D 

Flip 

Flop 

data 

outputs 

Figure 1 – 1 Typical architecture of the smallest functional unit in a FPGA. 

graphic algorithms and protocols can be represented as sequence of algebraic func- 

tions in chosen operational area. The operations in cryptography are often similar to 

the ones used in the fields mentioned above. Therefore the optimised blocks in struc- 

ture of FPGAs provide means for efficient realisation of cryptographic primitives, 

too. 

The additional property of cryptosystems - the security, is supported by vendors 

of the FPGAs by enhancing the devices with hard-wired encryption cores and special 

purpose memories. With raising importance of cryptography the FPGA vendors will 

be pushed to provide more and more features supporting security of FPGA-based 

cryptosystems as it was proposed in [93]. More information on FPGA features 

and their relation to implementation of cryptosystems including analysis of possible 

attacks can be found in [122]. 

1.2 RSA Algorithm 

Nowadays the most popular asymmetric cryptosystem is RSA which was developed 

by Ronald Rivest, Adi Shamir and Leonard Adleman in 1978 [96]. 

A private key for RSA algorithm consists of two large primes p and q with com- 

parable sizes and a secret exponent D. A public key is represented by an exponent 

E and modulus M, where 

M = pq (1.1) 

The Euler totien function φ(M) is defined as a number of positive integers smaller 

6

FEI KEMT 

than M, which are relatively prime to M, thus: 

φ(M) = (p − 1)(q − 1) . (1.2) 

Therefore we can write an equation for the public exponent E: 

Private exponent D is chosen such that: 

gcd(E, φ(M)) = 1 . (1.3) 

D = E −1 mod φ(M) . (1.4) 

While the public key consists of a tuple (M, E), the private key can be kept in the 

two possible forms: simply as a tuple (M, D) or in extended form including also 

the primes p and q. The latter form allows a faster decryption algorithm using a 

Chinese Reminder Theorem (CRT). 

Basic mathematical operation used by RSA for cryptographic operations (en- 

cryption and digital signature) is modular exponentiation. To encrypt a message X 

by a public key (M, E) one applies the following equation [86]: 

Y = X E mod M . (1.5) 

Decryption of received encrypted message Y is done using a private key couple 

(M, D) by calculating: 

X = Y D mod M . (1.6) 

Similarly to encryption, the RSA signature scheme operations employ modular ex- 

ponentiation for generation of a signature I for message text X 

and its verification 

I = X D mod M (1.7) 

X = I E mod M . (1.8) 

Note that while for the encryption scheme Alice as the sending part uses receiving 

Bob’s public key to encrypt the message and this case only Bob is able to decrypt it 

knowing his private key (see Figure 1 – 2). In case of message signature Alice signs 

the message using her private key to prove its authenticity and thereafter anybody 

who disposes of Alice’s public key is able to verify her signature. 

7

FEI KEMT 

A (X) 

request for B’s 

private key 

key (M,E) 

encrypted 

E 

Y=X mod M 

B 

(M,E,D) 

Figure 1 – 2 RSA encryption scheme when A sends encrypted message to B. First A receive 

B’s public key upon a request, afterwards A encrypts a message X using the B’s public key Y = 

X E mod M. Finally B decrypts the received message Y using own private key X = Y D mod M. 

1.2.1 Modular Exponentiation and Multiplication 

The modular exponentiation used for encryption and signature schemes of RSA (see 

Equations 1.5-1.8) and other public-key cryptographic algorithms can be computed 

in two ways, as a series of the modular multiplications (MMs): 

• interleaved by a modular reduction, or 

• with a final reduction step. 

The most known method from the first category - the Montgomery modular mul- 

tiplication (MMM) invented by P. L. Montgomery [88] will be further discussed 

in this work. For the multiplication and subsequent division one can use popular 

Karatsuba-Ofman’s multiplication [76] in combination with Barrett’s reduction [25]. 

The MM can be a very slow operation when performed on general-purpose com- 

puters. Currently suggested length of operands (e.g. for RSA) is 1024 and more bits 

is far above the typical length of operands (8-32 bits). Therefore there is a motiva- 

tion for design of special algebraic units performing modular operations in a more 

efficient way. Better peformance and effectiveness of the implementation is achieved 

by adaption of algorithms and exploitations of platforms with reconfigurable archi- 

tecture. Performing mathematical operations with the RSA extra long variables can 

be limiting for the units optimised for 8, 16 or 32 bits lengths of variables that are 

more typical e.g. in signal processing. 

The RSA modular exponentiation does not allow straightforward implementa- 

tion and requires application of the algorithms that will e.g. divide long operands 

8

FEI KEMT 

in shorter words taking into account the physical limitations of the structures in se- 

lected hardware platform. Optimal solution in case when the operands length may 

change would provide a design for which the length of operands determines only the 

computational time for an operation but not the overall performance of the unit 

that is constant for arbitrary length. 

Montgomery Methods The MMM provides a very efficient way for computing 

the modular exponentiation. Input operands for the baseline algebraic operations 

of the RSA algorithm described by Equations 1.5-1.8 have very long length due 

to security reasons. Nowadays, the key length for the RSA is switched from 1024 

to 2048 bits as the factorisation effort brings better results, closer to the bottom 

standard value. Having a need to use operands with doubled precision it is even more 

desirable to find algorithms that minimise the number of the algebraic operations 

together with their complexity. 

The Montgomery reduction allows efficient implementation of the MM without 

using the classical modular reduction step that is even more expensive operation in 

comparison to the multiplication. Therefore it pays off to minimise the number of 

required reductions or to use algorithms avoiding the division. 

In Montgomery exponentiation algorithm (Algorithm 1 – 1 [86]) the modular ex- 

ponentiation unrolls into series of the MMM. Thanks to the transformation to a 

Montgomery domain and application of the MMM, it is possible to avoid the un- 

wanted modular reduction during computations. 

We continue with description of the MMM and conversion operations applied in 

the Algorithm 1 – 1. 

Given two integers X and Y (X, Y < M < R), and the prime k-bit modulus M, 

the MMM algorithm computes 

S = MMM(X, Y ) = (XY R −1 ) mod M , (1.9) 

where R −1 is the inverse of R = b k and b denotes a base or radix. The M-residue 

X, of an integer X < M is defined as [41]: 

X = XR mod M (1.10) 

For conversion to the Montgomery domain we can use the MMM function as follows: 

MMM(X, R 2 ) = XR 2 R −1 mod M (1.11) 

= XR mod M 

= X 

9

FEI KEMT 

Algorithm 1 – 1 Montgomery exponentiation algorithm [86], the definition of M ′ 

requires that gcd(M, R) = 1, b denotes base or radix. 

Require: M = (mk−1 . . . m0)b, R = b k , M ′ = −M −1 mod b, E = (et . . . e0)2 with 

et = 1, and an integer X, 1 ≤ X < M. The values R 2 mod M and R mod M 

may be also provided as precomputed inputs. 

Ensure: A = X E mod M. 

1: X ⇐ MMM(X, R 2 mod M) 

2: A ⇐ R mod M 

3: for i = t down to 0 do 

4: A ⇐ MMM(A, A) 

5: if ei = 1 then 

6: A ⇐ MMM(A, X) 

7: end if 

8: end for 

9: A ⇐ MMM(A, 1) 

10: return A 

Therefore the first operation in the Algorithm 1 – 1 (Step 1) maps the input value 

X to its M-residue X. 

Now we show how to re-map the value X to its ordinary form of integer X what 

is done in the last operation of the exponentiation (Algorithm 1 – 1, Step 9). It can 

be seen that the Montgomery product of two M-residues X, Y is itself the M-residue 

S: 

S = MMM(A, B) (1.12) 

= XY R −1 mod M 

= XRY RR −1 mod M 

= XY R mod M 

= SR mod M 

10

FEI KEMT 

so a final operation required to convert the M-residue S back into S is defined as: 

S = SR −1 mod M (1.13) 

= 1SR −1 mod M 

= MMM(1, S) 

The algorithm works for any modulus M provided that gcd(M, R) = 1. This is 

always case in the RSA since M = pq, product of two primes, and therefore odd. 

And since R is a power of 2, it is always even. 

The MMM algorithm for k-bit operands X = (xk−1, . . . , x1, x0), Y , and M is 

given as Algorithm 1 – 2 [86]. 

Algorithm 1 – 2 The Montgomery modular multiplication algorithm for k-bit 

operands X = (xk−1, . . . , x1, x0), Y , and M 

Require: M = (mk−1 . . . m0)b, X = (xk−1 . . . x0)b, Y = (yk−1 . . . y0)b, with 0 ≥ 

X, Y < M, R = b n with gcd(M, b), and M ′ = −M −1 mod b. 

Ensure: S = XY R −1 mod M. 

1: S ⇐ 0 , S = (sk−1 . . . s0)b 

2: for i = 0 to k − 1 do 

3: qi ⇐ (s0 + xiy0)M ′ mod b 

4: S ⇐ (S + xiY + qiM)/b 

5: end for 

6: if S ≥ M then 

7: S ⇐ S − M 

8: end if 

9: return S 

Thanks to the reduction during a pre-computation step of Algorithm 1 – 2 it is 

possible to avoid an expensive operation of the modular division during the com- 

putations. In case of a single multiplication operation the classical algorithm for 

modular multiplication would be faster than the MMM. Due to a need of rather 

expensive transformation to the Montgomery domain (M-residue) and back, it is 

more effective to stay in that domain as long as possible and transform the operands 

back to the ordinary only at the very end of the computations. That requires a long 

sequence of the MMMs as it is in case of the modular exponentiation (Algorithm 1 – 

1). 

11

FEI KEMT 

In the Algorithm 1 – 1 the input operand X is transformed to the Montgomery 

domain X at the beginning (Step 1). Afterwards follows the series of the MMM in 

the Montgomery domain. Finally, in the last step (Step 9) the result is transformed 

back to normal domain. In this way the advantage of computing in Montgomery 

domain is fully exploited. The MMM is considered as the most effective method for 

modular exponentiation operations applied e.g. in the RSA cryptographic algorithm. 

1.2.2 Hardware Implementations of the MMM 

Achieving short computation time of the MM as the most time-consuming opera- 

tion in RSA and ECC algorithms has a significant impact on the performance of 

the elementary cryptographic operations. Therefore efficient implementation of the 

algorithm has been an attractive field for research. Due to long operands on which 

the operations are performed the hardware platform seems to be a natural choice 

before software implementation. Since the size of operands may change according 

to requirements and is different for RSA and ECC, the parameterized design in 

programmable logic would offer an universal design for fast prototyping. 

The implementations bring in life specifically adjusted general algorithms that 

take into account the hardware platforms features and prefer operations easily im- 

plementable in digital logic gates. The designs in general tend towards providing 

an universal and elastic solution or have a priority in best usage of resources and 

achievement of shortest computation times. 

One of the most cited hardware implementation of the MMM was introduced at 

CHES 1999 by Tenca and Koç [108]. A cheap and flexible modular exponentiation 

hardware accelerator can be also achieved using FPGAs. Results presented in liter- 

ature, e.g. [29, 41, 51] are mainly concentrated to systolic-like implementations that 

provide a very fast but less flexible solution. 

Pre-computing partial results as presented in [72] allows to reduce the number 

of clock cycles required for performing of a single MMM operation. Such approach 

needs marginally more area in comparison to original proposal [108] and as far as the 

latency is concerned it is comparable to the design presented in [85] that is based on 

processing multi-precision operands in carry-save form. High-radix implementations 

[110] also provide reduction of computational steps, but the complexity of logic part 

increases substantially. 

Current FPGAs provide an alternative hardware platform even for system-level 

integration of a cryptographic hardware. A SOC concept can typically include an 

12

FEI KEMT 

embedded processor with a set of dedicated coprocessors. For such a system a 

highly flexible (although typically slower) scalable MMM coprocessor could be more 

attractive than a fixed length dedicated one. 

That direction was chosen in our research, when our goal is to analyse and 

implement solution that would allow quick prototyping of special purpose hardware 

designs and use features of target platform in order to accelerate execution of the 

MMM operation. 

The radix-2 MMM algorithm (b = 2) is very suitable for hardware implemen- 

tation due to easily implementable operations as a word-by-bit multiplication, a 

bit-shift (division by two) and an addition. Implementations with higher radix were 

also published [30, 110] and offer a proper alternative, but using a more complex 

algebraic unit. 

Radix-2 Montgomery Multiplication Algorithm The simplified version of 

the MMM algorithm (Algorithm 1 – 2) when the radix b is equal to 2 (b = 2) for 

k-bit operands X = (xk−1, . . . , x1, x0), Y , and M is given as Algorithm 1 – 3. 

Algorithm 1 – 3 The basic radix-2 Montgomery multiplication algorithm for k-bit 

operands X = (xk−1, . . . , x1, x0), Y , and M 

Require: M = (mk−1 . . . m0)2, X = (xk−1 . . . x0)2, Y = (yk−1 . . . y0)2, M ′ = 

−M −1 mod 2, E = (et . . . e0)2 with et = 1, R = 2 k , and an integer X, 1 ≤ X < 

M. The values R 2 mod M and R mod M may be also provided as precomputed 

inputs. 


1: S0 ⇐ 0 

2: for i = 0 to k − 1 do 

3: qi ⇐ (Si + xiY ) mod 2 

4: Si+1 ⇐ (Si + xiY + qiM)/2 

5: end for 

6: if Sk ≥ M then 

7: Sk ⇐ Sk − M 

8: end if 

9: S ← Sk 

10: return S 

From a comparison of the Algorithms 1 – 2 and 1 – 3 one can see how the choice of 

b = 2 may help to simplify the operations inside the MMM. The modular reduction 

13

FEI KEMT 

by the radix b changes to a check of the LSB. In the Step 4 the division is replaced 

by a simple right shift operation. 

The formulation that describes the radix-2 algorithm was used as the starting 

point for derivation of a scalable design computing the MMM presented in [108,109]. 

Later we will discuss the features of such scalable architecture. Before that, we make 

a closer look at the operations of the algorithm and consider their modifications so 

they are better suitable for efficient execution on chosen FPGA hardware platform. 

The decision whether perform an addition of the modulus M to the temporal 

sum Si+1 is based on the value of the variable qi that can be simply implemented. 

The test checks the LSB of the partial sum Si+1 = Si + xiY and stores it as variable 

qi once the addition of xiY is finished (see step 3 of the Algorithm 1 – 3). The stored 

value decides on the addition of M in the following iteration of the loop. 

However, the second condition (see step 6 of the Algorithm 1 – 3) causes a prob- 

lem for a possible pipelined execution of computations. After the loop of additions, 

multiplications and shifts, the mentioned comparison and subsequent conditional 

subtraction is required. Without the final reduction step the outcome of the inner 

loop of multiplication can provide an improper input for the subsequent multipli- 

cation operation. That may happen in the case when the final value of S is bigger 

than M (S > M). We have intention to use the MMM in a series of multiplica- 

tions when the transformation into the Montgomery domain brings profit over an 

expensive reduction as it was showed in the Algorithm 1 – 1. Therefore we analyse 

possibilities for omitting the final condition step by changes in the Algorithm 1 – 3 

and make possible a use of pipelined multipliers. 

Algorithm Modifications The MMM algorithm (Algorithm 1 – 2) introduced 

earlier is further extended. Two variants of the algorithm are discussed and im- 

plemented, both supporting scalable multiple-word oriented implementation, but 

handling a carry processing in different ways. 

In the modified Algorithm 1 – 4 we use the following input operands: 

k� 

X = xi2 

i=0 

i = (0, 0, xk, xk−1, . . . , x1, x0) < 2M , (1.14) 

�Y = 

k� 

�yi2 i+1 = (yk, . . . , y1, y0, 0) < 4M , (1.15) 

i=0 

where R = 2 k+3 , Y < 2M, and 2 k−1 < M < 2 k is an k-bit number (the same as 

in the Algorithm 1 – 3). Note that � Y in Equation 1.15 is a left shifted version of 

14

FEI KEMT 

Y , with �y0 = 0 and X is concatenated with two zero bits at MSB positions. This 

change simplifies the computation of qi compared to Algorithm 1 – 3. The value of 

qi needed for computation of Si+1 is given directly as a LSB of Si from the previous 

iteration (see step 4 of the Algorithm 1 – 4). In this way the latency caused by an 

addition of operands xiY is removed and logic implementation can be simplified, 

too. 

Algorithm 1 – 4 Optimized radix-2 Montgomery multiplication algorithm 

Require: X = � k i=0 xi2 i = (0, 0, xk, xk−1, . . . , x1, x0) < 2M, � Y = � k i=0 �yi2 i+1 = 

(yk, . . . , y1, y0, 0) < 4M, R = 2 k+3 , Y < 2M, and 2 k−1 < M < 2 k . 


1: S0 ⇐ 0 

2: � Y ⇐ 2Y 

3: for i = 0 to k + 2 do 

4: qi ⇐ Si mod 2 

5: Si+1 ⇐ (Si + xi � Y + qiM)/2 

6: end for 

7: S ⇐ Sk+3 

8: return S 

The inner loop of the Algorithm 1 – 4 is executed with three additional iterations 

in comparison to the Algorithm 1 – 3. Higher number of iterations ensures that 

the inequalities Si < 3M, i = 0, 1, . . . , k + 2 and S = Sk+3 = MMM(X, Y ) = 

(XY R −k−3 ) mod M < 2M always hold. The result of S = MMM(X, Y ) can thus 

be reused as an input X and Y for the subsequent MMM. This modification avoids 

the originally proposed final correction step (comparison and subtraction in step 6 

of the Algorithm 1 – 3) and makes possible a pipelined execution of the algorithm in 

separated multipliers. 

In typical applications (e.g. RSA), input operands X, Y are pre-multiplied 

by a factor 2 2k mod M (Algorithm 1 – 3) or 2 2k+6 mod M (Algorithm 1 – 4). The 

final MMM with value 1 makes the final result smaller than M (with probability 

1 − 2 −(k+2) as shown in [29]) and provides the result XY mod M. 

1.3 EC in Cryptography 

Application of the EC in the public-key cryptography was independently proposed 

by Neal Koblitz and Victor S. Miller in year 1985 [77, 87]. Advantage of using 

15

FEI KEMT 

the ECC instead of the RSA or DSA [56] lies in the fact that the length of key 

can be much shorter. The best known algorithm for solving the elliptic curve dis- 

crete logarithm problem (ECDLP) takes fully exponential time, while the algorithms 

for the integer factorization problem and the discrete logarithm problem take sub- 

exponential time. The comparison of key length for equivalent security level is 

presented in Table 1 – 1 [91]. 

Table 1 – 1 Comparison of the key length (in bits) for equivalent security level for public-key 

cryptosystems 

Security (bits) DSA RSA ECC 

80 1024 1024 160-223 

112 2048 2048 224-255 

128 3072 3072 256-383 

192 7680 7680 384-511 

256 15360 15360 512+ 

The fundamental and most expensive operation underlying ECC is a point multi- 

plication, which is defined over field operations. For a point P and a positive integer 

k, the point multiplication kP is defined by adding k-times the point P to itself: 

kP = P + . . . + P 

� �� 

k 

. (1.16) 

Various algorithms have been proposed for more efficient computation of the point 

multiplication taking into account a fixed or unknown point P . 

The EC over F denoted as E is a curve that is given by an equation of the 

following form: 

where E must be smooth. 

E : y 2 + a1xy + a3y = x 3 + a2x 2 + a4x + a6 , (ai ∈ F) (1.17) 

We let E(F) denote the set of points (x, y) ∈ F 2 that satisfy this equation, along 

with a point at infinity denoted O. If the characteristic of F is neither 2 nor 3, then 

the Equation 1.17 can be simplified to the usually used form (so-called Weierstraß 

form): 

y 2 = x 3 + ax + b . (a, b ∈ F) (1.18) 

The condition for smoothness of the curve is, in this case, equals to the requirement 

of no multiple roots of the cubic element in the Equation 1.18. This holds if and 

only if the discriminant of x 3 + ax + b, which is −(4a 2 ) + 27b 3 , is nonzero. 

16

FEI KEMT 

The EC is an Abelian group with the point O serving as its identity element. 

Further we define rules for point addition and point doubling (addition of the identical 

point). 

Let P = (xP , yP ) ∈ E, then −P = (xP , −yP ). If Q = (xQ, yQ) ∈ E, and 

Q �= −P , then P + Q = (xP +Q, yP +Q). Formulas for point addition and doubling 

are presented further, see Equations 1.19. 

xP +Q = λ 2 − xP − xQ (1.19) 

yP +Q = λ(xP − xP +Q) − yP 

λ = yQ − yP 

xQ − xP 

λ = 3x2P + a 

2yP 

if P �= Q 

if P = Q 

When P �= Q (addition) the formulas for computing P + Q require 1 inversion, 2 

multiplications, and 1 squaring. When P = Q (doubling) the formulas for computing 

2P require 1 inversion, 2 multiplications, and 2 squarings. Since field inversion 

is significantly more expensive than multiplication it is advantageous to represent 

points using projective coordinates and then use formulas without inversion [35]. 

Before definition of the ECDLP we define another parameter for EC. The order 

of point P on an EC is the smallest positive integer n such that nP = O. Where 

nP is the point multiplication defined in Equation 1.16. 

The ECDLP is defined as follows: Let us have a curve E over F, a point P ∈ E 

of order n and a point Q ∈ E. Then in case it exists, find an integer l, 0 ≤ l ≤ n−1, 

for which Q = lP . 

As an example for cryptographic operations computed on the EC we mention 

the elliptic curve digital signature algorithm (ECDSA), the equivalent of the DSA 

in the EC domain. The generation of the key is done by the steps described in 

Algorithm 1 – 5 [78]. 

The signature of a message m with an arbitrary length is computed as mentioned 

in the Algorithm 1 – 6 [78]. 

From a practical point of view, the performance of ECC depends on the efficient 

implementation of finite field operations and fast algorithm for the scalar multipli- 

cation. 

17

FEI KEMT 

Algorithm 1 – 5 Key generation in ECC [78] 

Require: E is an EC over F, P is a point of order n on curve E. 

Ensure: Pair of private key and public key. 

1: Choose a random integer d, 0 < d < n 

2: Q ⇐ dP 

3: return Q, the public key 

4: return d, the private key 

Algorithm 1 – 6 Message signing in ECC [78] 

Require: Message m with an arbitrary length, a hash value h(m) obtained from a 

one-way function. 

Ensure: Signature of the message m. 

1: Choose random integer k, 0 < k < n 

2: kP ⇐ (x1, y1) and r ⇐ x1 mod n (0 < x1 < q − 1) 

3: if r = 0 then 

4: Go back to the step 1. 

5: end if 

6: k −1 mod n 

7: s ⇐ k −1 {h(m) + dr} mod n 

8: if s = 0 then 

9: Go back to the step 1. 

10: end if 

11: return (r, s) 

1.4 Conclusions 

In this section we have presented two nowadays most important public key cryp- 

tosystems, namely RSA and ECC. 

While RSA is massively applied by industry since several years, the ECC as 

relatively new cryptographic algorithms just starts to win as better choice for im- 

plementation of public-key algorithm especially for energy- and place-limited plat- 

forms. The possibility to use much shorter key, and therefore less heavy arithmetical 

operations makes from ECC an optimal algorithm for hardware implementation. 

The description of both algorithms given in the thesis focuses on their most 

intensively-used and heavy operation - the modular multiplication. This fact makes 

from the multiplication an important target for our research as the improvements 

18

FEI KEMT 

in implementation of the MM have significant impact on better performance of the 

whole system based on the modular operations, as are the RSA or ECC. 

The common part of both introduced cryptosystem is a modular multiplier. After 

this theoretical introduction we continue by description of algorithms for multipli- 

cation adapted to the target hardware architecture and implementation itself. 

19

FEI KEMT 

2 Montgomery Modular Multiplication in Hard- 

ware 

In this chapter we present results of our research in the area of efficient implementa- 

tion of the (MMM) and its application in cryptographic systems. Obtained design 

of the multiplier can be included in cryptosystems or accelerators as supporting unit 

for computationally heavy operations in the public-key algorithms as RSA or ECC. 

We focus on design of the processing element (PE) that computes the MMM 

and the coprocessor that includes beside the PE(s) also the memory registers and 

an interface to the control unit. 

Results of the research were published in the following list of articles [46, 49, 50, 

113, 117, 118]. The main achievements of our research were done in the following 

areas: 

• Analysis of two PE concepts – algorithm improvement, effective implementa- 

tion in chosen FPGA families, concepts comparison, 

• MMM coprocessor design – software-hardware co-design, scalability and para- 

metrisation, interface with a control unit. 

The Section 2.1 explains the concept of scalable MMM design. In Section 2.2 

we analyse the MMM algorithms and architecture for their effective implementation 

suitable for reconfigurable hardware structures. The results of area occupation and 

timing analysis are summarised in Section 2.3 and provide information on available 

choices of multiplier parameters. The chapter is closed by Section 2.4 including the 

summary of the discussed issues. 

2.1 Scalable MMM design 

An arithmetic unit is called scalable if it can be reused or replicated in order to 

generate long-precision results independently of the data precision for which the 

unit was originally designed [108]. In cryptography, the length of input operands 

and key may vary in dependency on chosen cipher working mode or by updating 

the algorithm to different security level. Hence, the scalability seems to be desirable 

feature of cryptographic arithmetic unit. In such cases scalability of the design pays 

off due to reduced costs for implementation. On the other hand, the well-scalable 

designs can be slower than the less universal ones optimised for selected parameters. 

20

FEI KEMT 

The more universal is a design the lower is its speed in comparison to a system 

designed for fixed operands parameters. 

A typical scalable coprocessor consists of two separate blocks – memory registers 

and arithmetic logic unit (ALU) connected by w-bit data path as shown in Figure 2 – 

1. Parameter of the word width w decides on the smallest operated data unit – 

word, dividing the operands length k to smaller, for target hardware structure more 

suitable, lengths which is usually a multiple of 8 bits. 

data 


w 

scalable 

ALU 

data 

memory 

data 

output 

control 

logic 

Figure 2 – 1 Architecture of a general scalable coprocessor based on separate memory and ALU 

connected by w-bit data-path 

Separation of the ALU and the memory is the first fundamental difference from 

the FPGA designs including the MMM optimized for fixed-length operands (e.g. [29, 

41]). The scalable algorithm requires a word-oriented processing that would make 

possible to change the number of words, or even the word width w. Normally w is 

smaller than the operands length k, therefore the computation time is proportionally 

longer. Better performance can be still achieved by implementation of smaller but 

faster ALU allowing higher clock frequency. 

Let us consider w-bit words. For operands with k-bit precision, e1 = ⌈(k +1)/w⌉ 

words are required for Algorithm 1 – 3. An extra bit used in the calculation of e1 is 

required since Si (internal variable of radix-2 algorithm) is in the range [0, 2M − 1] 

[108]. Then all the computations of Algorithm 1 – 3 must be done with an extra 

bit of precision. The input operands will need an extra zero bit value at the MSB 

position in order to have the precision extended to the correct value. 

Algorithm 1 – 4 requires e2 = ⌈(k + 3)/w⌉ words in order to support extended 

range of input variables X, � Y , and internal variable Si. Note that in many practical 

configurations e1 = e2 and no additional words are required for Algorithm 1 – 4. The 

operands X will need two extra 0 bit values at the MSB and subsequent position in 

order to have the precision extended to the k + 3 cycles required by Algorithm 1 – 4. 

In practical configurations k ≥ 1024 therefore the difference in number of cycles is 

21

FEI KEMT 

not significant. On the other hand, the possibility to remove correction unit from 

hardware design of Algorithm 1 – 4 brings valuable advantage. 

In the rest of the thesis the notions e1 or e2 are used to denote the number of 

words in cases we need to emphasis the difference of the number of words in the 

algorithms, or we use the notation e in case we mean a number of words in general. 

2.1.1 Scalable Multiple-Word Algorithms 

Operations in Algorithm 1 – 3 and Algorithm 1 – 4 are performed on the full-precision 

operands and do not provide scalability feature explained above. We analyse rela- 

tions between parameters of the multipliers and underlying FPGA structure and 

provide solution suitable for devices including fast carry architecture. 

A scalable algorithm in which the operand Y (multiplicand) is scanned word- 

by-word, and the operand X (multiplier) is scanned bit-by-bit was proposed in 

[108,109]. The Multiple Word Radix-2 Montgomery Multiplication algorithm (MW- 

R2MM) uses the following vectors: 

M = (M (e−1) , . . . , M (1) , M (0) ) (2.1) 

Y = (Y (e−1) , . . . , Y (1) , Y (0) ) 

S = (S (e−1) , . . . , S (1) , S (0) ) 

X = (xk−1, . . . , x1, x0) 

where the words are marked with superscripts and the bits are marked with sub- 

scripts. The concatenation of vectors a and b is noted as (a, b). A particular range 

of bits in a vector a from position i to position j, j > i will be expressed as aj..i. 

The bit position i of the k-th word of a is represented by symbol a (k) 

i . 

The details of the MWR2MM algorithm (further referred to as MWR2MM CSA, 

where CSA states for Carry-Save Adder) are given in [108] and in the thesis it will 

be denoted as Algorithm 2 – 1. Optimized version of MMM Algorithm 1 – 4 can be 

transformed to a multiple word form (referred to as MWR2MM CPA, where CPA 

states for Carry-Propagate Adder) in a similar way, shown in Algorithm 2 – 2. The 

reason for such naming of algorithms is given by the way of their implementation 

and we explain more about it in the following parts of the thesis. 

The algorithms compute a partial sum S for each bit of X, scanning the words 

of Y and M. Once the precision is exhausted, another bit of X is taken, and the 

scan is repeated. Thus, the algorithms MWR2MM CSA as well as MWR2MM CPA 

22

FEI KEMT 

Algorithm 2 – 1 The multiple word radix-2 Montgomery multiplication MWR2- 

MM CSA algorithm 

1: S ⇐ 0 

2: for i = 0 to k − 1 do 

3: C ⇐ 0 

4: (C, S (0) ) ⇐ xiY (0) + S (0) 

5: qi ⇐ S (0) 

0 

6: if qi = 1 then 

7: (C, S (0) ) ⇐ C + S (0) + M (0) 

8: for j = 1 to e1 − 1 do 

9: (C, S (j) ) ⇐ C + xiY (j) + M (j) + S (j) 

10: S (j−1) ⇐ (S (j) 

0 , S (j−1) 

w−1..1) 

11: end for 

12: S (e1−1) ⇐ (C, S (e1−1) 

w−1..1) 

13: else 

14: for j = 1 to e1 − 1 do 

15: (C, S (j) ) ⇐ C + xiY (j) + S (j) 

16: S (j−1) ⇐ (S (j) 

0 , S (j−1) 

w−1..1) 

17: end for 

18: S (e1−1) ⇐ (C, S (e1−1) 

w−1..1) 

19: end if 

20: end for 

impose no constraints on precision of the operands. What varies is the number of 

loop iterations i required to accomplish the MMM operation and the number of 

words for input and internal operands – e1 and e2, respectively. The carry variable 

C must be from the set {0, 1, 2} what is imposed by the addition of the three vectors 

S, M, xiY , and xi � Y , respectively [108]. 

2.1.2 Comparison of Implementation Approaches 

Two algorithms have been chosen for the hardware implementation – the MW- 

R2MM CSA algorithm (Algorithm 2 – 1) and MWR2MM CPA algorithm (Algo- 

rithm 2 – 2). Our first goal is to show a difference between the algorithms on the 

algorithmic level, other goal is to compare also the way how the algorithms can be 

implemented. 

The difference in algorithms was motivated by possibility to omit the comparison 

23

FEI KEMT 

Algorithm 2 – 2 The multiple word radix-2 Montgomery multiplication MWR2- 

MM CPA algorithm 

1: S ⇐ 0 

2: � Y ⇐ 2Y 

3: for i = 0 to k + 3 do 

4: C ⇐ 0 

5: qi ⇐ S (0) 

0 

6: for j = 1 to e2 − 1 do 

7: (C, S (j) ) ⇐ C + xi � Y (j) + qiM (j) + S (j) 

8: S (j−1) ⇐ (S (j) 

0 , S (j−1) 

w−1..1) 

9: end for 

10: S (e2−1) ⇐ (C, S (e2−1) 

w−1..1) 

11: end for 

of the final sum S to M at the end of the loop in the Algorithm 1 – 3 for the price 

of some extra loops in the Algorithm MWR2MM CPA. Another difference is in 

computation of the variable qi that decides on addition of M. Its value in the MW- 

R2MM CPA algorithm is given directly as LSB of the zeroth word of the internal 

sum S computed in the previous loop. Contrary of the Algorithm MWR2MM CPA 

the Algorithm MWR2MM CSA uses a value obtained after addition of item xiY 

what increase a latency for computing the qi. 

The most important difference between MWR2MM CSA and MWR2MM CPA is 

introduced in a way by which the variable S is represented. In carry-save redundant 

form applied in our implementation of the Algorithm MWR2MM CSA the sum S 

is represented by formulation: 

S (j) = 1S (j) + r2S (j) , (2.2) 

where r is the radix (in our implementation r = 2) and 1S, 2S are two w-bit com- 

ponents of the sum S. Advantage of such representation is in no carry propagation 

inside the inner loop of the MMM algorithm. On the other hand, for storing the 

partial sum variable S it required to use two w-bit registers instead of one. Only at 

the very end of the computations, the redundant form is transformed to the normal 

representation applying the Equation 2.2. The CSA PE which executes the MW- 

R2MM CSA Algorithm is in this direction independent on hardware platform and 

does not require any special features for hardware implementation of the adders. 

In the implementation of the MWR2MM CPA algorithm all operands are op- 

24

FEI KEMT 

erated and stored in a non-redundant form, each requiring w-bit register with e2 

words. 

Different form of representation of the sum S in the implementation of algorithms 

MWR2MM CPA and MWR2MM CSA has the following consequences: 

1. The MWR2MM CPA algorithm uses less (only 80% of MWR2MM CSA) mem- 

ory resources for the same operand sizes. 

2. The MWR2MM CPA algorithm does not require any correction unit for trans- 

formation of the algorithm output in the final step, while the MWR2MM CSA 

algorithm requires at least final conversion to a non-redundant form. 

3. The MWR2MM CPA algorithm allows a simpler computation of internal vari- 

able qi that can allow to simplify architecture of CPA PE. 

4. The CSA PE is always faster than the CPA one because it does not use carry 

in inner loop of the algorithm. The CPA PE is slower but uses less logic 

resources. Therefore, potentially within the same FPGA resources also more 

CPA PE pipelined stages can be used, what can turn into speed up of the 

solution and yield better area time (AT) product. 

2.2 Multiplier Architecture 

In this section we present architecture of the implemented units for computing the 

MMM. The units are proposed as dedicated coprocessors with standardised interface 

to an external control unit. Such approach makes possible to connect several units to 

a controller and provide parallel computation of the MMM. The peripheral multiplier 

can be mapped in the memory of the host processor, where the control operations 

are triggered by an interrupt or a control register. 

Other approach would propose a set of instructions supporting fast modular 

operations on a general-purpose processor. In this case, besides the target platform 

resources the optimisation takes into account the processor structure what makes 

the design more specific for a chosen processor architecture. 

In the processor+dedicated coprocessor architecture no special requirements are 

given for the control unit apart from the specification of the interface since the main 

computational effort is done in the coprocessor. In this way a significantly better 

use of resources can be achieved in cases when large general-purpose processor is 

replaced by a small CPU with coprocessor. 

25

FEI KEMT 

Beside the internal structure of the multipliers we discuss also the pipeline struc- 

ture of the coprocessor and its interconnection to the host, what can be an embedded 

soft-core or a stand-alone processor. The scalable designs offer several parameters to 

be chosen after consideration of the required execution time and available hardware 

resources. 

2.2.1 Adder Concepts 

In our designs we apply two different ways of implementation of the adders that are 

described in this section. The architectures designed for MWR2MM CSA and MW- 

R2MM CPA algorithms differ in implementation of the adders inside the multiplier 

units. 

The scalable chain of CSAs does not include any connection between the adders 

units (see the Figure 2 – 2(b)), what makes it independent on the platform technology 

and the length of the operands to be added. 

The propagation of the carry bit in the CPA requires to minimise the connection 

length between the adders. In case of the ASIC design this critical datapath can be 

optimised to achieve the best possible performance. On the other hand, in case of 

the FPGAs the underlying architecture cannot be changed, yet the logical behaviour 

and interconnections given by the device vendor can be re-configured. The FPGA 

vendors provide a feature that can be exploited in cases when a very fast connection 

between the adjacent LE is required, as it is in case of the CPAs scalable chain. 

To achieve an acceleration of normally slow carry propagation in the CPA unit a 

fast carry chain network of connections included in modern FPGAs will be deployed 

(see the Figure 2 – 2(a)). The best performance of the carry chain is achieved inside 

one logic array block (LAB). In dependency on the FPGA type the number of LEs 

in one LAB differs, typical values are 16, 32. . . If the adder width (w) is bigger than 

the number of LEs in the LAB, the LABs carry chains need to be interconnected. A 

longer carry chain is required to hold the fast carry connection feature. To achieve 

it, the connected LABs should be placed next to each other in one column. That 

is possible only in cases when a tool for place and route (P&R) is able to recognise 

the carry chain in the synthesised logic and exploits the hardware architecture of 

the target device to provide a fast interconnection. 

We can conclude that the speed of the CPA PE depends significantly on the 

word-length (the length of the carry chain). However, we can suppose that up to a 

certain word-length, w ≤ wmax the speed of the CPA PE is not critical, because the 

26

FEI KEMT 

C’ 

carry chain carry chain 

FA FA FA 

(a) carry-propagate adder 

C 

FA FA . . . FA 

(b) carry-save adder 

Figure 2 – 2 One level of the w-bit adder implemented as CPA and CSA with FAs 

final speed is dominated by the embedded memory access time or other critical path 

in the logic. The value wmax may differ between technologies due to the different 

routing and distinct physical layout (number of LEs in LAB). The question is if the 

wmax is in the range of allowed values for the on-chip memory width of available 

FPGAs. In this way we could store and also operate the variables with optimal 

word width and achieve the best Area-Time product. 

Carry-Save Adder Unit The whole computational complexity of both algo- 

rithms lies in two additions of three w-bit operands for computing Si+1. The 

propagation of the carry bits between the w adders is (in general) too slow. The 

implementation of the MWR2MM CSA in [108] uses redundant representation of 

intermediate sum S and carry-save adders [38]. The MWR2MM CSA w-bit PE 

architecture based on Full Adders (FAs) is depicted in Figure 2 – 3. 

In order to reduce the storage size and arithmetic hardware complexity the vari- 

ables X, Y , and M are available in a non-redundant form. The intermediate internal 

sum S is received and generated in the redundant form as 1S and 2S. The advantage 

of redundant form lies in the independence of the latency from the word length w 

as there is no direct connection between the FAs. The output of the adders is valid 

right after appearance of the input signals and the delay is given mainly by internal 

combinational logic of the FA. 

The processing delay may increase for larger w as a result of the broadcast 

problem only, it will not depend on the arithmetic operation itself. Conversion 

into the normal non-redundant representation is only done at the very end of the 

MMM computation. The intermediate result of sum S may be further shifted to 

other MMM unit as operand X or Y for a new computation (e.g. next iteration 

of the modular exponentiation). The redundant representation of variables that 

requires twice as much memory as a non-redundant representation and a need for the 

transformation to/from redundant form have been considered as the main drawbacks 

27

FEI KEMT 

q 

x 

i 

S (j) 

2 w-1 S (j) 

1 w-1 

i 

Y (j) 

w-1 M(j) 

w-1 

FA FA 

FA 

S (j-1) 

2 w-1 

FA 

S (j) 

2 w-2 S (j) 

1 w-2 

S (j-1) 

1 w-1 

S (j-1) 

2 w-2 

Y (j) 

w-2 M(j) 

w-2 

FA 

S (j-1) 

1 w-2 

. . . 

. . . 

S (j) 

0 S (j) 

2 1 0 

Y (j) 

0 

FA 

M(j) 

0 

S (j-1) 

0 S (j-1) 

2 1 0 

Figure 2 – 3 Block diagram of the CSA-based w-bit MWR2MM processing element (CSA PE) 

based on FA 

of the MWR2MM CSA algorithm. Positive property of the implementation is its 

independence on carry chain logic on the target platform. 

Carry-Propagate Adder Unit Recent FPGAs contain high-speed interconnect 

lines between adjacent logic blocks which have been designed to provide an efficient 

carry propagation. The CPA PE architecture presented in this thesis is optimal for 

the implementation of the MMM unit on any FPGA that has dedicated carry logic 

capability (e.g. modern Altera and Xilinx FPGAs). The basic organization of the 

ALU consists of two layers of conventional CPAs as shown in Figure 2 – 4. 

Unlike the CSA PE, the CPA PE does not support a feature of arbitrary word 

width w. The border for the number of FAs in one row is given by the target 

technology. The more LEs are chained by fast (and short) interconnection the higher 

the word width can be, achieving comparable speed results to CSA PE. The value 

of the carry signal raised in the first FA from the left side (for LSB) is subsequently 

processed in the adjacent FA that outputs another carry signal for the third adder 

in the row. . . In this way the carry signal is propagated till the most right FA (for 

28 

C

FEI KEMT 

q 

x 

i 

i 

C a 

C b 

Y (j) 

w-1 M(j) 

w-1 

S (j) 

w-1 

FA FA 

FA 

FA 

S (j-1) 

w-1 

Y (j) 

w-2 M(j) 

w-2 

S (j) 

w-2 

FA 

S (j-1) 

w-2 

. . . 

. . . 

S (j-1) 

0 

Y (j) 

0 M(j) 

0 

Figure 2 – 4 Block diagram of CPA-based w-bit MWR2MM processing element (CPA PE) based 

on FA 

MSB). Once it receives a valid value of the carry and computes the outputs, the 

complete w-bit result can be proceeded to a next computation. From the description 

we can see that the delay caused by the carry propagation grows linearly with the 

S (j) 

0 

number of connections that is given by the word width w. 

Pipeline Structure Both algorithms – MWR2MM CSA and MWR2MM CPA 

share the same data dependencies. A detailed analysis of potential inner paral- 

lelism and investigation of pipelined organisation that would be suitable for an 

MWR2MM CSA algorithm implementation can be found in [108, 109]. The pre- 

sented analysis can be directly applied also to the MWR2MM CPA algorithm. The 

most important result of the analysis – the possibility to operate in pipelined stages 

of the multipliers is applied in the FPGA implementations presented in the thesis. 

The main advantage of the scalable architecture for the MMM lies in the fact that 

the PEs can be easily repeated to increase the throughput of the coprocessor [108]. 

In the pipelined version several slightly modified PEs (some registers have to be 

added to allow temporary data storage) are connected in a cascade (see Figure 2 – 

5). 

29 

FA 

C 

a 

C b

FEI KEMT 

x i x i-1 xi-n+1 

Y (j) 

M (j) 

S (j) 

PE 1 

Y (j-1) 

M (j-1) 

S (j-1) 

PE 2 

S (j-n) 

data 

memory 

. . . 

. . . 

. . . 

Y (j-n+1) 

M (j-n+1) 

S (j-n+1) 

PE n 

Figure 2 – 5 Pipelined organization of the MMM coprocessor based on n-stage PEs connection 

and separated embedded data memory 

The maximum degree of pipeline that can be obtained with this architecture is 

found as: 

nmax = 

� � 

e + 1 

2 

(2.3) 

The number 2 in denominator expresses the number of clock cycles after which the 

output of the MMM unit is valid. It means also that new values for input variables 

of the PEs in the pipelined row are delivered every third clock cycle. Output data 

from one stage are kept between the adjacent stages in temporal registers for one 

clock cycle and afterwards delivered to the subsequent stage. The stages include the 

second register at their input level which provides total delay of two clock cycles as 

required by the computation process. 

To keep the internal control logic simple the number of the stages n is restricted 

to values dividing the number of words e (n|e). Thanks to the simplification in the 

moment when the computation had been finished the last word of the sum S is at 

the output of the last unit in the row and is directly shifted to the memory to be 

stored there. In case of arbitrary n the functionality for a word shift between the 

stages at the end of computations would need to be implemented. Addition of the 

feature requires some extra logic in the data-path what has a negative influence on 

the maximal clock frequency, therefore it is not supported in our designs. 

The number of clock cycles needed for a single MMM operation in design con- 

taining n ≤ nmax MMM units can be computed as: 

TMMM = k2 

+ 2n = 

wn 

� � 

ew 

e + 2n (2.4) 

n 

From the Equation 2.4 we can see that the number of stages n has a significant 

impact on computation time and reduces it linearly. When less than nmax MMM 

30

FEI KEMT 

units are available, the total execution time TMMM will increase. On the other 

hand the area occupation of the coprocessor can be changed according to the area 

constraints of the target device. Implementation of n < nmax stages means also 

more operations needed for reading from and storing in the memory. Shifting the 

processed data between the stages is faster than storing the intermediate results in 

the memory block and their repeated reading to finish the computations on them. 

Therefore the best performance is achieved in design with maximal number of stages 

nmax (n = nmax). 

Parametrisation The MMM coprocessor has three variable parameters (w, e, and 

n) that can be chosen for any implementation. According to the required area of 

the implemented coprocessor and the required timings for the MMM computations 

the number of pipelined stages and the word width (n, w) can be chosen. The 

security level of public-key algorithm defines the length of operands for the multiplier 

(k = we). This approach gives high flexibility to the processor and coprocessor 

design. 

In general, there are two possible approaches how to increase the speed of the 

MMM computation in the proposed designs (check Equation 2.4 to understand the 

relations between the design parameters and the computation time TMMM): 

1. To increase the word length w. In this way the number of iterations given by 

e is reduced what yields a shorter computation time. While the older FPGAs 

provide memory blocks with dual port memory feature and configurable word 

lengths only up to 16 bits (Altera Apex [8]), in the high-performance models 

it can be up to 32 bits for middle-sized blocks or 128 bits for large memory 

blocks (Altera Stratix II [20]). Since the capacity of the block is sufficient 

for typical RSA operands it makes sense to use only one block per operand. 

In case of an older technology with smaller memory blocks and chosen bigger 

word width (16 < w ≤ 32) two memory blocks per variable aare required. 

In dependency of the memory configuration several variables may share one 

memory block. Operands mapping to the memory is especially important for 

constrained SOC designs with limited number of memory blocks. 

2. To increase the number of pipelined stages n. The hardware structure of the 

PE for both solutions (CSA PE and CPA PE) is relatively simple and fast 

and independent on the number of stages, what was a condition for a scalable 

design. An addition of several pipelined stages can increase the overall speed, 

31

FEI KEMT 

especially if the access to the embedded memory is a bottleneck (as it is in a 

case of FPGAs with limited routing resources for large w). 

From the previous analysis we can conclude that the number of words w is chosen 

according to the target platform architecture and its memory blocks organisation 

and support for fast carry operations. The number of pipelined stages n is adapted 

to available chip size. 

2.2.2 Memory Block 

The operands are stored in the memory block that is included in the data-path. Op- 

timisation of the memory organisation and connection to the ALU helps to achieve 

better performance. Due to intensive exchange of data between the memory and 

ALU, the connection is often a part of the longest - critical path of the logic and 

influences a maximal clock frequency of the circuit. 

In dependency on number of pipelined stages (n) and number of iterations given 

by number of words (w) the data of operands are several times read out of the 

memory, processed by PEs, and stored back. The memory block may contain input 

data loaded by a control unit, the intermediate results, and the final results ready to 

be sent back to a host processor after the computations had been finished. Note that 

at the same time different words of an operand are loaded and stored. Therefore 

the memory have to support dual-port configuration. It makes possible to address 

reading and writing from/to separate places of the memory. Schematic organisation 

of the dual-port memory register inside the MMM coprocessor for one of the variables 

is depicted at Figure 2 – 6. 

A data 

A address 

0: 

1: 

e-1: 

w bits 

w bits 

. 

. 

. 

w bits 

memory unit: e x w bits 

B data 

B address 

A port B port 

Figure 2 – 6 Organisation of the dual-port memory register inside the MMM coprocessor for one 

variable with e words of width w bits 

32

FEI KEMT 

In the coprocessor we need to store four operands for the MMM computations: 

three input operands X, Y, M and the result S. The storage of S requires one or 

two registers for a case of the non-redundant or redundant representation form, 

respectively. The scalability feature applied to the ALU needs to be adopted to the 

memory block, too. 

The requirements for the scalable design make possible that the architecture 

is easily adaptable to the length of operands different from the one for which the 

system was originally designed. In the memory block the number of stored variables 

is constant (four or five, depending on the chosen implementation). What varies is 

the number of words and consequently the number of bits needed to address them. 

We propose a model in which the each word of every variable can be addressed 

as from the coprocessor as well as from the host unit. We recognise an internal 

address of a word that specifies its location in given coprocessor and register, a 

register address that makes possible to choose a register with required variable 

and finally a coprocessor address distinguishing between several ALUs. With this 

memory management a control unit can address any word of a chosen coprocessor, 

store there the input values for computations and afterwards read the results for 

further processing. Number of address bits for each level can be adopted according 

to number of coprocessors, variables and number of words. The address width is 

usually given by the word width of the interface between the processor and the 

coprocessor. For the address longer than the interface word width an appropriate 

address model needs to be chosen - accepting several address signals in parallel or 

differencing the address type in other way. 

Table 2 – 1 Address of operands from host processor level (LSB right) 

coprocessor register internal 

XX XXX XXXXXXX 

The memory address bits are assigned as shown in Table 2 – 1 (LSB is right). 

The CPU in the presented example of the address format can handle up to 4 MMM 

coprocessors (two bits address) with 8 operands (three bits address) each composed 

of 128 words. Such configuration is suitable for the RSA computations on the 

operands’ length n = 2048 bits and word width w = 16 bits what gives e = 128 

number of words. 

33

FEI KEMT 

2.2.3 Interface to Controller 

The way in which the MMM coprocessor is connected to the control unit (e.g. an 

embedded processor) is important for the control of the computation process and 

for the exchange of processed data. 

Our first objective is to find a solution which would make possible a fast and flex- 

ible replacement of input and output data between the memory of the host processor 

and the MMM coprocessor’s internal memory block. The requirement for flexibility 

is related to the scalability of the coprocessor that may include several MMM units. 

Moreover, the internal word widths of the control unit and the coprocessor may 

differ. 

Other goal is to optimise the control of the coprocessor(s). The triggering of 

the computations and then checking their status plays important role especially in 

configurations with several coprocessors (not necessarily the MMM coprocessors) 

operated by one control unit when it is ineligible to block the operations running on 

the host processor. 

Finally, the goal is also to design an interface that would be universal and ap- 

plicable with minimal amount of a clue logic for connection to different types of 

processor buses. 

The interface that satisfies the requirements mentioned above is depicted in 

Figure 2 – 7. The functionality of the particular signals is explained in the next part 

of the section. 

clock 

reset 

chip select 

write enable 

irq 

address bus 

data bus 

MMM 

coprocessor 

Figure 2 – 7 Proposed universal interface for the MMM coprocessor 

34

FEI KEMT 

Status and Control Interface The operations inside the MMM coprocessor are 

controlled by a control register that is mapped in the control unit’s memory via the 

interface. In the presented solution there are two control bits: 

bit 0 controls the multiplication/squaring process. Set 1 to trigger the computa- 

tions, 0 for idle. 

bit 1 switches between the multiplication and squaring. Set 0 to compute the MMM 

on the input parameters X and Y , set 1 to square (multiple the operand by 

itself) the value stored in memory register Y . 

A status register has been used to check the actual status of the coprocessor and 

the computational process in the solution published in [117]. The LSB raises during 

the data storing and computations. After triggering the computation the processor’s 

duty is to check the status register regularly. Once the operation of multiplication or 

squaring had been finished the value of the status bit is changed to 0. The control 

unit is expected to read the results from the MMM coprocessor and, if required, 

repeat the operation with new operands. 

The version described in [49] uses the communication over an interrupt (signal 

irq in Figure 2 – 7). This solution is more suitable for software control of coprocessors 

and for a configuration with several MMM coprocessors. After the computation of 

the MMM the interrupt signal of the host processor is asserted. This state persists 

until the results are read within the interrupt routine by the processor. Thereafter 

new operands can be loaded into the memory and the whole process started again. 

Memory Operations The transfer of the operands between the control unit and 

the coprocessor is executed by a couple of control signals (chip select denoting the 

particular coprocessor, and write enable signalising a storing operation) and buses 

for address and data. 

The syntax of operand address has been explained in Table 2 – 1. The chip select 

signal of the corresponding coprocessor is asserted according to the address decoded 

by the interface. Since the input operands X, Y and M require only access for 

their storage and on the other hand the operand S is exclusively used as the output 

register of the coprocessor, their addresses may be shared. The particular operand 

register is then selected as per write enable signal and the addresses. 

In case when the internal word widths of the processor and the coprocessor do 

not match, an additional functionality is required from the interface to perform the 

memory alignment and proper decoding of the memory address. 

35

FEI KEMT 

Clock Signal Distribution As there may a need for faster (in generally different) 

clocking of the dedicated coprocessor we analyse a solution with separated clock 

signals for both parts of the system. 

The clock signal from the control processor controls through the bus and the 

interface all the processes between the processor’s and coprocessor’s memory. The 

operations inside the MMM coprocessor are then clocked by the external (usually 

faster) clock signal. 

Note that additional clock signal requires also some extra resources for its gener- 

ation. That may cause problems in the constrained embedded systems on low-end 

FPGAs with low number of clock generating circuits (e.g. PLLs). On the other 

hand, the performance improvement is significant. Thanks to this clock signals or- 

ganisation almost three times higher performance of the MMM coprocessor has been 

obtained in [49] comparison to the implementation using the same clock signal for 

both units [117]. 

2.3 Implementation of the MMM 

In this section we provide obtained parameters of the MMM units implemented 

according to the theory presented in the previous parts of the thesis. The MWR2- 

MM CSA algorithm and MWR2MM CPA algorithm are compared by implemen- 

tation of the PEs on several families of FPGAs produced by Altera. Further, we 

summarise the implementation results of the MMM coprocessor and we discuss an 

approach with software-hardware co-design and compare the results with a soft- 

ware way of implementation of the MMM. Finally, we provide a summary of the 

implementation results. 

2.3.1 Comparison of CSA and CPA PE 

Tables 2 – 2 and 2 – 3 the results of MWR2MM CSA and MWR2MM CPA PEs im- 

plementations (including data storage registers necessary for the pipelined version) 

in different Altera FPGAs for various word lengths w. 

There are several interesting facts that can be seen in these tables. With the 

exception of CPA PE implemented in the ACEX family, the two solutions are tech- 

nologically independent (as far as the area occupation is concerned). The size (in 

LEs) of the block depends almost linearly on the word length w. CSA PE occupies 

always more resources than that of CPA PE. 

36

FEI KEMT 

Table 2 – 2 PE sizes and speeds for old style Altera FPGAs 

CPA PE CSA PE 

Device w Size Speed w Size Speed 

(bits) (LEs) (MHz) (bits) (LEs) (MHz) 

ACEX [7] 8 66 161 8 81 232 

EP1K100-1 16 130 129 16 161 202 

32 258 99 32 321 170 

APEX [8] 8 59 161 8 81 232 

EP20K160-1 16 115 129 16 161 202 

32 227 99 32 321 170 

Table 2 – 3 PE sizes and speeds for new style Altera FPGAs 

CPA PE CSA PE 

Device w Size Speed w Size Speed 

(bits) (LEs) (MHz) (bits) (LEs) (MHz) 

CYCLONE [13] 8 59 277 8 81 304 

EP1C20-6 16 115 235 16 161 304 

32 227 221 32 321 304 

STRATIX [18] 8 59 271 8 81 304 

EP1S10-6 16 115 248 16 161 304 

32 227 214 32 321 304 

The most important fact concerns the speed of the PEs. As it could be expected, 

the CSA PE is always faster and the speed vary either only slightly (for old families) 

or almost not at all (for recent families, probably due to enhanced routing possi- 

bilities) with the word length w. However, the speed of the CPA PE in the older 

families decreases significantly with the word length (about 40% from 8 bits to 32 

bits). Recent Altera devices use enhanced carry chain. So-called carry-select chain 

uses the redundant carry calculation (hard-wired) to increase the speed of carry 

functions. This feature enables to get processing times for CPA PE comparable to 

CSA PE (but slower about 10 to 30%). Since CPA PE is about 20% smaller, one 

can improve the final speed increasing number of pipelined stages. However, this 

approach does not seem to be adequate for word lengths w > 32 bits. 

37

FEI KEMT 

2.3.2 Montgomery Multiplication Coprocessor 

Having the optimised PE for the MMM computations our objective is to complete 

the MMM coprocessor with all necessary parts. The memory registers, the interface 

to the control unit and the clock distribution logic are integral parts of the MMM 

coprocessor. The IP block including all mentioned design units is very suitable for 

quick system development providing the full functionality for operations demanding 

the MMM and a universal interface for connection to the control processor. 

The architecture of the coprocessor and all its parts has been discussed in the 

Section 2.2. In the Table 2 – 4 we provide the results for the area occupation and 

the critical path expressed as the maximal clocking frequency on the Altera APEX 

20K200E FPGA. For the sample configuration we have chosen the MMM coprocessor 

based on the multiplier unit based on the MWR2MM CSA Algorithm with operands 

word width (w = 32) and precision k = 1024 and k = 2048 bits, respectively. 

Table 2 – 4 Area occupation in number of LEs and maximal clock frequency (fclkMMM ) (MHz) of 

the MMM coprocessor (w = 32, n = 1..4) with MWR2MM CSA algorithm 

k = 1024 k = 2048 

LEs (fclkMMM ) (LEs) (fclkMMM ) 

n = 1 542 107.22 551 105.83 

n = 2 1100 110.43 1136 106.96 

n = 3 1621 108.34 1644 104.39 

n = 4 1943 106.67 1980 103.85 

2.3.3 Hardware-Software Co-design of MMM: a Case Study 

For configurable platform is typical a SOC architecture. Such approach reduces 

the production costs and on the other hand provides very suitable platform for the 

cryptographic applications. The SOC minimises the number of external interfaces 

and in this way decreases also the amount of leaked information. 

Another advantage of use of the SOC is that hardware and software solutions can 

be compared in a better way. Therefore the choice of optimal resources utilisation 

is based on a proper analysis. In the SOC both software and hardware solutions 

occupy the same resources. 

The fully software solution usually needs relatively large logic resources and small 

memory resources to implement the processor and sometimes large memory to im- 

38

FEI KEMT 

plement the program code. The fully hardware solution needs greater logic resources 

and eventually some data memory. In a mixed hardware-software design, parallel 

and time critical operations can be done in a hardware (dedicated coprocessors) 

and complex sequential and control operations in a software (main processor). In 

our SOC design the speedup factor of the coprocessor application in relationship to 

the entirely software-based solution can be measured quite easily: both implemen- 

tations use the same embedded processor, Altera Nios soft core described further in 

the following paragraph. 

Embedded Nios Processor The Nios CPU [10] is a pipelined general-purpose 

RISC processor that is generated by proprietary Altera VHDL generator (SOPC 

Builder) and can be synthesised and embedded in all recent Altera FPGAs. The 

Nios supports both 32-bit and 16-bit architectural variants. Both variants use 16-bit 

instructions. The principal features of the Nios instruction set architecture are: 

1. large, windowed register file, 

2. simple, complete instruction set, 

3. powerful addressing modes, 

4. extensibility. 

Existing Nios peripherals (e.g. UART, timer. . . ) as well as new custom peripherals 

can be connected through an Avalon bus [9]. Avalon is a simple bus architecture 

designed for connecting on-chip processor(s) and peripheral together into a SOC. 

Comparison of Implementations The Nios processor is used as a control unit 

in mixed implementations and as a main processor for the software implementa- 

tion. The 32-bit version of the Nios CPU can optionally be configured to include 

a hardware-supported integer multiplier. The additional logic is used by the MUL 

instruction to compute 32-bit result in three clock cycles 1 . This option is not sup- 

ported in the 16-bit Nios instruction set. In order to obtain realistic comparisons, 

32-bit Nios CPU with hardware supported MUL instruction was used for software 

implementation. 

In order to compare them, we have implemented three different systems: 

1 When using the MUL option with Altera Stratix devices, the hardware multiplier uses the 

Stratix DSP blocks for implementation. 

39

FEI KEMT 

1. Fully software solution implemented on a 32-bit Nios processor. 

2. Mixed software-hardware design with 16-bit Nios processor and the pipelined 

coprocessor including the CSA PE. 

3. Mixed software-hardware design with 16-bit Nios processor and the pipelined 

coprocessor including the CPA PE. 

Further, we provide the details of each system design and comment the obtained 

results. 

1. The software implementation of the MMM algorithm has been written in the 

Nios assembly language by using all known optimization techniques for the 

target processor. The Separated Operand Scanning (SOS) MMM method [39] 

was used as the best method for given Nios RISC architecture [66]. The 

Table 2 – 5 shows the timings for the execution of the MMM on the fully 

software solution running on the processor clocked at 50 MHz. The 32-bit 

Nios processor occupies 2137 LEs without the logic for the integer multiplier 

(for MUL instruction) that requires additional 446 LEs. 

In case of the software implementation it is effective to apply a different algo- 

rithms for the multiplication and squaring what reduces the execution time for 

the squaring operation. However due to vulnerability against the side-channel 

attacks it is better to align the execution times of both operations. 

Table 2 – 5 Execution times of software implementation of MMM on Altera Nios development 

board (with APEX EP20K200 clocked at 50 MHz) 

Length Method Multiplication Squaring 

(e × w) (ms) (ms) 

1024 SOS32MEM 2.40 1.87 

2048 SOS32MEM 9.47 7.24 

2. In the mixed hardware-software design the multiplication and squaring is com- 

pletely implemented in the hardware. Both operations share the same arith- 

metic unit. Due to move of the computational complexity from the main pro- 

cessor to the dedicated coprocessor one does not need to use the 32-bit version 

of the Nios core. Instead of the 32-bit controller one can include the 16-bit 

40

FEI KEMT 

Nios processor that is powerful enough to control the process and reduces the 

resources usage to reasonable 1275 LEs. 

The MMM coprocessor is based on a 16-bit (w = 16) CSA PE with 6 (n = 6) 

pipelined stages and occupies 1290 LEs. The total area occupation of the 

second, mixed hardware-software solution is comparable to the purely software 

solution. The processor has been clocked at 50 MHz and the MMM coprocessor 

at 150 MHz. Times necessary for MMM and squaring are presented in Table 2 – 

6. 

Table 2 – 6 Execution times of mixed hardware-software implementation of MMM on Altera Nios 

development board (with APEX EP20K200) for the CSA PE 


(e × w) (ms) (ms) 

1024 = 64 × 16 MWR2MM CSA 0.073 0.073 

2048 = 128 × 16 MWR2MM CSA 0.291 0.291 

3. The third design we analyse is based on the same system architecture as the 

one introduced in the second point. This time the MMM coprocessor includes 

the 16-bit (w = 16) CPA PE with 9 (n = 9) pipelined stages. The parameters 

were chosen with purpose to get the occupied area size comparable to the 

other two design variations. The processor has been clocked at 50 MHz and 

the MMM coprocessor at 100 MHz. The results obtained for this configuration 

are presented in Table 2 – 7. 

Table 2 – 7 Execution times of mixed hardware-software implementation of the MMM on Altera 

Nios development board (with APEX EP20K200) for the CPA PE 


(e × w) (ms) (ms) 

1024 = 64 × 16 MWR2MM CPA 0.069 0.069 

2048 = 128 × 16 MWR2MM CPA 0.278 0.278 

41

FEI KEMT 

2.3.4 Implementation Results 

The presented results have been obtained after P&R process in Altera Quartus de- 

velopment system, version 2.2. The simulation and synthesis of the designs was 

done in development tools from Mentor Graphics included in the FPGA Advan- 

tage package. The carry chains in the CPA PE have been implemented using 

the lpm add sub function from the Library of Parameterized Modules (LPM) – a 

technology-independent library of logic functions that are parameterized to achieve 

scalability and adaptability. 

All the logic have been described by VHDL taking into account the scalability 

and possible choice of the system parameters. Beside the memory registers block 

and the carry chain logic, the designs are fully portable to any FPGA platform. 

In the subsection 2.3.1 we have summarised the differences between the two 

chosen concepts for implementation of the PE for the MMM. The result of the MMM 

coprocessor implementation shows importance of the clock distribution unit since 

the achieved maximal clocking frequency of the coprocessor overruns the typical 

working frequency of the control units (the Nios soft-core processor in our case). 

According to the previous analysis the critical path of the coprocessor does not 

change with increasing number of pipelined stages k, and the relation between the 

occupied area size and the computational time for the MMM operation stays linear. 

From the case study having objective to find an optimal utilisation of the plat- 

form resources we can find to following conclusions. From all three designs which 

parameters were chosen in order to achieve a comparable area occupation the slow- 

est is the software solution 2 . The two designs including the optimised MMM units 

implemented in hardware provides computational times around 30 times shorter. 

From the comparison between the CSA and CPA concepts the latter one provides 

slightly better times. 

2.4 Conclusions and Future Work 

The chapter covers the topics related to the effective implementation of the algebraic 

coprocessor for MMM operation. We compared two basic concepts of the multiplier 

architecture. The improvements of the algorithm are related to the reconfigurable 

platform chosen for the implementation. Tho pair of concepts was chosen to present 

2 In fact the instruction set of the Nios processor has been enhanced by the hardware-supported 

MUL instruction. The completely software solution gives too poor results to consider them in the 

comparison. 

42

FEI KEMT 

the contribution of the carry chain dedicated logic in recent FPGA families and 

compare it to the classical approach with the CSA. 

Analysed multiplier PE provides the core unit for developed MMM coprocessor. 

Our attention was paid to keep the scalability feature included in the PE also in 

the other parts of the system. The interface of the coprocessor provides flexible 

and powerful connection according to the processor’s type of peripherals handling. 

The presented MMM coprocessor was successfully incorporated into SOCs with two 

types of the control unit: in this chapter the soft-core processor Altera Nios was 

applied, in Chapter 4 we describe system controlled by an ARM processor. 

Obtained solution is very flexible and thanks to scalability and possibility to 

choose between two types of PE, one is able to adapt it to a large range of target 

platforms and applications. The features of the MMM coprocessor ware confirmed 

by two proof-of-concept implementations. In this chapter we consider the coproces- 

sor application for RSA-based public key cryptosystem in which typical operands 

length exceeds 1000 bits. In Chapter 4 we present a design of the coprocessor dedi- 

cated for integer factoring based on elliptic curves. The IP block covering the MMM 

coprocessor with all its features supports fast development of embedded systems. 

From areas in which we see possible improvements of the design we mention 

a better memory management for variables smaller than the total capacity of the 

memory block. The RSA application can be enhanced by the CRT method that 

requires shorter operands. Such requirement can be perfectly met by the MMM 

coprocessor in future thanks to its scalability. 

43

FEI KEMT 

3 Elliptic Curve Method in Hardware - prelimi- 

naries 

Hardware implementations of factoring algorithms require special purpose devices 

suitable for effective execution of intensive computations. In this chapter we provide 

preliminaries for the topic of ECM hardware implementation. 

In the Section 3.1 we start with introduction on factoring in general and present 

the motivation for implementation of the ECM in hardware. The chapter continues 

with a summary of previous work done in the area of ECM implementation (the 

Section 3.2). Mathematical background of the method and closer look at the both 

phases of the ECM are given in the Section 3.3. 

3.1 Integer Factoring 

In the previous parts of the thesis we have explained that the security of the RSA 

cryptosystem relies on the difficulty of factoring large integers. Hence, the devel- 

opment of a fast factorisation method could allow the cryptanalysis of messages 

encrypted or signed by RSA. However, till now the problem of factorisation has 

remained hard. 

In this section we start with basic facts on integer factoring and present the most 

important factoring methods. Further, the ECM is described as a promising method 

for hardware implementation. 

3.1.1 Factoring Algorithms 

We provide definitions of terms related to factoring and introduction to the factoring 

methods that can be found also in [80]. 

Factoring a positive integer n means finding positive integers u and v such that 

the product of u and v equals n, and such that both u and v are greater than 1. 

Such u and v are called factors (or divisors) of n, and n = uv is called a factorisation 

of n. Positive integers that can be factored are called composites. Positive integers 

greater than 1 that cannot be factored are called primes. 

In some factorisation methods we use a feature of integers called smoothness. We 

say that a positive integer is B-smooth if all its prime factors are ≤ B. An integer 

is said to be smooth with respect to S, where S is some set of integers, if it can be 

completely factored using the elements of S. We often simply use the term smooth, 

in which case the bound B or the set S is clear from the context. 

44

FEI KEMT 

We start with the simplest method for integer factoring, namely the trial division. 

The smallest prime factor p of n can be found by trying if n is divisible by all primes 

in succession, until p is reached. If we assume that a table of all primes ≤ p is 

available this process takes π(p) division attempts (called trial divisions), where π(p) 

is number of primes ≤ p, or the prime counting function, where the approximation 

to get its value has been found as π(p) ≈ p/ log e(p). 

Since n has at least one factor ≤ √ n, factoring n using trial division takes 

approximately √ n operations, in the worst case. For many composites trial division 

is therefore infeasible as factoring method. For most numbers it is very effective, 

however, because most numbers have small factors: 88% of all positive integers have 

a factor < 100, and almost 92% have a factor < 1000. 

Several more efficient algorithms for factoring integers have been proposed. Each 

algorithm is appropriate for a different situation. For instance, the ECM [82] allows 

the efficient factoring of numbers with relatively small factors. The generalised 

number field sieve (GNFS, see [81]) is the best algorithm for factoring numbers with 

large factors and, hence, can be used for attacking the RSA cryptosystem. 

In GNFS arise many mid-size integers that have to be checked for smoothness, 

i.e. if they decompose completely into small prime factors. The sieving step of 

GNFS finds some of these factors. After dividing them out, one obtains a co-factor 

that has to be checked for smoothness. Let us call this step the co-factorisation 

or smoothness test. An appropriate choice for this task is the multiple polynomial 

quadratic sieve (MPQS, see [104]) or the ECM. 

3.1.2 Motivation for Hardware Implementation 

The current world record in factoring a random RSA modulus is 200 decimals and 

was achieved with a complete software implementation of the GNFS in 2005 [63], 

using MPQS for the factorisation of the cofactors. For larger modulus it becomes 

crucial to use a special hardware for factoring. Recently, some new hardware ar- 

chitectures for the sieving step in GNFS have been proposed (e.g., SHARK [64], 

TWIRL [103]). The efficiency of, e.g. SHARK (and possibly other innovative 

GNFS realizations) is directly related to efficient support units for smoothness test- 

ing within the architecture. 

It appears that the use of the ECM rather than the MPQS is a better choice 

for the smoothness test, since the MPQS requires a larger silicon area and irregular 

operations. On the other hand, the ECM is almost ideal algorithm for dramatically 

45

FEI KEMT 

improving the area-time product through special purpose hardware. We summarise 

the advantages of the ECM in the following points: 

1. ECM performs a very high number of operations on a very small set of input 

data, hence, it is not very I/O intensive. 

2. ECM requires relatively little memory when comparing to other methods. 

3. The operands needed for supporting GNFS are well beyond the width of cur- 

rent computer buses, arithmetic units, and registers, so a special purpose 

hardware can provide a much better efficiency in implementation and com- 

putational time. 

4. The nature of the smoothness testing in the GNFS allows a very high degree 

of parallelisation. 

The key for efficient ECM hardware with parallel architecture lies in fast arith- 

metic units. Such units for modular addition and multiplication have been studied 

thoroughly in the last few years, e.g. for the use in cryptographic devices including 

ECC (see e.g. [71,92]). Therefore, we could exploit the well developed area of ECC 

architectures for our ECM design. 

3.2 Previous Implementations of ECM 

To our knowledge, the ECM has never been implemented in hardware before. In the 

context of special-purpose hardware for the GNFS, [27] mentions that construction 

of a special ECM hardware might be promising for supporting the GNFS. However, 

till now there were published only two concepts for the ECM hardware implementa- 

tion. The first one, presented also in this work, has been a proof-of-concept design 

proposed by Jan Pelzl, Martin ˇ Simka et al. [65, 94, 120]. The latter one from Kris 

Gaj et al. [67] improves our proposal and provides the most recent reference for the 

ECM implementation. 

The main differences of both concepts are in the following areas: 

• control logic - external vs. internal, what in detail means a way of distribution 

the control over computation between the ECM units and the central control 

logic, 

• memory management - thanks to better organisation of memory registers and 

using single-port memory access, the design of Gaj et al. requires significantly 

46

FEI KEMT 

less memory blocks than ours (with dual-port access and separate memory 

block for each register), 

• parallelisation - better computational times are achieved by parallel execution 

of arithmetic operations and addition of the second multiplier, 

• Montgomery multiplier - while in our concept the multiplier design is based 

on the proposal from Tenca and Koc [108], in the Gaj’s design the multiplier 

comes from McIvor and McLoony [85]. It provides a shorter computation 

time, but also a less flexible architecture what can be a disadvantage in case 

of changing the ECM parameters. 

By selection of faster multiplier and better resources utilisation in comparison to our 

proof-of-concept design, the authors have achieved the AT product improvement by 

factor 3.7 for Phase 1 and 6.4 for Phase 2, respectively, using the same hardware 

platform. 

In the software domain, there were several attempts to apply the ECM to the 

factorisation. 

A parallel software implementation of ECM on several workstations (Pentiu- 

mII@350 MHz, Linux OS) is reported in [123]. The implementation uses fast net- 

work switches and has been programmed based on the Message-Passing Interface 

(MPI) standard. 

Two massively parallel implementations of ECM based on systolic versions of the 

MMM are described in [45]. The authors apply a single instruction, multiple data 

(SIMD) approach on a particular type of parallel computer. 

A well known free software implementation of the ECM to factor integers is 

available from [128] (GMP-ECM). The implementation is based on the GNU mul- 

tiple precision (GMP) arithmetic library. The original purpose of the project was 

to find a factor of 50 digits or more by ECM. The participation of several devel- 

opers made GMP-ECM an excellent resource for a state-of-the-art ECM software 

implementation, including many useful tweaks. 

3.3 Mathematical Background 

The principles of ECM are based on Pollard’s (p − 1)-method [95]. Therefore we 

start with short summarization of the Pollard’s method. Afterwards we describe 

H. W. Lenstra’s ECM [82]. 

47

FEI KEMT 

3.3.1 Pollard’s (p − 1)-algorithm 

Let k, n ∈ N with n being the composite to be factored. Furthermore, let p|n with 

p ∈ P. Let a ∈ Z and n be co-prime, i.e. gcd(a, n) = 1. Let e = k(p − 1). 

1. By little Fermat, 

2. p|n yields gcd(a e − 1, n) > 1. 

a p−1 ≡ 1 mod q ⇒ a k(p−1) ≡ 1 mod p 

⇐ a e ≡ 1 mod p 

⇐ a e − 1 ≡ 0 mod p 

⇐ p|(a e − 1). 

3. If a e �≡ 1 mod n, then 1 < gcd(a e − 1, n) < n. In this case, we found a 

non-trivial divisor of n. 

Obviously, we cannot compute e = k(p−1) without the knowledge of p. Instead, 

we assume that p − 1 can be decomposed into many small factors below a certain 

bound B1. In this case, p − 1 is called B1-smooth. 

Let B2 denote the highest prime power dividing p − 1 and choose e such that 

e = 

� 

pi∈P,pi≤B1 

p ep i 

i , epi = max{r ∈ N : pr i ≤ B2} . (3.1) 

With the computation of a e with d = gcd(a e − 1, n) we hope to find a non-trivial 

factor d of n. 

In general, Pollard’s method can be defined as follows: 

Let Gp = (Zp) ⋆ and Gn = (Zn) ⋆ be multiplicative groups and let φ be the canon- 

ical homomorphism 

φ : Gn → Gp (reduction modulo p) (3.2) 

A factor of n is found if simultaneously a e �≡ 1 mod n and a e ≡ 1 mod p, i.e. 

∀k1 ∈ N : e �= k1 · ordGn(a), 

∃k2 ∈ N : e = k2 · ordGp(φ(a)). 

48

FEI KEMT 

3.3.2 ECM Algorithm 

In 1987, H. Lenstra came up with the idea of translating Pollard’s method from 

the groups Gp and Gn to the groups of points on elliptic curves E modulo n and 

modulo q [82]. Indeed, a group operation in E(Zn) can be defined by using the 

given addition formulae [32]. 

The corresponding homomorphism φ to the one defined in Equation 3.2 is: 

φ : E(Zn) → E(Zq) (reduction of coordinates modulo q) (3.3) 

The exponentiation in Pollard’s (p−1) method is replaced by a point multiplication. 

Let n be an integer without small prime factors which is divisible by at least two 

different primes, one of them q. Such numbers appear after trial division and a quick 

prime power test. Let E(Zn) be an elliptic curve with good reduction at all prime 

divisors of n (this can be checked by calculating the gcd of n and the discriminant 

of E, which very rarely yields a prime factor of n) and a point P ∈ E(Zn) �= O. 

A factor of n is found if k · P is not equal to the identity element in E(Zn) but 

k · φ(P ) equals to the identity element in E(Zq), i.e. 

∀k1 ∈ N : k �= k1 · ordE(Zn)(P ), 

∃k2 ∈ N : k = k2 · ordE(Zq)(φ(P )). 

Let the elliptic curve E be defined by the homogeneous Weierstrass Equation: 

y 2 z = x 3 + axz 2 + bz 3 

(3.4) 

In this case, above conditions yield two properties for the z-coordinate zQ of the 

resulting point Q = k · P : 

k �= k1 · ordE(Zn)(P ) ⇐ n ∤ zQ 

k = k2 · ordE(Zq)(φ(P )) ⇐ q | zQ. 

Under these conditions, a non-trivial factor d of n is obtained by d = gcd(zQ, n). 

With the assumption that the order of P is B1-smooth and does not contain 

any prime power larger than B2, the scalar k is computed in the same way as e in 

Equation 3.1 as 

k = 

� 


p ep i 

i , epi = max{r ∈ N : pr i ≤ B2} . (3.5) 

49

FEI KEMT 

If the order of P ∈ E(Fq) satisfies certain smoothness conditions described below, 

we can discover the factor q of n as follows: 

In the first phase of ECM, we calculate Q = kP where k is a product of prime 

powers p e ≤ B1 with appropriately chosen smoothness bounds. The second phase of 

ECM checks for each prime B1 

in E(Fq). Algorithm 3 – 1 summarises all necessary steps for both phases of ECM. 

Phase 2 can be done efficiently, e.g., using the Weierstraß form and projective 

coordinates pQ = (xpQ : ypQ : zpQ) by testing whether gcd(zpQ, n) is bigger than 1. 

Note that we can avoid all gcd computations but one at the expense of one 

modular multiplication per gcd by accumulating the numbers to be checked in a 

product modulo n and performing one final gcd. 

Algorithm 3 – 1 Elliptic Curve Method 

Require: Composite n 

Ensure: Factor d of n 

1: Phase 1: 

2: Choose arbitrary curve E(Zn) and random point P ∈ E(Zn) �= O 

3: Choose smoothness bounds B1, B2 ∈ N 

4: Compute 

k ⇐ 

� 


5: Compute Q = kP ⇐ (xQ, yQ, zQ) 

6: Compute d ⇐ gcd(zQ, n) 

7: Phase 2: 

8: Set Π := 1 

9: for each prime p with B1 

10: Compute pQ ⇐ (xpQ : ypQ : zpQ) 

11: Compute Π ⇐ Π · zpQ 

12: end for 

13: Compute d ⇐ gcd(Π, n) 

14: if 1 < d < n then 

15: A non-trivial factor d is found 

16: return d 

17: else 

p ep i 

i , epi ⇐ max{r ∈ N : pr i ≤ B2} 

18: Restart from choosing another elliptic curve in phase 1 (Step 2). 

19: end if 

50

FEI KEMT 

If using only one single curve, the properties of the ECM are related to those of 

the Pollard’s (p − 1)-method. The advantage of the ECM lies in the possibility of 

choosing a different curve after each unsuccessful trial to increase the probability of 

finding factors of n. 

All calculations are done modulo n. If the final gcd of the product Π and n 

satisfies 

1 < gcd(Π, n) < n , (3.6) 

a factor is found. The parameters B1 and B2 control the probability of finding a 

divisor q. More precisely, if the of P factors into a product of co-prime prime powers 

(each ≤ B1) and at most one additional prime between B1 and B2, the prime factor 

q is discovered. 

The procedure will be repeated for other elliptic curves. To generate them one 

commences with the starting point P and constructs an elliptic curve such that P 

lies on it. 

It is possible that more than one or even all prime divisors of n are discovered 

simultaneously. This happens rarely for reasonable parameter choices and can be 

ignored by proceeding to the next elliptic curve. 

The running time of the ECM is given by 

T (q) q→∞ 

= e (√ 2+o(1)) √ log q log log q 

(3.7) 

operations, thus, it mainly depends on the size of the factors to be found and not 

on the size of n [34]. However, remark that the operations are computed modulo n, 

hence, the running time of the operations depends on n. 

Montgomery-Form Curves Apart from the Weierstraß form there are vari- 

ous other forms for the elliptic curves. We use Montgomery’s form (described by 

Equation 3.8) that was suggested in [89] by Montgomery and compute in the set 

S = E(Z/nZ)/{±1} only using the x- and z-coordinates. 

By 2 z = x 3 + Ax 2 z + xz 2 

(3.8) 

The curves of this form always have an order divisible by 4. In our case, the curves 

can be chosen in such a way that they have an order divisible by 12. The advantage 

of the use of Montgomery form curves in cryptography is the inherent resistance 

against side channel attacks due to almost indistinguishable group operations, i.e. 

the elementary operations for addition and doubling of points are quite similar. A 

51

FEI KEMT 

handicap of the Montgomery form is the fact that not every arbitrary curve can be 

transformed into this form. Hence, there is merely interest in implementing ECC 

based on Montgomery form curves. 

The residue class of P +Q in this set can be computed from P , Q and P −Q using 

4 multiplications and 1 squaring (see Equation 3.9). A doubling, i. e. 2P , can be 

computed from P and curve parameter A (see 3.8) using 5 squarings (Equation 3.10). 

Since we are only interested in checking whether we obtain the point at infinity O 

for some prime divisor of n computing in S is no restriction. 

Addition: (3.9) 

xP +Q ≡ zP −Q[(xP − zP )(xQ + zQ) + (xP + zP )(xQ − zQ)] 2 

zP +Q ≡ xP −Q[(xP − zP )(xQ + zQ) − (xP + zP )(xQ − zQ)] 2 

(mod n) 

(mod n) 

Doubling: (3.10) 

4xP zP ≡ (xP + zP ) 2 − (xP − zP ) 2 

x2P ≡ (xP + zP ) 2 (xP − zP ) 2 

(mod n) 

(mod n) 

z2P ≡ 4xP zP [(xP − zP ) 2 + 4xP zP (A + 2)/4] (mod n) 

Finding Suitable Curves in Montgomery Form Assume a curve of the form 

By 2 = x 3 + Ax 2 + x with gcd((A 2 − 4)B, n) = 1 (3.11) 

Such curves have a group order divisible by 4. To obtain an order divisible by 12, 

choose A and B such that 

The point 

A = −3a4 − 6a2 + 1 

4a3 , B = (a2 − 1) 2 

4a3 , with a = t2 − 1 

t2 + 3 

� √ � 

2 3a + 1 3a2 + 1 

(x0, y0) = , 

4a 4a 

(3.12) 

(3.13) 

is on the curve, if 3a 2 + 1 = 4(t 4 + 3)/(t 2 + 3) 2 is a rational square, which can be 

obtained by t 2 = (u 2 − 12)/4u with u 2 − 12u being a rational square. 

First Phase of the ECM If the triple (P, mP, (m + 1)P ) is given in the Mont- 

gomery form, we can compute (P, 2mP, (2m + 1)P ) or (P, (2m + 1)P, (2m + 2)P ) 

by performing one addition (following the Equations 3.9) and one doubling (follow- 

ing the Equations 3.10) in Montgomery’s form. Thus, Q = kP can be calculated 

52

FEI KEMT 

using [log 2 k] additions and duplications according to Algorithm 3 – 2, amounting to 

11[log 2 k] multiplications. In case when zP = 1 we can even reduce the number to 

10[log 2 k] modular multiplications. 

Algorithm 3 – 2 Exponentiation for Curves in Montgomery Form 

Require: Integer k > 1 with k = (ktkt−1 . . . k1k0)2 and a point P on the curve 

E M : By 2 = x 3 + Ax 2 + x. 

Ensure: Product Q = kP . 

1: Pm ⇐ P 

2: Pm+1 ⇐ 2P 

3: for i = t − 1 to 1 do 

4: if ki = 1 then 

5: Pm ⇐ Pm + Pm+1 

6: Pm+1 ⇐ 2Pm+1 

7: else 

8: Pm+1 ⇐ Pm + Pm+1 

9: Pm ⇐ 2Pm 

10: end if 

11: end for 

12: if k0 = 1 then 

13: Q ⇐ Pm + Pm+1 

14: else 

15: Q ⇐ 2Pm 

16: end if 

17: return Q 

By handling each prime factor of k separately and by using optimal addition 

chains, the number of multiplications can be decreased further to roughly 9.3[log 2 k] 

(see [89]). The addition chains can be precalculated. 

Second Phase of the ECM The standard way to calculate the points pQ for all 

primes B1 

through the differences of consecutive primes in the interval [B1, B2]. Then, a single 

point multiple p0Q is computed with p0 being the smallest prime in that interval 

and the corresponding table entries are added successively to obtain pQ for the next 

prime p. 

53

FEI KEMT 

Two major improvements have been proposed for the ECM [33, 89]. Using the 

Montgomery’s form, the procedure is difficult to implement but can be improved as 

follows. 

The following Lemma allows us to reduce the complexity by repeatedly multi- 

plying a difference of two products instead of computing complex point operations 

in each step of phase 2: 

Lemma 1 Let q = a + b with a and b co-prime. Furthermore, let qQ = A + B with 

A = aQ and B = bQ, then zqQ = 0 mod t for gcd(zQ, n) = 1 if and only if 

Proof 

xA · zB − zA · xB ≡ 0 mod t. 

1. Montgomery’s point addition formula 3.9 yields 

t|zqQ ⇔ t|xA−B[xA · zB − zA · xB] 2 

⇐ t|(xA · zB − zA · xB). 

2. If zqQ ≡ 0 mod t, qQ is the identity point on the elliptic curve over Ft. Hence, 

A = −B, i.e. A and B are zero or 

xA/zA ≡ xB/zB mod t. 

A = B = 0 yields Q = 0, thus t|zQ, which is a contradiction to the assumption 

of gcd(zQ, n) = 1. Then we have 

xA/zA ≡ xB/zB mod t and 

xA · zB ≡ zA · xB mod t respectively. 

The improved standard continuation uses a parameter 2 < D < B1. First, a 

table T of multiples kQ of Q for all 1 ≤ k < D, 

gcd(k, D) = 1 is calculated. 

2 

Each prime B1 

Lemma 1, gcd(zpQ, n) > 1 if and only if gcd(xmDQzkQ − xkQzmDQ, n) > 1. Thus, we 

calculate the sequence mDQ (which can easily be done in Montgomery’s form) and 

accumulate the product of all xmDQzkQ − xkQzmDQ for which mD − k or mD + k is 

prime. 

The memory requirements for the improved standard continuation are ϕ(D) 

2 

points for the table T and the points DQ, (m − 1)DQ,and mDQ for computing 

54

FEI KEMT 

the sequence, altogether ϕ(D) + 6 numbers. The computational costs consist of the 

generation of T and the calculation of mDQ which amounts to at most D 

4 

+ B2 

D 

elliptic curve operations (mostly additions) and at most 3(π(B2) − π(B1)) modular 

multiplications, π(x) being the number of primes up to x. The last term can be 

lowered if D contains many small prime factors since this will increase the number 

of pairs (m, k) for which both mD − k and mD + k are prime. Neglecting space 

considerations a good choice for D is a number around √ B2 which is divisible by 

many small primes. 

4 Elliptic Curve Method in Hardware 

We present the first published hardware implementation of the ECM for integer fac- 

toring. The ECM implementation includes a complete hardware logic that supports 

the ECM factoring of numbers up to approximately 200 bits. The proposed solution 

applies parameters best suited to find factors of up to about 42 bits. The ECM 

design features a supporting logic for computation of the modular operations as ad- 

dition, subtraction, multiplication and squaring. The multiplication and squaring 

is computed in the MMM unit analysed in the Chapter 2. The circuit has a good 

scalability also to larger and smaller bit lengths. For a proof-of-concept purpose, 

the ECM architecture has been implemented as a software-hardware co-design on a 

FPGA and an embedded micro-controller in a SOC. Such a design perfectly fits the 

needs of recent proposals for hardware architectures for the GNFS (see, e.g. [64]) 

and can reduce the overall costs of a GNFS device considerably. 

Parts of this section were published in papers [65,94,120]. The research achieve- 

ments described in this chapter include the following: 

• ECM algorithm for hardware – algorithm adaptation and parametrisation, 

• ECM implementation – unit design, parallelisation, case study for GNFS. 

The ECM implementation was done as a joint work, mainly with Jan Pelzl from 

Ruhr University Bochum (in SHARK project that includes the ECM design, have 

cooperated also Christine Priplata and Colin Stahlke (Edizone GmbH, Germany), 

and Jens Franke and Thorsten Kleinjung (University of Bonn, Germany)). 

The Section 4.1 describes the details on selection of the parameters in the ECM. 

The architecture of the implementation and discussion on the chosen algorithms 

for the modular operations is presented in the Section 4.2. Implementation details 

55 

+ 7

FEI KEMT 

and case study with GNFS based on ECM units are summarised in the Section 4.3. 

Finally, we conclude the chapter with discussion on obtained results. 

4.1 Parameterisation of the ECM Algorithm 

Our implementation focuses on the factorisation of numbers up to 200 bits with 

factors of up to around 42 bits. Thus, the most optimal parameters need to be found 

for the smoothness bounds B1, B2, and in the improved standard continuation used 

parameter D (see the description of the ECM second phase in Section 3.3.2). We 

find the values that yield a high probability of success and a relatively small running 

time and area consumption. With the running time depending on the size of the 

(unknown) factors to be found, optimal parameters cannot be known beforehand. 

Hence, good parameters can be found by experiments with different prime bounds. 

4.1.1 Phase 1 

Deduced from software experiments, we choose B1 = 960 and B2 = 57 000 as prime 

bounds. The value of k has 1 375 bits, hence, assuming the binary method (Algo- 

rithm 3 – 2), 1 374 point additions and 1 374 point duplications for the execution of 

phase 1 are required. Due to the use of Montgomery coordinates, the coordinate 

zP of the starting point P can be set to 1, then the addition takes only 5 multi- 

plications instead of 6. The improved phase 1 (with optimal addition chains) has 

to use the general case, where zP �= 1. For the sake of simplicity and a preferably 

simple control logic, we choose the binary method for the time being. For the chosen 

parameters, the computational complexity of phase 1 is 13 740 modular multiplica- 

tions and squarings 3 . With optimised addition chains this number can be reduced 

to approximately 12 000 modular multiplications and squarings. 

According to Equation 3.10, duplicating a point 2PA = PC involves the input 

values xA, zA, A24 and n, where A24 = (A + 2)/4 is computed from the curve pa- 

rameter A (see Equation 3.8) in advance and should be stored in a fixed register. 

A point addition PC = PA + PB handles the input values xA, zA, xB, zB, xA−B, zA−B 

and n (see Equation 3.9). 

Notice that the values n, A24, xA−B and zA−B do not change during phase 1. 

Furthermore, zA−B = z1 can be chosen to be 1. Thus, no register is required for 

zA−B. The output values xC and zC can be written to certain input registers to 

3 Squarings and multiplications are considered to have an identical complexity in our case since 

the hardware unit is the same for both, the multiplication and squaring. 

56

FEI KEMT 

save memory. If we assume that the ECM unit does not execute addition and 

duplication in parallel, at most 7 registers for the values in Zn are required for 

phase 1. Additionally, we will require 4 temporary registers for intermediate values. 

Thus, a total of 11 registers is required for phase 1. 

4.1.2 Phase 2 

For the prime bounds chosen, 5 621 primes p ∈ [B1, B2] have to be tested in phase 

2. With the prime bounds fixed, the computational complexity depends on the size 

of D. Hence, D should consist of small primes in order to keep ϕ(D) as small as 

possible. We consider the cases D = 6, D = 30, D = 60 and D = 210. The 

initial values can be computed by first computing ˆ Q = DQ, then B1 

D ˆ Q with the 

binary method, yielding automatically ( B1 

D − 1) ˆ Q. The total number of modular 

multiplications is determined by the number of point additions, point duplications 

and multiplications for the product Π. 

Table 4 – 1 displays the computational complexity and the number of registers 

required additionally for phase 2. For the numbers in the table, we assume the use 

of Algorithm 3 – 2 for computing the initial values. E.g., in the case D = 30, the cost 

for the computation of DQ, ( B1 

D 

B1 

− 1)DQ, and DQ is as much as 8 point additions 

D 

and 8 point duplications. For the same D, the computation of the table involves 

5 point additions and 2 point duplications, yielding to a total of 13 590 modular 

multiplications. 

Remark: for the case D = 210, we start with B1 = 1 050 in order to assure that 

D and B1 share the same prime factors. For phase 2 we choose D = 30 to obtain 

a minimal AT product of the design. Since ϕ(D) = 8 is small, only 8 additional 

registers are required to store all coordinates in a table. Unlike in phase 1, we have 

to consider the general case for point addition where zA−B �= 1. Hence, an additional 

register for this quantity is needed. 

For the product Π of all xA · zB − zA · xB, one more register is necessary. The 

temporary registers from phase 1 suffice to store the intermediate results xA · zB, 

zA · xB and xA · zB − zA · xB. Hence, additional 10 registers for phase 2 yield a total 

of 21 required registers for both phases. The computational complexity of phase 2 is 

1 881 point additions and 10 point duplications. Together with the 13 590 modular 

multiplications for computing the product Π, 24 926 modular multiplications and 

squarings are required. 

For a high probability of success (p > 80%) of finding a single factor of size of 42 

57

FEI KEMT 

Table 4 – 1 Computational complexity and memory requirements for phase 2 depending on D 

number of modular multiplications for number 

D point additions point duplications product Π total of regs. 

6 (9 + 0 + 9 340) · 6 = 56 094 (9 + 0) · 5 = 45 14 625 70 764 4 

30 (8 + 5 + 1 868) · 6 = 11 286 (8 + 2) · 5 = 50 13 590 24 926 10 

60 (8 + 9 + 934) · 6 = 5 706 (8 + 2) · 5 = 50 13 629 19 385 18 

210 (9 + 28 + 266) · 6 = 1 818 (9 + 5) · 5 = 70 13 038 14 926 50 

bit, software experiments suggest to run ECM on approximately 20 different curves 

for a single candidate for the given parameters. For factors of size of 40 bit, only 10 

curves are required on average for a similar probability of success. 

4.2 Design of the ECM Unit 

The ECM unit consists of three main parts: the Arithmetic Logic Unit (ALU), the 

memory part (registers) and an internal control logic (see Figure 4 – 1). Each unit 

has a very low communication overhead since all intermediate results during com- 

putation are stored inside the unit, in the registers. Before the actual computation 

starts, all required initial values (xP , n, A24) are assigned to memory registers of the 

unit. This is the only data input. 

The only output is the above mentioned product Π. The number Π is read from 

the unit’s memory only at the very end of the computation. The computation of 

gcd(Π, n) as well as the commands for the ECM units are handled outside the ECM 

units by the central control logic. 

central 

control 

logic 

ctrl 

data 

control 

logic 

memory 

ALU 

ECM unit 

Figure 4 – 1 Architecture of the ECM unit 

58

FEI KEMT 

4.2.1 Control Logic 

The central control logic is connected to each ECM unit via a control bus (ctrl). The 

logic coordinates the data exchange with the unit before and after computation and 

starts each computation in the unit by a special set of commands. The commands 

contain an instruction for the next computation to be performed (i.e. add, subtract, 

multiply, square), including the in- and output registers to be used. The start of an 

operation is invoked by setting the start-bit to the active level. 

The control bus has to offer the possibility to specify which input register(s) and 

which output register are connected to the ALU. Only certain combinations of in- 

and output registers occur, offering the possibility to reduce the complexity of the 

logic and the width of the control bus by compressing the necessary information. 

For simplicity and clarity, we skipped the further optimisation of the commands. 

Instead, we use a clearly understandable structure for the commands. A command 

consists of 16 bit which are assigned as shown in Table 4 – 2 (LSB is left). 

Table 4 – 2 A command syntax for the ECM unit (LSB left) 

start operation input 1 input 2 output 

X XX XXXX XXXX XXXXX 

If several ECM units work in parallel, only one central control logic is needed. 

All commands are sent in parallel to all units. Separate communication with each 

of all units, one by one, is expected only in the beginning and in the end of the 

computations. The unit’s memory cells have to be written and read out separately. 

Once the computations in all units are finished, an LSB of the central status register 

is set to active value to indicate the units’ availability for further commands. 

Each ECM unit includes some internal control logic in order to coordinate the 

data and command flow inside the unit. Once a command with the corresponding 

start bit is set, the computation inside the unit is started. The ALU is fed by 

corresponding input registers and the results are stored again inside the unit in one 

of registers. Once the computation is finished, a status bit is set to indicate the 

unit’s availability for further commands. 

4.2.2 Memory Management 

The addresses specified above refer to relative addresses inside each unit since we 

want to address the same register in multiple ECM units in parallel. For reading 

59

FEI KEMT 

from or writing to a single register in a specific ECM unit, the unit needs to be 

recognised separately by unique address prefix. In combination with a address for 

each unit, a register has a unique hardware address and can be addressed from 

outside the ECM unit. This is imperative since the central control logic writes data 

to these registers before phase 1 starts and it reads data from one of the registers 

after phase 2 has been finished. 

Each register can contain n bits and is organised in e = � � 

n+1 words of size w 

w 

(see Figure 4 – 2). Memory access is performed word wise. Reasonable values for 

w are w = 4, 8, 16, 32 what is given by the included multiplier requiring those word 

widths. 

0: 

1: 

e-1: 

w bits 

w bits 

. 

. 

. 

w bits 

P1 register: e x w bits 

. . . . 

0: 

1: 

e-1: 

w bits 

w bits 

. 

. 

. 

w bits 

P21 register: e x w bits 

Figure 4 – 2 Organisation of the ECM unit’s memory registers for 21 variables with e words of 

width w 

The ALU performs the arithmetic modulo 2n, i.e., modular multiplication, mod- 

ular squaring, modular addition and subtraction. 

4.2.3 Choice of the Arithmetic Algorithms 

The main purpose when we were designing the ECM was to synthesise an area-time 

efficient implementation. All algorithms are chosen to allow achievement of a low 

area and relatively high speed. Low area consumption can be achieved by structures, 

which allow for a certain degree of pipeline and consequently do not require much 

memory. For the ECM, we have chosen a set of algorithms which seem to be very well 

suited for our purpose. The chosen algorithms are fully scalable and make possible 

to analyse different unit parameters and their impact on units performance. 

In the following, we briefly describe the algorithms for modular addition, subtrac- 

tion, and multiplication to be implemented for the ALU. Squaring is done with the 

multiplication circuit since a separate hardware circuit for squaring would increase 

60

FEI KEMT 

the overall AT product. Similarly, subtraction can be computed with a slightly 

modified circuit for addition. 

Modular Multiplication An efficient Montgomery multiplier, highly suitable for 

our design is described in [108]. While in [108] a structure with carry-save adders 

and redundant representation of operands has been implemented, we have chosen a 

configuration with carry-propagate adders and non-redundant representation that 

makes a more effective implementation possible especially when the target plat- 

form supports fast carry chain logic. A detailed analysis and comparison of both 

structures can be found in [46] and also in this thesis in chapter 2. 

The depicted hardware performs a slightly modified MWR2MM (Algorithm 2 – 

1), but with non-redundant carry-propagate architecture (earlier denoted as MW- 

R2MM CPA). Therefore, our previously mentioned thoughts and analysis of param- 

eters for other variants of the MMM algorithm are valid also for this version. In the 

implemented algorithm (Algorithm 4 – 1) we have used in the step (a) only bit op- 

erations instead of more expensive word-wise addition as it was originally proposed 

in [108]. 

The final reduction step of the originally proposed MMM (Algorithm 1 – 2) can 

be omitted when the following condition is fulfilled: 

4M < 2 n . (4.1) 

With bounded input values X, Y < 2M, the output value is also bounded (S < 2M). 

A minimal AT product of the sole multiplier can be achieved with a word width 

of 8 bits and a pipeline depth of 1 (w = 8, p = 1, see [108]). However, for our 

ECM architecture, the AT product does not only depend on the AT product of the 

multiplier. In fact, the multiplier only takes a comparably small part of the overall 

area. On the other hand, the overall speed relies primarily on the speed of the 

multiplier. Thus, we choose a pipeline depth of p = 2 for word width w = 32 bits, 

in order to achieve a shorter computation time for multiplication. 

Modular Addition and Subtraction Addition and subtraction is implemented 

as one circuit. As with the multiplication circuit, the operations are done word 

wise and the word size and number of words can be chosen arbitrary. Since the 

same memory is used for input and output operands, we choose the same word size 

as for the multiplier. The subtraction relies on the same hardware as the adder, 

only one input bit has to be changed (sub = 1) in order to compute a subtraction 

61

FEI KEMT 

Algorithm 4 – 1 Modified MWR2MM algorithm 

1: S ⇐ 0 

2: for i = 0 to n − 1 do 

3: qi ⇐ xiY (0) 

0 

4: if qi = 1 then 

+ S (0) 

0 

5: for j = 0 to e do 

6: (Ca, S (j) ) ⇐ Ca + xiY (j) + M (j) 

7: (Cb, S (j) ) ⇐ Cb + S (j) 

8: S (j−1) ⇐ (S (j) 

0 , S (j−1) 

w−1..1) 

9: end for 

10: else 

11: for j = 0 to e do 

12: (Ca, S (j) ) ⇐ Ca + xiY (j) 

13: (Cb, S (j) ) ⇐ Cb + S (j) 

14: S (j−1) ⇐ (S (j) 

0 , S (j−1) 

w−1..1) 

15: end for 

16: end if 

17: S (e) ⇐ 0 

18: end for 

rather than an addition (see Figure 4 – 3). All operations are done modulo 2n. 

Algorithms 4 – 2 and 4 – 3 show the elementary steps of a modular addition and 

subtraction, respectively. 

If x + y ≥ 2n a reduction can be applied by simple subtraction of 2n. A variable 

z contains the result and T is a (temporary) register. A comparison z < 2n takes 

the same amount of time as a subtraction T = z − 2n. Thus, we compute the 

subtraction in all cases and decide by the sign of the values, which one to take as 

the result (z or T ). If T is the correct result, the content of T has to be copied to 

the register z. 

For a modular addition, we need at most 

Tadd = 3(e + 1) (4.2) 

clock cycles, where e is the number of words (for implemented non-redundant form 

of operands e = � N+1 

w 

� 

). On average, we would only have to reduce every second 

time. However, since the control of phase 1 and phase 2 is parallelised for many 

units, we have to assume the worst case running time which is given by Equation 4.2. 

62

FEI KEMT 

C a 

X w-1 Y w-1 X w-2 X 0 

C b 

+ 

FA FA 

FA 

M w-1 

+ 

FA 

S w-1 

Y w-2 

+ 

M w-2 

+ 

FA 

Figure 4 – 3 Scalable addition and subtraction unit for operands with word width w 

S w-2 

The subtraction x − y can be accomplished by the addition of x with the bitwise 

complement of y and 1. The addition of 1 is simply achieved by setting the first carry 

bit to one (cin = 1) (Step 1). Since the result can be negative, a final verification 

is required. If necessary, the modulus has to be added. The following algorithm 

describes the modular subtraction: 

In step 1, both memory cells z and T obtain the same value, which can be done 

in hardware in parallel at the same time without any additional overhead. After the 

computation of the difference, one can check for the correctness of the result. 

Hence, subtraction can be performed more efficiently than addition and requires 

in the worst case 

clock cycles. 

. . . 

Y 0 

+ 

M 0 

+ 

FA 

S 0 

sub 

C a 

sub 

Tsub = 2(e + 1) (4.3) 

63 

C b

FEI KEMT 

Algorithm 4 – 2 Modular addition 

Require: Two integers x, y < 2n 

Ensure: Sum z = x + y mod 2n 

1: z ⇐ x + y 

2: T ⇐ z − 2n 

3: if T ≥ 0 then 

4: z ⇐ T 

5: end if 

6: return z 

Algorithm 4 – 3 Modular subtraction 

Require: Two integers x, y < 2n 

Ensure: Difference z = x − y mod 2n 

1: T = z ⇐ x − y 

2: if z < 0 then 

3: z ⇐ T + 2n 

4: end if 

5: return z 

4.2.4 Parallelization of the Algorithm 

ECM can be perfectly parallelized by using different curves in parallel since the 

computations of each unit are completely independent. For the control of more 

than one ECM unit, it is essential to know that both phases, phase 1 and phase 2, 

are controlled completely identically, independent of the composite to be factored. 

Solely the curve parameter and possibly the modulus of the units and, hence, the 

coordinates of the initial point differ. Thus, all units have to be initialized differently 

which is done by simply writing the values into the corresponding memory locations 

sequentially. 

During the execution of both phases, exactly the same commands can be sent to 

all units in parallel. Since the runtime of multiplication/squaring is constant (does 

not rely on input values) and for addition/subtraction differs at most in 2(e + 1) 

clock cycles, all units can execute the same command in approximately the same 

time. 

After phase 2, the results are read from the units one after another. The required 

time for this data I/O is negligible for one ECM unit since the computation time of 

both phases dominates. For several units in parallel, the computation time does not 

64

FEI KEMT 

change, but the time for data I/O scales linearly with the number of units. Hence, 

not too many units should be controlled by one single logic. For massively parallel 

ECM in hardware, the ECM units can be segmented into clusters, each with its own 

control unit. 

4.3 Implementation of the ECM Unit 

This section presents the actual hardware implementation done on a SOC (FPGA 

and embedded microprocessor). This first hardware implementation of ECM is de- 

signed as a proof-of-concept. All timings are obtained by using real hardware, not 

only simulation. All results have been carefully checked by a reference implementa- 

tion in software. 

4.3.1 Hardware Platform 

The ECM implementation is realized as a hybrid design. It consists of an ECM 

unit implemented on an FPGA (Xilinx Virtex2000E-6) [124] and a control logic 

implemented in software on an embedded micro-controller (ARM7TDMI, 25MHz) 

[90]. The ECM unit is coded in VHDL and was simulated and synthesised for the 

FPGA by using FPGA Advantage tools, place & route was done in Xilinx ISE. For 

the actual VHDL implementation, memory cells have been realized with the FPGA’s 

internal block RAM. For the word width w = 32 bits 2 blocks with e = ⌈ N+1⌉ 

words 

2 

are used for each register due to dual-port access mode and selected algorithm for 

multiplication. 

The ECM unit, as implemented, expects the commands which are written to a 

control register accessible by the embedded ARM processor. Required point coordi- 

nates and curve parameters are loaded into the ECM unit before the first command 

is decoded. For this purpose, these memory cells of unit are accessible from the 

outside by a unique address. Internal registers, which are only used as temporary 

registers during the computation are not accessible from the outside, by the micro- 

controller. 

The control of the whole unit is done by the micro-controller present on the 

board. The processor controls the data transfer from and to the units, and issues 

the commands for all steps in phase 1 and phase 2 for the central control login inside 

FPGA. For code generation, debugging and compilation, the ARM Developer Suite 

1.2 was used. For details on the ARM microprocessor, see [23]. At a later stage, 

a soft-core processor core (in VHDL) could be used instead of an hard-wired ARM 

65

FEI KEMT 

microprocessor, e.g. Altera Nios [10]. 

For a suitable implementation on a selected platform one can choose the word 

width w, number of words e (length of operands), level p of pipeline stages of the 

multiplier, and the number of ECM units. Although the presented implementation 

was realised on a Xilinx Virtex-E FPGA, the proposed algorithms and the design 

architecture can be implemented on any FPGA. Hence, a significant speed-up on 

state-of-the-art devices can be expected. Anyway, the platform at hand is sufficient 

for proof-of-concept purposes. Since the suggested clock rate of the synthesis tool 

was higher than the actual supported frequency of the hardware, no attempt to 

further accelerate the design has been made. Due to the lack of FPGA specific 

optimisations, the code can easily be used for different types of FPGAs that include 

dedicated memory blocks and fast carry-chain logic. 

The actual design was done for n = 198 bit composites. The parameters for the 

multiplier are p = 2 and w = 32. Scaling the design to bit lengths from 100 to 

300 bits can be easily accomplished. In this case, the AT product will de-/ increase 

according to the size of O(N 2 ). 

4.3.2 Results 

After the synthesis and place and route, the binary image was loaded onto the 

FPGA and clocked with a frequency of 25MHz. Hence, the cycle length of the ALU 

performing the modular arithmetic is 40ns. Table 4 – 3 shows the timings of relevant 

operations of the implementation. 

Hardware factorization design includes full support for all operations needed 

during the ECM phases 1 and 2. The timings for phase 1 and 2 are obtained after 

timing measurements on a testing board. The time for the initialization and reading 

from the memories is not taken into account, since it only delays the computation 

at the very beginning and the very end. 

Although a squaring is computed with the multiplication circuit, the overhead 

is slightly lower yielding a mere 0.3% faster execution. Point addition in phase 1 is 

more efficient since it makes use of the fact that the z coordinate of the difference 

of points can be chosen to be 1. 

The ECM unit including the full support for the phase 1 and 2 of the ECM 

with the word width w = 32 bits, number of words e = 7, level of pipeline p = 2 

has the following area requirements: 1754 LUTs, 506 flip-flops and 44 Blocks RAM. 

Minimum clock period achieved the value of 26.225ns (maximum clock frequency: 

66

FEI KEMT 

Table 4 – 3 Running Times of the ECM Implementation (198 bits modulus), p = 2, w = 32 

(Xilinx Virtex2000E-6 and ARM7TDMI, 25MHz) 

Operation Time 

modular addition 2.00µs 

modular subtraction 1.68µs 

modular multiplication 64.5µs 

modular squaring 64.5µs 

point addition (phase 1, zQ = 1) 333µs 

point addition (phase 2) 397µs 

point doubling 330µs 

Phase 1 912ms 

Phase 2 1879ms 

38.132MHz). Further improvements in data organisation inside the ECM unit should 

yield higher performance of the whole design. The critical path of design includes 

multiplexers of input and output buses of memory registers. High number of sup- 

ported combinations due to universality of proposed design causes complicated and 

hence a slow logic. More optimised data-path with multiple multipliers in ALU helps 

to decrease the number of supported combinations of registers as shown in [67]. 

Due to the system’s latency for loading and storing values in the registers, not 

more than 100 ECM units (FPGA) should be controlled by one processor. With 

a much higher number of units the communication overhead would outweigh the 

computation time. However, the control logic of the data I/O has not been in the 

focus of our optimisation efforts yet and, thus, we assume that slight improvements 

of the speed of the data I/O are still feasible. Especially if targeting an ASIC 

implementation, such numbers are likely to change. 

4.3.3 ECM-Based Acceleration of GNFS: a Case Study 

Building an efficient and cheap ECM hardware can influence the overall performance 

of the GNFS since ECM can be perfectly used for smoothness testing step within 

the GNFS (see [64]). In this section, we briefly estimate the costs, space require- 

ments and power consumption of a special ECM hardware implemented as ASIC. 

Motivation for such analysis lies in a fact that ASIC design can achieve roughly 

10 times better performance as FPGA design. Knowing the area requirements and 

67

FEI KEMT 

timings of ECM implementation makes possible to compare fairly our design with 

other (future) solutions. In our estimate, we focus on the production cost which we 

believe to be much higher than the development cost of such an ASIC. This special 

hardware could be produced as single ICs (such as common CPUs), ready for the 

use in larger circuits. We choose a setting with a word width w = 8 and assume the 

use of carry save adders. 

Estimation of the Runtime We can determine the running time of both phases 

on basis of the underlying ring arithmetic. The upper bounds for the number of clock 

cycles of a modular addition and a modular subtraction are given in Equations 4.2 

and 4.3, respectively. A setting with N = 199, w = 8, p = 8, and e = 25 yields 

Tadd = 3(e + 1) = 78 and Tsub = 2(e + 1) = 52 cycles. According to Equation 2.4, 

the implemented multiplier requires Tmul = 666 cycles. For each operation we 

should include Tinit = 2 cycles for initialisation of the ALU at the beginning of each 

computation. 

For the group operations for phase 1 we obtain 

TP add = 5Tmul + 3Tadd + 3Tsub + 11Tinit = 3 742 and 

TP dbl = 5Tmul + 2Tadd + 2Tsub + 9Tinit = 3 608 

clock cycles. For phase 2, TP add changes to T ′ P add = 4 410 cycles since zA−B �= 1 in 

most cases, hence, we have to take the multiplication by zA−B into account. 

The total cycle count for both phases is 

TP hase 1 = 1 374(TP add + TP dbl) = 10 098 900 and 

TP hase 2 = 1 881T ′ P add + 50TP dbl + 13 590Tmul = 17 553 730 

clock cycles. Excluding the time for pre- and post-processing, a unit needs ap- 

proximately 27.7 · 10 6 clock cycles for both phases on one curve. If we assume a 

frequency of 500 MHz (for ASIC), such a complex computation can be performed 

in approximately 55 ms. 

Estimation of Area Requirements The estimation of area requirements have 

been based on results published in [108] 4 , the multiplier with w = 8 and p = 8 

4 The numbers provided in that contribution refer to a multiplier built with CSAs. Since we 

implemented the architecture with CPAs, given numbers are larger (approximately 20%) than 

those which would be achieved with our design. 

68

FEI KEMT 

requires 21 400 transistors in standard CMOS technology (assuming 4 transistors 

per NAND gate). We assume that the circuit for addition and subtraction can be 

achieved with at most 1 000 transistors. For the memory, we assume (area expen- 

sive) static RAM which requires 25 200 transistors for 21 registers. For the unit’s 

internal control we assume additional 6 000 transistors. The central control requires 

less than 2 000 000 transistors. Hence, one unit requires approximately 53 600 tran- 

sistors. Assuming the CMOS technology of a standard Pentium 4 processor (0.13 

µm, approx. 55 million transistors), we could fit 990 ECM units into the area of 

one standard processor. One ECM unit needs an area of approximately 0.1475 mm 2 

and has a power dissipation of approximately 40 mW. 

Application to the GNFS Considering the architecture for a special GNFS 

hardware of [64], we have to test approximately 1.7 · 10 14 co-factors up to 125 bits 

for smoothness. Since both the running time as well as the area requirement scales 

linearly with the bit size, we can multiply the results from the subsections above 

with a factor of 125/198 ≈ 0.628. If we distribute the computation over a whole 

year, we have to check 5 390 665 co-factors per second 5 . 

For a probability of success of p > 80%, we test 20 curves per co-factor, thus, 

we need approximately 3 850 000 ECM units which would yield a total chip area 

of 625 000mm 2 (= 4 300 ICs of the size of a Pentium 4) and a power consumption 

of approximately 175 kW. If we assume a cost of US$ 5 000 per 300mm wafer, as 

done in [103], the ECM units would cost less than US$ 45 000 for the whole GNFS 

architecture, which is negligible in the context of the overall costs. 

4.4 Conclusions and Future Steps 

In this chapter we presented the first published implementation of the ECM in a 

real hardware for factoring numbers up to 200 bits. To make the implementation 

possible the algorithm was adapted for conditions given by hardware, e.g. limited 

memory space, bus width, communication load. . . The parametrisation of the algo- 

rithms was done to particularly fit the needs of a hardware environment, yielding a 

high efficiency regarding the area-time product. 

The sequential control part of the ECM is operated by software commands of the 

embedded ARM processor. For intensive computing operations the special purpose 

5 Remark that we only take the time for finding the first factor into account. Since this happens 

quite seldom, we neglect the factorization of the remainder for our estimate. 

69

FEI KEMT 

hardware was implemented on Xilinx FPGA. The ECM unit provides full support for 

all computations of the phases 1 and 2 of the ECM. It is also possible to include more 

ECM units working parallel in one FPGA chip. Our implementation impressively 

shows that due to very low area requirements and low data I/O, ECM is predestined 

for the use in hardware. A single unit for factoring composites of up to 198 bits 

requires 506 flip-flops, 1754 lookup-tables and 44 Blocks RAM (less than 6% of logic 

and 27% of memory resources of the Xilinx Vertex2000E device). 

Thanks to scalability of the design, it is possible to change the data width and 

adapt it to target FPGA architecture. Another advantage lies in modularity of the 

design, namely the blocks for underlying modular operations: addition/subtraction 

and multiplication/squaring. At this stage we re-used the MMM very similar to the 

versions of the multiplier described in the chapter 2. 

The known drawbacks of the design are the noneffective usage of on-chip memory 

blocks and low maximum clock frequency. Our proof-of-concept design has not 

optimised the dedication of registers just for certain arithmetical operation or data- 

flow direction. Since the chosen algorithm for MMM requires simultaneous access 

for writing and reading to/from register with the sum S, we have selected dual-port 

memory mode for all registers. Similarly, the multiplexing of the registers with input 

and output operands has been left universal and therefore complicated and slow. 

As demonstrated, ECM can be perfectly parallelised and, thus, an implementa- 

tion at a larger scale can be used to assist the GNFS factoring algorithm by carrying 

out all required smoothness tests. A low cost ASIC implementation of ECM can 

decrease the overall costs of the GNFS architecture SHARK, as shown in [64]. We 

believe that an extensive use of ECM for smoothness testing can further reduce the 

costs of such a GNFS machine. 

As future steps, variants of phase 2 can be examined in order to achieve the 

lowest possible AT product. To achieve a higher maximal clock frequency of the 

ECM unit, the control logic inside the unit might be optimised. 

Since most of the computation time is spent for modular multiplications, an im- 

provement of the implementation of the multiplication directly affects the overall 

performance. Hence, alternative architectures for the multiplication can be investi- 

gated. 

70

FEI KEMT 

5 True Random Number Generator - preliminar- 

ies 

Random values play a crucial role in several areas of science. In dependency on field 

of application the requirements for parameters of random sequence and generator 

of sequence itself may vary. Focusing on the sequence origin we distinguish between 

truly- and pseudo-random sequences. The construction of generators decides on 

their suitability for commercial or research applications. 

In the following chapter we provide an introduction to the topic of randomness 

and random values (Section 5.1) while focusing on generators applicable in cryptog- 

raphy. In Section 5.2 we mention typical sources for generation of random sequences 

in digital circuits. In Section 5.3 we summarise design ideas of the PLL-based gen- 

erator we will analyse in the following chapter. In Section 5.4 we explain testing 

techniques applied in order to evaluate generators and in Section 5.5 we discuss is- 

sues related to attacks on RNGs. Finally, in Section 5.6 we summarise the chapter. 

5.1 Randomness 

We start with topic called randomness, and the most natural questions that come 

in our minds may look like: How to define the randomness? Where comes it from? 

Or how can we prove that a sequence is random? 

The randomness of the world we live in has been a scientific and philosophical 

topic for long time. Famous remark of Albert Einstein says that “God does not 

play dice with the universe” what might convince us about determinism of our 

environment. However, several physical phenomena present in physical world are 

proved to have a random nature e.g. probabilistic nature of quantum mechanics, 

thermal and shot noise in electronic components, or nuclear decay. 

The fundamental problem of randomness is in fact that even with exact definition 

it is very difficult to prove whether any finite numeric sequence is random or not. The 

randomness of a source is evaluated through the parameters of sequence generated 

using that source. The way how the values of sequence are extracted from the source 

depends on applied harvesting mechanism. The optimal harvesting does not disturb 

the random physical process and extracts as much entropy as possible. 

The entropy H of a random variable X with n outcomes � 

xi : i = 1, . . . , n � 

is 

defined as negative logarithm of the probability of the process’s most likely output 

71

FEI KEMT 

[68] what can be expressed as the following equation: 

n� 

H(X) = − p(xi) logb p(xi) (5.1) 

i=1 

where p(xi) is a probability function of the outcome xi. Therefore, the higher is 

the level of entropy, the less predictable is the process. A completely random pro- 

cess with maximal entropy provides uniformly distributed sequence. For the natural 

sources of randomness it is usually more difficult to achieve good statistical proper- 

ties of the sequences since they tend to include a certain level of bias or other kind of 

deviation from ideally equiprobable sequence. Post-processing sequence convertors 

are able to improve the statistic distribution, but usually reduce the output bitrate 

of the sequence. 

Achieving constantly high level of entropy in a RNG assures randomness of the 

produced bit sequence. When designing a RNG it is important to find level of 

entropy in the source, a relation between generator’s parameters and the entropy 

level and a monitoring mechanism for the entropy level. 

5.1.1 Definitions of Randomness 

There are several partial definitions of random numbers that help us to gather 

the requirements given on random sequences and devices generating them. Let us 

mention some of the definitions. 

The following definitions provide us information about the process by which the 

random numbers should be generated - a truly random number is generated by 

a process, whose outcome is unpredictable, and which cannot be subsequentially 

reliably reproduced. The unpredictability of the process means that each output 

state of the process is equally possible and may be guessed correctly with the same 

(negligible) probability (following the uniform distribution). The ability to repro- 

duce the random process would require some sign of periodic pattern in the process 

behaviour, what is undesirable in case of a random pattern. 

Chaitin’s Theorem [40] says that it is formally impossible to verify whether a 

finite sequence is random or not. Since we technically do not handle with infinite 

sequences what we can do is to check a practical randomness of finite sequence. That 

means to evaluate how the sequence under review shares the statistical properties 

of an ideal random sequence e.g. the equal probability of all possible outputs. 

According to Knuth [76], a sequence of random numbers is a sequence of inde- 

pendent numbers with a specified distribution and a specified probability of falling 

72

FEI KEMT 

in any given range of values. Other definition comes from Schneier [101], who says 

that random is a sequence that has the same statistical properties as random bits, 

is unpredictable and cannot be reliably reproduced. Kolmogorov defines a string of 

bits as being random if and only if it is shorter than any computer program that 

can produce that string. From all three definitions we can extract a common re- 

quirement (necessary but insufficient) for having the numbers in a random sequence 

uncorrelated 6 . 

Unpredictable sequence is the one for which the knowledge of all generated values 

in the past does not increase probability to guess the subsequent value, or in other 

words knowing one of the numbers in the sequence must not help predicting the 

other ones. The same fact we can illustrate by another of unpredictability definitions 

which defines it as a status that there is no polynomial algorithm, by which knowing 

l bits of the generated sequence S one is able to predict (l+1)-th bit with probability 

bigger than 0.5 [86]. No correlation also causes that the generated random sequence 

cannot be produced by other computer program than the one that prints the whole 

random sequence as it is. 

Under truly random sequence of bits we understand an uncorrelated sequence 

that cannot be reproduced or predicted, has equal probability of all possible outputs 

(equiprobability) and its generation is based on a random process. 

A sequence that keeps the statistical properties of random sequence, but its 

members are correlated or the sequence can be reproduced is called pseudo-random. 

The pseudo-random sequence looks random, but its origin is not in a random process 

and the sequence generation can be reproduced and described as an algorithm. 

One of the issues discussed in the thesis is the ability to distinguish between the 

truly random and pseudo-random sequence by exploration of the generation process 

in generator. 

5.1.2 Random Number Generator 

A RNG is an electronic device or software routine designed to yield a sequence of 

random numbers. 

A pseudo-random number generator (PRNG) is based on an algebraic function 

that expands the initial random value (a seed) into a random-like looking sequence. 

A true-random number generator (TRNG) includes a physical source of randomness 

6 However, simple (linear) and known correlation relations between the members of sequence do 

not exclude such source. In these cases a corrector that removes the correlated samples may be 

applied. More dangerous are masked correlations of higher order that are difficult in detection. 

73

FEI KEMT 

and a harvesting mechanism which extracts the randomness and generates truly 

random values. 

Security level of PRNG depends on complexity of the generating function, the 

period length of the generated sequence, and the amount of entropy in the seed. As 

a result, the pseudo-random sequences may achieve a high level of unpredictability 

in case of sufficient complexity of the generating function. However, the pseudo- 

random sequence has always a finite period and remains reproducible as far as 

initial conditions are sustained. 

The PRNG is the only choice for software implementations and thanks to de- 

terministic components it attracts also the designers of electronic digital systems. 

Note that also pseudo-random sequence can be unpredictable when produced by 

cryptographically secure PRNG e.g. based on hash (one-way) functions, stream ci- 

phers or Blum Blum Shub principle [28]. The PRNG requires a random seed (from 

a TRNG or other reliable source of entropy, if available) to obtain the starting level 

of entropy. As the system is deterministic, for identical seeds the PRNG generates 

identical output pseudo-random sequences, too. No more entropy is added during 

exploitation of the seed, therefore the seed’s entropy designates the unpredictability 

of the generated sequence. 

The term generator is not completely correct in case of TRNG as the randomness 

is not generated but rather extracted from a source of randomness (see Figure 5 – 

1). In TRNG the occurrence of random events is sampled by an extractor and 

transformed into a sequence of numerical values usually expressed as a binary stream. 

Source of 

randomness 

A/D conversion 

analogue part digital part 

noise 

signal 

Postprocessing 

digitised 

noise 

signal 

internal 

random 

sequence 

Output 

buffer 

external interface 

random 

number 

sequence 

Figure 5 – 1 Schematic diagram of a TRNG with designation of internal signals and interfaces 

The Figure 5 – 1 represents a typical design of TRNG based on a physical phe- 

nomenon. Using a proper harvest mechanism the analogue signal is converted into 

its digitised form. According to statistical properties of the signal it may be required 

to apply a post-processing in order to produce an internal random sequence. The 

generated sequence can be further accumulated in output buffer before leaving the 

74

FEI KEMT 

generator on an external request. 

5.1.3 Applications of Random Numbers 

Random or pseudo-random values may be applied in variety of application areas, e.g. 

in simulation methods like Monte Carlo [84], in generation of spreading sequences in 

spread spectrum communication systems [106], by generation of primes, in several 

cryptographic algorithms, or in gambling industry. Naturally, the requirements for 

generators and generated random data differ according to the application. 

In addition to proper statistical parameters, a generated random sequence for 

sensible cryptographic application has to be unpredictable and unrepeatable. Due to 

unrepeatability we expect completely different and random sequence for each use of 

the generator, even by identical starting conditions (like the seed for PRNG). This 

is an inherent feature of TRNGs based on entropy extraction from natural physical 

phenomena. In such case the entropy of the generator is increased by each generated 

value. 

Application areas for the RNGs can be found in a number of cryptographic 

algorithms. The dominant application of RNG is a secure generation of the keys 

for encryption. The bit-length of the key is chosen in dependency on length of the 

time when the key is valid. In different cryptographic applications this time can 

vary from the seconds for session keys to the years for encryption keys for archiving 

systems. Following this, the RNG has to provide random values with bit rate in the 

range between tens to thousands of bits per second. While for PRNG it is not a 

problem to achieve high output bit rates, for TRNG desired in high-level security 

cryptosystems the situation is different. A source of randomness in TRNG may 

have a low level of entropy per bit what means also low output bit rate because of 

required accumulation of the entropy. 

In cryptography, the values produced by randomness extractors or generators are 

used as cryptographic keys, initialization vectors, padding bits, blinding values and 

or masking values in countermeasures against side-channel attacks. In dependency 

on application the random value needs to be kept secret as in case of encryption 

(secret) keys or can be published as a nonce or a part of public key. 

Nowadays the security of cryptography systems is not based on secrecy of en- 

cryption methods, those are publicly known, but on the knowledge of a secret key. 

An adversary focuses all her attacks on revelation of that secret information. Hav- 

ing under control the device that generates the values–keys allows the attacker to 

75

FEI KEMT 

control also all the systems which security depends on them. Those are the reasons 

that emphasise the randomness generation process in cryptography. 

Requirements on TRNG for cryptography We can conclude the previous 

paragraphs with a list of special requirements given for implementation of TRNGs 

in case the produced sequences are applied in cryptography: 

• Specific statistical properties – generated sequence must have perfect statistical 

properties. Some known bias of the probability of zeros and ones in the gen- 

erated bit-stream could make cryptographic attacks easier since nonzero value 

of bias deforms the required uniform distribution. The expected parameters 

are usually achieved by random sequence post-processing. 

• Unpredictability – knowledge of arbitrary long sequence from the generator 

or any other information about the internal status of the generator should 

not enable anyone to predict preceding or subsequent generator outputs or to 

guess them with some non-negligible probability. Such behaviour is natural for 

random physical phenomena. The requirement is satisfied by a proof showing 

the origin of randomly looking sequence. 

• Security parameters – the TRNG is target for an adversely attack also as an 

electronic device. More than one-off revealing of the secret key, an adversary 

is usually interested to influence the key generation process permanently. As 

means for improved vulnerability against this kind of attacks, the RNG de- 

signers should consider implementation of on-line tests tailored to harvesting 

mechanism of a TRNG. 

5.2 TRNG Implementations in Digital Systems 

In the following part of the thesis we provide an overview of known TRNG implemen- 

tations and design proposals. We focus mostly on designs targeted for application 

in digital circuits. 

Nowadays, a common hardware platform for implementation of cryptographic 

primitives is a digital device. The cryptographic functions are performed as a soft- 

ware code on embedded processors on DSPs, FPGAs, SoC etc. or run on dedicated 

(co)processors with programmable (FPGA) or hard-wired logic cells (ASIC). That 

fact motivates research of generators that could be integrated into circuits that are 

completely digital. 

76

FEI KEMT 

Digital circuits are naturally well-suited for implementation of a PRNG because 

of their deterministic nature. For implementation of a physical TRNG it is required 

to look for a source of randomness inside a circuit. Typical digital circuits include 

only a limited range of sources of randomness that we will investigate further. 

As we explained already, true randomness is achievable only in generators based 

on some physical phenomenon. Anyhow, one of the main objectives of digital sys- 

tems designers is to minimise the impact of spurious analogue effects and achieve 

perfect stability of the system. Therefore the goal is an optimisation of clock distri- 

bution network for wide range of frequencies and a careful design of PCB layout and 

power supply network. One can see these contradictory requirements for the system, 

when on one side we expect perfectly deterministic behaviour of digital part of the 

system, and on other side we look for a high-quality source of truly randomness for 

TRNG placed in the same system. 

For the sake of security preferred are completely embedded implementations of 

RNG. In such case the internal signals of the RNG are not exposed to potential 

attacks. However, due to lack of suitable sources of randomness on a given platform 

there are designs that propose a use of external discrete components as a source of 

randomness while the processing part of generator is implemented in digital part of 

the system (e.g. [112]). 

5.2.1 Sources of Randomness 

The following sources of randomness can be found in the digital devices: 

• metastability 

• various types of noise 

• clock jitter 

Although the clock jitter is primarily caused by a noise and therefore it could be 

included under the noise, we mention the jitter as a separate category. The TRNGs 

based on jitter use techniques different from the ones based on direct sampling of 

noise. In addition, the generators sourced by jitter belong to the most popular 

designs of TRNGs. 

We note that although the sources of randomness are presented separately, it is 

generally more difficult to separate them in the technical designs, where all of them 

may be present and have influence on randomness source entropy. As an example 

77

FEI KEMT 

we can mention a generator kept in metastable state whose stable output value will 

be influenced by noise conditions inside the generator. In such case, the primary 

source of randomness is the metastability and the secondary source is the noise. 

Metastability A fundamental building block of digital circuits, the flip-flop (FF) 

has two well-defined stable states - high and low level usually denoted as 1 and 0 

(see Figure 5 – 2). Under certain conditions the device may get into a state which 

cannot be described by any of the above defined states. This condition is called 

metastability. 

stable state 0 

Metastable state 

stable state 1 

Figure 5 – 2 Illustration of stable states (0 and 1) and undefined metastable state 

The most common way to get a device into the metastability is to violate the 

setup 7 and hold 8 times of the device. That can be achieved by choosing the frequen- 

cies of the clock and input signals of the FF in a ratio that results into changes of 

the input signal level that are too close to edges of the clock signal. Other option 

is that the frequencies of the signals are the same, but the phases are aligned in a 

way that causes FF’s setup and hold time violation. 

Keeping the FF close to metastability and then allowing it to resolve produces a 

binary sequence that depends on noise conditions inside the FF in the time of release. 

If the origin of the noise is a thermal motion, then its random nature suggests that 

repeatedly clocking a FF forced into metastability will produce a succession of binary 

bits with little correlation between any pair in the sequence [75]. 

7Setup time is defined as the minimum time before sampling edge by which the sampled signal 

must be stable 

8Hold time is defined as the minimum time after sampling edge during which the sampled signal 

must be stable 

78

FEI KEMT 

In case of generators based on metastability the main implementation issue is 

the phase or frequency control of the input signals that forces the metastability 

conditions. Complicated control system makes the implementation more vulnerable 

to attacks. In RNGs based on other randomness extraction techniques e.g. on 

free-running oscillators, the metastability may also occur and contribute to overall 

entropy of the randomness source [52]. 

Producers of FPGAs, and digital circuits in general, constantly work on reducing 

of the setup and hold times 9 as metastability produces ineligible non-deterministic 

exceptions in the behaviour of the devices [6]. Therefore the published implemen- 

tations of TRNG [75, 83, 121] usually propose special circuits implemented e.g. by 

CMOS technology. 

Due to difficulty to meet the metastable condition in a long-term meaning we 

can conclude that the metastability is good (secondary) source of randomness in 

case it is combined with other sources. 

Noise Despite their deterministic behaviour the digital devices are based on analog 

elements naturally producing a certain level of noise. There is always a source of 

noise (e.g. thermal noise – resistance or shoot) present in an electronic device. In 

order to apply the noise as a source of randomness it is required to amplify the noise 

itself or the effects caused by the noise. 

Most of the true hardware RNGs depend primarily on a source of thermal noise, 

which is then post-processed to reduce the effects of deterministic internal and ex- 

ternal influences such as power supply variations, DC bias, and electromagnetic 

fields [73]. Direct amplification and sampling of a noisy signal is not possible in 

pure digital circuits. However, more complex devices are not exclusively digital and 

include embedded components for mixed signals (analogue-digital) processing like 

A/D and D/A converters, or clock circuitry for a signal skew compensation. 

A technique with clocked comparator fed by directly amplified noise is applicable 

only in case of well-shielded noise sources, what can be hardly achieved in case of 

integrated digital systems. Instead of direct amplification of the noise, it is techni- 

cally more feasible to amplify signals that include a randomly changing part, but 

has higher level of amplitude than the noise itself (see e.g. [24, 73]). 

Bagini and Bucci [24] provide one of first designs that include an analytical model 

of the generator behaviour and a self-testing procedure. As a convertor of analogue 

9 For LE FF of Altera Stratix II speed grade -3 the setup time tSU = 90 ps and hold time 

tH = 149 ns [21]. 

79

FEI KEMT 

noise to binary signal a comparator is applied. Balanced signal is then sampled by 

a delay FF. The number of internal transitions in the generated binary signal allow 

online checking of the generator behaviour. 

Noise as an intrinsic and reliable source of noise in electronic devices is attractive 

for designers of TRNG. We further elaborate its influence on signals, e.g. in form of 

jitter. 

Jitter In this part we discuss various sources of jitter and a qualification of the 

jitter components to deterministic and random ones. We start with some basic 

definitions of the jitter, deterministic and random jitter [102]. 

By convention, timing variations are split into two categories, called jitter and 

wander, based on a Fourier analysis of the variations vs. time. Timing variations 

that occur slowly are called wander. On the other hand the jitter describes tim- 

ing variations that occur more rapidly. The threshold between wander and jitter 

is defined to be 10 Hz according to the ITU, but also other definitions may be 

encountered. 

We continue with more specific definition of the jitter and its two components. 

Jitter is a deviation from the ideal timing of an event (see Figure 5 – 3). The 

reference event is the differential zero crossing for electrical signals and the 

nominal receiver threshold power level for optical systems. Jitter is composed 

of both deterministic and Gaussian (random) content. 

Deterministic Jitter (DJ) is the jitter with non-Gaussian probability density 

function. Deterministic jitter is always bounded in amplitude and has specific 

causes. Four kinds of deterministic jitter are identified: duty cycle distor- 

tion, data dependent, sinusoidal or periodic, and uncorrelated (to the data) 

bounded jitter. The DJ is characterized by its bounded, peak-to-peak value. 

Random Jitter (RJ) is the jitter that is characterized by a Gaussian distribution. 

Random jitter is defined to be the peak-to-peak value which is given to be 14 

times the standard deviation (14σjit) of the Gaussian distribution. 

Knowing the basic definition of the jitter we can continue by definitions of three 

types of jitter that differ in the reference signal that is considered to be ideal, without 

any jitter, and the time period of observations [102]. We add also our definition of 

the tracking jitter that plays crucial role in the randomness extraction method of 

PLL based TRNG. 

80

FEI KEMT 

reference 

edge 

mean period 

unit 

interval 

jitter 

Figure 5 – 3 Timing jitter in clock signal 

Cycle-to-cycle jitter is the difference in a clock’s period from one cycle to the next 

one. Cycle-to-cycle jitter is the most difficult to measure usually requiring a 

timing interval analyser. 

Half-period jitter is the measure of maximum change in a clock’s output transi- 

tion from its ideal position during one-half period. 

Period jitter is the change in a clock’s output transition, typically the rising edge, 

from its ideal position over consecutive clock edges. Period jitter is measured 

and expressed in time or frequency. Period jitter measurements are used to 

calculate timing margins in systems. 

Tracking jitter is defined as a variation in time relationship between the edges of 

the reference (input) clock and output clock of a clock circuitry. 

Deterministic periodic jitter is typically caused by external deterministic noise 

sources coupling into a system, such as switching power-supply noise or a strong 

local radio frequency carrier. It may also be caused by an unstable clock-recovery 

PLL. 

While a random process can, in theory, have any probability distribution, ran- 

dom jitter is assumed to have a Gaussian distribution for the purpose of the jitter 

model. One reason for this is that the primary source of random noise in many 

electrical circuits is thermal noise (also called Johnson noise or shot noise), which 

is known to have a Gaussian distribution. Another, more fundamental reason is 

that the composite effect of many uncorrelated noise sources, no matter what the 

distributions of the individual sources approaches a Gaussian distribution according 

to the central limit theorem [107]. 

For a random signal with a Gaussian distribution, there is theoretically no limit 

on the max and min values, so the observed peak-peak value will generally grow 

81

FEI KEMT 

over time. For this reason, the peak-peak value should be used in conjunction with 

the population size and some knowledge of the type of distribution. 

5.2.2 Survey of Designs Based on Jitter 

In this section we summarise currently most known concepts and designs of genera- 

tors based on extraction of randomness from clock jitter. The jitter appears in clock 

signals generated by free-running oscillators or PLL circuitry implemented inside a 

digital device. 

The Tkacik TRNG Design The generator invented by Tkacik [111] includes 

combination of two deterministic circuits – a linear feedback shift register (LFSR) 

and cellular automation shift register (CASR). The registers are clocked by two inde- 

pendent rings whose clock frequency is influenced by external impacts and includes 

jitter. In addition, the selected outputs of CASR and LFSR are XORed together 

providing the final random signal. The harvesting technique of the generator is very 

complex and no verification of its effectiveness is provided. 

The design was evaluated by Dichtl [43] who pointed out an issue with unclear 

source of randomness in the generator. Under certain conditions and with partial 

knowledge of some internal values an attacker is able to predict the generated value 

due to low level of entropy. 

The Fischer and Drutarovský Design In design from Fischer and Drutarovský 

[60] the idea is to extract random values by sampling a clock signal influenced by 

tracking jitter caused by analogue PLL in FPGAs from Altera. The jitter can be 

sampled only under defined condition when frequencies of sampled and sampling 

clock signals are in a certain ratio. 

Sampling of clock signal is executed periodically with period given by PLL di- 

viders. Samples taken in transition zones have nonzero probability to result in 

logical one or zero and are called critical samples. The position of critical samples is 

stabilised during operation of the generator as far as the working conditions of the 

generator do not change. 

More details on the TRNG implementation and features of the generator are 

described in the next section. This design provide us a reference for theoretical 

testing and theories which are presented in the thesis. 

82

FEI KEMT 

The Golić Design Golić’s goal is to provide digital TRNG built from logic gates 

only. Such design is cost effective and suitable for implementation on any digital 

chip. In article from Golić [70] the author proposes two new elements applied in 

design of TRNG showed in Figures 5 – 4(a) and 5 – 4(b): the Galois ring oscillator 

(GARO) and Fibonacci ring oscillator (FIRO). 

(a) Galois ring oscillator (b) Fibonacci ring oscillator 

Figure 5 – 4 Ring oscillator structures proposed by Golić. 

Adding more complex feedback loop in the ring oscillator (RO) makes also its 

behaviour more complex and therefore more suitable for TRNG where the random- 

ness coming from jitter spreads faster. In comparison to classical RO, the usage of 

GARO and FIRO yields a higher level of entropy and robustness of the generator. 

Additional entropy of the generator comes from frequent metastability effects in the 

sampling gate. 

In [44] Golić and Dichtl show results of practical implementation of TRNG using 

the oscillators presented above. The authors prove the randomness of the solution by 

analysis of the generator output after repeated restarts of the circuit. The standard 

deviation of the output signal voltage raises quickly after the restart and stabilises 

on significantly large level which assure randomness of the sample taken in this time 

period. 

The Kohlbrenner and Gaj Design The principle similar to PLL-based genera- 

tor [60] was proposed by Kohlbrenner and Gaj in [79]. Instead of PLL circuitry that 

is not present in all FPGAs, the authors use a pair of oscillator rings implemented 

in programmable logic area of FPGA. Since the principle expects a tight pair of 

frequencies generated by rings, the oscillators must be matched precisely. That re- 

quires also proper positioning of the rings inside the FPGA and manual corrections 

in placements and routing. 

The authors investigated also the influence of temperature on RO. The frequency 

of a RO tends to wander as the chip’s temperature varies. It is important to place 

the ROs in a pair close to each other so the difference between the frequencies is 

reduced due to minimal difference in temperature. 

83

FEI KEMT 

The Bucci and Luzzi Testable TRNG Design Framework The authors of 

testable TRNG design framework [36] come with idea of a stateless RNG which 

generates statistically independent random bits. In case the post-processing unit 

is also memoryless, the internal random bits are independent too. The stateless 

condition of the generator can be achieved by resetting the generation and post- 

processing circuit before generation a random bit or word, respectively. 

In case of RO based generators the reset state is achieved by stopping the os- 

cillators after each bit generation, so the phase shift between the oscillators is not 

accumulated. Another motivation is to avoid a complicated deterministic beating 

pattern between fast and slow frequencies of the RNG. Should the generator include 

any control or compensation loops, then the stateless condition is met only if the 

loops achieved their steady state. 

The Sunar et al. TRNG Design A theoretical concept of generator based on 

ring oscillators (ROs) with equal length was published by Sunar et al. in [105]. Ac- 

cording to the concept the outputs of several ROs are XORed together and sampled 

by a D flip-flop. The number of oscillators is chosen according to jitter size and 

internal frequency of the rings. The design goal of properly working generator is 

an uniformly distributed region of unpredictable transitions. It is assumed that the 

phase drift caused by jitter appears in the internal signal of each ring and influences 

the movement the edges in the signal. 

Several assumptions made by the authors of this concept were questioned by 

authors of [44]. The main problem lies in expectation that the ROs are independent 

what is usually difficult to achieve due to their high tendency to couple with each 

other or lock on a common frequency if there is a strong source of periodic signal 

close to the ROs. 

Notes on Other Published Designs Several published designs of the TRNG 

are based on frequency instability of free-running oscillators e.g. [53]. Free-running 

oscillators are typically used also in FPGAs based TRNGs [79, 112]. 

In the papers published recently [31, 105] we can observe that the successfully 

passed statistical tests of the proposed RNG are not sufficient anymore. Much more 

attention is paid to an analysis and model of the randomness extraction process. 

The theoretical bounds for entropy and statistical estimations of the RNG behaviour 

are provided in order to prove the security of the generator. The requirement for 

continuous testing of the generated sequence was raised by Schindler in [100]. As a 

84

FEI KEMT 

consequence the RNG designs should provide a testing method designed particularly 

for given type of RNG (see e.g. [36]). 

In [26] the authors improve modeling of RO TRNG, and instead of conventional 

time-based models they provide a phase-oriented presentation. The observation 

claiming that the ROs tend to couple with each other have been confirmed by the 

experiments with global deterministic jitter. Instead of conclusion that coupling 

reduces the randomness of a TRNG, the authors warn of overestimation of the 

jitter size. After removing the impact of global jitter the accumulation of jitter is 

much slower, what implies in lower sampling frequency of the generator in order to 

accumulate obtain random sequences. 

5.3 PLL-Based TRNG on FPGA 

In this section we introduce TRNG implementation based on randomness extrac- 

tion from tracking jitter that is inherent in clock signal produced by analog PLL 

embedded in some FPGA families. The PLL circuitry normally applied for synthesis 

of on-chip clock signals derived from external quartz signal is driven to provide a 

couple of signals with certain fixed ratio of their frequencies. The ratio is selected 

for purpose of the jitter sampling and sets also other parameters of the generator as 

speed of output random sequence. 

In the following pages we compile dependencies between the PLL and TRNG 

parameters and explain their meaning. We explain the fundamental method behind 

the PLL-based TRNG (PLL-TRNG) invented by Fischer and Drutarovský and pub- 

lished in [60]. 

5.3.1 Randomness Extraction Method 

The tracking jitter in the output signal of the on-chip analog PLL is detected by 

sampling the signal using an other rationally related clock signal. The fundamental 

issue of allowing jitter sampling lies in setting of the sampled and sampling edges 

close enough to each other. When this condition is met, the unpredictable jitter 

decides on the output values of the sampling gate. The simplified structure of the 

PLL-TRNG is depicted in Figure 5 – 5. 

Let us have two clock signals CLK and CLJ with frequencies FCLJ and FCLK 

in the given ratio: 

FCLJ 

FCLK 

= KM 

KD 

= MCLJDCLK 

MCLKDCLJ 

85 

, (5.2)

FEI KEMT 

CLI 

PLL 

PLL 

1 

2 

CLJ 

CLK 

D 

Flip 

Flop 

q(nT CLK) Decimator 

(NK D) 

x(nNT Q) 

Figure 5 – 5 Block structure of the PLL-TRNG with two PLLs, sampling gate and corrector of 

the output sequence. 

where KM and KD are combinations of PLL dividers (DCLK, DCLJ) and multi- 

pliers (MCLK, MCLJ). As it can be seen in Figure 5 – 6, the signal CLJ is sampled 

in KD discrete positions during the period TQ, which is given as 

CLJ 

CLK 

OUT 

critical samples 

TQ = KDTCLK = KMTCLJ . (5.3) 

TQ TQ 

DT 

KM 

samples 

Figure 5 – 6 Sampling of the CLJ clock signal including the tracking jitter on the raising edge 

of the CLK signal (illustrated for KM = 5 and KD = 7) 

It has been shown in [60] that if KM and KD are relatively prime, the set of 

samples creates an equidistant set of values with a distance step 

d = TCLK 

2KM 

GCD(2KM, KD) = TCLJ 

GCD(2KM, KD) , (5.4) 

2KD 

The method offers a possibility to choose the worst-case distance MAX(∆Tmin) = 

d/2 between two closest edges of the CLK and CLJ signals as [60] 

MAX(∆Tmin) = TCLK 

GCD(2KM, KD) = 

4KM 

TCLJ 

GCD(2KM, KD) (5.5) 

4KD 

and thus to assure proper behavior of the generator. 

86 

KD

FEI KEMT 

If the parameters KM and KD are chosen so that 

MAX(∆Tmin) < σjit , (5.6) 

it is guaranteed that during the period TQ the sampling edge of CLK will fall at 

least once into the edge zone of CLJ (where the edge zone means the time interval 

around the edge with a width smaller than σjit, while σjit is a standard deviation of 

the jitter). The KD samples represented by the output signal q(nTCLK) are XOR- 

ed bit-wise in a corrector [60] to obtain one random bit during N periods TQ. The 

generator output bitrate R is thus decimated by factor N to R = 1/(NTQ). It can 

be seen that while the left side of (5.6) depends on the generator structure and PLL 

settings, its right side, the jitter, depends on the noise of the PLL circuitry, the 

working environment, and on the circuit board design. Therefore, the jitter must be 

known in advance or (even better) measured in real time. Measurement of the jitter 

requires special measuring equipment. Common methods of jitter measurement 

(e.g. those used in [61]) enable one to measure the absolute long-term jitter and 

not the relative tracking jitter employed in the proposed TRNG. Furthermore, the 

jitter is measured under laboratory conditions and not in a real (potentially hostile) 

environment. If the results of measurements are not available, parameters from the 

vendor’s documentation can be used for the design of the TRNG as in [60]. 

The decimated output signal of the TRNG 

x(nTQ) = q(nTQ) ⊕ q(nTQ − TCLK) ⊕ . . . ⊕ q(nTQ − (KD − 1)TCLK) , (5.7) 

which is generated at the output of an Exclusive-OR (XOR)-based decimator [42] 

as a bit-wise addition modulo 2 (⊕) of KD samples q(.) sampled with the frequency 

FCLK will be nondeterministic, too. Note that the delay line can still be a useful 

building block for σjit ≈ MAX(∆Tmin) or σjit < MAX(∆Tmin), as it was shown 

in [62]. 

The sampler sensitivity on the jitter 

S = FCLIMAX(∆Tmin) = 

1 

4MCLKMCLJ 

(5.8) 

is derived from Equation (5.5). Decreasing MAX(∆Tmin) for a fixed FCLI requires 

maximisation of multiplying coefficients (M). 

For the output bitrate R = 1/TQ = FCLK/KD we get the condition 

R = 

FCLI 

DCLKDCLJ 

87 

(5.9)

FEI KEMT 

For R it holds that the increasing R for a fixed FCLI requires minimisation of dividing 

coefficients (D). Of course, optimization cannot be done independently. There are 

system limits expressed by the condition 

5.3.2 Coherent Sampling 

R 

MAX(∆Tmin) = 4FCLKFCLJ . (5.10) 

The sampling technique applied for randomness extraction in PLL-TRNG is called 

a coherent sampling. 

The method expects that the samples are processed during the period TQ that 

is given by ratio of the clock frequencies. In case of ideal signals without a jitter 

the output signal is perfectly periodical. Let us provide some more details on the 

parameters of this signal. 

Similar technique is applied to measure high frequency signals. The coherent 

principle is based on sampling the measured signal during several periods of the 

sampled signal, instead of usually expected one period. Sampling frequency fs is 

lower than the frequency of sampled signal f. The ratio between the frequencies is 

expressed as 

fs = N 

f , GCD(M, N) = 1 . (5.11) 

M 

During M periods of sampled signal is obtained N samples. Since M and N are 

relatively prime numbers, the N samples are distinct and evenly distributed in TQ, 

thus the effective sampling frequency is fseff = Nf. In order to obtain the original 

waveform of the sampled signal a time shuffling of the samples may be needed. In 

case of M = N + 1 time shuffling can be avoided if 0 ≤ φ1 ≤ 2π/N. 

This sampling theory may applied to the referred generator. Since it is difficult 

to fulfil the condition for avoiding shuffling of the samples, an re-shuffling is required. 

Let’s assume that during the period TQ we acquired KD samples of the CLJ signal 

with order i = 0, 1, . . . , KD − 1. Next, we need to rearrange the samples according 

to their timing position in the CLJ signal. The idea behind this reordering lies 

in the fact that KD samples of CLJ are taken during KM periods of CLJ signal, 

therefore we can reconstruct one period of the signal CLJ from KD samples. Thus, 

we compute the order index j and we sort the samples regarding this index. 

j = iKM mod KD 

88 

(5.12)

FEI KEMT 

5.4 Testing of TRNGs 

Randomness of the generated numbers cannot be proven only by passing generally 

used statistical tests. Instead of that each RNG implementation has to be evaluated 

individually as an unique system. However, if the prototype in the lab generates 

acceptable random numbers this may not be true for each piece of TRNG of the 

same type during the whole operation time and therefore a continual testing of the 

generated output is required. 

It is well-known that most of the attacks are directed towards the implementa- 

tions of the cryptographic algorithms and not to the algorithms themselves. This 

means that special attention should be paid to avoid all weaknesses helping an at- 

tacker in breaking of a system. 

The topic of tests is highly accurate in case of attacks. The generators as sources 

of secrets, on which the security of the whole cryptosystems is based, are popular 

target of attacks and attempts to obscure the generated output. The topic of at- 

tacks is also included in the chapter. Changing the working conditions may have a 

degrading influence on the parameters of generated sequence. 

In [74] an approach for the evaluation of physical random number generators 

is given which takes the construction of the TRNG into account. The document 

presents a theory how the TRNGs used in cryptographic systems should be evalu- 

ated. 

For the TRNGs testing we have to accept the following facts [100]: 

• A final set of statistical tests may detect defects of a random source, but these 

tests cannot verify the randomness of the source. 

• Good statistical properties of the random numbers are clearly not sufficient for 

sensitive cryptographic applications as the generation of the keys, signature 

key pars or signature parameters. 

• The key criterion is not the statistical behavior of the numbers but their en- 

tropy. 

• For good TRNG it has to be given that the increase of entropy per generated 

number is sufficiently large. 

In [74], there is proposed a set of tests that should be passed, including the 

Coron’s test of entropy increase. In addition to the proof that the generated num- 

bers have desired properties, it is needed to provide an explanation of randomness 

89

FEI KEMT 

extraction. In other words, the principle of random numbers generation has to be 

described for better understanding and for better analysis of possible attacks on the 

TRNG. 

Startup Test, Online Test, TOT Tests If RNG prototype in a lab generates 

acceptable random numbers this may not be true for each TRNG of the same type 

during the whole operation time. The reason for this could be found in tolerances 

of components of the noise source, ageing effects, or outside attacks. In the worst 

case the TRNG breaks down totally and the output numbers are constant from that 

moment on. Therefore, the developer of the TRNG should implement also tests 

that will detect similar cases of the randomness degradation of the output bits. We 

distinguish between 3 types of tests [74]: 

1. startup test is used to verify the principle functionality of the noise source 

when the TRNG has been started. 

2. online test should detect if the quality of the random numbers is not sufficient 

for this particular TRNG or deteriorates in the course of the time. 

3. tot test (’tot’ stands for ’total failure of the noise source’) should detect a total 

breakdown of the noise source. 

Implementation of the tests For implementation of the tests one has to consider 

the limitations that are given by the platform on which the TRNG is implemented. 

Not rarely the implementation target are smart cards, or field programmable gate 

arrays (FPGAs) with limited memory space. Therefore the chosen tests should 

require only small additional logic resources. Moreover the tests should be selected 

according to the features of the TRNG and the basic principle of the random source. 

It is possible to create also new tests that are more suitable for the particular TRNG 

and detect better the possible defects. 

Due to the limited memory resources of target platforms it is impossible to test 

the statistical properties on very long sequences (up to Mbits of data) as some tests 

(e.g. [97]) require. The goal is to find tests that are able continually evaluate the 

quality of the random source without the need of storing the output bits. Require- 

ments which appropriate online tests should fulfil are formulated in [100]. 

Two another requirements are given on the tests. On one side we expect detection 

of even small deviation from ideal random source, but on the other side often random 

90

FEI KEMT 

alarms are not acceptable (e.g. tot test can block the smart card, so the revision by 

the producer is required for reusing it). Therefore the ranges of deviations from the 

ideal randomness have to be set very carefully to do not decrease the security of the 

system, but also do not block the TRNG by fake alarms. This is task is even more 

difficult for short sequences of random bits tested inside the TRNG. 

5.5 Attacks against TRNG 

The main attacker’s goal of a cryptographic algorithm or implementation is to reveal 

some part or even the whole secret key and then decrypt easily any encrypted 

message. Attacking RNGs has a different motivation than finding the key. Inside 

cryptographic systems the RNG plays crucial role in generation of secret keys, session 

keys, etc. A random key is the outcome of the generation process. Therefore the 

target of the attack is not only the generated value of the secret key but also any 

information making possible to predict the succeeding or preceding values of the 

keys. 

In case of successful attack, the generated values may not be random anymore 

and can be constant or strongly biased or attacker knows the algorithm for their 

correct prediction with high probability. By this approach one tries to change the 

random behaviour of the TRNG to deterministic one, or at least change the proba- 

bility distribution of the generated sequence. 

In case of PRNG, the knowledge of the seed or internal status can lead to breaking 

the generator because its structure is usually known and public. In case of well- 

deigned TRNG the information about actual internal status does not provide any 

information about the previous or following one. Therefore focus of the attack is the 

source of noise and randomness extraction method rather than the internal status 

of the TRNG. 

Attacks on cryptographic systems (including RNG) can be divided into algorith- 

mic and implementation attacks. 

Algorithmic attacks The first group of attacks, the algorithmic attacks, includes 

mathematical analysis of the mechanism for randomness extraction or the structure 

of the PRNG and does not require any access to the attacked unit. The analysis 

can be used especially against PRNG designs with non-properly designed way of 

obtaining the seed value [69]. If seed contains low level of entropy, then the output 

of the generator has statistical properties not comparable to the random sequence 

91

FEI KEMT 

and effort needed for reproduction the output is lower. Mathematical analysis of 

TRNGs tries to find deterministic dependencies inside the extraction method causing 

pseudo-randomness. 

As the parameters of TRNGs are highly dependent on the implementation, at- 

tacking directly the hardware realisation can be more powerful. 

Implementation attacks The second group, the implementation attacks, expects 

a direct physical access to an implementation and is based on weaknesses caused by 

implementation of the RNG. Implementation attacks are further divided to passive 

and active attacks. 

Passive attacks usually called side-channel attacks, benefit from a side channel in- 

formation gained from the physical implementation. The power consumption, 

execution time or electromagnetic emanations can provide additional useful 

information about RNG internal status or processed data. 

Active attacks require an involvement of the attacker into changes of the standard 

working conditions, operation flow or design of the original implementation of 

the RNG. The non-invasive active attacks apply non-permanent changes of ex- 

ternal parameters for RNG e.g. supply voltage, temperature, with motivation 

to achieve non-standard - biased RNG output. With more resources one can 

execute an invasive attack and change the physical structure of the implemen- 

tation. The attacker tries to destroy the source of randomness and make the 

output of the RNG constant or to get directly the output of generator. 

5.6 Conclusions 

In this chapter we have introduced the topic of random numbers. The extraction 

of random bits in digital environment is a crucial topic in the area of system imple- 

mentations with public-key cryptography. The randomness itself and typical three 

sources of randomness: noise, metastability and jitter were described. In order to 

provide an overview on the actual status in the research we have collected descrip- 

tions of the recently published design proposals and implementations of TRNG. 

A typical design of TRNG implemented in a digital device includes a source 

of randomness from which a digitised noise signal can be harvested by a proper 

mechanism. We have explained the importance of research in the areas of the 

harvesting mechanisms and postprocessing. The positive results of statistical tests 

92

FEI KEMT 

do not assure the random base of generated sequence. In addition the working 

environment may also have a significant impact on the parameters of output bits. 

Requirements on RNGs applied in cryptography cover security parameters of the 

design, unpredictability of the generated sequence and specific statistical properties 

of the output sequence. 

The generator chosen for our research - the PLL-TRNG proposed in [60] will be 

further tested and analysed in order to provide better tools for choosing its param- 

eters and understand its behaviour in changing environment. Described theoretical 

background on testing and attacks of RNGs has been applied and the results are 

given in the following chapter. 

93

FEI KEMT 

6 True Random Number Generator 

The chapter is dedicated to analysis of jitter-based random generator under various 

aspects. Our work is based on the TRNG design proposed by Viktor Fischer and 

Miloˇs Drutarovský published in 2002 [60]. We enhance the already published results 

summarised in the previous chapter. Our focus is put on analysis of the generator 

in changing working conditions and configurations settings. 

Results of the research were published in the following list of papers [47, 48, 61, 

62,114–116,119]. The main achievements of our research were done in the following 

areas: 

• Analysis of PLL circuitry as a source of randomness – implementation issues, 

possible PLL configurations, verification of vendor parameters, 

• Analysis of TRNG implementation in different FPGAs – achievability study, 

design consideration, practical results, 

• Stochastic model of PLL-TRNG – proposal and practical verification, 

• Temperature influence on PLL-TRNG – practical attack on TRNG with results 

and suggestions for design. 

The chapter is structured as follows. In the Section 6.1 we describe two ways 

of clock synthesis in modern FPGAs and summarise the parameters of the clock 

circuitry verified by practical measurements of the PLL parameters. The Section 6.2 

provides an analysis of PLL configurations, practical results from Altera and Actel 

FPGA implementations of the generator and a stochastic model of the generator. 

In Section 6.3 we describe a non-invasive attack on the generator together with 

practical outcomes. In the last part (Section 6.4) we discuss the obtained results 

and provide ideas on the further research. 

6.1 Clock Synthesis in FPGAs 

In present-day integrated digital systems, there is a need for numerous clock sig- 

nals with various frequencies. The synthesis of the clocks in separated circuits is 

not effective and the frequencies are too high to be generated by an external crys- 

tal. FPGA vendors offer for this purpose a clock circuitry embedded on the FPGA 

chip. Beside synthesis of clock signals with required frequencies it provides addi- 

tional functions making possible a processing of signals with very high frequencies. 

94

FEI KEMT 

The clock conditioning circuits usually enable to perform following functions (in 

dependency on FPGA vendor and family): clock phase adjustment, clock delay 

minimisation, clock frequency synthesis, clock modulation spread-spectrum, static 

or dynamic configuration of circuits parameters, etc. 

We explain two mostly applied principles for clock signal management in FPGAs 

based on PLL and delay-locked loop (DLL). Both principles can be implemented 

as digital or analog circuits. While the FPGA vendor Xilinx has chosen digital 

implementation of DLL in most of their FPGAs, other vendors like Altera and 

Actel included in their devices a clock circuitry based on an analog PLL. 

Phase-Locked Loop Circuitry Typical analog PLL block in Altera and Actel 

devices (see Figure 6 – 1) can provide at least one synthesised clock signal with 

frequency FOUT : 

FOUT = FV CO 

k 

= FREF 

m 

k 

= FIN 

m 

n × k 

, (6.1) 

where FIN is the frequency of the input clock source that can be an external crystal 

or other PLL in case of PLL cascade, FREF is the input reference frequency that 

is used to lock the feedback clock FF B, and finally the voltage controlled oscillator 

(VCO) produces a clock signal with output frequency FV CO. Reference-, feedback- 

and post-divider values n, m, and k can vary from one to several hundreds in 

FPGAs [11, 14], or to several thousands in ASICs [22] and set together with VCO 

working limits the range of input and output frequencies. 

clock 


F IN 

:n 

F REF 

F FB 

Phase 

Frequency 

Detector 

:m 

Charge 

Pump 

Loop 

Filter 

& 

VCO 

F VCO 

:k 

. 

. 

. 

:k 

1 

c 

clock 

output(s) 

Figure 6 – 1 Block diagram of analog PLL circuitry for clock signal synthesis in Altera FPGA [11] 

Delay-Locked Loop Circuitry Synthesis of clock signal in DLL circuits is achieved 

by insertion of a variable delay between the input and output clock signal (see Fig- 

ure 6 – 2). Delay lines can be built using a voltage controlled delay or as a series of 

discrete delay elements as it is in Xilinx DLL [125, 126]. 

95 

F OUT

FEI KEMT 

clock 


F IN 

F FB 

Phase 

Detector 

+/- 

Delay 

Line 

clock 

output 

F OUT 

Figure 6 – 2 Block diagram of digital DLL unit typical for Xilinx FPGA clock management 

circuits 

The DLL achieves very good results in delay compensation and clock condition- 

ing. However, the available range of clock dividers is much more limited than in 

case of PLL. It is possible to use an output pin with clock signal derived from input 

signal, where its frequency may be doubled or divided by values: 1.5, 2, 2.5, 3, 4, 5, 

8, or 16 in case of Spartan II FPGA devices [127]. 

6.1.1 PLL as Source of Randomness 

Due to its digital nature the DLL in Xilinx devices is less sensible to noise envi- 

ronment than analog PLL with VCO. The VCO tends to lock to frequencies of 

disturbing external signals and therefore is required a use of separated networks for 

power supply and ground connection mounted only to the clock circuitry. On the 

other hand, the analog PLL makes possible a small area implementation providing 

a wide range of clock frequencies. The DLL technology is limited in this direction 

and offers only certain combinations of ratios between input and output frequencies. 

Changes in the temperature or fluctuations of the supply voltage correlated to 

switching activity of the closely placed logic may cause a drift in the generated 

clock signal. As a compensation the loop makes adjustments of the delay elements 

or VCO frequency what is recognised as a deterministic jitter added to the clock 

signal. Other source of noise influencing the PLL circuitry is the input clock signal. 

Therefore there is a tradeoff between compensation of the internal or external jitter. 

All phase changes in the PLL or differences of delays in the DLL introduce a 

jitter in the synthesised output signal. Filters inside the clock circuitry are matched 

to eliminate the non-linearity caused by the loop and external influences, however 

the intrinsic random noise of the VCO is always present in the output clock signal 

and cannot be attenuated completely. Thanks to that, the PLL provides a promising 

source of randomness suitable for an implementation of the TRNG. In addition, the 

96

FEI KEMT 

frequency of the VCO is never constant and even by stable working conditions it 

fluctuates around a mean value. 

From the provided analysis we can conclude that PLL circuits are more suitable 

for TRNG design based on jitter sampling as they offer a wide frequency range for 

generated signals. Moreover, the internal PLL circuitry provide a reliable source of 

a jitter. 

Analog PLL in Altera and Actel FPGAs The core of clock circuitry embed- 

ded in Altera and Actel FPGAs is formed by an analog PLL circuit surrounded 

by several delay lines, clock multipliers/dividers, and circuits for interconnections 

between internal clock network and external pads. Number of PLLs and their fea- 

tures depend on chosen FPGA type and vendor. The Tables 6 – 1 and 6 – 2 present 

the basic parameters of PLLs and clock circuits for FPGA devices from Altera 

(APEX20K(E) [14], Cyclone [12,17] and Stratix [15,19]) and Actel (Axcelerator [2], 

ProASICplus [3], ProASIC3(E) [4]). 

Table 6 – 1 Parameters of PLL embedded in Altera FPGAs 

family # of PLLs 

dividers range 

m n k 

max. output period jitter 

APEX20K 1 – – – 200ps 

APEX20KE 2, 4 1-160 – * – 0.35% RMS of output period 

Cyclone 1, 2 2-32 1-32 1-32 ±300ps for FOUT ≥ 100MHz 

60mUI for FOUT < 100MHz 

Cyclone II 2, 4 1-32 1-4 1-32 NA ** 

Stratix 

Stratix II 

* m/(n × k)=1-280. 

4, 8×FPLL *** 1-32 1-32 1-32 ±100ps for FOUT > 200MHz 

2, 4×EPLL 1-512 1-512 1-1024 ±20mUI for FOUT < 200MHz 

4, 8×FPLL 1-32 1-4 1-32 

2, 4×EPLL 1-32 1-32 1-32 

NA ** 

** The jitter specification for the PLL output pins are dependent on the I/O pins in 

its VCCIO bank, how many of them are switching outputs, how much they toggle, 

and whether or not they use programmable current strength. 

*** EPLL and FPLL stand for Enhanced and Fast PLL, respectively. 

97

FEI KEMT 

Table 6 – 2 Parameters of PLL embedded in Actel FPGAs 

family # of PLLs dividers range max. output period jitter 

ProASIC3(E) 1 (6) NA 

ProASICplus 2 

Axcelerator 8 

180ps for FOUT = 24MHz 



m = 1-64 ±1% for FOUT < 10MHz 

n=1-32 ±2% for 10MHz < FOUT < 60MHz 

k=1-4 ±1% for FOUT > 60MHz 

m =1-64 long-term: 1% of FOUT or 100ps 

n = 1-64 short-term: 50ps +1% of FOUT 

There are two parameters of the PLL clock circuits that have significant impact 

on possibility to extract randomness from the clock jitter, namely the output period 

jitter of the PLL and range of frequency dividers. The level of timing jitter in clock 

signals is for latest FPGAs families permanently decreased by FPGA vendors what 

was proved also by our experimental measurements (described later). On the other 

hand, the range of divisors in high-density devices is enlarged enough to achieve 

wider range of synthesised clocking frequencies. 

The jitter size is usually expressed in peak-to-peak value (what is a difference 

between the smallest and the largest clock period) or 1-sigma value (σjit) (standard 

deviation). Typical values of the period jitter depend on the technology and config- 

uration of the PLL and can range from 3.5 ps to 10 ps for ASICs [22], or up to 100 

ps for FPGAs [11, 19]. Since the technology of the embedded PLL and the quality 

of the VCO is usually set by FPGA vendor, a user can modify the output jitter by 

configuration of the PLL divider values (m, n, k) and loop filter bandwidth. 

Jitter Generated in Altera Stratix FPGA In analog PLLs, various noise 

sources cause that the PLL’s internal VCO fluctuates in frequency. Under ideal 

conditions, the fluctuations visible as a jitter are caused only by analog (non- 

deterministic) internal noise sources. In such case the noise is denoted as an intrinsic 

jitter. Other possible frequency fluctuations are caused by variations of supply volt- 

age, temperature, external interference through the power, ground, or by the internal 

noisy environment generated by internal FPGA circuits [125]. The PLL’s control 

circuitry adjusts the VCO back to the specified frequency and this change is seen 

98

FEI KEMT 

as a (deterministic) jitter. 

We analyse further the parameters of PLL circuits in Stratix family of Altera 

FPGAs and their relations to the generated clock signal and jitter included in it. 

The Altera Stratix devices include two types of PLLs: 

Fast PLL (FPLL): Stratix devices include up to 8 FPLLs. The FPLLs offer 

general-purpose clock management with multiplication and phase shifting. 

The multiplication is simplified in comparison to EPLL and uses only m/k 

scaling factors with a range from 1 to 32 [15]. Input frequency can vary in 

dependency on m (for speed grade -5) from 15 to 717 MHz, output frequency 

from 9.4 to 420 MHz, and the frequency of the VCO from 300 to 1000 MHz. 

Enhanced PLL (EPLL): Comparing to FPLL, the EPLLs have some additional 

configurable features like external feedback, configurable bandwidth, run-time 

reconfiguration, etc. and have enhanced range of parameters. Input frequency 

can vary (for a speed grade -5 device) from 3 to 684 MHz, output frequency 

from 9.4 to 420 MHz and the frequency of the VCO from 300 to 800 MHz. 

Reference-, feedback- and post-divider values n, m and k can vary from 1 to 

512 (1024 for k) with 50% duty cycle [15]. 

The size of the intrinsic jitter of the PLL depends on the quality factor Q of the 

VCO, on the bandwidth of the loop filter (see Figure 6 – 1), and on the so-called 

pattern jitter introduced by the phase frequency detector. The technology of the 

PLL and the quality of the VCO is given by FPGA design. A designer can change 

the output jitter directly - by modification of scaling factors (for FPLL and EPLL) 

and filter bandwidth (only for EPLL), but also indirectly by the design of the board 

(separation of the analog and digital ground, filtering of the analog power supply, 

etc.). 

PLL acts as a low-pass filter, therefore a low bandwidth setting of the lop filter 

can be applied to filter out high frequency jitter from the input clock. To track the 

input jitter, one can use a high bandwidth setting. As mentioned already a power 

supply noise could cause the VCO output frequency to fluctuate and cause jitter. In 

such cases a low bandwidth causes the feedback loop to respond slower to the noise 

being injected by the VCO. In turn, it cannot adjust for this noise and counteract it. 

A high bandwidth allows the loop to respond quickly to the noise and compensate 

for it. Therefore there is a tradeoff between high and low pass filter of PLL loop 

filter that causes either filtering of the input signal jitter or VCO noise. 

99

FEI KEMT 

Since the size of the jitter is very important for our method, we needed to 

measure it for various PLL configurations and confirm the values provided by chips 

vendors. For example, according to vendor’s measurements [125], the PLL jitter 

in an Apex FPGA has 1-sigma value of σjit ≈ 15.9 ps for a FOUT = 66.6 MHz 

synthesised clock signal and feedback divider m = 2. These results were acquired 

under “ideal conditions” with a minimal amount of FPGA resources occupied and 

minimal input/output activities. Our measurements showed that the clock jitter in 

the Apex FPGAs is significantly higher (about 140 ps) for higher dividers factors 

and internal FPGA flip-flops switching on different clock frequencies. Note that the 

value of jitter size depends on the PLL settings and the type of the power supply 

filter included in the development board, but the measured value of jitter is never 

lower than internal intrinsic jitter of FPGA. 

(a) FPLL with ratio 12/7, σjit ≈ 10 ps (b) EPLL with ratio 139/133, σjit ≈ 16 ps 

Figure 6 – 3 Jitter of the clock signal in Altera Stratix design (horizontal scale: 200 ps/div) 

For jitter measurement on a Stratix family FPGA we have selected Altera DSP 

Development board with Stratix EP1S25F780C5 device [16]. The jitter has been 

measured similarly as in [62] using Agilent Infiniium DCA 86100B wide bandwidth 

oscilloscope. We have found that in comparison to the Nios board with APEX [10] 

(used as reference in [60]) the jitter is significantly smaller. For example, for the 

FPLL and the ratio 12/7 the jitter achieves 1-sigma value of about 10 ps (see Figure 

6 – 3(a)) and for the EPLL and the ratio 139/133 the 1-sigma value of the jitter is 

about 16 ps (see Figure 6 – 3(b)). 

100

FEI KEMT 

6.2 PLL-Based TRNG on FPGA 

After the part concerning the general parameters of PLL circuitry in FPGAs we 

continue with section which delivers results on practical implementation of the PLL- 

TRNG in different families of FPGA vendors - Altera and Actel. Presented stochas- 

tic model of the generator helps to understand the randomness extraction method. 

6.2.1 PLL Configurations 

The design depicted in Figure 5 – 5 represents only one of the possible PLL configu- 

rations that we will investigate further. In general, there are three options how the 

PLLs can be configured in the TRNG in dependency on chosen FPGA: with one 

PLL, with two parallel PLLs and with two (or more) cascaded PLLs (see Figure 6 – 

4). 

a) 

b) 

c) 

CLI 

CLI 

CLI 

PLL 

PLL1 

PLL2 

PLL1 PLL2 

Figure 6 – 4 Configurations of TRNG with: a) one PLL, b) two parallel PLLs and c) two cascaded 

PLLs 

CLJ 

CLK 

CLJ 

CLK 

CLJ 

CLK 

D 

Flip 

Flop 

D 

Flip 

Flop 

D 

Flip 

Flop 

In some cases, especially in low-cost FPGAs, only one PLL is available for the 

TRNG (see Figure 6 – 4a) ) and the other (if available) are used for the rest of the 

system. If there are no or only some acceptable restrictions 10 for the input clock 

10 By acceptable we mean the requirements for the clocking frequency, which are in a certain 

OUT 

OUT 

OUT 

range that is suitable also for the TRNG to achieve the working condition (5.6). 

101

FEI KEMT 

Table 6 – 3 Parameters settings for different TRNG configurations 

configuration / parameter one PLL two parallel PLLs two cascaded PLLs 

FCLK 

FCLJ 

FCLI 

MCLJ 

DCLJ FCLI 

MCLK 

DCLK FCLI 

MCLJ 

DCLJ FCLI 

FCLI 

MCLJ MCLJ 1 2 

FCLI 

DCLJ DCLJ 1 2 

KM MCLJ MCLJDCLK MCLJ1MCLJ2 

KD DCLJ DCLJMCLK DCLJ1DCLJ2 

S 

R 

1 

4MCLJ 

FCLI 

DCLJ 

1 

4MCLKMCLJ 

FCLI 

DCLKDCLJ 

1 

4MCLJ 1 MCLJ 2 

FCLI 

DCLJ 1 DCLJ 2 

frequency of the logic part out of the TRNG, then one or more PLLs can be shared 

by the TRNG and the user logic. 

In most cases the use of two PLLs is largely sufficient to fulfil the condition (5.6). 

Usually, the option with two parallel PLLs is used (see Fig. 6 – 4b) ). In cases when 

the range of PLL divisors is not satisfactory (again, this is the case of the low-cost 

FPGAs), a cascade of two (or more, if available) PLLs can be applied (see Figure 6 – 

4c) ). Each configuration permits to achieve different characteristics (defined in 

[61]) depending on parameters of PLLs, namely maximum input, output and VCO 

frequency, multiplication and division factors, etc. and in this way the needed 

frequency can be synthesised. The parameters of the considered three generator 

configurations are summarised in Table 6 – 3. 

We can conclude that the use of two PLLs in either parallel or serial (cascaded) 

configuration can increase significantly sensitivity on the jitter and the output bit- 

rate of the generator, depending on the available range of multiplication or division 

factors or both. 

In the equations presented in Table 6 – 3 it is shown from which PLL coefficients 

(dividers) the factors KM and KD are composed. The factor KM has a direct 

influence on the value of MAX(∆Tmin) (see Eq. 5.5). While for the configurations 

with one PLL or several cascaded PLLs KM is composed only from multiplying 

coefficients, in case of the parallel configuration the dividing coefficient is included. 

This should be considered especially in cases when not all the PLL coefficients have 

identical range. 

102

FEI KEMT 

6.2.2 Analysis of TRNG in Altera Stratix FPGAs 

Our implementation strategy for the described case was to get the fastest and the 

best quality generator using a minimum amount of resources (PLLs). Since the 

Stratix family contains two types of PLLs, several configurations are possible. 

The most economic solution would be based on the use of one FPLL (since there 

are four FPLLs in the chosen device). But the multiplication and division factors 

of a single FPLL cannot fulfil the implementation condition (5.6). Other option is 

to use EPLL with extended range of parameters that enables to build a single-PLL 

TRNG. For this reason, following four architectures of the TRNG implemented in 

Altera Stratix devices are possible: 

1. Two FPLLs (referenced further as configuration A) 

2. One FPLL and one EPLL (configuration B) 

3. One EPLL (configuration C) 

4. Two EPLLs (configuration D) 

The relationship between the sensibility on the jitter S and the output bitrate 

R of the TRNG for configuration with 2 parallel PLLs (see Table 6 – 3 for other 

configurations and characteristic parameters) was described in equations 5.8 and 

5.9. 

Experimental Results TRNG architectures were tested on Altera DSP board 

with Stratix EP1S25F780C5 [16]. The TRNG architectures were described in VHDL 

and implemented using Altera Quartus II development system, version 3.0 SP2. 

Acquired bits were transmitted to the PC through a parallel port. The complete 

TRNG design including 1024 x 8-bit FIFO and a parallel interface controller needs 

up to 120 LEs from about 25000 LEs available in the device. The signal CLK was 

used as a clock signal for the control logic and was therefore limited to about 250 

MHz (although the output frequency of the PLL can be higher). 

In order to test basic quality of different versions of TRNG, we evaluated the 

following statistical parameters of the generated bit sequence b(n) (all of them were 

computed for the record length of N = 1000000 bits): 

1. Bias computed as 

bias = E[b(n)] − 0.5 = E[b] − 0.5 ∼ = N1 

N 

103 

− 0.5 (6.2)

FEI KEMT 

where N1 is the number of b(n) = 1 for n = 0, 1, . . . , N −1. For a good TRNG, 

the bias should converge to 0 (with deviation ≈ ±3/ √ N ). 

2. Maximal autocorrelation coefficient computed as 

where 

= 

ρmax = max{|corr(bk)| , k = 1, 2, . . . , 100} (6.3) 

� 

� 

corr(bk) = corr b(n), b(n − k) = (6.4) 

� 

� 

E b(n) − E[b(n)] �� 

b(n − k) − E[b(n − k)] �� 

� � 

var b(n))var(b(n − k) � 

var(b(n)) = var(b) = E � 

{b − E[b]} 2� 

= E[b]{1 − E[b]} (6.5) 

Based on [42, 86] it can be shown that for a good TRNG (with bias → 0) 

and a finite record length N the corr(bk) follows standard normal distribution 

N(0, 1) and the following condition should be fulfilled (value χ = 2.576 is from 

P (X > χ) = α = 0.01/2 valid for N(0, 1) distribution) 

ρmax → 2.576 

√ N = 0.002576 (6.6) 

3. Standard FIPS140-2 statistical tests [57] that analyse 20000 bit records and 

define thresholds to assess TRNG randomness. FIPS140-2 tests include Mono- 

bit, Poker, Run and Long runs tests. We analysed 100 sequences for each 

tested TRNG architecture and evaluated relative number (tM, tP , tR, tL) of se- 

quences that passed each test. Good TRNG should pass all FIPS tests so that 

tF IP S = tMtP tRtL = 1. 

Tables 6 – 4 and 6 – 5 include parameters and results for selected TRNG archi- 

tectures. The best output bitrate and quality (expressed through the bias, ρmax 

and tF IP S) is obtained using TRNG configuration with two EPLLs. The enhanced 

adjustable parameters of the EPLL allow to achieve the required level of sensitivity 

according to the jitter present in the device. The configurations with the FPLL 

are not suitable for jitter sampling due to limited range of PLL dividers (see Ta- 

ble 6 – 1). In case of low sensitivity S the number of critical samples is very low, the 

configuration is unstable and the output sequence has significant bias. 

104

FEI KEMT 

Table 6 – 4 Configuration parameters of tested TRNG 

MAX 

Conf. PLL1 PLL2 Total ∆Tmin R σjit 

Type KM/KD Type KM/KD KM/KD [ps] [kb/s] [ps] 

A Fast 12/7 Fast 25/12 144/175 10.4 952.4 10 

B Enh. 43/7 Fast 25/12 516/175 2.9 952.4 23 

C Enh. 212/207 - 1 212/207 14.7 386.5 12 

D Enh. 43/7 Enh. 31/10 430/217 2.3 1142.9 13 

Table 6 – 5 Results of quality evaluation of tested TRNG configurations 

Configuration bias ρmax tF IP S 

A -0.358 0.043 0 

B 0.054 0.023 0 

C -0.003 0.012 0.96 

D 0.002 0.003 1 

The final speed of the generator in configuration D (more than 1Mbit/s) is much 

higher than that presented in [60], while the quality confirmed by statistical tests 

remains comparable. Thanks to the analysis of available PLL configuration and 

their parameters we have presented a generator without additional delaying logic 

applied in the original proposal [60]. Application of simpler sampling part of the 

generator is possible thanks to wider dividers range of PLL circuits. 

6.2.3 Analysis of TRNG in Actel FPGAs 

In this section we explain how the parameters of the clock circuitry influence the 

parameters of the discussed PLL-TRNG in case of low-cost FPGA. Analysis should 

answer the question whether Actel FPGAs are suitable PLL-TRNG implementation 

and what parameters of the TRNG are achievable. 

Clock generator circuitry in Actel FPGAs As a target family for TRNG 

implementation the ProASICplus was chosen. This low-cost FPGA family based 

on flash technology offers two well-configurable PLLs on a chip. We selected an 

evaluation board [1] provided with ProASICplus APA300-PQFP208 device [3] for 

experiments and measurements. As a reference input clock source an on-board 

105

FEI KEMT 

oscillator with frequency 40MHz was used. The board has separated power supply 

for the PLLs and for the rest of the chip what enables to analyse the impact of power 

supply violations (from off-chip manipulations, or from activity of the on-chip logic 

by interconnection of the power supplies) on the generated sequences. 

In the on-chip PLL there exist the following limitations for the frequencies of 

signals connected to PLL circuits: Fin = 1.5 − 240MHz, Fout = 6 − 180MHz and 

FVCO = 24 − 180MHz. As it was already mentioned in Table 6 – 2 the PLL output 

frequency of the PLL Fout is derived from the input frequency Fin by application of 

the dividers: 

m FVCO 

Fout = Fin = 

n × k k 

(6.7) 

where m, n and k are PLL frequency dividers and FVCO states for an output fre- 

quency of the VCO. 

In order to compare possible configurations and find out the ranges of TRNG 

parameters one can go through the following steps. The frequency ranges of the 

two rationally related clocking signals are given by the frequency ranges of the PLL 

dividers and the input frequency (using equations from Table 6 – 3). From the ratio 

of the frequencies it is possible to set the parameters KM and KD and then also 

check the basic condition (expressed in Equation 5.6) that has to be fulfilled for 

the functionality of the TRNG. The size of the jitter deviation σjit can either be 

measured on the target device (if required equipments measurements are available), 

or just estimated (considering the ranges given in vendor’s documentation) and then 

set empirically after experiments with generator’s settings. Knowing the frequencies 

of the clocking signals and parameters KM and KD it is easy to find the period TQ 

(see Equation 5.3) and then the output bit-rate R = 1/TQ. 

To give an overview on what ranges of MAX(∆Tmin) are achievable in different 

PLL configurations we summarise them in Table 6 – 6. One should note that the 

intervals are only theoretically achievable or could be slightly different in practical 

cases, since some limitations were not taken into account (e.g. the limited output 

and input frequency for cascaded configuration, limited number of combinations of 

dividers, etc.). 

From Table 6 – 6 we can see that the smallest values of MAX(∆Tmin) can be 

reached with the cascaded configuration. While the frequencies range is the same 

as for the other configurations, the number of combinations of frequency dividers is 

higher what offers better possibilities for matching the FCLJ frequency to the fixed 

FCLK. 

As expected, the lowest sensitivity is achievable by using only one PLL. On the 

106

FEI KEMT 

Table 6 – 6 Achievable sensitivity on jitter using two clock signals in Actel ProASICplus (FCLI = 

40MHz) 

configuration MAX(∆Tmin) 

two PLLs 0.17ps - 41ns 

one PLL 10.85ps - 41ns 

two cascaded PLLs 0.084ps - 41ns 

other side, if the size of the jitter is large enough, this configuration is the most 

effective in area consumption. In practical cases the configuration with one PLL is 

not usable, as the number of random samples and their entropy is low because of 

the low sensitivity S. 

As a solution one can add the second PLL in parallel or cascaded configuration. 

It was already mentioned that the parallel configuration has a disadvantage in con- 

trolling two clock signals instead of one as it is in case of the cascaded configuration. 

On the other hand, a disadvantage of the cascaded configuration could be the fact 

that the tracking jitter is composed of components produced in the all PLLs in the 

cascade. 

Achievable sensitivity is in the worst case comparable, in other cases much higher 

than is the size of jitter (usually around 10-100ps) therefore we can conclude that 

taking into account the theoretical requirements the proposed method is feasible to 

implement and is suitable for Actel FPGAs. 

Experimental Results After the theoretical analysis we have proceeded to a 

practical implementation. The generator has been synthesised and programmed in 

the FPGA using Actel design tools Libero IDE 7.1. 

In experiments we have focused on the configuration with one PLL circuit, as a 

specific configuration typical for low-cost FPGAs. In order to increase the sensitivity 

of the sampler we have added some delay elements in the front of bank of sampling 

gates (for details check [61]). In case of Actel ProASICplus the shortest delay 

inside the chip, around 0.5ns, is available between the input and output of a NAND 

gate [5]. Outputs of all delaying paths are accumulated during a multiple of periods 

TQ, afterwards the bits of accumulator are XORed together and provide as one 

output bit. 

The configuration proving the possibility to implement the TRNG in Actel ProA- 

SICplus FPGA using one PLL and a delay line from NAND gates has the following 

107

FEI KEMT 

Table 6 – 7 Area occupation of one PLL TRNG with delay line in FPGA Actel ProASICPlus 

parameters: 

• FCLK = FCLI = 40 MHz 

Logic type Number Usage 

Core Cells 396 4.8% 

FIFO Cells 2 6.3% 

PLLs 1 50% 

• FCLJ = MCLJ 

DCLJ FCLI = 1240 

= 68.5714 MHz 

• Number of delay elements (NAND gates): 8 

7 

• Accumulation period: 17TQ = 119 periods of FCLK 

The requirements for the area occupation are summarised in Table 6 – 7. The 

design includes also the logic for reading the internal signals and generated sequence 

by a computer and can be reduced if required. 

The NIST statistical tests were performed on continuous 1-Gigabit TRNG out- 

put records and followed the testing strategy, general recommendations, and result 

interpretation described in [97]. We have used a set of 1000 1-Megabits sequences 

produced by the TRNG, for which most of the tests were passed, however, some 

of them do not e.g. overlapping template test or some variants of non-periodic 

templates. Considering the fact that the generated sequence is in some parame- 

ters slightly distinguishable from truly random stream may signalise some problems 

inside the TRNG implementation, on the other hand, the tested sequence is ex- 

tremely long (1 gigabit continual record) unlike the output streams required for 

practical applications. 

The experimental tests of configurations with two PLLs connected in parallel or 

cascade have shown, that the condition expressed by Equation 5.6 is necessary but 

not sufficient condition for proper running of the TRNG. From the results we can 

prove, confirming the theoretical analysis, that the tracking jitter can be sampled 

and the generator includes critical random samples. But to achieve reliably an unbi- 

ased and random sequence the number of the critical samples and their probability 

distribution have to satisfy some additional conditions that will be specified later in 

this chapter. 

108

FEI KEMT 

On case of Actel FPGAs we explained the way how the basic parameters of the 

TRNG can be computed and what is the relation between them and target device 

parameters. Following the presented results it is possible to implement the TRNG 

with required parameters. We can conclude that Actel FPGAs are suitable for 

implementation of the TRNG based on discussed method, and achieved parameters 

are comparable with the ones from Altera FPGAs. 

6.2.4 Stochastic Model of PLL-TRNG 

It is a common requirement that a good TRNG design should be supported by a 

mathematical (more precisely stochastic) model of the source of randomness. A 

reliable model is a necessary requirement for the security evaluation during the 

certification process [37]. On one hand, the model should be as simple as possible, 

but on the other hand, it should also reliably describe a basic behavior of the TRNG. 

In our case, the stochastic model should express the probability that the value on 

the generator output is equal to one as a function of the jitter variation and the 

phase of the CLK and CLJ signals. 

Reordering of the Samples If sampled values of the signal CLJ are ordered in 

a proper way, they create an image of the original clock waveform. If we accumulate 

the ordered samples in KD accumulators during Q periods TQ, we obtain an image 

of the distribution of the probabilities where the i-th sample is equal to one. 

The Figure 6 – 5 presents an example of accumulated and reordered samples 

obtained during Q = 1000 periods TQ for these parameters: 

• KM = 212, KD = 207, FCLJ = 81.93 MHz presented at Figure 6 – 5(a)) and 

• KM = 516, KD = 175, FCLJ = 491.43 MHz at Figure 6 – 5(b)). 

The variation of the jitter is proportional to the number of points (critical sam- 

ples) in the rising (or falling) region of the waveforms (two and six in the pre- 

sented example). Since in (b) FCLJ = 491.43 MHz, the period TCLJ is divided into 

KD = 175 sampling intervals, the distance between two subsequent samples is equal 

to about 11.6 ps. The width of the region influenced by the jitter is thus about 

69.6 ps. This value is equal to approximately 3σjit, so the σjit ∼ 23.2 ps. Using the 

same method, we can get σjit ∼ 29.5 ps from Figure 6 – 5(a). It is clear that the 

presented method of the jitter measurement is sufficiently simple to be implemented 

inside a device and the jitter can thus be monitored continuously in real time. 

109

FEI KEMT 

0,75 

0,5 

0,25 

1 

0 

1 30 59 88 117 146 175 204 

(a) KM /KD = 212/207 

0,75 

0,5 

0,25 

1 

0 

1 30 59 88 117 146 175 

(b) KM /KD = 516/175 

Figure 6 – 5 Distribution of mean values of ordered CLJ signal samples obtained during Q = 1000 

periods TQ 

On-chip reordering In order to make possible a better analysis of samples pro- 

cessed in TRNG we implemented the following method for on-chip reordering of the 

samples. 

The structure of ordering logic is illustrated in Figure 6 – 6. Samples coming 

from the TRNG are continually written in a dual-port memory block organised as 

512 1-bit wide words (usually we do not use KD parameter, which determines the 

number of samples in one period, bigger than 512). Writing address is initialised 

with each new period TQ signalised by signal next tq. In order to read samples in a 

way they create the CLJ clock waveform we need to set a correct reading address. 

This operation is done by a LUT implemented as a ROM block. Input of the table 

is identical with writing address, and output of LUT is used as a reading address 

from samples memory. The content of ROM – the LUT can be easily generated 

using Equation 5.12. 

Signal sample ord was assigned to an output pin of DSP Stratix board and 

measured by a scope (Tektronix TDS 3052), with trigger signal next tq. In Figure 6 – 

7 we present the measured waveform. The parameters of the TRNG are following: 

MCLK = 13, DCLK = 12, MCLJ = 14, DCLJ = 11, KM = 168, KD = 143, K −1 

M = 103. 

In the region of edge (in this particular case, on falling edge). Ordered samples 

do not create ideal rectangular waveform, instead there can be observed more edges 

in one period. Samples placed around a position of an ideal edge are sampled in 

different timing instances (due to required reordering of the samples in time). Hence, 

they may be influenced by different amount of jitter or said in other words, jitter 

changes are faster than sampling frequency. This fact causes more than one change 

(edge) of the signal. In order to make a better analysis of this phenomenon we need 

to collect samples from the edge region for several hundreds of subsequent periods. 

110

FEI KEMT 

0..K D 

next_tq 

sample 

01 

9 

9 

writing port 

RAM 

512 x 1b 

ROM 

K x 9b 

D 

reading port 

00 11 

sample_ord 

9 

9 

01 

D 

D 

edge 

Figure 6 – 6 Block diagram of design for on-chip samples reordering 

Figure 6 – 7 Reordered samples from generator measured by oscilloscope 

111 

9

FEI KEMT 

Stochastic Model The clock signal CLJ is sampled KD times by other clock 

signal CLK during one period TQ. The output signal is quasi periodic 11 with the 

period TQ as long as the condition 

GCD(KM, KD) = 1 (6.8) 

is fulfilled. Samples, which are taken in a “stable” part of the CLJ signal (i.e. 

samples, which are not influenced by the jitter) always have a constant value (logical 

zero or one). They form a dominant part of the set of output samples. 

The value of the i-th sample qi (0 ≤ i ≤ KD − 1) can be viewed as a binary 

random variable Xi ∈ {0, 1}. Its mean value E[Xi] is equal to the probability 

pi(Xi = 1), which is related to the mean value of the jitter in the corresponding 

sampling instant. It was shown in [60] that the decimated output signal x(nTQ) of 

the TRNG represents a bit-wise addition modulo 2 of KD binary samples q() (check 

also Figure 5 – 6) expressed as 

x(nTQ) = q(nTQ) ⊕ q(nTQ − TCLK) ⊕ . . . (6.9) 

. . . ⊕ q(nTQ − (KD − 1)TCLK) . 

We denote the number of critical samples K p 

D. The critical samples get the value 

of 1 with the probability pi ∈ (0, 1), i = 0, 1, . . . , K p 

D − 1. The rest of KD samples 

is deterministic. They can obtain logical values of zero and one and their number 

is denoted as K 0 D and K 1 D, respectively. The total number of samples in the period 

TQ can be expressed as a sum of deterministic and critical samples: 

KD = K p 

D + K 1 D + K 0 D . (6.10) 

The generator extracts randomness from K p 

D binary values using a standard XOR 

corrector. It was assumed in [60] that these values are statistically independent. 

Using mathematical background from [42], it is possible to show that the following 

relation holds for the set of probabilities pi of K p 

D independent samples and the 

mean value E[pi] at the output of the XOR corrector (the output of the TRNG) is: 

E[pi] = 1 

2 + (−1)K1 D(−2) Kp 

D −1 

K p 

D−1 � 

� 

i=0 

pi − 1 

� 

2 

. (6.11) 

11 If the signals are not influenced by the jitter, the output signal of the sampling gate is perfectly 

periodic. If some jitter is present, the subsequent periods are not identical, but differ only in few 

random samples while constant samples form a major part of the waveform. 

112

FEI KEMT 

Equation 6.11 can be viewed as a stochastic model of the generator, since it permits 

to estimate a probability of the generator output value as a function of the mean 

values of critical samples (which depend on the jitter characteristics). However, the 

model is valid if and only if critical samples are independent. 

The proposed model shows that (as it could be expected) the bias of the generator 

output decreases with the increasing number of critical samples (note that this 

number is related to the jitter variation). It can be seen that if the mean value of 

any of these samples is equal to 0.5, the bias on the generator output is equal to 

zero and does not depend on the remaining samples. Finally, the sign of the bias 

depends on the number of samples having a mean value equal to one (K 1 D). 

The advantage of the proposed model lies in the fact that the model can also be 

used as a proof of mutual statistical independence of the critical samples. To evaluate 

the statistical independence, the output mean value and the mean value of critical 

samples are measured and the validity of the model expressed in Equation 6.11 

is verified. If the test fails, the random variables (critical samples) are mutually 

dependent. 

Model Verification The validity of the model has been tested on real data in 

order to confirm the model empirically. We have tested outputs of seven TRNG 

configurations implemented in Altera Stratix devices. The Table 6 – 8 presents the 

chosen parameters of the tested configurations (KM, KD, FCLK and FCLJ) and the 

corresponding results – mean value of critical samples (E[pi]), mean value of the 

generator output (m = E [x(nTQ)]), number of samples equal to one (K 1 D) and 

number of critical samples (K p 

D). 

The mean value of the output bitstream m = E [x(nTQ)] is computed as an 

arithmetic mean of 512,000 successive bits at the output of the TRNG. The mean of 

the model E[pi] is calculated using the Equation 6.11, while employing probabilities 

of the critical samples pi accumulated after Q = 1000 periods TQ. 

As it can be seen, the model is very precise for a small number of critical samples, 

since both mean values are very similar. For a higher number of critical samples, 

the mean value tends to the ideal value 0.5. Note that the model provides correct 

information about the statistical deviation of the output bitstream in configurations 

1, 2 and 5. The model gives acceptable results corresponding closely to the mean 

value of the generated sequence in tests 6 and 7. It should be noted that in config- 

urations 3 and 4, the model outputs do not agree with the generator outputs (most 

probably) because of statistical dependence between critical samples. 

113

FEI KEMT 

Table 6 – 8 Mean values measured using the stochastic model E[pi] and the output sequence of 

the TRNG m = E [x(nTQ)] 

# KM KD FCLK FCLJ E[pi] m K 1 D K p 

D 

(MHz) (MHz) 

1 144 119 113.33 137.14 0.846 0.829 61 2 

2 144 175 166.66 139.14 0.717 0.729 89 3 

3 486 119 75.55 139.14 0.501 0.553 55 10 

4 486 161 102.22 139.14 0.507 0.524 74 13 

5 250 203 232 285.71 0.489 0.526 95 16 

6 270 203 232 308.57 0.5 0.496 96 16 

7 486 217 137.77 308.57 0.499 0.496 99 22 

6.3 Active Non-Invasive Attack on TRNG 

To obtain results of a real-life attack we have executed an active non-invasive attack 

on FPGA implementation of TRNG [60]. Namely we have tried to force some bias 

to the output of generator by changing the working temperature of the FPGA chip. 

Our aim is to find out what kind of changes in the parameters of generated sequence 

can be observed. Moreover, we will record the internal signals of the generator and 

evaluate the influence of temperature on them. 

Similar experiments has been described in [98] where the PLL-TRNG has been 

evaluated as problematic, with varying quality of the generated bit sequence. Based 

on obtained results from the attack realisation we will provide additional require- 

ments for the PLL-TRNG design and explain why the configuration chosen by San- 

toro et al. [98] had problems to pass the statistical tests. 

6.3.1 Attack description 

The temperature of the FPGA was decreased by application of a freezing spray. The 

lowest achieved temperature was −40 ◦ C. As the FPGA chip produces some heat it 

has been warmed up by itself up to +30 ◦ C. During the measurements we have tried 

to keep the temperature in the range of the selected value. The temperature of the 

chip was measured by simple contact thermometer. 

Two similar configurations of the TRNG were chosen as objects under attack. 

In both cases we have used Altera Stratix DSP board with EP1S25 device [18]. 

The following parameters have been chosen or given by the board: FCLI = 80 

114

FEI KEMT 

MHz, MCLK = 31, DCLK = 10, MCLJ = 36, DCLJ = 7. Then FCLK = 248 MHz, 

FCLJ = 411 MHz, and KM/KD = 360/217. In order to make possible a comparison 

of TRNG behaviour for two settings we have chosen the following configurations 

that differ in bandwidth of the loop filter: 

• Configuration A has the filter bandwidth set automatically by the synthesising 

tool (Altera Quartus). 

• Configuration B has the filter bandwidth set to preset value low. 

The lower is the bandwidth the better input jitter rejection can be achieved for 

the price of longer locking time of PLL. The synthesising tool chooses the optimal 

bandwidth for selected signal frequencies, achieving acceptable locking time and 

level of input jitter filtering. By setting the bandwidth to a low value, we achieve 

that jitter from sources outside the PLL are filtered out and we can observe the 

jitter sourced inside the PLL. 

6.3.2 Measurements results 

For evaluation of the TRNG behaviour by changing the temperature we collected 

for each value the random bit sequence from the output of generator, as well as 

the internal signal values, providing information on number of influenced random 

samples. 

By reordering the samples it is possible to reconstruct the waveforms of sampled 

clock signal and track the changes of their probabilities. The waveforms sampled by 

the generator are depicted in Figures 6 – 8 and 6 – 9. For each sample the number of 

ones is counted during one thousand of TQ periods. The samples in stable regions 

end up with 0 or 1000 number of sampled ones. The samples in edge areas (rising 

and falling edge), influenced by jitter, reach values between the boundaries. 

In ideal case we suppose that a position of sampling edge is stable and what 

changes is the position of edge in the sampled clock signal. The logical value in the 

moment of sampling is influenced by an additive jitter. Analysing the sampled values 

allows us to describe the behaviour of the generator and impact of temperature on 

the jitter parameters. 

From the charts we can see that the position of critical samples does not change 

across the range of temperatures for both configurations. The configuration A in- 

cludes less critical samples than the configuration B what implies lower σ 2 of the 

jitter. 

115

FEI KEMT 

Figure 6 – 8 Sampled waveform of a clock signal for TRNG for Configuration A for temperatures 

in range −40 ◦ C + 30 ◦ C. 

The random sequences were tested by simple statistical tests defined in FIPS 

standard [57]. The test suite can reveal a bias or unbalanced distribution of zeros 

and ones in generated sequence by application of 4 basic tests (monobit test, poker 

test, runs and long runs tests). If at least one test from the set was not passed, the 

result is denoted as FAILED, otherwise we put OK mark. 

In Table 6 – 9 we summarise the results of statistical tests at different FPGA chip 

temperatures. It can be seen that while the configuration A has produced by some 

temperatures the sequences that did not pass the statistical tests, the configuration 

B is reliable in the whole range of temperatures. 

The columns with critical samples number show the number of samples influenced 

by jitter. It can be observed that in case of the configuration B, when we have set 

a low bandwidth of the loop filter, the number of influenced samples is significantly 

higher. 

We further investigate the number and position of critical samples for both con- 

figurations in dependency on the chip temperature. Crucial impact on the statistical 

parameters of the generated sequence have the samples with probability around 0.5. 

In case we eliminate the almost constant samples, with less than 100 by jitter in- 

fluenced values during 1000 periods, there are 4-6 and 12-13 highly critical samples 

per edge for configuration A and B, respectively. 

The Figures 6 – 10 and 6 – 11 show in details the area of rising edge of the sampled 

waveform. We can observe how during the measuring period the number of sampled 

116

FEI KEMT 

Figure 6 – 9 Sampled waveform of a clock signal for TRNG for configuration B for temperatures 

in range −40 ◦ C + 32 ◦ C. 

ones changes in relation to different chip temperature. For configuration A is typical 

a large spread of the amounts for a fixed position of sample. In configuration B the 

subsequent samples have very similar amounts of sampled ones, and the overall 

waveform looks more stable. 

In order to better visualise the changes in sampled signals in dependency on 

temperature we provide Figures 6 – 12 and 6 – 13 which show in detail a dynamic of 

amounts of sampled ones for most critical samples. 

In configuration A we can observe a significant change of sampled ones by chang- 

ing the chip temperature. For example at position number 84 the difference in 

amount of ones sampled by minimal and maximal temperature is more than 500. 

This fact as well as the low number of critical samples cause instability of the gen- 

erator in changing working environment. 

Although the jitter is present during the whole range of the temperatures (the 

number of critical samples does not change), the bias of the samples changes visibly 

and influences the statistical parameters of the generated sequence. In a moment 

when all samples are strongly biased (case of temperature between 20 and 30 ◦ C) 

the output sequence is also biased and does not pass the statistical tests suite. 

The configuration B is more stable in changing chip temperature and the density 

of samples with equal probability to sample zero and one is much higher when 

comparing to the case A. Thanks to that the statistical parameters of the generated 

sequence stay acceptable and pass all required statistical tests. The bias of particular 

117

FEI KEMT 

Table 6 – 9 Results of statistical tests (FIPS) of TRNG output and number of random samples 

influenced by the jitter at different chip temperatures 

Conf A Conf B 

temperature FIPS critical FIPS critical 

in ◦ C tests samples # tests samples # 

-40 OK 26 OK 64 

-30 OK 26 OK 66 

-20 FAILED 25 OK 64 

-10 OK 24 OK 62 

0 OK 24 OK 63 

+10 OK 24 OK 68 

+20 FAILED 22 OK 61 

+30 FAILED 25 OK 60 

samples is compensated by other samples in critical area, and the final sequence is 

kept unbiased. 

Observing Jitter From the observations depicted above we can conclude that 

the standard deviation (σ 2 ) of the jitter in the sampled signal does not change. 

The size of deviation can be observed as number of critical samples which remains 

almost constant in the whole range of tested temperatures. The presence of jitter 

represents a fundamental condition for generator proper function. Therefore, a well 

suited startup test for this kind of generators should include a test of critical samples 

presence. 

The on-chip implementation of this test needs to include a memory block and 

counters which sum up for each edge position of the sampling signal the number 

of sampled ones. The edge positions with the counter value different from 0 or not 

equal to the number of TQ periods signalise the presence of jitter. The number 

of critical samples must be higher than zero, but low number of samples cannot be 

accepted neither. From empirical experiments described above we can conclude that 

configurations with more than 10 highly critical samples per edge behave reliably 

even in changing environment. 

Continuous monitoring of the critical samples number allows to implement an 

effective online test for the discussed category of PLL-based generators. Each signif- 

icant change either in position or in probability value of critical samples may have 

118

FEI KEMT 

Figure 6 – 10 Amount of sampled ones during 1000 sampling periods for TRNG with configuration 

A (detail of the raising edge). 

an impact on the parameters of the generated sequence and therefore should initiate 

an alarm signal inside the TRNG. 

From measured data it is possible to estimate the jitter parameters and draw the 

probability histograms. In the Figure 6 – 14 we compare the histograms of TRNG 

working in configuration A and B which differ in loop filter bandwidth. In both cases 

the jitter has normal Gaussian distribution. As it can be observed, the configuration 

A includes jitter with lower deviation while the jitter in configuration B has almost 

three times higher value. 

What we find crucial in our measurements is the observation of jitter parameters 

with changing temperature. The jitter in the PLL circuitry becomes different with 

freezing the chip what can be observed as a change in number of sampled ones 

at critical samples positions. In Figure 6 – 15 we depicted the difference in those 

numbers when comparing the values by boundary temperatures −40 ◦ C and +30 ◦ C 

in both configurations. The difference has the Gaussian normal distribution as well 

as in case of the previously discussed jitter by the room temperature. The standard 

deviation of the additional jitter is identical to its values for measurements at stable 

temperature. As a result we can conclude that by changing the chip temperature 

the amplitude of the jitter changes, too. 

In case of PLL-TRNG the bigger are the changes of jitter amplitude the bigger 

are changes in the histogram of jitter and that has direct impact on statistical 

properties of the generated sequence. In case of configuration A the changes in 

119

FEI KEMT 

Figure 6 – 11 Amount of sampled ones during 1000 sampling periods for TRNG with configuration 

B, with low-pass loop filter (detail of the raising edge). 

amplitude of the jitter are significant as are the differences in probability values 

between particular samples. The configuration B is characterised by smaller changes 

of the jitter amplitude which are in addition more flat. In such case the probability 

changes uniformly for the most critical samples and does not have any unwanted 

impact on the generated random sequence. Described higher level of robustness was 

observed in configuration B and confirmed by positive output of all statistical tests. 

From the obtained results and suggestions for PLL-TRNG design we can con- 

clude that the design tested in [98] with parameters KM/KD = 270/203 is not 

suitable for usage in changing temperature. From the Table 6 – 8 we get the number 

of critical samples that is 22, 11 per edge. As we proposed in the suggestions above, 

more important is the number of highly critical samples that should be more than 10. 

This condition is not met in this configuration and the generator behaves similarly 

to the Configuration A in our experiments during simulated attack on PLL-TRNG. 

6.4 Conclusions and Further Research 

The chapter provided an analysis of the PLL based TRNG. We focused on implemen- 

tation aspects and relations between the target platform FPGA and PLL circuitry 

and achievable technical parameters of the generator in devices from vendors Actel 

and Altera. In the second part of the chapter we brought our proposal for stochastic 

model of the TRNG and proposed additional steps in PLL-TRNG design in order 

to achieve a robustness of the generator in changing environment. 

120

FEI KEMT 

Figure 6 – 12 Amount of sampled ones during 1000 sampling periods according to temperature 

for chosen sample positions in TRNG with configuration A. 

By theoretical and practical analysis we concluded that the PLL circuitry is 

more suitable for discussed TRNG implementation when compared to DLL. The 

parameters of the PLL circuitry available in FPGAs present on the market are 

satisfactory for reliable implementation. We showed the steps for theoretical analysis 

of the PLL parameters with estimation of the jitter and TRNG parameters that were 

later confirmed by empirical measurements. 

Two practical implementations in Altera and Actel families of FPGAs showed 

which criteria are important in the design. The achieved final speed of the generator 

in Altera Stratix device is more than 1Mbit/s with the quality of output confirmed 

by statistical tests. Thanks to the analysis of available PLL configuration and their 

parameters we have presented a generator without additional delaying logic applied 

in the original proposal [60]. Application of simpler sampling part of the generator 

is possible thanks to wider dividers range of PLL circuits in Stratix FPGA family. 

We presented the most compact solution with one PLL circuit and the chain of 

delay elements implemented in Actel ProASICplus device. The results of statistical 

tests for very long record of generated data confirm high level of randomness, with 

few tests failed. We can conclude that it was theoretically and also practically 

confirmed that the PLL-TRNG is suitable for fully embedded implementation in 

low-cost FPGAs and provides a reliable source of truly random values also in cases 

when only a small number of PLLs with limited range of frequency dividers is 

available. 

The proposed stochastic model of the generator allows to prove the mutual sta- 

121

FEI KEMT 

Figure 6 – 13 Amount of sampled ones during 1000 sampling periods according to temperature 

for chosen sample positions in TRNG with configuration B. 

tistical independence between the critical samples. The model was confirmed in 

empirical way and is valid for small number of critical samples, however, in case 

of higher number the model is less precise. In order to achieve a better adjusted 

model we propose for future research to monitor and analyse the bit sequence at 

the output of the sampling gate, before XOR operation. This kind of measurements 

may uncover a possible dependency between the samples. 

In the last part of the chapter we presented results of experiments with change- 

able temperature of chip with the PLL-TRNG. As a result we propose additional 

requirements for the generator design that need to be met in order to achieve a 

robustness of the design. We can conclude that configurations with more than 10 

highly critical samples per edge behave reliably even in changing environment. The 

bigger are the changes of jitter amplitude the bigger are changes in the histogram of 

jitter and that has direct negative impact on statistical properties of the generated 

sequence. 

122

FEI KEMT 

Figure 6 – 14 Comparison of probability histograms for the jitter measured by temperature 20 ◦ C 

in TRNG with configuration A and B. Data measured were around the rising edge of the sampled 

clock waveform. 

Figure 6 – 15 Difference in number of sampled ones for critical samples by boundary temperatures 

−40 ◦ C and +30 ◦ C in TRNG with configuration A and B around the rising edge of the sampled 

clock waveform. 

123

FEI KEMT 

7 Research Contribution 

With this thesis we contributed to the field of hardware implementation of public-key 

cryptographic system elements. We discussed the aspects of algorithm adaptations 

and system architectures for modular multiplier and cryptanalytic hardware. Ran- 

domness extraction method based on clock circuitry was evaluated and new findings 

were presented. 

The research contribution were achieved in the following topics: 

• Optimised Montgomery modular multiplier implementation in hardware. 

• The elliptic curve method implementation in hardware. 

• Evaluation of random number generator based on clock circuitry in FPGAs. 

Optimised Montgomery modular multiplier implementation in hardware 

Two most popular public-key cryptographic algorithms – the RSA and ECC use 

extensively modular operations with large numbers. The MM can be a very slow 

operation when performed on general-purpose computers, therefore can be acceler- 

ated by an effective hardware implementation. 

We analysed algorithms for Montgomery MM and architectures for their effec- 

tive implementation suitable for reconfigurable hardware structures. Our attention 

was paid to keep the scalability and parametrisation of multiplier unit also in the 

other parts of the system and find an optimal model for division of computational 

load between the software and hardware part of the system. The results of area oc- 

cupation and timing analysis were presented after application of hardware-software 

co-design. 

The elliptic curve method implementation in hardware The security of 

the most applied public-key cryptographic algorithm – RSA depends on hardness 

of factoring large numbers. In the currently best known method for factoring large 

integers – the GNFS one important step is the factorisation of mid-sized integers 

for which an ECM is an efficient algorithm. 

The ECM algorithm is a classical example of algorithm that can be significantly 

accelerated thanks to special-purpose hardware. We provided a detailed description 

of efficient ECM architecture, especially suited for hardware implementation. The 

modular multiplier obtained as a result of our research described in the previous 

point presents a core element of the ECM unit and allows fast prototyping. For 

124

FEI KEMT 

proof-of-concept purpose, we have chosen architecture with an embedded controller 

and dedicated coprocessor designed by software-hardware co-design on an FPGA. 

We presented the area requirements of the system and timings on the first published 

real hardware implementation. 

Evaluation of random number generator based on clock circuitry Random 

values play a crucial role in several areas of science. In dependency on field of 

application the requirements for parameters of random sequence and generator of 

sequence itself may vary. 

We enhanced the already published results on the generator invented in [60]. 

Our focus was put on analysis of the generator in changing working conditions and 

configurations settings. We presented the most compact solution with one PLL 

circuit and a chain of delay elements implemented in a low-cost FPGA. In other 

design, we focused on achieving high bitrate of the generated sequence when the 

achieved final speed of the generator was more than 1Mbit/s with the quality of 

output confirmed by statistical tests. We summarised our results of experiments 

with changeable temperature of chip and proposed additional requirements for the 

generator design that need to be met in order to achieve a robustness of the design. 

125

FEI KEMT 

Curriculum vitae 

Professional experience 

• self-employed, Electronic Documents Laboratory – Team Leader (August 2008 

– now). 

Projects related to PKI, biometrics and cryptography for Polish Security Print- 

ing Works (PWPW S.A.), Warsaw, Poland. System design and analysis, 

preparation of proof-of-concept systems. 

• Sentivision Polska, Warsaw, Poland, Senior Software Engineer (October 2006 

– July 2008). 

Expert on Digital Rights Management implementations in embedded plat- 

forms for IPTV and VoD systems, cryptography related applications and fea- 

tures. End-to-end implementation of Marlin IPTV-ES DRM system in C 

including server and client side. Technical project leader - narrow cooperation 

with project manager, contact with customers, consulting and on-site support, 

maintenance and release of software. 

Stages abroad 

• Three months research stage in COSIC group at Katholieke Universiteit Leu- 

ven, Belgium – Involved in the FP6 project “SCA Resistant Design”, analysis 

of side-channel attacks (2006) 

• Four months research stage at Laboratoire Traitement du Signal et Instrumen- 

tation, Unité Mixte de Recherche CNRS 5516, Université Jean Monnet, Saint- 

Etienne, France – Analysis of TRNG embedded in Altera FPGAs, stochastic 

model of the generator (2005) 

• Four months research stage at Communication Security group, Ruhr-Universität 

Bochum, Germany – Optimisation and implementation of ECM for factorisa- 

tion on Xilinx FPGA (2004) 

• Four months stage as an Erasmus student at the Higher Institute for Advanced 

Technologies of Saint-Etienne (ISTASE), Université Jean Monnet, Saint-Etienne, 

France – Implementation of scalable MM, work with Altera Nios processor 

(2002) 

126

FEI KEMT 

References 

[1] Actel Corporation. ProASICplus Evaluation Board, User’s guide, 2002. 

[2] Actel Corporation. Axcelerator Family PLL and Clock Management, Ap- 

plication Note, June 2003. 

[3] Actel Corporation. Using ProASICplus Clock Conditioning Circuits, Ap- 

plication Note, Dec. 2004. 

[4] Actel Corporation. ProASIC3(E) Flash Family FPGAs, Datasheet, Jan. 

2005. 

[5] Actel Corporation. ProASICplus Flash Family FPGAs, ver. 5.3, May 

2006. 

[6] Altera Corporation. Metastability in Altera Devices ver.4.0, May 1999. 

[7] Altera Corporation. ACEX 1K Programmable Logic Device Family, Data 

Sheet, Sept. 2001. ver. 3.3. 

[8] Altera Corporation. APEX 20K Programmable Logic Device Family, 

Data Sheet, Feb. 2002. ver. 4.3. 

[9] Altera Corporation. Avalon Bus Specification, Reference Manual, Jan. 

2002. ver. 2.0. 

[10] Altera Corporation. Nios Embedded Processor Development Board 

ver.2.1, Apr. 2002. 

[11] Altera Corporation. Using PLLs in Stratix Devices, Feb. 2002. ver. 1.0. 

[12] Altera Corporation. Cyclone Device Handbook, Using PLLs in Cyclone 

Devices, Oct. 2003. ver. 1.2. 

[13] Altera Corporation. Cyclone Programmable Logic Device Family, Data 

Sheet, Mar. 2003. ver. 1.1. 

[14] Altera Corporation. Using the ClockLock & ClockBoost PLL Features in 

APEX Devices, Nov. 2003. Application Note 115, ver. 2.6. 

[15] Altera Corporation. Stratix Device Handbook, General-Purpose PLLs in 

Stratix & Stratix GX Devices, Sept. 2004. ver. 3.1. 

127

FEI KEMT 

[16] Altera Corporation. Stratix EP1S25 DSP Development Board, Dec. 2004. 

ver. 1.6. 

[17] Altera Corporation. Cyclone II Device Handbook, PLLs in Cyclone II 

Devices, Feb. 2005. ver. 1.2. 

[18] Altera Corporation. Stratix Device Handbook, July 2005. ver. 3.4. 

[19] Altera Corporation. Stratix II Device Handbook, PLLs in Stratix II De- 

vices, Mar. 2005. ver. 2.2. 

[20] Altera Corporation. Stratix II Device Handbook, Volume 2, Chapter 2, 

TriMatrix Embedded Memory Blocks in Stratix II & Stratix II GX Devices, 

Apr. 2006. ver. 4.2. 

[21] Altera Corporation. Stratix II Device Handbook, Volume 1, Chapter 5, 

DC & Switching Characteristics, May 2007. ver. 4.3. 

[22] AMI Semiconductors Company. XpressArray High Density 0.18 um 

Structured ASIC. 

[23] ARM Limited. ARM7TDMI (Rev 3) — Technical Reference Manual. Avail- 

able at 

http://www.arm.com/pdfs/DDI0029G_7TDMI_R3_trm.pdf, 2001. 

[24] Bagini, V., and Bucci, M. A design of reliable true random number gener- 

ator for cryptographic applications. In Cryptographic Hardware and Embedded 

Systems – CHES’99 (Berlin, Germany, Aug. 1999), Ç. K. Koç and C. Paar, 

Eds., no. 1717 in Lecture Notes in Computer Science, Springer-Verlag, pp. 204– 

218. 

[25] Barrett, P. Implementating the rivest, shamir and aldham public-key en- 

cryption algorithm on standard digital signal processor. In Proceedings of 

CRYPTO’86 (1986), vol. 263 of Lecture Notes in Computer Science, pp. 311– 

323. 

[26] Baudet, M., Lubicz, D., Micolod, J., and Tassiaux, A. On the secu- 

rity of oscillator-based random number generators. Cryptology ePrint Archive, 

Report 2009/299, 2009. http://eprint.iacr.org/. 

128

FEI KEMT 

[27] Bernstein, D. Circuits for Integer Factorization: A Proposal. Manuscript. 

Available at http://cr.yp.to/papers.html#nfscircuit, 2001. 

[28] Blum, L., Blum, M., and Shub, M. A simple unpredictable pseudo- 

random number generator. SIAM Journal on Computing 15 (1986), 364–383. 

[29] Blum, T., and Paar, C. Montgomery modular exponentiation on reconfig- 

urable hardware. In Proceedings of the 14th IEEE Symposium on Computer 

Arithmetic (Adelaide, Australia) (Los Alamitos, CA, April 1999), Koren and 

Kornerup, Eds., IEEE Computer Society Press, pp. 70–77. 

[30] Blum, T., and Paar, C. High radix montgomery modular exponentiation 

on reconfigurable hardware. IEEE Transaction on Computers 50, 7 (2001), 

759–764. 

[31] Bock, H., Bucci, M., and Luzzi, R. An offset-compensated oscillator- 

based random bit source for security applications. In Cryptographic Hardware 

and Embedded Systems – CHES 2004 (Berlin, Germany, 2004), M. Joye and J.- 

J. Quisquater, Eds., no. 3156 in Lecture Notes in Computer Science, Springer- 

Verlag, pp. 268–281. 

[32] Bosma, W. Primality testing using elliptic curves. Tech. Rep. 85-12, Math- 

ematical Institut, Universiteit van Amsterdam, 1985. 

[33] Brent, R. P. Some Integer Factorization Algorithms Using Elliptic Curves. 

In Australian Computer Science Communications 8 (1986), pp. 149–163. 

[34] Brent, R. P. Factorization of the tenth Fermat number. Mathematics of 

Computation 68, 225 (1999), 429–451. 

[35] Brown, M., Hankerson, D., López, J., and Menezes, A. Software 

Implementation of the NIST Elliptic Curves Over Prime Fields. In Top- 

ics in Cryptology — CT-RSA 2001 (Berlin, April 2001), D. Naccache, Ed., 

vol. LNCS 2020, Springer-Verlag, pp. 250–265. 

[36] Bucci, M., and Luzzi, R. Design of testable random bit generators. In 

Cryptographic Hardware and Embedded Systems – CHES 2005 (Berlin, Ger- 

many, 2005), J. Rao and B. Sunar, Eds., no. 3659 in Lecture Notes in Computer 

Science, Springer-Verlag, pp. 147–156. 

129

FEI KEMT 

[37] Bundesamt für Sicherheit in der Informationstechnik – BSI. Ap- 

plication Notes and Interpretation of the scheme (AIS), AIS 31, Funcionality 

Classes and Evaluation Methodology for Physical Random Number Generators, 

Sept. 2001. 

[38] Ç. K. Koç. RSA hardware implementation. Tech. rep., RSA Laboratoties, 

RSA Data Security, Inc., Aug. 1995. 

[39] Ç. K. Koç, Acar, T., and Kaliski, Jr., B. S. Analyzing and comparing 

Montgomery multiplication algorithms. IEEE Micro 16, 3 (June 1996), 26–33. 

[40] Chaitin, G. J. Algorithmic Information Theory. Cambridge University 

Press, 1987. 

[41] Daly, A., and Marname, W. Efficient architectures for implemeting Mont- 

gomery modular multiplication and RSA modular exponentiation on reconfig- 

urable logic. In Proceedings of the 2002 ACM/SIGDA tenth international 

symposium on Field-programmable gate arrays FPGA’02 (Monterey, Califor- 

nia, USA, Feb. 2002). 

[42] Davies, R. B. Exclusive OR (XOR) and hardware random number genera- 

tors. Tech. rep., 2002. 

[43] Dichtl, M. How to Predict the Output of a Hardware Random Number 

Generator. In Workshop on Cryptographic Hardware and Embedded Systems 

– CHES 2003 (Berlin, Germany, Sept. 8–10, 2003), C. D. Walter, Ç. K. Koç, 

and C. Paar, Eds., vol. 2779 of Lecture Notes in Computer Science, Springer- 

Verlag, pp. 181–188. 

[44] Dichtl, M., and Golić, J. D. High-speed true random number genera- 

tion with logic gates only. In CHES ’07: Proceedings of the 9th international 

workshop on Cryptographic Hardware and Embedded Systems (Berlin, Heidel- 

berg, 2007), vol. 4727 of Lecture Notes in Computer Science, Springer-Verlag, 

pp. 45–62. 

[45] Dixon, B., and Lenstra, A. Massively parallel elliptic curve factoring. In 

Advances in Cryptology - Eurocrypt ’92 (1993), R. Rueppel, Ed., vol. 658 of 

LNCS, Springer, Berlin, pp. 183–193. 

130

FEI KEMT 

[46] Drutarovský, M., Fischer, V., and ˇ Simka, M. Comparison of Two 

Implementations of Scalable Montgomery Coprocessor Embedded in Reconfig- 

urable Hardware. In Proceedings of the XIX Conference on Design of Circuits 

and Integrated Systems – DCIS 2004 (Bordeaux, France, Nov. 24–26, 2004), 

pp. 240–245. 

[47] Drutarovský, M., Fischer, V., ˇ Simka, M., and Celle, F. A Simple 

PLL-based True Random Number Generator for Embedded Digital Systems. 

Computing and Informatics 23, 5 (2004), 501–515. 

[48] Drutarovský, M., and ˇ Simka, M. Cryptographic True Random Number 

Generator for Embedded Nios Processor. In Proceedings of 13th International 

Czech-Slovak Scientific Conference Radioelektronika (Brno, Czech Republic, 

May 6–7, 2003), pp. 268–371. 

[49] Drutarovský, M., and ˇ Simka, M. Custom FPGA Cryptographic Blocks 

for Reconfigurable Embedded NIOS Processor. Acta Electrotechnica et Infor- 

matica 4, 2 (2004), 33–39. 

[50] Drutarovský, M., ˇ Simka, M., and Fischer, V. Comparison of Scalable 

Montgomery Modular Multiplications Embedded in Reconfigurable Hardware. 

Acta Electrotechnica et Informatica 6, 2 (2006), 37–45. 

[51] Eldridge, S. E., and Walter, C. D. Hardware implementation of Mont- 

gomery’s modular multiplication algorithm. IEEE Trans. Comput. 42, 6 

(1993), 693–699. 

[52] Epstein, M., Hars, L., Krasinski, R., Rosner, M., and Zheng, H. 

Design and implementation of a true random number generator based on dig- 

ital circuit artifacts. In Workshop on Cryptographic Hardware and Embedded 

Systems – CHES 2003 (Berlin, Germany, Sept. 8–10, 2003), C. D. Walter, Ç. 

K. Koç, and C. Paar, Eds., vol. 2779 of Lecture Notes in Computer Science, 

Springer-Verlag, pp. 152–165. 

[53] Fairfield, R. C., Mortenson, R. L., and Coulthart, K. B. An LSI 

random number generator (RNG). In Proceedings of CRYPTO 84 on Advances 

in cryptology (1985), Springer-Verlag New York, Inc., pp. 203–230. 

131

FEI KEMT 

[54] Federal Information Processing Standards, National Institute 

of Standards and Technology, U.S. Department of Commerce. 

Data Encryption Standard, Jan. 1977. NIST FIPS PUB 46. 



Data Encryption Standard, Oct. 1999. NIST FIPS PUB 46-3. 



Specification for the Digital Signature Standard, Jan. 2000. NIST FIPS PUB 

186-2. 



Security Requirements for Cryptographic Modules, May 2001. NIST FIPS PUB 

140-2. 



Specification for the Advanced Encryption Standard (AES), 2001. NIST FIPS 

PUB 197. 



Specification for the Secure Hash Standard, Aug. 2002. NIST FIPS PUB 180-2 

+ change notice to include SHA-224. 

[60] Fischer, V., and Drutarovský, M. True random number generator em- 

bedded in reconfigurable hardware. In Workshop on Cryptographic Hardware 

and Embedded Systems – CHES 2002 (Berlin, Germany, Aug.13–15, 2002), 

B. S. Kaliski, Jr., Ç. K. Koç, and C. Paar, Eds., vol. 2523 of Lecture Notes in 

Computer Science, Springer-Verlag, pp. 415–430. 

[61] Fischer, V., Drutarovský, M., ˇ Simka, M., and Bochard, N. High 

Performance True Random Number Generator in Altera Stratix FPLDs. In 

Field-Programmable Logic and Applications – FPL 2004 (Lueven, Belgium, 

Aug. 2004), J. Becker, M. Platzner, and S. Vernalde, Eds., vol. 3203 of Lecture 

Notes in Computer Science, Springer-Verlag, pp. 555–564. 

132

FEI KEMT 

[62] Fischer, V., Drutarovský, M., ˇ Simka, M., and Celle, F. Simple 

PLL-based True Random Number Generator for Embedded Digital Systems. 

In Proceedings of IEEE Design and Diagnostics of Electronic Circuits and 

Systems Workshop – DDECS 2004 (Stará Lesná, Slovakia, Apr. 18–21, 2004), 

pp. 129–136. 

[63] Franke, J., and Kleinjung, T. E-mail announcement. 

http://www.crypto-world.com/announcements/rsa200.txt, May 2005. 

[64] Franke, J., Kleinjung, T., Paar, C., Pelzl, J., Priplata, C., and 

Stahlke, C. SHARK — A Realizable Special Hardware Sieving Device 

for Factoring 1024-bit Integers. In Workshop on Cryptographic Hardware and 

Embedded Systems — CHES 2005, Edinburgh (August 2005), LNCS, Springer. 

To appear. 

[65] Franke, J., Kleinjung, T., Paar, C., Pelzl, J., Priplata, C., ˇ Simka, 

M., and Stahlke, C. An effcient hardware architecture for factoring integers 

with the Elliptic Curve Method. In 1st Workshop on Special-purpose Hardware 

for Attacking Cryptographic Systems – SHARCS 2005 (Paris, France, Feb. 24– 

25, 2005), pp. 51–62. 

[66] Frolek, V. Implementation of asymmetric encryption algorithms in recon- 

figurable circuits. Master’s thesis, Technical University of Koˇsice, Department 

of Electronics and Multimedia Communications, Jan.-May 2002. 

[67] Gaj, K., Kwon, S., Baier, P., Kohlbrenner, P., Le, H., Khaleelud- 

din, M., and Bachimanchi, R. Implementing the elliptic curve method of 

factoring in reconfigurable hardware. In Workshop on Special-purpose Hard- 

ware for Attacking Cryptographic Systems – SHARCS 2006 (Cologne, Ger- 

many, Apr. 03–04, 2006). 

[68] Gennaro, R. Randomness in cryptography. IEEE Security and Privacy 4, 

2 (2006), 64–67. 

[69] Goldberg, I., and Wagner, D. Randomness and the Netscape browser. 

Dr. Dobb’s Journal (Jan. 1996), 66–70. 

[70] Golic, J. New methods for digital generation and postprocessing of random 

data. IEEE Transaction on Computers 55, 10 (2006), 1217–1229. 

133

FEI KEMT 

[71] Gura, N., Chang, S., 2, H., Sumit, G., Gupta, V., Finchelstein, D., 

Goupy, E., and Stebila, D. An End-to-End Systems Approach to Elliptic 

Curve Cryptography. In Cryptographic Hardware and Embedded Systems — 

CHES 2002 (2002), Ç. K. Koç and C. Paar, Eds., vol. LNCS 2523, Springer, 

pp. 349–365. 

[72] Huang, M., Gaj, K., Kwon, S., and El-Ghazawi, T. An optimized 

hardware architecture for the Montgomery Multiplication Algorithm. In PKC 

2008: 11th International Workshop on Practice and Theory in Public Key 

Cryptography, Barcelona, Spain (March 2008), pp. 214–228. 

[73] Jun, B., and Kocher, P. The intel random number generator. 

White paper prepared for intel corporation, Cryptography Research, Inc., 

http://www.cryptography.com/resources/whitepapers/IntelRNG.pdf, Apr. 

1999. 

[74] Killmann, W., and Schindler, W. A proposal for: Fuctionality Classes 

and Evaluation Methodology for True (Physical) Random Number Generators, 

Sept. 2001. 

[75] Kinniment, D., and Chester, E. Design of an on-chip random number 

generator using metastability. In Proceedings of the 28th European Solid-State 

Circuit Conference (Sept. 2002), Univ. Bologna, Italy, pp. 595–598. 

[76] Knuth, D. E. Seminumerical Algorithms, second ed., vol. 2 of The Art of 

Computer Programming. Addison-Wesley, Reading, Massachusetts, Jan. 10, 

1981. 

[77] Koblitz, N. Elliptic curve cryptosystems. Mathematics of Computation 48, 

177 (Jan. 1987), 203–209. 

[78] Koblitz, N., Menezes, A., and Vanstone, S. The state of elliptic curve 

cryptography. Designs, Codes and Cryptography 19, 2-3 (Mar. 2000), 173–193. 

[79] Kohlbrenner, P., and Gaj, K. An embedded true random number gen- 

erator for FPGAs. In Proceeding of the 2004 ACM/SIGDA 12th international 

symposium on Field programmable gate arrays (2004), ACM Press, pp. 71–78. 

[80] Lenstra, A. K. Designs, Codes and Cryptography. Kluwer Academic Pub- 

lishers, Boston, 2000, ch. Integer Factoring. 

134

FEI KEMT 

[81] Lenstra, A. K., and H. W. Lenstra, J., Eds. The Development of the 

Number Field Sieve. Lecture Notes in Math. Volume 1554. Springer, 1993. 

[82] Lenstra, H. W. Factoring Integers with Elliptic Curves. Annals of Mathe- 

matics 126, 2 (1987), 649–673. 

[83] Lim, D., Ranasinghe, D. C., Devadas, S., Jamali, B., Abbott, D., 

and Coleb, P. H. Exploiting metastability and thermal noise to build a 

re-configurable hardware random number generator. In Noise in Devices and 

Circuits III; Proceedings of SPIE (Texas, USA, May 2005), vol. 5844, pp. 294– 

309. 

[84] MacKay, D. J. C. Introduction to Monte Carlo methods. In Learning in 

Graphical Models, M. I. Jordan, Ed., NATO Science Series. Kluwer Academic 

Press, 1998, pp. 175–204. 

[85] McIvor, C., McLoone, M., McCanny, J., Daly, A., and Marnane, 

W. Fast montgomery modular multiplication and rsa cryptographic proces- 

sor architectures. In 37th IEEE Computer Society Asilomar Conference on 

Signals, Systems and Computers (Monterey, USA, Nov. 2003), pp. 379–384. 

[86] Menezes, J. A., Oorschot, P. C., and Vanstone, S. A. Handbook of 

Applied Cryptography. CRC Press, New York, Oct. 1996. 

[87] Miller, V. S. Use of elliptic curves in cryptography. In Lecture notes 

in computer sciences; 218 on Advances in cryptology—CRYPTO 85 (1986), 

Springer-Verlag New York, Inc., pp. 417–426. 

[88] Montgomery, P. Modular Multiplication without Trial Division. Mathe- 

matics of Computation 44, 170 (April 1985), 519–521. 

[89] Montgomery, P. Speeding up the Pollard and elliptic curve methods of 

factorization. Mathematics of Computation 48 (1987), 243–264. 

[90] NEC Corporation. Preliminary User’s Manual System-on-Chip Lite, De- 

velopment Board, Hardware, Document No. A15650EE1V0UM00, July 2001. 

Available at http://www.ee.nec.de/_pdf/A15650EE1V0UM00.PDF. 

[91] organization = Federal Information Processing Standards, Na- 

tional Institute of Standards and Technology, U.S. Department 

135

FEI KEMT 

of Commerce, month = aug, year = 2005, note =. ”Recommendation 

for Key Management, part 1 - General”. 

[92] Orlando, G., and Paar, C. A Scalable GF (p) Elliptic Curve Processor Ar- 

chitecture for Programmable Hardware. In Workshop on Cryptographic Hard- 

ware and Embedded Systems — CHES 2001 (May 14-16, 2001), Ç. K. Koç, 

D. Naccache, and C. Paar, Eds., vol. LNCS 2162, Springer, pp. 348–363. 

[93] Pavelka, P., Galajda, P., and Fischer, V. Crypto FPGA a step to- 

wards a new class of flexible security devices. In Radioelektronika 2005 : 15th 

international Czech-Slovak scientific conference (Brno, Czech Republic, May 

2005), University of Technology, pp. 397–400. 

[94] Pelzl, J., ˇ Simka, M., Kleinjung, T., Franke, J., Priplata, C., 

Stahlke, C., Drutarovský, M., Fischer, V., and Paar, C. Area–time 

efficient hardware architecture for factoring integers with the elliptic curve 

method. IEE Proceedings - Information Security 152, 1 (2005), 67–78. 

[95] Pollard, J. A Monte Carlo Method for Factorization. Nordisk Tidskrift for 

Informationsbehandlung (BIT) 15 (1975), 331–334. 

[96] Rivest, R. L., Shamir, A., and Adleman, L. A Method for Obtaining 

Digital Signatures and Public-Key Cryptosystems. Communications of the 

ACM 21, 2 (February 1978), 120–126. 

[97] Rukhin, A., Soto, J., Nechvatal, J., Smid, M., Barker, E., Leigh, 

S., Levenson, M., Vangel, M., Banks, D., Heckert, A., Dray, J., 

and Vo, S. A Statistical Test Suite for Random and Pseudorandom Number 

Generators for Cryptographic Applications. NIST Special Publication 800-22. 

(revised May 15, 2002). 

[98] Santoro, R., Sentieys, O., and Roy, S. On-line monitoring of random 

number generators for embedded security. In IEEE International Symposium 

on Circuit and Systems – ISCAS 2009 (2009), pp. 3050–3053. 

[99] Schaumont, P., and Ching, D. Gezel. Available at 

http://rijndael.ece.vt.edu/gezel2. 

[100] Schindler, W. Efficient Online Tests for True Random Number Gener- 

ators. In Workshop on Cryptographic Hardware and Embedded Systems – 

136

FEI KEMT 

CHES 2001 (Berlin, Germany, May 13–16, 2001), Ç. K. Koç, D. Naccache, 

and C. Paar, Eds., vol. 2162 of Lecture Notes in Computer Science, Springer- 

Verlag, pp. 103–117. 

[101] Schneier, B. Applied Cryptography: Protocols, Algorithms, and Source Code 

in C, 2nd ed. John Wiley & Sons, Inc., New York, 1996. 

[102] Secretariat National Committee for Information Technology 

Standardization. Fibre Channel - Methodologies for Jitter Specification, 

T11.2 / Project 1230/ Rev 10, June 1999. 

[103] Shamir, A., and Tromer, E. Factoring Large Numbers with the TWIRL 

Device. In Advances in Cryptology — Crypto 2003 (2003), vol. 2729 of LNCS, 

Springer, pp. 1–26. 

[104] Silverman, R. D. The multiple polynomial quadratic sieve. Mathematics of 

Computation 48 (1987), 329–340. 

[105] Sunar, B., Martin, W. J., and Stinson, D. R. A provably secure true 

random number generator with built-in tolerance to active attacks. IEEE 

Transaction on Computers 56, 1 (2007), 109–119. 

[106] Tang, K., Siegel, P. H., and Milstein, L. B. A comparison of long 

versus short spreading sequences in coded asynchronous DS-CDMA systems. 

IEEE Journal on Selected Areas in Communications 19, 8 (Aug. 2001), 1614– 

1624. 

[107] Tektronix. A Guide to Understanding and Characterizing Timing Jitter. 

[108] Tenca, A. F., and Ç. K. Koç. A scalable architecture for Montgomery 

multiplication. In Cryptographic Hardware and Embedded Systems (Berlin, 

Germany, 1999), Ç.K. Koç and C. Paar, Eds., no. 1717 in Computer Science, 

Springer Verlag, pp. 94–108. 

[109] Tenca, A. F., and Ç. K. Koç. A scalable architecture for modular multipli- 

cation based on Montgomery’s algorithm. IEEE Transactions on Computers 

52, 9 (Sept. 2003), 1215–1221. 

[110] Tenca, A. F., Todorov, G., and Ç. K. Koç. High-radix design of a 

scalable modular multiplier. In Cryptographic Hardware and Embedded Sys- 

tems – CHES 2001 (Berlin, Germany, May 2001), Ç. K. Koç, D. Naccache, 

137

FEI KEMT 

and C. Paar, Eds., no. 2162 in Lecture Notes in Computer Science, Springer- 

Verlag, pp. 189–205. 

[111] Tkacik, T. E. A hardware random number generator. In Workshop on Cryp- 

tographic Hardware and Embedded Systems – CHES 2002 (Berlin, Germany, 

Aug.13–15, 2002), B. S. Kaliski, Jr., Ç. K. Koç, and C. Paar, Eds., vol. 2523 

of Lecture Notes in Computer Science, Springer-Verlag, pp. 450–453. 

[112] Tsoi, K., Leung, K., and Leong, P. Compact FPGA-based true and 

pseudo random number generators. In Proceedings of the IEEE Symposium on 

Field-Programmable Custom Computing Machines (FCCM), California USA 

(2003), pp. 51–61. 

[113] ˇ Simka, M., and Drutarovský, M. Montgomery Multiplication Copro- 

cessor on Reconfigurable Logic. In Proceedings of 13th International Czech- 

Slovak Scientific Conference Radioelektronika (Brno, Czech Republic, May 

6–7, 2003), pp. 95–98. 

[114] ˇ Simka, M., Drutarovský, M., and Fischer, V. Embedded True Ran- 

dom Number Generator in Actel FPGAs. In Workshop on Cryptographic 

Advances in Secure Hardware – CRASH 2005 (Leuven, Belgium, Sept. 6–7, 

2005). 

[115] ˇ Simka, M., Drutarovský, M., and Fischer, V. Randomness Extrac- 

tion Method Based on Rationally Related Clock Signals. In Proceedings of 

the DSP-MCOM 2005, The 6th International Conference on Digital Signal 

Processing and Multimedia Communications (Koˇsice, Slovakia, Sept. 13–14, 

2005), pp. 190–193. 

[116] ˇ Simka, M., Drutarovský, M., and Fischer, V. Performance of PLL- 

based True Random Number Generator in changing working conditions (Sub- 

mitted). Acta Electrotechnica et Informatica (2010). 

[117] ˇ Simka, M., and Fischer, V. Montgomery Multiplication Coprocessor for 

Altera Nios Embedded Processor. In Proceedings of Electronic Computers and 

Informatics (Herl’any, Slovakia, Oct. 2002), pp. 206–211. 

[118] ˇ Simka, M., Fischer, V., and Drutarovský, M. Hardware-Software 

Codesign in Embedded Asymmetric Cryptography Application – a Case 

138

FEI KEMT 

Study. In Field-Programmable Logic and Applications – FPL 2003 (Lis- 

bon, Portugal, Sept. 2003), P. Y. Cheung, G. A. Constantinide, and J. T. 

de Sousa, Eds., vol. 2778 of Lecture Notes in Computer Science, Springer- 

Verlag, pp. 1075–1078. 

[119] ˇ Simka, M., Fischer, V., Drutarovský, M., and Fayolle, J. Model 

of a true random number generator aimed at cryptographics applications. In 

Proceedings of the International Symposium on Circuit and Systems – ISCAS 

2006 (Island of Kos, Greece, May 21–24, 2006), pp. 5619–5623. 

[120] ˇ Simka, M., Pelzl, J., Kleinjung, T., Franke, J., Priplata, C., 

Stahlke, C., Drutarovský, M., Fischer, V., and Paar, C. Hard- 

ware Factorization Based on Elliptic Curve Method. In FCCM – IEEE Sym- 

posium on Field-Programmable Custom Computing Machines (Napa Valley, 

California, Apr. 17–20, 2005). 

[121] Walker, S., and Foo, S. Evaluating metastability in electronic circuits 

for random number generation. In Proceedings of the IEEE Computer Society 

Workshop on VLSI 2001 (WVLSI ’01) (2001), IEEE Computer Society, p. 99. 

[122] Wollinger, T., and Paar, C. How secure are FPGAs in cryptographic 

applications? (long version). Cryptology ePrint Archive, Report 2003/119, 

2003. 

[123] Wolski, E., Filho, J. G. S., and Dantas, M. A. R. Parallel Implementa- 

tion of Elliptic Curve Method for Integer Factorization Using Message-Passing 

Interface (MPI). In SBAC- PAD 13th Symposium on Computer Architecture 

and High-Performance, 2001, Pirenopolis (2001). 

[124] Xilinx Corporation. Virtex-E 1.8V Field Programmable Gate Arrays — 

Production Product Specification. 

[125] Xilinx Corporation. Superior Jitter Management with DLLs ver.1.2, 

Virtech Tech Topic VTT013 ed., Jan. 2003. 

[126] Xilinx Corporation. Using the Virtex Delay-Locked Loop ver.2.8, Appli- 

cation Note 132: Virtex Series ed., Jan. 2006. 

[127] Xilinx Corporation. Using Delay-Locked Loops in Spartan-II/IIE FPGAs 

ver.1.2, Application Note 174 ed., June 2008. 

139

FEI KEMT 

[128] Zimmermann, P. ECMNET page. Available at 

http://www.loria.fr/˜zimmerma/records/ecmnet.html. 

140

1 Montgomery Modular Multiplication in Hard- ware

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?