1 Montgomery Modular Multiplication in Hard- ware
1 Montgomery Modular Multiplication in Hard- ware
1 Montgomery Modular Multiplication in Hard- ware
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Technical University of Koˇsice<br />
Faculty of Electrical Eng<strong>in</strong>eer<strong>in</strong>g and Informatics<br />
Analysis and Implementation of Selected<br />
Blocks for Public-Key Cryptosystems <strong>in</strong><br />
FPGAs<br />
2010 Mart<strong>in</strong> ˇSimka
Technical University of Koˇsice<br />
Faculty of Electrical Eng<strong>in</strong>eer<strong>in</strong>g and Informatics<br />
Department of Electronics and Multimedia Communications<br />
Analysis and Implementation of Selected<br />
Blocks for Public-Key Cryptosystems <strong>in</strong><br />
FPGAs<br />
<strong>Montgomery</strong> <strong>Modular</strong> Multiplier and True Random<br />
Number Generator<br />
Doctoral Thesis<br />
Discipl<strong>in</strong>e: 26-13-9 Electronics<br />
Department: Department of Electronics and Multime-<br />
dia Communications (FEI)<br />
Supervisor: doc. Ing. Miloˇs Drutarovsk´y, PhD.<br />
Consultant: prof. Ing. Viktor Fischer, PhD.<br />
Koˇsice 2010 Mart<strong>in</strong> ˇSimka
Metadata Sheet<br />
Author: Mart<strong>in</strong> ˇ Simka<br />
Thesis title: Analysis and Implementation of Selected Blocks for Public-<br />
Key Cryptosystems <strong>in</strong> FPGAs<br />
Subtitle: <strong>Montgomery</strong> <strong>Modular</strong> Multiplier and True Random Num-<br />
ber Generator<br />
Language: English<br />
Type of Thesis: Doctoral Thesis<br />
Number of Pages: 126<br />
Degree: PhD.<br />
University: Technical University of Koˇsice<br />
Faculty: Faculty of Electrical Eng<strong>in</strong>eer<strong>in</strong>g and Informatics (FEI)<br />
Department: Department of Electronics and Multimedia Communica-<br />
tions (KEMT)<br />
Discipl<strong>in</strong>e: 26-13-9 Electronics<br />
Town: Koˇsice, Slovakia<br />
Supervisor: doc. Ing. Miloˇs Drutarovsk´y, PhD.<br />
Consultant(s) : prof. Ing. Viktor Fischer, PhD.<br />
Date of Submission: 2. 8. 2010<br />
Date of Defence: 9. 2010<br />
Keywords: modular multiplication, elliptic curve method, factorisation,<br />
random number generator<br />
Category Conspectus: Technika, technológia, <strong>in</strong>ˇz<strong>in</strong>ierstvo; Elektronika<br />
Thesis Citation: Mart<strong>in</strong> ˇ Simka: Analysis and Implementation of Selected<br />
Blocks for Public-Key Cryptosystems <strong>in</strong> FPGAs. Koˇsice:<br />
Technical University of Koˇsice, Faculty of Electrical Engi-<br />
neer<strong>in</strong>g and Informatics. 2010. 126 pages<br />
Title SK: Anal´yza a implementácia vybran´ych blokov pre kryp-<br />
tografické systémy s verejn´ym kl’účom<br />
Subtitle SK: <strong>Montgomery</strong>ho modulárna násobička a generátor skutočne<br />
náhodn´ych čísel<br />
Keywords SK: modulárne násobenie, metóda eliptick´ych kriviek, fak-<br />
torizácia, generátor náhodn´ych čísel
Abstract <strong>in</strong> English<br />
In the thesis we deal with two elementary blocks used <strong>in</strong> public key cryptosystems<br />
– the first block is a modular multiplier for very long operands, the second one<br />
is random number generator. Both blocks are designed on programmable target<br />
platform (FPGA devices) what allows quick prototyp<strong>in</strong>g of proposed systems.<br />
Our ma<strong>in</strong> goal <strong>in</strong> case of multiplier is to achieve a scalable and parametrised<br />
solution, which is easily portable and adaptable accord<strong>in</strong>g to a f<strong>in</strong>al target platform<br />
and processed data. Note that due to requested high flexibility of solution the<br />
achieved speed for clock<strong>in</strong>g is lower than <strong>in</strong> case of dedicated design focused on speed.<br />
On the other hand, our solution is perfect for prototyp<strong>in</strong>g and proof-of-concept<br />
designs approach. In the thesis we analyse algorithm improvements <strong>in</strong> relation to<br />
technical features of chosen FPGA families. Obta<strong>in</strong>ed universal arithmetic solution<br />
needs to be enhanced with equally universal <strong>in</strong>terface <strong>in</strong> order to connect a control<br />
unit. As a result we obta<strong>in</strong>ed a build<strong>in</strong>g block – the multiplier for application <strong>in</strong><br />
cryptographic and cryptanalytic systems. For the multiplier it is possible to choose<br />
a range of occupied physical area, computational time and size of operands.<br />
The second area we deal with is a generation of random numbers <strong>in</strong> digital<br />
environment of <strong>in</strong>tegrated circuits. A random number generator (RNG) is the only<br />
cryptographic element for which there are no generally applied algorithms. The ma<strong>in</strong><br />
reason for this is <strong>in</strong> the fact that harvest<strong>in</strong>g mechanism of RNG is tightly related to<br />
a target platform. Physical sources of randomness are very limited <strong>in</strong> digital devices.<br />
In addition, we deal with problematic issue of randomness test<strong>in</strong>g. The chosen design<br />
of RNG we analyse under chang<strong>in</strong>g temperature of a chip. F<strong>in</strong>ally, the proposed<br />
stochastic model of generator allows better understand<strong>in</strong>g of its pr<strong>in</strong>ciple.<br />
Abstract <strong>in</strong> Slovak<br />
V dizertačnej práci sa zaoberáme dvoma elementárnymi blokmi pouˇzívan´ymi v<br />
kryptografick´ych systémoch s verejn´ym kl’účom – prv´ym je násobička pre operácie s<br />
vel’k´ymi číslami, druh´ym je generátor náhodn´ych čísel. Oba bloky sú realizované v<br />
technológii hradlov´ych polí (obvody typu FPGA), čo umoˇzňuje vytvorenie prototypu<br />
vo vel’mi krátkom čase.<br />
Naˇsim hlavn´ym ciel’om v prípade násobičky je realizácia l’ahko parametrizova-<br />
tel’ného a ˇskálovatel’ného rieˇsenia, ktoré umoˇzňuje prispôsobenie architektúry podl’a
FEI KEMT<br />
ciel’ovej platformy a vlastností spracúvan´ych dát. Treba poznamenat’, ˇze dôsledkom<br />
flexibility rieˇsenia je niˇzˇsia dosahovaná r´ychlost’ v´ypočtov. Na druhej strane, takéto<br />
rieˇsenie je ideálne v prípade realizácie prototypov a návrhov, ktoré majú potvrdit’<br />
navrhovan´y koncept rieˇsenia. V práci sa zaoberáme prispôsobením ˇstruktúry náso-<br />
bičky k architektúre ciel’ovej platformy vybran´ych rodín hradlov´ych polí. Získané<br />
univerzálne rieˇsenie je potrebné vybavit’ rovnako univerzálnym rozhraním, ktoré<br />
umoˇzní prepojenie v´ypočtovej jednotky ku rôznorod´ym typom riadiacich jednotiek.<br />
Ako v´ysledok sme získali stavebn´y prvok kryptografick´ych a kryptoanalytick´ych<br />
systémov, pre ktor´y je moˇzné zvolit’ vel’kost’ obsadenej plochy na ciel’ovej platforme,<br />
r´ychlost’ vykonávanej operácie násobenia a vel’kost’ akceptovan´ych parametrov.<br />
Druhou oblast’ou, ktorou sa v práci zaoberáme je oblast’ generovania náhodn´ych<br />
postupností v prostredí číslicov´ych <strong>in</strong>tegrovan´ych obvodov. Generátor náhodn´ych<br />
čísel (RNG) je jed<strong>in</strong>´ym prvkom kryptografick´ych systémov, ktorého pr<strong>in</strong>cíp nie je<br />
dan´y medz<strong>in</strong>árodn´ym ˇstandardom. Hlavn´ym dôvodom je to, ˇze spôsob získavania<br />
náhodn´ych hodnôt je striktne závisl´y od ciel’ovej platformy pre implementáciu gene-<br />
rátora. Fyzické zdroje entrópie pouˇzitel’né v číslicov´ych obvodoch majú obmedzené<br />
moˇznosti, k čomu sa eˇste pripája problematika testovania náhodnosti v´ystupnej pos-<br />
tupnosti. Vybran´y generátor analyzujeme z hl’adiska jeho správania v meniacich sa<br />
tepeln´ych podmienkach súčiastky, v ktorej je umiestnen´y. Predstaven´y stochastick´y<br />
model generátora pribliˇzuje podstatu pr<strong>in</strong>cípu generovania náhodnej postupnosti.<br />
v
Declaration<br />
I hereby declare that this thesis is my own work and effort. Where other sources<br />
of <strong>in</strong>formation have been used, they have been acknowledged.<br />
Koˇsice 2. 8. 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . .<br />
Signature
Acknowledgement<br />
There are several persons who contributed to the research results published <strong>in</strong><br />
the thesis and to the fact I can submit the thesis for defence.<br />
I am very grateful to my advisor Miloˇs Drutarovsk´y for guid<strong>in</strong>g me all along my<br />
research, for his effort and dedication, and for all time he found for me. I want to<br />
thank my special advisor Prof. Viktor Fischer for his great advice, support and<br />
ideas for research, and for mak<strong>in</strong>g possible my stage <strong>in</strong> France. I would like to<br />
express my gratitude to Prof. Duˇsan Levick´y for help <strong>in</strong> tough situations dur<strong>in</strong>g my<br />
stay at department he leads.<br />
Big thanks goes to Nathalie Bochard and Frédéric Celle for very good coop-<br />
eration and help regard<strong>in</strong>g FPGA design. I was glad to meet Cor<strong>in</strong>ne Fournier<br />
and Loïc Denis who made my weekends very enjoyable. Thanks to all members of<br />
Hubert Curien Laboratory <strong>in</strong> Sa<strong>in</strong>t-Etienne, I had nice time with you.<br />
I would like to thank all my colleagues from COSY group. Especially to Jan<br />
Pelzl for very fruitful jo<strong>in</strong>t work on hard<strong>ware</strong> implementation of ECM. I am grateful<br />
to Prof. Christof Paar who allowed me to work <strong>in</strong> his research group and get such a<br />
priceless experience. Thanks to Sandeep Kumar, Andy Rupp and Axel Poschmann<br />
for great time <strong>in</strong> Bochum, spent on research, but not only. Special thanks goes to<br />
Irmgard Kühn for mak<strong>in</strong>g my contact with all bureaucracy much easier.<br />
From the COSIC group I would like to thank Prof. Ingrid Verbauwhede and<br />
Prof. Bart Preneel for mak<strong>in</strong>g it possible to jo<strong>in</strong> their team <strong>in</strong> Leuven. Thanks<br />
to Lejla Bat<strong>in</strong>a and Elke De Mulder for <strong>in</strong>corporat<strong>in</strong>g me <strong>in</strong> side-channel attack<br />
research and all members of COSIC for creat<strong>in</strong>g great atmosphere there.<br />
I want to thank my family for their encouragements and support, and especially<br />
my sister Katka for all our <strong>in</strong>spir<strong>in</strong>g discussions.<br />
Most importantly, I thank my dear Kasia for her endless love and patience.<br />
Thanks to all of you!<br />
Mart<strong>in</strong>
Preface<br />
Systems for public key cryptography are <strong>in</strong>tensively applied <strong>in</strong> order to digitally sign<br />
or encrypt data. In this way we assure <strong>in</strong>tegrity and confidentiality of the signed<br />
message and provide authentication and non-repudiation features for a signer. The<br />
complexity of computations has impact on performance of the system, especially <strong>in</strong><br />
case of long keys. The security of the operations is based on secrecy of the private<br />
key, while its public part and the algorithm itself are publicly known.<br />
In the first part of thesis we analyse the computational part of the systems and<br />
focus on flexible implementation of modular multiplier. The output of the research<br />
was applied <strong>in</strong> order to estimate performance of Elliptic Curve Method (ECM)<br />
<strong>in</strong>creased thanks to its hard<strong>ware</strong> realisation. Scalable nature of the multiplier was<br />
spread <strong>in</strong> the whole design, and the proof-of-concept implementation was designed<br />
and tested <strong>in</strong> a very short time.<br />
In the second part of document we focus on the key generat<strong>in</strong>g element – a Ran-<br />
dom Number Generator (RNG). Already known design was analysed under several<br />
aspects and we provide results <strong>in</strong> the form of a stochastic model of the RNG and<br />
proposed test<strong>in</strong>g methods suitable for this type of RNGs.<br />
The target platform for the selected build<strong>in</strong>g blocks of cryptosystems is FPGA<br />
(Field Programmable Gate Array) what offers a reduction of development time, wide<br />
range of devices and high level of security. In the thesis analyse particular families of<br />
devices from FPGA vendors which <strong>in</strong>clude dedicated electronic elements used <strong>in</strong> our<br />
designs. Parameters of the blocks and algorithm improvements may have significant<br />
impact on the performance of system.<br />
Three topics of the thesis provide a picture of complexity level <strong>in</strong> cryptology and<br />
underl<strong>in</strong>e relevance of research <strong>in</strong> area of cryptographic systems implementation.
Contents<br />
Introduction 1<br />
1 <strong>Montgomery</strong> <strong>Modular</strong> <strong>Multiplication</strong> <strong>in</strong> <strong>Hard</strong><strong>ware</strong> - prelim<strong>in</strong>aries 3<br />
1.1 Implementation Platforms . . . . . . . . . . . . . . . . . . . . . . . . 3<br />
1.2 RSA Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />
1.2.1 <strong>Modular</strong> Exponentiation and <strong>Multiplication</strong> . . . . . . . . . . 8<br />
1.2.2 <strong>Hard</strong><strong>ware</strong> Implementations of the MMM . . . . . . . . . . . . 12<br />
1.3 EC <strong>in</strong> Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />
1.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18<br />
2 <strong>Montgomery</strong> <strong>Modular</strong> <strong>Multiplication</strong> <strong>in</strong> <strong>Hard</strong><strong>ware</strong> 20<br />
2.1 Scalable MMM design . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />
2.1.1 Scalable Multiple-Word Algorithms . . . . . . . . . . . . . . . 22<br />
2.1.2 Comparison of Implementation Approaches . . . . . . . . . . . 23<br />
2.2 Multiplier Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />
2.2.1 Adder Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />
2.2.2 Memory Block . . . . . . . . . . . . . . . . . . . . . . . . . . . 32<br />
2.2.3 Interface to Controller . . . . . . . . . . . . . . . . . . . . . . 34<br />
2.3 Implementation of the MMM . . . . . . . . . . . . . . . . . . . . . . 36<br />
2.3.1 Comparison of CSA and CPA PE . . . . . . . . . . . . . . . . 36<br />
2.3.2 <strong>Montgomery</strong> <strong>Multiplication</strong> Coprocessor . . . . . . . . . . . . 38<br />
2.3.3 <strong>Hard</strong><strong>ware</strong>-Soft<strong>ware</strong> Co-design of MMM: a Case Study . . . . . 38<br />
2.3.4 Implementation Results . . . . . . . . . . . . . . . . . . . . . 42<br />
2.4 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . 42<br />
3 Elliptic Curve Method <strong>in</strong> <strong>Hard</strong><strong>ware</strong> - prelim<strong>in</strong>aries 44<br />
3.1 Integer Factor<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44<br />
3.1.1 Factor<strong>in</strong>g Algorithms . . . . . . . . . . . . . . . . . . . . . . . 44<br />
3.1.2 Motivation for <strong>Hard</strong><strong>ware</strong> Implementation . . . . . . . . . . . . 45<br />
3.2 Previous Implementations of ECM . . . . . . . . . . . . . . . . . . . 46<br />
3.3 Mathematical Background . . . . . . . . . . . . . . . . . . . . . . . . 47<br />
3.3.1 Pollard’s (p − 1)-algorithm . . . . . . . . . . . . . . . . . . . . 48<br />
3.3.2 ECM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 49
FEI KEMT<br />
4 Elliptic Curve Method <strong>in</strong> <strong>Hard</strong><strong>ware</strong> 55<br />
4.1 Parameterisation of the ECM Algorithm . . . . . . . . . . . . . . . . 56<br />
4.1.1 Phase 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />
4.1.2 Phase 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />
4.2 Design of the ECM Unit . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />
4.2.1 Control Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />
4.2.2 Memory Management . . . . . . . . . . . . . . . . . . . . . . . 59<br />
4.2.3 Choice of the Arithmetic Algorithms . . . . . . . . . . . . . . 60<br />
4.2.4 Parallelization of the Algorithm . . . . . . . . . . . . . . . . . 64<br />
4.3 Implementation of the ECM Unit . . . . . . . . . . . . . . . . . . . . 65<br />
4.3.1 <strong>Hard</strong><strong>ware</strong> Platform . . . . . . . . . . . . . . . . . . . . . . . . 65<br />
4.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66<br />
4.3.3 ECM-Based Acceleration of GNFS: a Case Study . . . . . . . 67<br />
4.4 Conclusions and Future Steps . . . . . . . . . . . . . . . . . . . . . . 69<br />
5 True Random Number Generator - prelim<strong>in</strong>aries 71<br />
5.1 Randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />
5.1.1 Def<strong>in</strong>itions of Randomness . . . . . . . . . . . . . . . . . . . . 72<br />
5.1.2 Random Number Generator . . . . . . . . . . . . . . . . . . . 73<br />
5.1.3 Applications of Random Numbers . . . . . . . . . . . . . . . . 75<br />
5.2 TRNG Implementations <strong>in</strong> Digital Systems . . . . . . . . . . . . . . . 76<br />
5.2.1 Sources of Randomness . . . . . . . . . . . . . . . . . . . . . . 77<br />
5.2.2 Survey of Designs Based on Jitter . . . . . . . . . . . . . . . . 82<br />
5.3 PLL-Based TRNG on FPGA . . . . . . . . . . . . . . . . . . . . . . 85<br />
5.3.1 Randomness Extraction Method . . . . . . . . . . . . . . . . . 85<br />
5.3.2 Coherent Sampl<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . 88<br />
5.4 Test<strong>in</strong>g of TRNGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />
5.5 Attacks aga<strong>in</strong>st TRNG . . . . . . . . . . . . . . . . . . . . . . . . . . 91<br />
5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />
6 True Random Number Generator 94<br />
6.1 Clock Synthesis <strong>in</strong> FPGAs . . . . . . . . . . . . . . . . . . . . . . . . 94<br />
6.1.1 PLL as Source of Randomness . . . . . . . . . . . . . . . . . . 96<br />
6.2 PLL-Based TRNG on FPGA . . . . . . . . . . . . . . . . . . . . . . 101<br />
6.2.1 PLL Configurations . . . . . . . . . . . . . . . . . . . . . . . . 101<br />
6.2.2 Analysis of TRNG <strong>in</strong> Altera Stratix FPGAs . . . . . . . . . . 103<br />
x
FEI KEMT<br />
6.2.3 Analysis of TRNG <strong>in</strong> Actel FPGAs . . . . . . . . . . . . . . . 105<br />
6.2.4 Stochastic Model of PLL-TRNG . . . . . . . . . . . . . . . . . 109<br />
6.3 Active Non-Invasive Attack on TRNG . . . . . . . . . . . . . . . . . 114<br />
6.3.1 Attack description . . . . . . . . . . . . . . . . . . . . . . . . 114<br />
6.3.2 Measurements results . . . . . . . . . . . . . . . . . . . . . . . 115<br />
6.4 Conclusions and Further Research . . . . . . . . . . . . . . . . . . . . 120<br />
7 Research Contribution 124<br />
Bibliography 127<br />
xi
List of Figures<br />
1 – 1 Typical architecture of the smallest functional unit <strong>in</strong> a FPGA. . . . 6<br />
1 – 2 RSA encryption scheme when A sends encrypted message to B. First<br />
A receive B’s public key upon a request, afterwards A encrypts a<br />
message X us<strong>in</strong>g the B’s public key Y = X E mod M. F<strong>in</strong>ally B<br />
decrypts the received message Y us<strong>in</strong>g own private key X = Y D mod<br />
M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />
2 – 1 Architecture of a general scalable coprocessor based on separate mem-<br />
ory and ALU connected by w-bit data-path . . . . . . . . . . . . . . 21<br />
2 – 2 One level of the w-bit adder implemented as CPA and CSA with FAs 27<br />
2 – 3 Block diagram of the CSA-based w-bit MWR2MM process<strong>in</strong>g element<br />
(CSA PE) based on FA . . . . . . . . . . . . . . . . . . . . . . . . . . 28<br />
2 – 4 Block diagram of CPA-based w-bit MWR2MM process<strong>in</strong>g element<br />
(CPA PE) based on FA . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />
2 – 5 Pipel<strong>in</strong>ed organization of the MMM coprocessor based on n-stage PEs<br />
connection and separated embedded data memory . . . . . . . . . . . 30<br />
2 – 6 Organisation of the dual-port memory register <strong>in</strong>side the MMM co-<br />
processor for one variable with e words of width w bits . . . . . . . . 32<br />
2 – 7 Proposed universal <strong>in</strong>terface for the MMM coprocessor . . . . . . . . 34<br />
4 – 1 Architecture of the ECM unit . . . . . . . . . . . . . . . . . . . . . . 58<br />
4 – 2 Organisation of the ECM unit’s memory registers for 21 variables<br />
with e words of width w . . . . . . . . . . . . . . . . . . . . . . . . . 60<br />
4 – 3 Scalable addition and subtraction unit for operands with word width w 63<br />
5 – 1 Schematic diagram of a TRNG with designation of <strong>in</strong>ternal signals<br />
and <strong>in</strong>terfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />
5 – 2 Illustration of stable states (0 and 1) and undef<strong>in</strong>ed metastable state 78<br />
5 – 3 Tim<strong>in</strong>g jitter <strong>in</strong> clock signal . . . . . . . . . . . . . . . . . . . . . . . 81<br />
5 – 4 R<strong>in</strong>g oscillator structures proposed by Golić. . . . . . . . . . . . . . . 83<br />
5 – 5 Block structure of the PLL-TRNG with two PLLs, sampl<strong>in</strong>g gate and<br />
corrector of the output sequence. . . . . . . . . . . . . . . . . . . . . 86<br />
5 – 6 Sampl<strong>in</strong>g of the CLJ clock signal <strong>in</strong>clud<strong>in</strong>g the track<strong>in</strong>g jitter on the<br />
rais<strong>in</strong>g edge of the CLK signal (illustrated for KM = 5 and KD = 7) 86<br />
6 – 1 Block diagram of analog PLL circuitry for clock signal synthesis <strong>in</strong><br />
Altera FPGA [11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
FEI KEMT<br />
6 – 2 Block diagram of digital DLL unit typical for Xil<strong>in</strong>x FPGA clock<br />
management circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . 96<br />
6 – 3 Jitter of the clock signal <strong>in</strong> Altera Stratix design (horizontal scale:<br />
200 ps/div) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />
6 – 4 Configurations of TRNG with: a) one PLL, b) two parallel PLLs and<br />
c) two cascaded PLLs . . . . . . . . . . . . . . . . . . . . . . . . . . . 101<br />
6 – 5 Distribution of mean values of ordered CLJ signal samples obta<strong>in</strong>ed<br />
dur<strong>in</strong>g Q = 1000 periods TQ . . . . . . . . . . . . . . . . . . . . . . . 110<br />
6 – 6 Block diagram of design for on-chip samples reorder<strong>in</strong>g . . . . . . . . 111<br />
6 – 7 Reordered samples from generator measured by oscilloscope . . . . . 111<br />
6 – 8 Sampled waveform of a clock signal for TRNG for Configuration A<br />
for temperatures <strong>in</strong> range −40 ◦ C + 30 ◦ C. . . . . . . . . . . . . . . . . 116<br />
6 – 9 Sampled waveform of a clock signal for TRNG for configuration B<br />
for temperatures <strong>in</strong> range −40 ◦ C + 32 ◦ C. . . . . . . . . . . . . . . . . 117<br />
6 – 10Amount of sampled ones dur<strong>in</strong>g 1000 sampl<strong>in</strong>g periods for TRNG<br />
with configuration A (detail of the rais<strong>in</strong>g edge). . . . . . . . . . . . . 119<br />
6 – 11Amount of sampled ones dur<strong>in</strong>g 1000 sampl<strong>in</strong>g periods for TRNG<br />
with configuration B, with low-pass loop filter (detail of the rais<strong>in</strong>g<br />
edge). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120<br />
6 – 12Amount of sampled ones dur<strong>in</strong>g 1000 sampl<strong>in</strong>g periods accord<strong>in</strong>g to<br />
temperature for chosen sample positions <strong>in</strong> TRNG with configuration<br />
A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />
6 – 13Amount of sampled ones dur<strong>in</strong>g 1000 sampl<strong>in</strong>g periods accord<strong>in</strong>g to<br />
temperature for chosen sample positions <strong>in</strong> TRNG with configuration<br />
B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122<br />
6 – 14Comparison of probability histograms for the jitter measured by tem-<br />
perature 20 ◦ C <strong>in</strong> TRNG with configuration A and B. Data measured<br />
were around the ris<strong>in</strong>g edge of the sampled clock waveform. . . . . . . 123<br />
6 – 15Difference <strong>in</strong> number of sampled ones for critical samples by boundary<br />
temperatures −40 ◦ C and +30 ◦ C <strong>in</strong> TRNG with configuration A and<br />
B around the ris<strong>in</strong>g edge of the sampled clock waveform. . . . . . . . 123<br />
xiii
List of Tables<br />
1 – 1 Comparison of the key length (<strong>in</strong> bits) for equivalent security level<br />
for public-key cryptosystems . . . . . . . . . . . . . . . . . . . . . . . 16<br />
2 – 1 Address of operands from host processor level (LSB right) . . . . . . 33<br />
2 – 2 PE sizes and speeds for old style Altera FPGAs . . . . . . . . . . . . 37<br />
2 – 3 PE sizes and speeds for new style Altera FPGAs . . . . . . . . . . . . 37<br />
2 – 4 Area occupation <strong>in</strong> number of LEs and maximal clock frequency<br />
(fclkMMM ) (MHz) of the MMM coprocessor (w = 32, n = 1..4) with<br />
MWR2MM CSA algorithm . . . . . . . . . . . . . . . . . . . . . . . . 38<br />
2 – 5 Execution times of soft<strong>ware</strong> implementation of MMM on Altera Nios<br />
development board (with APEX EP20K200 clocked at 50 MHz) . . . 40<br />
2 – 6 Execution times of mixed hard<strong>ware</strong>-soft<strong>ware</strong> implementation of MMM<br />
on Altera Nios development board (with APEX EP20K200) for the<br />
CSA PE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41<br />
2 – 7 Execution times of mixed hard<strong>ware</strong>-soft<strong>ware</strong> implementation of the<br />
MMM on Altera Nios development board (with APEX EP20K200)<br />
for the CPA PE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41<br />
4 – 1 Computational complexity and memory requirements for phase 2 de-<br />
pend<strong>in</strong>g on D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />
4 – 2 A command syntax for the ECM unit (LSB left) . . . . . . . . . . . . 59<br />
4 – 3 Runn<strong>in</strong>g Times of the ECM Implementation (198 bits modulus), p =<br />
2, w = 32 (Xil<strong>in</strong>x Virtex2000E-6 and ARM7TDMI, 25MHz) . . . . . 67<br />
6 – 1 Parameters of PLL embedded <strong>in</strong> Altera FPGAs . . . . . . . . . . . . 97<br />
6 – 2 Parameters of PLL embedded <strong>in</strong> Actel FPGAs . . . . . . . . . . . . . 98<br />
6 – 3 Parameters sett<strong>in</strong>gs for different TRNG configurations . . . . . . . . 102<br />
6 – 4 Configuration parameters of tested TRNG . . . . . . . . . . . . . . . 105<br />
6 – 5 Results of quality evaluation of tested TRNG configurations . . . . . 105<br />
6 – 6 Achievable sensitivity on jitter us<strong>in</strong>g two clock signals <strong>in</strong> Actel ProA-<br />
SICplus (FCLI = 40MHz) . . . . . . . . . . . . . . . . . . . . . . . . . 107<br />
6 – 7 Area occupation of one PLL TRNG with delay l<strong>in</strong>e <strong>in</strong> FPGA Actel<br />
ProASICPlus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108<br />
6 – 8 Mean values measured us<strong>in</strong>g the stochastic model E[pi] and the out-<br />
put sequence of the TRNG m = E [x(nTQ)] . . . . . . . . . . . . . . . 114<br />
6 – 9 Results of statistical tests (FIPS) of TRNG output and number of<br />
random samples <strong>in</strong>fluenced by the jitter at different chip temperatures 118
List of Algorithms<br />
1 – 1 <strong>Montgomery</strong> exponentiation algorithm [86], the def<strong>in</strong>ition of M ′ re-<br />
quires that gcd(M, R) = 1, b denotes base or radix. . . . . . . . . . . 10<br />
1 – 2 The <strong>Montgomery</strong> modular multiplication algorithm for k-bit operands<br />
X = (xk−1, . . . , x1, x0), Y , and M . . . . . . . . . . . . . . . . . . . . 11<br />
1 – 3 The basic radix-2 <strong>Montgomery</strong> multiplication algorithm for k-bit operands<br />
X = (xk−1, . . . , x1, x0), Y , and M . . . . . . . . . . . . . . . . . . . . 13<br />
1 – 4 Optimized radix-2 <strong>Montgomery</strong> multiplication algorithm . . . . . . . 15<br />
1 – 5 Key generation <strong>in</strong> ECC [78] . . . . . . . . . . . . . . . . . . . . . . . 18<br />
1 – 6 Message sign<strong>in</strong>g <strong>in</strong> ECC [78] . . . . . . . . . . . . . . . . . . . . . . . 18<br />
2 – 1 The multiple word radix-2 <strong>Montgomery</strong> multiplication MWR2MM CSA<br />
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />
2 – 2 The multiple word radix-2 <strong>Montgomery</strong> multiplication MWR2MM CPA<br />
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />
3 – 1 Elliptic Curve Method . . . . . . . . . . . . . . . . . . . . . . . . . . 50<br />
3 – 2 Exponentiation for Curves <strong>in</strong> <strong>Montgomery</strong> Form . . . . . . . . . . . . 53<br />
4 – 1 Modified MWR2MM algorithm . . . . . . . . . . . . . . . . . . . . . 62<br />
4 – 2 <strong>Modular</strong> addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64<br />
4 – 3 <strong>Modular</strong> subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
List of Symbols and Abbreviations<br />
A (x) the x th word of vector A<br />
Ax..y particular range of bits <strong>in</strong> a vector A from position x to position y<br />
A (y)<br />
x<br />
bit position of the y th word of A<br />
B bound of smoothness<br />
D a parameter <strong>in</strong> improved standard cont<strong>in</strong>uation of ECM<br />
DCLJ divid<strong>in</strong>g factor for CLJ clock signal<br />
DCLK divid<strong>in</strong>g factor for CLK clock signal<br />
FCLJ frequency of CLJ clock signal<br />
FCLK frequency of CLK clock signal<br />
KD decimation factor of CLK clock signal<br />
KM decimation factor of CLJ clock signal<br />
M modulus<br />
MCLJ multiplication factor for CLJ clock signal<br />
MCLK multiplication factor for CLK clock signal<br />
S partial sum<br />
TQ<br />
time period of bit generation<br />
TCLJ time period of CLJ clock signal<br />
TCLK time period of CLK clock signal<br />
X nultiplier<br />
Y multiplicand<br />
φ canonical homomorphism<br />
φ() Euler tontien function
FEI KEMT<br />
π(p) prime count<strong>in</strong>g function, number of primes ≤ p<br />
σjit standard deviation of jitter<br />
xA the x th part of vector A<br />
b base or radix<br />
e number of words<br />
k length of operands<br />
n positive <strong>in</strong>teger to be factored<br />
p prime factor<br />
w word width<br />
ALU Arithmetic Logic Unit<br />
ASIC Application-Specific Integrated Circuits<br />
AT Area-Time<br />
CASR Cellular Automation Shift Register<br />
CLB Configurable Logic Block<br />
CPA Carry Propagate Adder<br />
CPU Central Process<strong>in</strong>g Unit<br />
CRT Ch<strong>in</strong>ese Rem<strong>in</strong>der Theorem<br />
CSA Carry Save Adder<br />
DJ Determ<strong>in</strong>istic Jitter<br />
DLL Delay Locked Loop<br />
DSA Digital Signature Algorithm<br />
EC Elliptic Curves<br />
ECC Elliptic Curve Cryptography<br />
xvii
FEI KEMT<br />
ECDLP Elliptic Curve Discrete Logarithm Problem<br />
ECDSA Elliptic Curve Digital Signature Algorithm<br />
ECM Elliptic Curve Method<br />
EPLL Enhanced PLL<br />
FA Full Adder<br />
FPGA Field Programmable Gate Array<br />
FPLL Fast PLL<br />
gcd Greatest Common Divisor<br />
GMP GNU Multiple Precision<br />
GNFS Generalised Number Field Sieve<br />
I/O Input/Output<br />
IP Intellectual Property<br />
ITU International Telecommunications Union<br />
LAB Logic Array Block<br />
LE Logic Element<br />
LFSR L<strong>in</strong>ear Feedback Shift Register<br />
LPM Library of Parameterized Modules<br />
LSB Least Significant Bit<br />
LUT Look-Up Table<br />
MM <strong>Modular</strong> <strong>Multiplication</strong><br />
MMM <strong>Modular</strong> <strong>Montgomery</strong> Multimplication<br />
MPQS Multiple Polynomial Quadratic Sieve<br />
MSB Most Significant Bit<br />
xviii
FEI KEMT<br />
MWR2MM Multiple Word Radix-2 <strong>Montgomery</strong> <strong>Multiplication</strong><br />
NA Not Available<br />
P&R Place and Route<br />
PCI Peripheral Component Interconnect<br />
PE Process<strong>in</strong>g Element<br />
PLL Phase Locked Loop<br />
PRNG Pseudo-Random Number Generator<br />
RAM Random Access Memory<br />
RFID Radio Frequency Identification<br />
RISC Reduced Instruction Set Computer<br />
RJ Random Jitter<br />
RMS Root Mean Square<br />
RNG Random Number Generator<br />
RO R<strong>in</strong>g Oscillator<br />
ROM Read-Only Memory<br />
SIMD S<strong>in</strong>gle Instruction Multiple Data<br />
SOC System on a Chip<br />
SOS Separated Operand Scann<strong>in</strong>g<br />
TRNG True-Random Number Generator<br />
UART Universal Asynchronous Receiver/Transmitter<br />
VCCIO Positive Supply Voltage for IO P<strong>in</strong>s<br />
VCO Voltage Controlled Oscillator<br />
VHDL VHSIC <strong>Hard</strong><strong>ware</strong> Description Language<br />
VHSIC Very High Speed Integrated Circuit<br />
xix
FEI KEMT<br />
Introduction<br />
In the thesis we analyse two elementary blocks of almost each public key cryp-<br />
tosystem, a multiplier for operations on very long operands and a random number<br />
generator.<br />
In the case of multiplier our ma<strong>in</strong> goal is to achieve scalable and parametrised<br />
design for fast prototyp<strong>in</strong>g <strong>in</strong> Field Programmable Gate Arrays (FPGAs). Flexibility<br />
of the design and computational latency create a trade-off, therefore this concept is<br />
suitable mostly for prototyp<strong>in</strong>g and proof-of-concept designs. As a secondary objec-<br />
tive we want to achieve effective utilisation of a selected family of FPGAs and apply<br />
its specific features. In this way we can analyse suitability of a certa<strong>in</strong> algorithm for<br />
the selected FPGA platform. Such approach is particularly appropriate <strong>in</strong> case the<br />
f<strong>in</strong>al implementation platform will be the same FPGA family.<br />
Flexible and effective design of multiplier would have a chance to offer an univer-<br />
sal solution <strong>in</strong> the applications with different asymmetric algorithms or <strong>in</strong> similar<br />
systems based on the same algebraic operations. Our goal is to design and im-<br />
plement a multiplier block with a universal <strong>in</strong>terface that could be <strong>in</strong>cluded <strong>in</strong> a<br />
variety of cryptosystems offer<strong>in</strong>g features for chang<strong>in</strong>g its configuration parameters<br />
e.g. length of the <strong>in</strong>put parameters, computational time and occupied area.<br />
Another area of our focus are random numbers, namely their generation <strong>in</strong> con-<br />
ditions of digital platforms. The Random Number Generator (RNG) design depends<br />
significantly on the target implementation platform. Therefore we analyse the fea-<br />
tures of FPGAs devices, change the work<strong>in</strong>g conditions what simulates attacker<br />
behaviour and describe the relations between the parameters of the generator and<br />
the statistical parameters of the generated sequence.<br />
Classification of the generators heavily depends on level of their description. Ac-<br />
cord<strong>in</strong>g to the latest trends <strong>in</strong> this research area, designers of RNGs should provide<br />
<strong>in</strong> addition to the statistical tests results also an detailed analysis and model of the<br />
generator. The generator’s behaviour needs to be expla<strong>in</strong>ed <strong>in</strong> details, supported<br />
by practical experiments. Special attention <strong>in</strong> the RNG design should be paid to<br />
testability of the RNG. The tests can be done on the generated sequence. However,<br />
as we will show, for analysed generator there exist more effective methods for test-<br />
<strong>in</strong>g. The proposed methods should take <strong>in</strong>to account the fundamental pr<strong>in</strong>ciple of<br />
extract<strong>in</strong>g the random values.<br />
In the Chapter 1 we <strong>in</strong>troduce mathematical background of two currently most<br />
known and used cryptographic algorithms for public key cryptosystems, the RSA<br />
1
FEI KEMT<br />
and Elliptic Curve Cryptography (ECC). In computationally highly <strong>in</strong>tensive public-<br />
key algorithms we identify the most expensive and also most used operation - mod-<br />
ular multiplication. The comparison of the operands length shows the range for<br />
which an universal architecture needs to be found.<br />
The Chapter 2 provides our design approach and implementation results for<br />
<strong>Montgomery</strong> multiplier. We compare two designs which differ <strong>in</strong> handl<strong>in</strong>g carry<br />
bits <strong>in</strong> adders <strong>in</strong>side the multiplier block. The analysis provides suggestions which<br />
technique is suitable for a certa<strong>in</strong> platform architecture. We present a scalable archi-<br />
tecture of algebraic coprocessor that is suitable for the multiplier. A communication<br />
<strong>in</strong>terface between the coprocessor and a control unit is also discussed. The f<strong>in</strong>al case<br />
study provides results of our hard<strong>ware</strong>-soft<strong>ware</strong> co-design <strong>in</strong> case of multiplier <strong>in</strong> ap-<br />
plications with soft-core processor and dedicated coprocessor.<br />
In Chapter 3 we start with mathematical background of <strong>in</strong>teger factor<strong>in</strong>g meth-<br />
ods and provide details on Elliptic Curve Method (ECM) algorithm <strong>in</strong>clud<strong>in</strong>g the<br />
first and second phase of the algorithm. The motivation for hard<strong>ware</strong> implementa-<br />
tion of the algorithm and previous approaches for implementations are summarised.<br />
The Chapter 4 describes the first published hard<strong>ware</strong> implementation of ECM<br />
method for factor<strong>in</strong>g numbers up to 200 bits. An ECM unit design is <strong>in</strong>troduced<br />
and we discuss the way how the implemented algorithms were chosen. In the f<strong>in</strong>al<br />
section we present the implementation results of the ECM units and a case study of<br />
application of the ECM unit <strong>in</strong> a well-known factor<strong>in</strong>g method.<br />
Randomness is the ma<strong>in</strong> topic of the Chapter 5. We discuss required features<br />
of random sequences <strong>in</strong>tended for cryptographic application. We widely describe<br />
a design of RNG with focus on digital devices and analyse available sources of<br />
randomness. In the last part of chapter a review of recently published RNG concepts<br />
is provided, while our focus is put on solution based on a Phase Locked Loop (PLL).<br />
The sections on tests and attacks summarise available knowledge from these areas.<br />
In Chapter 6 we deliver our results <strong>in</strong> research of PLL-based RNG. Start<strong>in</strong>g<br />
with analysis of PLL parameters <strong>in</strong> available FPGA devices we provide description<br />
of design process for two FPGA vendors. Thanks to observations of RNG’s <strong>in</strong>ternal<br />
signals we were able to <strong>in</strong>troduce a stochastic model of the generator and describe<br />
its behaviour <strong>in</strong> chang<strong>in</strong>g chip temperature. Based on the empirical experiments we<br />
enhance the design process with additional requirements <strong>in</strong> order to achieve more<br />
robust solution.<br />
The research contribution of the thesis is summarised <strong>in</strong> the f<strong>in</strong>al Chapter 7<br />
where we collect the results from all three topics discussed <strong>in</strong> the thesis.<br />
2
FEI KEMT<br />
1 <strong>Montgomery</strong> <strong>Modular</strong> <strong>Multiplication</strong> <strong>in</strong> <strong>Hard</strong>-<br />
<strong>ware</strong> - prelim<strong>in</strong>aries<br />
Many popular public-key cryptographic algorithms and protocols, such as RSA,<br />
ElGamal, elliptic curve cryptography (ECC), Diffie-Hellman, etc. [86] extensively<br />
use modular operations with large numbers. Typical size of operands <strong>in</strong> ECC and<br />
RSA is 160-300 bits and 1000-2000 bits, respectively.<br />
We start the chapter with discussion on optimal choice of the computation<br />
method and way of its implementation accord<strong>in</strong>g to chosen implementation plat-<br />
form (the Section 1.1). In Section 1.2 we br<strong>in</strong>g a summary on RSA algorithm<br />
together with a short analysis of available algorithms for modular multiplication.<br />
We mention the aspects of hard<strong>ware</strong> implementation and review the available pa-<br />
pers <strong>in</strong> this area. F<strong>in</strong>ally, the further implemented algorithm and its modification<br />
are <strong>in</strong>troduced. The Section 1.3 we start with def<strong>in</strong>ition of elliptic curves (EC) and<br />
cont<strong>in</strong>ue with their application <strong>in</strong> cryptography. The last section summarises the<br />
most important features of the presented public-key algorithms and identifies the<br />
most important part of the system for effective implementation.<br />
1.1 Implementation Platforms<br />
By hav<strong>in</strong>g all parts of cryptosystem (encryption, authentication, key storage, gen-<br />
eration of random numbers . . . ) implemented on the same platform one is able to<br />
achieve highly compact and therefore potentially secure implementation. The more<br />
signals are available for an adversary for observation, the more <strong>in</strong>formation about<br />
processed data can be obta<strong>in</strong>ed.<br />
While <strong>in</strong> the past the development of hard<strong>ware</strong> and soft<strong>ware</strong> platforms was done<br />
separately, beside the <strong>in</strong>itial requirements and def<strong>in</strong>itions of data formats and <strong>in</strong>ter-<br />
faces, nowadays with so called hard<strong>ware</strong>-soft<strong>ware</strong> co-design one tries to f<strong>in</strong>d optimum<br />
<strong>in</strong> effective utilisation of resources. In such case some of operations are implemented<br />
as a hard<strong>ware</strong> structure and the others as a soft<strong>ware</strong> function. With reconfigurable<br />
devices and embedded soft-core processors the situation is very suitable for such an<br />
approach. However, development of mixed systems is not a trivial task for designers,<br />
especially on the level when decision on tasks division is done. Systems mak<strong>in</strong>g pos-<br />
sible to simulate and evaluate system performance by proposed soft<strong>ware</strong>-hard<strong>ware</strong><br />
architecture before a real (and expensive) implementation are only on the early stage<br />
of development (check e.g. GEZEL language and design environment [99]).<br />
3
FEI KEMT<br />
The hard<strong>ware</strong> implementation platforms offer higher level of security thanks to<br />
possibility to separate physically a sensitive data and <strong>in</strong> dependency on operations<br />
also higher performance as similar soft<strong>ware</strong> implementations.<br />
As a hard<strong>ware</strong> platform can be considered:<br />
• ASIC (Application-Specific Integrated Circuit),<br />
• FPGA (Field-Programmable Gate Array) or<br />
• RFID (Radio Frequency Identification) chip.<br />
There are different approaches by implementation of cryptosystems. Implementa-<br />
tion can provide some support<strong>in</strong>g functions for general-purpose processor, cover all<br />
crypto-related operations <strong>in</strong> a standard system or even represent complete system<br />
able to substitute the orig<strong>in</strong>al non-secured system.<br />
In dependency on the application the implementation can be done <strong>in</strong> the form<br />
of a smart card, IP (Intellectual Property) core, co-processor, PCI card, router etc.<br />
With enlarg<strong>in</strong>g area of chips it is possible to implement a CPU, memory blocks,<br />
peripherals, <strong>in</strong>terfaces and co-processor on a s<strong>in</strong>gle chip provid<strong>in</strong>g a such called<br />
system-on-a-chip (SOC). Especially <strong>in</strong> cryptography there is a requirement for im-<br />
plementation systems as SOC which hides the <strong>in</strong>ternal signals from possible abuse<br />
by the adversary. SOC raises another requirement, namely to f<strong>in</strong>d a way for im-<br />
plementation of all parts of SOC on the same chip, same platform, if possible by<br />
shar<strong>in</strong>g the same resources.<br />
Applications have various requirements for area, speed, energy, or power con-<br />
sumption. Additionally, <strong>in</strong> case of cryptosystems we def<strong>in</strong>e also level of security<br />
tak<strong>in</strong>g <strong>in</strong>to account the vulnerability aga<strong>in</strong>st eavesdropp<strong>in</strong>g and side-channel at-<br />
tacks, or ability of the system to detect an attack and thereafter delete the sensitive<br />
data <strong>in</strong> a way mak<strong>in</strong>g impossible to restore them by an adversary (tamper resis-<br />
tance). Def<strong>in</strong>itions of conditions required to certify cryptographic implementations<br />
on a certa<strong>in</strong> level of security and areas where such systems can be used are set <strong>in</strong><br />
standards of well-known standardisation organisations [57].<br />
Reconfigurable Devices Reconfigurable device is an hard<strong>ware</strong> architecture with<br />
both a functionality of process<strong>in</strong>g elements and an <strong>in</strong>terconnection between them<br />
can be modified after fabrication time. The most known reconfigurable hard<strong>ware</strong><br />
components are FPGAs.<br />
4
FEI KEMT<br />
Cryptographic primitives belong to group of systems suitable for reconfigurable<br />
devices due to the follow<strong>in</strong>g features:<br />
• standardized algorithms - most of the cryptographic algorithms, but random<br />
number generators are approved by <strong>in</strong>ternational standard organisations (e.g.<br />
[54–56,58,59]). Thus, the functionality described by mathematical algorithms<br />
and equations can by deeply studied and tailored to the hard<strong>ware</strong> structure. It<br />
is possible that group of secure cryptographic algorithms is changed <strong>in</strong> the time<br />
due to newly <strong>in</strong>vented attacks. The reconfigurable platform makes possible to<br />
remove obsolete algorithms from runn<strong>in</strong>g systems and provide the new ones,<br />
even without hard<strong>ware</strong> update or exchange.<br />
• several supported functionality modes and lengths of operands - while the num-<br />
ber of the most popular algorithms is limited, each of them provides a group of<br />
selectable parameters what results <strong>in</strong> need to implement a group of algorithms<br />
comb<strong>in</strong>ations.<br />
• sequential structure - <strong>in</strong> dependency on runn<strong>in</strong>g operation only selected crypto-<br />
graphic blocks need to programmed <strong>in</strong> a device and <strong>in</strong> case of operation change<br />
the other configuration is loaded. As an example we mentioned a scheme when<br />
at the beg<strong>in</strong>n<strong>in</strong>g of the communication a secret key is distributed to the parties<br />
by an asymmetric algorithm which is later misplaced by a faster symmetric<br />
encryption implemented on the same device.<br />
FPGA Architecture The underly<strong>in</strong>g FPGA architecture consists of an array of<br />
the smallest programmable units - logic elements (LE) or configurable logic blocks<br />
(CLB), and the programmable connection switches. A typical FPGA architecture<br />
consists of a high number (hundreds to thousands) of LEs and rout<strong>in</strong>g channels with<br />
different length/speed. By the LE we understand the smallest functional unit that<br />
is addressed by the mapp<strong>in</strong>g tools. Typically it consists of a look-up table (LUT)<br />
and a register (D flip-flop) (see Figure 1 – 1), what makes possible to implement the<br />
comb<strong>in</strong>atorial as well as sequential logic, or a small memory block. Additionally, the<br />
FPGA architecture may <strong>in</strong>clude special dedicated blocks or build<strong>in</strong>g items for other<br />
functions e.g. for stor<strong>in</strong>g data, comput<strong>in</strong>g multiplication and addition, synthesis<br />
clock signals. . .<br />
Modern FPGAs provide support for implementation of a wide range of the algo-<br />
rithms from area of signal process<strong>in</strong>g, communication or network<strong>in</strong>g. The crypto-<br />
5
FEI KEMT<br />
data<br />
<strong>in</strong>puts<br />
clock<br />
Look-up<br />
Table<br />
carry<br />
<strong>in</strong>put<br />
Carry<br />
Cha<strong>in</strong><br />
carry<br />
output<br />
D<br />
Flip<br />
Flop<br />
data<br />
outputs<br />
Figure 1 – 1 Typical architecture of the smallest functional unit <strong>in</strong> a FPGA.<br />
graphic algorithms and protocols can be represented as sequence of algebraic func-<br />
tions <strong>in</strong> chosen operational area. The operations <strong>in</strong> cryptography are often similar to<br />
the ones used <strong>in</strong> the fields mentioned above. Therefore the optimised blocks <strong>in</strong> struc-<br />
ture of FPGAs provide means for efficient realisation of cryptographic primitives,<br />
too.<br />
The additional property of cryptosystems - the security, is supported by vendors<br />
of the FPGAs by enhanc<strong>in</strong>g the devices with hard-wired encryption cores and special<br />
purpose memories. With rais<strong>in</strong>g importance of cryptography the FPGA vendors will<br />
be pushed to provide more and more features support<strong>in</strong>g security of FPGA-based<br />
cryptosystems as it was proposed <strong>in</strong> [93]. More <strong>in</strong>formation on FPGA features<br />
and their relation to implementation of cryptosystems <strong>in</strong>clud<strong>in</strong>g analysis of possible<br />
attacks can be found <strong>in</strong> [122].<br />
1.2 RSA Algorithm<br />
Nowadays the most popular asymmetric cryptosystem is RSA which was developed<br />
by Ronald Rivest, Adi Shamir and Leonard Adleman <strong>in</strong> 1978 [96].<br />
A private key for RSA algorithm consists of two large primes p and q with com-<br />
parable sizes and a secret exponent D. A public key is represented by an exponent<br />
E and modulus M, where<br />
M = pq (1.1)<br />
The Euler totien function φ(M) is def<strong>in</strong>ed as a number of positive <strong>in</strong>tegers smaller<br />
6
FEI KEMT<br />
than M, which are relatively prime to M, thus:<br />
φ(M) = (p − 1)(q − 1) . (1.2)<br />
Therefore we can write an equation for the public exponent E:<br />
Private exponent D is chosen such that:<br />
gcd(E, φ(M)) = 1 . (1.3)<br />
D = E −1 mod φ(M) . (1.4)<br />
While the public key consists of a tuple (M, E), the private key can be kept <strong>in</strong> the<br />
two possible forms: simply as a tuple (M, D) or <strong>in</strong> extended form <strong>in</strong>clud<strong>in</strong>g also<br />
the primes p and q. The latter form allows a faster decryption algorithm us<strong>in</strong>g a<br />
Ch<strong>in</strong>ese Rem<strong>in</strong>der Theorem (CRT).<br />
Basic mathematical operation used by RSA for cryptographic operations (en-<br />
cryption and digital signature) is modular exponentiation. To encrypt a message X<br />
by a public key (M, E) one applies the follow<strong>in</strong>g equation [86]:<br />
Y = X E mod M . (1.5)<br />
Decryption of received encrypted message Y is done us<strong>in</strong>g a private key couple<br />
(M, D) by calculat<strong>in</strong>g:<br />
X = Y D mod M . (1.6)<br />
Similarly to encryption, the RSA signature scheme operations employ modular ex-<br />
ponentiation for generation of a signature I for message text X<br />
and its verification<br />
I = X D mod M (1.7)<br />
X = I E mod M . (1.8)<br />
Note that while for the encryption scheme Alice as the send<strong>in</strong>g part uses receiv<strong>in</strong>g<br />
Bob’s public key to encrypt the message and this case only Bob is able to decrypt it<br />
know<strong>in</strong>g his private key (see Figure 1 – 2). In case of message signature Alice signs<br />
the message us<strong>in</strong>g her private key to prove its authenticity and thereafter anybody<br />
who disposes of Alice’s public key is able to verify her signature.<br />
7
FEI KEMT<br />
A (X)<br />
request for B’s<br />
private key<br />
key (M,E)<br />
encrypted<br />
E<br />
Y=X mod M<br />
B<br />
(M,E,D)<br />
Figure 1 – 2 RSA encryption scheme when A sends encrypted message to B. First A receive<br />
B’s public key upon a request, afterwards A encrypts a message X us<strong>in</strong>g the B’s public key Y =<br />
X E mod M. F<strong>in</strong>ally B decrypts the received message Y us<strong>in</strong>g own private key X = Y D mod M.<br />
1.2.1 <strong>Modular</strong> Exponentiation and <strong>Multiplication</strong><br />
The modular exponentiation used for encryption and signature schemes of RSA (see<br />
Equations 1.5-1.8) and other public-key cryptographic algorithms can be computed<br />
<strong>in</strong> two ways, as a series of the modular multiplications (MMs):<br />
• <strong>in</strong>terleaved by a modular reduction, or<br />
• with a f<strong>in</strong>al reduction step.<br />
The most known method from the first category - the <strong>Montgomery</strong> modular mul-<br />
tiplication (MMM) <strong>in</strong>vented by P. L. <strong>Montgomery</strong> [88] will be further discussed<br />
<strong>in</strong> this work. For the multiplication and subsequent division one can use popular<br />
Karatsuba-Ofman’s multiplication [76] <strong>in</strong> comb<strong>in</strong>ation with Barrett’s reduction [25].<br />
The MM can be a very slow operation when performed on general-purpose com-<br />
puters. Currently suggested length of operands (e.g. for RSA) is 1024 and more bits<br />
is far above the typical length of operands (8-32 bits). Therefore there is a motiva-<br />
tion for design of special algebraic units perform<strong>in</strong>g modular operations <strong>in</strong> a more<br />
efficient way. Better peformance and effectiveness of the implementation is achieved<br />
by adaption of algorithms and exploitations of platforms with reconfigurable archi-<br />
tecture. Perform<strong>in</strong>g mathematical operations with the RSA extra long variables can<br />
be limit<strong>in</strong>g for the units optimised for 8, 16 or 32 bits lengths of variables that are<br />
more typical e.g. <strong>in</strong> signal process<strong>in</strong>g.<br />
The RSA modular exponentiation does not allow straightforward implementa-<br />
tion and requires application of the algorithms that will e.g. divide long operands<br />
8
FEI KEMT<br />
<strong>in</strong> shorter words tak<strong>in</strong>g <strong>in</strong>to account the physical limitations of the structures <strong>in</strong> se-<br />
lected hard<strong>ware</strong> platform. Optimal solution <strong>in</strong> case when the operands length may<br />
change would provide a design for which the length of operands determ<strong>in</strong>es only the<br />
computational time for an operation but not the overall performance of the unit<br />
that is constant for arbitrary length.<br />
<strong>Montgomery</strong> Methods The MMM provides a very efficient way for comput<strong>in</strong>g<br />
the modular exponentiation. Input operands for the basel<strong>in</strong>e algebraic operations<br />
of the RSA algorithm described by Equations 1.5-1.8 have very long length due<br />
to security reasons. Nowadays, the key length for the RSA is switched from 1024<br />
to 2048 bits as the factorisation effort br<strong>in</strong>gs better results, closer to the bottom<br />
standard value. Hav<strong>in</strong>g a need to use operands with doubled precision it is even more<br />
desirable to f<strong>in</strong>d algorithms that m<strong>in</strong>imise the number of the algebraic operations<br />
together with their complexity.<br />
The <strong>Montgomery</strong> reduction allows efficient implementation of the MM without<br />
us<strong>in</strong>g the classical modular reduction step that is even more expensive operation <strong>in</strong><br />
comparison to the multiplication. Therefore it pays off to m<strong>in</strong>imise the number of<br />
required reductions or to use algorithms avoid<strong>in</strong>g the division.<br />
In <strong>Montgomery</strong> exponentiation algorithm (Algorithm 1 – 1 [86]) the modular ex-<br />
ponentiation unrolls <strong>in</strong>to series of the MMM. Thanks to the transformation to a<br />
<strong>Montgomery</strong> doma<strong>in</strong> and application of the MMM, it is possible to avoid the un-<br />
wanted modular reduction dur<strong>in</strong>g computations.<br />
We cont<strong>in</strong>ue with description of the MMM and conversion operations applied <strong>in</strong><br />
the Algorithm 1 – 1.<br />
Given two <strong>in</strong>tegers X and Y (X, Y < M < R), and the prime k-bit modulus M,<br />
the MMM algorithm computes<br />
S = MMM(X, Y ) = (XY R −1 ) mod M , (1.9)<br />
where R −1 is the <strong>in</strong>verse of R = b k and b denotes a base or radix. The M-residue<br />
X, of an <strong>in</strong>teger X < M is def<strong>in</strong>ed as [41]:<br />
X = XR mod M (1.10)<br />
For conversion to the <strong>Montgomery</strong> doma<strong>in</strong> we can use the MMM function as follows:<br />
MMM(X, R 2 ) = XR 2 R −1 mod M (1.11)<br />
= XR mod M<br />
= X<br />
9
FEI KEMT<br />
Algorithm 1 – 1 <strong>Montgomery</strong> exponentiation algorithm [86], the def<strong>in</strong>ition of M ′<br />
requires that gcd(M, R) = 1, b denotes base or radix.<br />
Require: M = (mk−1 . . . m0)b, R = b k , M ′ = −M −1 mod b, E = (et . . . e0)2 with<br />
et = 1, and an <strong>in</strong>teger X, 1 ≤ X < M. The values R 2 mod M and R mod M<br />
may be also provided as precomputed <strong>in</strong>puts.<br />
Ensure: A = X E mod M.<br />
1: X ⇐ MMM(X, R 2 mod M)<br />
2: A ⇐ R mod M<br />
3: for i = t down to 0 do<br />
4: A ⇐ MMM(A, A)<br />
5: if ei = 1 then<br />
6: A ⇐ MMM(A, X)<br />
7: end if<br />
8: end for<br />
9: A ⇐ MMM(A, 1)<br />
10: return A<br />
Therefore the first operation <strong>in</strong> the Algorithm 1 – 1 (Step 1) maps the <strong>in</strong>put value<br />
X to its M-residue X.<br />
Now we show how to re-map the value X to its ord<strong>in</strong>ary form of <strong>in</strong>teger X what<br />
is done <strong>in</strong> the last operation of the exponentiation (Algorithm 1 – 1, Step 9). It can<br />
be seen that the <strong>Montgomery</strong> product of two M-residues X, Y is itself the M-residue<br />
S:<br />
S = MMM(A, B) (1.12)<br />
= XY R −1 mod M<br />
= XRY RR −1 mod M<br />
= XY R mod M<br />
= SR mod M<br />
10
FEI KEMT<br />
so a f<strong>in</strong>al operation required to convert the M-residue S back <strong>in</strong>to S is def<strong>in</strong>ed as:<br />
S = SR −1 mod M (1.13)<br />
= 1SR −1 mod M<br />
= MMM(1, S)<br />
The algorithm works for any modulus M provided that gcd(M, R) = 1. This is<br />
always case <strong>in</strong> the RSA s<strong>in</strong>ce M = pq, product of two primes, and therefore odd.<br />
And s<strong>in</strong>ce R is a power of 2, it is always even.<br />
The MMM algorithm for k-bit operands X = (xk−1, . . . , x1, x0), Y , and M is<br />
given as Algorithm 1 – 2 [86].<br />
Algorithm 1 – 2 The <strong>Montgomery</strong> modular multiplication algorithm for k-bit<br />
operands X = (xk−1, . . . , x1, x0), Y , and M<br />
Require: M = (mk−1 . . . m0)b, X = (xk−1 . . . x0)b, Y = (yk−1 . . . y0)b, with 0 ≥<br />
X, Y < M, R = b n with gcd(M, b), and M ′ = −M −1 mod b.<br />
Ensure: S = XY R −1 mod M.<br />
1: S ⇐ 0 , S = (sk−1 . . . s0)b<br />
2: for i = 0 to k − 1 do<br />
3: qi ⇐ (s0 + xiy0)M ′ mod b<br />
4: S ⇐ (S + xiY + qiM)/b<br />
5: end for<br />
6: if S ≥ M then<br />
7: S ⇐ S − M<br />
8: end if<br />
9: return S<br />
Thanks to the reduction dur<strong>in</strong>g a pre-computation step of Algorithm 1 – 2 it is<br />
possible to avoid an expensive operation of the modular division dur<strong>in</strong>g the com-<br />
putations. In case of a s<strong>in</strong>gle multiplication operation the classical algorithm for<br />
modular multiplication would be faster than the MMM. Due to a need of rather<br />
expensive transformation to the <strong>Montgomery</strong> doma<strong>in</strong> (M-residue) and back, it is<br />
more effective to stay <strong>in</strong> that doma<strong>in</strong> as long as possible and transform the operands<br />
back to the ord<strong>in</strong>ary only at the very end of the computations. That requires a long<br />
sequence of the MMMs as it is <strong>in</strong> case of the modular exponentiation (Algorithm 1 –<br />
1).<br />
11
FEI KEMT<br />
In the Algorithm 1 – 1 the <strong>in</strong>put operand X is transformed to the <strong>Montgomery</strong><br />
doma<strong>in</strong> X at the beg<strong>in</strong>n<strong>in</strong>g (Step 1). Afterwards follows the series of the MMM <strong>in</strong><br />
the <strong>Montgomery</strong> doma<strong>in</strong>. F<strong>in</strong>ally, <strong>in</strong> the last step (Step 9) the result is transformed<br />
back to normal doma<strong>in</strong>. In this way the advantage of comput<strong>in</strong>g <strong>in</strong> <strong>Montgomery</strong><br />
doma<strong>in</strong> is fully exploited. The MMM is considered as the most effective method for<br />
modular exponentiation operations applied e.g. <strong>in</strong> the RSA cryptographic algorithm.<br />
1.2.2 <strong>Hard</strong><strong>ware</strong> Implementations of the MMM<br />
Achiev<strong>in</strong>g short computation time of the MM as the most time-consum<strong>in</strong>g opera-<br />
tion <strong>in</strong> RSA and ECC algorithms has a significant impact on the performance of<br />
the elementary cryptographic operations. Therefore efficient implementation of the<br />
algorithm has been an attractive field for research. Due to long operands on which<br />
the operations are performed the hard<strong>ware</strong> platform seems to be a natural choice<br />
before soft<strong>ware</strong> implementation. S<strong>in</strong>ce the size of operands may change accord<strong>in</strong>g<br />
to requirements and is different for RSA and ECC, the parameterized design <strong>in</strong><br />
programmable logic would offer an universal design for fast prototyp<strong>in</strong>g.<br />
The implementations br<strong>in</strong>g <strong>in</strong> life specifically adjusted general algorithms that<br />
take <strong>in</strong>to account the hard<strong>ware</strong> platforms features and prefer operations easily im-<br />
plementable <strong>in</strong> digital logic gates. The designs <strong>in</strong> general tend towards provid<strong>in</strong>g<br />
an universal and elastic solution or have a priority <strong>in</strong> best usage of resources and<br />
achievement of shortest computation times.<br />
One of the most cited hard<strong>ware</strong> implementation of the MMM was <strong>in</strong>troduced at<br />
CHES 1999 by Tenca and Koç [108]. A cheap and flexible modular exponentiation<br />
hard<strong>ware</strong> accelerator can be also achieved us<strong>in</strong>g FPGAs. Results presented <strong>in</strong> liter-<br />
ature, e.g. [29, 41, 51] are ma<strong>in</strong>ly concentrated to systolic-like implementations that<br />
provide a very fast but less flexible solution.<br />
Pre-comput<strong>in</strong>g partial results as presented <strong>in</strong> [72] allows to reduce the number<br />
of clock cycles required for perform<strong>in</strong>g of a s<strong>in</strong>gle MMM operation. Such approach<br />
needs marg<strong>in</strong>ally more area <strong>in</strong> comparison to orig<strong>in</strong>al proposal [108] and as far as the<br />
latency is concerned it is comparable to the design presented <strong>in</strong> [85] that is based on<br />
process<strong>in</strong>g multi-precision operands <strong>in</strong> carry-save form. High-radix implementations<br />
[110] also provide reduction of computational steps, but the complexity of logic part<br />
<strong>in</strong>creases substantially.<br />
Current FPGAs provide an alternative hard<strong>ware</strong> platform even for system-level<br />
<strong>in</strong>tegration of a cryptographic hard<strong>ware</strong>. A SOC concept can typically <strong>in</strong>clude an<br />
12
FEI KEMT<br />
embedded processor with a set of dedicated coprocessors. For such a system a<br />
highly flexible (although typically slower) scalable MMM coprocessor could be more<br />
attractive than a fixed length dedicated one.<br />
That direction was chosen <strong>in</strong> our research, when our goal is to analyse and<br />
implement solution that would allow quick prototyp<strong>in</strong>g of special purpose hard<strong>ware</strong><br />
designs and use features of target platform <strong>in</strong> order to accelerate execution of the<br />
MMM operation.<br />
The radix-2 MMM algorithm (b = 2) is very suitable for hard<strong>ware</strong> implemen-<br />
tation due to easily implementable operations as a word-by-bit multiplication, a<br />
bit-shift (division by two) and an addition. Implementations with higher radix were<br />
also published [30, 110] and offer a proper alternative, but us<strong>in</strong>g a more complex<br />
algebraic unit.<br />
Radix-2 <strong>Montgomery</strong> <strong>Multiplication</strong> Algorithm The simplified version of<br />
the MMM algorithm (Algorithm 1 – 2) when the radix b is equal to 2 (b = 2) for<br />
k-bit operands X = (xk−1, . . . , x1, x0), Y , and M is given as Algorithm 1 – 3.<br />
Algorithm 1 – 3 The basic radix-2 <strong>Montgomery</strong> multiplication algorithm for k-bit<br />
operands X = (xk−1, . . . , x1, x0), Y , and M<br />
Require: M = (mk−1 . . . m0)2, X = (xk−1 . . . x0)2, Y = (yk−1 . . . y0)2, M ′ =<br />
−M −1 mod 2, E = (et . . . e0)2 with et = 1, R = 2 k , and an <strong>in</strong>teger X, 1 ≤ X <<br />
M. The values R 2 mod M and R mod M may be also provided as precomputed<br />
<strong>in</strong>puts.<br />
Ensure: S = XY R −1 mod M.<br />
1: S0 ⇐ 0<br />
2: for i = 0 to k − 1 do<br />
3: qi ⇐ (Si + xiY ) mod 2<br />
4: Si+1 ⇐ (Si + xiY + qiM)/2<br />
5: end for<br />
6: if Sk ≥ M then<br />
7: Sk ⇐ Sk − M<br />
8: end if<br />
9: S ← Sk<br />
10: return S<br />
From a comparison of the Algorithms 1 – 2 and 1 – 3 one can see how the choice of<br />
b = 2 may help to simplify the operations <strong>in</strong>side the MMM. The modular reduction<br />
13
FEI KEMT<br />
by the radix b changes to a check of the LSB. In the Step 4 the division is replaced<br />
by a simple right shift operation.<br />
The formulation that describes the radix-2 algorithm was used as the start<strong>in</strong>g<br />
po<strong>in</strong>t for derivation of a scalable design comput<strong>in</strong>g the MMM presented <strong>in</strong> [108,109].<br />
Later we will discuss the features of such scalable architecture. Before that, we make<br />
a closer look at the operations of the algorithm and consider their modifications so<br />
they are better suitable for efficient execution on chosen FPGA hard<strong>ware</strong> platform.<br />
The decision whether perform an addition of the modulus M to the temporal<br />
sum Si+1 is based on the value of the variable qi that can be simply implemented.<br />
The test checks the LSB of the partial sum Si+1 = Si + xiY and stores it as variable<br />
qi once the addition of xiY is f<strong>in</strong>ished (see step 3 of the Algorithm 1 – 3). The stored<br />
value decides on the addition of M <strong>in</strong> the follow<strong>in</strong>g iteration of the loop.<br />
However, the second condition (see step 6 of the Algorithm 1 – 3) causes a prob-<br />
lem for a possible pipel<strong>in</strong>ed execution of computations. After the loop of additions,<br />
multiplications and shifts, the mentioned comparison and subsequent conditional<br />
subtraction is required. Without the f<strong>in</strong>al reduction step the outcome of the <strong>in</strong>ner<br />
loop of multiplication can provide an improper <strong>in</strong>put for the subsequent multipli-<br />
cation operation. That may happen <strong>in</strong> the case when the f<strong>in</strong>al value of S is bigger<br />
than M (S > M). We have <strong>in</strong>tention to use the MMM <strong>in</strong> a series of multiplica-<br />
tions when the transformation <strong>in</strong>to the <strong>Montgomery</strong> doma<strong>in</strong> br<strong>in</strong>gs profit over an<br />
expensive reduction as it was showed <strong>in</strong> the Algorithm 1 – 1. Therefore we analyse<br />
possibilities for omitt<strong>in</strong>g the f<strong>in</strong>al condition step by changes <strong>in</strong> the Algorithm 1 – 3<br />
and make possible a use of pipel<strong>in</strong>ed multipliers.<br />
Algorithm Modifications The MMM algorithm (Algorithm 1 – 2) <strong>in</strong>troduced<br />
earlier is further extended. Two variants of the algorithm are discussed and im-<br />
plemented, both support<strong>in</strong>g scalable multiple-word oriented implementation, but<br />
handl<strong>in</strong>g a carry process<strong>in</strong>g <strong>in</strong> different ways.<br />
In the modified Algorithm 1 – 4 we use the follow<strong>in</strong>g <strong>in</strong>put operands:<br />
k�<br />
X = xi2<br />
i=0<br />
i = (0, 0, xk, xk−1, . . . , x1, x0) < 2M , (1.14)<br />
�Y =<br />
k�<br />
�yi2 i+1 = (yk, . . . , y1, y0, 0) < 4M , (1.15)<br />
i=0<br />
where R = 2 k+3 , Y < 2M, and 2 k−1 < M < 2 k is an k-bit number (the same as<br />
<strong>in</strong> the Algorithm 1 – 3). Note that � Y <strong>in</strong> Equation 1.15 is a left shifted version of<br />
14
FEI KEMT<br />
Y , with �y0 = 0 and X is concatenated with two zero bits at MSB positions. This<br />
change simplifies the computation of qi compared to Algorithm 1 – 3. The value of<br />
qi needed for computation of Si+1 is given directly as a LSB of Si from the previous<br />
iteration (see step 4 of the Algorithm 1 – 4). In this way the latency caused by an<br />
addition of operands xiY is removed and logic implementation can be simplified,<br />
too.<br />
Algorithm 1 – 4 Optimized radix-2 <strong>Montgomery</strong> multiplication algorithm<br />
Require: X = � k i=0 xi2 i = (0, 0, xk, xk−1, . . . , x1, x0) < 2M, � Y = � k i=0 �yi2 i+1 =<br />
(yk, . . . , y1, y0, 0) < 4M, R = 2 k+3 , Y < 2M, and 2 k−1 < M < 2 k .<br />
Ensure: S = XY R −1 mod M.<br />
1: S0 ⇐ 0<br />
2: � Y ⇐ 2Y<br />
3: for i = 0 to k + 2 do<br />
4: qi ⇐ Si mod 2<br />
5: Si+1 ⇐ (Si + xi � Y + qiM)/2<br />
6: end for<br />
7: S ⇐ Sk+3<br />
8: return S<br />
The <strong>in</strong>ner loop of the Algorithm 1 – 4 is executed with three additional iterations<br />
<strong>in</strong> comparison to the Algorithm 1 – 3. Higher number of iterations ensures that<br />
the <strong>in</strong>equalities Si < 3M, i = 0, 1, . . . , k + 2 and S = Sk+3 = MMM(X, Y ) =<br />
(XY R −k−3 ) mod M < 2M always hold. The result of S = MMM(X, Y ) can thus<br />
be reused as an <strong>in</strong>put X and Y for the subsequent MMM. This modification avoids<br />
the orig<strong>in</strong>ally proposed f<strong>in</strong>al correction step (comparison and subtraction <strong>in</strong> step 6<br />
of the Algorithm 1 – 3) and makes possible a pipel<strong>in</strong>ed execution of the algorithm <strong>in</strong><br />
separated multipliers.<br />
In typical applications (e.g. RSA), <strong>in</strong>put operands X, Y are pre-multiplied<br />
by a factor 2 2k mod M (Algorithm 1 – 3) or 2 2k+6 mod M (Algorithm 1 – 4). The<br />
f<strong>in</strong>al MMM with value 1 makes the f<strong>in</strong>al result smaller than M (with probability<br />
1 − 2 −(k+2) as shown <strong>in</strong> [29]) and provides the result XY mod M.<br />
1.3 EC <strong>in</strong> Cryptography<br />
Application of the EC <strong>in</strong> the public-key cryptography was <strong>in</strong>dependently proposed<br />
by Neal Koblitz and Victor S. Miller <strong>in</strong> year 1985 [77, 87]. Advantage of us<strong>in</strong>g<br />
15
FEI KEMT<br />
the ECC <strong>in</strong>stead of the RSA or DSA [56] lies <strong>in</strong> the fact that the length of key<br />
can be much shorter. The best known algorithm for solv<strong>in</strong>g the elliptic curve dis-<br />
crete logarithm problem (ECDLP) takes fully exponential time, while the algorithms<br />
for the <strong>in</strong>teger factorization problem and the discrete logarithm problem take sub-<br />
exponential time. The comparison of key length for equivalent security level is<br />
presented <strong>in</strong> Table 1 – 1 [91].<br />
Table 1 – 1 Comparison of the key length (<strong>in</strong> bits) for equivalent security level for public-key<br />
cryptosystems<br />
Security (bits) DSA RSA ECC<br />
80 1024 1024 160-223<br />
112 2048 2048 224-255<br />
128 3072 3072 256-383<br />
192 7680 7680 384-511<br />
256 15360 15360 512+<br />
The fundamental and most expensive operation underly<strong>in</strong>g ECC is a po<strong>in</strong>t multi-<br />
plication, which is def<strong>in</strong>ed over field operations. For a po<strong>in</strong>t P and a positive <strong>in</strong>teger<br />
k, the po<strong>in</strong>t multiplication kP is def<strong>in</strong>ed by add<strong>in</strong>g k-times the po<strong>in</strong>t P to itself:<br />
kP = P + . . . + P<br />
� �� �<br />
k<br />
. (1.16)<br />
Various algorithms have been proposed for more efficient computation of the po<strong>in</strong>t<br />
multiplication tak<strong>in</strong>g <strong>in</strong>to account a fixed or unknown po<strong>in</strong>t P .<br />
The EC over F denoted as E is a curve that is given by an equation of the<br />
follow<strong>in</strong>g form:<br />
where E must be smooth.<br />
E : y 2 + a1xy + a3y = x 3 + a2x 2 + a4x + a6 , (ai ∈ F) (1.17)<br />
We let E(F) denote the set of po<strong>in</strong>ts (x, y) ∈ F 2 that satisfy this equation, along<br />
with a po<strong>in</strong>t at <strong>in</strong>f<strong>in</strong>ity denoted O. If the characteristic of F is neither 2 nor 3, then<br />
the Equation 1.17 can be simplified to the usually used form (so-called Weierstraß<br />
form):<br />
y 2 = x 3 + ax + b . (a, b ∈ F) (1.18)<br />
The condition for smoothness of the curve is, <strong>in</strong> this case, equals to the requirement<br />
of no multiple roots of the cubic element <strong>in</strong> the Equation 1.18. This holds if and<br />
only if the discrim<strong>in</strong>ant of x 3 + ax + b, which is −(4a 2 ) + 27b 3 , is nonzero.<br />
16
FEI KEMT<br />
The EC is an Abelian group with the po<strong>in</strong>t O serv<strong>in</strong>g as its identity element.<br />
Further we def<strong>in</strong>e rules for po<strong>in</strong>t addition and po<strong>in</strong>t doubl<strong>in</strong>g (addition of the identical<br />
po<strong>in</strong>t).<br />
Let P = (xP , yP ) ∈ E, then −P = (xP , −yP ). If Q = (xQ, yQ) ∈ E, and<br />
Q �= −P , then P + Q = (xP +Q, yP +Q). Formulas for po<strong>in</strong>t addition and doubl<strong>in</strong>g<br />
are presented further, see Equations 1.19.<br />
xP +Q = λ 2 − xP − xQ (1.19)<br />
yP +Q = λ(xP − xP +Q) − yP<br />
λ = yQ − yP<br />
xQ − xP<br />
λ = 3x2P + a<br />
2yP<br />
if P �= Q<br />
if P = Q<br />
When P �= Q (addition) the formulas for comput<strong>in</strong>g P + Q require 1 <strong>in</strong>version, 2<br />
multiplications, and 1 squar<strong>in</strong>g. When P = Q (doubl<strong>in</strong>g) the formulas for comput<strong>in</strong>g<br />
2P require 1 <strong>in</strong>version, 2 multiplications, and 2 squar<strong>in</strong>gs. S<strong>in</strong>ce field <strong>in</strong>version<br />
is significantly more expensive than multiplication it is advantageous to represent<br />
po<strong>in</strong>ts us<strong>in</strong>g projective coord<strong>in</strong>ates and then use formulas without <strong>in</strong>version [35].<br />
Before def<strong>in</strong>ition of the ECDLP we def<strong>in</strong>e another parameter for EC. The order<br />
of po<strong>in</strong>t P on an EC is the smallest positive <strong>in</strong>teger n such that nP = O. Where<br />
nP is the po<strong>in</strong>t multiplication def<strong>in</strong>ed <strong>in</strong> Equation 1.16.<br />
The ECDLP is def<strong>in</strong>ed as follows: Let us have a curve E over F, a po<strong>in</strong>t P ∈ E<br />
of order n and a po<strong>in</strong>t Q ∈ E. Then <strong>in</strong> case it exists, f<strong>in</strong>d an <strong>in</strong>teger l, 0 ≤ l ≤ n−1,<br />
for which Q = lP .<br />
As an example for cryptographic operations computed on the EC we mention<br />
the elliptic curve digital signature algorithm (ECDSA), the equivalent of the DSA<br />
<strong>in</strong> the EC doma<strong>in</strong>. The generation of the key is done by the steps described <strong>in</strong><br />
Algorithm 1 – 5 [78].<br />
The signature of a message m with an arbitrary length is computed as mentioned<br />
<strong>in</strong> the Algorithm 1 – 6 [78].<br />
From a practical po<strong>in</strong>t of view, the performance of ECC depends on the efficient<br />
implementation of f<strong>in</strong>ite field operations and fast algorithm for the scalar multipli-<br />
cation.<br />
17
FEI KEMT<br />
Algorithm 1 – 5 Key generation <strong>in</strong> ECC [78]<br />
Require: E is an EC over F, P is a po<strong>in</strong>t of order n on curve E.<br />
Ensure: Pair of private key and public key.<br />
1: Choose a random <strong>in</strong>teger d, 0 < d < n<br />
2: Q ⇐ dP<br />
3: return Q, the public key<br />
4: return d, the private key<br />
Algorithm 1 – 6 Message sign<strong>in</strong>g <strong>in</strong> ECC [78]<br />
Require: Message m with an arbitrary length, a hash value h(m) obta<strong>in</strong>ed from a<br />
one-way function.<br />
Ensure: Signature of the message m.<br />
1: Choose random <strong>in</strong>teger k, 0 < k < n<br />
2: kP ⇐ (x1, y1) and r ⇐ x1 mod n (0 < x1 < q − 1)<br />
3: if r = 0 then<br />
4: Go back to the step 1.<br />
5: end if<br />
6: k −1 mod n<br />
7: s ⇐ k −1 {h(m) + dr} mod n<br />
8: if s = 0 then<br />
9: Go back to the step 1.<br />
10: end if<br />
11: return (r, s)<br />
1.4 Conclusions<br />
In this section we have presented two nowadays most important public key cryp-<br />
tosystems, namely RSA and ECC.<br />
While RSA is massively applied by <strong>in</strong>dustry s<strong>in</strong>ce several years, the ECC as<br />
relatively new cryptographic algorithms just starts to w<strong>in</strong> as better choice for im-<br />
plementation of public-key algorithm especially for energy- and place-limited plat-<br />
forms. The possibility to use much shorter key, and therefore less heavy arithmetical<br />
operations makes from ECC an optimal algorithm for hard<strong>ware</strong> implementation.<br />
The description of both algorithms given <strong>in</strong> the thesis focuses on their most<br />
<strong>in</strong>tensively-used and heavy operation - the modular multiplication. This fact makes<br />
from the multiplication an important target for our research as the improvements<br />
18
FEI KEMT<br />
<strong>in</strong> implementation of the MM have significant impact on better performance of the<br />
whole system based on the modular operations, as are the RSA or ECC.<br />
The common part of both <strong>in</strong>troduced cryptosystem is a modular multiplier. After<br />
this theoretical <strong>in</strong>troduction we cont<strong>in</strong>ue by description of algorithms for multipli-<br />
cation adapted to the target hard<strong>ware</strong> architecture and implementation itself.<br />
19
FEI KEMT<br />
2 <strong>Montgomery</strong> <strong>Modular</strong> <strong>Multiplication</strong> <strong>in</strong> <strong>Hard</strong>-<br />
<strong>ware</strong><br />
In this chapter we present results of our research <strong>in</strong> the area of efficient implementa-<br />
tion of the (MMM) and its application <strong>in</strong> cryptographic systems. Obta<strong>in</strong>ed design<br />
of the multiplier can be <strong>in</strong>cluded <strong>in</strong> cryptosystems or accelerators as support<strong>in</strong>g unit<br />
for computationally heavy operations <strong>in</strong> the public-key algorithms as RSA or ECC.<br />
We focus on design of the process<strong>in</strong>g element (PE) that computes the MMM<br />
and the coprocessor that <strong>in</strong>cludes beside the PE(s) also the memory registers and<br />
an <strong>in</strong>terface to the control unit.<br />
Results of the research were published <strong>in</strong> the follow<strong>in</strong>g list of articles [46, 49, 50,<br />
113, 117, 118]. The ma<strong>in</strong> achievements of our research were done <strong>in</strong> the follow<strong>in</strong>g<br />
areas:<br />
• Analysis of two PE concepts – algorithm improvement, effective implementa-<br />
tion <strong>in</strong> chosen FPGA families, concepts comparison,<br />
• MMM coprocessor design – soft<strong>ware</strong>-hard<strong>ware</strong> co-design, scalability and para-<br />
metrisation, <strong>in</strong>terface with a control unit.<br />
The Section 2.1 expla<strong>in</strong>s the concept of scalable MMM design. In Section 2.2<br />
we analyse the MMM algorithms and architecture for their effective implementation<br />
suitable for reconfigurable hard<strong>ware</strong> structures. The results of area occupation and<br />
tim<strong>in</strong>g analysis are summarised <strong>in</strong> Section 2.3 and provide <strong>in</strong>formation on available<br />
choices of multiplier parameters. The chapter is closed by Section 2.4 <strong>in</strong>clud<strong>in</strong>g the<br />
summary of the discussed issues.<br />
2.1 Scalable MMM design<br />
An arithmetic unit is called scalable if it can be reused or replicated <strong>in</strong> order to<br />
generate long-precision results <strong>in</strong>dependently of the data precision for which the<br />
unit was orig<strong>in</strong>ally designed [108]. In cryptography, the length of <strong>in</strong>put operands<br />
and key may vary <strong>in</strong> dependency on chosen cipher work<strong>in</strong>g mode or by updat<strong>in</strong>g<br />
the algorithm to different security level. Hence, the scalability seems to be desirable<br />
feature of cryptographic arithmetic unit. In such cases scalability of the design pays<br />
off due to reduced costs for implementation. On the other hand, the well-scalable<br />
designs can be slower than the less universal ones optimised for selected parameters.<br />
20
FEI KEMT<br />
The more universal is a design the lower is its speed <strong>in</strong> comparison to a system<br />
designed for fixed operands parameters.<br />
A typical scalable coprocessor consists of two separate blocks – memory registers<br />
and arithmetic logic unit (ALU) connected by w-bit data path as shown <strong>in</strong> Figure 2 –<br />
1. Parameter of the word width w decides on the smallest operated data unit –<br />
word, divid<strong>in</strong>g the operands length k to smaller, for target hard<strong>ware</strong> structure more<br />
suitable, lengths which is usually a multiple of 8 bits.<br />
data<br />
<strong>in</strong>put<br />
w<br />
scalable<br />
ALU<br />
data<br />
memory<br />
data<br />
output<br />
control<br />
logic<br />
Figure 2 – 1 Architecture of a general scalable coprocessor based on separate memory and ALU<br />
connected by w-bit data-path<br />
Separation of the ALU and the memory is the first fundamental difference from<br />
the FPGA designs <strong>in</strong>clud<strong>in</strong>g the MMM optimized for fixed-length operands (e.g. [29,<br />
41]). The scalable algorithm requires a word-oriented process<strong>in</strong>g that would make<br />
possible to change the number of words, or even the word width w. Normally w is<br />
smaller than the operands length k, therefore the computation time is proportionally<br />
longer. Better performance can be still achieved by implementation of smaller but<br />
faster ALU allow<strong>in</strong>g higher clock frequency.<br />
Let us consider w-bit words. For operands with k-bit precision, e1 = ⌈(k +1)/w⌉<br />
words are required for Algorithm 1 – 3. An extra bit used <strong>in</strong> the calculation of e1 is<br />
required s<strong>in</strong>ce Si (<strong>in</strong>ternal variable of radix-2 algorithm) is <strong>in</strong> the range [0, 2M − 1]<br />
[108]. Then all the computations of Algorithm 1 – 3 must be done with an extra<br />
bit of precision. The <strong>in</strong>put operands will need an extra zero bit value at the MSB<br />
position <strong>in</strong> order to have the precision extended to the correct value.<br />
Algorithm 1 – 4 requires e2 = ⌈(k + 3)/w⌉ words <strong>in</strong> order to support extended<br />
range of <strong>in</strong>put variables X, � Y , and <strong>in</strong>ternal variable Si. Note that <strong>in</strong> many practical<br />
configurations e1 = e2 and no additional words are required for Algorithm 1 – 4. The<br />
operands X will need two extra 0 bit values at the MSB and subsequent position <strong>in</strong><br />
order to have the precision extended to the k + 3 cycles required by Algorithm 1 – 4.<br />
In practical configurations k ≥ 1024 therefore the difference <strong>in</strong> number of cycles is<br />
21
FEI KEMT<br />
not significant. On the other hand, the possibility to remove correction unit from<br />
hard<strong>ware</strong> design of Algorithm 1 – 4 br<strong>in</strong>gs valuable advantage.<br />
In the rest of the thesis the notions e1 or e2 are used to denote the number of<br />
words <strong>in</strong> cases we need to emphasis the difference of the number of words <strong>in</strong> the<br />
algorithms, or we use the notation e <strong>in</strong> case we mean a number of words <strong>in</strong> general.<br />
2.1.1 Scalable Multiple-Word Algorithms<br />
Operations <strong>in</strong> Algorithm 1 – 3 and Algorithm 1 – 4 are performed on the full-precision<br />
operands and do not provide scalability feature expla<strong>in</strong>ed above. We analyse rela-<br />
tions between parameters of the multipliers and underly<strong>in</strong>g FPGA structure and<br />
provide solution suitable for devices <strong>in</strong>clud<strong>in</strong>g fast carry architecture.<br />
A scalable algorithm <strong>in</strong> which the operand Y (multiplicand) is scanned word-<br />
by-word, and the operand X (multiplier) is scanned bit-by-bit was proposed <strong>in</strong><br />
[108,109]. The Multiple Word Radix-2 <strong>Montgomery</strong> <strong>Multiplication</strong> algorithm (MW-<br />
R2MM) uses the follow<strong>in</strong>g vectors:<br />
M = (M (e−1) , . . . , M (1) , M (0) ) (2.1)<br />
Y = (Y (e−1) , . . . , Y (1) , Y (0) )<br />
S = (S (e−1) , . . . , S (1) , S (0) )<br />
X = (xk−1, . . . , x1, x0)<br />
where the words are marked with superscripts and the bits are marked with sub-<br />
scripts. The concatenation of vectors a and b is noted as (a, b). A particular range<br />
of bits <strong>in</strong> a vector a from position i to position j, j > i will be expressed as aj..i.<br />
The bit position i of the k-th word of a is represented by symbol a (k)<br />
i .<br />
The details of the MWR2MM algorithm (further referred to as MWR2MM CSA,<br />
where CSA states for Carry-Save Adder) are given <strong>in</strong> [108] and <strong>in</strong> the thesis it will<br />
be denoted as Algorithm 2 – 1. Optimized version of MMM Algorithm 1 – 4 can be<br />
transformed to a multiple word form (referred to as MWR2MM CPA, where CPA<br />
states for Carry-Propagate Adder) <strong>in</strong> a similar way, shown <strong>in</strong> Algorithm 2 – 2. The<br />
reason for such nam<strong>in</strong>g of algorithms is given by the way of their implementation<br />
and we expla<strong>in</strong> more about it <strong>in</strong> the follow<strong>in</strong>g parts of the thesis.<br />
The algorithms compute a partial sum S for each bit of X, scann<strong>in</strong>g the words<br />
of Y and M. Once the precision is exhausted, another bit of X is taken, and the<br />
scan is repeated. Thus, the algorithms MWR2MM CSA as well as MWR2MM CPA<br />
22
FEI KEMT<br />
Algorithm 2 – 1 The multiple word radix-2 <strong>Montgomery</strong> multiplication MWR2-<br />
MM CSA algorithm<br />
1: S ⇐ 0<br />
2: for i = 0 to k − 1 do<br />
3: C ⇐ 0<br />
4: (C, S (0) ) ⇐ xiY (0) + S (0)<br />
5: qi ⇐ S (0)<br />
0<br />
6: if qi = 1 then<br />
7: (C, S (0) ) ⇐ C + S (0) + M (0)<br />
8: for j = 1 to e1 − 1 do<br />
9: (C, S (j) ) ⇐ C + xiY (j) + M (j) + S (j)<br />
10: S (j−1) ⇐ (S (j)<br />
0 , S (j−1)<br />
w−1..1)<br />
11: end for<br />
12: S (e1−1) ⇐ (C, S (e1−1)<br />
w−1..1)<br />
13: else<br />
14: for j = 1 to e1 − 1 do<br />
15: (C, S (j) ) ⇐ C + xiY (j) + S (j)<br />
16: S (j−1) ⇐ (S (j)<br />
0 , S (j−1)<br />
w−1..1)<br />
17: end for<br />
18: S (e1−1) ⇐ (C, S (e1−1)<br />
w−1..1)<br />
19: end if<br />
20: end for<br />
impose no constra<strong>in</strong>ts on precision of the operands. What varies is the number of<br />
loop iterations i required to accomplish the MMM operation and the number of<br />
words for <strong>in</strong>put and <strong>in</strong>ternal operands – e1 and e2, respectively. The carry variable<br />
C must be from the set {0, 1, 2} what is imposed by the addition of the three vectors<br />
S, M, xiY , and xi � Y , respectively [108].<br />
2.1.2 Comparison of Implementation Approaches<br />
Two algorithms have been chosen for the hard<strong>ware</strong> implementation – the MW-<br />
R2MM CSA algorithm (Algorithm 2 – 1) and MWR2MM CPA algorithm (Algo-<br />
rithm 2 – 2). Our first goal is to show a difference between the algorithms on the<br />
algorithmic level, other goal is to compare also the way how the algorithms can be<br />
implemented.<br />
The difference <strong>in</strong> algorithms was motivated by possibility to omit the comparison<br />
23
FEI KEMT<br />
Algorithm 2 – 2 The multiple word radix-2 <strong>Montgomery</strong> multiplication MWR2-<br />
MM CPA algorithm<br />
1: S ⇐ 0<br />
2: � Y ⇐ 2Y<br />
3: for i = 0 to k + 3 do<br />
4: C ⇐ 0<br />
5: qi ⇐ S (0)<br />
0<br />
6: for j = 1 to e2 − 1 do<br />
7: (C, S (j) ) ⇐ C + xi � Y (j) + qiM (j) + S (j)<br />
8: S (j−1) ⇐ (S (j)<br />
0 , S (j−1)<br />
w−1..1)<br />
9: end for<br />
10: S (e2−1) ⇐ (C, S (e2−1)<br />
w−1..1)<br />
11: end for<br />
of the f<strong>in</strong>al sum S to M at the end of the loop <strong>in</strong> the Algorithm 1 – 3 for the price<br />
of some extra loops <strong>in</strong> the Algorithm MWR2MM CPA. Another difference is <strong>in</strong><br />
computation of the variable qi that decides on addition of M. Its value <strong>in</strong> the MW-<br />
R2MM CPA algorithm is given directly as LSB of the zeroth word of the <strong>in</strong>ternal<br />
sum S computed <strong>in</strong> the previous loop. Contrary of the Algorithm MWR2MM CPA<br />
the Algorithm MWR2MM CSA uses a value obta<strong>in</strong>ed after addition of item xiY<br />
what <strong>in</strong>crease a latency for comput<strong>in</strong>g the qi.<br />
The most important difference between MWR2MM CSA and MWR2MM CPA is<br />
<strong>in</strong>troduced <strong>in</strong> a way by which the variable S is represented. In carry-save redundant<br />
form applied <strong>in</strong> our implementation of the Algorithm MWR2MM CSA the sum S<br />
is represented by formulation:<br />
S (j) = 1S (j) + r2S (j) , (2.2)<br />
where r is the radix (<strong>in</strong> our implementation r = 2) and 1S, 2S are two w-bit com-<br />
ponents of the sum S. Advantage of such representation is <strong>in</strong> no carry propagation<br />
<strong>in</strong>side the <strong>in</strong>ner loop of the MMM algorithm. On the other hand, for stor<strong>in</strong>g the<br />
partial sum variable S it required to use two w-bit registers <strong>in</strong>stead of one. Only at<br />
the very end of the computations, the redundant form is transformed to the normal<br />
representation apply<strong>in</strong>g the Equation 2.2. The CSA PE which executes the MW-<br />
R2MM CSA Algorithm is <strong>in</strong> this direction <strong>in</strong>dependent on hard<strong>ware</strong> platform and<br />
does not require any special features for hard<strong>ware</strong> implementation of the adders.<br />
In the implementation of the MWR2MM CPA algorithm all operands are op-<br />
24
FEI KEMT<br />
erated and stored <strong>in</strong> a non-redundant form, each requir<strong>in</strong>g w-bit register with e2<br />
words.<br />
Different form of representation of the sum S <strong>in</strong> the implementation of algorithms<br />
MWR2MM CPA and MWR2MM CSA has the follow<strong>in</strong>g consequences:<br />
1. The MWR2MM CPA algorithm uses less (only 80% of MWR2MM CSA) mem-<br />
ory resources for the same operand sizes.<br />
2. The MWR2MM CPA algorithm does not require any correction unit for trans-<br />
formation of the algorithm output <strong>in</strong> the f<strong>in</strong>al step, while the MWR2MM CSA<br />
algorithm requires at least f<strong>in</strong>al conversion to a non-redundant form.<br />
3. The MWR2MM CPA algorithm allows a simpler computation of <strong>in</strong>ternal vari-<br />
able qi that can allow to simplify architecture of CPA PE.<br />
4. The CSA PE is always faster than the CPA one because it does not use carry<br />
<strong>in</strong> <strong>in</strong>ner loop of the algorithm. The CPA PE is slower but uses less logic<br />
resources. Therefore, potentially with<strong>in</strong> the same FPGA resources also more<br />
CPA PE pipel<strong>in</strong>ed stages can be used, what can turn <strong>in</strong>to speed up of the<br />
solution and yield better area time (AT) product.<br />
2.2 Multiplier Architecture<br />
In this section we present architecture of the implemented units for comput<strong>in</strong>g the<br />
MMM. The units are proposed as dedicated coprocessors with standardised <strong>in</strong>terface<br />
to an external control unit. Such approach makes possible to connect several units to<br />
a controller and provide parallel computation of the MMM. The peripheral multiplier<br />
can be mapped <strong>in</strong> the memory of the host processor, where the control operations<br />
are triggered by an <strong>in</strong>terrupt or a control register.<br />
Other approach would propose a set of <strong>in</strong>structions support<strong>in</strong>g fast modular<br />
operations on a general-purpose processor. In this case, besides the target platform<br />
resources the optimisation takes <strong>in</strong>to account the processor structure what makes<br />
the design more specific for a chosen processor architecture.<br />
In the processor+dedicated coprocessor architecture no special requirements are<br />
given for the control unit apart from the specification of the <strong>in</strong>terface s<strong>in</strong>ce the ma<strong>in</strong><br />
computational effort is done <strong>in</strong> the coprocessor. In this way a significantly better<br />
use of resources can be achieved <strong>in</strong> cases when large general-purpose processor is<br />
replaced by a small CPU with coprocessor.<br />
25
FEI KEMT<br />
Beside the <strong>in</strong>ternal structure of the multipliers we discuss also the pipel<strong>in</strong>e struc-<br />
ture of the coprocessor and its <strong>in</strong>terconnection to the host, what can be an embedded<br />
soft-core or a stand-alone processor. The scalable designs offer several parameters to<br />
be chosen after consideration of the required execution time and available hard<strong>ware</strong><br />
resources.<br />
2.2.1 Adder Concepts<br />
In our designs we apply two different ways of implementation of the adders that are<br />
described <strong>in</strong> this section. The architectures designed for MWR2MM CSA and MW-<br />
R2MM CPA algorithms differ <strong>in</strong> implementation of the adders <strong>in</strong>side the multiplier<br />
units.<br />
The scalable cha<strong>in</strong> of CSAs does not <strong>in</strong>clude any connection between the adders<br />
units (see the Figure 2 – 2(b)), what makes it <strong>in</strong>dependent on the platform technology<br />
and the length of the operands to be added.<br />
The propagation of the carry bit <strong>in</strong> the CPA requires to m<strong>in</strong>imise the connection<br />
length between the adders. In case of the ASIC design this critical datapath can be<br />
optimised to achieve the best possible performance. On the other hand, <strong>in</strong> case of<br />
the FPGAs the underly<strong>in</strong>g architecture cannot be changed, yet the logical behaviour<br />
and <strong>in</strong>terconnections given by the device vendor can be re-configured. The FPGA<br />
vendors provide a feature that can be exploited <strong>in</strong> cases when a very fast connection<br />
between the adjacent LE is required, as it is <strong>in</strong> case of the CPAs scalable cha<strong>in</strong>.<br />
To achieve an acceleration of normally slow carry propagation <strong>in</strong> the CPA unit a<br />
fast carry cha<strong>in</strong> network of connections <strong>in</strong>cluded <strong>in</strong> modern FPGAs will be deployed<br />
(see the Figure 2 – 2(a)). The best performance of the carry cha<strong>in</strong> is achieved <strong>in</strong>side<br />
one logic array block (LAB). In dependency on the FPGA type the number of LEs<br />
<strong>in</strong> one LAB differs, typical values are 16, 32. . . If the adder width (w) is bigger than<br />
the number of LEs <strong>in</strong> the LAB, the LABs carry cha<strong>in</strong>s need to be <strong>in</strong>terconnected. A<br />
longer carry cha<strong>in</strong> is required to hold the fast carry connection feature. To achieve<br />
it, the connected LABs should be placed next to each other <strong>in</strong> one column. That<br />
is possible only <strong>in</strong> cases when a tool for place and route (P&R) is able to recognise<br />
the carry cha<strong>in</strong> <strong>in</strong> the synthesised logic and exploits the hard<strong>ware</strong> architecture of<br />
the target device to provide a fast <strong>in</strong>terconnection.<br />
We can conclude that the speed of the CPA PE depends significantly on the<br />
word-length (the length of the carry cha<strong>in</strong>). However, we can suppose that up to a<br />
certa<strong>in</strong> word-length, w ≤ wmax the speed of the CPA PE is not critical, because the<br />
26
FEI KEMT<br />
C’<br />
carry cha<strong>in</strong> carry cha<strong>in</strong><br />
FA FA FA<br />
(a) carry-propagate adder<br />
C<br />
FA FA . . . FA<br />
(b) carry-save adder<br />
Figure 2 – 2 One level of the w-bit adder implemented as CPA and CSA with FAs<br />
f<strong>in</strong>al speed is dom<strong>in</strong>ated by the embedded memory access time or other critical path<br />
<strong>in</strong> the logic. The value wmax may differ between technologies due to the different<br />
rout<strong>in</strong>g and dist<strong>in</strong>ct physical layout (number of LEs <strong>in</strong> LAB). The question is if the<br />
wmax is <strong>in</strong> the range of allowed values for the on-chip memory width of available<br />
FPGAs. In this way we could store and also operate the variables with optimal<br />
word width and achieve the best Area-Time product.<br />
Carry-Save Adder Unit The whole computational complexity of both algo-<br />
rithms lies <strong>in</strong> two additions of three w-bit operands for comput<strong>in</strong>g Si+1. The<br />
propagation of the carry bits between the w adders is (<strong>in</strong> general) too slow. The<br />
implementation of the MWR2MM CSA <strong>in</strong> [108] uses redundant representation of<br />
<strong>in</strong>termediate sum S and carry-save adders [38]. The MWR2MM CSA w-bit PE<br />
architecture based on Full Adders (FAs) is depicted <strong>in</strong> Figure 2 – 3.<br />
In order to reduce the storage size and arithmetic hard<strong>ware</strong> complexity the vari-<br />
ables X, Y , and M are available <strong>in</strong> a non-redundant form. The <strong>in</strong>termediate <strong>in</strong>ternal<br />
sum S is received and generated <strong>in</strong> the redundant form as 1S and 2S. The advantage<br />
of redundant form lies <strong>in</strong> the <strong>in</strong>dependence of the latency from the word length w<br />
as there is no direct connection between the FAs. The output of the adders is valid<br />
right after appearance of the <strong>in</strong>put signals and the delay is given ma<strong>in</strong>ly by <strong>in</strong>ternal<br />
comb<strong>in</strong>ational logic of the FA.<br />
The process<strong>in</strong>g delay may <strong>in</strong>crease for larger w as a result of the broadcast<br />
problem only, it will not depend on the arithmetic operation itself. Conversion<br />
<strong>in</strong>to the normal non-redundant representation is only done at the very end of the<br />
MMM computation. The <strong>in</strong>termediate result of sum S may be further shifted to<br />
other MMM unit as operand X or Y for a new computation (e.g. next iteration<br />
of the modular exponentiation). The redundant representation of variables that<br />
requires twice as much memory as a non-redundant representation and a need for the<br />
transformation to/from redundant form have been considered as the ma<strong>in</strong> drawbacks<br />
27
FEI KEMT<br />
q<br />
x<br />
i<br />
S (j)<br />
2 w-1 S (j)<br />
1 w-1<br />
i<br />
Y (j)<br />
w-1 M(j)<br />
w-1<br />
FA FA<br />
FA<br />
S (j-1)<br />
2 w-1<br />
FA<br />
S (j)<br />
2 w-2 S (j)<br />
1 w-2<br />
S (j-1)<br />
1 w-1<br />
S (j-1)<br />
2 w-2<br />
Y (j)<br />
w-2 M(j)<br />
w-2<br />
FA<br />
S (j-1)<br />
1 w-2<br />
. . .<br />
. . .<br />
S (j)<br />
0 S (j)<br />
2 1 0<br />
Y (j)<br />
0<br />
FA<br />
M(j)<br />
0<br />
S (j-1)<br />
0 S (j-1)<br />
2 1 0<br />
Figure 2 – 3 Block diagram of the CSA-based w-bit MWR2MM process<strong>in</strong>g element (CSA PE)<br />
based on FA<br />
of the MWR2MM CSA algorithm. Positive property of the implementation is its<br />
<strong>in</strong>dependence on carry cha<strong>in</strong> logic on the target platform.<br />
Carry-Propagate Adder Unit Recent FPGAs conta<strong>in</strong> high-speed <strong>in</strong>terconnect<br />
l<strong>in</strong>es between adjacent logic blocks which have been designed to provide an efficient<br />
carry propagation. The CPA PE architecture presented <strong>in</strong> this thesis is optimal for<br />
the implementation of the MMM unit on any FPGA that has dedicated carry logic<br />
capability (e.g. modern Altera and Xil<strong>in</strong>x FPGAs). The basic organization of the<br />
ALU consists of two layers of conventional CPAs as shown <strong>in</strong> Figure 2 – 4.<br />
Unlike the CSA PE, the CPA PE does not support a feature of arbitrary word<br />
width w. The border for the number of FAs <strong>in</strong> one row is given by the target<br />
technology. The more LEs are cha<strong>in</strong>ed by fast (and short) <strong>in</strong>terconnection the higher<br />
the word width can be, achiev<strong>in</strong>g comparable speed results to CSA PE. The value<br />
of the carry signal raised <strong>in</strong> the first FA from the left side (for LSB) is subsequently<br />
processed <strong>in</strong> the adjacent FA that outputs another carry signal for the third adder<br />
<strong>in</strong> the row. . . In this way the carry signal is propagated till the most right FA (for<br />
28<br />
C
FEI KEMT<br />
q<br />
x<br />
i<br />
i<br />
C a<br />
C b<br />
Y (j)<br />
w-1 M(j)<br />
w-1<br />
S (j)<br />
w-1<br />
FA FA<br />
FA<br />
FA<br />
S (j-1)<br />
w-1<br />
Y (j)<br />
w-2 M(j)<br />
w-2<br />
S (j)<br />
w-2<br />
FA<br />
S (j-1)<br />
w-2<br />
. . .<br />
. . .<br />
S (j-1)<br />
0<br />
Y (j)<br />
0 M(j)<br />
0<br />
Figure 2 – 4 Block diagram of CPA-based w-bit MWR2MM process<strong>in</strong>g element (CPA PE) based<br />
on FA<br />
MSB). Once it receives a valid value of the carry and computes the outputs, the<br />
complete w-bit result can be proceeded to a next computation. From the description<br />
we can see that the delay caused by the carry propagation grows l<strong>in</strong>early with the<br />
S (j)<br />
0<br />
number of connections that is given by the word width w.<br />
Pipel<strong>in</strong>e Structure Both algorithms – MWR2MM CSA and MWR2MM CPA<br />
share the same data dependencies. A detailed analysis of potential <strong>in</strong>ner paral-<br />
lelism and <strong>in</strong>vestigation of pipel<strong>in</strong>ed organisation that would be suitable for an<br />
MWR2MM CSA algorithm implementation can be found <strong>in</strong> [108, 109]. The pre-<br />
sented analysis can be directly applied also to the MWR2MM CPA algorithm. The<br />
most important result of the analysis – the possibility to operate <strong>in</strong> pipel<strong>in</strong>ed stages<br />
of the multipliers is applied <strong>in</strong> the FPGA implementations presented <strong>in</strong> the thesis.<br />
The ma<strong>in</strong> advantage of the scalable architecture for the MMM lies <strong>in</strong> the fact that<br />
the PEs can be easily repeated to <strong>in</strong>crease the throughput of the coprocessor [108].<br />
In the pipel<strong>in</strong>ed version several slightly modified PEs (some registers have to be<br />
added to allow temporary data storage) are connected <strong>in</strong> a cascade (see Figure 2 –<br />
5).<br />
29<br />
FA<br />
C<br />
a<br />
C b
FEI KEMT<br />
x i x i-1 xi-n+1<br />
Y (j)<br />
M (j)<br />
S (j)<br />
PE 1<br />
Y (j-1)<br />
M (j-1)<br />
S (j-1)<br />
PE 2<br />
S (j-n)<br />
data<br />
memory<br />
. . .<br />
. . .<br />
. . .<br />
Y (j-n+1)<br />
M (j-n+1)<br />
S (j-n+1)<br />
PE n<br />
Figure 2 – 5 Pipel<strong>in</strong>ed organization of the MMM coprocessor based on n-stage PEs connection<br />
and separated embedded data memory<br />
The maximum degree of pipel<strong>in</strong>e that can be obta<strong>in</strong>ed with this architecture is<br />
found as:<br />
nmax =<br />
� �<br />
e + 1<br />
2<br />
(2.3)<br />
The number 2 <strong>in</strong> denom<strong>in</strong>ator expresses the number of clock cycles after which the<br />
output of the MMM unit is valid. It means also that new values for <strong>in</strong>put variables<br />
of the PEs <strong>in</strong> the pipel<strong>in</strong>ed row are delivered every third clock cycle. Output data<br />
from one stage are kept between the adjacent stages <strong>in</strong> temporal registers for one<br />
clock cycle and afterwards delivered to the subsequent stage. The stages <strong>in</strong>clude the<br />
second register at their <strong>in</strong>put level which provides total delay of two clock cycles as<br />
required by the computation process.<br />
To keep the <strong>in</strong>ternal control logic simple the number of the stages n is restricted<br />
to values divid<strong>in</strong>g the number of words e (n|e). Thanks to the simplification <strong>in</strong> the<br />
moment when the computation had been f<strong>in</strong>ished the last word of the sum S is at<br />
the output of the last unit <strong>in</strong> the row and is directly shifted to the memory to be<br />
stored there. In case of arbitrary n the functionality for a word shift between the<br />
stages at the end of computations would need to be implemented. Addition of the<br />
feature requires some extra logic <strong>in</strong> the data-path what has a negative <strong>in</strong>fluence on<br />
the maximal clock frequency, therefore it is not supported <strong>in</strong> our designs.<br />
The number of clock cycles needed for a s<strong>in</strong>gle MMM operation <strong>in</strong> design con-<br />
ta<strong>in</strong><strong>in</strong>g n ≤ nmax MMM units can be computed as:<br />
TMMM = k2<br />
+ 2n =<br />
wn<br />
� �<br />
ew<br />
e + 2n (2.4)<br />
n<br />
From the Equation 2.4 we can see that the number of stages n has a significant<br />
impact on computation time and reduces it l<strong>in</strong>early. When less than nmax MMM<br />
30
FEI KEMT<br />
units are available, the total execution time TMMM will <strong>in</strong>crease. On the other<br />
hand the area occupation of the coprocessor can be changed accord<strong>in</strong>g to the area<br />
constra<strong>in</strong>ts of the target device. Implementation of n < nmax stages means also<br />
more operations needed for read<strong>in</strong>g from and stor<strong>in</strong>g <strong>in</strong> the memory. Shift<strong>in</strong>g the<br />
processed data between the stages is faster than stor<strong>in</strong>g the <strong>in</strong>termediate results <strong>in</strong><br />
the memory block and their repeated read<strong>in</strong>g to f<strong>in</strong>ish the computations on them.<br />
Therefore the best performance is achieved <strong>in</strong> design with maximal number of stages<br />
nmax (n = nmax).<br />
Parametrisation The MMM coprocessor has three variable parameters (w, e, and<br />
n) that can be chosen for any implementation. Accord<strong>in</strong>g to the required area of<br />
the implemented coprocessor and the required tim<strong>in</strong>gs for the MMM computations<br />
the number of pipel<strong>in</strong>ed stages and the word width (n, w) can be chosen. The<br />
security level of public-key algorithm def<strong>in</strong>es the length of operands for the multiplier<br />
(k = we). This approach gives high flexibility to the processor and coprocessor<br />
design.<br />
In general, there are two possible approaches how to <strong>in</strong>crease the speed of the<br />
MMM computation <strong>in</strong> the proposed designs (check Equation 2.4 to understand the<br />
relations between the design parameters and the computation time TMMM):<br />
1. To <strong>in</strong>crease the word length w. In this way the number of iterations given by<br />
e is reduced what yields a shorter computation time. While the older FPGAs<br />
provide memory blocks with dual port memory feature and configurable word<br />
lengths only up to 16 bits (Altera Apex [8]), <strong>in</strong> the high-performance models<br />
it can be up to 32 bits for middle-sized blocks or 128 bits for large memory<br />
blocks (Altera Stratix II [20]). S<strong>in</strong>ce the capacity of the block is sufficient<br />
for typical RSA operands it makes sense to use only one block per operand.<br />
In case of an older technology with smaller memory blocks and chosen bigger<br />
word width (16 < w ≤ 32) two memory blocks per variable aare required.<br />
In dependency of the memory configuration several variables may share one<br />
memory block. Operands mapp<strong>in</strong>g to the memory is especially important for<br />
constra<strong>in</strong>ed SOC designs with limited number of memory blocks.<br />
2. To <strong>in</strong>crease the number of pipel<strong>in</strong>ed stages n. The hard<strong>ware</strong> structure of the<br />
PE for both solutions (CSA PE and CPA PE) is relatively simple and fast<br />
and <strong>in</strong>dependent on the number of stages, what was a condition for a scalable<br />
design. An addition of several pipel<strong>in</strong>ed stages can <strong>in</strong>crease the overall speed,<br />
31
FEI KEMT<br />
especially if the access to the embedded memory is a bottleneck (as it is <strong>in</strong> a<br />
case of FPGAs with limited rout<strong>in</strong>g resources for large w).<br />
From the previous analysis we can conclude that the number of words w is chosen<br />
accord<strong>in</strong>g to the target platform architecture and its memory blocks organisation<br />
and support for fast carry operations. The number of pipel<strong>in</strong>ed stages n is adapted<br />
to available chip size.<br />
2.2.2 Memory Block<br />
The operands are stored <strong>in</strong> the memory block that is <strong>in</strong>cluded <strong>in</strong> the data-path. Op-<br />
timisation of the memory organisation and connection to the ALU helps to achieve<br />
better performance. Due to <strong>in</strong>tensive exchange of data between the memory and<br />
ALU, the connection is often a part of the longest - critical path of the logic and<br />
<strong>in</strong>fluences a maximal clock frequency of the circuit.<br />
In dependency on number of pipel<strong>in</strong>ed stages (n) and number of iterations given<br />
by number of words (w) the data of operands are several times read out of the<br />
memory, processed by PEs, and stored back. The memory block may conta<strong>in</strong> <strong>in</strong>put<br />
data loaded by a control unit, the <strong>in</strong>termediate results, and the f<strong>in</strong>al results ready to<br />
be sent back to a host processor after the computations had been f<strong>in</strong>ished. Note that<br />
at the same time different words of an operand are loaded and stored. Therefore<br />
the memory have to support dual-port configuration. It makes possible to address<br />
read<strong>in</strong>g and writ<strong>in</strong>g from/to separate places of the memory. Schematic organisation<br />
of the dual-port memory register <strong>in</strong>side the MMM coprocessor for one of the variables<br />
is depicted at Figure 2 – 6.<br />
A data<br />
A address<br />
0:<br />
1:<br />
e-1:<br />
w bits<br />
w bits<br />
.<br />
.<br />
.<br />
w bits<br />
memory unit: e x w bits<br />
B data<br />
B address<br />
A port B port<br />
Figure 2 – 6 Organisation of the dual-port memory register <strong>in</strong>side the MMM coprocessor for one<br />
variable with e words of width w bits<br />
32
FEI KEMT<br />
In the coprocessor we need to store four operands for the MMM computations:<br />
three <strong>in</strong>put operands X, Y, M and the result S. The storage of S requires one or<br />
two registers for a case of the non-redundant or redundant representation form,<br />
respectively. The scalability feature applied to the ALU needs to be adopted to the<br />
memory block, too.<br />
The requirements for the scalable design make possible that the architecture<br />
is easily adaptable to the length of operands different from the one for which the<br />
system was orig<strong>in</strong>ally designed. In the memory block the number of stored variables<br />
is constant (four or five, depend<strong>in</strong>g on the chosen implementation). What varies is<br />
the number of words and consequently the number of bits needed to address them.<br />
We propose a model <strong>in</strong> which the each word of every variable can be addressed<br />
as from the coprocessor as well as from the host unit. We recognise an <strong>in</strong>ternal<br />
address of a word that specifies its location <strong>in</strong> given coprocessor and register, a<br />
register address that makes possible to choose a register with required variable<br />
and f<strong>in</strong>ally a coprocessor address dist<strong>in</strong>guish<strong>in</strong>g between several ALUs. With this<br />
memory management a control unit can address any word of a chosen coprocessor,<br />
store there the <strong>in</strong>put values for computations and afterwards read the results for<br />
further process<strong>in</strong>g. Number of address bits for each level can be adopted accord<strong>in</strong>g<br />
to number of coprocessors, variables and number of words. The address width is<br />
usually given by the word width of the <strong>in</strong>terface between the processor and the<br />
coprocessor. For the address longer than the <strong>in</strong>terface word width an appropriate<br />
address model needs to be chosen - accept<strong>in</strong>g several address signals <strong>in</strong> parallel or<br />
differenc<strong>in</strong>g the address type <strong>in</strong> other way.<br />
Table 2 – 1 Address of operands from host processor level (LSB right)<br />
coprocessor register <strong>in</strong>ternal<br />
XX XXX XXXXXXX<br />
The memory address bits are assigned as shown <strong>in</strong> Table 2 – 1 (LSB is right).<br />
The CPU <strong>in</strong> the presented example of the address format can handle up to 4 MMM<br />
coprocessors (two bits address) with 8 operands (three bits address) each composed<br />
of 128 words. Such configuration is suitable for the RSA computations on the<br />
operands’ length n = 2048 bits and word width w = 16 bits what gives e = 128<br />
number of words.<br />
33
FEI KEMT<br />
2.2.3 Interface to Controller<br />
The way <strong>in</strong> which the MMM coprocessor is connected to the control unit (e.g. an<br />
embedded processor) is important for the control of the computation process and<br />
for the exchange of processed data.<br />
Our first objective is to f<strong>in</strong>d a solution which would make possible a fast and flex-<br />
ible replacement of <strong>in</strong>put and output data between the memory of the host processor<br />
and the MMM coprocessor’s <strong>in</strong>ternal memory block. The requirement for flexibility<br />
is related to the scalability of the coprocessor that may <strong>in</strong>clude several MMM units.<br />
Moreover, the <strong>in</strong>ternal word widths of the control unit and the coprocessor may<br />
differ.<br />
Other goal is to optimise the control of the coprocessor(s). The trigger<strong>in</strong>g of<br />
the computations and then check<strong>in</strong>g their status plays important role especially <strong>in</strong><br />
configurations with several coprocessors (not necessarily the MMM coprocessors)<br />
operated by one control unit when it is <strong>in</strong>eligible to block the operations runn<strong>in</strong>g on<br />
the host processor.<br />
F<strong>in</strong>ally, the goal is also to design an <strong>in</strong>terface that would be universal and ap-<br />
plicable with m<strong>in</strong>imal amount of a clue logic for connection to different types of<br />
processor buses.<br />
The <strong>in</strong>terface that satisfies the requirements mentioned above is depicted <strong>in</strong><br />
Figure 2 – 7. The functionality of the particular signals is expla<strong>in</strong>ed <strong>in</strong> the next part<br />
of the section.<br />
clock<br />
reset<br />
chip select<br />
write enable<br />
irq<br />
address bus<br />
data bus<br />
MMM<br />
coprocessor<br />
Figure 2 – 7 Proposed universal <strong>in</strong>terface for the MMM coprocessor<br />
34
FEI KEMT<br />
Status and Control Interface The operations <strong>in</strong>side the MMM coprocessor are<br />
controlled by a control register that is mapped <strong>in</strong> the control unit’s memory via the<br />
<strong>in</strong>terface. In the presented solution there are two control bits:<br />
bit 0 controls the multiplication/squar<strong>in</strong>g process. Set 1 to trigger the computa-<br />
tions, 0 for idle.<br />
bit 1 switches between the multiplication and squar<strong>in</strong>g. Set 0 to compute the MMM<br />
on the <strong>in</strong>put parameters X and Y , set 1 to square (multiple the operand by<br />
itself) the value stored <strong>in</strong> memory register Y .<br />
A status register has been used to check the actual status of the coprocessor and<br />
the computational process <strong>in</strong> the solution published <strong>in</strong> [117]. The LSB raises dur<strong>in</strong>g<br />
the data stor<strong>in</strong>g and computations. After trigger<strong>in</strong>g the computation the processor’s<br />
duty is to check the status register regularly. Once the operation of multiplication or<br />
squar<strong>in</strong>g had been f<strong>in</strong>ished the value of the status bit is changed to 0. The control<br />
unit is expected to read the results from the MMM coprocessor and, if required,<br />
repeat the operation with new operands.<br />
The version described <strong>in</strong> [49] uses the communication over an <strong>in</strong>terrupt (signal<br />
irq <strong>in</strong> Figure 2 – 7). This solution is more suitable for soft<strong>ware</strong> control of coprocessors<br />
and for a configuration with several MMM coprocessors. After the computation of<br />
the MMM the <strong>in</strong>terrupt signal of the host processor is asserted. This state persists<br />
until the results are read with<strong>in</strong> the <strong>in</strong>terrupt rout<strong>in</strong>e by the processor. Thereafter<br />
new operands can be loaded <strong>in</strong>to the memory and the whole process started aga<strong>in</strong>.<br />
Memory Operations The transfer of the operands between the control unit and<br />
the coprocessor is executed by a couple of control signals (chip select denot<strong>in</strong>g the<br />
particular coprocessor, and write enable signalis<strong>in</strong>g a stor<strong>in</strong>g operation) and buses<br />
for address and data.<br />
The syntax of operand address has been expla<strong>in</strong>ed <strong>in</strong> Table 2 – 1. The chip select<br />
signal of the correspond<strong>in</strong>g coprocessor is asserted accord<strong>in</strong>g to the address decoded<br />
by the <strong>in</strong>terface. S<strong>in</strong>ce the <strong>in</strong>put operands X, Y and M require only access for<br />
their storage and on the other hand the operand S is exclusively used as the output<br />
register of the coprocessor, their addresses may be shared. The particular operand<br />
register is then selected as per write enable signal and the addresses.<br />
In case when the <strong>in</strong>ternal word widths of the processor and the coprocessor do<br />
not match, an additional functionality is required from the <strong>in</strong>terface to perform the<br />
memory alignment and proper decod<strong>in</strong>g of the memory address.<br />
35
FEI KEMT<br />
Clock Signal Distribution As there may a need for faster (<strong>in</strong> generally different)<br />
clock<strong>in</strong>g of the dedicated coprocessor we analyse a solution with separated clock<br />
signals for both parts of the system.<br />
The clock signal from the control processor controls through the bus and the<br />
<strong>in</strong>terface all the processes between the processor’s and coprocessor’s memory. The<br />
operations <strong>in</strong>side the MMM coprocessor are then clocked by the external (usually<br />
faster) clock signal.<br />
Note that additional clock signal requires also some extra resources for its gener-<br />
ation. That may cause problems <strong>in</strong> the constra<strong>in</strong>ed embedded systems on low-end<br />
FPGAs with low number of clock generat<strong>in</strong>g circuits (e.g. PLLs). On the other<br />
hand, the performance improvement is significant. Thanks to this clock signals or-<br />
ganisation almost three times higher performance of the MMM coprocessor has been<br />
obta<strong>in</strong>ed <strong>in</strong> [49] comparison to the implementation us<strong>in</strong>g the same clock signal for<br />
both units [117].<br />
2.3 Implementation of the MMM<br />
In this section we provide obta<strong>in</strong>ed parameters of the MMM units implemented<br />
accord<strong>in</strong>g to the theory presented <strong>in</strong> the previous parts of the thesis. The MWR2-<br />
MM CSA algorithm and MWR2MM CPA algorithm are compared by implemen-<br />
tation of the PEs on several families of FPGAs produced by Altera. Further, we<br />
summarise the implementation results of the MMM coprocessor and we discuss an<br />
approach with soft<strong>ware</strong>-hard<strong>ware</strong> co-design and compare the results with a soft-<br />
<strong>ware</strong> way of implementation of the MMM. F<strong>in</strong>ally, we provide a summary of the<br />
implementation results.<br />
2.3.1 Comparison of CSA and CPA PE<br />
Tables 2 – 2 and 2 – 3 the results of MWR2MM CSA and MWR2MM CPA PEs im-<br />
plementations (<strong>in</strong>clud<strong>in</strong>g data storage registers necessary for the pipel<strong>in</strong>ed version)<br />
<strong>in</strong> different Altera FPGAs for various word lengths w.<br />
There are several <strong>in</strong>terest<strong>in</strong>g facts that can be seen <strong>in</strong> these tables. With the<br />
exception of CPA PE implemented <strong>in</strong> the ACEX family, the two solutions are tech-<br />
nologically <strong>in</strong>dependent (as far as the area occupation is concerned). The size (<strong>in</strong><br />
LEs) of the block depends almost l<strong>in</strong>early on the word length w. CSA PE occupies<br />
always more resources than that of CPA PE.<br />
36
FEI KEMT<br />
Table 2 – 2 PE sizes and speeds for old style Altera FPGAs<br />
CPA PE CSA PE<br />
Device w Size Speed w Size Speed<br />
(bits) (LEs) (MHz) (bits) (LEs) (MHz)<br />
ACEX [7] 8 66 161 8 81 232<br />
EP1K100-1 16 130 129 16 161 202<br />
32 258 99 32 321 170<br />
APEX [8] 8 59 161 8 81 232<br />
EP20K160-1 16 115 129 16 161 202<br />
32 227 99 32 321 170<br />
Table 2 – 3 PE sizes and speeds for new style Altera FPGAs<br />
CPA PE CSA PE<br />
Device w Size Speed w Size Speed<br />
(bits) (LEs) (MHz) (bits) (LEs) (MHz)<br />
CYCLONE [13] 8 59 277 8 81 304<br />
EP1C20-6 16 115 235 16 161 304<br />
32 227 221 32 321 304<br />
STRATIX [18] 8 59 271 8 81 304<br />
EP1S10-6 16 115 248 16 161 304<br />
32 227 214 32 321 304<br />
The most important fact concerns the speed of the PEs. As it could be expected,<br />
the CSA PE is always faster and the speed vary either only slightly (for old families)<br />
or almost not at all (for recent families, probably due to enhanced rout<strong>in</strong>g possi-<br />
bilities) with the word length w. However, the speed of the CPA PE <strong>in</strong> the older<br />
families decreases significantly with the word length (about 40% from 8 bits to 32<br />
bits). Recent Altera devices use enhanced carry cha<strong>in</strong>. So-called carry-select cha<strong>in</strong><br />
uses the redundant carry calculation (hard-wired) to <strong>in</strong>crease the speed of carry<br />
functions. This feature enables to get process<strong>in</strong>g times for CPA PE comparable to<br />
CSA PE (but slower about 10 to 30%). S<strong>in</strong>ce CPA PE is about 20% smaller, one<br />
can improve the f<strong>in</strong>al speed <strong>in</strong>creas<strong>in</strong>g number of pipel<strong>in</strong>ed stages. However, this<br />
approach does not seem to be adequate for word lengths w > 32 bits.<br />
37
FEI KEMT<br />
2.3.2 <strong>Montgomery</strong> <strong>Multiplication</strong> Coprocessor<br />
Hav<strong>in</strong>g the optimised PE for the MMM computations our objective is to complete<br />
the MMM coprocessor with all necessary parts. The memory registers, the <strong>in</strong>terface<br />
to the control unit and the clock distribution logic are <strong>in</strong>tegral parts of the MMM<br />
coprocessor. The IP block <strong>in</strong>clud<strong>in</strong>g all mentioned design units is very suitable for<br />
quick system development provid<strong>in</strong>g the full functionality for operations demand<strong>in</strong>g<br />
the MMM and a universal <strong>in</strong>terface for connection to the control processor.<br />
The architecture of the coprocessor and all its parts has been discussed <strong>in</strong> the<br />
Section 2.2. In the Table 2 – 4 we provide the results for the area occupation and<br />
the critical path expressed as the maximal clock<strong>in</strong>g frequency on the Altera APEX<br />
20K200E FPGA. For the sample configuration we have chosen the MMM coprocessor<br />
based on the multiplier unit based on the MWR2MM CSA Algorithm with operands<br />
word width (w = 32) and precision k = 1024 and k = 2048 bits, respectively.<br />
Table 2 – 4 Area occupation <strong>in</strong> number of LEs and maximal clock frequency (fclkMMM ) (MHz) of<br />
the MMM coprocessor (w = 32, n = 1..4) with MWR2MM CSA algorithm<br />
k = 1024 k = 2048<br />
LEs (fclkMMM ) (LEs) (fclkMMM )<br />
n = 1 542 107.22 551 105.83<br />
n = 2 1100 110.43 1136 106.96<br />
n = 3 1621 108.34 1644 104.39<br />
n = 4 1943 106.67 1980 103.85<br />
2.3.3 <strong>Hard</strong><strong>ware</strong>-Soft<strong>ware</strong> Co-design of MMM: a Case Study<br />
For configurable platform is typical a SOC architecture. Such approach reduces<br />
the production costs and on the other hand provides very suitable platform for the<br />
cryptographic applications. The SOC m<strong>in</strong>imises the number of external <strong>in</strong>terfaces<br />
and <strong>in</strong> this way decreases also the amount of leaked <strong>in</strong>formation.<br />
Another advantage of use of the SOC is that hard<strong>ware</strong> and soft<strong>ware</strong> solutions can<br />
be compared <strong>in</strong> a better way. Therefore the choice of optimal resources utilisation<br />
is based on a proper analysis. In the SOC both soft<strong>ware</strong> and hard<strong>ware</strong> solutions<br />
occupy the same resources.<br />
The fully soft<strong>ware</strong> solution usually needs relatively large logic resources and small<br />
memory resources to implement the processor and sometimes large memory to im-<br />
38
FEI KEMT<br />
plement the program code. The fully hard<strong>ware</strong> solution needs greater logic resources<br />
and eventually some data memory. In a mixed hard<strong>ware</strong>-soft<strong>ware</strong> design, parallel<br />
and time critical operations can be done <strong>in</strong> a hard<strong>ware</strong> (dedicated coprocessors)<br />
and complex sequential and control operations <strong>in</strong> a soft<strong>ware</strong> (ma<strong>in</strong> processor). In<br />
our SOC design the speedup factor of the coprocessor application <strong>in</strong> relationship to<br />
the entirely soft<strong>ware</strong>-based solution can be measured quite easily: both implemen-<br />
tations use the same embedded processor, Altera Nios soft core described further <strong>in</strong><br />
the follow<strong>in</strong>g paragraph.<br />
Embedded Nios Processor The Nios CPU [10] is a pipel<strong>in</strong>ed general-purpose<br />
RISC processor that is generated by proprietary Altera VHDL generator (SOPC<br />
Builder) and can be synthesised and embedded <strong>in</strong> all recent Altera FPGAs. The<br />
Nios supports both 32-bit and 16-bit architectural variants. Both variants use 16-bit<br />
<strong>in</strong>structions. The pr<strong>in</strong>cipal features of the Nios <strong>in</strong>struction set architecture are:<br />
1. large, w<strong>in</strong>dowed register file,<br />
2. simple, complete <strong>in</strong>struction set,<br />
3. powerful address<strong>in</strong>g modes,<br />
4. extensibility.<br />
Exist<strong>in</strong>g Nios peripherals (e.g. UART, timer. . . ) as well as new custom peripherals<br />
can be connected through an Avalon bus [9]. Avalon is a simple bus architecture<br />
designed for connect<strong>in</strong>g on-chip processor(s) and peripheral together <strong>in</strong>to a SOC.<br />
Comparison of Implementations The Nios processor is used as a control unit<br />
<strong>in</strong> mixed implementations and as a ma<strong>in</strong> processor for the soft<strong>ware</strong> implementa-<br />
tion. The 32-bit version of the Nios CPU can optionally be configured to <strong>in</strong>clude<br />
a hard<strong>ware</strong>-supported <strong>in</strong>teger multiplier. The additional logic is used by the MUL<br />
<strong>in</strong>struction to compute 32-bit result <strong>in</strong> three clock cycles 1 . This option is not sup-<br />
ported <strong>in</strong> the 16-bit Nios <strong>in</strong>struction set. In order to obta<strong>in</strong> realistic comparisons,<br />
32-bit Nios CPU with hard<strong>ware</strong> supported MUL <strong>in</strong>struction was used for soft<strong>ware</strong><br />
implementation.<br />
In order to compare them, we have implemented three different systems:<br />
1 When us<strong>in</strong>g the MUL option with Altera Stratix devices, the hard<strong>ware</strong> multiplier uses the<br />
Stratix DSP blocks for implementation.<br />
39
FEI KEMT<br />
1. Fully soft<strong>ware</strong> solution implemented on a 32-bit Nios processor.<br />
2. Mixed soft<strong>ware</strong>-hard<strong>ware</strong> design with 16-bit Nios processor and the pipel<strong>in</strong>ed<br />
coprocessor <strong>in</strong>clud<strong>in</strong>g the CSA PE.<br />
3. Mixed soft<strong>ware</strong>-hard<strong>ware</strong> design with 16-bit Nios processor and the pipel<strong>in</strong>ed<br />
coprocessor <strong>in</strong>clud<strong>in</strong>g the CPA PE.<br />
Further, we provide the details of each system design and comment the obta<strong>in</strong>ed<br />
results.<br />
1. The soft<strong>ware</strong> implementation of the MMM algorithm has been written <strong>in</strong> the<br />
Nios assembly language by us<strong>in</strong>g all known optimization techniques for the<br />
target processor. The Separated Operand Scann<strong>in</strong>g (SOS) MMM method [39]<br />
was used as the best method for given Nios RISC architecture [66]. The<br />
Table 2 – 5 shows the tim<strong>in</strong>gs for the execution of the MMM on the fully<br />
soft<strong>ware</strong> solution runn<strong>in</strong>g on the processor clocked at 50 MHz. The 32-bit<br />
Nios processor occupies 2137 LEs without the logic for the <strong>in</strong>teger multiplier<br />
(for MUL <strong>in</strong>struction) that requires additional 446 LEs.<br />
In case of the soft<strong>ware</strong> implementation it is effective to apply a different algo-<br />
rithms for the multiplication and squar<strong>in</strong>g what reduces the execution time for<br />
the squar<strong>in</strong>g operation. However due to vulnerability aga<strong>in</strong>st the side-channel<br />
attacks it is better to align the execution times of both operations.<br />
Table 2 – 5 Execution times of soft<strong>ware</strong> implementation of MMM on Altera Nios development<br />
board (with APEX EP20K200 clocked at 50 MHz)<br />
Length Method <strong>Multiplication</strong> Squar<strong>in</strong>g<br />
(e × w) (ms) (ms)<br />
1024 SOS32MEM 2.40 1.87<br />
2048 SOS32MEM 9.47 7.24<br />
2. In the mixed hard<strong>ware</strong>-soft<strong>ware</strong> design the multiplication and squar<strong>in</strong>g is com-<br />
pletely implemented <strong>in</strong> the hard<strong>ware</strong>. Both operations share the same arith-<br />
metic unit. Due to move of the computational complexity from the ma<strong>in</strong> pro-<br />
cessor to the dedicated coprocessor one does not need to use the 32-bit version<br />
of the Nios core. Instead of the 32-bit controller one can <strong>in</strong>clude the 16-bit<br />
40
FEI KEMT<br />
Nios processor that is powerful enough to control the process and reduces the<br />
resources usage to reasonable 1275 LEs.<br />
The MMM coprocessor is based on a 16-bit (w = 16) CSA PE with 6 (n = 6)<br />
pipel<strong>in</strong>ed stages and occupies 1290 LEs. The total area occupation of the<br />
second, mixed hard<strong>ware</strong>-soft<strong>ware</strong> solution is comparable to the purely soft<strong>ware</strong><br />
solution. The processor has been clocked at 50 MHz and the MMM coprocessor<br />
at 150 MHz. Times necessary for MMM and squar<strong>in</strong>g are presented <strong>in</strong> Table 2 –<br />
6.<br />
Table 2 – 6 Execution times of mixed hard<strong>ware</strong>-soft<strong>ware</strong> implementation of MMM on Altera Nios<br />
development board (with APEX EP20K200) for the CSA PE<br />
Length Method <strong>Multiplication</strong> Squar<strong>in</strong>g<br />
(e × w) (ms) (ms)<br />
1024 = 64 × 16 MWR2MM CSA 0.073 0.073<br />
2048 = 128 × 16 MWR2MM CSA 0.291 0.291<br />
3. The third design we analyse is based on the same system architecture as the<br />
one <strong>in</strong>troduced <strong>in</strong> the second po<strong>in</strong>t. This time the MMM coprocessor <strong>in</strong>cludes<br />
the 16-bit (w = 16) CPA PE with 9 (n = 9) pipel<strong>in</strong>ed stages. The parameters<br />
were chosen with purpose to get the occupied area size comparable to the<br />
other two design variations. The processor has been clocked at 50 MHz and<br />
the MMM coprocessor at 100 MHz. The results obta<strong>in</strong>ed for this configuration<br />
are presented <strong>in</strong> Table 2 – 7.<br />
Table 2 – 7 Execution times of mixed hard<strong>ware</strong>-soft<strong>ware</strong> implementation of the MMM on Altera<br />
Nios development board (with APEX EP20K200) for the CPA PE<br />
Length Method <strong>Multiplication</strong> Squar<strong>in</strong>g<br />
(e × w) (ms) (ms)<br />
1024 = 64 × 16 MWR2MM CPA 0.069 0.069<br />
2048 = 128 × 16 MWR2MM CPA 0.278 0.278<br />
41
FEI KEMT<br />
2.3.4 Implementation Results<br />
The presented results have been obta<strong>in</strong>ed after P&R process <strong>in</strong> Altera Quartus de-<br />
velopment system, version 2.2. The simulation and synthesis of the designs was<br />
done <strong>in</strong> development tools from Mentor Graphics <strong>in</strong>cluded <strong>in</strong> the FPGA Advan-<br />
tage package. The carry cha<strong>in</strong>s <strong>in</strong> the CPA PE have been implemented us<strong>in</strong>g<br />
the lpm add sub function from the Library of Parameterized Modules (LPM) – a<br />
technology-<strong>in</strong>dependent library of logic functions that are parameterized to achieve<br />
scalability and adaptability.<br />
All the logic have been described by VHDL tak<strong>in</strong>g <strong>in</strong>to account the scalability<br />
and possible choice of the system parameters. Beside the memory registers block<br />
and the carry cha<strong>in</strong> logic, the designs are fully portable to any FPGA platform.<br />
In the subsection 2.3.1 we have summarised the differences between the two<br />
chosen concepts for implementation of the PE for the MMM. The result of the MMM<br />
coprocessor implementation shows importance of the clock distribution unit s<strong>in</strong>ce<br />
the achieved maximal clock<strong>in</strong>g frequency of the coprocessor overruns the typical<br />
work<strong>in</strong>g frequency of the control units (the Nios soft-core processor <strong>in</strong> our case).<br />
Accord<strong>in</strong>g to the previous analysis the critical path of the coprocessor does not<br />
change with <strong>in</strong>creas<strong>in</strong>g number of pipel<strong>in</strong>ed stages k, and the relation between the<br />
occupied area size and the computational time for the MMM operation stays l<strong>in</strong>ear.<br />
From the case study hav<strong>in</strong>g objective to f<strong>in</strong>d an optimal utilisation of the plat-<br />
form resources we can f<strong>in</strong>d to follow<strong>in</strong>g conclusions. From all three designs which<br />
parameters were chosen <strong>in</strong> order to achieve a comparable area occupation the slow-<br />
est is the soft<strong>ware</strong> solution 2 . The two designs <strong>in</strong>clud<strong>in</strong>g the optimised MMM units<br />
implemented <strong>in</strong> hard<strong>ware</strong> provides computational times around 30 times shorter.<br />
From the comparison between the CSA and CPA concepts the latter one provides<br />
slightly better times.<br />
2.4 Conclusions and Future Work<br />
The chapter covers the topics related to the effective implementation of the algebraic<br />
coprocessor for MMM operation. We compared two basic concepts of the multiplier<br />
architecture. The improvements of the algorithm are related to the reconfigurable<br />
platform chosen for the implementation. Tho pair of concepts was chosen to present<br />
2 In fact the <strong>in</strong>struction set of the Nios processor has been enhanced by the hard<strong>ware</strong>-supported<br />
MUL <strong>in</strong>struction. The completely soft<strong>ware</strong> solution gives too poor results to consider them <strong>in</strong> the<br />
comparison.<br />
42
FEI KEMT<br />
the contribution of the carry cha<strong>in</strong> dedicated logic <strong>in</strong> recent FPGA families and<br />
compare it to the classical approach with the CSA.<br />
Analysed multiplier PE provides the core unit for developed MMM coprocessor.<br />
Our attention was paid to keep the scalability feature <strong>in</strong>cluded <strong>in</strong> the PE also <strong>in</strong><br />
the other parts of the system. The <strong>in</strong>terface of the coprocessor provides flexible<br />
and powerful connection accord<strong>in</strong>g to the processor’s type of peripherals handl<strong>in</strong>g.<br />
The presented MMM coprocessor was successfully <strong>in</strong>corporated <strong>in</strong>to SOCs with two<br />
types of the control unit: <strong>in</strong> this chapter the soft-core processor Altera Nios was<br />
applied, <strong>in</strong> Chapter 4 we describe system controlled by an ARM processor.<br />
Obta<strong>in</strong>ed solution is very flexible and thanks to scalability and possibility to<br />
choose between two types of PE, one is able to adapt it to a large range of target<br />
platforms and applications. The features of the MMM coprocessor <strong>ware</strong> confirmed<br />
by two proof-of-concept implementations. In this chapter we consider the coproces-<br />
sor application for RSA-based public key cryptosystem <strong>in</strong> which typical operands<br />
length exceeds 1000 bits. In Chapter 4 we present a design of the coprocessor dedi-<br />
cated for <strong>in</strong>teger factor<strong>in</strong>g based on elliptic curves. The IP block cover<strong>in</strong>g the MMM<br />
coprocessor with all its features supports fast development of embedded systems.<br />
From areas <strong>in</strong> which we see possible improvements of the design we mention<br />
a better memory management for variables smaller than the total capacity of the<br />
memory block. The RSA application can be enhanced by the CRT method that<br />
requires shorter operands. Such requirement can be perfectly met by the MMM<br />
coprocessor <strong>in</strong> future thanks to its scalability.<br />
43
FEI KEMT<br />
3 Elliptic Curve Method <strong>in</strong> <strong>Hard</strong><strong>ware</strong> - prelimi-<br />
naries<br />
<strong>Hard</strong><strong>ware</strong> implementations of factor<strong>in</strong>g algorithms require special purpose devices<br />
suitable for effective execution of <strong>in</strong>tensive computations. In this chapter we provide<br />
prelim<strong>in</strong>aries for the topic of ECM hard<strong>ware</strong> implementation.<br />
In the Section 3.1 we start with <strong>in</strong>troduction on factor<strong>in</strong>g <strong>in</strong> general and present<br />
the motivation for implementation of the ECM <strong>in</strong> hard<strong>ware</strong>. The chapter cont<strong>in</strong>ues<br />
with a summary of previous work done <strong>in</strong> the area of ECM implementation (the<br />
Section 3.2). Mathematical background of the method and closer look at the both<br />
phases of the ECM are given <strong>in</strong> the Section 3.3.<br />
3.1 Integer Factor<strong>in</strong>g<br />
In the previous parts of the thesis we have expla<strong>in</strong>ed that the security of the RSA<br />
cryptosystem relies on the difficulty of factor<strong>in</strong>g large <strong>in</strong>tegers. Hence, the devel-<br />
opment of a fast factorisation method could allow the cryptanalysis of messages<br />
encrypted or signed by RSA. However, till now the problem of factorisation has<br />
rema<strong>in</strong>ed hard.<br />
In this section we start with basic facts on <strong>in</strong>teger factor<strong>in</strong>g and present the most<br />
important factor<strong>in</strong>g methods. Further, the ECM is described as a promis<strong>in</strong>g method<br />
for hard<strong>ware</strong> implementation.<br />
3.1.1 Factor<strong>in</strong>g Algorithms<br />
We provide def<strong>in</strong>itions of terms related to factor<strong>in</strong>g and <strong>in</strong>troduction to the factor<strong>in</strong>g<br />
methods that can be found also <strong>in</strong> [80].<br />
Factor<strong>in</strong>g a positive <strong>in</strong>teger n means f<strong>in</strong>d<strong>in</strong>g positive <strong>in</strong>tegers u and v such that<br />
the product of u and v equals n, and such that both u and v are greater than 1.<br />
Such u and v are called factors (or divisors) of n, and n = uv is called a factorisation<br />
of n. Positive <strong>in</strong>tegers that can be factored are called composites. Positive <strong>in</strong>tegers<br />
greater than 1 that cannot be factored are called primes.<br />
In some factorisation methods we use a feature of <strong>in</strong>tegers called smoothness. We<br />
say that a positive <strong>in</strong>teger is B-smooth if all its prime factors are ≤ B. An <strong>in</strong>teger<br />
is said to be smooth with respect to S, where S is some set of <strong>in</strong>tegers, if it can be<br />
completely factored us<strong>in</strong>g the elements of S. We often simply use the term smooth,<br />
<strong>in</strong> which case the bound B or the set S is clear from the context.<br />
44
FEI KEMT<br />
We start with the simplest method for <strong>in</strong>teger factor<strong>in</strong>g, namely the trial division.<br />
The smallest prime factor p of n can be found by try<strong>in</strong>g if n is divisible by all primes<br />
<strong>in</strong> succession, until p is reached. If we assume that a table of all primes ≤ p is<br />
available this process takes π(p) division attempts (called trial divisions), where π(p)<br />
is number of primes ≤ p, or the prime count<strong>in</strong>g function, where the approximation<br />
to get its value has been found as π(p) ≈ p/ log e(p).<br />
S<strong>in</strong>ce n has at least one factor ≤ √ n, factor<strong>in</strong>g n us<strong>in</strong>g trial division takes<br />
approximately √ n operations, <strong>in</strong> the worst case. For many composites trial division<br />
is therefore <strong>in</strong>feasible as factor<strong>in</strong>g method. For most numbers it is very effective,<br />
however, because most numbers have small factors: 88% of all positive <strong>in</strong>tegers have<br />
a factor < 100, and almost 92% have a factor < 1000.<br />
Several more efficient algorithms for factor<strong>in</strong>g <strong>in</strong>tegers have been proposed. Each<br />
algorithm is appropriate for a different situation. For <strong>in</strong>stance, the ECM [82] allows<br />
the efficient factor<strong>in</strong>g of numbers with relatively small factors. The generalised<br />
number field sieve (GNFS, see [81]) is the best algorithm for factor<strong>in</strong>g numbers with<br />
large factors and, hence, can be used for attack<strong>in</strong>g the RSA cryptosystem.<br />
In GNFS arise many mid-size <strong>in</strong>tegers that have to be checked for smoothness,<br />
i.e. if they decompose completely <strong>in</strong>to small prime factors. The siev<strong>in</strong>g step of<br />
GNFS f<strong>in</strong>ds some of these factors. After divid<strong>in</strong>g them out, one obta<strong>in</strong>s a co-factor<br />
that has to be checked for smoothness. Let us call this step the co-factorisation<br />
or smoothness test. An appropriate choice for this task is the multiple polynomial<br />
quadratic sieve (MPQS, see [104]) or the ECM.<br />
3.1.2 Motivation for <strong>Hard</strong><strong>ware</strong> Implementation<br />
The current world record <strong>in</strong> factor<strong>in</strong>g a random RSA modulus is 200 decimals and<br />
was achieved with a complete soft<strong>ware</strong> implementation of the GNFS <strong>in</strong> 2005 [63],<br />
us<strong>in</strong>g MPQS for the factorisation of the cofactors. For larger modulus it becomes<br />
crucial to use a special hard<strong>ware</strong> for factor<strong>in</strong>g. Recently, some new hard<strong>ware</strong> ar-<br />
chitectures for the siev<strong>in</strong>g step <strong>in</strong> GNFS have been proposed (e.g., SHARK [64],<br />
TWIRL [103]). The efficiency of, e.g. SHARK (and possibly other <strong>in</strong>novative<br />
GNFS realizations) is directly related to efficient support units for smoothness test-<br />
<strong>in</strong>g with<strong>in</strong> the architecture.<br />
It appears that the use of the ECM rather than the MPQS is a better choice<br />
for the smoothness test, s<strong>in</strong>ce the MPQS requires a larger silicon area and irregular<br />
operations. On the other hand, the ECM is almost ideal algorithm for dramatically<br />
45
FEI KEMT<br />
improv<strong>in</strong>g the area-time product through special purpose hard<strong>ware</strong>. We summarise<br />
the advantages of the ECM <strong>in</strong> the follow<strong>in</strong>g po<strong>in</strong>ts:<br />
1. ECM performs a very high number of operations on a very small set of <strong>in</strong>put<br />
data, hence, it is not very I/O <strong>in</strong>tensive.<br />
2. ECM requires relatively little memory when compar<strong>in</strong>g to other methods.<br />
3. The operands needed for support<strong>in</strong>g GNFS are well beyond the width of cur-<br />
rent computer buses, arithmetic units, and registers, so a special purpose<br />
hard<strong>ware</strong> can provide a much better efficiency <strong>in</strong> implementation and com-<br />
putational time.<br />
4. The nature of the smoothness test<strong>in</strong>g <strong>in</strong> the GNFS allows a very high degree<br />
of parallelisation.<br />
The key for efficient ECM hard<strong>ware</strong> with parallel architecture lies <strong>in</strong> fast arith-<br />
metic units. Such units for modular addition and multiplication have been studied<br />
thoroughly <strong>in</strong> the last few years, e.g. for the use <strong>in</strong> cryptographic devices <strong>in</strong>clud<strong>in</strong>g<br />
ECC (see e.g. [71,92]). Therefore, we could exploit the well developed area of ECC<br />
architectures for our ECM design.<br />
3.2 Previous Implementations of ECM<br />
To our knowledge, the ECM has never been implemented <strong>in</strong> hard<strong>ware</strong> before. In the<br />
context of special-purpose hard<strong>ware</strong> for the GNFS, [27] mentions that construction<br />
of a special ECM hard<strong>ware</strong> might be promis<strong>in</strong>g for support<strong>in</strong>g the GNFS. However,<br />
till now there were published only two concepts for the ECM hard<strong>ware</strong> implementa-<br />
tion. The first one, presented also <strong>in</strong> this work, has been a proof-of-concept design<br />
proposed by Jan Pelzl, Mart<strong>in</strong> ˇ Simka et al. [65, 94, 120]. The latter one from Kris<br />
Gaj et al. [67] improves our proposal and provides the most recent reference for the<br />
ECM implementation.<br />
The ma<strong>in</strong> differences of both concepts are <strong>in</strong> the follow<strong>in</strong>g areas:<br />
• control logic - external vs. <strong>in</strong>ternal, what <strong>in</strong> detail means a way of distribution<br />
the control over computation between the ECM units and the central control<br />
logic,<br />
• memory management - thanks to better organisation of memory registers and<br />
us<strong>in</strong>g s<strong>in</strong>gle-port memory access, the design of Gaj et al. requires significantly<br />
46
FEI KEMT<br />
less memory blocks than ours (with dual-port access and separate memory<br />
block for each register),<br />
• parallelisation - better computational times are achieved by parallel execution<br />
of arithmetic operations and addition of the second multiplier,<br />
• <strong>Montgomery</strong> multiplier - while <strong>in</strong> our concept the multiplier design is based<br />
on the proposal from Tenca and Koc [108], <strong>in</strong> the Gaj’s design the multiplier<br />
comes from McIvor and McLoony [85]. It provides a shorter computation<br />
time, but also a less flexible architecture what can be a disadvantage <strong>in</strong> case<br />
of chang<strong>in</strong>g the ECM parameters.<br />
By selection of faster multiplier and better resources utilisation <strong>in</strong> comparison to our<br />
proof-of-concept design, the authors have achieved the AT product improvement by<br />
factor 3.7 for Phase 1 and 6.4 for Phase 2, respectively, us<strong>in</strong>g the same hard<strong>ware</strong><br />
platform.<br />
In the soft<strong>ware</strong> doma<strong>in</strong>, there were several attempts to apply the ECM to the<br />
factorisation.<br />
A parallel soft<strong>ware</strong> implementation of ECM on several workstations (Pentiu-<br />
mII@350 MHz, L<strong>in</strong>ux OS) is reported <strong>in</strong> [123]. The implementation uses fast net-<br />
work switches and has been programmed based on the Message-Pass<strong>in</strong>g Interface<br />
(MPI) standard.<br />
Two massively parallel implementations of ECM based on systolic versions of the<br />
MMM are described <strong>in</strong> [45]. The authors apply a s<strong>in</strong>gle <strong>in</strong>struction, multiple data<br />
(SIMD) approach on a particular type of parallel computer.<br />
A well known free soft<strong>ware</strong> implementation of the ECM to factor <strong>in</strong>tegers is<br />
available from [128] (GMP-ECM). The implementation is based on the GNU mul-<br />
tiple precision (GMP) arithmetic library. The orig<strong>in</strong>al purpose of the project was<br />
to f<strong>in</strong>d a factor of 50 digits or more by ECM. The participation of several devel-<br />
opers made GMP-ECM an excellent resource for a state-of-the-art ECM soft<strong>ware</strong><br />
implementation, <strong>in</strong>clud<strong>in</strong>g many useful tweaks.<br />
3.3 Mathematical Background<br />
The pr<strong>in</strong>ciples of ECM are based on Pollard’s (p − 1)-method [95]. Therefore we<br />
start with short summarization of the Pollard’s method. Afterwards we describe<br />
H. W. Lenstra’s ECM [82].<br />
47
FEI KEMT<br />
3.3.1 Pollard’s (p − 1)-algorithm<br />
Let k, n ∈ N with n be<strong>in</strong>g the composite to be factored. Furthermore, let p|n with<br />
p ∈ P. Let a ∈ Z and n be co-prime, i.e. gcd(a, n) = 1. Let e = k(p − 1).<br />
1. By little Fermat,<br />
2. p|n yields gcd(a e − 1, n) > 1.<br />
a p−1 ≡ 1 mod q ⇒ a k(p−1) ≡ 1 mod p<br />
⇐ a e ≡ 1 mod p<br />
⇐ a e − 1 ≡ 0 mod p<br />
⇐ p|(a e − 1).<br />
3. If a e �≡ 1 mod n, then 1 < gcd(a e − 1, n) < n. In this case, we found a<br />
non-trivial divisor of n.<br />
Obviously, we cannot compute e = k(p−1) without the knowledge of p. Instead,<br />
we assume that p − 1 can be decomposed <strong>in</strong>to many small factors below a certa<strong>in</strong><br />
bound B1. In this case, p − 1 is called B1-smooth.<br />
Let B2 denote the highest prime power divid<strong>in</strong>g p − 1 and choose e such that<br />
e =<br />
�<br />
pi∈P,pi≤B1<br />
p ep i<br />
i , epi = max{r ∈ N : pr i ≤ B2} . (3.1)<br />
With the computation of a e with d = gcd(a e − 1, n) we hope to f<strong>in</strong>d a non-trivial<br />
factor d of n.<br />
In general, Pollard’s method can be def<strong>in</strong>ed as follows:<br />
Let Gp = (Zp) ⋆ and Gn = (Zn) ⋆ be multiplicative groups and let φ be the canon-<br />
ical homomorphism<br />
φ : Gn → Gp (reduction modulo p) (3.2)<br />
A factor of n is found if simultaneously a e �≡ 1 mod n and a e ≡ 1 mod p, i.e.<br />
∀k1 ∈ N : e �= k1 · ordGn(a),<br />
∃k2 ∈ N : e = k2 · ordGp(φ(a)).<br />
48
FEI KEMT<br />
3.3.2 ECM Algorithm<br />
In 1987, H. Lenstra came up with the idea of translat<strong>in</strong>g Pollard’s method from<br />
the groups Gp and Gn to the groups of po<strong>in</strong>ts on elliptic curves E modulo n and<br />
modulo q [82]. Indeed, a group operation <strong>in</strong> E(Zn) can be def<strong>in</strong>ed by us<strong>in</strong>g the<br />
given addition formulae [32].<br />
The correspond<strong>in</strong>g homomorphism φ to the one def<strong>in</strong>ed <strong>in</strong> Equation 3.2 is:<br />
φ : E(Zn) → E(Zq) (reduction of coord<strong>in</strong>ates modulo q) (3.3)<br />
The exponentiation <strong>in</strong> Pollard’s (p−1) method is replaced by a po<strong>in</strong>t multiplication.<br />
Let n be an <strong>in</strong>teger without small prime factors which is divisible by at least two<br />
different primes, one of them q. Such numbers appear after trial division and a quick<br />
prime power test. Let E(Zn) be an elliptic curve with good reduction at all prime<br />
divisors of n (this can be checked by calculat<strong>in</strong>g the gcd of n and the discrim<strong>in</strong>ant<br />
of E, which very rarely yields a prime factor of n) and a po<strong>in</strong>t P ∈ E(Zn) �= O.<br />
A factor of n is found if k · P is not equal to the identity element <strong>in</strong> E(Zn) but<br />
k · φ(P ) equals to the identity element <strong>in</strong> E(Zq), i.e.<br />
∀k1 ∈ N : k �= k1 · ordE(Zn)(P ),<br />
∃k2 ∈ N : k = k2 · ordE(Zq)(φ(P )).<br />
Let the elliptic curve E be def<strong>in</strong>ed by the homogeneous Weierstrass Equation:<br />
y 2 z = x 3 + axz 2 + bz 3<br />
(3.4)<br />
In this case, above conditions yield two properties for the z-coord<strong>in</strong>ate zQ of the<br />
result<strong>in</strong>g po<strong>in</strong>t Q = k · P :<br />
k �= k1 · ordE(Zn)(P ) ⇐ n ∤ zQ<br />
k = k2 · ordE(Zq)(φ(P )) ⇐ q | zQ.<br />
Under these conditions, a non-trivial factor d of n is obta<strong>in</strong>ed by d = gcd(zQ, n).<br />
With the assumption that the order of P is B1-smooth and does not conta<strong>in</strong><br />
any prime power larger than B2, the scalar k is computed <strong>in</strong> the same way as e <strong>in</strong><br />
Equation 3.1 as<br />
k =<br />
�<br />
pi∈P,pi≤B1<br />
p ep i<br />
i , epi = max{r ∈ N : pr i ≤ B2} . (3.5)<br />
49
FEI KEMT<br />
If the order of P ∈ E(Fq) satisfies certa<strong>in</strong> smoothness conditions described below,<br />
we can discover the factor q of n as follows:<br />
In the first phase of ECM, we calculate Q = kP where k is a product of prime<br />
powers p e ≤ B1 with appropriately chosen smoothness bounds. The second phase of<br />
ECM checks for each prime B1 < p ≤ B2 whether pQ reduces to the neutral element<br />
<strong>in</strong> E(Fq). Algorithm 3 – 1 summarises all necessary steps for both phases of ECM.<br />
Phase 2 can be done efficiently, e.g., us<strong>in</strong>g the Weierstraß form and projective<br />
coord<strong>in</strong>ates pQ = (xpQ : ypQ : zpQ) by test<strong>in</strong>g whether gcd(zpQ, n) is bigger than 1.<br />
Note that we can avoid all gcd computations but one at the expense of one<br />
modular multiplication per gcd by accumulat<strong>in</strong>g the numbers to be checked <strong>in</strong> a<br />
product modulo n and perform<strong>in</strong>g one f<strong>in</strong>al gcd.<br />
Algorithm 3 – 1 Elliptic Curve Method<br />
Require: Composite n<br />
Ensure: Factor d of n<br />
1: Phase 1:<br />
2: Choose arbitrary curve E(Zn) and random po<strong>in</strong>t P ∈ E(Zn) �= O<br />
3: Choose smoothness bounds B1, B2 ∈ N<br />
4: Compute<br />
k ⇐<br />
�<br />
pi∈P,pi≤B1<br />
5: Compute Q = kP ⇐ (xQ, yQ, zQ)<br />
6: Compute d ⇐ gcd(zQ, n)<br />
7: Phase 2:<br />
8: Set Π := 1<br />
9: for each prime p with B1 < p ≤ B2 do<br />
10: Compute pQ ⇐ (xpQ : ypQ : zpQ)<br />
11: Compute Π ⇐ Π · zpQ<br />
12: end for<br />
13: Compute d ⇐ gcd(Π, n)<br />
14: if 1 < d < n then<br />
15: A non-trivial factor d is found<br />
16: return d<br />
17: else<br />
p ep i<br />
i , epi ⇐ max{r ∈ N : pr i ≤ B2}<br />
18: Restart from choos<strong>in</strong>g another elliptic curve <strong>in</strong> phase 1 (Step 2).<br />
19: end if<br />
50
FEI KEMT<br />
If us<strong>in</strong>g only one s<strong>in</strong>gle curve, the properties of the ECM are related to those of<br />
the Pollard’s (p − 1)-method. The advantage of the ECM lies <strong>in</strong> the possibility of<br />
choos<strong>in</strong>g a different curve after each unsuccessful trial to <strong>in</strong>crease the probability of<br />
f<strong>in</strong>d<strong>in</strong>g factors of n.<br />
All calculations are done modulo n. If the f<strong>in</strong>al gcd of the product Π and n<br />
satisfies<br />
1 < gcd(Π, n) < n , (3.6)<br />
a factor is found. The parameters B1 and B2 control the probability of f<strong>in</strong>d<strong>in</strong>g a<br />
divisor q. More precisely, if the of P factors <strong>in</strong>to a product of co-prime prime powers<br />
(each ≤ B1) and at most one additional prime between B1 and B2, the prime factor<br />
q is discovered.<br />
The procedure will be repeated for other elliptic curves. To generate them one<br />
commences with the start<strong>in</strong>g po<strong>in</strong>t P and constructs an elliptic curve such that P<br />
lies on it.<br />
It is possible that more than one or even all prime divisors of n are discovered<br />
simultaneously. This happens rarely for reasonable parameter choices and can be<br />
ignored by proceed<strong>in</strong>g to the next elliptic curve.<br />
The runn<strong>in</strong>g time of the ECM is given by<br />
T (q) q→∞<br />
= e (√ 2+o(1)) √ log q log log q<br />
(3.7)<br />
operations, thus, it ma<strong>in</strong>ly depends on the size of the factors to be found and not<br />
on the size of n [34]. However, remark that the operations are computed modulo n,<br />
hence, the runn<strong>in</strong>g time of the operations depends on n.<br />
<strong>Montgomery</strong>-Form Curves Apart from the Weierstraß form there are vari-<br />
ous other forms for the elliptic curves. We use <strong>Montgomery</strong>’s form (described by<br />
Equation 3.8) that was suggested <strong>in</strong> [89] by <strong>Montgomery</strong> and compute <strong>in</strong> the set<br />
S = E(Z/nZ)/{±1} only us<strong>in</strong>g the x- and z-coord<strong>in</strong>ates.<br />
By 2 z = x 3 + Ax 2 z + xz 2<br />
(3.8)<br />
The curves of this form always have an order divisible by 4. In our case, the curves<br />
can be chosen <strong>in</strong> such a way that they have an order divisible by 12. The advantage<br />
of the use of <strong>Montgomery</strong> form curves <strong>in</strong> cryptography is the <strong>in</strong>herent resistance<br />
aga<strong>in</strong>st side channel attacks due to almost <strong>in</strong>dist<strong>in</strong>guishable group operations, i.e.<br />
the elementary operations for addition and doubl<strong>in</strong>g of po<strong>in</strong>ts are quite similar. A<br />
51
FEI KEMT<br />
handicap of the <strong>Montgomery</strong> form is the fact that not every arbitrary curve can be<br />
transformed <strong>in</strong>to this form. Hence, there is merely <strong>in</strong>terest <strong>in</strong> implement<strong>in</strong>g ECC<br />
based on <strong>Montgomery</strong> form curves.<br />
The residue class of P +Q <strong>in</strong> this set can be computed from P , Q and P −Q us<strong>in</strong>g<br />
4 multiplications and 1 squar<strong>in</strong>g (see Equation 3.9). A doubl<strong>in</strong>g, i. e. 2P , can be<br />
computed from P and curve parameter A (see 3.8) us<strong>in</strong>g 5 squar<strong>in</strong>gs (Equation 3.10).<br />
S<strong>in</strong>ce we are only <strong>in</strong>terested <strong>in</strong> check<strong>in</strong>g whether we obta<strong>in</strong> the po<strong>in</strong>t at <strong>in</strong>f<strong>in</strong>ity O<br />
for some prime divisor of n comput<strong>in</strong>g <strong>in</strong> S is no restriction.<br />
Addition: (3.9)<br />
xP +Q ≡ zP −Q[(xP − zP )(xQ + zQ) + (xP + zP )(xQ − zQ)] 2<br />
zP +Q ≡ xP −Q[(xP − zP )(xQ + zQ) − (xP + zP )(xQ − zQ)] 2<br />
(mod n)<br />
(mod n)<br />
Doubl<strong>in</strong>g: (3.10)<br />
4xP zP ≡ (xP + zP ) 2 − (xP − zP ) 2<br />
x2P ≡ (xP + zP ) 2 (xP − zP ) 2<br />
(mod n)<br />
(mod n)<br />
z2P ≡ 4xP zP [(xP − zP ) 2 + 4xP zP (A + 2)/4] (mod n)<br />
F<strong>in</strong>d<strong>in</strong>g Suitable Curves <strong>in</strong> <strong>Montgomery</strong> Form Assume a curve of the form<br />
By 2 = x 3 + Ax 2 + x with gcd((A 2 − 4)B, n) = 1 (3.11)<br />
Such curves have a group order divisible by 4. To obta<strong>in</strong> an order divisible by 12,<br />
choose A and B such that<br />
The po<strong>in</strong>t<br />
A = −3a4 − 6a2 + 1<br />
4a3 , B = (a2 − 1) 2<br />
4a3 , with a = t2 − 1<br />
t2 + 3<br />
� √ �<br />
2 3a + 1 3a2 + 1<br />
(x0, y0) = ,<br />
4a 4a<br />
(3.12)<br />
(3.13)<br />
is on the curve, if 3a 2 + 1 = 4(t 4 + 3)/(t 2 + 3) 2 is a rational square, which can be<br />
obta<strong>in</strong>ed by t 2 = (u 2 − 12)/4u with u 2 − 12u be<strong>in</strong>g a rational square.<br />
First Phase of the ECM If the triple (P, mP, (m + 1)P ) is given <strong>in</strong> the Mont-<br />
gomery form, we can compute (P, 2mP, (2m + 1)P ) or (P, (2m + 1)P, (2m + 2)P )<br />
by perform<strong>in</strong>g one addition (follow<strong>in</strong>g the Equations 3.9) and one doubl<strong>in</strong>g (follow-<br />
<strong>in</strong>g the Equations 3.10) <strong>in</strong> <strong>Montgomery</strong>’s form. Thus, Q = kP can be calculated<br />
52
FEI KEMT<br />
us<strong>in</strong>g [log 2 k] additions and duplications accord<strong>in</strong>g to Algorithm 3 – 2, amount<strong>in</strong>g to<br />
11[log 2 k] multiplications. In case when zP = 1 we can even reduce the number to<br />
10[log 2 k] modular multiplications.<br />
Algorithm 3 – 2 Exponentiation for Curves <strong>in</strong> <strong>Montgomery</strong> Form<br />
Require: Integer k > 1 with k = (ktkt−1 . . . k1k0)2 and a po<strong>in</strong>t P on the curve<br />
E M : By 2 = x 3 + Ax 2 + x.<br />
Ensure: Product Q = kP .<br />
1: Pm ⇐ P<br />
2: Pm+1 ⇐ 2P<br />
3: for i = t − 1 to 1 do<br />
4: if ki = 1 then<br />
5: Pm ⇐ Pm + Pm+1<br />
6: Pm+1 ⇐ 2Pm+1<br />
7: else<br />
8: Pm+1 ⇐ Pm + Pm+1<br />
9: Pm ⇐ 2Pm<br />
10: end if<br />
11: end for<br />
12: if k0 = 1 then<br />
13: Q ⇐ Pm + Pm+1<br />
14: else<br />
15: Q ⇐ 2Pm<br />
16: end if<br />
17: return Q<br />
By handl<strong>in</strong>g each prime factor of k separately and by us<strong>in</strong>g optimal addition<br />
cha<strong>in</strong>s, the number of multiplications can be decreased further to roughly 9.3[log 2 k]<br />
(see [89]). The addition cha<strong>in</strong>s can be precalculated.<br />
Second Phase of the ECM The standard way to calculate the po<strong>in</strong>ts pQ for all<br />
primes B1 < p ≤ B2 is to precompute a (small) table of multiples kQ, where k runs<br />
through the differences of consecutive primes <strong>in</strong> the <strong>in</strong>terval [B1, B2]. Then, a s<strong>in</strong>gle<br />
po<strong>in</strong>t multiple p0Q is computed with p0 be<strong>in</strong>g the smallest prime <strong>in</strong> that <strong>in</strong>terval<br />
and the correspond<strong>in</strong>g table entries are added successively to obta<strong>in</strong> pQ for the next<br />
prime p.<br />
53
FEI KEMT<br />
Two major improvements have been proposed for the ECM [33, 89]. Us<strong>in</strong>g the<br />
<strong>Montgomery</strong>’s form, the procedure is difficult to implement but can be improved as<br />
follows.<br />
The follow<strong>in</strong>g Lemma allows us to reduce the complexity by repeatedly multi-<br />
ply<strong>in</strong>g a difference of two products <strong>in</strong>stead of comput<strong>in</strong>g complex po<strong>in</strong>t operations<br />
<strong>in</strong> each step of phase 2:<br />
Lemma 1 Let q = a + b with a and b co-prime. Furthermore, let qQ = A + B with<br />
A = aQ and B = bQ, then zqQ = 0 mod t for gcd(zQ, n) = 1 if and only if<br />
Proof<br />
xA · zB − zA · xB ≡ 0 mod t.<br />
1. <strong>Montgomery</strong>’s po<strong>in</strong>t addition formula 3.9 yields<br />
t|zqQ ⇔ t|xA−B[xA · zB − zA · xB] 2<br />
⇐ t|(xA · zB − zA · xB).<br />
2. If zqQ ≡ 0 mod t, qQ is the identity po<strong>in</strong>t on the elliptic curve over Ft. Hence,<br />
A = −B, i.e. A and B are zero or<br />
xA/zA ≡ xB/zB mod t.<br />
A = B = 0 yields Q = 0, thus t|zQ, which is a contradiction to the assumption<br />
of gcd(zQ, n) = 1. Then we have<br />
xA/zA ≡ xB/zB mod t and<br />
xA · zB ≡ zA · xB mod t respectively.<br />
The improved standard cont<strong>in</strong>uation uses a parameter 2 < D < B1. First, a<br />
table T of multiples kQ of Q for all 1 ≤ k < D,<br />
gcd(k, D) = 1 is calculated.<br />
2<br />
Each prime B1 < p ≤ B2 can be written as mD ± k with kQ ∈ T . Now, with<br />
Lemma 1, gcd(zpQ, n) > 1 if and only if gcd(xmDQzkQ − xkQzmDQ, n) > 1. Thus, we<br />
calculate the sequence mDQ (which can easily be done <strong>in</strong> <strong>Montgomery</strong>’s form) and<br />
accumulate the product of all xmDQzkQ − xkQzmDQ for which mD − k or mD + k is<br />
prime.<br />
The memory requirements for the improved standard cont<strong>in</strong>uation are ϕ(D)<br />
2<br />
po<strong>in</strong>ts for the table T and the po<strong>in</strong>ts DQ, (m − 1)DQ,and mDQ for comput<strong>in</strong>g<br />
54
FEI KEMT<br />
the sequence, altogether ϕ(D) + 6 numbers. The computational costs consist of the<br />
generation of T and the calculation of mDQ which amounts to at most D<br />
4<br />
+ B2<br />
D<br />
elliptic curve operations (mostly additions) and at most 3(π(B2) − π(B1)) modular<br />
multiplications, π(x) be<strong>in</strong>g the number of primes up to x. The last term can be<br />
lowered if D conta<strong>in</strong>s many small prime factors s<strong>in</strong>ce this will <strong>in</strong>crease the number<br />
of pairs (m, k) for which both mD − k and mD + k are prime. Neglect<strong>in</strong>g space<br />
considerations a good choice for D is a number around √ B2 which is divisible by<br />
many small primes.<br />
4 Elliptic Curve Method <strong>in</strong> <strong>Hard</strong><strong>ware</strong><br />
We present the first published hard<strong>ware</strong> implementation of the ECM for <strong>in</strong>teger fac-<br />
tor<strong>in</strong>g. The ECM implementation <strong>in</strong>cludes a complete hard<strong>ware</strong> logic that supports<br />
the ECM factor<strong>in</strong>g of numbers up to approximately 200 bits. The proposed solution<br />
applies parameters best suited to f<strong>in</strong>d factors of up to about 42 bits. The ECM<br />
design features a support<strong>in</strong>g logic for computation of the modular operations as ad-<br />
dition, subtraction, multiplication and squar<strong>in</strong>g. The multiplication and squar<strong>in</strong>g<br />
is computed <strong>in</strong> the MMM unit analysed <strong>in</strong> the Chapter 2. The circuit has a good<br />
scalability also to larger and smaller bit lengths. For a proof-of-concept purpose,<br />
the ECM architecture has been implemented as a soft<strong>ware</strong>-hard<strong>ware</strong> co-design on a<br />
FPGA and an embedded micro-controller <strong>in</strong> a SOC. Such a design perfectly fits the<br />
needs of recent proposals for hard<strong>ware</strong> architectures for the GNFS (see, e.g. [64])<br />
and can reduce the overall costs of a GNFS device considerably.<br />
Parts of this section were published <strong>in</strong> papers [65,94,120]. The research achieve-<br />
ments described <strong>in</strong> this chapter <strong>in</strong>clude the follow<strong>in</strong>g:<br />
• ECM algorithm for hard<strong>ware</strong> – algorithm adaptation and parametrisation,<br />
• ECM implementation – unit design, parallelisation, case study for GNFS.<br />
The ECM implementation was done as a jo<strong>in</strong>t work, ma<strong>in</strong>ly with Jan Pelzl from<br />
Ruhr University Bochum (<strong>in</strong> SHARK project that <strong>in</strong>cludes the ECM design, have<br />
cooperated also Christ<strong>in</strong>e Priplata and Col<strong>in</strong> Stahlke (Edizone GmbH, Germany),<br />
and Jens Franke and Thorsten Kle<strong>in</strong>jung (University of Bonn, Germany)).<br />
The Section 4.1 describes the details on selection of the parameters <strong>in</strong> the ECM.<br />
The architecture of the implementation and discussion on the chosen algorithms<br />
for the modular operations is presented <strong>in</strong> the Section 4.2. Implementation details<br />
55<br />
+ 7
FEI KEMT<br />
and case study with GNFS based on ECM units are summarised <strong>in</strong> the Section 4.3.<br />
F<strong>in</strong>ally, we conclude the chapter with discussion on obta<strong>in</strong>ed results.<br />
4.1 Parameterisation of the ECM Algorithm<br />
Our implementation focuses on the factorisation of numbers up to 200 bits with<br />
factors of up to around 42 bits. Thus, the most optimal parameters need to be found<br />
for the smoothness bounds B1, B2, and <strong>in</strong> the improved standard cont<strong>in</strong>uation used<br />
parameter D (see the description of the ECM second phase <strong>in</strong> Section 3.3.2). We<br />
f<strong>in</strong>d the values that yield a high probability of success and a relatively small runn<strong>in</strong>g<br />
time and area consumption. With the runn<strong>in</strong>g time depend<strong>in</strong>g on the size of the<br />
(unknown) factors to be found, optimal parameters cannot be known beforehand.<br />
Hence, good parameters can be found by experiments with different prime bounds.<br />
4.1.1 Phase 1<br />
Deduced from soft<strong>ware</strong> experiments, we choose B1 = 960 and B2 = 57 000 as prime<br />
bounds. The value of k has 1 375 bits, hence, assum<strong>in</strong>g the b<strong>in</strong>ary method (Algo-<br />
rithm 3 – 2), 1 374 po<strong>in</strong>t additions and 1 374 po<strong>in</strong>t duplications for the execution of<br />
phase 1 are required. Due to the use of <strong>Montgomery</strong> coord<strong>in</strong>ates, the coord<strong>in</strong>ate<br />
zP of the start<strong>in</strong>g po<strong>in</strong>t P can be set to 1, then the addition takes only 5 multi-<br />
plications <strong>in</strong>stead of 6. The improved phase 1 (with optimal addition cha<strong>in</strong>s) has<br />
to use the general case, where zP �= 1. For the sake of simplicity and a preferably<br />
simple control logic, we choose the b<strong>in</strong>ary method for the time be<strong>in</strong>g. For the chosen<br />
parameters, the computational complexity of phase 1 is 13 740 modular multiplica-<br />
tions and squar<strong>in</strong>gs 3 . With optimised addition cha<strong>in</strong>s this number can be reduced<br />
to approximately 12 000 modular multiplications and squar<strong>in</strong>gs.<br />
Accord<strong>in</strong>g to Equation 3.10, duplicat<strong>in</strong>g a po<strong>in</strong>t 2PA = PC <strong>in</strong>volves the <strong>in</strong>put<br />
values xA, zA, A24 and n, where A24 = (A + 2)/4 is computed from the curve pa-<br />
rameter A (see Equation 3.8) <strong>in</strong> advance and should be stored <strong>in</strong> a fixed register.<br />
A po<strong>in</strong>t addition PC = PA + PB handles the <strong>in</strong>put values xA, zA, xB, zB, xA−B, zA−B<br />
and n (see Equation 3.9).<br />
Notice that the values n, A24, xA−B and zA−B do not change dur<strong>in</strong>g phase 1.<br />
Furthermore, zA−B = z1 can be chosen to be 1. Thus, no register is required for<br />
zA−B. The output values xC and zC can be written to certa<strong>in</strong> <strong>in</strong>put registers to<br />
3 Squar<strong>in</strong>gs and multiplications are considered to have an identical complexity <strong>in</strong> our case s<strong>in</strong>ce<br />
the hard<strong>ware</strong> unit is the same for both, the multiplication and squar<strong>in</strong>g.<br />
56
FEI KEMT<br />
save memory. If we assume that the ECM unit does not execute addition and<br />
duplication <strong>in</strong> parallel, at most 7 registers for the values <strong>in</strong> Zn are required for<br />
phase 1. Additionally, we will require 4 temporary registers for <strong>in</strong>termediate values.<br />
Thus, a total of 11 registers is required for phase 1.<br />
4.1.2 Phase 2<br />
For the prime bounds chosen, 5 621 primes p ∈ [B1, B2] have to be tested <strong>in</strong> phase<br />
2. With the prime bounds fixed, the computational complexity depends on the size<br />
of D. Hence, D should consist of small primes <strong>in</strong> order to keep ϕ(D) as small as<br />
possible. We consider the cases D = 6, D = 30, D = 60 and D = 210. The<br />
<strong>in</strong>itial values can be computed by first comput<strong>in</strong>g ˆ Q = DQ, then B1<br />
D ˆ Q with the<br />
b<strong>in</strong>ary method, yield<strong>in</strong>g automatically ( B1<br />
D − 1) ˆ Q. The total number of modular<br />
multiplications is determ<strong>in</strong>ed by the number of po<strong>in</strong>t additions, po<strong>in</strong>t duplications<br />
and multiplications for the product Π.<br />
Table 4 – 1 displays the computational complexity and the number of registers<br />
required additionally for phase 2. For the numbers <strong>in</strong> the table, we assume the use<br />
of Algorithm 3 – 2 for comput<strong>in</strong>g the <strong>in</strong>itial values. E.g., <strong>in</strong> the case D = 30, the cost<br />
for the computation of DQ, ( B1<br />
D<br />
B1<br />
− 1)DQ, and DQ is as much as 8 po<strong>in</strong>t additions<br />
D<br />
and 8 po<strong>in</strong>t duplications. For the same D, the computation of the table <strong>in</strong>volves<br />
5 po<strong>in</strong>t additions and 2 po<strong>in</strong>t duplications, yield<strong>in</strong>g to a total of 13 590 modular<br />
multiplications.<br />
Remark: for the case D = 210, we start with B1 = 1 050 <strong>in</strong> order to assure that<br />
D and B1 share the same prime factors. For phase 2 we choose D = 30 to obta<strong>in</strong><br />
a m<strong>in</strong>imal AT product of the design. S<strong>in</strong>ce ϕ(D) = 8 is small, only 8 additional<br />
registers are required to store all coord<strong>in</strong>ates <strong>in</strong> a table. Unlike <strong>in</strong> phase 1, we have<br />
to consider the general case for po<strong>in</strong>t addition where zA−B �= 1. Hence, an additional<br />
register for this quantity is needed.<br />
For the product Π of all xA · zB − zA · xB, one more register is necessary. The<br />
temporary registers from phase 1 suffice to store the <strong>in</strong>termediate results xA · zB,<br />
zA · xB and xA · zB − zA · xB. Hence, additional 10 registers for phase 2 yield a total<br />
of 21 required registers for both phases. The computational complexity of phase 2 is<br />
1 881 po<strong>in</strong>t additions and 10 po<strong>in</strong>t duplications. Together with the 13 590 modular<br />
multiplications for comput<strong>in</strong>g the product Π, 24 926 modular multiplications and<br />
squar<strong>in</strong>gs are required.<br />
For a high probability of success (p > 80%) of f<strong>in</strong>d<strong>in</strong>g a s<strong>in</strong>gle factor of size of 42<br />
57
FEI KEMT<br />
Table 4 – 1 Computational complexity and memory requirements for phase 2 depend<strong>in</strong>g on D<br />
number of modular multiplications for number<br />
D po<strong>in</strong>t additions po<strong>in</strong>t duplications product Π total of regs.<br />
6 (9 + 0 + 9 340) · 6 = 56 094 (9 + 0) · 5 = 45 14 625 70 764 4<br />
30 (8 + 5 + 1 868) · 6 = 11 286 (8 + 2) · 5 = 50 13 590 24 926 10<br />
60 (8 + 9 + 934) · 6 = 5 706 (8 + 2) · 5 = 50 13 629 19 385 18<br />
210 (9 + 28 + 266) · 6 = 1 818 (9 + 5) · 5 = 70 13 038 14 926 50<br />
bit, soft<strong>ware</strong> experiments suggest to run ECM on approximately 20 different curves<br />
for a s<strong>in</strong>gle candidate for the given parameters. For factors of size of 40 bit, only 10<br />
curves are required on average for a similar probability of success.<br />
4.2 Design of the ECM Unit<br />
The ECM unit consists of three ma<strong>in</strong> parts: the Arithmetic Logic Unit (ALU), the<br />
memory part (registers) and an <strong>in</strong>ternal control logic (see Figure 4 – 1). Each unit<br />
has a very low communication overhead s<strong>in</strong>ce all <strong>in</strong>termediate results dur<strong>in</strong>g com-<br />
putation are stored <strong>in</strong>side the unit, <strong>in</strong> the registers. Before the actual computation<br />
starts, all required <strong>in</strong>itial values (xP , n, A24) are assigned to memory registers of the<br />
unit. This is the only data <strong>in</strong>put.<br />
The only output is the above mentioned product Π. The number Π is read from<br />
the unit’s memory only at the very end of the computation. The computation of<br />
gcd(Π, n) as well as the commands for the ECM units are handled outside the ECM<br />
units by the central control logic.<br />
central<br />
control<br />
logic<br />
ctrl<br />
data<br />
control<br />
logic<br />
memory<br />
ALU<br />
ECM unit<br />
Figure 4 – 1 Architecture of the ECM unit<br />
58
FEI KEMT<br />
4.2.1 Control Logic<br />
The central control logic is connected to each ECM unit via a control bus (ctrl). The<br />
logic coord<strong>in</strong>ates the data exchange with the unit before and after computation and<br />
starts each computation <strong>in</strong> the unit by a special set of commands. The commands<br />
conta<strong>in</strong> an <strong>in</strong>struction for the next computation to be performed (i.e. add, subtract,<br />
multiply, square), <strong>in</strong>clud<strong>in</strong>g the <strong>in</strong>- and output registers to be used. The start of an<br />
operation is <strong>in</strong>voked by sett<strong>in</strong>g the start-bit to the active level.<br />
The control bus has to offer the possibility to specify which <strong>in</strong>put register(s) and<br />
which output register are connected to the ALU. Only certa<strong>in</strong> comb<strong>in</strong>ations of <strong>in</strong>-<br />
and output registers occur, offer<strong>in</strong>g the possibility to reduce the complexity of the<br />
logic and the width of the control bus by compress<strong>in</strong>g the necessary <strong>in</strong>formation.<br />
For simplicity and clarity, we skipped the further optimisation of the commands.<br />
Instead, we use a clearly understandable structure for the commands. A command<br />
consists of 16 bit which are assigned as shown <strong>in</strong> Table 4 – 2 (LSB is left).<br />
Table 4 – 2 A command syntax for the ECM unit (LSB left)<br />
start operation <strong>in</strong>put 1 <strong>in</strong>put 2 output<br />
X XX XXXX XXXX XXXXX<br />
If several ECM units work <strong>in</strong> parallel, only one central control logic is needed.<br />
All commands are sent <strong>in</strong> parallel to all units. Separate communication with each<br />
of all units, one by one, is expected only <strong>in</strong> the beg<strong>in</strong>n<strong>in</strong>g and <strong>in</strong> the end of the<br />
computations. The unit’s memory cells have to be written and read out separately.<br />
Once the computations <strong>in</strong> all units are f<strong>in</strong>ished, an LSB of the central status register<br />
is set to active value to <strong>in</strong>dicate the units’ availability for further commands.<br />
Each ECM unit <strong>in</strong>cludes some <strong>in</strong>ternal control logic <strong>in</strong> order to coord<strong>in</strong>ate the<br />
data and command flow <strong>in</strong>side the unit. Once a command with the correspond<strong>in</strong>g<br />
start bit is set, the computation <strong>in</strong>side the unit is started. The ALU is fed by<br />
correspond<strong>in</strong>g <strong>in</strong>put registers and the results are stored aga<strong>in</strong> <strong>in</strong>side the unit <strong>in</strong> one<br />
of registers. Once the computation is f<strong>in</strong>ished, a status bit is set to <strong>in</strong>dicate the<br />
unit’s availability for further commands.<br />
4.2.2 Memory Management<br />
The addresses specified above refer to relative addresses <strong>in</strong>side each unit s<strong>in</strong>ce we<br />
want to address the same register <strong>in</strong> multiple ECM units <strong>in</strong> parallel. For read<strong>in</strong>g<br />
59
FEI KEMT<br />
from or writ<strong>in</strong>g to a s<strong>in</strong>gle register <strong>in</strong> a specific ECM unit, the unit needs to be<br />
recognised separately by unique address prefix. In comb<strong>in</strong>ation with a address for<br />
each unit, a register has a unique hard<strong>ware</strong> address and can be addressed from<br />
outside the ECM unit. This is imperative s<strong>in</strong>ce the central control logic writes data<br />
to these registers before phase 1 starts and it reads data from one of the registers<br />
after phase 2 has been f<strong>in</strong>ished.<br />
Each register can conta<strong>in</strong> n bits and is organised <strong>in</strong> e = � �<br />
n+1 words of size w<br />
w<br />
(see Figure 4 – 2). Memory access is performed word wise. Reasonable values for<br />
w are w = 4, 8, 16, 32 what is given by the <strong>in</strong>cluded multiplier requir<strong>in</strong>g those word<br />
widths.<br />
0:<br />
1:<br />
e-1:<br />
w bits<br />
w bits<br />
.<br />
.<br />
.<br />
w bits<br />
P1 register: e x w bits<br />
. . . .<br />
0:<br />
1:<br />
e-1:<br />
w bits<br />
w bits<br />
.<br />
.<br />
.<br />
w bits<br />
P21 register: e x w bits<br />
Figure 4 – 2 Organisation of the ECM unit’s memory registers for 21 variables with e words of<br />
width w<br />
The ALU performs the arithmetic modulo 2n, i.e., modular multiplication, mod-<br />
ular squar<strong>in</strong>g, modular addition and subtraction.<br />
4.2.3 Choice of the Arithmetic Algorithms<br />
The ma<strong>in</strong> purpose when we were design<strong>in</strong>g the ECM was to synthesise an area-time<br />
efficient implementation. All algorithms are chosen to allow achievement of a low<br />
area and relatively high speed. Low area consumption can be achieved by structures,<br />
which allow for a certa<strong>in</strong> degree of pipel<strong>in</strong>e and consequently do not require much<br />
memory. For the ECM, we have chosen a set of algorithms which seem to be very well<br />
suited for our purpose. The chosen algorithms are fully scalable and make possible<br />
to analyse different unit parameters and their impact on units performance.<br />
In the follow<strong>in</strong>g, we briefly describe the algorithms for modular addition, subtrac-<br />
tion, and multiplication to be implemented for the ALU. Squar<strong>in</strong>g is done with the<br />
multiplication circuit s<strong>in</strong>ce a separate hard<strong>ware</strong> circuit for squar<strong>in</strong>g would <strong>in</strong>crease<br />
60
FEI KEMT<br />
the overall AT product. Similarly, subtraction can be computed with a slightly<br />
modified circuit for addition.<br />
<strong>Modular</strong> <strong>Multiplication</strong> An efficient <strong>Montgomery</strong> multiplier, highly suitable for<br />
our design is described <strong>in</strong> [108]. While <strong>in</strong> [108] a structure with carry-save adders<br />
and redundant representation of operands has been implemented, we have chosen a<br />
configuration with carry-propagate adders and non-redundant representation that<br />
makes a more effective implementation possible especially when the target plat-<br />
form supports fast carry cha<strong>in</strong> logic. A detailed analysis and comparison of both<br />
structures can be found <strong>in</strong> [46] and also <strong>in</strong> this thesis <strong>in</strong> chapter 2.<br />
The depicted hard<strong>ware</strong> performs a slightly modified MWR2MM (Algorithm 2 –<br />
1), but with non-redundant carry-propagate architecture (earlier denoted as MW-<br />
R2MM CPA). Therefore, our previously mentioned thoughts and analysis of param-<br />
eters for other variants of the MMM algorithm are valid also for this version. In the<br />
implemented algorithm (Algorithm 4 – 1) we have used <strong>in</strong> the step (a) only bit op-<br />
erations <strong>in</strong>stead of more expensive word-wise addition as it was orig<strong>in</strong>ally proposed<br />
<strong>in</strong> [108].<br />
The f<strong>in</strong>al reduction step of the orig<strong>in</strong>ally proposed MMM (Algorithm 1 – 2) can<br />
be omitted when the follow<strong>in</strong>g condition is fulfilled:<br />
4M < 2 n . (4.1)<br />
With bounded <strong>in</strong>put values X, Y < 2M, the output value is also bounded (S < 2M).<br />
A m<strong>in</strong>imal AT product of the sole multiplier can be achieved with a word width<br />
of 8 bits and a pipel<strong>in</strong>e depth of 1 (w = 8, p = 1, see [108]). However, for our<br />
ECM architecture, the AT product does not only depend on the AT product of the<br />
multiplier. In fact, the multiplier only takes a comparably small part of the overall<br />
area. On the other hand, the overall speed relies primarily on the speed of the<br />
multiplier. Thus, we choose a pipel<strong>in</strong>e depth of p = 2 for word width w = 32 bits,<br />
<strong>in</strong> order to achieve a shorter computation time for multiplication.<br />
<strong>Modular</strong> Addition and Subtraction Addition and subtraction is implemented<br />
as one circuit. As with the multiplication circuit, the operations are done word<br />
wise and the word size and number of words can be chosen arbitrary. S<strong>in</strong>ce the<br />
same memory is used for <strong>in</strong>put and output operands, we choose the same word size<br />
as for the multiplier. The subtraction relies on the same hard<strong>ware</strong> as the adder,<br />
only one <strong>in</strong>put bit has to be changed (sub = 1) <strong>in</strong> order to compute a subtraction<br />
61
FEI KEMT<br />
Algorithm 4 – 1 Modified MWR2MM algorithm<br />
1: S ⇐ 0<br />
2: for i = 0 to n − 1 do<br />
3: qi ⇐ xiY (0)<br />
0<br />
4: if qi = 1 then<br />
+ S (0)<br />
0<br />
5: for j = 0 to e do<br />
6: (Ca, S (j) ) ⇐ Ca + xiY (j) + M (j)<br />
7: (Cb, S (j) ) ⇐ Cb + S (j)<br />
8: S (j−1) ⇐ (S (j)<br />
0 , S (j−1)<br />
w−1..1)<br />
9: end for<br />
10: else<br />
11: for j = 0 to e do<br />
12: (Ca, S (j) ) ⇐ Ca + xiY (j)<br />
13: (Cb, S (j) ) ⇐ Cb + S (j)<br />
14: S (j−1) ⇐ (S (j)<br />
0 , S (j−1)<br />
w−1..1)<br />
15: end for<br />
16: end if<br />
17: S (e) ⇐ 0<br />
18: end for<br />
rather than an addition (see Figure 4 – 3). All operations are done modulo 2n.<br />
Algorithms 4 – 2 and 4 – 3 show the elementary steps of a modular addition and<br />
subtraction, respectively.<br />
If x + y ≥ 2n a reduction can be applied by simple subtraction of 2n. A variable<br />
z conta<strong>in</strong>s the result and T is a (temporary) register. A comparison z < 2n takes<br />
the same amount of time as a subtraction T = z − 2n. Thus, we compute the<br />
subtraction <strong>in</strong> all cases and decide by the sign of the values, which one to take as<br />
the result (z or T ). If T is the correct result, the content of T has to be copied to<br />
the register z.<br />
For a modular addition, we need at most<br />
Tadd = 3(e + 1) (4.2)<br />
clock cycles, where e is the number of words (for implemented non-redundant form<br />
of operands e = � N+1<br />
w<br />
�<br />
). On average, we would only have to reduce every second<br />
time. However, s<strong>in</strong>ce the control of phase 1 and phase 2 is parallelised for many<br />
units, we have to assume the worst case runn<strong>in</strong>g time which is given by Equation 4.2.<br />
62
FEI KEMT<br />
C a<br />
X w-1 Y w-1 X w-2 X 0<br />
C b<br />
+<br />
FA FA<br />
FA<br />
M w-1<br />
+<br />
FA<br />
S w-1<br />
Y w-2<br />
+<br />
M w-2<br />
+<br />
FA<br />
Figure 4 – 3 Scalable addition and subtraction unit for operands with word width w<br />
S w-2<br />
The subtraction x − y can be accomplished by the addition of x with the bitwise<br />
complement of y and 1. The addition of 1 is simply achieved by sett<strong>in</strong>g the first carry<br />
bit to one (c<strong>in</strong> = 1) (Step 1). S<strong>in</strong>ce the result can be negative, a f<strong>in</strong>al verification<br />
is required. If necessary, the modulus has to be added. The follow<strong>in</strong>g algorithm<br />
describes the modular subtraction:<br />
In step 1, both memory cells z and T obta<strong>in</strong> the same value, which can be done<br />
<strong>in</strong> hard<strong>ware</strong> <strong>in</strong> parallel at the same time without any additional overhead. After the<br />
computation of the difference, one can check for the correctness of the result.<br />
Hence, subtraction can be performed more efficiently than addition and requires<br />
<strong>in</strong> the worst case<br />
clock cycles.<br />
. . .<br />
Y 0<br />
+<br />
M 0<br />
+<br />
FA<br />
S 0<br />
sub<br />
C a<br />
sub<br />
Tsub = 2(e + 1) (4.3)<br />
63<br />
C b
FEI KEMT<br />
Algorithm 4 – 2 <strong>Modular</strong> addition<br />
Require: Two <strong>in</strong>tegers x, y < 2n<br />
Ensure: Sum z = x + y mod 2n<br />
1: z ⇐ x + y<br />
2: T ⇐ z − 2n<br />
3: if T ≥ 0 then<br />
4: z ⇐ T<br />
5: end if<br />
6: return z<br />
Algorithm 4 – 3 <strong>Modular</strong> subtraction<br />
Require: Two <strong>in</strong>tegers x, y < 2n<br />
Ensure: Difference z = x − y mod 2n<br />
1: T = z ⇐ x − y<br />
2: if z < 0 then<br />
3: z ⇐ T + 2n<br />
4: end if<br />
5: return z<br />
4.2.4 Parallelization of the Algorithm<br />
ECM can be perfectly parallelized by us<strong>in</strong>g different curves <strong>in</strong> parallel s<strong>in</strong>ce the<br />
computations of each unit are completely <strong>in</strong>dependent. For the control of more<br />
than one ECM unit, it is essential to know that both phases, phase 1 and phase 2,<br />
are controlled completely identically, <strong>in</strong>dependent of the composite to be factored.<br />
Solely the curve parameter and possibly the modulus of the units and, hence, the<br />
coord<strong>in</strong>ates of the <strong>in</strong>itial po<strong>in</strong>t differ. Thus, all units have to be <strong>in</strong>itialized differently<br />
which is done by simply writ<strong>in</strong>g the values <strong>in</strong>to the correspond<strong>in</strong>g memory locations<br />
sequentially.<br />
Dur<strong>in</strong>g the execution of both phases, exactly the same commands can be sent to<br />
all units <strong>in</strong> parallel. S<strong>in</strong>ce the runtime of multiplication/squar<strong>in</strong>g is constant (does<br />
not rely on <strong>in</strong>put values) and for addition/subtraction differs at most <strong>in</strong> 2(e + 1)<br />
clock cycles, all units can execute the same command <strong>in</strong> approximately the same<br />
time.<br />
After phase 2, the results are read from the units one after another. The required<br />
time for this data I/O is negligible for one ECM unit s<strong>in</strong>ce the computation time of<br />
both phases dom<strong>in</strong>ates. For several units <strong>in</strong> parallel, the computation time does not<br />
64
FEI KEMT<br />
change, but the time for data I/O scales l<strong>in</strong>early with the number of units. Hence,<br />
not too many units should be controlled by one s<strong>in</strong>gle logic. For massively parallel<br />
ECM <strong>in</strong> hard<strong>ware</strong>, the ECM units can be segmented <strong>in</strong>to clusters, each with its own<br />
control unit.<br />
4.3 Implementation of the ECM Unit<br />
This section presents the actual hard<strong>ware</strong> implementation done on a SOC (FPGA<br />
and embedded microprocessor). This first hard<strong>ware</strong> implementation of ECM is de-<br />
signed as a proof-of-concept. All tim<strong>in</strong>gs are obta<strong>in</strong>ed by us<strong>in</strong>g real hard<strong>ware</strong>, not<br />
only simulation. All results have been carefully checked by a reference implementa-<br />
tion <strong>in</strong> soft<strong>ware</strong>.<br />
4.3.1 <strong>Hard</strong><strong>ware</strong> Platform<br />
The ECM implementation is realized as a hybrid design. It consists of an ECM<br />
unit implemented on an FPGA (Xil<strong>in</strong>x Virtex2000E-6) [124] and a control logic<br />
implemented <strong>in</strong> soft<strong>ware</strong> on an embedded micro-controller (ARM7TDMI, 25MHz)<br />
[90]. The ECM unit is coded <strong>in</strong> VHDL and was simulated and synthesised for the<br />
FPGA by us<strong>in</strong>g FPGA Advantage tools, place & route was done <strong>in</strong> Xil<strong>in</strong>x ISE. For<br />
the actual VHDL implementation, memory cells have been realized with the FPGA’s<br />
<strong>in</strong>ternal block RAM. For the word width w = 32 bits 2 blocks with e = ⌈ N+1⌉<br />
words<br />
2<br />
are used for each register due to dual-port access mode and selected algorithm for<br />
multiplication.<br />
The ECM unit, as implemented, expects the commands which are written to a<br />
control register accessible by the embedded ARM processor. Required po<strong>in</strong>t coordi-<br />
nates and curve parameters are loaded <strong>in</strong>to the ECM unit before the first command<br />
is decoded. For this purpose, these memory cells of unit are accessible from the<br />
outside by a unique address. Internal registers, which are only used as temporary<br />
registers dur<strong>in</strong>g the computation are not accessible from the outside, by the micro-<br />
controller.<br />
The control of the whole unit is done by the micro-controller present on the<br />
board. The processor controls the data transfer from and to the units, and issues<br />
the commands for all steps <strong>in</strong> phase 1 and phase 2 for the central control log<strong>in</strong> <strong>in</strong>side<br />
FPGA. For code generation, debugg<strong>in</strong>g and compilation, the ARM Developer Suite<br />
1.2 was used. For details on the ARM microprocessor, see [23]. At a later stage,<br />
a soft-core processor core (<strong>in</strong> VHDL) could be used <strong>in</strong>stead of an hard-wired ARM<br />
65
FEI KEMT<br />
microprocessor, e.g. Altera Nios [10].<br />
For a suitable implementation on a selected platform one can choose the word<br />
width w, number of words e (length of operands), level p of pipel<strong>in</strong>e stages of the<br />
multiplier, and the number of ECM units. Although the presented implementation<br />
was realised on a Xil<strong>in</strong>x Virtex-E FPGA, the proposed algorithms and the design<br />
architecture can be implemented on any FPGA. Hence, a significant speed-up on<br />
state-of-the-art devices can be expected. Anyway, the platform at hand is sufficient<br />
for proof-of-concept purposes. S<strong>in</strong>ce the suggested clock rate of the synthesis tool<br />
was higher than the actual supported frequency of the hard<strong>ware</strong>, no attempt to<br />
further accelerate the design has been made. Due to the lack of FPGA specific<br />
optimisations, the code can easily be used for different types of FPGAs that <strong>in</strong>clude<br />
dedicated memory blocks and fast carry-cha<strong>in</strong> logic.<br />
The actual design was done for n = 198 bit composites. The parameters for the<br />
multiplier are p = 2 and w = 32. Scal<strong>in</strong>g the design to bit lengths from 100 to<br />
300 bits can be easily accomplished. In this case, the AT product will de-/ <strong>in</strong>crease<br />
accord<strong>in</strong>g to the size of O(N 2 ).<br />
4.3.2 Results<br />
After the synthesis and place and route, the b<strong>in</strong>ary image was loaded onto the<br />
FPGA and clocked with a frequency of 25MHz. Hence, the cycle length of the ALU<br />
perform<strong>in</strong>g the modular arithmetic is 40ns. Table 4 – 3 shows the tim<strong>in</strong>gs of relevant<br />
operations of the implementation.<br />
<strong>Hard</strong><strong>ware</strong> factorization design <strong>in</strong>cludes full support for all operations needed<br />
dur<strong>in</strong>g the ECM phases 1 and 2. The tim<strong>in</strong>gs for phase 1 and 2 are obta<strong>in</strong>ed after<br />
tim<strong>in</strong>g measurements on a test<strong>in</strong>g board. The time for the <strong>in</strong>itialization and read<strong>in</strong>g<br />
from the memories is not taken <strong>in</strong>to account, s<strong>in</strong>ce it only delays the computation<br />
at the very beg<strong>in</strong>n<strong>in</strong>g and the very end.<br />
Although a squar<strong>in</strong>g is computed with the multiplication circuit, the overhead<br />
is slightly lower yield<strong>in</strong>g a mere 0.3% faster execution. Po<strong>in</strong>t addition <strong>in</strong> phase 1 is<br />
more efficient s<strong>in</strong>ce it makes use of the fact that the z coord<strong>in</strong>ate of the difference<br />
of po<strong>in</strong>ts can be chosen to be 1.<br />
The ECM unit <strong>in</strong>clud<strong>in</strong>g the full support for the phase 1 and 2 of the ECM<br />
with the word width w = 32 bits, number of words e = 7, level of pipel<strong>in</strong>e p = 2<br />
has the follow<strong>in</strong>g area requirements: 1754 LUTs, 506 flip-flops and 44 Blocks RAM.<br />
M<strong>in</strong>imum clock period achieved the value of 26.225ns (maximum clock frequency:<br />
66
FEI KEMT<br />
Table 4 – 3 Runn<strong>in</strong>g Times of the ECM Implementation (198 bits modulus), p = 2, w = 32<br />
(Xil<strong>in</strong>x Virtex2000E-6 and ARM7TDMI, 25MHz)<br />
Operation Time<br />
modular addition 2.00µs<br />
modular subtraction 1.68µs<br />
modular multiplication 64.5µs<br />
modular squar<strong>in</strong>g 64.5µs<br />
po<strong>in</strong>t addition (phase 1, zQ = 1) 333µs<br />
po<strong>in</strong>t addition (phase 2) 397µs<br />
po<strong>in</strong>t doubl<strong>in</strong>g 330µs<br />
Phase 1 912ms<br />
Phase 2 1879ms<br />
38.132MHz). Further improvements <strong>in</strong> data organisation <strong>in</strong>side the ECM unit should<br />
yield higher performance of the whole design. The critical path of design <strong>in</strong>cludes<br />
multiplexers of <strong>in</strong>put and output buses of memory registers. High number of sup-<br />
ported comb<strong>in</strong>ations due to universality of proposed design causes complicated and<br />
hence a slow logic. More optimised data-path with multiple multipliers <strong>in</strong> ALU helps<br />
to decrease the number of supported comb<strong>in</strong>ations of registers as shown <strong>in</strong> [67].<br />
Due to the system’s latency for load<strong>in</strong>g and stor<strong>in</strong>g values <strong>in</strong> the registers, not<br />
more than 100 ECM units (FPGA) should be controlled by one processor. With<br />
a much higher number of units the communication overhead would outweigh the<br />
computation time. However, the control logic of the data I/O has not been <strong>in</strong> the<br />
focus of our optimisation efforts yet and, thus, we assume that slight improvements<br />
of the speed of the data I/O are still feasible. Especially if target<strong>in</strong>g an ASIC<br />
implementation, such numbers are likely to change.<br />
4.3.3 ECM-Based Acceleration of GNFS: a Case Study<br />
Build<strong>in</strong>g an efficient and cheap ECM hard<strong>ware</strong> can <strong>in</strong>fluence the overall performance<br />
of the GNFS s<strong>in</strong>ce ECM can be perfectly used for smoothness test<strong>in</strong>g step with<strong>in</strong><br />
the GNFS (see [64]). In this section, we briefly estimate the costs, space require-<br />
ments and power consumption of a special ECM hard<strong>ware</strong> implemented as ASIC.<br />
Motivation for such analysis lies <strong>in</strong> a fact that ASIC design can achieve roughly<br />
10 times better performance as FPGA design. Know<strong>in</strong>g the area requirements and<br />
67
FEI KEMT<br />
tim<strong>in</strong>gs of ECM implementation makes possible to compare fairly our design with<br />
other (future) solutions. In our estimate, we focus on the production cost which we<br />
believe to be much higher than the development cost of such an ASIC. This special<br />
hard<strong>ware</strong> could be produced as s<strong>in</strong>gle ICs (such as common CPUs), ready for the<br />
use <strong>in</strong> larger circuits. We choose a sett<strong>in</strong>g with a word width w = 8 and assume the<br />
use of carry save adders.<br />
Estimation of the Runtime We can determ<strong>in</strong>e the runn<strong>in</strong>g time of both phases<br />
on basis of the underly<strong>in</strong>g r<strong>in</strong>g arithmetic. The upper bounds for the number of clock<br />
cycles of a modular addition and a modular subtraction are given <strong>in</strong> Equations 4.2<br />
and 4.3, respectively. A sett<strong>in</strong>g with N = 199, w = 8, p = 8, and e = 25 yields<br />
Tadd = 3(e + 1) = 78 and Tsub = 2(e + 1) = 52 cycles. Accord<strong>in</strong>g to Equation 2.4,<br />
the implemented multiplier requires Tmul = 666 cycles. For each operation we<br />
should <strong>in</strong>clude T<strong>in</strong>it = 2 cycles for <strong>in</strong>itialisation of the ALU at the beg<strong>in</strong>n<strong>in</strong>g of each<br />
computation.<br />
For the group operations for phase 1 we obta<strong>in</strong><br />
TP add = 5Tmul + 3Tadd + 3Tsub + 11T<strong>in</strong>it = 3 742 and<br />
TP dbl = 5Tmul + 2Tadd + 2Tsub + 9T<strong>in</strong>it = 3 608<br />
clock cycles. For phase 2, TP add changes to T ′ P add = 4 410 cycles s<strong>in</strong>ce zA−B �= 1 <strong>in</strong><br />
most cases, hence, we have to take the multiplication by zA−B <strong>in</strong>to account.<br />
The total cycle count for both phases is<br />
TP hase 1 = 1 374(TP add + TP dbl) = 10 098 900 and<br />
TP hase 2 = 1 881T ′ P add + 50TP dbl + 13 590Tmul = 17 553 730<br />
clock cycles. Exclud<strong>in</strong>g the time for pre- and post-process<strong>in</strong>g, a unit needs ap-<br />
proximately 27.7 · 10 6 clock cycles for both phases on one curve. If we assume a<br />
frequency of 500 MHz (for ASIC), such a complex computation can be performed<br />
<strong>in</strong> approximately 55 ms.<br />
Estimation of Area Requirements The estimation of area requirements have<br />
been based on results published <strong>in</strong> [108] 4 , the multiplier with w = 8 and p = 8<br />
4 The numbers provided <strong>in</strong> that contribution refer to a multiplier built with CSAs. S<strong>in</strong>ce we<br />
implemented the architecture with CPAs, given numbers are larger (approximately 20%) than<br />
those which would be achieved with our design.<br />
68
FEI KEMT<br />
requires 21 400 transistors <strong>in</strong> standard CMOS technology (assum<strong>in</strong>g 4 transistors<br />
per NAND gate). We assume that the circuit for addition and subtraction can be<br />
achieved with at most 1 000 transistors. For the memory, we assume (area expen-<br />
sive) static RAM which requires 25 200 transistors for 21 registers. For the unit’s<br />
<strong>in</strong>ternal control we assume additional 6 000 transistors. The central control requires<br />
less than 2 000 000 transistors. Hence, one unit requires approximately 53 600 tran-<br />
sistors. Assum<strong>in</strong>g the CMOS technology of a standard Pentium 4 processor (0.13<br />
µm, approx. 55 million transistors), we could fit 990 ECM units <strong>in</strong>to the area of<br />
one standard processor. One ECM unit needs an area of approximately 0.1475 mm 2<br />
and has a power dissipation of approximately 40 mW.<br />
Application to the GNFS Consider<strong>in</strong>g the architecture for a special GNFS<br />
hard<strong>ware</strong> of [64], we have to test approximately 1.7 · 10 14 co-factors up to 125 bits<br />
for smoothness. S<strong>in</strong>ce both the runn<strong>in</strong>g time as well as the area requirement scales<br />
l<strong>in</strong>early with the bit size, we can multiply the results from the subsections above<br />
with a factor of 125/198 ≈ 0.628. If we distribute the computation over a whole<br />
year, we have to check 5 390 665 co-factors per second 5 .<br />
For a probability of success of p > 80%, we test 20 curves per co-factor, thus,<br />
we need approximately 3 850 000 ECM units which would yield a total chip area<br />
of 625 000mm 2 (= 4 300 ICs of the size of a Pentium 4) and a power consumption<br />
of approximately 175 kW. If we assume a cost of US$ 5 000 per 300mm wafer, as<br />
done <strong>in</strong> [103], the ECM units would cost less than US$ 45 000 for the whole GNFS<br />
architecture, which is negligible <strong>in</strong> the context of the overall costs.<br />
4.4 Conclusions and Future Steps<br />
In this chapter we presented the first published implementation of the ECM <strong>in</strong> a<br />
real hard<strong>ware</strong> for factor<strong>in</strong>g numbers up to 200 bits. To make the implementation<br />
possible the algorithm was adapted for conditions given by hard<strong>ware</strong>, e.g. limited<br />
memory space, bus width, communication load. . . The parametrisation of the algo-<br />
rithms was done to particularly fit the needs of a hard<strong>ware</strong> environment, yield<strong>in</strong>g a<br />
high efficiency regard<strong>in</strong>g the area-time product.<br />
The sequential control part of the ECM is operated by soft<strong>ware</strong> commands of the<br />
embedded ARM processor. For <strong>in</strong>tensive comput<strong>in</strong>g operations the special purpose<br />
5 Remark that we only take the time for f<strong>in</strong>d<strong>in</strong>g the first factor <strong>in</strong>to account. S<strong>in</strong>ce this happens<br />
quite seldom, we neglect the factorization of the rema<strong>in</strong>der for our estimate.<br />
69
FEI KEMT<br />
hard<strong>ware</strong> was implemented on Xil<strong>in</strong>x FPGA. The ECM unit provides full support for<br />
all computations of the phases 1 and 2 of the ECM. It is also possible to <strong>in</strong>clude more<br />
ECM units work<strong>in</strong>g parallel <strong>in</strong> one FPGA chip. Our implementation impressively<br />
shows that due to very low area requirements and low data I/O, ECM is predest<strong>in</strong>ed<br />
for the use <strong>in</strong> hard<strong>ware</strong>. A s<strong>in</strong>gle unit for factor<strong>in</strong>g composites of up to 198 bits<br />
requires 506 flip-flops, 1754 lookup-tables and 44 Blocks RAM (less than 6% of logic<br />
and 27% of memory resources of the Xil<strong>in</strong>x Vertex2000E device).<br />
Thanks to scalability of the design, it is possible to change the data width and<br />
adapt it to target FPGA architecture. Another advantage lies <strong>in</strong> modularity of the<br />
design, namely the blocks for underly<strong>in</strong>g modular operations: addition/subtraction<br />
and multiplication/squar<strong>in</strong>g. At this stage we re-used the MMM very similar to the<br />
versions of the multiplier described <strong>in</strong> the chapter 2.<br />
The known drawbacks of the design are the noneffective usage of on-chip memory<br />
blocks and low maximum clock frequency. Our proof-of-concept design has not<br />
optimised the dedication of registers just for certa<strong>in</strong> arithmetical operation or data-<br />
flow direction. S<strong>in</strong>ce the chosen algorithm for MMM requires simultaneous access<br />
for writ<strong>in</strong>g and read<strong>in</strong>g to/from register with the sum S, we have selected dual-port<br />
memory mode for all registers. Similarly, the multiplex<strong>in</strong>g of the registers with <strong>in</strong>put<br />
and output operands has been left universal and therefore complicated and slow.<br />
As demonstrated, ECM can be perfectly parallelised and, thus, an implementa-<br />
tion at a larger scale can be used to assist the GNFS factor<strong>in</strong>g algorithm by carry<strong>in</strong>g<br />
out all required smoothness tests. A low cost ASIC implementation of ECM can<br />
decrease the overall costs of the GNFS architecture SHARK, as shown <strong>in</strong> [64]. We<br />
believe that an extensive use of ECM for smoothness test<strong>in</strong>g can further reduce the<br />
costs of such a GNFS mach<strong>in</strong>e.<br />
As future steps, variants of phase 2 can be exam<strong>in</strong>ed <strong>in</strong> order to achieve the<br />
lowest possible AT product. To achieve a higher maximal clock frequency of the<br />
ECM unit, the control logic <strong>in</strong>side the unit might be optimised.<br />
S<strong>in</strong>ce most of the computation time is spent for modular multiplications, an im-<br />
provement of the implementation of the multiplication directly affects the overall<br />
performance. Hence, alternative architectures for the multiplication can be <strong>in</strong>vesti-<br />
gated.<br />
70
FEI KEMT<br />
5 True Random Number Generator - prelim<strong>in</strong>ar-<br />
ies<br />
Random values play a crucial role <strong>in</strong> several areas of science. In dependency on field<br />
of application the requirements for parameters of random sequence and generator<br />
of sequence itself may vary. Focus<strong>in</strong>g on the sequence orig<strong>in</strong> we dist<strong>in</strong>guish between<br />
truly- and pseudo-random sequences. The construction of generators decides on<br />
their suitability for commercial or research applications.<br />
In the follow<strong>in</strong>g chapter we provide an <strong>in</strong>troduction to the topic of randomness<br />
and random values (Section 5.1) while focus<strong>in</strong>g on generators applicable <strong>in</strong> cryptog-<br />
raphy. In Section 5.2 we mention typical sources for generation of random sequences<br />
<strong>in</strong> digital circuits. In Section 5.3 we summarise design ideas of the PLL-based gen-<br />
erator we will analyse <strong>in</strong> the follow<strong>in</strong>g chapter. In Section 5.4 we expla<strong>in</strong> test<strong>in</strong>g<br />
techniques applied <strong>in</strong> order to evaluate generators and <strong>in</strong> Section 5.5 we discuss is-<br />
sues related to attacks on RNGs. F<strong>in</strong>ally, <strong>in</strong> Section 5.6 we summarise the chapter.<br />
5.1 Randomness<br />
We start with topic called randomness, and the most natural questions that come<br />
<strong>in</strong> our m<strong>in</strong>ds may look like: How to def<strong>in</strong>e the randomness? Where comes it from?<br />
Or how can we prove that a sequence is random?<br />
The randomness of the world we live <strong>in</strong> has been a scientific and philosophical<br />
topic for long time. Famous remark of Albert E<strong>in</strong>ste<strong>in</strong> says that “God does not<br />
play dice with the universe” what might conv<strong>in</strong>ce us about determ<strong>in</strong>ism of our<br />
environment. However, several physical phenomena present <strong>in</strong> physical world are<br />
proved to have a random nature e.g. probabilistic nature of quantum mechanics,<br />
thermal and shot noise <strong>in</strong> electronic components, or nuclear decay.<br />
The fundamental problem of randomness is <strong>in</strong> fact that even with exact def<strong>in</strong>ition<br />
it is very difficult to prove whether any f<strong>in</strong>ite numeric sequence is random or not. The<br />
randomness of a source is evaluated through the parameters of sequence generated<br />
us<strong>in</strong>g that source. The way how the values of sequence are extracted from the source<br />
depends on applied harvest<strong>in</strong>g mechanism. The optimal harvest<strong>in</strong>g does not disturb<br />
the random physical process and extracts as much entropy as possible.<br />
The entropy H of a random variable X with n outcomes �<br />
xi : i = 1, . . . , n �<br />
is<br />
def<strong>in</strong>ed as negative logarithm of the probability of the process’s most likely output<br />
71
FEI KEMT<br />
[68] what can be expressed as the follow<strong>in</strong>g equation:<br />
n�<br />
H(X) = − p(xi) logb p(xi) (5.1)<br />
i=1<br />
where p(xi) is a probability function of the outcome xi. Therefore, the higher is<br />
the level of entropy, the less predictable is the process. A completely random pro-<br />
cess with maximal entropy provides uniformly distributed sequence. For the natural<br />
sources of randomness it is usually more difficult to achieve good statistical proper-<br />
ties of the sequences s<strong>in</strong>ce they tend to <strong>in</strong>clude a certa<strong>in</strong> level of bias or other k<strong>in</strong>d of<br />
deviation from ideally equiprobable sequence. Post-process<strong>in</strong>g sequence convertors<br />
are able to improve the statistic distribution, but usually reduce the output bitrate<br />
of the sequence.<br />
Achiev<strong>in</strong>g constantly high level of entropy <strong>in</strong> a RNG assures randomness of the<br />
produced bit sequence. When design<strong>in</strong>g a RNG it is important to f<strong>in</strong>d level of<br />
entropy <strong>in</strong> the source, a relation between generator’s parameters and the entropy<br />
level and a monitor<strong>in</strong>g mechanism for the entropy level.<br />
5.1.1 Def<strong>in</strong>itions of Randomness<br />
There are several partial def<strong>in</strong>itions of random numbers that help us to gather<br />
the requirements given on random sequences and devices generat<strong>in</strong>g them. Let us<br />
mention some of the def<strong>in</strong>itions.<br />
The follow<strong>in</strong>g def<strong>in</strong>itions provide us <strong>in</strong>formation about the process by which the<br />
random numbers should be generated - a truly random number is generated by<br />
a process, whose outcome is unpredictable, and which cannot be subsequentially<br />
reliably reproduced. The unpredictability of the process means that each output<br />
state of the process is equally possible and may be guessed correctly with the same<br />
(negligible) probability (follow<strong>in</strong>g the uniform distribution). The ability to repro-<br />
duce the random process would require some sign of periodic pattern <strong>in</strong> the process<br />
behaviour, what is undesirable <strong>in</strong> case of a random pattern.<br />
Chait<strong>in</strong>’s Theorem [40] says that it is formally impossible to verify whether a<br />
f<strong>in</strong>ite sequence is random or not. S<strong>in</strong>ce we technically do not handle with <strong>in</strong>f<strong>in</strong>ite<br />
sequences what we can do is to check a practical randomness of f<strong>in</strong>ite sequence. That<br />
means to evaluate how the sequence under review shares the statistical properties<br />
of an ideal random sequence e.g. the equal probability of all possible outputs.<br />
Accord<strong>in</strong>g to Knuth [76], a sequence of random numbers is a sequence of <strong>in</strong>de-<br />
pendent numbers with a specified distribution and a specified probability of fall<strong>in</strong>g<br />
72
FEI KEMT<br />
<strong>in</strong> any given range of values. Other def<strong>in</strong>ition comes from Schneier [101], who says<br />
that random is a sequence that has the same statistical properties as random bits,<br />
is unpredictable and cannot be reliably reproduced. Kolmogorov def<strong>in</strong>es a str<strong>in</strong>g of<br />
bits as be<strong>in</strong>g random if and only if it is shorter than any computer program that<br />
can produce that str<strong>in</strong>g. From all three def<strong>in</strong>itions we can extract a common re-<br />
quirement (necessary but <strong>in</strong>sufficient) for hav<strong>in</strong>g the numbers <strong>in</strong> a random sequence<br />
uncorrelated 6 .<br />
Unpredictable sequence is the one for which the knowledge of all generated values<br />
<strong>in</strong> the past does not <strong>in</strong>crease probability to guess the subsequent value, or <strong>in</strong> other<br />
words know<strong>in</strong>g one of the numbers <strong>in</strong> the sequence must not help predict<strong>in</strong>g the<br />
other ones. The same fact we can illustrate by another of unpredictability def<strong>in</strong>itions<br />
which def<strong>in</strong>es it as a status that there is no polynomial algorithm, by which know<strong>in</strong>g<br />
l bits of the generated sequence S one is able to predict (l+1)-th bit with probability<br />
bigger than 0.5 [86]. No correlation also causes that the generated random sequence<br />
cannot be produced by other computer program than the one that pr<strong>in</strong>ts the whole<br />
random sequence as it is.<br />
Under truly random sequence of bits we understand an uncorrelated sequence<br />
that cannot be reproduced or predicted, has equal probability of all possible outputs<br />
(equiprobability) and its generation is based on a random process.<br />
A sequence that keeps the statistical properties of random sequence, but its<br />
members are correlated or the sequence can be reproduced is called pseudo-random.<br />
The pseudo-random sequence looks random, but its orig<strong>in</strong> is not <strong>in</strong> a random process<br />
and the sequence generation can be reproduced and described as an algorithm.<br />
One of the issues discussed <strong>in</strong> the thesis is the ability to dist<strong>in</strong>guish between the<br />
truly random and pseudo-random sequence by exploration of the generation process<br />
<strong>in</strong> generator.<br />
5.1.2 Random Number Generator<br />
A RNG is an electronic device or soft<strong>ware</strong> rout<strong>in</strong>e designed to yield a sequence of<br />
random numbers.<br />
A pseudo-random number generator (PRNG) is based on an algebraic function<br />
that expands the <strong>in</strong>itial random value (a seed) <strong>in</strong>to a random-like look<strong>in</strong>g sequence.<br />
A true-random number generator (TRNG) <strong>in</strong>cludes a physical source of randomness<br />
6 However, simple (l<strong>in</strong>ear) and known correlation relations between the members of sequence do<br />
not exclude such source. In these cases a corrector that removes the correlated samples may be<br />
applied. More dangerous are masked correlations of higher order that are difficult <strong>in</strong> detection.<br />
73
FEI KEMT<br />
and a harvest<strong>in</strong>g mechanism which extracts the randomness and generates truly<br />
random values.<br />
Security level of PRNG depends on complexity of the generat<strong>in</strong>g function, the<br />
period length of the generated sequence, and the amount of entropy <strong>in</strong> the seed. As<br />
a result, the pseudo-random sequences may achieve a high level of unpredictability<br />
<strong>in</strong> case of sufficient complexity of the generat<strong>in</strong>g function. However, the pseudo-<br />
random sequence has always a f<strong>in</strong>ite period and rema<strong>in</strong>s reproducible as far as<br />
<strong>in</strong>itial conditions are susta<strong>in</strong>ed.<br />
The PRNG is the only choice for soft<strong>ware</strong> implementations and thanks to de-<br />
term<strong>in</strong>istic components it attracts also the designers of electronic digital systems.<br />
Note that also pseudo-random sequence can be unpredictable when produced by<br />
cryptographically secure PRNG e.g. based on hash (one-way) functions, stream ci-<br />
phers or Blum Blum Shub pr<strong>in</strong>ciple [28]. The PRNG requires a random seed (from<br />
a TRNG or other reliable source of entropy, if available) to obta<strong>in</strong> the start<strong>in</strong>g level<br />
of entropy. As the system is determ<strong>in</strong>istic, for identical seeds the PRNG generates<br />
identical output pseudo-random sequences, too. No more entropy is added dur<strong>in</strong>g<br />
exploitation of the seed, therefore the seed’s entropy designates the unpredictability<br />
of the generated sequence.<br />
The term generator is not completely correct <strong>in</strong> case of TRNG as the randomness<br />
is not generated but rather extracted from a source of randomness (see Figure 5 –<br />
1). In TRNG the occurrence of random events is sampled by an extractor and<br />
transformed <strong>in</strong>to a sequence of numerical values usually expressed as a b<strong>in</strong>ary stream.<br />
Source of<br />
randomness<br />
A/D conversion<br />
analogue part digital part<br />
noise<br />
signal<br />
Postprocess<strong>in</strong>g<br />
digitised<br />
noise<br />
signal<br />
<strong>in</strong>ternal<br />
random<br />
sequence<br />
Output<br />
buffer<br />
external <strong>in</strong>terface<br />
random<br />
number<br />
sequence<br />
Figure 5 – 1 Schematic diagram of a TRNG with designation of <strong>in</strong>ternal signals and <strong>in</strong>terfaces<br />
The Figure 5 – 1 represents a typical design of TRNG based on a physical phe-<br />
nomenon. Us<strong>in</strong>g a proper harvest mechanism the analogue signal is converted <strong>in</strong>to<br />
its digitised form. Accord<strong>in</strong>g to statistical properties of the signal it may be required<br />
to apply a post-process<strong>in</strong>g <strong>in</strong> order to produce an <strong>in</strong>ternal random sequence. The<br />
generated sequence can be further accumulated <strong>in</strong> output buffer before leav<strong>in</strong>g the<br />
74
FEI KEMT<br />
generator on an external request.<br />
5.1.3 Applications of Random Numbers<br />
Random or pseudo-random values may be applied <strong>in</strong> variety of application areas, e.g.<br />
<strong>in</strong> simulation methods like Monte Carlo [84], <strong>in</strong> generation of spread<strong>in</strong>g sequences <strong>in</strong><br />
spread spectrum communication systems [106], by generation of primes, <strong>in</strong> several<br />
cryptographic algorithms, or <strong>in</strong> gambl<strong>in</strong>g <strong>in</strong>dustry. Naturally, the requirements for<br />
generators and generated random data differ accord<strong>in</strong>g to the application.<br />
In addition to proper statistical parameters, a generated random sequence for<br />
sensible cryptographic application has to be unpredictable and unrepeatable. Due to<br />
unrepeatability we expect completely different and random sequence for each use of<br />
the generator, even by identical start<strong>in</strong>g conditions (like the seed for PRNG). This<br />
is an <strong>in</strong>herent feature of TRNGs based on entropy extraction from natural physical<br />
phenomena. In such case the entropy of the generator is <strong>in</strong>creased by each generated<br />
value.<br />
Application areas for the RNGs can be found <strong>in</strong> a number of cryptographic<br />
algorithms. The dom<strong>in</strong>ant application of RNG is a secure generation of the keys<br />
for encryption. The bit-length of the key is chosen <strong>in</strong> dependency on length of the<br />
time when the key is valid. In different cryptographic applications this time can<br />
vary from the seconds for session keys to the years for encryption keys for archiv<strong>in</strong>g<br />
systems. Follow<strong>in</strong>g this, the RNG has to provide random values with bit rate <strong>in</strong> the<br />
range between tens to thousands of bits per second. While for PRNG it is not a<br />
problem to achieve high output bit rates, for TRNG desired <strong>in</strong> high-level security<br />
cryptosystems the situation is different. A source of randomness <strong>in</strong> TRNG may<br />
have a low level of entropy per bit what means also low output bit rate because of<br />
required accumulation of the entropy.<br />
In cryptography, the values produced by randomness extractors or generators are<br />
used as cryptographic keys, <strong>in</strong>itialization vectors, padd<strong>in</strong>g bits, bl<strong>in</strong>d<strong>in</strong>g values and<br />
or mask<strong>in</strong>g values <strong>in</strong> countermeasures aga<strong>in</strong>st side-channel attacks. In dependency<br />
on application the random value needs to be kept secret as <strong>in</strong> case of encryption<br />
(secret) keys or can be published as a nonce or a part of public key.<br />
Nowadays the security of cryptography systems is not based on secrecy of en-<br />
cryption methods, those are publicly known, but on the knowledge of a secret key.<br />
An adversary focuses all her attacks on revelation of that secret <strong>in</strong>formation. Hav-<br />
<strong>in</strong>g under control the device that generates the values–keys allows the attacker to<br />
75
FEI KEMT<br />
control also all the systems which security depends on them. Those are the reasons<br />
that emphasise the randomness generation process <strong>in</strong> cryptography.<br />
Requirements on TRNG for cryptography We can conclude the previous<br />
paragraphs with a list of special requirements given for implementation of TRNGs<br />
<strong>in</strong> case the produced sequences are applied <strong>in</strong> cryptography:<br />
• Specific statistical properties – generated sequence must have perfect statistical<br />
properties. Some known bias of the probability of zeros and ones <strong>in</strong> the gen-<br />
erated bit-stream could make cryptographic attacks easier s<strong>in</strong>ce nonzero value<br />
of bias deforms the required uniform distribution. The expected parameters<br />
are usually achieved by random sequence post-process<strong>in</strong>g.<br />
• Unpredictability – knowledge of arbitrary long sequence from the generator<br />
or any other <strong>in</strong>formation about the <strong>in</strong>ternal status of the generator should<br />
not enable anyone to predict preced<strong>in</strong>g or subsequent generator outputs or to<br />
guess them with some non-negligible probability. Such behaviour is natural for<br />
random physical phenomena. The requirement is satisfied by a proof show<strong>in</strong>g<br />
the orig<strong>in</strong> of randomly look<strong>in</strong>g sequence.<br />
• Security parameters – the TRNG is target for an adversely attack also as an<br />
electronic device. More than one-off reveal<strong>in</strong>g of the secret key, an adversary<br />
is usually <strong>in</strong>terested to <strong>in</strong>fluence the key generation process permanently. As<br />
means for improved vulnerability aga<strong>in</strong>st this k<strong>in</strong>d of attacks, the RNG de-<br />
signers should consider implementation of on-l<strong>in</strong>e tests tailored to harvest<strong>in</strong>g<br />
mechanism of a TRNG.<br />
5.2 TRNG Implementations <strong>in</strong> Digital Systems<br />
In the follow<strong>in</strong>g part of the thesis we provide an overview of known TRNG implemen-<br />
tations and design proposals. We focus mostly on designs targeted for application<br />
<strong>in</strong> digital circuits.<br />
Nowadays, a common hard<strong>ware</strong> platform for implementation of cryptographic<br />
primitives is a digital device. The cryptographic functions are performed as a soft-<br />
<strong>ware</strong> code on embedded processors on DSPs, FPGAs, SoC etc. or run on dedicated<br />
(co)processors with programmable (FPGA) or hard-wired logic cells (ASIC). That<br />
fact motivates research of generators that could be <strong>in</strong>tegrated <strong>in</strong>to circuits that are<br />
completely digital.<br />
76
FEI KEMT<br />
Digital circuits are naturally well-suited for implementation of a PRNG because<br />
of their determ<strong>in</strong>istic nature. For implementation of a physical TRNG it is required<br />
to look for a source of randomness <strong>in</strong>side a circuit. Typical digital circuits <strong>in</strong>clude<br />
only a limited range of sources of randomness that we will <strong>in</strong>vestigate further.<br />
As we expla<strong>in</strong>ed already, true randomness is achievable only <strong>in</strong> generators based<br />
on some physical phenomenon. Anyhow, one of the ma<strong>in</strong> objectives of digital sys-<br />
tems designers is to m<strong>in</strong>imise the impact of spurious analogue effects and achieve<br />
perfect stability of the system. Therefore the goal is an optimisation of clock distri-<br />
bution network for wide range of frequencies and a careful design of PCB layout and<br />
power supply network. One can see these contradictory requirements for the system,<br />
when on one side we expect perfectly determ<strong>in</strong>istic behaviour of digital part of the<br />
system, and on other side we look for a high-quality source of truly randomness for<br />
TRNG placed <strong>in</strong> the same system.<br />
For the sake of security preferred are completely embedded implementations of<br />
RNG. In such case the <strong>in</strong>ternal signals of the RNG are not exposed to potential<br />
attacks. However, due to lack of suitable sources of randomness on a given platform<br />
there are designs that propose a use of external discrete components as a source of<br />
randomness while the process<strong>in</strong>g part of generator is implemented <strong>in</strong> digital part of<br />
the system (e.g. [112]).<br />
5.2.1 Sources of Randomness<br />
The follow<strong>in</strong>g sources of randomness can be found <strong>in</strong> the digital devices:<br />
• metastability<br />
• various types of noise<br />
• clock jitter<br />
Although the clock jitter is primarily caused by a noise and therefore it could be<br />
<strong>in</strong>cluded under the noise, we mention the jitter as a separate category. The TRNGs<br />
based on jitter use techniques different from the ones based on direct sampl<strong>in</strong>g of<br />
noise. In addition, the generators sourced by jitter belong to the most popular<br />
designs of TRNGs.<br />
We note that although the sources of randomness are presented separately, it is<br />
generally more difficult to separate them <strong>in</strong> the technical designs, where all of them<br />
may be present and have <strong>in</strong>fluence on randomness source entropy. As an example<br />
77
FEI KEMT<br />
we can mention a generator kept <strong>in</strong> metastable state whose stable output value will<br />
be <strong>in</strong>fluenced by noise conditions <strong>in</strong>side the generator. In such case, the primary<br />
source of randomness is the metastability and the secondary source is the noise.<br />
Metastability A fundamental build<strong>in</strong>g block of digital circuits, the flip-flop (FF)<br />
has two well-def<strong>in</strong>ed stable states - high and low level usually denoted as 1 and 0<br />
(see Figure 5 – 2). Under certa<strong>in</strong> conditions the device may get <strong>in</strong>to a state which<br />
cannot be described by any of the above def<strong>in</strong>ed states. This condition is called<br />
metastability.<br />
stable state 0<br />
Metastable state<br />
stable state 1<br />
Figure 5 – 2 Illustration of stable states (0 and 1) and undef<strong>in</strong>ed metastable state<br />
The most common way to get a device <strong>in</strong>to the metastability is to violate the<br />
setup 7 and hold 8 times of the device. That can be achieved by choos<strong>in</strong>g the frequen-<br />
cies of the clock and <strong>in</strong>put signals of the FF <strong>in</strong> a ratio that results <strong>in</strong>to changes of<br />
the <strong>in</strong>put signal level that are too close to edges of the clock signal. Other option<br />
is that the frequencies of the signals are the same, but the phases are aligned <strong>in</strong> a<br />
way that causes FF’s setup and hold time violation.<br />
Keep<strong>in</strong>g the FF close to metastability and then allow<strong>in</strong>g it to resolve produces a<br />
b<strong>in</strong>ary sequence that depends on noise conditions <strong>in</strong>side the FF <strong>in</strong> the time of release.<br />
If the orig<strong>in</strong> of the noise is a thermal motion, then its random nature suggests that<br />
repeatedly clock<strong>in</strong>g a FF forced <strong>in</strong>to metastability will produce a succession of b<strong>in</strong>ary<br />
bits with little correlation between any pair <strong>in</strong> the sequence [75].<br />
7Setup time is def<strong>in</strong>ed as the m<strong>in</strong>imum time before sampl<strong>in</strong>g edge by which the sampled signal<br />
must be stable<br />
8Hold time is def<strong>in</strong>ed as the m<strong>in</strong>imum time after sampl<strong>in</strong>g edge dur<strong>in</strong>g which the sampled signal<br />
must be stable<br />
78
FEI KEMT<br />
In case of generators based on metastability the ma<strong>in</strong> implementation issue is<br />
the phase or frequency control of the <strong>in</strong>put signals that forces the metastability<br />
conditions. Complicated control system makes the implementation more vulnerable<br />
to attacks. In RNGs based on other randomness extraction techniques e.g. on<br />
free-runn<strong>in</strong>g oscillators, the metastability may also occur and contribute to overall<br />
entropy of the randomness source [52].<br />
Producers of FPGAs, and digital circuits <strong>in</strong> general, constantly work on reduc<strong>in</strong>g<br />
of the setup and hold times 9 as metastability produces <strong>in</strong>eligible non-determ<strong>in</strong>istic<br />
exceptions <strong>in</strong> the behaviour of the devices [6]. Therefore the published implemen-<br />
tations of TRNG [75, 83, 121] usually propose special circuits implemented e.g. by<br />
CMOS technology.<br />
Due to difficulty to meet the metastable condition <strong>in</strong> a long-term mean<strong>in</strong>g we<br />
can conclude that the metastability is good (secondary) source of randomness <strong>in</strong><br />
case it is comb<strong>in</strong>ed with other sources.<br />
Noise Despite their determ<strong>in</strong>istic behaviour the digital devices are based on analog<br />
elements naturally produc<strong>in</strong>g a certa<strong>in</strong> level of noise. There is always a source of<br />
noise (e.g. thermal noise – resistance or shoot) present <strong>in</strong> an electronic device. In<br />
order to apply the noise as a source of randomness it is required to amplify the noise<br />
itself or the effects caused by the noise.<br />
Most of the true hard<strong>ware</strong> RNGs depend primarily on a source of thermal noise,<br />
which is then post-processed to reduce the effects of determ<strong>in</strong>istic <strong>in</strong>ternal and ex-<br />
ternal <strong>in</strong>fluences such as power supply variations, DC bias, and electromagnetic<br />
fields [73]. Direct amplification and sampl<strong>in</strong>g of a noisy signal is not possible <strong>in</strong><br />
pure digital circuits. However, more complex devices are not exclusively digital and<br />
<strong>in</strong>clude embedded components for mixed signals (analogue-digital) process<strong>in</strong>g like<br />
A/D and D/A converters, or clock circuitry for a signal skew compensation.<br />
A technique with clocked comparator fed by directly amplified noise is applicable<br />
only <strong>in</strong> case of well-shielded noise sources, what can be hardly achieved <strong>in</strong> case of<br />
<strong>in</strong>tegrated digital systems. Instead of direct amplification of the noise, it is techni-<br />
cally more feasible to amplify signals that <strong>in</strong>clude a randomly chang<strong>in</strong>g part, but<br />
has higher level of amplitude than the noise itself (see e.g. [24, 73]).<br />
Bag<strong>in</strong>i and Bucci [24] provide one of first designs that <strong>in</strong>clude an analytical model<br />
of the generator behaviour and a self-test<strong>in</strong>g procedure. As a convertor of analogue<br />
9 For LE FF of Altera Stratix II speed grade -3 the setup time tSU = 90 ps and hold time<br />
tH = 149 ns [21].<br />
79
FEI KEMT<br />
noise to b<strong>in</strong>ary signal a comparator is applied. Balanced signal is then sampled by<br />
a delay FF. The number of <strong>in</strong>ternal transitions <strong>in</strong> the generated b<strong>in</strong>ary signal allow<br />
onl<strong>in</strong>e check<strong>in</strong>g of the generator behaviour.<br />
Noise as an <strong>in</strong>tr<strong>in</strong>sic and reliable source of noise <strong>in</strong> electronic devices is attractive<br />
for designers of TRNG. We further elaborate its <strong>in</strong>fluence on signals, e.g. <strong>in</strong> form of<br />
jitter.<br />
Jitter In this part we discuss various sources of jitter and a qualification of the<br />
jitter components to determ<strong>in</strong>istic and random ones. We start with some basic<br />
def<strong>in</strong>itions of the jitter, determ<strong>in</strong>istic and random jitter [102].<br />
By convention, tim<strong>in</strong>g variations are split <strong>in</strong>to two categories, called jitter and<br />
wander, based on a Fourier analysis of the variations vs. time. Tim<strong>in</strong>g variations<br />
that occur slowly are called wander. On the other hand the jitter describes tim-<br />
<strong>in</strong>g variations that occur more rapidly. The threshold between wander and jitter<br />
is def<strong>in</strong>ed to be 10 Hz accord<strong>in</strong>g to the ITU, but also other def<strong>in</strong>itions may be<br />
encountered.<br />
We cont<strong>in</strong>ue with more specific def<strong>in</strong>ition of the jitter and its two components.<br />
Jitter is a deviation from the ideal tim<strong>in</strong>g of an event (see Figure 5 – 3). The<br />
reference event is the differential zero cross<strong>in</strong>g for electrical signals and the<br />
nom<strong>in</strong>al receiver threshold power level for optical systems. Jitter is composed<br />
of both determ<strong>in</strong>istic and Gaussian (random) content.<br />
Determ<strong>in</strong>istic Jitter (DJ) is the jitter with non-Gaussian probability density<br />
function. Determ<strong>in</strong>istic jitter is always bounded <strong>in</strong> amplitude and has specific<br />
causes. Four k<strong>in</strong>ds of determ<strong>in</strong>istic jitter are identified: duty cycle distor-<br />
tion, data dependent, s<strong>in</strong>usoidal or periodic, and uncorrelated (to the data)<br />
bounded jitter. The DJ is characterized by its bounded, peak-to-peak value.<br />
Random Jitter (RJ) is the jitter that is characterized by a Gaussian distribution.<br />
Random jitter is def<strong>in</strong>ed to be the peak-to-peak value which is given to be 14<br />
times the standard deviation (14σjit) of the Gaussian distribution.<br />
Know<strong>in</strong>g the basic def<strong>in</strong>ition of the jitter we can cont<strong>in</strong>ue by def<strong>in</strong>itions of three<br />
types of jitter that differ <strong>in</strong> the reference signal that is considered to be ideal, without<br />
any jitter, and the time period of observations [102]. We add also our def<strong>in</strong>ition of<br />
the track<strong>in</strong>g jitter that plays crucial role <strong>in</strong> the randomness extraction method of<br />
PLL based TRNG.<br />
80
FEI KEMT<br />
reference<br />
edge<br />
mean period<br />
unit<br />
<strong>in</strong>terval<br />
jitter<br />
Figure 5 – 3 Tim<strong>in</strong>g jitter <strong>in</strong> clock signal<br />
Cycle-to-cycle jitter is the difference <strong>in</strong> a clock’s period from one cycle to the next<br />
one. Cycle-to-cycle jitter is the most difficult to measure usually requir<strong>in</strong>g a<br />
tim<strong>in</strong>g <strong>in</strong>terval analyser.<br />
Half-period jitter is the measure of maximum change <strong>in</strong> a clock’s output transi-<br />
tion from its ideal position dur<strong>in</strong>g one-half period.<br />
Period jitter is the change <strong>in</strong> a clock’s output transition, typically the ris<strong>in</strong>g edge,<br />
from its ideal position over consecutive clock edges. Period jitter is measured<br />
and expressed <strong>in</strong> time or frequency. Period jitter measurements are used to<br />
calculate tim<strong>in</strong>g marg<strong>in</strong>s <strong>in</strong> systems.<br />
Track<strong>in</strong>g jitter is def<strong>in</strong>ed as a variation <strong>in</strong> time relationship between the edges of<br />
the reference (<strong>in</strong>put) clock and output clock of a clock circuitry.<br />
Determ<strong>in</strong>istic periodic jitter is typically caused by external determ<strong>in</strong>istic noise<br />
sources coupl<strong>in</strong>g <strong>in</strong>to a system, such as switch<strong>in</strong>g power-supply noise or a strong<br />
local radio frequency carrier. It may also be caused by an unstable clock-recovery<br />
PLL.<br />
While a random process can, <strong>in</strong> theory, have any probability distribution, ran-<br />
dom jitter is assumed to have a Gaussian distribution for the purpose of the jitter<br />
model. One reason for this is that the primary source of random noise <strong>in</strong> many<br />
electrical circuits is thermal noise (also called Johnson noise or shot noise), which<br />
is known to have a Gaussian distribution. Another, more fundamental reason is<br />
that the composite effect of many uncorrelated noise sources, no matter what the<br />
distributions of the <strong>in</strong>dividual sources approaches a Gaussian distribution accord<strong>in</strong>g<br />
to the central limit theorem [107].<br />
For a random signal with a Gaussian distribution, there is theoretically no limit<br />
on the max and m<strong>in</strong> values, so the observed peak-peak value will generally grow<br />
81
FEI KEMT<br />
over time. For this reason, the peak-peak value should be used <strong>in</strong> conjunction with<br />
the population size and some knowledge of the type of distribution.<br />
5.2.2 Survey of Designs Based on Jitter<br />
In this section we summarise currently most known concepts and designs of genera-<br />
tors based on extraction of randomness from clock jitter. The jitter appears <strong>in</strong> clock<br />
signals generated by free-runn<strong>in</strong>g oscillators or PLL circuitry implemented <strong>in</strong>side a<br />
digital device.<br />
The Tkacik TRNG Design The generator <strong>in</strong>vented by Tkacik [111] <strong>in</strong>cludes<br />
comb<strong>in</strong>ation of two determ<strong>in</strong>istic circuits – a l<strong>in</strong>ear feedback shift register (LFSR)<br />
and cellular automation shift register (CASR). The registers are clocked by two <strong>in</strong>de-<br />
pendent r<strong>in</strong>gs whose clock frequency is <strong>in</strong>fluenced by external impacts and <strong>in</strong>cludes<br />
jitter. In addition, the selected outputs of CASR and LFSR are XORed together<br />
provid<strong>in</strong>g the f<strong>in</strong>al random signal. The harvest<strong>in</strong>g technique of the generator is very<br />
complex and no verification of its effectiveness is provided.<br />
The design was evaluated by Dichtl [43] who po<strong>in</strong>ted out an issue with unclear<br />
source of randomness <strong>in</strong> the generator. Under certa<strong>in</strong> conditions and with partial<br />
knowledge of some <strong>in</strong>ternal values an attacker is able to predict the generated value<br />
due to low level of entropy.<br />
The Fischer and Drutarovsk´y Design In design from Fischer and Drutarovsk´y<br />
[60] the idea is to extract random values by sampl<strong>in</strong>g a clock signal <strong>in</strong>fluenced by<br />
track<strong>in</strong>g jitter caused by analogue PLL <strong>in</strong> FPGAs from Altera. The jitter can be<br />
sampled only under def<strong>in</strong>ed condition when frequencies of sampled and sampl<strong>in</strong>g<br />
clock signals are <strong>in</strong> a certa<strong>in</strong> ratio.<br />
Sampl<strong>in</strong>g of clock signal is executed periodically with period given by PLL di-<br />
viders. Samples taken <strong>in</strong> transition zones have nonzero probability to result <strong>in</strong><br />
logical one or zero and are called critical samples. The position of critical samples is<br />
stabilised dur<strong>in</strong>g operation of the generator as far as the work<strong>in</strong>g conditions of the<br />
generator do not change.<br />
More details on the TRNG implementation and features of the generator are<br />
described <strong>in</strong> the next section. This design provide us a reference for theoretical<br />
test<strong>in</strong>g and theories which are presented <strong>in</strong> the thesis.<br />
82
FEI KEMT<br />
The Golić Design Golić’s goal is to provide digital TRNG built from logic gates<br />
only. Such design is cost effective and suitable for implementation on any digital<br />
chip. In article from Golić [70] the author proposes two new elements applied <strong>in</strong><br />
design of TRNG showed <strong>in</strong> Figures 5 – 4(a) and 5 – 4(b): the Galois r<strong>in</strong>g oscillator<br />
(GARO) and Fibonacci r<strong>in</strong>g oscillator (FIRO).<br />
(a) Galois r<strong>in</strong>g oscillator (b) Fibonacci r<strong>in</strong>g oscillator<br />
Figure 5 – 4 R<strong>in</strong>g oscillator structures proposed by Golić.<br />
Add<strong>in</strong>g more complex feedback loop <strong>in</strong> the r<strong>in</strong>g oscillator (RO) makes also its<br />
behaviour more complex and therefore more suitable for TRNG where the random-<br />
ness com<strong>in</strong>g from jitter spreads faster. In comparison to classical RO, the usage of<br />
GARO and FIRO yields a higher level of entropy and robustness of the generator.<br />
Additional entropy of the generator comes from frequent metastability effects <strong>in</strong> the<br />
sampl<strong>in</strong>g gate.<br />
In [44] Golić and Dichtl show results of practical implementation of TRNG us<strong>in</strong>g<br />
the oscillators presented above. The authors prove the randomness of the solution by<br />
analysis of the generator output after repeated restarts of the circuit. The standard<br />
deviation of the output signal voltage raises quickly after the restart and stabilises<br />
on significantly large level which assure randomness of the sample taken <strong>in</strong> this time<br />
period.<br />
The Kohlbrenner and Gaj Design The pr<strong>in</strong>ciple similar to PLL-based genera-<br />
tor [60] was proposed by Kohlbrenner and Gaj <strong>in</strong> [79]. Instead of PLL circuitry that<br />
is not present <strong>in</strong> all FPGAs, the authors use a pair of oscillator r<strong>in</strong>gs implemented<br />
<strong>in</strong> programmable logic area of FPGA. S<strong>in</strong>ce the pr<strong>in</strong>ciple expects a tight pair of<br />
frequencies generated by r<strong>in</strong>gs, the oscillators must be matched precisely. That re-<br />
quires also proper position<strong>in</strong>g of the r<strong>in</strong>gs <strong>in</strong>side the FPGA and manual corrections<br />
<strong>in</strong> placements and rout<strong>in</strong>g.<br />
The authors <strong>in</strong>vestigated also the <strong>in</strong>fluence of temperature on RO. The frequency<br />
of a RO tends to wander as the chip’s temperature varies. It is important to place<br />
the ROs <strong>in</strong> a pair close to each other so the difference between the frequencies is<br />
reduced due to m<strong>in</strong>imal difference <strong>in</strong> temperature.<br />
83
FEI KEMT<br />
The Bucci and Luzzi Testable TRNG Design Framework The authors of<br />
testable TRNG design framework [36] come with idea of a stateless RNG which<br />
generates statistically <strong>in</strong>dependent random bits. In case the post-process<strong>in</strong>g unit<br />
is also memoryless, the <strong>in</strong>ternal random bits are <strong>in</strong>dependent too. The stateless<br />
condition of the generator can be achieved by resett<strong>in</strong>g the generation and post-<br />
process<strong>in</strong>g circuit before generation a random bit or word, respectively.<br />
In case of RO based generators the reset state is achieved by stopp<strong>in</strong>g the os-<br />
cillators after each bit generation, so the phase shift between the oscillators is not<br />
accumulated. Another motivation is to avoid a complicated determ<strong>in</strong>istic beat<strong>in</strong>g<br />
pattern between fast and slow frequencies of the RNG. Should the generator <strong>in</strong>clude<br />
any control or compensation loops, then the stateless condition is met only if the<br />
loops achieved their steady state.<br />
The Sunar et al. TRNG Design A theoretical concept of generator based on<br />
r<strong>in</strong>g oscillators (ROs) with equal length was published by Sunar et al. <strong>in</strong> [105]. Ac-<br />
cord<strong>in</strong>g to the concept the outputs of several ROs are XORed together and sampled<br />
by a D flip-flop. The number of oscillators is chosen accord<strong>in</strong>g to jitter size and<br />
<strong>in</strong>ternal frequency of the r<strong>in</strong>gs. The design goal of properly work<strong>in</strong>g generator is<br />
an uniformly distributed region of unpredictable transitions. It is assumed that the<br />
phase drift caused by jitter appears <strong>in</strong> the <strong>in</strong>ternal signal of each r<strong>in</strong>g and <strong>in</strong>fluences<br />
the movement the edges <strong>in</strong> the signal.<br />
Several assumptions made by the authors of this concept were questioned by<br />
authors of [44]. The ma<strong>in</strong> problem lies <strong>in</strong> expectation that the ROs are <strong>in</strong>dependent<br />
what is usually difficult to achieve due to their high tendency to couple with each<br />
other or lock on a common frequency if there is a strong source of periodic signal<br />
close to the ROs.<br />
Notes on Other Published Designs Several published designs of the TRNG<br />
are based on frequency <strong>in</strong>stability of free-runn<strong>in</strong>g oscillators e.g. [53]. Free-runn<strong>in</strong>g<br />
oscillators are typically used also <strong>in</strong> FPGAs based TRNGs [79, 112].<br />
In the papers published recently [31, 105] we can observe that the successfully<br />
passed statistical tests of the proposed RNG are not sufficient anymore. Much more<br />
attention is paid to an analysis and model of the randomness extraction process.<br />
The theoretical bounds for entropy and statistical estimations of the RNG behaviour<br />
are provided <strong>in</strong> order to prove the security of the generator. The requirement for<br />
cont<strong>in</strong>uous test<strong>in</strong>g of the generated sequence was raised by Sch<strong>in</strong>dler <strong>in</strong> [100]. As a<br />
84
FEI KEMT<br />
consequence the RNG designs should provide a test<strong>in</strong>g method designed particularly<br />
for given type of RNG (see e.g. [36]).<br />
In [26] the authors improve model<strong>in</strong>g of RO TRNG, and <strong>in</strong>stead of conventional<br />
time-based models they provide a phase-oriented presentation. The observation<br />
claim<strong>in</strong>g that the ROs tend to couple with each other have been confirmed by the<br />
experiments with global determ<strong>in</strong>istic jitter. Instead of conclusion that coupl<strong>in</strong>g<br />
reduces the randomness of a TRNG, the authors warn of overestimation of the<br />
jitter size. After remov<strong>in</strong>g the impact of global jitter the accumulation of jitter is<br />
much slower, what implies <strong>in</strong> lower sampl<strong>in</strong>g frequency of the generator <strong>in</strong> order to<br />
accumulate obta<strong>in</strong> random sequences.<br />
5.3 PLL-Based TRNG on FPGA<br />
In this section we <strong>in</strong>troduce TRNG implementation based on randomness extrac-<br />
tion from track<strong>in</strong>g jitter that is <strong>in</strong>herent <strong>in</strong> clock signal produced by analog PLL<br />
embedded <strong>in</strong> some FPGA families. The PLL circuitry normally applied for synthesis<br />
of on-chip clock signals derived from external quartz signal is driven to provide a<br />
couple of signals with certa<strong>in</strong> fixed ratio of their frequencies. The ratio is selected<br />
for purpose of the jitter sampl<strong>in</strong>g and sets also other parameters of the generator as<br />
speed of output random sequence.<br />
In the follow<strong>in</strong>g pages we compile dependencies between the PLL and TRNG<br />
parameters and expla<strong>in</strong> their mean<strong>in</strong>g. We expla<strong>in</strong> the fundamental method beh<strong>in</strong>d<br />
the PLL-based TRNG (PLL-TRNG) <strong>in</strong>vented by Fischer and Drutarovsk´y and pub-<br />
lished <strong>in</strong> [60].<br />
5.3.1 Randomness Extraction Method<br />
The track<strong>in</strong>g jitter <strong>in</strong> the output signal of the on-chip analog PLL is detected by<br />
sampl<strong>in</strong>g the signal us<strong>in</strong>g an other rationally related clock signal. The fundamental<br />
issue of allow<strong>in</strong>g jitter sampl<strong>in</strong>g lies <strong>in</strong> sett<strong>in</strong>g of the sampled and sampl<strong>in</strong>g edges<br />
close enough to each other. When this condition is met, the unpredictable jitter<br />
decides on the output values of the sampl<strong>in</strong>g gate. The simplified structure of the<br />
PLL-TRNG is depicted <strong>in</strong> Figure 5 – 5.<br />
Let us have two clock signals CLK and CLJ with frequencies FCLJ and FCLK<br />
<strong>in</strong> the given ratio:<br />
FCLJ<br />
FCLK<br />
= KM<br />
KD<br />
= MCLJDCLK<br />
MCLKDCLJ<br />
85<br />
, (5.2)
FEI KEMT<br />
CLI<br />
PLL<br />
PLL<br />
1<br />
2<br />
CLJ<br />
CLK<br />
D<br />
Flip<br />
Flop<br />
q(nT CLK) Decimator<br />
(NK D)<br />
x(nNT Q)<br />
Figure 5 – 5 Block structure of the PLL-TRNG with two PLLs, sampl<strong>in</strong>g gate and corrector of<br />
the output sequence.<br />
where KM and KD are comb<strong>in</strong>ations of PLL dividers (DCLK, DCLJ) and multi-<br />
pliers (MCLK, MCLJ). As it can be seen <strong>in</strong> Figure 5 – 6, the signal CLJ is sampled<br />
<strong>in</strong> KD discrete positions dur<strong>in</strong>g the period TQ, which is given as<br />
CLJ<br />
CLK<br />
OUT<br />
critical samples<br />
TQ = KDTCLK = KMTCLJ . (5.3)<br />
TQ TQ<br />
DT<br />
KM<br />
samples<br />
Figure 5 – 6 Sampl<strong>in</strong>g of the CLJ clock signal <strong>in</strong>clud<strong>in</strong>g the track<strong>in</strong>g jitter on the rais<strong>in</strong>g edge<br />
of the CLK signal (illustrated for KM = 5 and KD = 7)<br />
It has been shown <strong>in</strong> [60] that if KM and KD are relatively prime, the set of<br />
samples creates an equidistant set of values with a distance step<br />
d = TCLK<br />
2KM<br />
GCD(2KM, KD) = TCLJ<br />
GCD(2KM, KD) , (5.4)<br />
2KD<br />
The method offers a possibility to choose the worst-case distance MAX(∆Tm<strong>in</strong>) =<br />
d/2 between two closest edges of the CLK and CLJ signals as [60]<br />
MAX(∆Tm<strong>in</strong>) = TCLK<br />
GCD(2KM, KD) =<br />
4KM<br />
TCLJ<br />
GCD(2KM, KD) (5.5)<br />
4KD<br />
and thus to assure proper behavior of the generator.<br />
86<br />
KD
FEI KEMT<br />
If the parameters KM and KD are chosen so that<br />
MAX(∆Tm<strong>in</strong>) < σjit , (5.6)<br />
it is guaranteed that dur<strong>in</strong>g the period TQ the sampl<strong>in</strong>g edge of CLK will fall at<br />
least once <strong>in</strong>to the edge zone of CLJ (where the edge zone means the time <strong>in</strong>terval<br />
around the edge with a width smaller than σjit, while σjit is a standard deviation of<br />
the jitter). The KD samples represented by the output signal q(nTCLK) are XOR-<br />
ed bit-wise <strong>in</strong> a corrector [60] to obta<strong>in</strong> one random bit dur<strong>in</strong>g N periods TQ. The<br />
generator output bitrate R is thus decimated by factor N to R = 1/(NTQ). It can<br />
be seen that while the left side of (5.6) depends on the generator structure and PLL<br />
sett<strong>in</strong>gs, its right side, the jitter, depends on the noise of the PLL circuitry, the<br />
work<strong>in</strong>g environment, and on the circuit board design. Therefore, the jitter must be<br />
known <strong>in</strong> advance or (even better) measured <strong>in</strong> real time. Measurement of the jitter<br />
requires special measur<strong>in</strong>g equipment. Common methods of jitter measurement<br />
(e.g. those used <strong>in</strong> [61]) enable one to measure the absolute long-term jitter and<br />
not the relative track<strong>in</strong>g jitter employed <strong>in</strong> the proposed TRNG. Furthermore, the<br />
jitter is measured under laboratory conditions and not <strong>in</strong> a real (potentially hostile)<br />
environment. If the results of measurements are not available, parameters from the<br />
vendor’s documentation can be used for the design of the TRNG as <strong>in</strong> [60].<br />
The decimated output signal of the TRNG<br />
x(nTQ) = q(nTQ) ⊕ q(nTQ − TCLK) ⊕ . . . ⊕ q(nTQ − (KD − 1)TCLK) , (5.7)<br />
which is generated at the output of an Exclusive-OR (XOR)-based decimator [42]<br />
as a bit-wise addition modulo 2 (⊕) of KD samples q(.) sampled with the frequency<br />
FCLK will be nondeterm<strong>in</strong>istic, too. Note that the delay l<strong>in</strong>e can still be a useful<br />
build<strong>in</strong>g block for σjit ≈ MAX(∆Tm<strong>in</strong>) or σjit < MAX(∆Tm<strong>in</strong>), as it was shown<br />
<strong>in</strong> [62].<br />
The sampler sensitivity on the jitter<br />
S = FCLIMAX(∆Tm<strong>in</strong>) =<br />
1<br />
4MCLKMCLJ<br />
(5.8)<br />
is derived from Equation (5.5). Decreas<strong>in</strong>g MAX(∆Tm<strong>in</strong>) for a fixed FCLI requires<br />
maximisation of multiply<strong>in</strong>g coefficients (M).<br />
For the output bitrate R = 1/TQ = FCLK/KD we get the condition<br />
R =<br />
FCLI<br />
DCLKDCLJ<br />
87<br />
(5.9)
FEI KEMT<br />
For R it holds that the <strong>in</strong>creas<strong>in</strong>g R for a fixed FCLI requires m<strong>in</strong>imisation of divid<strong>in</strong>g<br />
coefficients (D). Of course, optimization cannot be done <strong>in</strong>dependently. There are<br />
system limits expressed by the condition<br />
5.3.2 Coherent Sampl<strong>in</strong>g<br />
R<br />
MAX(∆Tm<strong>in</strong>) = 4FCLKFCLJ . (5.10)<br />
The sampl<strong>in</strong>g technique applied for randomness extraction <strong>in</strong> PLL-TRNG is called<br />
a coherent sampl<strong>in</strong>g.<br />
The method expects that the samples are processed dur<strong>in</strong>g the period TQ that<br />
is given by ratio of the clock frequencies. In case of ideal signals without a jitter<br />
the output signal is perfectly periodical. Let us provide some more details on the<br />
parameters of this signal.<br />
Similar technique is applied to measure high frequency signals. The coherent<br />
pr<strong>in</strong>ciple is based on sampl<strong>in</strong>g the measured signal dur<strong>in</strong>g several periods of the<br />
sampled signal, <strong>in</strong>stead of usually expected one period. Sampl<strong>in</strong>g frequency fs is<br />
lower than the frequency of sampled signal f. The ratio between the frequencies is<br />
expressed as<br />
fs = N<br />
f , GCD(M, N) = 1 . (5.11)<br />
M<br />
Dur<strong>in</strong>g M periods of sampled signal is obta<strong>in</strong>ed N samples. S<strong>in</strong>ce M and N are<br />
relatively prime numbers, the N samples are dist<strong>in</strong>ct and evenly distributed <strong>in</strong> TQ,<br />
thus the effective sampl<strong>in</strong>g frequency is fseff = Nf. In order to obta<strong>in</strong> the orig<strong>in</strong>al<br />
waveform of the sampled signal a time shuffl<strong>in</strong>g of the samples may be needed. In<br />
case of M = N + 1 time shuffl<strong>in</strong>g can be avoided if 0 ≤ φ1 ≤ 2π/N.<br />
This sampl<strong>in</strong>g theory may applied to the referred generator. S<strong>in</strong>ce it is difficult<br />
to fulfil the condition for avoid<strong>in</strong>g shuffl<strong>in</strong>g of the samples, an re-shuffl<strong>in</strong>g is required.<br />
Let’s assume that dur<strong>in</strong>g the period TQ we acquired KD samples of the CLJ signal<br />
with order i = 0, 1, . . . , KD − 1. Next, we need to rearrange the samples accord<strong>in</strong>g<br />
to their tim<strong>in</strong>g position <strong>in</strong> the CLJ signal. The idea beh<strong>in</strong>d this reorder<strong>in</strong>g lies<br />
<strong>in</strong> the fact that KD samples of CLJ are taken dur<strong>in</strong>g KM periods of CLJ signal,<br />
therefore we can reconstruct one period of the signal CLJ from KD samples. Thus,<br />
we compute the order <strong>in</strong>dex j and we sort the samples regard<strong>in</strong>g this <strong>in</strong>dex.<br />
j = iKM mod KD<br />
88<br />
(5.12)
FEI KEMT<br />
5.4 Test<strong>in</strong>g of TRNGs<br />
Randomness of the generated numbers cannot be proven only by pass<strong>in</strong>g generally<br />
used statistical tests. Instead of that each RNG implementation has to be evaluated<br />
<strong>in</strong>dividually as an unique system. However, if the prototype <strong>in</strong> the lab generates<br />
acceptable random numbers this may not be true for each piece of TRNG of the<br />
same type dur<strong>in</strong>g the whole operation time and therefore a cont<strong>in</strong>ual test<strong>in</strong>g of the<br />
generated output is required.<br />
It is well-known that most of the attacks are directed towards the implementa-<br />
tions of the cryptographic algorithms and not to the algorithms themselves. This<br />
means that special attention should be paid to avoid all weaknesses help<strong>in</strong>g an at-<br />
tacker <strong>in</strong> break<strong>in</strong>g of a system.<br />
The topic of tests is highly accurate <strong>in</strong> case of attacks. The generators as sources<br />
of secrets, on which the security of the whole cryptosystems is based, are popular<br />
target of attacks and attempts to obscure the generated output. The topic of at-<br />
tacks is also <strong>in</strong>cluded <strong>in</strong> the chapter. Chang<strong>in</strong>g the work<strong>in</strong>g conditions may have a<br />
degrad<strong>in</strong>g <strong>in</strong>fluence on the parameters of generated sequence.<br />
In [74] an approach for the evaluation of physical random number generators<br />
is given which takes the construction of the TRNG <strong>in</strong>to account. The document<br />
presents a theory how the TRNGs used <strong>in</strong> cryptographic systems should be evalu-<br />
ated.<br />
For the TRNGs test<strong>in</strong>g we have to accept the follow<strong>in</strong>g facts [100]:<br />
• A f<strong>in</strong>al set of statistical tests may detect defects of a random source, but these<br />
tests cannot verify the randomness of the source.<br />
• Good statistical properties of the random numbers are clearly not sufficient for<br />
sensitive cryptographic applications as the generation of the keys, signature<br />
key pars or signature parameters.<br />
• The key criterion is not the statistical behavior of the numbers but their en-<br />
tropy.<br />
• For good TRNG it has to be given that the <strong>in</strong>crease of entropy per generated<br />
number is sufficiently large.<br />
In [74], there is proposed a set of tests that should be passed, <strong>in</strong>clud<strong>in</strong>g the<br />
Coron’s test of entropy <strong>in</strong>crease. In addition to the proof that the generated num-<br />
bers have desired properties, it is needed to provide an explanation of randomness<br />
89
FEI KEMT<br />
extraction. In other words, the pr<strong>in</strong>ciple of random numbers generation has to be<br />
described for better understand<strong>in</strong>g and for better analysis of possible attacks on the<br />
TRNG.<br />
Startup Test, Onl<strong>in</strong>e Test, TOT Tests If RNG prototype <strong>in</strong> a lab generates<br />
acceptable random numbers this may not be true for each TRNG of the same type<br />
dur<strong>in</strong>g the whole operation time. The reason for this could be found <strong>in</strong> tolerances<br />
of components of the noise source, age<strong>in</strong>g effects, or outside attacks. In the worst<br />
case the TRNG breaks down totally and the output numbers are constant from that<br />
moment on. Therefore, the developer of the TRNG should implement also tests<br />
that will detect similar cases of the randomness degradation of the output bits. We<br />
dist<strong>in</strong>guish between 3 types of tests [74]:<br />
1. startup test is used to verify the pr<strong>in</strong>ciple functionality of the noise source<br />
when the TRNG has been started.<br />
2. onl<strong>in</strong>e test should detect if the quality of the random numbers is not sufficient<br />
for this particular TRNG or deteriorates <strong>in</strong> the course of the time.<br />
3. tot test (’tot’ stands for ’total failure of the noise source’) should detect a total<br />
breakdown of the noise source.<br />
Implementation of the tests For implementation of the tests one has to consider<br />
the limitations that are given by the platform on which the TRNG is implemented.<br />
Not rarely the implementation target are smart cards, or field programmable gate<br />
arrays (FPGAs) with limited memory space. Therefore the chosen tests should<br />
require only small additional logic resources. Moreover the tests should be selected<br />
accord<strong>in</strong>g to the features of the TRNG and the basic pr<strong>in</strong>ciple of the random source.<br />
It is possible to create also new tests that are more suitable for the particular TRNG<br />
and detect better the possible defects.<br />
Due to the limited memory resources of target platforms it is impossible to test<br />
the statistical properties on very long sequences (up to Mbits of data) as some tests<br />
(e.g. [97]) require. The goal is to f<strong>in</strong>d tests that are able cont<strong>in</strong>ually evaluate the<br />
quality of the random source without the need of stor<strong>in</strong>g the output bits. Require-<br />
ments which appropriate onl<strong>in</strong>e tests should fulfil are formulated <strong>in</strong> [100].<br />
Two another requirements are given on the tests. On one side we expect detection<br />
of even small deviation from ideal random source, but on the other side often random<br />
90
FEI KEMT<br />
alarms are not acceptable (e.g. tot test can block the smart card, so the revision by<br />
the producer is required for reus<strong>in</strong>g it). Therefore the ranges of deviations from the<br />
ideal randomness have to be set very carefully to do not decrease the security of the<br />
system, but also do not block the TRNG by fake alarms. This is task is even more<br />
difficult for short sequences of random bits tested <strong>in</strong>side the TRNG.<br />
5.5 Attacks aga<strong>in</strong>st TRNG<br />
The ma<strong>in</strong> attacker’s goal of a cryptographic algorithm or implementation is to reveal<br />
some part or even the whole secret key and then decrypt easily any encrypted<br />
message. Attack<strong>in</strong>g RNGs has a different motivation than f<strong>in</strong>d<strong>in</strong>g the key. Inside<br />
cryptographic systems the RNG plays crucial role <strong>in</strong> generation of secret keys, session<br />
keys, etc. A random key is the outcome of the generation process. Therefore the<br />
target of the attack is not only the generated value of the secret key but also any<br />
<strong>in</strong>formation mak<strong>in</strong>g possible to predict the succeed<strong>in</strong>g or preced<strong>in</strong>g values of the<br />
keys.<br />
In case of successful attack, the generated values may not be random anymore<br />
and can be constant or strongly biased or attacker knows the algorithm for their<br />
correct prediction with high probability. By this approach one tries to change the<br />
random behaviour of the TRNG to determ<strong>in</strong>istic one, or at least change the proba-<br />
bility distribution of the generated sequence.<br />
In case of PRNG, the knowledge of the seed or <strong>in</strong>ternal status can lead to break<strong>in</strong>g<br />
the generator because its structure is usually known and public. In case of well-<br />
deigned TRNG the <strong>in</strong>formation about actual <strong>in</strong>ternal status does not provide any<br />
<strong>in</strong>formation about the previous or follow<strong>in</strong>g one. Therefore focus of the attack is the<br />
source of noise and randomness extraction method rather than the <strong>in</strong>ternal status<br />
of the TRNG.<br />
Attacks on cryptographic systems (<strong>in</strong>clud<strong>in</strong>g RNG) can be divided <strong>in</strong>to algorith-<br />
mic and implementation attacks.<br />
Algorithmic attacks The first group of attacks, the algorithmic attacks, <strong>in</strong>cludes<br />
mathematical analysis of the mechanism for randomness extraction or the structure<br />
of the PRNG and does not require any access to the attacked unit. The analysis<br />
can be used especially aga<strong>in</strong>st PRNG designs with non-properly designed way of<br />
obta<strong>in</strong><strong>in</strong>g the seed value [69]. If seed conta<strong>in</strong>s low level of entropy, then the output<br />
of the generator has statistical properties not comparable to the random sequence<br />
91
FEI KEMT<br />
and effort needed for reproduction the output is lower. Mathematical analysis of<br />
TRNGs tries to f<strong>in</strong>d determ<strong>in</strong>istic dependencies <strong>in</strong>side the extraction method caus<strong>in</strong>g<br />
pseudo-randomness.<br />
As the parameters of TRNGs are highly dependent on the implementation, at-<br />
tack<strong>in</strong>g directly the hard<strong>ware</strong> realisation can be more powerful.<br />
Implementation attacks The second group, the implementation attacks, expects<br />
a direct physical access to an implementation and is based on weaknesses caused by<br />
implementation of the RNG. Implementation attacks are further divided to passive<br />
and active attacks.<br />
Passive attacks usually called side-channel attacks, benefit from a side channel <strong>in</strong>-<br />
formation ga<strong>in</strong>ed from the physical implementation. The power consumption,<br />
execution time or electromagnetic emanations can provide additional useful<br />
<strong>in</strong>formation about RNG <strong>in</strong>ternal status or processed data.<br />
Active attacks require an <strong>in</strong>volvement of the attacker <strong>in</strong>to changes of the standard<br />
work<strong>in</strong>g conditions, operation flow or design of the orig<strong>in</strong>al implementation of<br />
the RNG. The non-<strong>in</strong>vasive active attacks apply non-permanent changes of ex-<br />
ternal parameters for RNG e.g. supply voltage, temperature, with motivation<br />
to achieve non-standard - biased RNG output. With more resources one can<br />
execute an <strong>in</strong>vasive attack and change the physical structure of the implemen-<br />
tation. The attacker tries to destroy the source of randomness and make the<br />
output of the RNG constant or to get directly the output of generator.<br />
5.6 Conclusions<br />
In this chapter we have <strong>in</strong>troduced the topic of random numbers. The extraction<br />
of random bits <strong>in</strong> digital environment is a crucial topic <strong>in</strong> the area of system imple-<br />
mentations with public-key cryptography. The randomness itself and typical three<br />
sources of randomness: noise, metastability and jitter were described. In order to<br />
provide an overview on the actual status <strong>in</strong> the research we have collected descrip-<br />
tions of the recently published design proposals and implementations of TRNG.<br />
A typical design of TRNG implemented <strong>in</strong> a digital device <strong>in</strong>cludes a source<br />
of randomness from which a digitised noise signal can be harvested by a proper<br />
mechanism. We have expla<strong>in</strong>ed the importance of research <strong>in</strong> the areas of the<br />
harvest<strong>in</strong>g mechanisms and postprocess<strong>in</strong>g. The positive results of statistical tests<br />
92
FEI KEMT<br />
do not assure the random base of generated sequence. In addition the work<strong>in</strong>g<br />
environment may also have a significant impact on the parameters of output bits.<br />
Requirements on RNGs applied <strong>in</strong> cryptography cover security parameters of the<br />
design, unpredictability of the generated sequence and specific statistical properties<br />
of the output sequence.<br />
The generator chosen for our research - the PLL-TRNG proposed <strong>in</strong> [60] will be<br />
further tested and analysed <strong>in</strong> order to provide better tools for choos<strong>in</strong>g its param-<br />
eters and understand its behaviour <strong>in</strong> chang<strong>in</strong>g environment. Described theoretical<br />
background on test<strong>in</strong>g and attacks of RNGs has been applied and the results are<br />
given <strong>in</strong> the follow<strong>in</strong>g chapter.<br />
93
FEI KEMT<br />
6 True Random Number Generator<br />
The chapter is dedicated to analysis of jitter-based random generator under various<br />
aspects. Our work is based on the TRNG design proposed by Viktor Fischer and<br />
Miloˇs Drutarovsk´y published <strong>in</strong> 2002 [60]. We enhance the already published results<br />
summarised <strong>in</strong> the previous chapter. Our focus is put on analysis of the generator<br />
<strong>in</strong> chang<strong>in</strong>g work<strong>in</strong>g conditions and configurations sett<strong>in</strong>gs.<br />
Results of the research were published <strong>in</strong> the follow<strong>in</strong>g list of papers [47, 48, 61,<br />
62,114–116,119]. The ma<strong>in</strong> achievements of our research were done <strong>in</strong> the follow<strong>in</strong>g<br />
areas:<br />
• Analysis of PLL circuitry as a source of randomness – implementation issues,<br />
possible PLL configurations, verification of vendor parameters,<br />
• Analysis of TRNG implementation <strong>in</strong> different FPGAs – achievability study,<br />
design consideration, practical results,<br />
• Stochastic model of PLL-TRNG – proposal and practical verification,<br />
• Temperature <strong>in</strong>fluence on PLL-TRNG – practical attack on TRNG with results<br />
and suggestions for design.<br />
The chapter is structured as follows. In the Section 6.1 we describe two ways<br />
of clock synthesis <strong>in</strong> modern FPGAs and summarise the parameters of the clock<br />
circuitry verified by practical measurements of the PLL parameters. The Section 6.2<br />
provides an analysis of PLL configurations, practical results from Altera and Actel<br />
FPGA implementations of the generator and a stochastic model of the generator.<br />
In Section 6.3 we describe a non-<strong>in</strong>vasive attack on the generator together with<br />
practical outcomes. In the last part (Section 6.4) we discuss the obta<strong>in</strong>ed results<br />
and provide ideas on the further research.<br />
6.1 Clock Synthesis <strong>in</strong> FPGAs<br />
In present-day <strong>in</strong>tegrated digital systems, there is a need for numerous clock sig-<br />
nals with various frequencies. The synthesis of the clocks <strong>in</strong> separated circuits is<br />
not effective and the frequencies are too high to be generated by an external crys-<br />
tal. FPGA vendors offer for this purpose a clock circuitry embedded on the FPGA<br />
chip. Beside synthesis of clock signals with required frequencies it provides addi-<br />
tional functions mak<strong>in</strong>g possible a process<strong>in</strong>g of signals with very high frequencies.<br />
94
FEI KEMT<br />
The clock condition<strong>in</strong>g circuits usually enable to perform follow<strong>in</strong>g functions (<strong>in</strong><br />
dependency on FPGA vendor and family): clock phase adjustment, clock delay<br />
m<strong>in</strong>imisation, clock frequency synthesis, clock modulation spread-spectrum, static<br />
or dynamic configuration of circuits parameters, etc.<br />
We expla<strong>in</strong> two mostly applied pr<strong>in</strong>ciples for clock signal management <strong>in</strong> FPGAs<br />
based on PLL and delay-locked loop (DLL). Both pr<strong>in</strong>ciples can be implemented<br />
as digital or analog circuits. While the FPGA vendor Xil<strong>in</strong>x has chosen digital<br />
implementation of DLL <strong>in</strong> most of their FPGAs, other vendors like Altera and<br />
Actel <strong>in</strong>cluded <strong>in</strong> their devices a clock circuitry based on an analog PLL.<br />
Phase-Locked Loop Circuitry Typical analog PLL block <strong>in</strong> Altera and Actel<br />
devices (see Figure 6 – 1) can provide at least one synthesised clock signal with<br />
frequency FOUT :<br />
FOUT = FV CO<br />
k<br />
= FREF<br />
m<br />
k<br />
= FIN<br />
m<br />
n × k<br />
, (6.1)<br />
where FIN is the frequency of the <strong>in</strong>put clock source that can be an external crystal<br />
or other PLL <strong>in</strong> case of PLL cascade, FREF is the <strong>in</strong>put reference frequency that<br />
is used to lock the feedback clock FF B, and f<strong>in</strong>ally the voltage controlled oscillator<br />
(VCO) produces a clock signal with output frequency FV CO. Reference-, feedback-<br />
and post-divider values n, m, and k can vary from one to several hundreds <strong>in</strong><br />
FPGAs [11, 14], or to several thousands <strong>in</strong> ASICs [22] and set together with VCO<br />
work<strong>in</strong>g limits the range of <strong>in</strong>put and output frequencies.<br />
clock<br />
<strong>in</strong>put<br />
F IN<br />
:n<br />
F REF<br />
F FB<br />
Phase<br />
Frequency<br />
Detector<br />
:m<br />
Charge<br />
Pump<br />
Loop<br />
Filter<br />
&<br />
VCO<br />
F VCO<br />
:k<br />
.<br />
.<br />
.<br />
:k<br />
1<br />
c<br />
clock<br />
output(s)<br />
Figure 6 – 1 Block diagram of analog PLL circuitry for clock signal synthesis <strong>in</strong> Altera FPGA [11]<br />
Delay-Locked Loop Circuitry Synthesis of clock signal <strong>in</strong> DLL circuits is achieved<br />
by <strong>in</strong>sertion of a variable delay between the <strong>in</strong>put and output clock signal (see Fig-<br />
ure 6 – 2). Delay l<strong>in</strong>es can be built us<strong>in</strong>g a voltage controlled delay or as a series of<br />
discrete delay elements as it is <strong>in</strong> Xil<strong>in</strong>x DLL [125, 126].<br />
95<br />
F OUT
FEI KEMT<br />
clock<br />
<strong>in</strong>put<br />
F IN<br />
F FB<br />
Phase<br />
Detector<br />
+/-<br />
Delay<br />
L<strong>in</strong>e<br />
clock<br />
output<br />
F OUT<br />
Figure 6 – 2 Block diagram of digital DLL unit typical for Xil<strong>in</strong>x FPGA clock management<br />
circuits<br />
The DLL achieves very good results <strong>in</strong> delay compensation and clock condition-<br />
<strong>in</strong>g. However, the available range of clock dividers is much more limited than <strong>in</strong><br />
case of PLL. It is possible to use an output p<strong>in</strong> with clock signal derived from <strong>in</strong>put<br />
signal, where its frequency may be doubled or divided by values: 1.5, 2, 2.5, 3, 4, 5,<br />
8, or 16 <strong>in</strong> case of Spartan II FPGA devices [127].<br />
6.1.1 PLL as Source of Randomness<br />
Due to its digital nature the DLL <strong>in</strong> Xil<strong>in</strong>x devices is less sensible to noise envi-<br />
ronment than analog PLL with VCO. The VCO tends to lock to frequencies of<br />
disturb<strong>in</strong>g external signals and therefore is required a use of separated networks for<br />
power supply and ground connection mounted only to the clock circuitry. On the<br />
other hand, the analog PLL makes possible a small area implementation provid<strong>in</strong>g<br />
a wide range of clock frequencies. The DLL technology is limited <strong>in</strong> this direction<br />
and offers only certa<strong>in</strong> comb<strong>in</strong>ations of ratios between <strong>in</strong>put and output frequencies.<br />
Changes <strong>in</strong> the temperature or fluctuations of the supply voltage correlated to<br />
switch<strong>in</strong>g activity of the closely placed logic may cause a drift <strong>in</strong> the generated<br />
clock signal. As a compensation the loop makes adjustments of the delay elements<br />
or VCO frequency what is recognised as a determ<strong>in</strong>istic jitter added to the clock<br />
signal. Other source of noise <strong>in</strong>fluenc<strong>in</strong>g the PLL circuitry is the <strong>in</strong>put clock signal.<br />
Therefore there is a tradeoff between compensation of the <strong>in</strong>ternal or external jitter.<br />
All phase changes <strong>in</strong> the PLL or differences of delays <strong>in</strong> the DLL <strong>in</strong>troduce a<br />
jitter <strong>in</strong> the synthesised output signal. Filters <strong>in</strong>side the clock circuitry are matched<br />
to elim<strong>in</strong>ate the non-l<strong>in</strong>earity caused by the loop and external <strong>in</strong>fluences, however<br />
the <strong>in</strong>tr<strong>in</strong>sic random noise of the VCO is always present <strong>in</strong> the output clock signal<br />
and cannot be attenuated completely. Thanks to that, the PLL provides a promis<strong>in</strong>g<br />
source of randomness suitable for an implementation of the TRNG. In addition, the<br />
96
FEI KEMT<br />
frequency of the VCO is never constant and even by stable work<strong>in</strong>g conditions it<br />
fluctuates around a mean value.<br />
From the provided analysis we can conclude that PLL circuits are more suitable<br />
for TRNG design based on jitter sampl<strong>in</strong>g as they offer a wide frequency range for<br />
generated signals. Moreover, the <strong>in</strong>ternal PLL circuitry provide a reliable source of<br />
a jitter.<br />
Analog PLL <strong>in</strong> Altera and Actel FPGAs The core of clock circuitry embed-<br />
ded <strong>in</strong> Altera and Actel FPGAs is formed by an analog PLL circuit surrounded<br />
by several delay l<strong>in</strong>es, clock multipliers/dividers, and circuits for <strong>in</strong>terconnections<br />
between <strong>in</strong>ternal clock network and external pads. Number of PLLs and their fea-<br />
tures depend on chosen FPGA type and vendor. The Tables 6 – 1 and 6 – 2 present<br />
the basic parameters of PLLs and clock circuits for FPGA devices from Altera<br />
(APEX20K(E) [14], Cyclone [12,17] and Stratix [15,19]) and Actel (Axcelerator [2],<br />
ProASICplus [3], ProASIC3(E) [4]).<br />
Table 6 – 1 Parameters of PLL embedded <strong>in</strong> Altera FPGAs<br />
family # of PLLs<br />
dividers range<br />
m n k<br />
max. output period jitter<br />
APEX20K 1 – – – 200ps<br />
APEX20KE 2, 4 1-160 – * – 0.35% RMS of output period<br />
Cyclone 1, 2 2-32 1-32 1-32 ±300ps for FOUT ≥ 100MHz<br />
60mUI for FOUT < 100MHz<br />
Cyclone II 2, 4 1-32 1-4 1-32 NA **<br />
Stratix<br />
Stratix II<br />
* m/(n × k)=1-280.<br />
4, 8×FPLL *** 1-32 1-32 1-32 ±100ps for FOUT > 200MHz<br />
2, 4×EPLL 1-512 1-512 1-1024 ±20mUI for FOUT < 200MHz<br />
4, 8×FPLL 1-32 1-4 1-32<br />
2, 4×EPLL 1-32 1-32 1-32<br />
NA **<br />
** The jitter specification for the PLL output p<strong>in</strong>s are dependent on the I/O p<strong>in</strong>s <strong>in</strong><br />
its VCCIO bank, how many of them are switch<strong>in</strong>g outputs, how much they toggle,<br />
and whether or not they use programmable current strength.<br />
*** EPLL and FPLL stand for Enhanced and Fast PLL, respectively.<br />
97
FEI KEMT<br />
Table 6 – 2 Parameters of PLL embedded <strong>in</strong> Actel FPGAs<br />
family # of PLLs dividers range max. output period jitter<br />
ProASIC3(E) 1 (6) NA<br />
ProASICplus 2<br />
Axcelerator 8<br />
180ps for FOUT = 24MHz<br />
90ps for FOUT = 100MHz<br />
70ps for FOUT = 350MHz<br />
m = 1-64 ±1% for FOUT < 10MHz<br />
n=1-32 ±2% for 10MHz < FOUT < 60MHz<br />
k=1-4 ±1% for FOUT > 60MHz<br />
m =1-64 long-term: 1% of FOUT or 100ps<br />
n = 1-64 short-term: 50ps +1% of FOUT<br />
There are two parameters of the PLL clock circuits that have significant impact<br />
on possibility to extract randomness from the clock jitter, namely the output period<br />
jitter of the PLL and range of frequency dividers. The level of tim<strong>in</strong>g jitter <strong>in</strong> clock<br />
signals is for latest FPGAs families permanently decreased by FPGA vendors what<br />
was proved also by our experimental measurements (described later). On the other<br />
hand, the range of divisors <strong>in</strong> high-density devices is enlarged enough to achieve<br />
wider range of synthesised clock<strong>in</strong>g frequencies.<br />
The jitter size is usually expressed <strong>in</strong> peak-to-peak value (what is a difference<br />
between the smallest and the largest clock period) or 1-sigma value (σjit) (standard<br />
deviation). Typical values of the period jitter depend on the technology and config-<br />
uration of the PLL and can range from 3.5 ps to 10 ps for ASICs [22], or up to 100<br />
ps for FPGAs [11, 19]. S<strong>in</strong>ce the technology of the embedded PLL and the quality<br />
of the VCO is usually set by FPGA vendor, a user can modify the output jitter by<br />
configuration of the PLL divider values (m, n, k) and loop filter bandwidth.<br />
Jitter Generated <strong>in</strong> Altera Stratix FPGA In analog PLLs, various noise<br />
sources cause that the PLL’s <strong>in</strong>ternal VCO fluctuates <strong>in</strong> frequency. Under ideal<br />
conditions, the fluctuations visible as a jitter are caused only by analog (non-<br />
determ<strong>in</strong>istic) <strong>in</strong>ternal noise sources. In such case the noise is denoted as an <strong>in</strong>tr<strong>in</strong>sic<br />
jitter. Other possible frequency fluctuations are caused by variations of supply volt-<br />
age, temperature, external <strong>in</strong>terference through the power, ground, or by the <strong>in</strong>ternal<br />
noisy environment generated by <strong>in</strong>ternal FPGA circuits [125]. The PLL’s control<br />
circuitry adjusts the VCO back to the specified frequency and this change is seen<br />
98
FEI KEMT<br />
as a (determ<strong>in</strong>istic) jitter.<br />
We analyse further the parameters of PLL circuits <strong>in</strong> Stratix family of Altera<br />
FPGAs and their relations to the generated clock signal and jitter <strong>in</strong>cluded <strong>in</strong> it.<br />
The Altera Stratix devices <strong>in</strong>clude two types of PLLs:<br />
Fast PLL (FPLL): Stratix devices <strong>in</strong>clude up to 8 FPLLs. The FPLLs offer<br />
general-purpose clock management with multiplication and phase shift<strong>in</strong>g.<br />
The multiplication is simplified <strong>in</strong> comparison to EPLL and uses only m/k<br />
scal<strong>in</strong>g factors with a range from 1 to 32 [15]. Input frequency can vary <strong>in</strong><br />
dependency on m (for speed grade -5) from 15 to 717 MHz, output frequency<br />
from 9.4 to 420 MHz, and the frequency of the VCO from 300 to 1000 MHz.<br />
Enhanced PLL (EPLL): Compar<strong>in</strong>g to FPLL, the EPLLs have some additional<br />
configurable features like external feedback, configurable bandwidth, run-time<br />
reconfiguration, etc. and have enhanced range of parameters. Input frequency<br />
can vary (for a speed grade -5 device) from 3 to 684 MHz, output frequency<br />
from 9.4 to 420 MHz and the frequency of the VCO from 300 to 800 MHz.<br />
Reference-, feedback- and post-divider values n, m and k can vary from 1 to<br />
512 (1024 for k) with 50% duty cycle [15].<br />
The size of the <strong>in</strong>tr<strong>in</strong>sic jitter of the PLL depends on the quality factor Q of the<br />
VCO, on the bandwidth of the loop filter (see Figure 6 – 1), and on the so-called<br />
pattern jitter <strong>in</strong>troduced by the phase frequency detector. The technology of the<br />
PLL and the quality of the VCO is given by FPGA design. A designer can change<br />
the output jitter directly - by modification of scal<strong>in</strong>g factors (for FPLL and EPLL)<br />
and filter bandwidth (only for EPLL), but also <strong>in</strong>directly by the design of the board<br />
(separation of the analog and digital ground, filter<strong>in</strong>g of the analog power supply,<br />
etc.).<br />
PLL acts as a low-pass filter, therefore a low bandwidth sett<strong>in</strong>g of the lop filter<br />
can be applied to filter out high frequency jitter from the <strong>in</strong>put clock. To track the<br />
<strong>in</strong>put jitter, one can use a high bandwidth sett<strong>in</strong>g. As mentioned already a power<br />
supply noise could cause the VCO output frequency to fluctuate and cause jitter. In<br />
such cases a low bandwidth causes the feedback loop to respond slower to the noise<br />
be<strong>in</strong>g <strong>in</strong>jected by the VCO. In turn, it cannot adjust for this noise and counteract it.<br />
A high bandwidth allows the loop to respond quickly to the noise and compensate<br />
for it. Therefore there is a tradeoff between high and low pass filter of PLL loop<br />
filter that causes either filter<strong>in</strong>g of the <strong>in</strong>put signal jitter or VCO noise.<br />
99
FEI KEMT<br />
S<strong>in</strong>ce the size of the jitter is very important for our method, we needed to<br />
measure it for various PLL configurations and confirm the values provided by chips<br />
vendors. For example, accord<strong>in</strong>g to vendor’s measurements [125], the PLL jitter<br />
<strong>in</strong> an Apex FPGA has 1-sigma value of σjit ≈ 15.9 ps for a FOUT = 66.6 MHz<br />
synthesised clock signal and feedback divider m = 2. These results were acquired<br />
under “ideal conditions” with a m<strong>in</strong>imal amount of FPGA resources occupied and<br />
m<strong>in</strong>imal <strong>in</strong>put/output activities. Our measurements showed that the clock jitter <strong>in</strong><br />
the Apex FPGAs is significantly higher (about 140 ps) for higher dividers factors<br />
and <strong>in</strong>ternal FPGA flip-flops switch<strong>in</strong>g on different clock frequencies. Note that the<br />
value of jitter size depends on the PLL sett<strong>in</strong>gs and the type of the power supply<br />
filter <strong>in</strong>cluded <strong>in</strong> the development board, but the measured value of jitter is never<br />
lower than <strong>in</strong>ternal <strong>in</strong>tr<strong>in</strong>sic jitter of FPGA.<br />
(a) FPLL with ratio 12/7, σjit ≈ 10 ps (b) EPLL with ratio 139/133, σjit ≈ 16 ps<br />
Figure 6 – 3 Jitter of the clock signal <strong>in</strong> Altera Stratix design (horizontal scale: 200 ps/div)<br />
For jitter measurement on a Stratix family FPGA we have selected Altera DSP<br />
Development board with Stratix EP1S25F780C5 device [16]. The jitter has been<br />
measured similarly as <strong>in</strong> [62] us<strong>in</strong>g Agilent Inf<strong>in</strong>iium DCA 86100B wide bandwidth<br />
oscilloscope. We have found that <strong>in</strong> comparison to the Nios board with APEX [10]<br />
(used as reference <strong>in</strong> [60]) the jitter is significantly smaller. For example, for the<br />
FPLL and the ratio 12/7 the jitter achieves 1-sigma value of about 10 ps (see Figure<br />
6 – 3(a)) and for the EPLL and the ratio 139/133 the 1-sigma value of the jitter is<br />
about 16 ps (see Figure 6 – 3(b)).<br />
100
FEI KEMT<br />
6.2 PLL-Based TRNG on FPGA<br />
After the part concern<strong>in</strong>g the general parameters of PLL circuitry <strong>in</strong> FPGAs we<br />
cont<strong>in</strong>ue with section which delivers results on practical implementation of the PLL-<br />
TRNG <strong>in</strong> different families of FPGA vendors - Altera and Actel. Presented stochas-<br />
tic model of the generator helps to understand the randomness extraction method.<br />
6.2.1 PLL Configurations<br />
The design depicted <strong>in</strong> Figure 5 – 5 represents only one of the possible PLL configu-<br />
rations that we will <strong>in</strong>vestigate further. In general, there are three options how the<br />
PLLs can be configured <strong>in</strong> the TRNG <strong>in</strong> dependency on chosen FPGA: with one<br />
PLL, with two parallel PLLs and with two (or more) cascaded PLLs (see Figure 6 –<br />
4).<br />
a)<br />
b)<br />
c)<br />
CLI<br />
CLI<br />
CLI<br />
PLL<br />
PLL1<br />
PLL2<br />
PLL1 PLL2<br />
Figure 6 – 4 Configurations of TRNG with: a) one PLL, b) two parallel PLLs and c) two cascaded<br />
PLLs<br />
CLJ<br />
CLK<br />
CLJ<br />
CLK<br />
CLJ<br />
CLK<br />
D<br />
Flip<br />
Flop<br />
D<br />
Flip<br />
Flop<br />
D<br />
Flip<br />
Flop<br />
In some cases, especially <strong>in</strong> low-cost FPGAs, only one PLL is available for the<br />
TRNG (see Figure 6 – 4a) ) and the other (if available) are used for the rest of the<br />
system. If there are no or only some acceptable restrictions 10 for the <strong>in</strong>put clock<br />
10 By acceptable we mean the requirements for the clock<strong>in</strong>g frequency, which are <strong>in</strong> a certa<strong>in</strong><br />
OUT<br />
OUT<br />
OUT<br />
range that is suitable also for the TRNG to achieve the work<strong>in</strong>g condition (5.6).<br />
101
FEI KEMT<br />
Table 6 – 3 Parameters sett<strong>in</strong>gs for different TRNG configurations<br />
configuration / parameter one PLL two parallel PLLs two cascaded PLLs<br />
FCLK<br />
FCLJ<br />
FCLI<br />
MCLJ<br />
DCLJ FCLI<br />
MCLK<br />
DCLK FCLI<br />
MCLJ<br />
DCLJ FCLI<br />
FCLI<br />
MCLJ MCLJ 1 2<br />
FCLI<br />
DCLJ DCLJ 1 2<br />
KM MCLJ MCLJDCLK MCLJ1MCLJ2<br />
KD DCLJ DCLJMCLK DCLJ1DCLJ2<br />
S<br />
R<br />
1<br />
4MCLJ<br />
FCLI<br />
DCLJ<br />
1<br />
4MCLKMCLJ<br />
FCLI<br />
DCLKDCLJ<br />
1<br />
4MCLJ 1 MCLJ 2<br />
FCLI<br />
DCLJ 1 DCLJ 2<br />
frequency of the logic part out of the TRNG, then one or more PLLs can be shared<br />
by the TRNG and the user logic.<br />
In most cases the use of two PLLs is largely sufficient to fulfil the condition (5.6).<br />
Usually, the option with two parallel PLLs is used (see Fig. 6 – 4b) ). In cases when<br />
the range of PLL divisors is not satisfactory (aga<strong>in</strong>, this is the case of the low-cost<br />
FPGAs), a cascade of two (or more, if available) PLLs can be applied (see Figure 6 –<br />
4c) ). Each configuration permits to achieve different characteristics (def<strong>in</strong>ed <strong>in</strong><br />
[61]) depend<strong>in</strong>g on parameters of PLLs, namely maximum <strong>in</strong>put, output and VCO<br />
frequency, multiplication and division factors, etc. and <strong>in</strong> this way the needed<br />
frequency can be synthesised. The parameters of the considered three generator<br />
configurations are summarised <strong>in</strong> Table 6 – 3.<br />
We can conclude that the use of two PLLs <strong>in</strong> either parallel or serial (cascaded)<br />
configuration can <strong>in</strong>crease significantly sensitivity on the jitter and the output bit-<br />
rate of the generator, depend<strong>in</strong>g on the available range of multiplication or division<br />
factors or both.<br />
In the equations presented <strong>in</strong> Table 6 – 3 it is shown from which PLL coefficients<br />
(dividers) the factors KM and KD are composed. The factor KM has a direct<br />
<strong>in</strong>fluence on the value of MAX(∆Tm<strong>in</strong>) (see Eq. 5.5). While for the configurations<br />
with one PLL or several cascaded PLLs KM is composed only from multiply<strong>in</strong>g<br />
coefficients, <strong>in</strong> case of the parallel configuration the divid<strong>in</strong>g coefficient is <strong>in</strong>cluded.<br />
This should be considered especially <strong>in</strong> cases when not all the PLL coefficients have<br />
identical range.<br />
102
FEI KEMT<br />
6.2.2 Analysis of TRNG <strong>in</strong> Altera Stratix FPGAs<br />
Our implementation strategy for the described case was to get the fastest and the<br />
best quality generator us<strong>in</strong>g a m<strong>in</strong>imum amount of resources (PLLs). S<strong>in</strong>ce the<br />
Stratix family conta<strong>in</strong>s two types of PLLs, several configurations are possible.<br />
The most economic solution would be based on the use of one FPLL (s<strong>in</strong>ce there<br />
are four FPLLs <strong>in</strong> the chosen device). But the multiplication and division factors<br />
of a s<strong>in</strong>gle FPLL cannot fulfil the implementation condition (5.6). Other option is<br />
to use EPLL with extended range of parameters that enables to build a s<strong>in</strong>gle-PLL<br />
TRNG. For this reason, follow<strong>in</strong>g four architectures of the TRNG implemented <strong>in</strong><br />
Altera Stratix devices are possible:<br />
1. Two FPLLs (referenced further as configuration A)<br />
2. One FPLL and one EPLL (configuration B)<br />
3. One EPLL (configuration C)<br />
4. Two EPLLs (configuration D)<br />
The relationship between the sensibility on the jitter S and the output bitrate<br />
R of the TRNG for configuration with 2 parallel PLLs (see Table 6 – 3 for other<br />
configurations and characteristic parameters) was described <strong>in</strong> equations 5.8 and<br />
5.9.<br />
Experimental Results TRNG architectures were tested on Altera DSP board<br />
with Stratix EP1S25F780C5 [16]. The TRNG architectures were described <strong>in</strong> VHDL<br />
and implemented us<strong>in</strong>g Altera Quartus II development system, version 3.0 SP2.<br />
Acquired bits were transmitted to the PC through a parallel port. The complete<br />
TRNG design <strong>in</strong>clud<strong>in</strong>g 1024 x 8-bit FIFO and a parallel <strong>in</strong>terface controller needs<br />
up to 120 LEs from about 25000 LEs available <strong>in</strong> the device. The signal CLK was<br />
used as a clock signal for the control logic and was therefore limited to about 250<br />
MHz (although the output frequency of the PLL can be higher).<br />
In order to test basic quality of different versions of TRNG, we evaluated the<br />
follow<strong>in</strong>g statistical parameters of the generated bit sequence b(n) (all of them were<br />
computed for the record length of N = 1000000 bits):<br />
1. Bias computed as<br />
bias = E[b(n)] − 0.5 = E[b] − 0.5 ∼ = N1<br />
N<br />
103<br />
− 0.5 (6.2)
FEI KEMT<br />
where N1 is the number of b(n) = 1 for n = 0, 1, . . . , N −1. For a good TRNG,<br />
the bias should converge to 0 (with deviation ≈ ±3/ √ N ).<br />
2. Maximal autocorrelation coefficient computed as<br />
where<br />
=<br />
ρmax = max{|corr(bk)| , k = 1, 2, . . . , 100} (6.3)<br />
�<br />
�<br />
corr(bk) = corr b(n), b(n − k) = (6.4)<br />
�<br />
�<br />
E b(n) − E[b(n)] ��<br />
b(n − k) − E[b(n − k)] ��<br />
� �<br />
var b(n))var(b(n − k) �<br />
var(b(n)) = var(b) = E �<br />
{b − E[b]} 2�<br />
= E[b]{1 − E[b]} (6.5)<br />
Based on [42, 86] it can be shown that for a good TRNG (with bias → 0)<br />
and a f<strong>in</strong>ite record length N the corr(bk) follows standard normal distribution<br />
N(0, 1) and the follow<strong>in</strong>g condition should be fulfilled (value χ = 2.576 is from<br />
P (X > χ) = α = 0.01/2 valid for N(0, 1) distribution)<br />
ρmax → 2.576<br />
√ N = 0.002576 (6.6)<br />
3. Standard FIPS140-2 statistical tests [57] that analyse 20000 bit records and<br />
def<strong>in</strong>e thresholds to assess TRNG randomness. FIPS140-2 tests <strong>in</strong>clude Mono-<br />
bit, Poker, Run and Long runs tests. We analysed 100 sequences for each<br />
tested TRNG architecture and evaluated relative number (tM, tP , tR, tL) of se-<br />
quences that passed each test. Good TRNG should pass all FIPS tests so that<br />
tF IP S = tMtP tRtL = 1.<br />
Tables 6 – 4 and 6 – 5 <strong>in</strong>clude parameters and results for selected TRNG archi-<br />
tectures. The best output bitrate and quality (expressed through the bias, ρmax<br />
and tF IP S) is obta<strong>in</strong>ed us<strong>in</strong>g TRNG configuration with two EPLLs. The enhanced<br />
adjustable parameters of the EPLL allow to achieve the required level of sensitivity<br />
accord<strong>in</strong>g to the jitter present <strong>in</strong> the device. The configurations with the FPLL<br />
are not suitable for jitter sampl<strong>in</strong>g due to limited range of PLL dividers (see Ta-<br />
ble 6 – 1). In case of low sensitivity S the number of critical samples is very low, the<br />
configuration is unstable and the output sequence has significant bias.<br />
104
FEI KEMT<br />
Table 6 – 4 Configuration parameters of tested TRNG<br />
MAX<br />
Conf. PLL1 PLL2 Total ∆Tm<strong>in</strong> R σjit<br />
Type KM/KD Type KM/KD KM/KD [ps] [kb/s] [ps]<br />
A Fast 12/7 Fast 25/12 144/175 10.4 952.4 10<br />
B Enh. 43/7 Fast 25/12 516/175 2.9 952.4 23<br />
C Enh. 212/207 - 1 212/207 14.7 386.5 12<br />
D Enh. 43/7 Enh. 31/10 430/217 2.3 1142.9 13<br />
Table 6 – 5 Results of quality evaluation of tested TRNG configurations<br />
Configuration bias ρmax tF IP S<br />
A -0.358 0.043 0<br />
B 0.054 0.023 0<br />
C -0.003 0.012 0.96<br />
D 0.002 0.003 1<br />
The f<strong>in</strong>al speed of the generator <strong>in</strong> configuration D (more than 1Mbit/s) is much<br />
higher than that presented <strong>in</strong> [60], while the quality confirmed by statistical tests<br />
rema<strong>in</strong>s comparable. Thanks to the analysis of available PLL configuration and<br />
their parameters we have presented a generator without additional delay<strong>in</strong>g logic<br />
applied <strong>in</strong> the orig<strong>in</strong>al proposal [60]. Application of simpler sampl<strong>in</strong>g part of the<br />
generator is possible thanks to wider dividers range of PLL circuits.<br />
6.2.3 Analysis of TRNG <strong>in</strong> Actel FPGAs<br />
In this section we expla<strong>in</strong> how the parameters of the clock circuitry <strong>in</strong>fluence the<br />
parameters of the discussed PLL-TRNG <strong>in</strong> case of low-cost FPGA. Analysis should<br />
answer the question whether Actel FPGAs are suitable PLL-TRNG implementation<br />
and what parameters of the TRNG are achievable.<br />
Clock generator circuitry <strong>in</strong> Actel FPGAs As a target family for TRNG<br />
implementation the ProASICplus was chosen. This low-cost FPGA family based<br />
on flash technology offers two well-configurable PLLs on a chip. We selected an<br />
evaluation board [1] provided with ProASICplus APA300-PQFP208 device [3] for<br />
experiments and measurements. As a reference <strong>in</strong>put clock source an on-board<br />
105
FEI KEMT<br />
oscillator with frequency 40MHz was used. The board has separated power supply<br />
for the PLLs and for the rest of the chip what enables to analyse the impact of power<br />
supply violations (from off-chip manipulations, or from activity of the on-chip logic<br />
by <strong>in</strong>terconnection of the power supplies) on the generated sequences.<br />
In the on-chip PLL there exist the follow<strong>in</strong>g limitations for the frequencies of<br />
signals connected to PLL circuits: F<strong>in</strong> = 1.5 − 240MHz, Fout = 6 − 180MHz and<br />
FVCO = 24 − 180MHz. As it was already mentioned <strong>in</strong> Table 6 – 2 the PLL output<br />
frequency of the PLL Fout is derived from the <strong>in</strong>put frequency F<strong>in</strong> by application of<br />
the dividers:<br />
m FVCO<br />
Fout = F<strong>in</strong> =<br />
n × k k<br />
(6.7)<br />
where m, n and k are PLL frequency dividers and FVCO states for an output fre-<br />
quency of the VCO.<br />
In order to compare possible configurations and f<strong>in</strong>d out the ranges of TRNG<br />
parameters one can go through the follow<strong>in</strong>g steps. The frequency ranges of the<br />
two rationally related clock<strong>in</strong>g signals are given by the frequency ranges of the PLL<br />
dividers and the <strong>in</strong>put frequency (us<strong>in</strong>g equations from Table 6 – 3). From the ratio<br />
of the frequencies it is possible to set the parameters KM and KD and then also<br />
check the basic condition (expressed <strong>in</strong> Equation 5.6) that has to be fulfilled for<br />
the functionality of the TRNG. The size of the jitter deviation σjit can either be<br />
measured on the target device (if required equipments measurements are available),<br />
or just estimated (consider<strong>in</strong>g the ranges given <strong>in</strong> vendor’s documentation) and then<br />
set empirically after experiments with generator’s sett<strong>in</strong>gs. Know<strong>in</strong>g the frequencies<br />
of the clock<strong>in</strong>g signals and parameters KM and KD it is easy to f<strong>in</strong>d the period TQ<br />
(see Equation 5.3) and then the output bit-rate R = 1/TQ.<br />
To give an overview on what ranges of MAX(∆Tm<strong>in</strong>) are achievable <strong>in</strong> different<br />
PLL configurations we summarise them <strong>in</strong> Table 6 – 6. One should note that the<br />
<strong>in</strong>tervals are only theoretically achievable or could be slightly different <strong>in</strong> practical<br />
cases, s<strong>in</strong>ce some limitations were not taken <strong>in</strong>to account (e.g. the limited output<br />
and <strong>in</strong>put frequency for cascaded configuration, limited number of comb<strong>in</strong>ations of<br />
dividers, etc.).<br />
From Table 6 – 6 we can see that the smallest values of MAX(∆Tm<strong>in</strong>) can be<br />
reached with the cascaded configuration. While the frequencies range is the same<br />
as for the other configurations, the number of comb<strong>in</strong>ations of frequency dividers is<br />
higher what offers better possibilities for match<strong>in</strong>g the FCLJ frequency to the fixed<br />
FCLK.<br />
As expected, the lowest sensitivity is achievable by us<strong>in</strong>g only one PLL. On the<br />
106
FEI KEMT<br />
Table 6 – 6 Achievable sensitivity on jitter us<strong>in</strong>g two clock signals <strong>in</strong> Actel ProASICplus (FCLI =<br />
40MHz)<br />
configuration MAX(∆Tm<strong>in</strong>)<br />
two PLLs 0.17ps - 41ns<br />
one PLL 10.85ps - 41ns<br />
two cascaded PLLs 0.084ps - 41ns<br />
other side, if the size of the jitter is large enough, this configuration is the most<br />
effective <strong>in</strong> area consumption. In practical cases the configuration with one PLL is<br />
not usable, as the number of random samples and their entropy is low because of<br />
the low sensitivity S.<br />
As a solution one can add the second PLL <strong>in</strong> parallel or cascaded configuration.<br />
It was already mentioned that the parallel configuration has a disadvantage <strong>in</strong> con-<br />
troll<strong>in</strong>g two clock signals <strong>in</strong>stead of one as it is <strong>in</strong> case of the cascaded configuration.<br />
On the other hand, a disadvantage of the cascaded configuration could be the fact<br />
that the track<strong>in</strong>g jitter is composed of components produced <strong>in</strong> the all PLLs <strong>in</strong> the<br />
cascade.<br />
Achievable sensitivity is <strong>in</strong> the worst case comparable, <strong>in</strong> other cases much higher<br />
than is the size of jitter (usually around 10-100ps) therefore we can conclude that<br />
tak<strong>in</strong>g <strong>in</strong>to account the theoretical requirements the proposed method is feasible to<br />
implement and is suitable for Actel FPGAs.<br />
Experimental Results After the theoretical analysis we have proceeded to a<br />
practical implementation. The generator has been synthesised and programmed <strong>in</strong><br />
the FPGA us<strong>in</strong>g Actel design tools Libero IDE 7.1.<br />
In experiments we have focused on the configuration with one PLL circuit, as a<br />
specific configuration typical for low-cost FPGAs. In order to <strong>in</strong>crease the sensitivity<br />
of the sampler we have added some delay elements <strong>in</strong> the front of bank of sampl<strong>in</strong>g<br />
gates (for details check [61]). In case of Actel ProASICplus the shortest delay<br />
<strong>in</strong>side the chip, around 0.5ns, is available between the <strong>in</strong>put and output of a NAND<br />
gate [5]. Outputs of all delay<strong>in</strong>g paths are accumulated dur<strong>in</strong>g a multiple of periods<br />
TQ, afterwards the bits of accumulator are XORed together and provide as one<br />
output bit.<br />
The configuration prov<strong>in</strong>g the possibility to implement the TRNG <strong>in</strong> Actel ProA-<br />
SICplus FPGA us<strong>in</strong>g one PLL and a delay l<strong>in</strong>e from NAND gates has the follow<strong>in</strong>g<br />
107
FEI KEMT<br />
Table 6 – 7 Area occupation of one PLL TRNG with delay l<strong>in</strong>e <strong>in</strong> FPGA Actel ProASICPlus<br />
parameters:<br />
• FCLK = FCLI = 40 MHz<br />
Logic type Number Usage<br />
Core Cells 396 4.8%<br />
FIFO Cells 2 6.3%<br />
PLLs 1 50%<br />
• FCLJ = MCLJ<br />
DCLJ FCLI = 1240<br />
= 68.5714 MHz<br />
• Number of delay elements (NAND gates): 8<br />
7<br />
• Accumulation period: 17TQ = 119 periods of FCLK<br />
The requirements for the area occupation are summarised <strong>in</strong> Table 6 – 7. The<br />
design <strong>in</strong>cludes also the logic for read<strong>in</strong>g the <strong>in</strong>ternal signals and generated sequence<br />
by a computer and can be reduced if required.<br />
The NIST statistical tests were performed on cont<strong>in</strong>uous 1-Gigabit TRNG out-<br />
put records and followed the test<strong>in</strong>g strategy, general recommendations, and result<br />
<strong>in</strong>terpretation described <strong>in</strong> [97]. We have used a set of 1000 1-Megabits sequences<br />
produced by the TRNG, for which most of the tests were passed, however, some<br />
of them do not e.g. overlapp<strong>in</strong>g template test or some variants of non-periodic<br />
templates. Consider<strong>in</strong>g the fact that the generated sequence is <strong>in</strong> some parame-<br />
ters slightly dist<strong>in</strong>guishable from truly random stream may signalise some problems<br />
<strong>in</strong>side the TRNG implementation, on the other hand, the tested sequence is ex-<br />
tremely long (1 gigabit cont<strong>in</strong>ual record) unlike the output streams required for<br />
practical applications.<br />
The experimental tests of configurations with two PLLs connected <strong>in</strong> parallel or<br />
cascade have shown, that the condition expressed by Equation 5.6 is necessary but<br />
not sufficient condition for proper runn<strong>in</strong>g of the TRNG. From the results we can<br />
prove, confirm<strong>in</strong>g the theoretical analysis, that the track<strong>in</strong>g jitter can be sampled<br />
and the generator <strong>in</strong>cludes critical random samples. But to achieve reliably an unbi-<br />
ased and random sequence the number of the critical samples and their probability<br />
distribution have to satisfy some additional conditions that will be specified later <strong>in</strong><br />
this chapter.<br />
108
FEI KEMT<br />
On case of Actel FPGAs we expla<strong>in</strong>ed the way how the basic parameters of the<br />
TRNG can be computed and what is the relation between them and target device<br />
parameters. Follow<strong>in</strong>g the presented results it is possible to implement the TRNG<br />
with required parameters. We can conclude that Actel FPGAs are suitable for<br />
implementation of the TRNG based on discussed method, and achieved parameters<br />
are comparable with the ones from Altera FPGAs.<br />
6.2.4 Stochastic Model of PLL-TRNG<br />
It is a common requirement that a good TRNG design should be supported by a<br />
mathematical (more precisely stochastic) model of the source of randomness. A<br />
reliable model is a necessary requirement for the security evaluation dur<strong>in</strong>g the<br />
certification process [37]. On one hand, the model should be as simple as possible,<br />
but on the other hand, it should also reliably describe a basic behavior of the TRNG.<br />
In our case, the stochastic model should express the probability that the value on<br />
the generator output is equal to one as a function of the jitter variation and the<br />
phase of the CLK and CLJ signals.<br />
Reorder<strong>in</strong>g of the Samples If sampled values of the signal CLJ are ordered <strong>in</strong><br />
a proper way, they create an image of the orig<strong>in</strong>al clock waveform. If we accumulate<br />
the ordered samples <strong>in</strong> KD accumulators dur<strong>in</strong>g Q periods TQ, we obta<strong>in</strong> an image<br />
of the distribution of the probabilities where the i-th sample is equal to one.<br />
The Figure 6 – 5 presents an example of accumulated and reordered samples<br />
obta<strong>in</strong>ed dur<strong>in</strong>g Q = 1000 periods TQ for these parameters:<br />
• KM = 212, KD = 207, FCLJ = 81.93 MHz presented at Figure 6 – 5(a)) and<br />
• KM = 516, KD = 175, FCLJ = 491.43 MHz at Figure 6 – 5(b)).<br />
The variation of the jitter is proportional to the number of po<strong>in</strong>ts (critical sam-<br />
ples) <strong>in</strong> the ris<strong>in</strong>g (or fall<strong>in</strong>g) region of the waveforms (two and six <strong>in</strong> the pre-<br />
sented example). S<strong>in</strong>ce <strong>in</strong> (b) FCLJ = 491.43 MHz, the period TCLJ is divided <strong>in</strong>to<br />
KD = 175 sampl<strong>in</strong>g <strong>in</strong>tervals, the distance between two subsequent samples is equal<br />
to about 11.6 ps. The width of the region <strong>in</strong>fluenced by the jitter is thus about<br />
69.6 ps. This value is equal to approximately 3σjit, so the σjit ∼ 23.2 ps. Us<strong>in</strong>g the<br />
same method, we can get σjit ∼ 29.5 ps from Figure 6 – 5(a). It is clear that the<br />
presented method of the jitter measurement is sufficiently simple to be implemented<br />
<strong>in</strong>side a device and the jitter can thus be monitored cont<strong>in</strong>uously <strong>in</strong> real time.<br />
109
FEI KEMT<br />
0,75<br />
0,5<br />
0,25<br />
1<br />
0<br />
1 30 59 88 117 146 175 204<br />
(a) KM /KD = 212/207<br />
0,75<br />
0,5<br />
0,25<br />
1<br />
0<br />
1 30 59 88 117 146 175<br />
(b) KM /KD = 516/175<br />
Figure 6 – 5 Distribution of mean values of ordered CLJ signal samples obta<strong>in</strong>ed dur<strong>in</strong>g Q = 1000<br />
periods TQ<br />
On-chip reorder<strong>in</strong>g In order to make possible a better analysis of samples pro-<br />
cessed <strong>in</strong> TRNG we implemented the follow<strong>in</strong>g method for on-chip reorder<strong>in</strong>g of the<br />
samples.<br />
The structure of order<strong>in</strong>g logic is illustrated <strong>in</strong> Figure 6 – 6. Samples com<strong>in</strong>g<br />
from the TRNG are cont<strong>in</strong>ually written <strong>in</strong> a dual-port memory block organised as<br />
512 1-bit wide words (usually we do not use KD parameter, which determ<strong>in</strong>es the<br />
number of samples <strong>in</strong> one period, bigger than 512). Writ<strong>in</strong>g address is <strong>in</strong>itialised<br />
with each new period TQ signalised by signal next tq. In order to read samples <strong>in</strong> a<br />
way they create the CLJ clock waveform we need to set a correct read<strong>in</strong>g address.<br />
This operation is done by a LUT implemented as a ROM block. Input of the table<br />
is identical with writ<strong>in</strong>g address, and output of LUT is used as a read<strong>in</strong>g address<br />
from samples memory. The content of ROM – the LUT can be easily generated<br />
us<strong>in</strong>g Equation 5.12.<br />
Signal sample ord was assigned to an output p<strong>in</strong> of DSP Stratix board and<br />
measured by a scope (Tektronix TDS 3052), with trigger signal next tq. In Figure 6 –<br />
7 we present the measured waveform. The parameters of the TRNG are follow<strong>in</strong>g:<br />
MCLK = 13, DCLK = 12, MCLJ = 14, DCLJ = 11, KM = 168, KD = 143, K −1<br />
M = 103.<br />
In the region of edge (<strong>in</strong> this particular case, on fall<strong>in</strong>g edge). Ordered samples<br />
do not create ideal rectangular waveform, <strong>in</strong>stead there can be observed more edges<br />
<strong>in</strong> one period. Samples placed around a position of an ideal edge are sampled <strong>in</strong><br />
different tim<strong>in</strong>g <strong>in</strong>stances (due to required reorder<strong>in</strong>g of the samples <strong>in</strong> time). Hence,<br />
they may be <strong>in</strong>fluenced by different amount of jitter or said <strong>in</strong> other words, jitter<br />
changes are faster than sampl<strong>in</strong>g frequency. This fact causes more than one change<br />
(edge) of the signal. In order to make a better analysis of this phenomenon we need<br />
to collect samples from the edge region for several hundreds of subsequent periods.<br />
110
FEI KEMT<br />
0..K D<br />
next_tq<br />
sample<br />
01<br />
9<br />
9<br />
writ<strong>in</strong>g port<br />
RAM<br />
512 x 1b<br />
ROM<br />
K x 9b<br />
D<br />
read<strong>in</strong>g port<br />
00 11<br />
sample_ord<br />
9<br />
9<br />
01<br />
D<br />
D<br />
edge<br />
Figure 6 – 6 Block diagram of design for on-chip samples reorder<strong>in</strong>g<br />
Figure 6 – 7 Reordered samples from generator measured by oscilloscope<br />
111<br />
9
FEI KEMT<br />
Stochastic Model The clock signal CLJ is sampled KD times by other clock<br />
signal CLK dur<strong>in</strong>g one period TQ. The output signal is quasi periodic 11 with the<br />
period TQ as long as the condition<br />
GCD(KM, KD) = 1 (6.8)<br />
is fulfilled. Samples, which are taken <strong>in</strong> a “stable” part of the CLJ signal (i.e.<br />
samples, which are not <strong>in</strong>fluenced by the jitter) always have a constant value (logical<br />
zero or one). They form a dom<strong>in</strong>ant part of the set of output samples.<br />
The value of the i-th sample qi (0 ≤ i ≤ KD − 1) can be viewed as a b<strong>in</strong>ary<br />
random variable Xi ∈ {0, 1}. Its mean value E[Xi] is equal to the probability<br />
pi(Xi = 1), which is related to the mean value of the jitter <strong>in</strong> the correspond<strong>in</strong>g<br />
sampl<strong>in</strong>g <strong>in</strong>stant. It was shown <strong>in</strong> [60] that the decimated output signal x(nTQ) of<br />
the TRNG represents a bit-wise addition modulo 2 of KD b<strong>in</strong>ary samples q() (check<br />
also Figure 5 – 6) expressed as<br />
x(nTQ) = q(nTQ) ⊕ q(nTQ − TCLK) ⊕ . . . (6.9)<br />
. . . ⊕ q(nTQ − (KD − 1)TCLK) .<br />
We denote the number of critical samples K p<br />
D. The critical samples get the value<br />
of 1 with the probability pi ∈ (0, 1), i = 0, 1, . . . , K p<br />
D − 1. The rest of KD samples<br />
is determ<strong>in</strong>istic. They can obta<strong>in</strong> logical values of zero and one and their number<br />
is denoted as K 0 D and K 1 D, respectively. The total number of samples <strong>in</strong> the period<br />
TQ can be expressed as a sum of determ<strong>in</strong>istic and critical samples:<br />
KD = K p<br />
D + K 1 D + K 0 D . (6.10)<br />
The generator extracts randomness from K p<br />
D b<strong>in</strong>ary values us<strong>in</strong>g a standard XOR<br />
corrector. It was assumed <strong>in</strong> [60] that these values are statistically <strong>in</strong>dependent.<br />
Us<strong>in</strong>g mathematical background from [42], it is possible to show that the follow<strong>in</strong>g<br />
relation holds for the set of probabilities pi of K p<br />
D <strong>in</strong>dependent samples and the<br />
mean value E[pi] at the output of the XOR corrector (the output of the TRNG) is:<br />
E[pi] = 1<br />
2 + (−1)K1 D(−2) Kp<br />
D −1<br />
K p<br />
D−1 �<br />
�<br />
i=0<br />
pi − 1<br />
�<br />
2<br />
. (6.11)<br />
11 If the signals are not <strong>in</strong>fluenced by the jitter, the output signal of the sampl<strong>in</strong>g gate is perfectly<br />
periodic. If some jitter is present, the subsequent periods are not identical, but differ only <strong>in</strong> few<br />
random samples while constant samples form a major part of the waveform.<br />
112
FEI KEMT<br />
Equation 6.11 can be viewed as a stochastic model of the generator, s<strong>in</strong>ce it permits<br />
to estimate a probability of the generator output value as a function of the mean<br />
values of critical samples (which depend on the jitter characteristics). However, the<br />
model is valid if and only if critical samples are <strong>in</strong>dependent.<br />
The proposed model shows that (as it could be expected) the bias of the generator<br />
output decreases with the <strong>in</strong>creas<strong>in</strong>g number of critical samples (note that this<br />
number is related to the jitter variation). It can be seen that if the mean value of<br />
any of these samples is equal to 0.5, the bias on the generator output is equal to<br />
zero and does not depend on the rema<strong>in</strong><strong>in</strong>g samples. F<strong>in</strong>ally, the sign of the bias<br />
depends on the number of samples hav<strong>in</strong>g a mean value equal to one (K 1 D).<br />
The advantage of the proposed model lies <strong>in</strong> the fact that the model can also be<br />
used as a proof of mutual statistical <strong>in</strong>dependence of the critical samples. To evaluate<br />
the statistical <strong>in</strong>dependence, the output mean value and the mean value of critical<br />
samples are measured and the validity of the model expressed <strong>in</strong> Equation 6.11<br />
is verified. If the test fails, the random variables (critical samples) are mutually<br />
dependent.<br />
Model Verification The validity of the model has been tested on real data <strong>in</strong><br />
order to confirm the model empirically. We have tested outputs of seven TRNG<br />
configurations implemented <strong>in</strong> Altera Stratix devices. The Table 6 – 8 presents the<br />
chosen parameters of the tested configurations (KM, KD, FCLK and FCLJ) and the<br />
correspond<strong>in</strong>g results – mean value of critical samples (E[pi]), mean value of the<br />
generator output (m = E [x(nTQ)]), number of samples equal to one (K 1 D) and<br />
number of critical samples (K p<br />
D).<br />
The mean value of the output bitstream m = E [x(nTQ)] is computed as an<br />
arithmetic mean of 512,000 successive bits at the output of the TRNG. The mean of<br />
the model E[pi] is calculated us<strong>in</strong>g the Equation 6.11, while employ<strong>in</strong>g probabilities<br />
of the critical samples pi accumulated after Q = 1000 periods TQ.<br />
As it can be seen, the model is very precise for a small number of critical samples,<br />
s<strong>in</strong>ce both mean values are very similar. For a higher number of critical samples,<br />
the mean value tends to the ideal value 0.5. Note that the model provides correct<br />
<strong>in</strong>formation about the statistical deviation of the output bitstream <strong>in</strong> configurations<br />
1, 2 and 5. The model gives acceptable results correspond<strong>in</strong>g closely to the mean<br />
value of the generated sequence <strong>in</strong> tests 6 and 7. It should be noted that <strong>in</strong> config-<br />
urations 3 and 4, the model outputs do not agree with the generator outputs (most<br />
probably) because of statistical dependence between critical samples.<br />
113
FEI KEMT<br />
Table 6 – 8 Mean values measured us<strong>in</strong>g the stochastic model E[pi] and the output sequence of<br />
the TRNG m = E [x(nTQ)]<br />
# KM KD FCLK FCLJ E[pi] m K 1 D K p<br />
D<br />
(MHz) (MHz)<br />
1 144 119 113.33 137.14 0.846 0.829 61 2<br />
2 144 175 166.66 139.14 0.717 0.729 89 3<br />
3 486 119 75.55 139.14 0.501 0.553 55 10<br />
4 486 161 102.22 139.14 0.507 0.524 74 13<br />
5 250 203 232 285.71 0.489 0.526 95 16<br />
6 270 203 232 308.57 0.5 0.496 96 16<br />
7 486 217 137.77 308.57 0.499 0.496 99 22<br />
6.3 Active Non-Invasive Attack on TRNG<br />
To obta<strong>in</strong> results of a real-life attack we have executed an active non-<strong>in</strong>vasive attack<br />
on FPGA implementation of TRNG [60]. Namely we have tried to force some bias<br />
to the output of generator by chang<strong>in</strong>g the work<strong>in</strong>g temperature of the FPGA chip.<br />
Our aim is to f<strong>in</strong>d out what k<strong>in</strong>d of changes <strong>in</strong> the parameters of generated sequence<br />
can be observed. Moreover, we will record the <strong>in</strong>ternal signals of the generator and<br />
evaluate the <strong>in</strong>fluence of temperature on them.<br />
Similar experiments has been described <strong>in</strong> [98] where the PLL-TRNG has been<br />
evaluated as problematic, with vary<strong>in</strong>g quality of the generated bit sequence. Based<br />
on obta<strong>in</strong>ed results from the attack realisation we will provide additional require-<br />
ments for the PLL-TRNG design and expla<strong>in</strong> why the configuration chosen by San-<br />
toro et al. [98] had problems to pass the statistical tests.<br />
6.3.1 Attack description<br />
The temperature of the FPGA was decreased by application of a freez<strong>in</strong>g spray. The<br />
lowest achieved temperature was −40 ◦ C. As the FPGA chip produces some heat it<br />
has been warmed up by itself up to +30 ◦ C. Dur<strong>in</strong>g the measurements we have tried<br />
to keep the temperature <strong>in</strong> the range of the selected value. The temperature of the<br />
chip was measured by simple contact thermometer.<br />
Two similar configurations of the TRNG were chosen as objects under attack.<br />
In both cases we have used Altera Stratix DSP board with EP1S25 device [18].<br />
The follow<strong>in</strong>g parameters have been chosen or given by the board: FCLI = 80<br />
114
FEI KEMT<br />
MHz, MCLK = 31, DCLK = 10, MCLJ = 36, DCLJ = 7. Then FCLK = 248 MHz,<br />
FCLJ = 411 MHz, and KM/KD = 360/217. In order to make possible a comparison<br />
of TRNG behaviour for two sett<strong>in</strong>gs we have chosen the follow<strong>in</strong>g configurations<br />
that differ <strong>in</strong> bandwidth of the loop filter:<br />
• Configuration A has the filter bandwidth set automatically by the synthesis<strong>in</strong>g<br />
tool (Altera Quartus).<br />
• Configuration B has the filter bandwidth set to preset value low.<br />
The lower is the bandwidth the better <strong>in</strong>put jitter rejection can be achieved for<br />
the price of longer lock<strong>in</strong>g time of PLL. The synthesis<strong>in</strong>g tool chooses the optimal<br />
bandwidth for selected signal frequencies, achiev<strong>in</strong>g acceptable lock<strong>in</strong>g time and<br />
level of <strong>in</strong>put jitter filter<strong>in</strong>g. By sett<strong>in</strong>g the bandwidth to a low value, we achieve<br />
that jitter from sources outside the PLL are filtered out and we can observe the<br />
jitter sourced <strong>in</strong>side the PLL.<br />
6.3.2 Measurements results<br />
For evaluation of the TRNG behaviour by chang<strong>in</strong>g the temperature we collected<br />
for each value the random bit sequence from the output of generator, as well as<br />
the <strong>in</strong>ternal signal values, provid<strong>in</strong>g <strong>in</strong>formation on number of <strong>in</strong>fluenced random<br />
samples.<br />
By reorder<strong>in</strong>g the samples it is possible to reconstruct the waveforms of sampled<br />
clock signal and track the changes of their probabilities. The waveforms sampled by<br />
the generator are depicted <strong>in</strong> Figures 6 – 8 and 6 – 9. For each sample the number of<br />
ones is counted dur<strong>in</strong>g one thousand of TQ periods. The samples <strong>in</strong> stable regions<br />
end up with 0 or 1000 number of sampled ones. The samples <strong>in</strong> edge areas (ris<strong>in</strong>g<br />
and fall<strong>in</strong>g edge), <strong>in</strong>fluenced by jitter, reach values between the boundaries.<br />
In ideal case we suppose that a position of sampl<strong>in</strong>g edge is stable and what<br />
changes is the position of edge <strong>in</strong> the sampled clock signal. The logical value <strong>in</strong> the<br />
moment of sampl<strong>in</strong>g is <strong>in</strong>fluenced by an additive jitter. Analys<strong>in</strong>g the sampled values<br />
allows us to describe the behaviour of the generator and impact of temperature on<br />
the jitter parameters.<br />
From the charts we can see that the position of critical samples does not change<br />
across the range of temperatures for both configurations. The configuration A <strong>in</strong>-<br />
cludes less critical samples than the configuration B what implies lower σ 2 of the<br />
jitter.<br />
115
FEI KEMT<br />
Figure 6 – 8 Sampled waveform of a clock signal for TRNG for Configuration A for temperatures<br />
<strong>in</strong> range −40 ◦ C + 30 ◦ C.<br />
The random sequences were tested by simple statistical tests def<strong>in</strong>ed <strong>in</strong> FIPS<br />
standard [57]. The test suite can reveal a bias or unbalanced distribution of zeros<br />
and ones <strong>in</strong> generated sequence by application of 4 basic tests (monobit test, poker<br />
test, runs and long runs tests). If at least one test from the set was not passed, the<br />
result is denoted as FAILED, otherwise we put OK mark.<br />
In Table 6 – 9 we summarise the results of statistical tests at different FPGA chip<br />
temperatures. It can be seen that while the configuration A has produced by some<br />
temperatures the sequences that did not pass the statistical tests, the configuration<br />
B is reliable <strong>in</strong> the whole range of temperatures.<br />
The columns with critical samples number show the number of samples <strong>in</strong>fluenced<br />
by jitter. It can be observed that <strong>in</strong> case of the configuration B, when we have set<br />
a low bandwidth of the loop filter, the number of <strong>in</strong>fluenced samples is significantly<br />
higher.<br />
We further <strong>in</strong>vestigate the number and position of critical samples for both con-<br />
figurations <strong>in</strong> dependency on the chip temperature. Crucial impact on the statistical<br />
parameters of the generated sequence have the samples with probability around 0.5.<br />
In case we elim<strong>in</strong>ate the almost constant samples, with less than 100 by jitter <strong>in</strong>-<br />
fluenced values dur<strong>in</strong>g 1000 periods, there are 4-6 and 12-13 highly critical samples<br />
per edge for configuration A and B, respectively.<br />
The Figures 6 – 10 and 6 – 11 show <strong>in</strong> details the area of ris<strong>in</strong>g edge of the sampled<br />
waveform. We can observe how dur<strong>in</strong>g the measur<strong>in</strong>g period the number of sampled<br />
116
FEI KEMT<br />
Figure 6 – 9 Sampled waveform of a clock signal for TRNG for configuration B for temperatures<br />
<strong>in</strong> range −40 ◦ C + 32 ◦ C.<br />
ones changes <strong>in</strong> relation to different chip temperature. For configuration A is typical<br />
a large spread of the amounts for a fixed position of sample. In configuration B the<br />
subsequent samples have very similar amounts of sampled ones, and the overall<br />
waveform looks more stable.<br />
In order to better visualise the changes <strong>in</strong> sampled signals <strong>in</strong> dependency on<br />
temperature we provide Figures 6 – 12 and 6 – 13 which show <strong>in</strong> detail a dynamic of<br />
amounts of sampled ones for most critical samples.<br />
In configuration A we can observe a significant change of sampled ones by chang-<br />
<strong>in</strong>g the chip temperature. For example at position number 84 the difference <strong>in</strong><br />
amount of ones sampled by m<strong>in</strong>imal and maximal temperature is more than 500.<br />
This fact as well as the low number of critical samples cause <strong>in</strong>stability of the gen-<br />
erator <strong>in</strong> chang<strong>in</strong>g work<strong>in</strong>g environment.<br />
Although the jitter is present dur<strong>in</strong>g the whole range of the temperatures (the<br />
number of critical samples does not change), the bias of the samples changes visibly<br />
and <strong>in</strong>fluences the statistical parameters of the generated sequence. In a moment<br />
when all samples are strongly biased (case of temperature between 20 and 30 ◦ C)<br />
the output sequence is also biased and does not pass the statistical tests suite.<br />
The configuration B is more stable <strong>in</strong> chang<strong>in</strong>g chip temperature and the density<br />
of samples with equal probability to sample zero and one is much higher when<br />
compar<strong>in</strong>g to the case A. Thanks to that the statistical parameters of the generated<br />
sequence stay acceptable and pass all required statistical tests. The bias of particular<br />
117
FEI KEMT<br />
Table 6 – 9 Results of statistical tests (FIPS) of TRNG output and number of random samples<br />
<strong>in</strong>fluenced by the jitter at different chip temperatures<br />
Conf A Conf B<br />
temperature FIPS critical FIPS critical<br />
<strong>in</strong> ◦ C tests samples # tests samples #<br />
-40 OK 26 OK 64<br />
-30 OK 26 OK 66<br />
-20 FAILED 25 OK 64<br />
-10 OK 24 OK 62<br />
0 OK 24 OK 63<br />
+10 OK 24 OK 68<br />
+20 FAILED 22 OK 61<br />
+30 FAILED 25 OK 60<br />
samples is compensated by other samples <strong>in</strong> critical area, and the f<strong>in</strong>al sequence is<br />
kept unbiased.<br />
Observ<strong>in</strong>g Jitter From the observations depicted above we can conclude that<br />
the standard deviation (σ 2 ) of the jitter <strong>in</strong> the sampled signal does not change.<br />
The size of deviation can be observed as number of critical samples which rema<strong>in</strong>s<br />
almost constant <strong>in</strong> the whole range of tested temperatures. The presence of jitter<br />
represents a fundamental condition for generator proper function. Therefore, a well<br />
suited startup test for this k<strong>in</strong>d of generators should <strong>in</strong>clude a test of critical samples<br />
presence.<br />
The on-chip implementation of this test needs to <strong>in</strong>clude a memory block and<br />
counters which sum up for each edge position of the sampl<strong>in</strong>g signal the number<br />
of sampled ones. The edge positions with the counter value different from 0 or not<br />
equal to the number of TQ periods signalise the presence of jitter. The number<br />
of critical samples must be higher than zero, but low number of samples cannot be<br />
accepted neither. From empirical experiments described above we can conclude that<br />
configurations with more than 10 highly critical samples per edge behave reliably<br />
even <strong>in</strong> chang<strong>in</strong>g environment.<br />
Cont<strong>in</strong>uous monitor<strong>in</strong>g of the critical samples number allows to implement an<br />
effective onl<strong>in</strong>e test for the discussed category of PLL-based generators. Each signif-<br />
icant change either <strong>in</strong> position or <strong>in</strong> probability value of critical samples may have<br />
118
FEI KEMT<br />
Figure 6 – 10 Amount of sampled ones dur<strong>in</strong>g 1000 sampl<strong>in</strong>g periods for TRNG with configuration<br />
A (detail of the rais<strong>in</strong>g edge).<br />
an impact on the parameters of the generated sequence and therefore should <strong>in</strong>itiate<br />
an alarm signal <strong>in</strong>side the TRNG.<br />
From measured data it is possible to estimate the jitter parameters and draw the<br />
probability histograms. In the Figure 6 – 14 we compare the histograms of TRNG<br />
work<strong>in</strong>g <strong>in</strong> configuration A and B which differ <strong>in</strong> loop filter bandwidth. In both cases<br />
the jitter has normal Gaussian distribution. As it can be observed, the configuration<br />
A <strong>in</strong>cludes jitter with lower deviation while the jitter <strong>in</strong> configuration B has almost<br />
three times higher value.<br />
What we f<strong>in</strong>d crucial <strong>in</strong> our measurements is the observation of jitter parameters<br />
with chang<strong>in</strong>g temperature. The jitter <strong>in</strong> the PLL circuitry becomes different with<br />
freez<strong>in</strong>g the chip what can be observed as a change <strong>in</strong> number of sampled ones<br />
at critical samples positions. In Figure 6 – 15 we depicted the difference <strong>in</strong> those<br />
numbers when compar<strong>in</strong>g the values by boundary temperatures −40 ◦ C and +30 ◦ C<br />
<strong>in</strong> both configurations. The difference has the Gaussian normal distribution as well<br />
as <strong>in</strong> case of the previously discussed jitter by the room temperature. The standard<br />
deviation of the additional jitter is identical to its values for measurements at stable<br />
temperature. As a result we can conclude that by chang<strong>in</strong>g the chip temperature<br />
the amplitude of the jitter changes, too.<br />
In case of PLL-TRNG the bigger are the changes of jitter amplitude the bigger<br />
are changes <strong>in</strong> the histogram of jitter and that has direct impact on statistical<br />
properties of the generated sequence. In case of configuration A the changes <strong>in</strong><br />
119
FEI KEMT<br />
Figure 6 – 11 Amount of sampled ones dur<strong>in</strong>g 1000 sampl<strong>in</strong>g periods for TRNG with configuration<br />
B, with low-pass loop filter (detail of the rais<strong>in</strong>g edge).<br />
amplitude of the jitter are significant as are the differences <strong>in</strong> probability values<br />
between particular samples. The configuration B is characterised by smaller changes<br />
of the jitter amplitude which are <strong>in</strong> addition more flat. In such case the probability<br />
changes uniformly for the most critical samples and does not have any unwanted<br />
impact on the generated random sequence. Described higher level of robustness was<br />
observed <strong>in</strong> configuration B and confirmed by positive output of all statistical tests.<br />
From the obta<strong>in</strong>ed results and suggestions for PLL-TRNG design we can con-<br />
clude that the design tested <strong>in</strong> [98] with parameters KM/KD = 270/203 is not<br />
suitable for usage <strong>in</strong> chang<strong>in</strong>g temperature. From the Table 6 – 8 we get the number<br />
of critical samples that is 22, 11 per edge. As we proposed <strong>in</strong> the suggestions above,<br />
more important is the number of highly critical samples that should be more than 10.<br />
This condition is not met <strong>in</strong> this configuration and the generator behaves similarly<br />
to the Configuration A <strong>in</strong> our experiments dur<strong>in</strong>g simulated attack on PLL-TRNG.<br />
6.4 Conclusions and Further Research<br />
The chapter provided an analysis of the PLL based TRNG. We focused on implemen-<br />
tation aspects and relations between the target platform FPGA and PLL circuitry<br />
and achievable technical parameters of the generator <strong>in</strong> devices from vendors Actel<br />
and Altera. In the second part of the chapter we brought our proposal for stochastic<br />
model of the TRNG and proposed additional steps <strong>in</strong> PLL-TRNG design <strong>in</strong> order<br />
to achieve a robustness of the generator <strong>in</strong> chang<strong>in</strong>g environment.<br />
120
FEI KEMT<br />
Figure 6 – 12 Amount of sampled ones dur<strong>in</strong>g 1000 sampl<strong>in</strong>g periods accord<strong>in</strong>g to temperature<br />
for chosen sample positions <strong>in</strong> TRNG with configuration A.<br />
By theoretical and practical analysis we concluded that the PLL circuitry is<br />
more suitable for discussed TRNG implementation when compared to DLL. The<br />
parameters of the PLL circuitry available <strong>in</strong> FPGAs present on the market are<br />
satisfactory for reliable implementation. We showed the steps for theoretical analysis<br />
of the PLL parameters with estimation of the jitter and TRNG parameters that were<br />
later confirmed by empirical measurements.<br />
Two practical implementations <strong>in</strong> Altera and Actel families of FPGAs showed<br />
which criteria are important <strong>in</strong> the design. The achieved f<strong>in</strong>al speed of the generator<br />
<strong>in</strong> Altera Stratix device is more than 1Mbit/s with the quality of output confirmed<br />
by statistical tests. Thanks to the analysis of available PLL configuration and their<br />
parameters we have presented a generator without additional delay<strong>in</strong>g logic applied<br />
<strong>in</strong> the orig<strong>in</strong>al proposal [60]. Application of simpler sampl<strong>in</strong>g part of the generator<br />
is possible thanks to wider dividers range of PLL circuits <strong>in</strong> Stratix FPGA family.<br />
We presented the most compact solution with one PLL circuit and the cha<strong>in</strong> of<br />
delay elements implemented <strong>in</strong> Actel ProASICplus device. The results of statistical<br />
tests for very long record of generated data confirm high level of randomness, with<br />
few tests failed. We can conclude that it was theoretically and also practically<br />
confirmed that the PLL-TRNG is suitable for fully embedded implementation <strong>in</strong><br />
low-cost FPGAs and provides a reliable source of truly random values also <strong>in</strong> cases<br />
when only a small number of PLLs with limited range of frequency dividers is<br />
available.<br />
The proposed stochastic model of the generator allows to prove the mutual sta-<br />
121
FEI KEMT<br />
Figure 6 – 13 Amount of sampled ones dur<strong>in</strong>g 1000 sampl<strong>in</strong>g periods accord<strong>in</strong>g to temperature<br />
for chosen sample positions <strong>in</strong> TRNG with configuration B.<br />
tistical <strong>in</strong>dependence between the critical samples. The model was confirmed <strong>in</strong><br />
empirical way and is valid for small number of critical samples, however, <strong>in</strong> case<br />
of higher number the model is less precise. In order to achieve a better adjusted<br />
model we propose for future research to monitor and analyse the bit sequence at<br />
the output of the sampl<strong>in</strong>g gate, before XOR operation. This k<strong>in</strong>d of measurements<br />
may uncover a possible dependency between the samples.<br />
In the last part of the chapter we presented results of experiments with change-<br />
able temperature of chip with the PLL-TRNG. As a result we propose additional<br />
requirements for the generator design that need to be met <strong>in</strong> order to achieve a<br />
robustness of the design. We can conclude that configurations with more than 10<br />
highly critical samples per edge behave reliably even <strong>in</strong> chang<strong>in</strong>g environment. The<br />
bigger are the changes of jitter amplitude the bigger are changes <strong>in</strong> the histogram of<br />
jitter and that has direct negative impact on statistical properties of the generated<br />
sequence.<br />
122
FEI KEMT<br />
Figure 6 – 14 Comparison of probability histograms for the jitter measured by temperature 20 ◦ C<br />
<strong>in</strong> TRNG with configuration A and B. Data measured were around the ris<strong>in</strong>g edge of the sampled<br />
clock waveform.<br />
Figure 6 – 15 Difference <strong>in</strong> number of sampled ones for critical samples by boundary temperatures<br />
−40 ◦ C and +30 ◦ C <strong>in</strong> TRNG with configuration A and B around the ris<strong>in</strong>g edge of the sampled<br />
clock waveform.<br />
123
FEI KEMT<br />
7 Research Contribution<br />
With this thesis we contributed to the field of hard<strong>ware</strong> implementation of public-key<br />
cryptographic system elements. We discussed the aspects of algorithm adaptations<br />
and system architectures for modular multiplier and cryptanalytic hard<strong>ware</strong>. Ran-<br />
domness extraction method based on clock circuitry was evaluated and new f<strong>in</strong>d<strong>in</strong>gs<br />
were presented.<br />
The research contribution were achieved <strong>in</strong> the follow<strong>in</strong>g topics:<br />
• Optimised <strong>Montgomery</strong> modular multiplier implementation <strong>in</strong> hard<strong>ware</strong>.<br />
• The elliptic curve method implementation <strong>in</strong> hard<strong>ware</strong>.<br />
• Evaluation of random number generator based on clock circuitry <strong>in</strong> FPGAs.<br />
Optimised <strong>Montgomery</strong> modular multiplier implementation <strong>in</strong> hard<strong>ware</strong><br />
Two most popular public-key cryptographic algorithms – the RSA and ECC use<br />
extensively modular operations with large numbers. The MM can be a very slow<br />
operation when performed on general-purpose computers, therefore can be acceler-<br />
ated by an effective hard<strong>ware</strong> implementation.<br />
We analysed algorithms for <strong>Montgomery</strong> MM and architectures for their effec-<br />
tive implementation suitable for reconfigurable hard<strong>ware</strong> structures. Our attention<br />
was paid to keep the scalability and parametrisation of multiplier unit also <strong>in</strong> the<br />
other parts of the system and f<strong>in</strong>d an optimal model for division of computational<br />
load between the soft<strong>ware</strong> and hard<strong>ware</strong> part of the system. The results of area oc-<br />
cupation and tim<strong>in</strong>g analysis were presented after application of hard<strong>ware</strong>-soft<strong>ware</strong><br />
co-design.<br />
The elliptic curve method implementation <strong>in</strong> hard<strong>ware</strong> The security of<br />
the most applied public-key cryptographic algorithm – RSA depends on hardness<br />
of factor<strong>in</strong>g large numbers. In the currently best known method for factor<strong>in</strong>g large<br />
<strong>in</strong>tegers – the GNFS one important step is the factorisation of mid-sized <strong>in</strong>tegers<br />
for which an ECM is an efficient algorithm.<br />
The ECM algorithm is a classical example of algorithm that can be significantly<br />
accelerated thanks to special-purpose hard<strong>ware</strong>. We provided a detailed description<br />
of efficient ECM architecture, especially suited for hard<strong>ware</strong> implementation. The<br />
modular multiplier obta<strong>in</strong>ed as a result of our research described <strong>in</strong> the previous<br />
po<strong>in</strong>t presents a core element of the ECM unit and allows fast prototyp<strong>in</strong>g. For<br />
124
FEI KEMT<br />
proof-of-concept purpose, we have chosen architecture with an embedded controller<br />
and dedicated coprocessor designed by soft<strong>ware</strong>-hard<strong>ware</strong> co-design on an FPGA.<br />
We presented the area requirements of the system and tim<strong>in</strong>gs on the first published<br />
real hard<strong>ware</strong> implementation.<br />
Evaluation of random number generator based on clock circuitry Random<br />
values play a crucial role <strong>in</strong> several areas of science. In dependency on field of<br />
application the requirements for parameters of random sequence and generator of<br />
sequence itself may vary.<br />
We enhanced the already published results on the generator <strong>in</strong>vented <strong>in</strong> [60].<br />
Our focus was put on analysis of the generator <strong>in</strong> chang<strong>in</strong>g work<strong>in</strong>g conditions and<br />
configurations sett<strong>in</strong>gs. We presented the most compact solution with one PLL<br />
circuit and a cha<strong>in</strong> of delay elements implemented <strong>in</strong> a low-cost FPGA. In other<br />
design, we focused on achiev<strong>in</strong>g high bitrate of the generated sequence when the<br />
achieved f<strong>in</strong>al speed of the generator was more than 1Mbit/s with the quality of<br />
output confirmed by statistical tests. We summarised our results of experiments<br />
with changeable temperature of chip and proposed additional requirements for the<br />
generator design that need to be met <strong>in</strong> order to achieve a robustness of the design.<br />
125
FEI KEMT<br />
Curriculum vitae<br />
Professional experience<br />
• self-employed, Electronic Documents Laboratory – Team Leader (August 2008<br />
– now).<br />
Projects related to PKI, biometrics and cryptography for Polish Security Pr<strong>in</strong>t-<br />
<strong>in</strong>g Works (PWPW S.A.), Warsaw, Poland. System design and analysis,<br />
preparation of proof-of-concept systems.<br />
• Sentivision Polska, Warsaw, Poland, Senior Soft<strong>ware</strong> Eng<strong>in</strong>eer (October 2006<br />
– July 2008).<br />
Expert on Digital Rights Management implementations <strong>in</strong> embedded plat-<br />
forms for IPTV and VoD systems, cryptography related applications and fea-<br />
tures. End-to-end implementation of Marl<strong>in</strong> IPTV-ES DRM system <strong>in</strong> C<br />
<strong>in</strong>clud<strong>in</strong>g server and client side. Technical project leader - narrow cooperation<br />
with project manager, contact with customers, consult<strong>in</strong>g and on-site support,<br />
ma<strong>in</strong>tenance and release of soft<strong>ware</strong>.<br />
Stages abroad<br />
• Three months research stage <strong>in</strong> COSIC group at Katholieke Universiteit Leu-<br />
ven, Belgium – Involved <strong>in</strong> the FP6 project “SCA Resistant Design”, analysis<br />
of side-channel attacks (2006)<br />
• Four months research stage at Laboratoire Traitement du Signal et Instrumen-<br />
tation, Unité Mixte de Recherche CNRS 5516, Université Jean Monnet, Sa<strong>in</strong>t-<br />
Etienne, France – Analysis of TRNG embedded <strong>in</strong> Altera FPGAs, stochastic<br />
model of the generator (2005)<br />
• Four months research stage at Communication Security group, Ruhr-Universität<br />
Bochum, Germany – Optimisation and implementation of ECM for factorisa-<br />
tion on Xil<strong>in</strong>x FPGA (2004)<br />
• Four months stage as an Erasmus student at the Higher Institute for Advanced<br />
Technologies of Sa<strong>in</strong>t-Etienne (ISTASE), Université Jean Monnet, Sa<strong>in</strong>t-Etienne,<br />
France – Implementation of scalable MM, work with Altera Nios processor<br />
(2002)<br />
126
FEI KEMT<br />
References<br />
[1] Actel Corporation. ProASICplus Evaluation Board, User’s guide, 2002.<br />
[2] Actel Corporation. Axcelerator Family PLL and Clock Management, Ap-<br />
plication Note, June 2003.<br />
[3] Actel Corporation. Us<strong>in</strong>g ProASICplus Clock Condition<strong>in</strong>g Circuits, Ap-<br />
plication Note, Dec. 2004.<br />
[4] Actel Corporation. ProASIC3(E) Flash Family FPGAs, Datasheet, Jan.<br />
2005.<br />
[5] Actel Corporation. ProASICplus Flash Family FPGAs, ver. 5.3, May<br />
2006.<br />
[6] Altera Corporation. Metastability <strong>in</strong> Altera Devices ver.4.0, May 1999.<br />
[7] Altera Corporation. ACEX 1K Programmable Logic Device Family, Data<br />
Sheet, Sept. 2001. ver. 3.3.<br />
[8] Altera Corporation. APEX 20K Programmable Logic Device Family,<br />
Data Sheet, Feb. 2002. ver. 4.3.<br />
[9] Altera Corporation. Avalon Bus Specification, Reference Manual, Jan.<br />
2002. ver. 2.0.<br />
[10] Altera Corporation. Nios Embedded Processor Development Board<br />
ver.2.1, Apr. 2002.<br />
[11] Altera Corporation. Us<strong>in</strong>g PLLs <strong>in</strong> Stratix Devices, Feb. 2002. ver. 1.0.<br />
[12] Altera Corporation. Cyclone Device Handbook, Us<strong>in</strong>g PLLs <strong>in</strong> Cyclone<br />
Devices, Oct. 2003. ver. 1.2.<br />
[13] Altera Corporation. Cyclone Programmable Logic Device Family, Data<br />
Sheet, Mar. 2003. ver. 1.1.<br />
[14] Altera Corporation. Us<strong>in</strong>g the ClockLock & ClockBoost PLL Features <strong>in</strong><br />
APEX Devices, Nov. 2003. Application Note 115, ver. 2.6.<br />
[15] Altera Corporation. Stratix Device Handbook, General-Purpose PLLs <strong>in</strong><br />
Stratix & Stratix GX Devices, Sept. 2004. ver. 3.1.<br />
127
FEI KEMT<br />
[16] Altera Corporation. Stratix EP1S25 DSP Development Board, Dec. 2004.<br />
ver. 1.6.<br />
[17] Altera Corporation. Cyclone II Device Handbook, PLLs <strong>in</strong> Cyclone II<br />
Devices, Feb. 2005. ver. 1.2.<br />
[18] Altera Corporation. Stratix Device Handbook, July 2005. ver. 3.4.<br />
[19] Altera Corporation. Stratix II Device Handbook, PLLs <strong>in</strong> Stratix II De-<br />
vices, Mar. 2005. ver. 2.2.<br />
[20] Altera Corporation. Stratix II Device Handbook, Volume 2, Chapter 2,<br />
TriMatrix Embedded Memory Blocks <strong>in</strong> Stratix II & Stratix II GX Devices,<br />
Apr. 2006. ver. 4.2.<br />
[21] Altera Corporation. Stratix II Device Handbook, Volume 1, Chapter 5,<br />
DC & Switch<strong>in</strong>g Characteristics, May 2007. ver. 4.3.<br />
[22] AMI Semiconductors Company. XpressArray High Density 0.18 um<br />
Structured ASIC.<br />
[23] ARM Limited. ARM7TDMI (Rev 3) — Technical Reference Manual. Avail-<br />
able at<br />
http://www.arm.com/pdfs/DDI0029G_7TDMI_R3_trm.pdf, 2001.<br />
[24] Bag<strong>in</strong>i, V., and Bucci, M. A design of reliable true random number gener-<br />
ator for cryptographic applications. In Cryptographic <strong>Hard</strong><strong>ware</strong> and Embedded<br />
Systems – CHES’99 (Berl<strong>in</strong>, Germany, Aug. 1999), Ç. K. Koç and C. Paar,<br />
Eds., no. 1717 <strong>in</strong> Lecture Notes <strong>in</strong> Computer Science, Spr<strong>in</strong>ger-Verlag, pp. 204–<br />
218.<br />
[25] Barrett, P. Implementat<strong>in</strong>g the rivest, shamir and aldham public-key en-<br />
cryption algorithm on standard digital signal processor. In Proceed<strong>in</strong>gs of<br />
CRYPTO’86 (1986), vol. 263 of Lecture Notes <strong>in</strong> Computer Science, pp. 311–<br />
323.<br />
[26] Baudet, M., Lubicz, D., Micolod, J., and Tassiaux, A. On the secu-<br />
rity of oscillator-based random number generators. Cryptology ePr<strong>in</strong>t Archive,<br />
Report 2009/299, 2009. http://epr<strong>in</strong>t.iacr.org/.<br />
128
FEI KEMT<br />
[27] Bernste<strong>in</strong>, D. Circuits for Integer Factorization: A Proposal. Manuscript.<br />
Available at http://cr.yp.to/papers.html#nfscircuit, 2001.<br />
[28] Blum, L., Blum, M., and Shub, M. A simple unpredictable pseudo-<br />
random number generator. SIAM Journal on Comput<strong>in</strong>g 15 (1986), 364–383.<br />
[29] Blum, T., and Paar, C. <strong>Montgomery</strong> modular exponentiation on reconfig-<br />
urable hard<strong>ware</strong>. In Proceed<strong>in</strong>gs of the 14th IEEE Symposium on Computer<br />
Arithmetic (Adelaide, Australia) (Los Alamitos, CA, April 1999), Koren and<br />
Kornerup, Eds., IEEE Computer Society Press, pp. 70–77.<br />
[30] Blum, T., and Paar, C. High radix montgomery modular exponentiation<br />
on reconfigurable hard<strong>ware</strong>. IEEE Transaction on Computers 50, 7 (2001),<br />
759–764.<br />
[31] Bock, H., Bucci, M., and Luzzi, R. An offset-compensated oscillator-<br />
based random bit source for security applications. In Cryptographic <strong>Hard</strong><strong>ware</strong><br />
and Embedded Systems – CHES 2004 (Berl<strong>in</strong>, Germany, 2004), M. Joye and J.-<br />
J. Quisquater, Eds., no. 3156 <strong>in</strong> Lecture Notes <strong>in</strong> Computer Science, Spr<strong>in</strong>ger-<br />
Verlag, pp. 268–281.<br />
[32] Bosma, W. Primality test<strong>in</strong>g us<strong>in</strong>g elliptic curves. Tech. Rep. 85-12, Math-<br />
ematical Institut, Universiteit van Amsterdam, 1985.<br />
[33] Brent, R. P. Some Integer Factorization Algorithms Us<strong>in</strong>g Elliptic Curves.<br />
In Australian Computer Science Communications 8 (1986), pp. 149–163.<br />
[34] Brent, R. P. Factorization of the tenth Fermat number. Mathematics of<br />
Computation 68, 225 (1999), 429–451.<br />
[35] Brown, M., Hankerson, D., López, J., and Menezes, A. Soft<strong>ware</strong><br />
Implementation of the NIST Elliptic Curves Over Prime Fields. In Top-<br />
ics <strong>in</strong> Cryptology — CT-RSA 2001 (Berl<strong>in</strong>, April 2001), D. Naccache, Ed.,<br />
vol. LNCS 2020, Spr<strong>in</strong>ger-Verlag, pp. 250–265.<br />
[36] Bucci, M., and Luzzi, R. Design of testable random bit generators. In<br />
Cryptographic <strong>Hard</strong><strong>ware</strong> and Embedded Systems – CHES 2005 (Berl<strong>in</strong>, Ger-<br />
many, 2005), J. Rao and B. Sunar, Eds., no. 3659 <strong>in</strong> Lecture Notes <strong>in</strong> Computer<br />
Science, Spr<strong>in</strong>ger-Verlag, pp. 147–156.<br />
129
FEI KEMT<br />
[37] Bundesamt für Sicherheit <strong>in</strong> der Informationstechnik – BSI. Ap-<br />
plication Notes and Interpretation of the scheme (AIS), AIS 31, Funcionality<br />
Classes and Evaluation Methodology for Physical Random Number Generators,<br />
Sept. 2001.<br />
[38] Ç. K. Koç. RSA hard<strong>ware</strong> implementation. Tech. rep., RSA Laboratoties,<br />
RSA Data Security, Inc., Aug. 1995.<br />
[39] Ç. K. Koç, Acar, T., and Kaliski, Jr., B. S. Analyz<strong>in</strong>g and compar<strong>in</strong>g<br />
<strong>Montgomery</strong> multiplication algorithms. IEEE Micro 16, 3 (June 1996), 26–33.<br />
[40] Chait<strong>in</strong>, G. J. Algorithmic Information Theory. Cambridge University<br />
Press, 1987.<br />
[41] Daly, A., and Marname, W. Efficient architectures for implemet<strong>in</strong>g Mont-<br />
gomery modular multiplication and RSA modular exponentiation on reconfig-<br />
urable logic. In Proceed<strong>in</strong>gs of the 2002 ACM/SIGDA tenth <strong>in</strong>ternational<br />
symposium on Field-programmable gate arrays FPGA’02 (Monterey, Califor-<br />
nia, USA, Feb. 2002).<br />
[42] Davies, R. B. Exclusive OR (XOR) and hard<strong>ware</strong> random number genera-<br />
tors. Tech. rep., 2002.<br />
[43] Dichtl, M. How to Predict the Output of a <strong>Hard</strong><strong>ware</strong> Random Number<br />
Generator. In Workshop on Cryptographic <strong>Hard</strong><strong>ware</strong> and Embedded Systems<br />
– CHES 2003 (Berl<strong>in</strong>, Germany, Sept. 8–10, 2003), C. D. Walter, Ç. K. Koç,<br />
and C. Paar, Eds., vol. 2779 of Lecture Notes <strong>in</strong> Computer Science, Spr<strong>in</strong>ger-<br />
Verlag, pp. 181–188.<br />
[44] Dichtl, M., and Golić, J. D. High-speed true random number genera-<br />
tion with logic gates only. In CHES ’07: Proceed<strong>in</strong>gs of the 9th <strong>in</strong>ternational<br />
workshop on Cryptographic <strong>Hard</strong><strong>ware</strong> and Embedded Systems (Berl<strong>in</strong>, Heidel-<br />
berg, 2007), vol. 4727 of Lecture Notes <strong>in</strong> Computer Science, Spr<strong>in</strong>ger-Verlag,<br />
pp. 45–62.<br />
[45] Dixon, B., and Lenstra, A. Massively parallel elliptic curve factor<strong>in</strong>g. In<br />
Advances <strong>in</strong> Cryptology - Eurocrypt ’92 (1993), R. Rueppel, Ed., vol. 658 of<br />
LNCS, Spr<strong>in</strong>ger, Berl<strong>in</strong>, pp. 183–193.<br />
130
FEI KEMT<br />
[46] Drutarovsk´y, M., Fischer, V., and ˇ Simka, M. Comparison of Two<br />
Implementations of Scalable <strong>Montgomery</strong> Coprocessor Embedded <strong>in</strong> Reconfig-<br />
urable <strong>Hard</strong><strong>ware</strong>. In Proceed<strong>in</strong>gs of the XIX Conference on Design of Circuits<br />
and Integrated Systems – DCIS 2004 (Bordeaux, France, Nov. 24–26, 2004),<br />
pp. 240–245.<br />
[47] Drutarovsk´y, M., Fischer, V., ˇ Simka, M., and Celle, F. A Simple<br />
PLL-based True Random Number Generator for Embedded Digital Systems.<br />
Comput<strong>in</strong>g and Informatics 23, 5 (2004), 501–515.<br />
[48] Drutarovsk´y, M., and ˇ Simka, M. Cryptographic True Random Number<br />
Generator for Embedded Nios Processor. In Proceed<strong>in</strong>gs of 13th International<br />
Czech-Slovak Scientific Conference Radioelektronika (Brno, Czech Republic,<br />
May 6–7, 2003), pp. 268–371.<br />
[49] Drutarovsk´y, M., and ˇ Simka, M. Custom FPGA Cryptographic Blocks<br />
for Reconfigurable Embedded NIOS Processor. Acta Electrotechnica et Infor-<br />
matica 4, 2 (2004), 33–39.<br />
[50] Drutarovsk´y, M., ˇ Simka, M., and Fischer, V. Comparison of Scalable<br />
<strong>Montgomery</strong> <strong>Modular</strong> <strong>Multiplication</strong>s Embedded <strong>in</strong> Reconfigurable <strong>Hard</strong><strong>ware</strong>.<br />
Acta Electrotechnica et Informatica 6, 2 (2006), 37–45.<br />
[51] Eldridge, S. E., and Walter, C. D. <strong>Hard</strong><strong>ware</strong> implementation of Mont-<br />
gomery’s modular multiplication algorithm. IEEE Trans. Comput. 42, 6<br />
(1993), 693–699.<br />
[52] Epste<strong>in</strong>, M., Hars, L., Kras<strong>in</strong>ski, R., Rosner, M., and Zheng, H.<br />
Design and implementation of a true random number generator based on dig-<br />
ital circuit artifacts. In Workshop on Cryptographic <strong>Hard</strong><strong>ware</strong> and Embedded<br />
Systems – CHES 2003 (Berl<strong>in</strong>, Germany, Sept. 8–10, 2003), C. D. Walter, Ç.<br />
K. Koç, and C. Paar, Eds., vol. 2779 of Lecture Notes <strong>in</strong> Computer Science,<br />
Spr<strong>in</strong>ger-Verlag, pp. 152–165.<br />
[53] Fairfield, R. C., Mortenson, R. L., and Coulthart, K. B. An LSI<br />
random number generator (RNG). In Proceed<strong>in</strong>gs of CRYPTO 84 on Advances<br />
<strong>in</strong> cryptology (1985), Spr<strong>in</strong>ger-Verlag New York, Inc., pp. 203–230.<br />
131
FEI KEMT<br />
[54] Federal Information Process<strong>in</strong>g Standards, National Institute<br />
of Standards and Technology, U.S. Department of Commerce.<br />
Data Encryption Standard, Jan. 1977. NIST FIPS PUB 46.<br />
[55] Federal Information Process<strong>in</strong>g Standards, National Institute<br />
of Standards and Technology, U.S. Department of Commerce.<br />
Data Encryption Standard, Oct. 1999. NIST FIPS PUB 46-3.<br />
[56] Federal Information Process<strong>in</strong>g Standards, National Institute<br />
of Standards and Technology, U.S. Department of Commerce.<br />
Specification for the Digital Signature Standard, Jan. 2000. NIST FIPS PUB<br />
186-2.<br />
[57] Federal Information Process<strong>in</strong>g Standards, National Institute<br />
of Standards and Technology, U.S. Department of Commerce.<br />
Security Requirements for Cryptographic Modules, May 2001. NIST FIPS PUB<br />
140-2.<br />
[58] Federal Information Process<strong>in</strong>g Standards, National Institute<br />
of Standards and Technology, U.S. Department of Commerce.<br />
Specification for the Advanced Encryption Standard (AES), 2001. NIST FIPS<br />
PUB 197.<br />
[59] Federal Information Process<strong>in</strong>g Standards, National Institute<br />
of Standards and Technology, U.S. Department of Commerce.<br />
Specification for the Secure Hash Standard, Aug. 2002. NIST FIPS PUB 180-2<br />
+ change notice to <strong>in</strong>clude SHA-224.<br />
[60] Fischer, V., and Drutarovsk´y, M. True random number generator em-<br />
bedded <strong>in</strong> reconfigurable hard<strong>ware</strong>. In Workshop on Cryptographic <strong>Hard</strong><strong>ware</strong><br />
and Embedded Systems – CHES 2002 (Berl<strong>in</strong>, Germany, Aug.13–15, 2002),<br />
B. S. Kaliski, Jr., Ç. K. Koç, and C. Paar, Eds., vol. 2523 of Lecture Notes <strong>in</strong><br />
Computer Science, Spr<strong>in</strong>ger-Verlag, pp. 415–430.<br />
[61] Fischer, V., Drutarovsk´y, M., ˇ Simka, M., and Bochard, N. High<br />
Performance True Random Number Generator <strong>in</strong> Altera Stratix FPLDs. In<br />
Field-Programmable Logic and Applications – FPL 2004 (Lueven, Belgium,<br />
Aug. 2004), J. Becker, M. Platzner, and S. Vernalde, Eds., vol. 3203 of Lecture<br />
Notes <strong>in</strong> Computer Science, Spr<strong>in</strong>ger-Verlag, pp. 555–564.<br />
132
FEI KEMT<br />
[62] Fischer, V., Drutarovsk´y, M., ˇ Simka, M., and Celle, F. Simple<br />
PLL-based True Random Number Generator for Embedded Digital Systems.<br />
In Proceed<strong>in</strong>gs of IEEE Design and Diagnostics of Electronic Circuits and<br />
Systems Workshop – DDECS 2004 (Stará Lesná, Slovakia, Apr. 18–21, 2004),<br />
pp. 129–136.<br />
[63] Franke, J., and Kle<strong>in</strong>jung, T. E-mail announcement.<br />
http://www.crypto-world.com/announcements/rsa200.txt, May 2005.<br />
[64] Franke, J., Kle<strong>in</strong>jung, T., Paar, C., Pelzl, J., Priplata, C., and<br />
Stahlke, C. SHARK — A Realizable Special <strong>Hard</strong><strong>ware</strong> Siev<strong>in</strong>g Device<br />
for Factor<strong>in</strong>g 1024-bit Integers. In Workshop on Cryptographic <strong>Hard</strong><strong>ware</strong> and<br />
Embedded Systems — CHES 2005, Ed<strong>in</strong>burgh (August 2005), LNCS, Spr<strong>in</strong>ger.<br />
To appear.<br />
[65] Franke, J., Kle<strong>in</strong>jung, T., Paar, C., Pelzl, J., Priplata, C., ˇ Simka,<br />
M., and Stahlke, C. An effcient hard<strong>ware</strong> architecture for factor<strong>in</strong>g <strong>in</strong>tegers<br />
with the Elliptic Curve Method. In 1st Workshop on Special-purpose <strong>Hard</strong><strong>ware</strong><br />
for Attack<strong>in</strong>g Cryptographic Systems – SHARCS 2005 (Paris, France, Feb. 24–<br />
25, 2005), pp. 51–62.<br />
[66] Frolek, V. Implementation of asymmetric encryption algorithms <strong>in</strong> recon-<br />
figurable circuits. Master’s thesis, Technical University of Koˇsice, Department<br />
of Electronics and Multimedia Communications, Jan.-May 2002.<br />
[67] Gaj, K., Kwon, S., Baier, P., Kohlbrenner, P., Le, H., Khaleelud-<br />
d<strong>in</strong>, M., and Bachimanchi, R. Implement<strong>in</strong>g the elliptic curve method of<br />
factor<strong>in</strong>g <strong>in</strong> reconfigurable hard<strong>ware</strong>. In Workshop on Special-purpose <strong>Hard</strong>-<br />
<strong>ware</strong> for Attack<strong>in</strong>g Cryptographic Systems – SHARCS 2006 (Cologne, Ger-<br />
many, Apr. 03–04, 2006).<br />
[68] Gennaro, R. Randomness <strong>in</strong> cryptography. IEEE Security and Privacy 4,<br />
2 (2006), 64–67.<br />
[69] Goldberg, I., and Wagner, D. Randomness and the Netscape browser.<br />
Dr. Dobb’s Journal (Jan. 1996), 66–70.<br />
[70] Golic, J. New methods for digital generation and postprocess<strong>in</strong>g of random<br />
data. IEEE Transaction on Computers 55, 10 (2006), 1217–1229.<br />
133
FEI KEMT<br />
[71] Gura, N., Chang, S., 2, H., Sumit, G., Gupta, V., F<strong>in</strong>chelste<strong>in</strong>, D.,<br />
Goupy, E., and Stebila, D. An End-to-End Systems Approach to Elliptic<br />
Curve Cryptography. In Cryptographic <strong>Hard</strong><strong>ware</strong> and Embedded Systems —<br />
CHES 2002 (2002), Ç. K. Koç and C. Paar, Eds., vol. LNCS 2523, Spr<strong>in</strong>ger,<br />
pp. 349–365.<br />
[72] Huang, M., Gaj, K., Kwon, S., and El-Ghazawi, T. An optimized<br />
hard<strong>ware</strong> architecture for the <strong>Montgomery</strong> <strong>Multiplication</strong> Algorithm. In PKC<br />
2008: 11th International Workshop on Practice and Theory <strong>in</strong> Public Key<br />
Cryptography, Barcelona, Spa<strong>in</strong> (March 2008), pp. 214–228.<br />
[73] Jun, B., and Kocher, P. The <strong>in</strong>tel random number generator.<br />
White paper prepared for <strong>in</strong>tel corporation, Cryptography Research, Inc.,<br />
http://www.cryptography.com/resources/whitepapers/IntelRNG.pdf, Apr.<br />
1999.<br />
[74] Killmann, W., and Sch<strong>in</strong>dler, W. A proposal for: Fuctionality Classes<br />
and Evaluation Methodology for True (Physical) Random Number Generators,<br />
Sept. 2001.<br />
[75] K<strong>in</strong>niment, D., and Chester, E. Design of an on-chip random number<br />
generator us<strong>in</strong>g metastability. In Proceed<strong>in</strong>gs of the 28th European Solid-State<br />
Circuit Conference (Sept. 2002), Univ. Bologna, Italy, pp. 595–598.<br />
[76] Knuth, D. E. Sem<strong>in</strong>umerical Algorithms, second ed., vol. 2 of The Art of<br />
Computer Programm<strong>in</strong>g. Addison-Wesley, Read<strong>in</strong>g, Massachusetts, Jan. 10,<br />
1981.<br />
[77] Koblitz, N. Elliptic curve cryptosystems. Mathematics of Computation 48,<br />
177 (Jan. 1987), 203–209.<br />
[78] Koblitz, N., Menezes, A., and Vanstone, S. The state of elliptic curve<br />
cryptography. Designs, Codes and Cryptography 19, 2-3 (Mar. 2000), 173–193.<br />
[79] Kohlbrenner, P., and Gaj, K. An embedded true random number gen-<br />
erator for FPGAs. In Proceed<strong>in</strong>g of the 2004 ACM/SIGDA 12th <strong>in</strong>ternational<br />
symposium on Field programmable gate arrays (2004), ACM Press, pp. 71–78.<br />
[80] Lenstra, A. K. Designs, Codes and Cryptography. Kluwer Academic Pub-<br />
lishers, Boston, 2000, ch. Integer Factor<strong>in</strong>g.<br />
134
FEI KEMT<br />
[81] Lenstra, A. K., and H. W. Lenstra, J., Eds. The Development of the<br />
Number Field Sieve. Lecture Notes <strong>in</strong> Math. Volume 1554. Spr<strong>in</strong>ger, 1993.<br />
[82] Lenstra, H. W. Factor<strong>in</strong>g Integers with Elliptic Curves. Annals of Mathe-<br />
matics 126, 2 (1987), 649–673.<br />
[83] Lim, D., Ranas<strong>in</strong>ghe, D. C., Devadas, S., Jamali, B., Abbott, D.,<br />
and Coleb, P. H. Exploit<strong>in</strong>g metastability and thermal noise to build a<br />
re-configurable hard<strong>ware</strong> random number generator. In Noise <strong>in</strong> Devices and<br />
Circuits III; Proceed<strong>in</strong>gs of SPIE (Texas, USA, May 2005), vol. 5844, pp. 294–<br />
309.<br />
[84] MacKay, D. J. C. Introduction to Monte Carlo methods. In Learn<strong>in</strong>g <strong>in</strong><br />
Graphical Models, M. I. Jordan, Ed., NATO Science Series. Kluwer Academic<br />
Press, 1998, pp. 175–204.<br />
[85] McIvor, C., McLoone, M., McCanny, J., Daly, A., and Marnane,<br />
W. Fast montgomery modular multiplication and rsa cryptographic proces-<br />
sor architectures. In 37th IEEE Computer Society Asilomar Conference on<br />
Signals, Systems and Computers (Monterey, USA, Nov. 2003), pp. 379–384.<br />
[86] Menezes, J. A., Oorschot, P. C., and Vanstone, S. A. Handbook of<br />
Applied Cryptography. CRC Press, New York, Oct. 1996.<br />
[87] Miller, V. S. Use of elliptic curves <strong>in</strong> cryptography. In Lecture notes<br />
<strong>in</strong> computer sciences; 218 on Advances <strong>in</strong> cryptology—CRYPTO 85 (1986),<br />
Spr<strong>in</strong>ger-Verlag New York, Inc., pp. 417–426.<br />
[88] <strong>Montgomery</strong>, P. <strong>Modular</strong> <strong>Multiplication</strong> without Trial Division. Mathe-<br />
matics of Computation 44, 170 (April 1985), 519–521.<br />
[89] <strong>Montgomery</strong>, P. Speed<strong>in</strong>g up the Pollard and elliptic curve methods of<br />
factorization. Mathematics of Computation 48 (1987), 243–264.<br />
[90] NEC Corporation. Prelim<strong>in</strong>ary User’s Manual System-on-Chip Lite, De-<br />
velopment Board, <strong>Hard</strong><strong>ware</strong>, Document No. A15650EE1V0UM00, July 2001.<br />
Available at http://www.ee.nec.de/_pdf/A15650EE1V0UM00.PDF.<br />
[91] organization = Federal Information Process<strong>in</strong>g Standards, Na-<br />
tional Institute of Standards and Technology, U.S. Department<br />
135
FEI KEMT<br />
of Commerce, month = aug, year = 2005, note =. ”Recommendation<br />
for Key Management, part 1 - General”.<br />
[92] Orlando, G., and Paar, C. A Scalable GF (p) Elliptic Curve Processor Ar-<br />
chitecture for Programmable <strong>Hard</strong><strong>ware</strong>. In Workshop on Cryptographic <strong>Hard</strong>-<br />
<strong>ware</strong> and Embedded Systems — CHES 2001 (May 14-16, 2001), Ç. K. Koç,<br />
D. Naccache, and C. Paar, Eds., vol. LNCS 2162, Spr<strong>in</strong>ger, pp. 348–363.<br />
[93] Pavelka, P., Galajda, P., and Fischer, V. Crypto FPGA a step to-<br />
wards a new class of flexible security devices. In Radioelektronika 2005 : 15th<br />
<strong>in</strong>ternational Czech-Slovak scientific conference (Brno, Czech Republic, May<br />
2005), University of Technology, pp. 397–400.<br />
[94] Pelzl, J., ˇ Simka, M., Kle<strong>in</strong>jung, T., Franke, J., Priplata, C.,<br />
Stahlke, C., Drutarovsk´y, M., Fischer, V., and Paar, C. Area–time<br />
efficient hard<strong>ware</strong> architecture for factor<strong>in</strong>g <strong>in</strong>tegers with the elliptic curve<br />
method. IEE Proceed<strong>in</strong>gs - Information Security 152, 1 (2005), 67–78.<br />
[95] Pollard, J. A Monte Carlo Method for Factorization. Nordisk Tidskrift for<br />
Informationsbehandlung (BIT) 15 (1975), 331–334.<br />
[96] Rivest, R. L., Shamir, A., and Adleman, L. A Method for Obta<strong>in</strong><strong>in</strong>g<br />
Digital Signatures and Public-Key Cryptosystems. Communications of the<br />
ACM 21, 2 (February 1978), 120–126.<br />
[97] Rukh<strong>in</strong>, A., Soto, J., Nechvatal, J., Smid, M., Barker, E., Leigh,<br />
S., Levenson, M., Vangel, M., Banks, D., Heckert, A., Dray, J.,<br />
and Vo, S. A Statistical Test Suite for Random and Pseudorandom Number<br />
Generators for Cryptographic Applications. NIST Special Publication 800-22.<br />
(revised May 15, 2002).<br />
[98] Santoro, R., Sentieys, O., and Roy, S. On-l<strong>in</strong>e monitor<strong>in</strong>g of random<br />
number generators for embedded security. In IEEE International Symposium<br />
on Circuit and Systems – ISCAS 2009 (2009), pp. 3050–3053.<br />
[99] Schaumont, P., and Ch<strong>in</strong>g, D. Gezel. Available at<br />
http://rijndael.ece.vt.edu/gezel2.<br />
[100] Sch<strong>in</strong>dler, W. Efficient Onl<strong>in</strong>e Tests for True Random Number Gener-<br />
ators. In Workshop on Cryptographic <strong>Hard</strong><strong>ware</strong> and Embedded Systems –<br />
136
FEI KEMT<br />
CHES 2001 (Berl<strong>in</strong>, Germany, May 13–16, 2001), Ç. K. Koç, D. Naccache,<br />
and C. Paar, Eds., vol. 2162 of Lecture Notes <strong>in</strong> Computer Science, Spr<strong>in</strong>ger-<br />
Verlag, pp. 103–117.<br />
[101] Schneier, B. Applied Cryptography: Protocols, Algorithms, and Source Code<br />
<strong>in</strong> C, 2nd ed. John Wiley & Sons, Inc., New York, 1996.<br />
[102] Secretariat National Committee for Information Technology<br />
Standardization. Fibre Channel - Methodologies for Jitter Specification,<br />
T11.2 / Project 1230/ Rev 10, June 1999.<br />
[103] Shamir, A., and Tromer, E. Factor<strong>in</strong>g Large Numbers with the TWIRL<br />
Device. In Advances <strong>in</strong> Cryptology — Crypto 2003 (2003), vol. 2729 of LNCS,<br />
Spr<strong>in</strong>ger, pp. 1–26.<br />
[104] Silverman, R. D. The multiple polynomial quadratic sieve. Mathematics of<br />
Computation 48 (1987), 329–340.<br />
[105] Sunar, B., Mart<strong>in</strong>, W. J., and St<strong>in</strong>son, D. R. A provably secure true<br />
random number generator with built-<strong>in</strong> tolerance to active attacks. IEEE<br />
Transaction on Computers 56, 1 (2007), 109–119.<br />
[106] Tang, K., Siegel, P. H., and Milste<strong>in</strong>, L. B. A comparison of long<br />
versus short spread<strong>in</strong>g sequences <strong>in</strong> coded asynchronous DS-CDMA systems.<br />
IEEE Journal on Selected Areas <strong>in</strong> Communications 19, 8 (Aug. 2001), 1614–<br />
1624.<br />
[107] Tektronix. A Guide to Understand<strong>in</strong>g and Characteriz<strong>in</strong>g Tim<strong>in</strong>g Jitter.<br />
[108] Tenca, A. F., and Ç. K. Koç. A scalable architecture for <strong>Montgomery</strong><br />
multiplication. In Cryptographic <strong>Hard</strong><strong>ware</strong> and Embedded Systems (Berl<strong>in</strong>,<br />
Germany, 1999), Ç.K. Koç and C. Paar, Eds., no. 1717 <strong>in</strong> Computer Science,<br />
Spr<strong>in</strong>ger Verlag, pp. 94–108.<br />
[109] Tenca, A. F., and Ç. K. Koç. A scalable architecture for modular multipli-<br />
cation based on <strong>Montgomery</strong>’s algorithm. IEEE Transactions on Computers<br />
52, 9 (Sept. 2003), 1215–1221.<br />
[110] Tenca, A. F., Todorov, G., and Ç. K. Koç. High-radix design of a<br />
scalable modular multiplier. In Cryptographic <strong>Hard</strong><strong>ware</strong> and Embedded Sys-<br />
tems – CHES 2001 (Berl<strong>in</strong>, Germany, May 2001), Ç. K. Koç, D. Naccache,<br />
137
FEI KEMT<br />
and C. Paar, Eds., no. 2162 <strong>in</strong> Lecture Notes <strong>in</strong> Computer Science, Spr<strong>in</strong>ger-<br />
Verlag, pp. 189–205.<br />
[111] Tkacik, T. E. A hard<strong>ware</strong> random number generator. In Workshop on Cryp-<br />
tographic <strong>Hard</strong><strong>ware</strong> and Embedded Systems – CHES 2002 (Berl<strong>in</strong>, Germany,<br />
Aug.13–15, 2002), B. S. Kaliski, Jr., Ç. K. Koç, and C. Paar, Eds., vol. 2523<br />
of Lecture Notes <strong>in</strong> Computer Science, Spr<strong>in</strong>ger-Verlag, pp. 450–453.<br />
[112] Tsoi, K., Leung, K., and Leong, P. Compact FPGA-based true and<br />
pseudo random number generators. In Proceed<strong>in</strong>gs of the IEEE Symposium on<br />
Field-Programmable Custom Comput<strong>in</strong>g Mach<strong>in</strong>es (FCCM), California USA<br />
(2003), pp. 51–61.<br />
[113] ˇ Simka, M., and Drutarovsk´y, M. <strong>Montgomery</strong> <strong>Multiplication</strong> Copro-<br />
cessor on Reconfigurable Logic. In Proceed<strong>in</strong>gs of 13th International Czech-<br />
Slovak Scientific Conference Radioelektronika (Brno, Czech Republic, May<br />
6–7, 2003), pp. 95–98.<br />
[114] ˇ Simka, M., Drutarovsk´y, M., and Fischer, V. Embedded True Ran-<br />
dom Number Generator <strong>in</strong> Actel FPGAs. In Workshop on Cryptographic<br />
Advances <strong>in</strong> Secure <strong>Hard</strong><strong>ware</strong> – CRASH 2005 (Leuven, Belgium, Sept. 6–7,<br />
2005).<br />
[115] ˇ Simka, M., Drutarovsk´y, M., and Fischer, V. Randomness Extrac-<br />
tion Method Based on Rationally Related Clock Signals. In Proceed<strong>in</strong>gs of<br />
the DSP-MCOM 2005, The 6th International Conference on Digital Signal<br />
Process<strong>in</strong>g and Multimedia Communications (Koˇsice, Slovakia, Sept. 13–14,<br />
2005), pp. 190–193.<br />
[116] ˇ Simka, M., Drutarovsk´y, M., and Fischer, V. Performance of PLL-<br />
based True Random Number Generator <strong>in</strong> chang<strong>in</strong>g work<strong>in</strong>g conditions (Sub-<br />
mitted). Acta Electrotechnica et Informatica (2010).<br />
[117] ˇ Simka, M., and Fischer, V. <strong>Montgomery</strong> <strong>Multiplication</strong> Coprocessor for<br />
Altera Nios Embedded Processor. In Proceed<strong>in</strong>gs of Electronic Computers and<br />
Informatics (Herl’any, Slovakia, Oct. 2002), pp. 206–211.<br />
[118] ˇ Simka, M., Fischer, V., and Drutarovsk´y, M. <strong>Hard</strong><strong>ware</strong>-Soft<strong>ware</strong><br />
Codesign <strong>in</strong> Embedded Asymmetric Cryptography Application – a Case<br />
138
FEI KEMT<br />
Study. In Field-Programmable Logic and Applications – FPL 2003 (Lis-<br />
bon, Portugal, Sept. 2003), P. Y. Cheung, G. A. Constant<strong>in</strong>ide, and J. T.<br />
de Sousa, Eds., vol. 2778 of Lecture Notes <strong>in</strong> Computer Science, Spr<strong>in</strong>ger-<br />
Verlag, pp. 1075–1078.<br />
[119] ˇ Simka, M., Fischer, V., Drutarovsk´y, M., and Fayolle, J. Model<br />
of a true random number generator aimed at cryptographics applications. In<br />
Proceed<strong>in</strong>gs of the International Symposium on Circuit and Systems – ISCAS<br />
2006 (Island of Kos, Greece, May 21–24, 2006), pp. 5619–5623.<br />
[120] ˇ Simka, M., Pelzl, J., Kle<strong>in</strong>jung, T., Franke, J., Priplata, C.,<br />
Stahlke, C., Drutarovsk´y, M., Fischer, V., and Paar, C. <strong>Hard</strong>-<br />
<strong>ware</strong> Factorization Based on Elliptic Curve Method. In FCCM – IEEE Sym-<br />
posium on Field-Programmable Custom Comput<strong>in</strong>g Mach<strong>in</strong>es (Napa Valley,<br />
California, Apr. 17–20, 2005).<br />
[121] Walker, S., and Foo, S. Evaluat<strong>in</strong>g metastability <strong>in</strong> electronic circuits<br />
for random number generation. In Proceed<strong>in</strong>gs of the IEEE Computer Society<br />
Workshop on VLSI 2001 (WVLSI ’01) (2001), IEEE Computer Society, p. 99.<br />
[122] Woll<strong>in</strong>ger, T., and Paar, C. How secure are FPGAs <strong>in</strong> cryptographic<br />
applications? (long version). Cryptology ePr<strong>in</strong>t Archive, Report 2003/119,<br />
2003.<br />
[123] Wolski, E., Filho, J. G. S., and Dantas, M. A. R. Parallel Implementa-<br />
tion of Elliptic Curve Method for Integer Factorization Us<strong>in</strong>g Message-Pass<strong>in</strong>g<br />
Interface (MPI). In SBAC- PAD 13th Symposium on Computer Architecture<br />
and High-Performance, 2001, Pirenopolis (2001).<br />
[124] Xil<strong>in</strong>x Corporation. Virtex-E 1.8V Field Programmable Gate Arrays —<br />
Production Product Specification.<br />
[125] Xil<strong>in</strong>x Corporation. Superior Jitter Management with DLLs ver.1.2,<br />
Virtech Tech Topic VTT013 ed., Jan. 2003.<br />
[126] Xil<strong>in</strong>x Corporation. Us<strong>in</strong>g the Virtex Delay-Locked Loop ver.2.8, Appli-<br />
cation Note 132: Virtex Series ed., Jan. 2006.<br />
[127] Xil<strong>in</strong>x Corporation. Us<strong>in</strong>g Delay-Locked Loops <strong>in</strong> Spartan-II/IIE FPGAs<br />
ver.1.2, Application Note 174 ed., June 2008.<br />
139
FEI KEMT<br />
[128] Zimmermann, P. ECMNET page. Available at<br />
http://www.loria.fr/˜zimmerma/records/ecmnet.html.<br />
140