04.11.2012 Views

1 Montgomery Modular Multiplication in Hard- ware

1 Montgomery Modular Multiplication in Hard- ware

1 Montgomery Modular Multiplication in Hard- ware

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Technical University of Koˇsice<br />

Faculty of Electrical Eng<strong>in</strong>eer<strong>in</strong>g and Informatics<br />

Analysis and Implementation of Selected<br />

Blocks for Public-Key Cryptosystems <strong>in</strong><br />

FPGAs<br />

2010 Mart<strong>in</strong> ˇSimka


Technical University of Koˇsice<br />

Faculty of Electrical Eng<strong>in</strong>eer<strong>in</strong>g and Informatics<br />

Department of Electronics and Multimedia Communications<br />

Analysis and Implementation of Selected<br />

Blocks for Public-Key Cryptosystems <strong>in</strong><br />

FPGAs<br />

<strong>Montgomery</strong> <strong>Modular</strong> Multiplier and True Random<br />

Number Generator<br />

Doctoral Thesis<br />

Discipl<strong>in</strong>e: 26-13-9 Electronics<br />

Department: Department of Electronics and Multime-<br />

dia Communications (FEI)<br />

Supervisor: doc. Ing. Miloˇs Drutarovsk´y, PhD.<br />

Consultant: prof. Ing. Viktor Fischer, PhD.<br />

Koˇsice 2010 Mart<strong>in</strong> ˇSimka


Metadata Sheet<br />

Author: Mart<strong>in</strong> ˇ Simka<br />

Thesis title: Analysis and Implementation of Selected Blocks for Public-<br />

Key Cryptosystems <strong>in</strong> FPGAs<br />

Subtitle: <strong>Montgomery</strong> <strong>Modular</strong> Multiplier and True Random Num-<br />

ber Generator<br />

Language: English<br />

Type of Thesis: Doctoral Thesis<br />

Number of Pages: 126<br />

Degree: PhD.<br />

University: Technical University of Koˇsice<br />

Faculty: Faculty of Electrical Eng<strong>in</strong>eer<strong>in</strong>g and Informatics (FEI)<br />

Department: Department of Electronics and Multimedia Communica-<br />

tions (KEMT)<br />

Discipl<strong>in</strong>e: 26-13-9 Electronics<br />

Town: Koˇsice, Slovakia<br />

Supervisor: doc. Ing. Miloˇs Drutarovsk´y, PhD.<br />

Consultant(s) : prof. Ing. Viktor Fischer, PhD.<br />

Date of Submission: 2. 8. 2010<br />

Date of Defence: 9. 2010<br />

Keywords: modular multiplication, elliptic curve method, factorisation,<br />

random number generator<br />

Category Conspectus: Technika, technológia, <strong>in</strong>ˇz<strong>in</strong>ierstvo; Elektronika<br />

Thesis Citation: Mart<strong>in</strong> ˇ Simka: Analysis and Implementation of Selected<br />

Blocks for Public-Key Cryptosystems <strong>in</strong> FPGAs. Koˇsice:<br />

Technical University of Koˇsice, Faculty of Electrical Engi-<br />

neer<strong>in</strong>g and Informatics. 2010. 126 pages<br />

Title SK: Anal´yza a implementácia vybran´ych blokov pre kryp-<br />

tografické systémy s verejn´ym kl’účom<br />

Subtitle SK: <strong>Montgomery</strong>ho modulárna násobička a generátor skutočne<br />

náhodn´ych čísel<br />

Keywords SK: modulárne násobenie, metóda eliptick´ych kriviek, fak-<br />

torizácia, generátor náhodn´ych čísel


Abstract <strong>in</strong> English<br />

In the thesis we deal with two elementary blocks used <strong>in</strong> public key cryptosystems<br />

– the first block is a modular multiplier for very long operands, the second one<br />

is random number generator. Both blocks are designed on programmable target<br />

platform (FPGA devices) what allows quick prototyp<strong>in</strong>g of proposed systems.<br />

Our ma<strong>in</strong> goal <strong>in</strong> case of multiplier is to achieve a scalable and parametrised<br />

solution, which is easily portable and adaptable accord<strong>in</strong>g to a f<strong>in</strong>al target platform<br />

and processed data. Note that due to requested high flexibility of solution the<br />

achieved speed for clock<strong>in</strong>g is lower than <strong>in</strong> case of dedicated design focused on speed.<br />

On the other hand, our solution is perfect for prototyp<strong>in</strong>g and proof-of-concept<br />

designs approach. In the thesis we analyse algorithm improvements <strong>in</strong> relation to<br />

technical features of chosen FPGA families. Obta<strong>in</strong>ed universal arithmetic solution<br />

needs to be enhanced with equally universal <strong>in</strong>terface <strong>in</strong> order to connect a control<br />

unit. As a result we obta<strong>in</strong>ed a build<strong>in</strong>g block – the multiplier for application <strong>in</strong><br />

cryptographic and cryptanalytic systems. For the multiplier it is possible to choose<br />

a range of occupied physical area, computational time and size of operands.<br />

The second area we deal with is a generation of random numbers <strong>in</strong> digital<br />

environment of <strong>in</strong>tegrated circuits. A random number generator (RNG) is the only<br />

cryptographic element for which there are no generally applied algorithms. The ma<strong>in</strong><br />

reason for this is <strong>in</strong> the fact that harvest<strong>in</strong>g mechanism of RNG is tightly related to<br />

a target platform. Physical sources of randomness are very limited <strong>in</strong> digital devices.<br />

In addition, we deal with problematic issue of randomness test<strong>in</strong>g. The chosen design<br />

of RNG we analyse under chang<strong>in</strong>g temperature of a chip. F<strong>in</strong>ally, the proposed<br />

stochastic model of generator allows better understand<strong>in</strong>g of its pr<strong>in</strong>ciple.<br />

Abstract <strong>in</strong> Slovak<br />

V dizertačnej práci sa zaoberáme dvoma elementárnymi blokmi pouˇzívan´ymi v<br />

kryptografick´ych systémoch s verejn´ym kl’účom – prv´ym je násobička pre operácie s<br />

vel’k´ymi číslami, druh´ym je generátor náhodn´ych čísel. Oba bloky sú realizované v<br />

technológii hradlov´ych polí (obvody typu FPGA), čo umoˇzňuje vytvorenie prototypu<br />

vo vel’mi krátkom čase.<br />

Naˇsim hlavn´ym ciel’om v prípade násobičky je realizácia l’ahko parametrizova-<br />

tel’ného a ˇskálovatel’ného rieˇsenia, ktoré umoˇzňuje prispôsobenie architektúry podl’a


FEI KEMT<br />

ciel’ovej platformy a vlastností spracúvan´ych dát. Treba poznamenat’, ˇze dôsledkom<br />

flexibility rieˇsenia je niˇzˇsia dosahovaná r´ychlost’ v´ypočtov. Na druhej strane, takéto<br />

rieˇsenie je ideálne v prípade realizácie prototypov a návrhov, ktoré majú potvrdit’<br />

navrhovan´y koncept rieˇsenia. V práci sa zaoberáme prispôsobením ˇstruktúry náso-<br />

bičky k architektúre ciel’ovej platformy vybran´ych rodín hradlov´ych polí. Získané<br />

univerzálne rieˇsenie je potrebné vybavit’ rovnako univerzálnym rozhraním, ktoré<br />

umoˇzní prepojenie v´ypočtovej jednotky ku rôznorod´ym typom riadiacich jednotiek.<br />

Ako v´ysledok sme získali stavebn´y prvok kryptografick´ych a kryptoanalytick´ych<br />

systémov, pre ktor´y je moˇzné zvolit’ vel’kost’ obsadenej plochy na ciel’ovej platforme,<br />

r´ychlost’ vykonávanej operácie násobenia a vel’kost’ akceptovan´ych parametrov.<br />

Druhou oblast’ou, ktorou sa v práci zaoberáme je oblast’ generovania náhodn´ych<br />

postupností v prostredí číslicov´ych <strong>in</strong>tegrovan´ych obvodov. Generátor náhodn´ych<br />

čísel (RNG) je jed<strong>in</strong>´ym prvkom kryptografick´ych systémov, ktorého pr<strong>in</strong>cíp nie je<br />

dan´y medz<strong>in</strong>árodn´ym ˇstandardom. Hlavn´ym dôvodom je to, ˇze spôsob získavania<br />

náhodn´ych hodnôt je striktne závisl´y od ciel’ovej platformy pre implementáciu gene-<br />

rátora. Fyzické zdroje entrópie pouˇzitel’né v číslicov´ych obvodoch majú obmedzené<br />

moˇznosti, k čomu sa eˇste pripája problematika testovania náhodnosti v´ystupnej pos-<br />

tupnosti. Vybran´y generátor analyzujeme z hl’adiska jeho správania v meniacich sa<br />

tepeln´ych podmienkach súčiastky, v ktorej je umiestnen´y. Predstaven´y stochastick´y<br />

model generátora pribliˇzuje podstatu pr<strong>in</strong>cípu generovania náhodnej postupnosti.<br />

v


Declaration<br />

I hereby declare that this thesis is my own work and effort. Where other sources<br />

of <strong>in</strong>formation have been used, they have been acknowledged.<br />

Koˇsice 2. 8. 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . .<br />

Signature


Acknowledgement<br />

There are several persons who contributed to the research results published <strong>in</strong><br />

the thesis and to the fact I can submit the thesis for defence.<br />

I am very grateful to my advisor Miloˇs Drutarovsk´y for guid<strong>in</strong>g me all along my<br />

research, for his effort and dedication, and for all time he found for me. I want to<br />

thank my special advisor Prof. Viktor Fischer for his great advice, support and<br />

ideas for research, and for mak<strong>in</strong>g possible my stage <strong>in</strong> France. I would like to<br />

express my gratitude to Prof. Duˇsan Levick´y for help <strong>in</strong> tough situations dur<strong>in</strong>g my<br />

stay at department he leads.<br />

Big thanks goes to Nathalie Bochard and Frédéric Celle for very good coop-<br />

eration and help regard<strong>in</strong>g FPGA design. I was glad to meet Cor<strong>in</strong>ne Fournier<br />

and Loïc Denis who made my weekends very enjoyable. Thanks to all members of<br />

Hubert Curien Laboratory <strong>in</strong> Sa<strong>in</strong>t-Etienne, I had nice time with you.<br />

I would like to thank all my colleagues from COSY group. Especially to Jan<br />

Pelzl for very fruitful jo<strong>in</strong>t work on hard<strong>ware</strong> implementation of ECM. I am grateful<br />

to Prof. Christof Paar who allowed me to work <strong>in</strong> his research group and get such a<br />

priceless experience. Thanks to Sandeep Kumar, Andy Rupp and Axel Poschmann<br />

for great time <strong>in</strong> Bochum, spent on research, but not only. Special thanks goes to<br />

Irmgard Kühn for mak<strong>in</strong>g my contact with all bureaucracy much easier.<br />

From the COSIC group I would like to thank Prof. Ingrid Verbauwhede and<br />

Prof. Bart Preneel for mak<strong>in</strong>g it possible to jo<strong>in</strong> their team <strong>in</strong> Leuven. Thanks<br />

to Lejla Bat<strong>in</strong>a and Elke De Mulder for <strong>in</strong>corporat<strong>in</strong>g me <strong>in</strong> side-channel attack<br />

research and all members of COSIC for creat<strong>in</strong>g great atmosphere there.<br />

I want to thank my family for their encouragements and support, and especially<br />

my sister Katka for all our <strong>in</strong>spir<strong>in</strong>g discussions.<br />

Most importantly, I thank my dear Kasia for her endless love and patience.<br />

Thanks to all of you!<br />

Mart<strong>in</strong>


Preface<br />

Systems for public key cryptography are <strong>in</strong>tensively applied <strong>in</strong> order to digitally sign<br />

or encrypt data. In this way we assure <strong>in</strong>tegrity and confidentiality of the signed<br />

message and provide authentication and non-repudiation features for a signer. The<br />

complexity of computations has impact on performance of the system, especially <strong>in</strong><br />

case of long keys. The security of the operations is based on secrecy of the private<br />

key, while its public part and the algorithm itself are publicly known.<br />

In the first part of thesis we analyse the computational part of the systems and<br />

focus on flexible implementation of modular multiplier. The output of the research<br />

was applied <strong>in</strong> order to estimate performance of Elliptic Curve Method (ECM)<br />

<strong>in</strong>creased thanks to its hard<strong>ware</strong> realisation. Scalable nature of the multiplier was<br />

spread <strong>in</strong> the whole design, and the proof-of-concept implementation was designed<br />

and tested <strong>in</strong> a very short time.<br />

In the second part of document we focus on the key generat<strong>in</strong>g element – a Ran-<br />

dom Number Generator (RNG). Already known design was analysed under several<br />

aspects and we provide results <strong>in</strong> the form of a stochastic model of the RNG and<br />

proposed test<strong>in</strong>g methods suitable for this type of RNGs.<br />

The target platform for the selected build<strong>in</strong>g blocks of cryptosystems is FPGA<br />

(Field Programmable Gate Array) what offers a reduction of development time, wide<br />

range of devices and high level of security. In the thesis analyse particular families of<br />

devices from FPGA vendors which <strong>in</strong>clude dedicated electronic elements used <strong>in</strong> our<br />

designs. Parameters of the blocks and algorithm improvements may have significant<br />

impact on the performance of system.<br />

Three topics of the thesis provide a picture of complexity level <strong>in</strong> cryptology and<br />

underl<strong>in</strong>e relevance of research <strong>in</strong> area of cryptographic systems implementation.


Contents<br />

Introduction 1<br />

1 <strong>Montgomery</strong> <strong>Modular</strong> <strong>Multiplication</strong> <strong>in</strong> <strong>Hard</strong><strong>ware</strong> - prelim<strong>in</strong>aries 3<br />

1.1 Implementation Platforms . . . . . . . . . . . . . . . . . . . . . . . . 3<br />

1.2 RSA Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />

1.2.1 <strong>Modular</strong> Exponentiation and <strong>Multiplication</strong> . . . . . . . . . . 8<br />

1.2.2 <strong>Hard</strong><strong>ware</strong> Implementations of the MMM . . . . . . . . . . . . 12<br />

1.3 EC <strong>in</strong> Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />

1.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18<br />

2 <strong>Montgomery</strong> <strong>Modular</strong> <strong>Multiplication</strong> <strong>in</strong> <strong>Hard</strong><strong>ware</strong> 20<br />

2.1 Scalable MMM design . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />

2.1.1 Scalable Multiple-Word Algorithms . . . . . . . . . . . . . . . 22<br />

2.1.2 Comparison of Implementation Approaches . . . . . . . . . . . 23<br />

2.2 Multiplier Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

2.2.1 Adder Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />

2.2.2 Memory Block . . . . . . . . . . . . . . . . . . . . . . . . . . . 32<br />

2.2.3 Interface to Controller . . . . . . . . . . . . . . . . . . . . . . 34<br />

2.3 Implementation of the MMM . . . . . . . . . . . . . . . . . . . . . . 36<br />

2.3.1 Comparison of CSA and CPA PE . . . . . . . . . . . . . . . . 36<br />

2.3.2 <strong>Montgomery</strong> <strong>Multiplication</strong> Coprocessor . . . . . . . . . . . . 38<br />

2.3.3 <strong>Hard</strong><strong>ware</strong>-Soft<strong>ware</strong> Co-design of MMM: a Case Study . . . . . 38<br />

2.3.4 Implementation Results . . . . . . . . . . . . . . . . . . . . . 42<br />

2.4 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . 42<br />

3 Elliptic Curve Method <strong>in</strong> <strong>Hard</strong><strong>ware</strong> - prelim<strong>in</strong>aries 44<br />

3.1 Integer Factor<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44<br />

3.1.1 Factor<strong>in</strong>g Algorithms . . . . . . . . . . . . . . . . . . . . . . . 44<br />

3.1.2 Motivation for <strong>Hard</strong><strong>ware</strong> Implementation . . . . . . . . . . . . 45<br />

3.2 Previous Implementations of ECM . . . . . . . . . . . . . . . . . . . 46<br />

3.3 Mathematical Background . . . . . . . . . . . . . . . . . . . . . . . . 47<br />

3.3.1 Pollard’s (p − 1)-algorithm . . . . . . . . . . . . . . . . . . . . 48<br />

3.3.2 ECM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 49


FEI KEMT<br />

4 Elliptic Curve Method <strong>in</strong> <strong>Hard</strong><strong>ware</strong> 55<br />

4.1 Parameterisation of the ECM Algorithm . . . . . . . . . . . . . . . . 56<br />

4.1.1 Phase 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />

4.1.2 Phase 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />

4.2 Design of the ECM Unit . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />

4.2.1 Control Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />

4.2.2 Memory Management . . . . . . . . . . . . . . . . . . . . . . . 59<br />

4.2.3 Choice of the Arithmetic Algorithms . . . . . . . . . . . . . . 60<br />

4.2.4 Parallelization of the Algorithm . . . . . . . . . . . . . . . . . 64<br />

4.3 Implementation of the ECM Unit . . . . . . . . . . . . . . . . . . . . 65<br />

4.3.1 <strong>Hard</strong><strong>ware</strong> Platform . . . . . . . . . . . . . . . . . . . . . . . . 65<br />

4.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66<br />

4.3.3 ECM-Based Acceleration of GNFS: a Case Study . . . . . . . 67<br />

4.4 Conclusions and Future Steps . . . . . . . . . . . . . . . . . . . . . . 69<br />

5 True Random Number Generator - prelim<strong>in</strong>aries 71<br />

5.1 Randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />

5.1.1 Def<strong>in</strong>itions of Randomness . . . . . . . . . . . . . . . . . . . . 72<br />

5.1.2 Random Number Generator . . . . . . . . . . . . . . . . . . . 73<br />

5.1.3 Applications of Random Numbers . . . . . . . . . . . . . . . . 75<br />

5.2 TRNG Implementations <strong>in</strong> Digital Systems . . . . . . . . . . . . . . . 76<br />

5.2.1 Sources of Randomness . . . . . . . . . . . . . . . . . . . . . . 77<br />

5.2.2 Survey of Designs Based on Jitter . . . . . . . . . . . . . . . . 82<br />

5.3 PLL-Based TRNG on FPGA . . . . . . . . . . . . . . . . . . . . . . 85<br />

5.3.1 Randomness Extraction Method . . . . . . . . . . . . . . . . . 85<br />

5.3.2 Coherent Sampl<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . 88<br />

5.4 Test<strong>in</strong>g of TRNGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />

5.5 Attacks aga<strong>in</strong>st TRNG . . . . . . . . . . . . . . . . . . . . . . . . . . 91<br />

5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />

6 True Random Number Generator 94<br />

6.1 Clock Synthesis <strong>in</strong> FPGAs . . . . . . . . . . . . . . . . . . . . . . . . 94<br />

6.1.1 PLL as Source of Randomness . . . . . . . . . . . . . . . . . . 96<br />

6.2 PLL-Based TRNG on FPGA . . . . . . . . . . . . . . . . . . . . . . 101<br />

6.2.1 PLL Configurations . . . . . . . . . . . . . . . . . . . . . . . . 101<br />

6.2.2 Analysis of TRNG <strong>in</strong> Altera Stratix FPGAs . . . . . . . . . . 103<br />

x


FEI KEMT<br />

6.2.3 Analysis of TRNG <strong>in</strong> Actel FPGAs . . . . . . . . . . . . . . . 105<br />

6.2.4 Stochastic Model of PLL-TRNG . . . . . . . . . . . . . . . . . 109<br />

6.3 Active Non-Invasive Attack on TRNG . . . . . . . . . . . . . . . . . 114<br />

6.3.1 Attack description . . . . . . . . . . . . . . . . . . . . . . . . 114<br />

6.3.2 Measurements results . . . . . . . . . . . . . . . . . . . . . . . 115<br />

6.4 Conclusions and Further Research . . . . . . . . . . . . . . . . . . . . 120<br />

7 Research Contribution 124<br />

Bibliography 127<br />

xi


List of Figures<br />

1 – 1 Typical architecture of the smallest functional unit <strong>in</strong> a FPGA. . . . 6<br />

1 – 2 RSA encryption scheme when A sends encrypted message to B. First<br />

A receive B’s public key upon a request, afterwards A encrypts a<br />

message X us<strong>in</strong>g the B’s public key Y = X E mod M. F<strong>in</strong>ally B<br />

decrypts the received message Y us<strong>in</strong>g own private key X = Y D mod<br />

M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />

2 – 1 Architecture of a general scalable coprocessor based on separate mem-<br />

ory and ALU connected by w-bit data-path . . . . . . . . . . . . . . 21<br />

2 – 2 One level of the w-bit adder implemented as CPA and CSA with FAs 27<br />

2 – 3 Block diagram of the CSA-based w-bit MWR2MM process<strong>in</strong>g element<br />

(CSA PE) based on FA . . . . . . . . . . . . . . . . . . . . . . . . . . 28<br />

2 – 4 Block diagram of CPA-based w-bit MWR2MM process<strong>in</strong>g element<br />

(CPA PE) based on FA . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />

2 – 5 Pipel<strong>in</strong>ed organization of the MMM coprocessor based on n-stage PEs<br />

connection and separated embedded data memory . . . . . . . . . . . 30<br />

2 – 6 Organisation of the dual-port memory register <strong>in</strong>side the MMM co-<br />

processor for one variable with e words of width w bits . . . . . . . . 32<br />

2 – 7 Proposed universal <strong>in</strong>terface for the MMM coprocessor . . . . . . . . 34<br />

4 – 1 Architecture of the ECM unit . . . . . . . . . . . . . . . . . . . . . . 58<br />

4 – 2 Organisation of the ECM unit’s memory registers for 21 variables<br />

with e words of width w . . . . . . . . . . . . . . . . . . . . . . . . . 60<br />

4 – 3 Scalable addition and subtraction unit for operands with word width w 63<br />

5 – 1 Schematic diagram of a TRNG with designation of <strong>in</strong>ternal signals<br />

and <strong>in</strong>terfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />

5 – 2 Illustration of stable states (0 and 1) and undef<strong>in</strong>ed metastable state 78<br />

5 – 3 Tim<strong>in</strong>g jitter <strong>in</strong> clock signal . . . . . . . . . . . . . . . . . . . . . . . 81<br />

5 – 4 R<strong>in</strong>g oscillator structures proposed by Golić. . . . . . . . . . . . . . . 83<br />

5 – 5 Block structure of the PLL-TRNG with two PLLs, sampl<strong>in</strong>g gate and<br />

corrector of the output sequence. . . . . . . . . . . . . . . . . . . . . 86<br />

5 – 6 Sampl<strong>in</strong>g of the CLJ clock signal <strong>in</strong>clud<strong>in</strong>g the track<strong>in</strong>g jitter on the<br />

rais<strong>in</strong>g edge of the CLK signal (illustrated for KM = 5 and KD = 7) 86<br />

6 – 1 Block diagram of analog PLL circuitry for clock signal synthesis <strong>in</strong><br />

Altera FPGA [11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95


FEI KEMT<br />

6 – 2 Block diagram of digital DLL unit typical for Xil<strong>in</strong>x FPGA clock<br />

management circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . 96<br />

6 – 3 Jitter of the clock signal <strong>in</strong> Altera Stratix design (horizontal scale:<br />

200 ps/div) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />

6 – 4 Configurations of TRNG with: a) one PLL, b) two parallel PLLs and<br />

c) two cascaded PLLs . . . . . . . . . . . . . . . . . . . . . . . . . . . 101<br />

6 – 5 Distribution of mean values of ordered CLJ signal samples obta<strong>in</strong>ed<br />

dur<strong>in</strong>g Q = 1000 periods TQ . . . . . . . . . . . . . . . . . . . . . . . 110<br />

6 – 6 Block diagram of design for on-chip samples reorder<strong>in</strong>g . . . . . . . . 111<br />

6 – 7 Reordered samples from generator measured by oscilloscope . . . . . 111<br />

6 – 8 Sampled waveform of a clock signal for TRNG for Configuration A<br />

for temperatures <strong>in</strong> range −40 ◦ C + 30 ◦ C. . . . . . . . . . . . . . . . . 116<br />

6 – 9 Sampled waveform of a clock signal for TRNG for configuration B<br />

for temperatures <strong>in</strong> range −40 ◦ C + 32 ◦ C. . . . . . . . . . . . . . . . . 117<br />

6 – 10Amount of sampled ones dur<strong>in</strong>g 1000 sampl<strong>in</strong>g periods for TRNG<br />

with configuration A (detail of the rais<strong>in</strong>g edge). . . . . . . . . . . . . 119<br />

6 – 11Amount of sampled ones dur<strong>in</strong>g 1000 sampl<strong>in</strong>g periods for TRNG<br />

with configuration B, with low-pass loop filter (detail of the rais<strong>in</strong>g<br />

edge). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120<br />

6 – 12Amount of sampled ones dur<strong>in</strong>g 1000 sampl<strong>in</strong>g periods accord<strong>in</strong>g to<br />

temperature for chosen sample positions <strong>in</strong> TRNG with configuration<br />

A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />

6 – 13Amount of sampled ones dur<strong>in</strong>g 1000 sampl<strong>in</strong>g periods accord<strong>in</strong>g to<br />

temperature for chosen sample positions <strong>in</strong> TRNG with configuration<br />

B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122<br />

6 – 14Comparison of probability histograms for the jitter measured by tem-<br />

perature 20 ◦ C <strong>in</strong> TRNG with configuration A and B. Data measured<br />

were around the ris<strong>in</strong>g edge of the sampled clock waveform. . . . . . . 123<br />

6 – 15Difference <strong>in</strong> number of sampled ones for critical samples by boundary<br />

temperatures −40 ◦ C and +30 ◦ C <strong>in</strong> TRNG with configuration A and<br />

B around the ris<strong>in</strong>g edge of the sampled clock waveform. . . . . . . . 123<br />

xiii


List of Tables<br />

1 – 1 Comparison of the key length (<strong>in</strong> bits) for equivalent security level<br />

for public-key cryptosystems . . . . . . . . . . . . . . . . . . . . . . . 16<br />

2 – 1 Address of operands from host processor level (LSB right) . . . . . . 33<br />

2 – 2 PE sizes and speeds for old style Altera FPGAs . . . . . . . . . . . . 37<br />

2 – 3 PE sizes and speeds for new style Altera FPGAs . . . . . . . . . . . . 37<br />

2 – 4 Area occupation <strong>in</strong> number of LEs and maximal clock frequency<br />

(fclkMMM ) (MHz) of the MMM coprocessor (w = 32, n = 1..4) with<br />

MWR2MM CSA algorithm . . . . . . . . . . . . . . . . . . . . . . . . 38<br />

2 – 5 Execution times of soft<strong>ware</strong> implementation of MMM on Altera Nios<br />

development board (with APEX EP20K200 clocked at 50 MHz) . . . 40<br />

2 – 6 Execution times of mixed hard<strong>ware</strong>-soft<strong>ware</strong> implementation of MMM<br />

on Altera Nios development board (with APEX EP20K200) for the<br />

CSA PE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41<br />

2 – 7 Execution times of mixed hard<strong>ware</strong>-soft<strong>ware</strong> implementation of the<br />

MMM on Altera Nios development board (with APEX EP20K200)<br />

for the CPA PE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41<br />

4 – 1 Computational complexity and memory requirements for phase 2 de-<br />

pend<strong>in</strong>g on D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />

4 – 2 A command syntax for the ECM unit (LSB left) . . . . . . . . . . . . 59<br />

4 – 3 Runn<strong>in</strong>g Times of the ECM Implementation (198 bits modulus), p =<br />

2, w = 32 (Xil<strong>in</strong>x Virtex2000E-6 and ARM7TDMI, 25MHz) . . . . . 67<br />

6 – 1 Parameters of PLL embedded <strong>in</strong> Altera FPGAs . . . . . . . . . . . . 97<br />

6 – 2 Parameters of PLL embedded <strong>in</strong> Actel FPGAs . . . . . . . . . . . . . 98<br />

6 – 3 Parameters sett<strong>in</strong>gs for different TRNG configurations . . . . . . . . 102<br />

6 – 4 Configuration parameters of tested TRNG . . . . . . . . . . . . . . . 105<br />

6 – 5 Results of quality evaluation of tested TRNG configurations . . . . . 105<br />

6 – 6 Achievable sensitivity on jitter us<strong>in</strong>g two clock signals <strong>in</strong> Actel ProA-<br />

SICplus (FCLI = 40MHz) . . . . . . . . . . . . . . . . . . . . . . . . . 107<br />

6 – 7 Area occupation of one PLL TRNG with delay l<strong>in</strong>e <strong>in</strong> FPGA Actel<br />

ProASICPlus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108<br />

6 – 8 Mean values measured us<strong>in</strong>g the stochastic model E[pi] and the out-<br />

put sequence of the TRNG m = E [x(nTQ)] . . . . . . . . . . . . . . . 114<br />

6 – 9 Results of statistical tests (FIPS) of TRNG output and number of<br />

random samples <strong>in</strong>fluenced by the jitter at different chip temperatures 118


List of Algorithms<br />

1 – 1 <strong>Montgomery</strong> exponentiation algorithm [86], the def<strong>in</strong>ition of M ′ re-<br />

quires that gcd(M, R) = 1, b denotes base or radix. . . . . . . . . . . 10<br />

1 – 2 The <strong>Montgomery</strong> modular multiplication algorithm for k-bit operands<br />

X = (xk−1, . . . , x1, x0), Y , and M . . . . . . . . . . . . . . . . . . . . 11<br />

1 – 3 The basic radix-2 <strong>Montgomery</strong> multiplication algorithm for k-bit operands<br />

X = (xk−1, . . . , x1, x0), Y , and M . . . . . . . . . . . . . . . . . . . . 13<br />

1 – 4 Optimized radix-2 <strong>Montgomery</strong> multiplication algorithm . . . . . . . 15<br />

1 – 5 Key generation <strong>in</strong> ECC [78] . . . . . . . . . . . . . . . . . . . . . . . 18<br />

1 – 6 Message sign<strong>in</strong>g <strong>in</strong> ECC [78] . . . . . . . . . . . . . . . . . . . . . . . 18<br />

2 – 1 The multiple word radix-2 <strong>Montgomery</strong> multiplication MWR2MM CSA<br />

algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />

2 – 2 The multiple word radix-2 <strong>Montgomery</strong> multiplication MWR2MM CPA<br />

algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />

3 – 1 Elliptic Curve Method . . . . . . . . . . . . . . . . . . . . . . . . . . 50<br />

3 – 2 Exponentiation for Curves <strong>in</strong> <strong>Montgomery</strong> Form . . . . . . . . . . . . 53<br />

4 – 1 Modified MWR2MM algorithm . . . . . . . . . . . . . . . . . . . . . 62<br />

4 – 2 <strong>Modular</strong> addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64<br />

4 – 3 <strong>Modular</strong> subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 64


List of Symbols and Abbreviations<br />

A (x) the x th word of vector A<br />

Ax..y particular range of bits <strong>in</strong> a vector A from position x to position y<br />

A (y)<br />

x<br />

bit position of the y th word of A<br />

B bound of smoothness<br />

D a parameter <strong>in</strong> improved standard cont<strong>in</strong>uation of ECM<br />

DCLJ divid<strong>in</strong>g factor for CLJ clock signal<br />

DCLK divid<strong>in</strong>g factor for CLK clock signal<br />

FCLJ frequency of CLJ clock signal<br />

FCLK frequency of CLK clock signal<br />

KD decimation factor of CLK clock signal<br />

KM decimation factor of CLJ clock signal<br />

M modulus<br />

MCLJ multiplication factor for CLJ clock signal<br />

MCLK multiplication factor for CLK clock signal<br />

S partial sum<br />

TQ<br />

time period of bit generation<br />

TCLJ time period of CLJ clock signal<br />

TCLK time period of CLK clock signal<br />

X nultiplier<br />

Y multiplicand<br />

φ canonical homomorphism<br />

φ() Euler tontien function


FEI KEMT<br />

π(p) prime count<strong>in</strong>g function, number of primes ≤ p<br />

σjit standard deviation of jitter<br />

xA the x th part of vector A<br />

b base or radix<br />

e number of words<br />

k length of operands<br />

n positive <strong>in</strong>teger to be factored<br />

p prime factor<br />

w word width<br />

ALU Arithmetic Logic Unit<br />

ASIC Application-Specific Integrated Circuits<br />

AT Area-Time<br />

CASR Cellular Automation Shift Register<br />

CLB Configurable Logic Block<br />

CPA Carry Propagate Adder<br />

CPU Central Process<strong>in</strong>g Unit<br />

CRT Ch<strong>in</strong>ese Rem<strong>in</strong>der Theorem<br />

CSA Carry Save Adder<br />

DJ Determ<strong>in</strong>istic Jitter<br />

DLL Delay Locked Loop<br />

DSA Digital Signature Algorithm<br />

EC Elliptic Curves<br />

ECC Elliptic Curve Cryptography<br />

xvii


FEI KEMT<br />

ECDLP Elliptic Curve Discrete Logarithm Problem<br />

ECDSA Elliptic Curve Digital Signature Algorithm<br />

ECM Elliptic Curve Method<br />

EPLL Enhanced PLL<br />

FA Full Adder<br />

FPGA Field Programmable Gate Array<br />

FPLL Fast PLL<br />

gcd Greatest Common Divisor<br />

GMP GNU Multiple Precision<br />

GNFS Generalised Number Field Sieve<br />

I/O Input/Output<br />

IP Intellectual Property<br />

ITU International Telecommunications Union<br />

LAB Logic Array Block<br />

LE Logic Element<br />

LFSR L<strong>in</strong>ear Feedback Shift Register<br />

LPM Library of Parameterized Modules<br />

LSB Least Significant Bit<br />

LUT Look-Up Table<br />

MM <strong>Modular</strong> <strong>Multiplication</strong><br />

MMM <strong>Modular</strong> <strong>Montgomery</strong> Multimplication<br />

MPQS Multiple Polynomial Quadratic Sieve<br />

MSB Most Significant Bit<br />

xviii


FEI KEMT<br />

MWR2MM Multiple Word Radix-2 <strong>Montgomery</strong> <strong>Multiplication</strong><br />

NA Not Available<br />

P&R Place and Route<br />

PCI Peripheral Component Interconnect<br />

PE Process<strong>in</strong>g Element<br />

PLL Phase Locked Loop<br />

PRNG Pseudo-Random Number Generator<br />

RAM Random Access Memory<br />

RFID Radio Frequency Identification<br />

RISC Reduced Instruction Set Computer<br />

RJ Random Jitter<br />

RMS Root Mean Square<br />

RNG Random Number Generator<br />

RO R<strong>in</strong>g Oscillator<br />

ROM Read-Only Memory<br />

SIMD S<strong>in</strong>gle Instruction Multiple Data<br />

SOC System on a Chip<br />

SOS Separated Operand Scann<strong>in</strong>g<br />

TRNG True-Random Number Generator<br />

UART Universal Asynchronous Receiver/Transmitter<br />

VCCIO Positive Supply Voltage for IO P<strong>in</strong>s<br />

VCO Voltage Controlled Oscillator<br />

VHDL VHSIC <strong>Hard</strong><strong>ware</strong> Description Language<br />

VHSIC Very High Speed Integrated Circuit<br />

xix


FEI KEMT<br />

Introduction<br />

In the thesis we analyse two elementary blocks of almost each public key cryp-<br />

tosystem, a multiplier for operations on very long operands and a random number<br />

generator.<br />

In the case of multiplier our ma<strong>in</strong> goal is to achieve scalable and parametrised<br />

design for fast prototyp<strong>in</strong>g <strong>in</strong> Field Programmable Gate Arrays (FPGAs). Flexibility<br />

of the design and computational latency create a trade-off, therefore this concept is<br />

suitable mostly for prototyp<strong>in</strong>g and proof-of-concept designs. As a secondary objec-<br />

tive we want to achieve effective utilisation of a selected family of FPGAs and apply<br />

its specific features. In this way we can analyse suitability of a certa<strong>in</strong> algorithm for<br />

the selected FPGA platform. Such approach is particularly appropriate <strong>in</strong> case the<br />

f<strong>in</strong>al implementation platform will be the same FPGA family.<br />

Flexible and effective design of multiplier would have a chance to offer an univer-<br />

sal solution <strong>in</strong> the applications with different asymmetric algorithms or <strong>in</strong> similar<br />

systems based on the same algebraic operations. Our goal is to design and im-<br />

plement a multiplier block with a universal <strong>in</strong>terface that could be <strong>in</strong>cluded <strong>in</strong> a<br />

variety of cryptosystems offer<strong>in</strong>g features for chang<strong>in</strong>g its configuration parameters<br />

e.g. length of the <strong>in</strong>put parameters, computational time and occupied area.<br />

Another area of our focus are random numbers, namely their generation <strong>in</strong> con-<br />

ditions of digital platforms. The Random Number Generator (RNG) design depends<br />

significantly on the target implementation platform. Therefore we analyse the fea-<br />

tures of FPGAs devices, change the work<strong>in</strong>g conditions what simulates attacker<br />

behaviour and describe the relations between the parameters of the generator and<br />

the statistical parameters of the generated sequence.<br />

Classification of the generators heavily depends on level of their description. Ac-<br />

cord<strong>in</strong>g to the latest trends <strong>in</strong> this research area, designers of RNGs should provide<br />

<strong>in</strong> addition to the statistical tests results also an detailed analysis and model of the<br />

generator. The generator’s behaviour needs to be expla<strong>in</strong>ed <strong>in</strong> details, supported<br />

by practical experiments. Special attention <strong>in</strong> the RNG design should be paid to<br />

testability of the RNG. The tests can be done on the generated sequence. However,<br />

as we will show, for analysed generator there exist more effective methods for test-<br />

<strong>in</strong>g. The proposed methods should take <strong>in</strong>to account the fundamental pr<strong>in</strong>ciple of<br />

extract<strong>in</strong>g the random values.<br />

In the Chapter 1 we <strong>in</strong>troduce mathematical background of two currently most<br />

known and used cryptographic algorithms for public key cryptosystems, the RSA<br />

1


FEI KEMT<br />

and Elliptic Curve Cryptography (ECC). In computationally highly <strong>in</strong>tensive public-<br />

key algorithms we identify the most expensive and also most used operation - mod-<br />

ular multiplication. The comparison of the operands length shows the range for<br />

which an universal architecture needs to be found.<br />

The Chapter 2 provides our design approach and implementation results for<br />

<strong>Montgomery</strong> multiplier. We compare two designs which differ <strong>in</strong> handl<strong>in</strong>g carry<br />

bits <strong>in</strong> adders <strong>in</strong>side the multiplier block. The analysis provides suggestions which<br />

technique is suitable for a certa<strong>in</strong> platform architecture. We present a scalable archi-<br />

tecture of algebraic coprocessor that is suitable for the multiplier. A communication<br />

<strong>in</strong>terface between the coprocessor and a control unit is also discussed. The f<strong>in</strong>al case<br />

study provides results of our hard<strong>ware</strong>-soft<strong>ware</strong> co-design <strong>in</strong> case of multiplier <strong>in</strong> ap-<br />

plications with soft-core processor and dedicated coprocessor.<br />

In Chapter 3 we start with mathematical background of <strong>in</strong>teger factor<strong>in</strong>g meth-<br />

ods and provide details on Elliptic Curve Method (ECM) algorithm <strong>in</strong>clud<strong>in</strong>g the<br />

first and second phase of the algorithm. The motivation for hard<strong>ware</strong> implementa-<br />

tion of the algorithm and previous approaches for implementations are summarised.<br />

The Chapter 4 describes the first published hard<strong>ware</strong> implementation of ECM<br />

method for factor<strong>in</strong>g numbers up to 200 bits. An ECM unit design is <strong>in</strong>troduced<br />

and we discuss the way how the implemented algorithms were chosen. In the f<strong>in</strong>al<br />

section we present the implementation results of the ECM units and a case study of<br />

application of the ECM unit <strong>in</strong> a well-known factor<strong>in</strong>g method.<br />

Randomness is the ma<strong>in</strong> topic of the Chapter 5. We discuss required features<br />

of random sequences <strong>in</strong>tended for cryptographic application. We widely describe<br />

a design of RNG with focus on digital devices and analyse available sources of<br />

randomness. In the last part of chapter a review of recently published RNG concepts<br />

is provided, while our focus is put on solution based on a Phase Locked Loop (PLL).<br />

The sections on tests and attacks summarise available knowledge from these areas.<br />

In Chapter 6 we deliver our results <strong>in</strong> research of PLL-based RNG. Start<strong>in</strong>g<br />

with analysis of PLL parameters <strong>in</strong> available FPGA devices we provide description<br />

of design process for two FPGA vendors. Thanks to observations of RNG’s <strong>in</strong>ternal<br />

signals we were able to <strong>in</strong>troduce a stochastic model of the generator and describe<br />

its behaviour <strong>in</strong> chang<strong>in</strong>g chip temperature. Based on the empirical experiments we<br />

enhance the design process with additional requirements <strong>in</strong> order to achieve more<br />

robust solution.<br />

The research contribution of the thesis is summarised <strong>in</strong> the f<strong>in</strong>al Chapter 7<br />

where we collect the results from all three topics discussed <strong>in</strong> the thesis.<br />

2


FEI KEMT<br />

1 <strong>Montgomery</strong> <strong>Modular</strong> <strong>Multiplication</strong> <strong>in</strong> <strong>Hard</strong>-<br />

<strong>ware</strong> - prelim<strong>in</strong>aries<br />

Many popular public-key cryptographic algorithms and protocols, such as RSA,<br />

ElGamal, elliptic curve cryptography (ECC), Diffie-Hellman, etc. [86] extensively<br />

use modular operations with large numbers. Typical size of operands <strong>in</strong> ECC and<br />

RSA is 160-300 bits and 1000-2000 bits, respectively.<br />

We start the chapter with discussion on optimal choice of the computation<br />

method and way of its implementation accord<strong>in</strong>g to chosen implementation plat-<br />

form (the Section 1.1). In Section 1.2 we br<strong>in</strong>g a summary on RSA algorithm<br />

together with a short analysis of available algorithms for modular multiplication.<br />

We mention the aspects of hard<strong>ware</strong> implementation and review the available pa-<br />

pers <strong>in</strong> this area. F<strong>in</strong>ally, the further implemented algorithm and its modification<br />

are <strong>in</strong>troduced. The Section 1.3 we start with def<strong>in</strong>ition of elliptic curves (EC) and<br />

cont<strong>in</strong>ue with their application <strong>in</strong> cryptography. The last section summarises the<br />

most important features of the presented public-key algorithms and identifies the<br />

most important part of the system for effective implementation.<br />

1.1 Implementation Platforms<br />

By hav<strong>in</strong>g all parts of cryptosystem (encryption, authentication, key storage, gen-<br />

eration of random numbers . . . ) implemented on the same platform one is able to<br />

achieve highly compact and therefore potentially secure implementation. The more<br />

signals are available for an adversary for observation, the more <strong>in</strong>formation about<br />

processed data can be obta<strong>in</strong>ed.<br />

While <strong>in</strong> the past the development of hard<strong>ware</strong> and soft<strong>ware</strong> platforms was done<br />

separately, beside the <strong>in</strong>itial requirements and def<strong>in</strong>itions of data formats and <strong>in</strong>ter-<br />

faces, nowadays with so called hard<strong>ware</strong>-soft<strong>ware</strong> co-design one tries to f<strong>in</strong>d optimum<br />

<strong>in</strong> effective utilisation of resources. In such case some of operations are implemented<br />

as a hard<strong>ware</strong> structure and the others as a soft<strong>ware</strong> function. With reconfigurable<br />

devices and embedded soft-core processors the situation is very suitable for such an<br />

approach. However, development of mixed systems is not a trivial task for designers,<br />

especially on the level when decision on tasks division is done. Systems mak<strong>in</strong>g pos-<br />

sible to simulate and evaluate system performance by proposed soft<strong>ware</strong>-hard<strong>ware</strong><br />

architecture before a real (and expensive) implementation are only on the early stage<br />

of development (check e.g. GEZEL language and design environment [99]).<br />

3


FEI KEMT<br />

The hard<strong>ware</strong> implementation platforms offer higher level of security thanks to<br />

possibility to separate physically a sensitive data and <strong>in</strong> dependency on operations<br />

also higher performance as similar soft<strong>ware</strong> implementations.<br />

As a hard<strong>ware</strong> platform can be considered:<br />

• ASIC (Application-Specific Integrated Circuit),<br />

• FPGA (Field-Programmable Gate Array) or<br />

• RFID (Radio Frequency Identification) chip.<br />

There are different approaches by implementation of cryptosystems. Implementa-<br />

tion can provide some support<strong>in</strong>g functions for general-purpose processor, cover all<br />

crypto-related operations <strong>in</strong> a standard system or even represent complete system<br />

able to substitute the orig<strong>in</strong>al non-secured system.<br />

In dependency on the application the implementation can be done <strong>in</strong> the form<br />

of a smart card, IP (Intellectual Property) core, co-processor, PCI card, router etc.<br />

With enlarg<strong>in</strong>g area of chips it is possible to implement a CPU, memory blocks,<br />

peripherals, <strong>in</strong>terfaces and co-processor on a s<strong>in</strong>gle chip provid<strong>in</strong>g a such called<br />

system-on-a-chip (SOC). Especially <strong>in</strong> cryptography there is a requirement for im-<br />

plementation systems as SOC which hides the <strong>in</strong>ternal signals from possible abuse<br />

by the adversary. SOC raises another requirement, namely to f<strong>in</strong>d a way for im-<br />

plementation of all parts of SOC on the same chip, same platform, if possible by<br />

shar<strong>in</strong>g the same resources.<br />

Applications have various requirements for area, speed, energy, or power con-<br />

sumption. Additionally, <strong>in</strong> case of cryptosystems we def<strong>in</strong>e also level of security<br />

tak<strong>in</strong>g <strong>in</strong>to account the vulnerability aga<strong>in</strong>st eavesdropp<strong>in</strong>g and side-channel at-<br />

tacks, or ability of the system to detect an attack and thereafter delete the sensitive<br />

data <strong>in</strong> a way mak<strong>in</strong>g impossible to restore them by an adversary (tamper resis-<br />

tance). Def<strong>in</strong>itions of conditions required to certify cryptographic implementations<br />

on a certa<strong>in</strong> level of security and areas where such systems can be used are set <strong>in</strong><br />

standards of well-known standardisation organisations [57].<br />

Reconfigurable Devices Reconfigurable device is an hard<strong>ware</strong> architecture with<br />

both a functionality of process<strong>in</strong>g elements and an <strong>in</strong>terconnection between them<br />

can be modified after fabrication time. The most known reconfigurable hard<strong>ware</strong><br />

components are FPGAs.<br />

4


FEI KEMT<br />

Cryptographic primitives belong to group of systems suitable for reconfigurable<br />

devices due to the follow<strong>in</strong>g features:<br />

• standardized algorithms - most of the cryptographic algorithms, but random<br />

number generators are approved by <strong>in</strong>ternational standard organisations (e.g.<br />

[54–56,58,59]). Thus, the functionality described by mathematical algorithms<br />

and equations can by deeply studied and tailored to the hard<strong>ware</strong> structure. It<br />

is possible that group of secure cryptographic algorithms is changed <strong>in</strong> the time<br />

due to newly <strong>in</strong>vented attacks. The reconfigurable platform makes possible to<br />

remove obsolete algorithms from runn<strong>in</strong>g systems and provide the new ones,<br />

even without hard<strong>ware</strong> update or exchange.<br />

• several supported functionality modes and lengths of operands - while the num-<br />

ber of the most popular algorithms is limited, each of them provides a group of<br />

selectable parameters what results <strong>in</strong> need to implement a group of algorithms<br />

comb<strong>in</strong>ations.<br />

• sequential structure - <strong>in</strong> dependency on runn<strong>in</strong>g operation only selected crypto-<br />

graphic blocks need to programmed <strong>in</strong> a device and <strong>in</strong> case of operation change<br />

the other configuration is loaded. As an example we mentioned a scheme when<br />

at the beg<strong>in</strong>n<strong>in</strong>g of the communication a secret key is distributed to the parties<br />

by an asymmetric algorithm which is later misplaced by a faster symmetric<br />

encryption implemented on the same device.<br />

FPGA Architecture The underly<strong>in</strong>g FPGA architecture consists of an array of<br />

the smallest programmable units - logic elements (LE) or configurable logic blocks<br />

(CLB), and the programmable connection switches. A typical FPGA architecture<br />

consists of a high number (hundreds to thousands) of LEs and rout<strong>in</strong>g channels with<br />

different length/speed. By the LE we understand the smallest functional unit that<br />

is addressed by the mapp<strong>in</strong>g tools. Typically it consists of a look-up table (LUT)<br />

and a register (D flip-flop) (see Figure 1 – 1), what makes possible to implement the<br />

comb<strong>in</strong>atorial as well as sequential logic, or a small memory block. Additionally, the<br />

FPGA architecture may <strong>in</strong>clude special dedicated blocks or build<strong>in</strong>g items for other<br />

functions e.g. for stor<strong>in</strong>g data, comput<strong>in</strong>g multiplication and addition, synthesis<br />

clock signals. . .<br />

Modern FPGAs provide support for implementation of a wide range of the algo-<br />

rithms from area of signal process<strong>in</strong>g, communication or network<strong>in</strong>g. The crypto-<br />

5


FEI KEMT<br />

data<br />

<strong>in</strong>puts<br />

clock<br />

Look-up<br />

Table<br />

carry<br />

<strong>in</strong>put<br />

Carry<br />

Cha<strong>in</strong><br />

carry<br />

output<br />

D<br />

Flip<br />

Flop<br />

data<br />

outputs<br />

Figure 1 – 1 Typical architecture of the smallest functional unit <strong>in</strong> a FPGA.<br />

graphic algorithms and protocols can be represented as sequence of algebraic func-<br />

tions <strong>in</strong> chosen operational area. The operations <strong>in</strong> cryptography are often similar to<br />

the ones used <strong>in</strong> the fields mentioned above. Therefore the optimised blocks <strong>in</strong> struc-<br />

ture of FPGAs provide means for efficient realisation of cryptographic primitives,<br />

too.<br />

The additional property of cryptosystems - the security, is supported by vendors<br />

of the FPGAs by enhanc<strong>in</strong>g the devices with hard-wired encryption cores and special<br />

purpose memories. With rais<strong>in</strong>g importance of cryptography the FPGA vendors will<br />

be pushed to provide more and more features support<strong>in</strong>g security of FPGA-based<br />

cryptosystems as it was proposed <strong>in</strong> [93]. More <strong>in</strong>formation on FPGA features<br />

and their relation to implementation of cryptosystems <strong>in</strong>clud<strong>in</strong>g analysis of possible<br />

attacks can be found <strong>in</strong> [122].<br />

1.2 RSA Algorithm<br />

Nowadays the most popular asymmetric cryptosystem is RSA which was developed<br />

by Ronald Rivest, Adi Shamir and Leonard Adleman <strong>in</strong> 1978 [96].<br />

A private key for RSA algorithm consists of two large primes p and q with com-<br />

parable sizes and a secret exponent D. A public key is represented by an exponent<br />

E and modulus M, where<br />

M = pq (1.1)<br />

The Euler totien function φ(M) is def<strong>in</strong>ed as a number of positive <strong>in</strong>tegers smaller<br />

6


FEI KEMT<br />

than M, which are relatively prime to M, thus:<br />

φ(M) = (p − 1)(q − 1) . (1.2)<br />

Therefore we can write an equation for the public exponent E:<br />

Private exponent D is chosen such that:<br />

gcd(E, φ(M)) = 1 . (1.3)<br />

D = E −1 mod φ(M) . (1.4)<br />

While the public key consists of a tuple (M, E), the private key can be kept <strong>in</strong> the<br />

two possible forms: simply as a tuple (M, D) or <strong>in</strong> extended form <strong>in</strong>clud<strong>in</strong>g also<br />

the primes p and q. The latter form allows a faster decryption algorithm us<strong>in</strong>g a<br />

Ch<strong>in</strong>ese Rem<strong>in</strong>der Theorem (CRT).<br />

Basic mathematical operation used by RSA for cryptographic operations (en-<br />

cryption and digital signature) is modular exponentiation. To encrypt a message X<br />

by a public key (M, E) one applies the follow<strong>in</strong>g equation [86]:<br />

Y = X E mod M . (1.5)<br />

Decryption of received encrypted message Y is done us<strong>in</strong>g a private key couple<br />

(M, D) by calculat<strong>in</strong>g:<br />

X = Y D mod M . (1.6)<br />

Similarly to encryption, the RSA signature scheme operations employ modular ex-<br />

ponentiation for generation of a signature I for message text X<br />

and its verification<br />

I = X D mod M (1.7)<br />

X = I E mod M . (1.8)<br />

Note that while for the encryption scheme Alice as the send<strong>in</strong>g part uses receiv<strong>in</strong>g<br />

Bob’s public key to encrypt the message and this case only Bob is able to decrypt it<br />

know<strong>in</strong>g his private key (see Figure 1 – 2). In case of message signature Alice signs<br />

the message us<strong>in</strong>g her private key to prove its authenticity and thereafter anybody<br />

who disposes of Alice’s public key is able to verify her signature.<br />

7


FEI KEMT<br />

A (X)<br />

request for B’s<br />

private key<br />

key (M,E)<br />

encrypted<br />

E<br />

Y=X mod M<br />

B<br />

(M,E,D)<br />

Figure 1 – 2 RSA encryption scheme when A sends encrypted message to B. First A receive<br />

B’s public key upon a request, afterwards A encrypts a message X us<strong>in</strong>g the B’s public key Y =<br />

X E mod M. F<strong>in</strong>ally B decrypts the received message Y us<strong>in</strong>g own private key X = Y D mod M.<br />

1.2.1 <strong>Modular</strong> Exponentiation and <strong>Multiplication</strong><br />

The modular exponentiation used for encryption and signature schemes of RSA (see<br />

Equations 1.5-1.8) and other public-key cryptographic algorithms can be computed<br />

<strong>in</strong> two ways, as a series of the modular multiplications (MMs):<br />

• <strong>in</strong>terleaved by a modular reduction, or<br />

• with a f<strong>in</strong>al reduction step.<br />

The most known method from the first category - the <strong>Montgomery</strong> modular mul-<br />

tiplication (MMM) <strong>in</strong>vented by P. L. <strong>Montgomery</strong> [88] will be further discussed<br />

<strong>in</strong> this work. For the multiplication and subsequent division one can use popular<br />

Karatsuba-Ofman’s multiplication [76] <strong>in</strong> comb<strong>in</strong>ation with Barrett’s reduction [25].<br />

The MM can be a very slow operation when performed on general-purpose com-<br />

puters. Currently suggested length of operands (e.g. for RSA) is 1024 and more bits<br />

is far above the typical length of operands (8-32 bits). Therefore there is a motiva-<br />

tion for design of special algebraic units perform<strong>in</strong>g modular operations <strong>in</strong> a more<br />

efficient way. Better peformance and effectiveness of the implementation is achieved<br />

by adaption of algorithms and exploitations of platforms with reconfigurable archi-<br />

tecture. Perform<strong>in</strong>g mathematical operations with the RSA extra long variables can<br />

be limit<strong>in</strong>g for the units optimised for 8, 16 or 32 bits lengths of variables that are<br />

more typical e.g. <strong>in</strong> signal process<strong>in</strong>g.<br />

The RSA modular exponentiation does not allow straightforward implementa-<br />

tion and requires application of the algorithms that will e.g. divide long operands<br />

8


FEI KEMT<br />

<strong>in</strong> shorter words tak<strong>in</strong>g <strong>in</strong>to account the physical limitations of the structures <strong>in</strong> se-<br />

lected hard<strong>ware</strong> platform. Optimal solution <strong>in</strong> case when the operands length may<br />

change would provide a design for which the length of operands determ<strong>in</strong>es only the<br />

computational time for an operation but not the overall performance of the unit<br />

that is constant for arbitrary length.<br />

<strong>Montgomery</strong> Methods The MMM provides a very efficient way for comput<strong>in</strong>g<br />

the modular exponentiation. Input operands for the basel<strong>in</strong>e algebraic operations<br />

of the RSA algorithm described by Equations 1.5-1.8 have very long length due<br />

to security reasons. Nowadays, the key length for the RSA is switched from 1024<br />

to 2048 bits as the factorisation effort br<strong>in</strong>gs better results, closer to the bottom<br />

standard value. Hav<strong>in</strong>g a need to use operands with doubled precision it is even more<br />

desirable to f<strong>in</strong>d algorithms that m<strong>in</strong>imise the number of the algebraic operations<br />

together with their complexity.<br />

The <strong>Montgomery</strong> reduction allows efficient implementation of the MM without<br />

us<strong>in</strong>g the classical modular reduction step that is even more expensive operation <strong>in</strong><br />

comparison to the multiplication. Therefore it pays off to m<strong>in</strong>imise the number of<br />

required reductions or to use algorithms avoid<strong>in</strong>g the division.<br />

In <strong>Montgomery</strong> exponentiation algorithm (Algorithm 1 – 1 [86]) the modular ex-<br />

ponentiation unrolls <strong>in</strong>to series of the MMM. Thanks to the transformation to a<br />

<strong>Montgomery</strong> doma<strong>in</strong> and application of the MMM, it is possible to avoid the un-<br />

wanted modular reduction dur<strong>in</strong>g computations.<br />

We cont<strong>in</strong>ue with description of the MMM and conversion operations applied <strong>in</strong><br />

the Algorithm 1 – 1.<br />

Given two <strong>in</strong>tegers X and Y (X, Y < M < R), and the prime k-bit modulus M,<br />

the MMM algorithm computes<br />

S = MMM(X, Y ) = (XY R −1 ) mod M , (1.9)<br />

where R −1 is the <strong>in</strong>verse of R = b k and b denotes a base or radix. The M-residue<br />

X, of an <strong>in</strong>teger X < M is def<strong>in</strong>ed as [41]:<br />

X = XR mod M (1.10)<br />

For conversion to the <strong>Montgomery</strong> doma<strong>in</strong> we can use the MMM function as follows:<br />

MMM(X, R 2 ) = XR 2 R −1 mod M (1.11)<br />

= XR mod M<br />

= X<br />

9


FEI KEMT<br />

Algorithm 1 – 1 <strong>Montgomery</strong> exponentiation algorithm [86], the def<strong>in</strong>ition of M ′<br />

requires that gcd(M, R) = 1, b denotes base or radix.<br />

Require: M = (mk−1 . . . m0)b, R = b k , M ′ = −M −1 mod b, E = (et . . . e0)2 with<br />

et = 1, and an <strong>in</strong>teger X, 1 ≤ X < M. The values R 2 mod M and R mod M<br />

may be also provided as precomputed <strong>in</strong>puts.<br />

Ensure: A = X E mod M.<br />

1: X ⇐ MMM(X, R 2 mod M)<br />

2: A ⇐ R mod M<br />

3: for i = t down to 0 do<br />

4: A ⇐ MMM(A, A)<br />

5: if ei = 1 then<br />

6: A ⇐ MMM(A, X)<br />

7: end if<br />

8: end for<br />

9: A ⇐ MMM(A, 1)<br />

10: return A<br />

Therefore the first operation <strong>in</strong> the Algorithm 1 – 1 (Step 1) maps the <strong>in</strong>put value<br />

X to its M-residue X.<br />

Now we show how to re-map the value X to its ord<strong>in</strong>ary form of <strong>in</strong>teger X what<br />

is done <strong>in</strong> the last operation of the exponentiation (Algorithm 1 – 1, Step 9). It can<br />

be seen that the <strong>Montgomery</strong> product of two M-residues X, Y is itself the M-residue<br />

S:<br />

S = MMM(A, B) (1.12)<br />

= XY R −1 mod M<br />

= XRY RR −1 mod M<br />

= XY R mod M<br />

= SR mod M<br />

10


FEI KEMT<br />

so a f<strong>in</strong>al operation required to convert the M-residue S back <strong>in</strong>to S is def<strong>in</strong>ed as:<br />

S = SR −1 mod M (1.13)<br />

= 1SR −1 mod M<br />

= MMM(1, S)<br />

The algorithm works for any modulus M provided that gcd(M, R) = 1. This is<br />

always case <strong>in</strong> the RSA s<strong>in</strong>ce M = pq, product of two primes, and therefore odd.<br />

And s<strong>in</strong>ce R is a power of 2, it is always even.<br />

The MMM algorithm for k-bit operands X = (xk−1, . . . , x1, x0), Y , and M is<br />

given as Algorithm 1 – 2 [86].<br />

Algorithm 1 – 2 The <strong>Montgomery</strong> modular multiplication algorithm for k-bit<br />

operands X = (xk−1, . . . , x1, x0), Y , and M<br />

Require: M = (mk−1 . . . m0)b, X = (xk−1 . . . x0)b, Y = (yk−1 . . . y0)b, with 0 ≥<br />

X, Y < M, R = b n with gcd(M, b), and M ′ = −M −1 mod b.<br />

Ensure: S = XY R −1 mod M.<br />

1: S ⇐ 0 , S = (sk−1 . . . s0)b<br />

2: for i = 0 to k − 1 do<br />

3: qi ⇐ (s0 + xiy0)M ′ mod b<br />

4: S ⇐ (S + xiY + qiM)/b<br />

5: end for<br />

6: if S ≥ M then<br />

7: S ⇐ S − M<br />

8: end if<br />

9: return S<br />

Thanks to the reduction dur<strong>in</strong>g a pre-computation step of Algorithm 1 – 2 it is<br />

possible to avoid an expensive operation of the modular division dur<strong>in</strong>g the com-<br />

putations. In case of a s<strong>in</strong>gle multiplication operation the classical algorithm for<br />

modular multiplication would be faster than the MMM. Due to a need of rather<br />

expensive transformation to the <strong>Montgomery</strong> doma<strong>in</strong> (M-residue) and back, it is<br />

more effective to stay <strong>in</strong> that doma<strong>in</strong> as long as possible and transform the operands<br />

back to the ord<strong>in</strong>ary only at the very end of the computations. That requires a long<br />

sequence of the MMMs as it is <strong>in</strong> case of the modular exponentiation (Algorithm 1 –<br />

1).<br />

11


FEI KEMT<br />

In the Algorithm 1 – 1 the <strong>in</strong>put operand X is transformed to the <strong>Montgomery</strong><br />

doma<strong>in</strong> X at the beg<strong>in</strong>n<strong>in</strong>g (Step 1). Afterwards follows the series of the MMM <strong>in</strong><br />

the <strong>Montgomery</strong> doma<strong>in</strong>. F<strong>in</strong>ally, <strong>in</strong> the last step (Step 9) the result is transformed<br />

back to normal doma<strong>in</strong>. In this way the advantage of comput<strong>in</strong>g <strong>in</strong> <strong>Montgomery</strong><br />

doma<strong>in</strong> is fully exploited. The MMM is considered as the most effective method for<br />

modular exponentiation operations applied e.g. <strong>in</strong> the RSA cryptographic algorithm.<br />

1.2.2 <strong>Hard</strong><strong>ware</strong> Implementations of the MMM<br />

Achiev<strong>in</strong>g short computation time of the MM as the most time-consum<strong>in</strong>g opera-<br />

tion <strong>in</strong> RSA and ECC algorithms has a significant impact on the performance of<br />

the elementary cryptographic operations. Therefore efficient implementation of the<br />

algorithm has been an attractive field for research. Due to long operands on which<br />

the operations are performed the hard<strong>ware</strong> platform seems to be a natural choice<br />

before soft<strong>ware</strong> implementation. S<strong>in</strong>ce the size of operands may change accord<strong>in</strong>g<br />

to requirements and is different for RSA and ECC, the parameterized design <strong>in</strong><br />

programmable logic would offer an universal design for fast prototyp<strong>in</strong>g.<br />

The implementations br<strong>in</strong>g <strong>in</strong> life specifically adjusted general algorithms that<br />

take <strong>in</strong>to account the hard<strong>ware</strong> platforms features and prefer operations easily im-<br />

plementable <strong>in</strong> digital logic gates. The designs <strong>in</strong> general tend towards provid<strong>in</strong>g<br />

an universal and elastic solution or have a priority <strong>in</strong> best usage of resources and<br />

achievement of shortest computation times.<br />

One of the most cited hard<strong>ware</strong> implementation of the MMM was <strong>in</strong>troduced at<br />

CHES 1999 by Tenca and Koç [108]. A cheap and flexible modular exponentiation<br />

hard<strong>ware</strong> accelerator can be also achieved us<strong>in</strong>g FPGAs. Results presented <strong>in</strong> liter-<br />

ature, e.g. [29, 41, 51] are ma<strong>in</strong>ly concentrated to systolic-like implementations that<br />

provide a very fast but less flexible solution.<br />

Pre-comput<strong>in</strong>g partial results as presented <strong>in</strong> [72] allows to reduce the number<br />

of clock cycles required for perform<strong>in</strong>g of a s<strong>in</strong>gle MMM operation. Such approach<br />

needs marg<strong>in</strong>ally more area <strong>in</strong> comparison to orig<strong>in</strong>al proposal [108] and as far as the<br />

latency is concerned it is comparable to the design presented <strong>in</strong> [85] that is based on<br />

process<strong>in</strong>g multi-precision operands <strong>in</strong> carry-save form. High-radix implementations<br />

[110] also provide reduction of computational steps, but the complexity of logic part<br />

<strong>in</strong>creases substantially.<br />

Current FPGAs provide an alternative hard<strong>ware</strong> platform even for system-level<br />

<strong>in</strong>tegration of a cryptographic hard<strong>ware</strong>. A SOC concept can typically <strong>in</strong>clude an<br />

12


FEI KEMT<br />

embedded processor with a set of dedicated coprocessors. For such a system a<br />

highly flexible (although typically slower) scalable MMM coprocessor could be more<br />

attractive than a fixed length dedicated one.<br />

That direction was chosen <strong>in</strong> our research, when our goal is to analyse and<br />

implement solution that would allow quick prototyp<strong>in</strong>g of special purpose hard<strong>ware</strong><br />

designs and use features of target platform <strong>in</strong> order to accelerate execution of the<br />

MMM operation.<br />

The radix-2 MMM algorithm (b = 2) is very suitable for hard<strong>ware</strong> implemen-<br />

tation due to easily implementable operations as a word-by-bit multiplication, a<br />

bit-shift (division by two) and an addition. Implementations with higher radix were<br />

also published [30, 110] and offer a proper alternative, but us<strong>in</strong>g a more complex<br />

algebraic unit.<br />

Radix-2 <strong>Montgomery</strong> <strong>Multiplication</strong> Algorithm The simplified version of<br />

the MMM algorithm (Algorithm 1 – 2) when the radix b is equal to 2 (b = 2) for<br />

k-bit operands X = (xk−1, . . . , x1, x0), Y , and M is given as Algorithm 1 – 3.<br />

Algorithm 1 – 3 The basic radix-2 <strong>Montgomery</strong> multiplication algorithm for k-bit<br />

operands X = (xk−1, . . . , x1, x0), Y , and M<br />

Require: M = (mk−1 . . . m0)2, X = (xk−1 . . . x0)2, Y = (yk−1 . . . y0)2, M ′ =<br />

−M −1 mod 2, E = (et . . . e0)2 with et = 1, R = 2 k , and an <strong>in</strong>teger X, 1 ≤ X <<br />

M. The values R 2 mod M and R mod M may be also provided as precomputed<br />

<strong>in</strong>puts.<br />

Ensure: S = XY R −1 mod M.<br />

1: S0 ⇐ 0<br />

2: for i = 0 to k − 1 do<br />

3: qi ⇐ (Si + xiY ) mod 2<br />

4: Si+1 ⇐ (Si + xiY + qiM)/2<br />

5: end for<br />

6: if Sk ≥ M then<br />

7: Sk ⇐ Sk − M<br />

8: end if<br />

9: S ← Sk<br />

10: return S<br />

From a comparison of the Algorithms 1 – 2 and 1 – 3 one can see how the choice of<br />

b = 2 may help to simplify the operations <strong>in</strong>side the MMM. The modular reduction<br />

13


FEI KEMT<br />

by the radix b changes to a check of the LSB. In the Step 4 the division is replaced<br />

by a simple right shift operation.<br />

The formulation that describes the radix-2 algorithm was used as the start<strong>in</strong>g<br />

po<strong>in</strong>t for derivation of a scalable design comput<strong>in</strong>g the MMM presented <strong>in</strong> [108,109].<br />

Later we will discuss the features of such scalable architecture. Before that, we make<br />

a closer look at the operations of the algorithm and consider their modifications so<br />

they are better suitable for efficient execution on chosen FPGA hard<strong>ware</strong> platform.<br />

The decision whether perform an addition of the modulus M to the temporal<br />

sum Si+1 is based on the value of the variable qi that can be simply implemented.<br />

The test checks the LSB of the partial sum Si+1 = Si + xiY and stores it as variable<br />

qi once the addition of xiY is f<strong>in</strong>ished (see step 3 of the Algorithm 1 – 3). The stored<br />

value decides on the addition of M <strong>in</strong> the follow<strong>in</strong>g iteration of the loop.<br />

However, the second condition (see step 6 of the Algorithm 1 – 3) causes a prob-<br />

lem for a possible pipel<strong>in</strong>ed execution of computations. After the loop of additions,<br />

multiplications and shifts, the mentioned comparison and subsequent conditional<br />

subtraction is required. Without the f<strong>in</strong>al reduction step the outcome of the <strong>in</strong>ner<br />

loop of multiplication can provide an improper <strong>in</strong>put for the subsequent multipli-<br />

cation operation. That may happen <strong>in</strong> the case when the f<strong>in</strong>al value of S is bigger<br />

than M (S > M). We have <strong>in</strong>tention to use the MMM <strong>in</strong> a series of multiplica-<br />

tions when the transformation <strong>in</strong>to the <strong>Montgomery</strong> doma<strong>in</strong> br<strong>in</strong>gs profit over an<br />

expensive reduction as it was showed <strong>in</strong> the Algorithm 1 – 1. Therefore we analyse<br />

possibilities for omitt<strong>in</strong>g the f<strong>in</strong>al condition step by changes <strong>in</strong> the Algorithm 1 – 3<br />

and make possible a use of pipel<strong>in</strong>ed multipliers.<br />

Algorithm Modifications The MMM algorithm (Algorithm 1 – 2) <strong>in</strong>troduced<br />

earlier is further extended. Two variants of the algorithm are discussed and im-<br />

plemented, both support<strong>in</strong>g scalable multiple-word oriented implementation, but<br />

handl<strong>in</strong>g a carry process<strong>in</strong>g <strong>in</strong> different ways.<br />

In the modified Algorithm 1 – 4 we use the follow<strong>in</strong>g <strong>in</strong>put operands:<br />

k�<br />

X = xi2<br />

i=0<br />

i = (0, 0, xk, xk−1, . . . , x1, x0) < 2M , (1.14)<br />

�Y =<br />

k�<br />

�yi2 i+1 = (yk, . . . , y1, y0, 0) < 4M , (1.15)<br />

i=0<br />

where R = 2 k+3 , Y < 2M, and 2 k−1 < M < 2 k is an k-bit number (the same as<br />

<strong>in</strong> the Algorithm 1 – 3). Note that � Y <strong>in</strong> Equation 1.15 is a left shifted version of<br />

14


FEI KEMT<br />

Y , with �y0 = 0 and X is concatenated with two zero bits at MSB positions. This<br />

change simplifies the computation of qi compared to Algorithm 1 – 3. The value of<br />

qi needed for computation of Si+1 is given directly as a LSB of Si from the previous<br />

iteration (see step 4 of the Algorithm 1 – 4). In this way the latency caused by an<br />

addition of operands xiY is removed and logic implementation can be simplified,<br />

too.<br />

Algorithm 1 – 4 Optimized radix-2 <strong>Montgomery</strong> multiplication algorithm<br />

Require: X = � k i=0 xi2 i = (0, 0, xk, xk−1, . . . , x1, x0) < 2M, � Y = � k i=0 �yi2 i+1 =<br />

(yk, . . . , y1, y0, 0) < 4M, R = 2 k+3 , Y < 2M, and 2 k−1 < M < 2 k .<br />

Ensure: S = XY R −1 mod M.<br />

1: S0 ⇐ 0<br />

2: � Y ⇐ 2Y<br />

3: for i = 0 to k + 2 do<br />

4: qi ⇐ Si mod 2<br />

5: Si+1 ⇐ (Si + xi � Y + qiM)/2<br />

6: end for<br />

7: S ⇐ Sk+3<br />

8: return S<br />

The <strong>in</strong>ner loop of the Algorithm 1 – 4 is executed with three additional iterations<br />

<strong>in</strong> comparison to the Algorithm 1 – 3. Higher number of iterations ensures that<br />

the <strong>in</strong>equalities Si < 3M, i = 0, 1, . . . , k + 2 and S = Sk+3 = MMM(X, Y ) =<br />

(XY R −k−3 ) mod M < 2M always hold. The result of S = MMM(X, Y ) can thus<br />

be reused as an <strong>in</strong>put X and Y for the subsequent MMM. This modification avoids<br />

the orig<strong>in</strong>ally proposed f<strong>in</strong>al correction step (comparison and subtraction <strong>in</strong> step 6<br />

of the Algorithm 1 – 3) and makes possible a pipel<strong>in</strong>ed execution of the algorithm <strong>in</strong><br />

separated multipliers.<br />

In typical applications (e.g. RSA), <strong>in</strong>put operands X, Y are pre-multiplied<br />

by a factor 2 2k mod M (Algorithm 1 – 3) or 2 2k+6 mod M (Algorithm 1 – 4). The<br />

f<strong>in</strong>al MMM with value 1 makes the f<strong>in</strong>al result smaller than M (with probability<br />

1 − 2 −(k+2) as shown <strong>in</strong> [29]) and provides the result XY mod M.<br />

1.3 EC <strong>in</strong> Cryptography<br />

Application of the EC <strong>in</strong> the public-key cryptography was <strong>in</strong>dependently proposed<br />

by Neal Koblitz and Victor S. Miller <strong>in</strong> year 1985 [77, 87]. Advantage of us<strong>in</strong>g<br />

15


FEI KEMT<br />

the ECC <strong>in</strong>stead of the RSA or DSA [56] lies <strong>in</strong> the fact that the length of key<br />

can be much shorter. The best known algorithm for solv<strong>in</strong>g the elliptic curve dis-<br />

crete logarithm problem (ECDLP) takes fully exponential time, while the algorithms<br />

for the <strong>in</strong>teger factorization problem and the discrete logarithm problem take sub-<br />

exponential time. The comparison of key length for equivalent security level is<br />

presented <strong>in</strong> Table 1 – 1 [91].<br />

Table 1 – 1 Comparison of the key length (<strong>in</strong> bits) for equivalent security level for public-key<br />

cryptosystems<br />

Security (bits) DSA RSA ECC<br />

80 1024 1024 160-223<br />

112 2048 2048 224-255<br />

128 3072 3072 256-383<br />

192 7680 7680 384-511<br />

256 15360 15360 512+<br />

The fundamental and most expensive operation underly<strong>in</strong>g ECC is a po<strong>in</strong>t multi-<br />

plication, which is def<strong>in</strong>ed over field operations. For a po<strong>in</strong>t P and a positive <strong>in</strong>teger<br />

k, the po<strong>in</strong>t multiplication kP is def<strong>in</strong>ed by add<strong>in</strong>g k-times the po<strong>in</strong>t P to itself:<br />

kP = P + . . . + P<br />

� �� �<br />

k<br />

. (1.16)<br />

Various algorithms have been proposed for more efficient computation of the po<strong>in</strong>t<br />

multiplication tak<strong>in</strong>g <strong>in</strong>to account a fixed or unknown po<strong>in</strong>t P .<br />

The EC over F denoted as E is a curve that is given by an equation of the<br />

follow<strong>in</strong>g form:<br />

where E must be smooth.<br />

E : y 2 + a1xy + a3y = x 3 + a2x 2 + a4x + a6 , (ai ∈ F) (1.17)<br />

We let E(F) denote the set of po<strong>in</strong>ts (x, y) ∈ F 2 that satisfy this equation, along<br />

with a po<strong>in</strong>t at <strong>in</strong>f<strong>in</strong>ity denoted O. If the characteristic of F is neither 2 nor 3, then<br />

the Equation 1.17 can be simplified to the usually used form (so-called Weierstraß<br />

form):<br />

y 2 = x 3 + ax + b . (a, b ∈ F) (1.18)<br />

The condition for smoothness of the curve is, <strong>in</strong> this case, equals to the requirement<br />

of no multiple roots of the cubic element <strong>in</strong> the Equation 1.18. This holds if and<br />

only if the discrim<strong>in</strong>ant of x 3 + ax + b, which is −(4a 2 ) + 27b 3 , is nonzero.<br />

16


FEI KEMT<br />

The EC is an Abelian group with the po<strong>in</strong>t O serv<strong>in</strong>g as its identity element.<br />

Further we def<strong>in</strong>e rules for po<strong>in</strong>t addition and po<strong>in</strong>t doubl<strong>in</strong>g (addition of the identical<br />

po<strong>in</strong>t).<br />

Let P = (xP , yP ) ∈ E, then −P = (xP , −yP ). If Q = (xQ, yQ) ∈ E, and<br />

Q �= −P , then P + Q = (xP +Q, yP +Q). Formulas for po<strong>in</strong>t addition and doubl<strong>in</strong>g<br />

are presented further, see Equations 1.19.<br />

xP +Q = λ 2 − xP − xQ (1.19)<br />

yP +Q = λ(xP − xP +Q) − yP<br />

λ = yQ − yP<br />

xQ − xP<br />

λ = 3x2P + a<br />

2yP<br />

if P �= Q<br />

if P = Q<br />

When P �= Q (addition) the formulas for comput<strong>in</strong>g P + Q require 1 <strong>in</strong>version, 2<br />

multiplications, and 1 squar<strong>in</strong>g. When P = Q (doubl<strong>in</strong>g) the formulas for comput<strong>in</strong>g<br />

2P require 1 <strong>in</strong>version, 2 multiplications, and 2 squar<strong>in</strong>gs. S<strong>in</strong>ce field <strong>in</strong>version<br />

is significantly more expensive than multiplication it is advantageous to represent<br />

po<strong>in</strong>ts us<strong>in</strong>g projective coord<strong>in</strong>ates and then use formulas without <strong>in</strong>version [35].<br />

Before def<strong>in</strong>ition of the ECDLP we def<strong>in</strong>e another parameter for EC. The order<br />

of po<strong>in</strong>t P on an EC is the smallest positive <strong>in</strong>teger n such that nP = O. Where<br />

nP is the po<strong>in</strong>t multiplication def<strong>in</strong>ed <strong>in</strong> Equation 1.16.<br />

The ECDLP is def<strong>in</strong>ed as follows: Let us have a curve E over F, a po<strong>in</strong>t P ∈ E<br />

of order n and a po<strong>in</strong>t Q ∈ E. Then <strong>in</strong> case it exists, f<strong>in</strong>d an <strong>in</strong>teger l, 0 ≤ l ≤ n−1,<br />

for which Q = lP .<br />

As an example for cryptographic operations computed on the EC we mention<br />

the elliptic curve digital signature algorithm (ECDSA), the equivalent of the DSA<br />

<strong>in</strong> the EC doma<strong>in</strong>. The generation of the key is done by the steps described <strong>in</strong><br />

Algorithm 1 – 5 [78].<br />

The signature of a message m with an arbitrary length is computed as mentioned<br />

<strong>in</strong> the Algorithm 1 – 6 [78].<br />

From a practical po<strong>in</strong>t of view, the performance of ECC depends on the efficient<br />

implementation of f<strong>in</strong>ite field operations and fast algorithm for the scalar multipli-<br />

cation.<br />

17


FEI KEMT<br />

Algorithm 1 – 5 Key generation <strong>in</strong> ECC [78]<br />

Require: E is an EC over F, P is a po<strong>in</strong>t of order n on curve E.<br />

Ensure: Pair of private key and public key.<br />

1: Choose a random <strong>in</strong>teger d, 0 < d < n<br />

2: Q ⇐ dP<br />

3: return Q, the public key<br />

4: return d, the private key<br />

Algorithm 1 – 6 Message sign<strong>in</strong>g <strong>in</strong> ECC [78]<br />

Require: Message m with an arbitrary length, a hash value h(m) obta<strong>in</strong>ed from a<br />

one-way function.<br />

Ensure: Signature of the message m.<br />

1: Choose random <strong>in</strong>teger k, 0 < k < n<br />

2: kP ⇐ (x1, y1) and r ⇐ x1 mod n (0 < x1 < q − 1)<br />

3: if r = 0 then<br />

4: Go back to the step 1.<br />

5: end if<br />

6: k −1 mod n<br />

7: s ⇐ k −1 {h(m) + dr} mod n<br />

8: if s = 0 then<br />

9: Go back to the step 1.<br />

10: end if<br />

11: return (r, s)<br />

1.4 Conclusions<br />

In this section we have presented two nowadays most important public key cryp-<br />

tosystems, namely RSA and ECC.<br />

While RSA is massively applied by <strong>in</strong>dustry s<strong>in</strong>ce several years, the ECC as<br />

relatively new cryptographic algorithms just starts to w<strong>in</strong> as better choice for im-<br />

plementation of public-key algorithm especially for energy- and place-limited plat-<br />

forms. The possibility to use much shorter key, and therefore less heavy arithmetical<br />

operations makes from ECC an optimal algorithm for hard<strong>ware</strong> implementation.<br />

The description of both algorithms given <strong>in</strong> the thesis focuses on their most<br />

<strong>in</strong>tensively-used and heavy operation - the modular multiplication. This fact makes<br />

from the multiplication an important target for our research as the improvements<br />

18


FEI KEMT<br />

<strong>in</strong> implementation of the MM have significant impact on better performance of the<br />

whole system based on the modular operations, as are the RSA or ECC.<br />

The common part of both <strong>in</strong>troduced cryptosystem is a modular multiplier. After<br />

this theoretical <strong>in</strong>troduction we cont<strong>in</strong>ue by description of algorithms for multipli-<br />

cation adapted to the target hard<strong>ware</strong> architecture and implementation itself.<br />

19


FEI KEMT<br />

2 <strong>Montgomery</strong> <strong>Modular</strong> <strong>Multiplication</strong> <strong>in</strong> <strong>Hard</strong>-<br />

<strong>ware</strong><br />

In this chapter we present results of our research <strong>in</strong> the area of efficient implementa-<br />

tion of the (MMM) and its application <strong>in</strong> cryptographic systems. Obta<strong>in</strong>ed design<br />

of the multiplier can be <strong>in</strong>cluded <strong>in</strong> cryptosystems or accelerators as support<strong>in</strong>g unit<br />

for computationally heavy operations <strong>in</strong> the public-key algorithms as RSA or ECC.<br />

We focus on design of the process<strong>in</strong>g element (PE) that computes the MMM<br />

and the coprocessor that <strong>in</strong>cludes beside the PE(s) also the memory registers and<br />

an <strong>in</strong>terface to the control unit.<br />

Results of the research were published <strong>in</strong> the follow<strong>in</strong>g list of articles [46, 49, 50,<br />

113, 117, 118]. The ma<strong>in</strong> achievements of our research were done <strong>in</strong> the follow<strong>in</strong>g<br />

areas:<br />

• Analysis of two PE concepts – algorithm improvement, effective implementa-<br />

tion <strong>in</strong> chosen FPGA families, concepts comparison,<br />

• MMM coprocessor design – soft<strong>ware</strong>-hard<strong>ware</strong> co-design, scalability and para-<br />

metrisation, <strong>in</strong>terface with a control unit.<br />

The Section 2.1 expla<strong>in</strong>s the concept of scalable MMM design. In Section 2.2<br />

we analyse the MMM algorithms and architecture for their effective implementation<br />

suitable for reconfigurable hard<strong>ware</strong> structures. The results of area occupation and<br />

tim<strong>in</strong>g analysis are summarised <strong>in</strong> Section 2.3 and provide <strong>in</strong>formation on available<br />

choices of multiplier parameters. The chapter is closed by Section 2.4 <strong>in</strong>clud<strong>in</strong>g the<br />

summary of the discussed issues.<br />

2.1 Scalable MMM design<br />

An arithmetic unit is called scalable if it can be reused or replicated <strong>in</strong> order to<br />

generate long-precision results <strong>in</strong>dependently of the data precision for which the<br />

unit was orig<strong>in</strong>ally designed [108]. In cryptography, the length of <strong>in</strong>put operands<br />

and key may vary <strong>in</strong> dependency on chosen cipher work<strong>in</strong>g mode or by updat<strong>in</strong>g<br />

the algorithm to different security level. Hence, the scalability seems to be desirable<br />

feature of cryptographic arithmetic unit. In such cases scalability of the design pays<br />

off due to reduced costs for implementation. On the other hand, the well-scalable<br />

designs can be slower than the less universal ones optimised for selected parameters.<br />

20


FEI KEMT<br />

The more universal is a design the lower is its speed <strong>in</strong> comparison to a system<br />

designed for fixed operands parameters.<br />

A typical scalable coprocessor consists of two separate blocks – memory registers<br />

and arithmetic logic unit (ALU) connected by w-bit data path as shown <strong>in</strong> Figure 2 –<br />

1. Parameter of the word width w decides on the smallest operated data unit –<br />

word, divid<strong>in</strong>g the operands length k to smaller, for target hard<strong>ware</strong> structure more<br />

suitable, lengths which is usually a multiple of 8 bits.<br />

data<br />

<strong>in</strong>put<br />

w<br />

scalable<br />

ALU<br />

data<br />

memory<br />

data<br />

output<br />

control<br />

logic<br />

Figure 2 – 1 Architecture of a general scalable coprocessor based on separate memory and ALU<br />

connected by w-bit data-path<br />

Separation of the ALU and the memory is the first fundamental difference from<br />

the FPGA designs <strong>in</strong>clud<strong>in</strong>g the MMM optimized for fixed-length operands (e.g. [29,<br />

41]). The scalable algorithm requires a word-oriented process<strong>in</strong>g that would make<br />

possible to change the number of words, or even the word width w. Normally w is<br />

smaller than the operands length k, therefore the computation time is proportionally<br />

longer. Better performance can be still achieved by implementation of smaller but<br />

faster ALU allow<strong>in</strong>g higher clock frequency.<br />

Let us consider w-bit words. For operands with k-bit precision, e1 = ⌈(k +1)/w⌉<br />

words are required for Algorithm 1 – 3. An extra bit used <strong>in</strong> the calculation of e1 is<br />

required s<strong>in</strong>ce Si (<strong>in</strong>ternal variable of radix-2 algorithm) is <strong>in</strong> the range [0, 2M − 1]<br />

[108]. Then all the computations of Algorithm 1 – 3 must be done with an extra<br />

bit of precision. The <strong>in</strong>put operands will need an extra zero bit value at the MSB<br />

position <strong>in</strong> order to have the precision extended to the correct value.<br />

Algorithm 1 – 4 requires e2 = ⌈(k + 3)/w⌉ words <strong>in</strong> order to support extended<br />

range of <strong>in</strong>put variables X, � Y , and <strong>in</strong>ternal variable Si. Note that <strong>in</strong> many practical<br />

configurations e1 = e2 and no additional words are required for Algorithm 1 – 4. The<br />

operands X will need two extra 0 bit values at the MSB and subsequent position <strong>in</strong><br />

order to have the precision extended to the k + 3 cycles required by Algorithm 1 – 4.<br />

In practical configurations k ≥ 1024 therefore the difference <strong>in</strong> number of cycles is<br />

21


FEI KEMT<br />

not significant. On the other hand, the possibility to remove correction unit from<br />

hard<strong>ware</strong> design of Algorithm 1 – 4 br<strong>in</strong>gs valuable advantage.<br />

In the rest of the thesis the notions e1 or e2 are used to denote the number of<br />

words <strong>in</strong> cases we need to emphasis the difference of the number of words <strong>in</strong> the<br />

algorithms, or we use the notation e <strong>in</strong> case we mean a number of words <strong>in</strong> general.<br />

2.1.1 Scalable Multiple-Word Algorithms<br />

Operations <strong>in</strong> Algorithm 1 – 3 and Algorithm 1 – 4 are performed on the full-precision<br />

operands and do not provide scalability feature expla<strong>in</strong>ed above. We analyse rela-<br />

tions between parameters of the multipliers and underly<strong>in</strong>g FPGA structure and<br />

provide solution suitable for devices <strong>in</strong>clud<strong>in</strong>g fast carry architecture.<br />

A scalable algorithm <strong>in</strong> which the operand Y (multiplicand) is scanned word-<br />

by-word, and the operand X (multiplier) is scanned bit-by-bit was proposed <strong>in</strong><br />

[108,109]. The Multiple Word Radix-2 <strong>Montgomery</strong> <strong>Multiplication</strong> algorithm (MW-<br />

R2MM) uses the follow<strong>in</strong>g vectors:<br />

M = (M (e−1) , . . . , M (1) , M (0) ) (2.1)<br />

Y = (Y (e−1) , . . . , Y (1) , Y (0) )<br />

S = (S (e−1) , . . . , S (1) , S (0) )<br />

X = (xk−1, . . . , x1, x0)<br />

where the words are marked with superscripts and the bits are marked with sub-<br />

scripts. The concatenation of vectors a and b is noted as (a, b). A particular range<br />

of bits <strong>in</strong> a vector a from position i to position j, j > i will be expressed as aj..i.<br />

The bit position i of the k-th word of a is represented by symbol a (k)<br />

i .<br />

The details of the MWR2MM algorithm (further referred to as MWR2MM CSA,<br />

where CSA states for Carry-Save Adder) are given <strong>in</strong> [108] and <strong>in</strong> the thesis it will<br />

be denoted as Algorithm 2 – 1. Optimized version of MMM Algorithm 1 – 4 can be<br />

transformed to a multiple word form (referred to as MWR2MM CPA, where CPA<br />

states for Carry-Propagate Adder) <strong>in</strong> a similar way, shown <strong>in</strong> Algorithm 2 – 2. The<br />

reason for such nam<strong>in</strong>g of algorithms is given by the way of their implementation<br />

and we expla<strong>in</strong> more about it <strong>in</strong> the follow<strong>in</strong>g parts of the thesis.<br />

The algorithms compute a partial sum S for each bit of X, scann<strong>in</strong>g the words<br />

of Y and M. Once the precision is exhausted, another bit of X is taken, and the<br />

scan is repeated. Thus, the algorithms MWR2MM CSA as well as MWR2MM CPA<br />

22


FEI KEMT<br />

Algorithm 2 – 1 The multiple word radix-2 <strong>Montgomery</strong> multiplication MWR2-<br />

MM CSA algorithm<br />

1: S ⇐ 0<br />

2: for i = 0 to k − 1 do<br />

3: C ⇐ 0<br />

4: (C, S (0) ) ⇐ xiY (0) + S (0)<br />

5: qi ⇐ S (0)<br />

0<br />

6: if qi = 1 then<br />

7: (C, S (0) ) ⇐ C + S (0) + M (0)<br />

8: for j = 1 to e1 − 1 do<br />

9: (C, S (j) ) ⇐ C + xiY (j) + M (j) + S (j)<br />

10: S (j−1) ⇐ (S (j)<br />

0 , S (j−1)<br />

w−1..1)<br />

11: end for<br />

12: S (e1−1) ⇐ (C, S (e1−1)<br />

w−1..1)<br />

13: else<br />

14: for j = 1 to e1 − 1 do<br />

15: (C, S (j) ) ⇐ C + xiY (j) + S (j)<br />

16: S (j−1) ⇐ (S (j)<br />

0 , S (j−1)<br />

w−1..1)<br />

17: end for<br />

18: S (e1−1) ⇐ (C, S (e1−1)<br />

w−1..1)<br />

19: end if<br />

20: end for<br />

impose no constra<strong>in</strong>ts on precision of the operands. What varies is the number of<br />

loop iterations i required to accomplish the MMM operation and the number of<br />

words for <strong>in</strong>put and <strong>in</strong>ternal operands – e1 and e2, respectively. The carry variable<br />

C must be from the set {0, 1, 2} what is imposed by the addition of the three vectors<br />

S, M, xiY , and xi � Y , respectively [108].<br />

2.1.2 Comparison of Implementation Approaches<br />

Two algorithms have been chosen for the hard<strong>ware</strong> implementation – the MW-<br />

R2MM CSA algorithm (Algorithm 2 – 1) and MWR2MM CPA algorithm (Algo-<br />

rithm 2 – 2). Our first goal is to show a difference between the algorithms on the<br />

algorithmic level, other goal is to compare also the way how the algorithms can be<br />

implemented.<br />

The difference <strong>in</strong> algorithms was motivated by possibility to omit the comparison<br />

23


FEI KEMT<br />

Algorithm 2 – 2 The multiple word radix-2 <strong>Montgomery</strong> multiplication MWR2-<br />

MM CPA algorithm<br />

1: S ⇐ 0<br />

2: � Y ⇐ 2Y<br />

3: for i = 0 to k + 3 do<br />

4: C ⇐ 0<br />

5: qi ⇐ S (0)<br />

0<br />

6: for j = 1 to e2 − 1 do<br />

7: (C, S (j) ) ⇐ C + xi � Y (j) + qiM (j) + S (j)<br />

8: S (j−1) ⇐ (S (j)<br />

0 , S (j−1)<br />

w−1..1)<br />

9: end for<br />

10: S (e2−1) ⇐ (C, S (e2−1)<br />

w−1..1)<br />

11: end for<br />

of the f<strong>in</strong>al sum S to M at the end of the loop <strong>in</strong> the Algorithm 1 – 3 for the price<br />

of some extra loops <strong>in</strong> the Algorithm MWR2MM CPA. Another difference is <strong>in</strong><br />

computation of the variable qi that decides on addition of M. Its value <strong>in</strong> the MW-<br />

R2MM CPA algorithm is given directly as LSB of the zeroth word of the <strong>in</strong>ternal<br />

sum S computed <strong>in</strong> the previous loop. Contrary of the Algorithm MWR2MM CPA<br />

the Algorithm MWR2MM CSA uses a value obta<strong>in</strong>ed after addition of item xiY<br />

what <strong>in</strong>crease a latency for comput<strong>in</strong>g the qi.<br />

The most important difference between MWR2MM CSA and MWR2MM CPA is<br />

<strong>in</strong>troduced <strong>in</strong> a way by which the variable S is represented. In carry-save redundant<br />

form applied <strong>in</strong> our implementation of the Algorithm MWR2MM CSA the sum S<br />

is represented by formulation:<br />

S (j) = 1S (j) + r2S (j) , (2.2)<br />

where r is the radix (<strong>in</strong> our implementation r = 2) and 1S, 2S are two w-bit com-<br />

ponents of the sum S. Advantage of such representation is <strong>in</strong> no carry propagation<br />

<strong>in</strong>side the <strong>in</strong>ner loop of the MMM algorithm. On the other hand, for stor<strong>in</strong>g the<br />

partial sum variable S it required to use two w-bit registers <strong>in</strong>stead of one. Only at<br />

the very end of the computations, the redundant form is transformed to the normal<br />

representation apply<strong>in</strong>g the Equation 2.2. The CSA PE which executes the MW-<br />

R2MM CSA Algorithm is <strong>in</strong> this direction <strong>in</strong>dependent on hard<strong>ware</strong> platform and<br />

does not require any special features for hard<strong>ware</strong> implementation of the adders.<br />

In the implementation of the MWR2MM CPA algorithm all operands are op-<br />

24


FEI KEMT<br />

erated and stored <strong>in</strong> a non-redundant form, each requir<strong>in</strong>g w-bit register with e2<br />

words.<br />

Different form of representation of the sum S <strong>in</strong> the implementation of algorithms<br />

MWR2MM CPA and MWR2MM CSA has the follow<strong>in</strong>g consequences:<br />

1. The MWR2MM CPA algorithm uses less (only 80% of MWR2MM CSA) mem-<br />

ory resources for the same operand sizes.<br />

2. The MWR2MM CPA algorithm does not require any correction unit for trans-<br />

formation of the algorithm output <strong>in</strong> the f<strong>in</strong>al step, while the MWR2MM CSA<br />

algorithm requires at least f<strong>in</strong>al conversion to a non-redundant form.<br />

3. The MWR2MM CPA algorithm allows a simpler computation of <strong>in</strong>ternal vari-<br />

able qi that can allow to simplify architecture of CPA PE.<br />

4. The CSA PE is always faster than the CPA one because it does not use carry<br />

<strong>in</strong> <strong>in</strong>ner loop of the algorithm. The CPA PE is slower but uses less logic<br />

resources. Therefore, potentially with<strong>in</strong> the same FPGA resources also more<br />

CPA PE pipel<strong>in</strong>ed stages can be used, what can turn <strong>in</strong>to speed up of the<br />

solution and yield better area time (AT) product.<br />

2.2 Multiplier Architecture<br />

In this section we present architecture of the implemented units for comput<strong>in</strong>g the<br />

MMM. The units are proposed as dedicated coprocessors with standardised <strong>in</strong>terface<br />

to an external control unit. Such approach makes possible to connect several units to<br />

a controller and provide parallel computation of the MMM. The peripheral multiplier<br />

can be mapped <strong>in</strong> the memory of the host processor, where the control operations<br />

are triggered by an <strong>in</strong>terrupt or a control register.<br />

Other approach would propose a set of <strong>in</strong>structions support<strong>in</strong>g fast modular<br />

operations on a general-purpose processor. In this case, besides the target platform<br />

resources the optimisation takes <strong>in</strong>to account the processor structure what makes<br />

the design more specific for a chosen processor architecture.<br />

In the processor+dedicated coprocessor architecture no special requirements are<br />

given for the control unit apart from the specification of the <strong>in</strong>terface s<strong>in</strong>ce the ma<strong>in</strong><br />

computational effort is done <strong>in</strong> the coprocessor. In this way a significantly better<br />

use of resources can be achieved <strong>in</strong> cases when large general-purpose processor is<br />

replaced by a small CPU with coprocessor.<br />

25


FEI KEMT<br />

Beside the <strong>in</strong>ternal structure of the multipliers we discuss also the pipel<strong>in</strong>e struc-<br />

ture of the coprocessor and its <strong>in</strong>terconnection to the host, what can be an embedded<br />

soft-core or a stand-alone processor. The scalable designs offer several parameters to<br />

be chosen after consideration of the required execution time and available hard<strong>ware</strong><br />

resources.<br />

2.2.1 Adder Concepts<br />

In our designs we apply two different ways of implementation of the adders that are<br />

described <strong>in</strong> this section. The architectures designed for MWR2MM CSA and MW-<br />

R2MM CPA algorithms differ <strong>in</strong> implementation of the adders <strong>in</strong>side the multiplier<br />

units.<br />

The scalable cha<strong>in</strong> of CSAs does not <strong>in</strong>clude any connection between the adders<br />

units (see the Figure 2 – 2(b)), what makes it <strong>in</strong>dependent on the platform technology<br />

and the length of the operands to be added.<br />

The propagation of the carry bit <strong>in</strong> the CPA requires to m<strong>in</strong>imise the connection<br />

length between the adders. In case of the ASIC design this critical datapath can be<br />

optimised to achieve the best possible performance. On the other hand, <strong>in</strong> case of<br />

the FPGAs the underly<strong>in</strong>g architecture cannot be changed, yet the logical behaviour<br />

and <strong>in</strong>terconnections given by the device vendor can be re-configured. The FPGA<br />

vendors provide a feature that can be exploited <strong>in</strong> cases when a very fast connection<br />

between the adjacent LE is required, as it is <strong>in</strong> case of the CPAs scalable cha<strong>in</strong>.<br />

To achieve an acceleration of normally slow carry propagation <strong>in</strong> the CPA unit a<br />

fast carry cha<strong>in</strong> network of connections <strong>in</strong>cluded <strong>in</strong> modern FPGAs will be deployed<br />

(see the Figure 2 – 2(a)). The best performance of the carry cha<strong>in</strong> is achieved <strong>in</strong>side<br />

one logic array block (LAB). In dependency on the FPGA type the number of LEs<br />

<strong>in</strong> one LAB differs, typical values are 16, 32. . . If the adder width (w) is bigger than<br />

the number of LEs <strong>in</strong> the LAB, the LABs carry cha<strong>in</strong>s need to be <strong>in</strong>terconnected. A<br />

longer carry cha<strong>in</strong> is required to hold the fast carry connection feature. To achieve<br />

it, the connected LABs should be placed next to each other <strong>in</strong> one column. That<br />

is possible only <strong>in</strong> cases when a tool for place and route (P&R) is able to recognise<br />

the carry cha<strong>in</strong> <strong>in</strong> the synthesised logic and exploits the hard<strong>ware</strong> architecture of<br />

the target device to provide a fast <strong>in</strong>terconnection.<br />

We can conclude that the speed of the CPA PE depends significantly on the<br />

word-length (the length of the carry cha<strong>in</strong>). However, we can suppose that up to a<br />

certa<strong>in</strong> word-length, w ≤ wmax the speed of the CPA PE is not critical, because the<br />

26


FEI KEMT<br />

C’<br />

carry cha<strong>in</strong> carry cha<strong>in</strong><br />

FA FA FA<br />

(a) carry-propagate adder<br />

C<br />

FA FA . . . FA<br />

(b) carry-save adder<br />

Figure 2 – 2 One level of the w-bit adder implemented as CPA and CSA with FAs<br />

f<strong>in</strong>al speed is dom<strong>in</strong>ated by the embedded memory access time or other critical path<br />

<strong>in</strong> the logic. The value wmax may differ between technologies due to the different<br />

rout<strong>in</strong>g and dist<strong>in</strong>ct physical layout (number of LEs <strong>in</strong> LAB). The question is if the<br />

wmax is <strong>in</strong> the range of allowed values for the on-chip memory width of available<br />

FPGAs. In this way we could store and also operate the variables with optimal<br />

word width and achieve the best Area-Time product.<br />

Carry-Save Adder Unit The whole computational complexity of both algo-<br />

rithms lies <strong>in</strong> two additions of three w-bit operands for comput<strong>in</strong>g Si+1. The<br />

propagation of the carry bits between the w adders is (<strong>in</strong> general) too slow. The<br />

implementation of the MWR2MM CSA <strong>in</strong> [108] uses redundant representation of<br />

<strong>in</strong>termediate sum S and carry-save adders [38]. The MWR2MM CSA w-bit PE<br />

architecture based on Full Adders (FAs) is depicted <strong>in</strong> Figure 2 – 3.<br />

In order to reduce the storage size and arithmetic hard<strong>ware</strong> complexity the vari-<br />

ables X, Y , and M are available <strong>in</strong> a non-redundant form. The <strong>in</strong>termediate <strong>in</strong>ternal<br />

sum S is received and generated <strong>in</strong> the redundant form as 1S and 2S. The advantage<br />

of redundant form lies <strong>in</strong> the <strong>in</strong>dependence of the latency from the word length w<br />

as there is no direct connection between the FAs. The output of the adders is valid<br />

right after appearance of the <strong>in</strong>put signals and the delay is given ma<strong>in</strong>ly by <strong>in</strong>ternal<br />

comb<strong>in</strong>ational logic of the FA.<br />

The process<strong>in</strong>g delay may <strong>in</strong>crease for larger w as a result of the broadcast<br />

problem only, it will not depend on the arithmetic operation itself. Conversion<br />

<strong>in</strong>to the normal non-redundant representation is only done at the very end of the<br />

MMM computation. The <strong>in</strong>termediate result of sum S may be further shifted to<br />

other MMM unit as operand X or Y for a new computation (e.g. next iteration<br />

of the modular exponentiation). The redundant representation of variables that<br />

requires twice as much memory as a non-redundant representation and a need for the<br />

transformation to/from redundant form have been considered as the ma<strong>in</strong> drawbacks<br />

27


FEI KEMT<br />

q<br />

x<br />

i<br />

S (j)<br />

2 w-1 S (j)<br />

1 w-1<br />

i<br />

Y (j)<br />

w-1 M(j)<br />

w-1<br />

FA FA<br />

FA<br />

S (j-1)<br />

2 w-1<br />

FA<br />

S (j)<br />

2 w-2 S (j)<br />

1 w-2<br />

S (j-1)<br />

1 w-1<br />

S (j-1)<br />

2 w-2<br />

Y (j)<br />

w-2 M(j)<br />

w-2<br />

FA<br />

S (j-1)<br />

1 w-2<br />

. . .<br />

. . .<br />

S (j)<br />

0 S (j)<br />

2 1 0<br />

Y (j)<br />

0<br />

FA<br />

M(j)<br />

0<br />

S (j-1)<br />

0 S (j-1)<br />

2 1 0<br />

Figure 2 – 3 Block diagram of the CSA-based w-bit MWR2MM process<strong>in</strong>g element (CSA PE)<br />

based on FA<br />

of the MWR2MM CSA algorithm. Positive property of the implementation is its<br />

<strong>in</strong>dependence on carry cha<strong>in</strong> logic on the target platform.<br />

Carry-Propagate Adder Unit Recent FPGAs conta<strong>in</strong> high-speed <strong>in</strong>terconnect<br />

l<strong>in</strong>es between adjacent logic blocks which have been designed to provide an efficient<br />

carry propagation. The CPA PE architecture presented <strong>in</strong> this thesis is optimal for<br />

the implementation of the MMM unit on any FPGA that has dedicated carry logic<br />

capability (e.g. modern Altera and Xil<strong>in</strong>x FPGAs). The basic organization of the<br />

ALU consists of two layers of conventional CPAs as shown <strong>in</strong> Figure 2 – 4.<br />

Unlike the CSA PE, the CPA PE does not support a feature of arbitrary word<br />

width w. The border for the number of FAs <strong>in</strong> one row is given by the target<br />

technology. The more LEs are cha<strong>in</strong>ed by fast (and short) <strong>in</strong>terconnection the higher<br />

the word width can be, achiev<strong>in</strong>g comparable speed results to CSA PE. The value<br />

of the carry signal raised <strong>in</strong> the first FA from the left side (for LSB) is subsequently<br />

processed <strong>in</strong> the adjacent FA that outputs another carry signal for the third adder<br />

<strong>in</strong> the row. . . In this way the carry signal is propagated till the most right FA (for<br />

28<br />

C


FEI KEMT<br />

q<br />

x<br />

i<br />

i<br />

C a<br />

C b<br />

Y (j)<br />

w-1 M(j)<br />

w-1<br />

S (j)<br />

w-1<br />

FA FA<br />

FA<br />

FA<br />

S (j-1)<br />

w-1<br />

Y (j)<br />

w-2 M(j)<br />

w-2<br />

S (j)<br />

w-2<br />

FA<br />

S (j-1)<br />

w-2<br />

. . .<br />

. . .<br />

S (j-1)<br />

0<br />

Y (j)<br />

0 M(j)<br />

0<br />

Figure 2 – 4 Block diagram of CPA-based w-bit MWR2MM process<strong>in</strong>g element (CPA PE) based<br />

on FA<br />

MSB). Once it receives a valid value of the carry and computes the outputs, the<br />

complete w-bit result can be proceeded to a next computation. From the description<br />

we can see that the delay caused by the carry propagation grows l<strong>in</strong>early with the<br />

S (j)<br />

0<br />

number of connections that is given by the word width w.<br />

Pipel<strong>in</strong>e Structure Both algorithms – MWR2MM CSA and MWR2MM CPA<br />

share the same data dependencies. A detailed analysis of potential <strong>in</strong>ner paral-<br />

lelism and <strong>in</strong>vestigation of pipel<strong>in</strong>ed organisation that would be suitable for an<br />

MWR2MM CSA algorithm implementation can be found <strong>in</strong> [108, 109]. The pre-<br />

sented analysis can be directly applied also to the MWR2MM CPA algorithm. The<br />

most important result of the analysis – the possibility to operate <strong>in</strong> pipel<strong>in</strong>ed stages<br />

of the multipliers is applied <strong>in</strong> the FPGA implementations presented <strong>in</strong> the thesis.<br />

The ma<strong>in</strong> advantage of the scalable architecture for the MMM lies <strong>in</strong> the fact that<br />

the PEs can be easily repeated to <strong>in</strong>crease the throughput of the coprocessor [108].<br />

In the pipel<strong>in</strong>ed version several slightly modified PEs (some registers have to be<br />

added to allow temporary data storage) are connected <strong>in</strong> a cascade (see Figure 2 –<br />

5).<br />

29<br />

FA<br />

C<br />

a<br />

C b


FEI KEMT<br />

x i x i-1 xi-n+1<br />

Y (j)<br />

M (j)<br />

S (j)<br />

PE 1<br />

Y (j-1)<br />

M (j-1)<br />

S (j-1)<br />

PE 2<br />

S (j-n)<br />

data<br />

memory<br />

. . .<br />

. . .<br />

. . .<br />

Y (j-n+1)<br />

M (j-n+1)<br />

S (j-n+1)<br />

PE n<br />

Figure 2 – 5 Pipel<strong>in</strong>ed organization of the MMM coprocessor based on n-stage PEs connection<br />

and separated embedded data memory<br />

The maximum degree of pipel<strong>in</strong>e that can be obta<strong>in</strong>ed with this architecture is<br />

found as:<br />

nmax =<br />

� �<br />

e + 1<br />

2<br />

(2.3)<br />

The number 2 <strong>in</strong> denom<strong>in</strong>ator expresses the number of clock cycles after which the<br />

output of the MMM unit is valid. It means also that new values for <strong>in</strong>put variables<br />

of the PEs <strong>in</strong> the pipel<strong>in</strong>ed row are delivered every third clock cycle. Output data<br />

from one stage are kept between the adjacent stages <strong>in</strong> temporal registers for one<br />

clock cycle and afterwards delivered to the subsequent stage. The stages <strong>in</strong>clude the<br />

second register at their <strong>in</strong>put level which provides total delay of two clock cycles as<br />

required by the computation process.<br />

To keep the <strong>in</strong>ternal control logic simple the number of the stages n is restricted<br />

to values divid<strong>in</strong>g the number of words e (n|e). Thanks to the simplification <strong>in</strong> the<br />

moment when the computation had been f<strong>in</strong>ished the last word of the sum S is at<br />

the output of the last unit <strong>in</strong> the row and is directly shifted to the memory to be<br />

stored there. In case of arbitrary n the functionality for a word shift between the<br />

stages at the end of computations would need to be implemented. Addition of the<br />

feature requires some extra logic <strong>in</strong> the data-path what has a negative <strong>in</strong>fluence on<br />

the maximal clock frequency, therefore it is not supported <strong>in</strong> our designs.<br />

The number of clock cycles needed for a s<strong>in</strong>gle MMM operation <strong>in</strong> design con-<br />

ta<strong>in</strong><strong>in</strong>g n ≤ nmax MMM units can be computed as:<br />

TMMM = k2<br />

+ 2n =<br />

wn<br />

� �<br />

ew<br />

e + 2n (2.4)<br />

n<br />

From the Equation 2.4 we can see that the number of stages n has a significant<br />

impact on computation time and reduces it l<strong>in</strong>early. When less than nmax MMM<br />

30


FEI KEMT<br />

units are available, the total execution time TMMM will <strong>in</strong>crease. On the other<br />

hand the area occupation of the coprocessor can be changed accord<strong>in</strong>g to the area<br />

constra<strong>in</strong>ts of the target device. Implementation of n < nmax stages means also<br />

more operations needed for read<strong>in</strong>g from and stor<strong>in</strong>g <strong>in</strong> the memory. Shift<strong>in</strong>g the<br />

processed data between the stages is faster than stor<strong>in</strong>g the <strong>in</strong>termediate results <strong>in</strong><br />

the memory block and their repeated read<strong>in</strong>g to f<strong>in</strong>ish the computations on them.<br />

Therefore the best performance is achieved <strong>in</strong> design with maximal number of stages<br />

nmax (n = nmax).<br />

Parametrisation The MMM coprocessor has three variable parameters (w, e, and<br />

n) that can be chosen for any implementation. Accord<strong>in</strong>g to the required area of<br />

the implemented coprocessor and the required tim<strong>in</strong>gs for the MMM computations<br />

the number of pipel<strong>in</strong>ed stages and the word width (n, w) can be chosen. The<br />

security level of public-key algorithm def<strong>in</strong>es the length of operands for the multiplier<br />

(k = we). This approach gives high flexibility to the processor and coprocessor<br />

design.<br />

In general, there are two possible approaches how to <strong>in</strong>crease the speed of the<br />

MMM computation <strong>in</strong> the proposed designs (check Equation 2.4 to understand the<br />

relations between the design parameters and the computation time TMMM):<br />

1. To <strong>in</strong>crease the word length w. In this way the number of iterations given by<br />

e is reduced what yields a shorter computation time. While the older FPGAs<br />

provide memory blocks with dual port memory feature and configurable word<br />

lengths only up to 16 bits (Altera Apex [8]), <strong>in</strong> the high-performance models<br />

it can be up to 32 bits for middle-sized blocks or 128 bits for large memory<br />

blocks (Altera Stratix II [20]). S<strong>in</strong>ce the capacity of the block is sufficient<br />

for typical RSA operands it makes sense to use only one block per operand.<br />

In case of an older technology with smaller memory blocks and chosen bigger<br />

word width (16 < w ≤ 32) two memory blocks per variable aare required.<br />

In dependency of the memory configuration several variables may share one<br />

memory block. Operands mapp<strong>in</strong>g to the memory is especially important for<br />

constra<strong>in</strong>ed SOC designs with limited number of memory blocks.<br />

2. To <strong>in</strong>crease the number of pipel<strong>in</strong>ed stages n. The hard<strong>ware</strong> structure of the<br />

PE for both solutions (CSA PE and CPA PE) is relatively simple and fast<br />

and <strong>in</strong>dependent on the number of stages, what was a condition for a scalable<br />

design. An addition of several pipel<strong>in</strong>ed stages can <strong>in</strong>crease the overall speed,<br />

31


FEI KEMT<br />

especially if the access to the embedded memory is a bottleneck (as it is <strong>in</strong> a<br />

case of FPGAs with limited rout<strong>in</strong>g resources for large w).<br />

From the previous analysis we can conclude that the number of words w is chosen<br />

accord<strong>in</strong>g to the target platform architecture and its memory blocks organisation<br />

and support for fast carry operations. The number of pipel<strong>in</strong>ed stages n is adapted<br />

to available chip size.<br />

2.2.2 Memory Block<br />

The operands are stored <strong>in</strong> the memory block that is <strong>in</strong>cluded <strong>in</strong> the data-path. Op-<br />

timisation of the memory organisation and connection to the ALU helps to achieve<br />

better performance. Due to <strong>in</strong>tensive exchange of data between the memory and<br />

ALU, the connection is often a part of the longest - critical path of the logic and<br />

<strong>in</strong>fluences a maximal clock frequency of the circuit.<br />

In dependency on number of pipel<strong>in</strong>ed stages (n) and number of iterations given<br />

by number of words (w) the data of operands are several times read out of the<br />

memory, processed by PEs, and stored back. The memory block may conta<strong>in</strong> <strong>in</strong>put<br />

data loaded by a control unit, the <strong>in</strong>termediate results, and the f<strong>in</strong>al results ready to<br />

be sent back to a host processor after the computations had been f<strong>in</strong>ished. Note that<br />

at the same time different words of an operand are loaded and stored. Therefore<br />

the memory have to support dual-port configuration. It makes possible to address<br />

read<strong>in</strong>g and writ<strong>in</strong>g from/to separate places of the memory. Schematic organisation<br />

of the dual-port memory register <strong>in</strong>side the MMM coprocessor for one of the variables<br />

is depicted at Figure 2 – 6.<br />

A data<br />

A address<br />

0:<br />

1:<br />

e-1:<br />

w bits<br />

w bits<br />

.<br />

.<br />

.<br />

w bits<br />

memory unit: e x w bits<br />

B data<br />

B address<br />

A port B port<br />

Figure 2 – 6 Organisation of the dual-port memory register <strong>in</strong>side the MMM coprocessor for one<br />

variable with e words of width w bits<br />

32


FEI KEMT<br />

In the coprocessor we need to store four operands for the MMM computations:<br />

three <strong>in</strong>put operands X, Y, M and the result S. The storage of S requires one or<br />

two registers for a case of the non-redundant or redundant representation form,<br />

respectively. The scalability feature applied to the ALU needs to be adopted to the<br />

memory block, too.<br />

The requirements for the scalable design make possible that the architecture<br />

is easily adaptable to the length of operands different from the one for which the<br />

system was orig<strong>in</strong>ally designed. In the memory block the number of stored variables<br />

is constant (four or five, depend<strong>in</strong>g on the chosen implementation). What varies is<br />

the number of words and consequently the number of bits needed to address them.<br />

We propose a model <strong>in</strong> which the each word of every variable can be addressed<br />

as from the coprocessor as well as from the host unit. We recognise an <strong>in</strong>ternal<br />

address of a word that specifies its location <strong>in</strong> given coprocessor and register, a<br />

register address that makes possible to choose a register with required variable<br />

and f<strong>in</strong>ally a coprocessor address dist<strong>in</strong>guish<strong>in</strong>g between several ALUs. With this<br />

memory management a control unit can address any word of a chosen coprocessor,<br />

store there the <strong>in</strong>put values for computations and afterwards read the results for<br />

further process<strong>in</strong>g. Number of address bits for each level can be adopted accord<strong>in</strong>g<br />

to number of coprocessors, variables and number of words. The address width is<br />

usually given by the word width of the <strong>in</strong>terface between the processor and the<br />

coprocessor. For the address longer than the <strong>in</strong>terface word width an appropriate<br />

address model needs to be chosen - accept<strong>in</strong>g several address signals <strong>in</strong> parallel or<br />

differenc<strong>in</strong>g the address type <strong>in</strong> other way.<br />

Table 2 – 1 Address of operands from host processor level (LSB right)<br />

coprocessor register <strong>in</strong>ternal<br />

XX XXX XXXXXXX<br />

The memory address bits are assigned as shown <strong>in</strong> Table 2 – 1 (LSB is right).<br />

The CPU <strong>in</strong> the presented example of the address format can handle up to 4 MMM<br />

coprocessors (two bits address) with 8 operands (three bits address) each composed<br />

of 128 words. Such configuration is suitable for the RSA computations on the<br />

operands’ length n = 2048 bits and word width w = 16 bits what gives e = 128<br />

number of words.<br />

33


FEI KEMT<br />

2.2.3 Interface to Controller<br />

The way <strong>in</strong> which the MMM coprocessor is connected to the control unit (e.g. an<br />

embedded processor) is important for the control of the computation process and<br />

for the exchange of processed data.<br />

Our first objective is to f<strong>in</strong>d a solution which would make possible a fast and flex-<br />

ible replacement of <strong>in</strong>put and output data between the memory of the host processor<br />

and the MMM coprocessor’s <strong>in</strong>ternal memory block. The requirement for flexibility<br />

is related to the scalability of the coprocessor that may <strong>in</strong>clude several MMM units.<br />

Moreover, the <strong>in</strong>ternal word widths of the control unit and the coprocessor may<br />

differ.<br />

Other goal is to optimise the control of the coprocessor(s). The trigger<strong>in</strong>g of<br />

the computations and then check<strong>in</strong>g their status plays important role especially <strong>in</strong><br />

configurations with several coprocessors (not necessarily the MMM coprocessors)<br />

operated by one control unit when it is <strong>in</strong>eligible to block the operations runn<strong>in</strong>g on<br />

the host processor.<br />

F<strong>in</strong>ally, the goal is also to design an <strong>in</strong>terface that would be universal and ap-<br />

plicable with m<strong>in</strong>imal amount of a clue logic for connection to different types of<br />

processor buses.<br />

The <strong>in</strong>terface that satisfies the requirements mentioned above is depicted <strong>in</strong><br />

Figure 2 – 7. The functionality of the particular signals is expla<strong>in</strong>ed <strong>in</strong> the next part<br />

of the section.<br />

clock<br />

reset<br />

chip select<br />

write enable<br />

irq<br />

address bus<br />

data bus<br />

MMM<br />

coprocessor<br />

Figure 2 – 7 Proposed universal <strong>in</strong>terface for the MMM coprocessor<br />

34


FEI KEMT<br />

Status and Control Interface The operations <strong>in</strong>side the MMM coprocessor are<br />

controlled by a control register that is mapped <strong>in</strong> the control unit’s memory via the<br />

<strong>in</strong>terface. In the presented solution there are two control bits:<br />

bit 0 controls the multiplication/squar<strong>in</strong>g process. Set 1 to trigger the computa-<br />

tions, 0 for idle.<br />

bit 1 switches between the multiplication and squar<strong>in</strong>g. Set 0 to compute the MMM<br />

on the <strong>in</strong>put parameters X and Y , set 1 to square (multiple the operand by<br />

itself) the value stored <strong>in</strong> memory register Y .<br />

A status register has been used to check the actual status of the coprocessor and<br />

the computational process <strong>in</strong> the solution published <strong>in</strong> [117]. The LSB raises dur<strong>in</strong>g<br />

the data stor<strong>in</strong>g and computations. After trigger<strong>in</strong>g the computation the processor’s<br />

duty is to check the status register regularly. Once the operation of multiplication or<br />

squar<strong>in</strong>g had been f<strong>in</strong>ished the value of the status bit is changed to 0. The control<br />

unit is expected to read the results from the MMM coprocessor and, if required,<br />

repeat the operation with new operands.<br />

The version described <strong>in</strong> [49] uses the communication over an <strong>in</strong>terrupt (signal<br />

irq <strong>in</strong> Figure 2 – 7). This solution is more suitable for soft<strong>ware</strong> control of coprocessors<br />

and for a configuration with several MMM coprocessors. After the computation of<br />

the MMM the <strong>in</strong>terrupt signal of the host processor is asserted. This state persists<br />

until the results are read with<strong>in</strong> the <strong>in</strong>terrupt rout<strong>in</strong>e by the processor. Thereafter<br />

new operands can be loaded <strong>in</strong>to the memory and the whole process started aga<strong>in</strong>.<br />

Memory Operations The transfer of the operands between the control unit and<br />

the coprocessor is executed by a couple of control signals (chip select denot<strong>in</strong>g the<br />

particular coprocessor, and write enable signalis<strong>in</strong>g a stor<strong>in</strong>g operation) and buses<br />

for address and data.<br />

The syntax of operand address has been expla<strong>in</strong>ed <strong>in</strong> Table 2 – 1. The chip select<br />

signal of the correspond<strong>in</strong>g coprocessor is asserted accord<strong>in</strong>g to the address decoded<br />

by the <strong>in</strong>terface. S<strong>in</strong>ce the <strong>in</strong>put operands X, Y and M require only access for<br />

their storage and on the other hand the operand S is exclusively used as the output<br />

register of the coprocessor, their addresses may be shared. The particular operand<br />

register is then selected as per write enable signal and the addresses.<br />

In case when the <strong>in</strong>ternal word widths of the processor and the coprocessor do<br />

not match, an additional functionality is required from the <strong>in</strong>terface to perform the<br />

memory alignment and proper decod<strong>in</strong>g of the memory address.<br />

35


FEI KEMT<br />

Clock Signal Distribution As there may a need for faster (<strong>in</strong> generally different)<br />

clock<strong>in</strong>g of the dedicated coprocessor we analyse a solution with separated clock<br />

signals for both parts of the system.<br />

The clock signal from the control processor controls through the bus and the<br />

<strong>in</strong>terface all the processes between the processor’s and coprocessor’s memory. The<br />

operations <strong>in</strong>side the MMM coprocessor are then clocked by the external (usually<br />

faster) clock signal.<br />

Note that additional clock signal requires also some extra resources for its gener-<br />

ation. That may cause problems <strong>in</strong> the constra<strong>in</strong>ed embedded systems on low-end<br />

FPGAs with low number of clock generat<strong>in</strong>g circuits (e.g. PLLs). On the other<br />

hand, the performance improvement is significant. Thanks to this clock signals or-<br />

ganisation almost three times higher performance of the MMM coprocessor has been<br />

obta<strong>in</strong>ed <strong>in</strong> [49] comparison to the implementation us<strong>in</strong>g the same clock signal for<br />

both units [117].<br />

2.3 Implementation of the MMM<br />

In this section we provide obta<strong>in</strong>ed parameters of the MMM units implemented<br />

accord<strong>in</strong>g to the theory presented <strong>in</strong> the previous parts of the thesis. The MWR2-<br />

MM CSA algorithm and MWR2MM CPA algorithm are compared by implemen-<br />

tation of the PEs on several families of FPGAs produced by Altera. Further, we<br />

summarise the implementation results of the MMM coprocessor and we discuss an<br />

approach with soft<strong>ware</strong>-hard<strong>ware</strong> co-design and compare the results with a soft-<br />

<strong>ware</strong> way of implementation of the MMM. F<strong>in</strong>ally, we provide a summary of the<br />

implementation results.<br />

2.3.1 Comparison of CSA and CPA PE<br />

Tables 2 – 2 and 2 – 3 the results of MWR2MM CSA and MWR2MM CPA PEs im-<br />

plementations (<strong>in</strong>clud<strong>in</strong>g data storage registers necessary for the pipel<strong>in</strong>ed version)<br />

<strong>in</strong> different Altera FPGAs for various word lengths w.<br />

There are several <strong>in</strong>terest<strong>in</strong>g facts that can be seen <strong>in</strong> these tables. With the<br />

exception of CPA PE implemented <strong>in</strong> the ACEX family, the two solutions are tech-<br />

nologically <strong>in</strong>dependent (as far as the area occupation is concerned). The size (<strong>in</strong><br />

LEs) of the block depends almost l<strong>in</strong>early on the word length w. CSA PE occupies<br />

always more resources than that of CPA PE.<br />

36


FEI KEMT<br />

Table 2 – 2 PE sizes and speeds for old style Altera FPGAs<br />

CPA PE CSA PE<br />

Device w Size Speed w Size Speed<br />

(bits) (LEs) (MHz) (bits) (LEs) (MHz)<br />

ACEX [7] 8 66 161 8 81 232<br />

EP1K100-1 16 130 129 16 161 202<br />

32 258 99 32 321 170<br />

APEX [8] 8 59 161 8 81 232<br />

EP20K160-1 16 115 129 16 161 202<br />

32 227 99 32 321 170<br />

Table 2 – 3 PE sizes and speeds for new style Altera FPGAs<br />

CPA PE CSA PE<br />

Device w Size Speed w Size Speed<br />

(bits) (LEs) (MHz) (bits) (LEs) (MHz)<br />

CYCLONE [13] 8 59 277 8 81 304<br />

EP1C20-6 16 115 235 16 161 304<br />

32 227 221 32 321 304<br />

STRATIX [18] 8 59 271 8 81 304<br />

EP1S10-6 16 115 248 16 161 304<br />

32 227 214 32 321 304<br />

The most important fact concerns the speed of the PEs. As it could be expected,<br />

the CSA PE is always faster and the speed vary either only slightly (for old families)<br />

or almost not at all (for recent families, probably due to enhanced rout<strong>in</strong>g possi-<br />

bilities) with the word length w. However, the speed of the CPA PE <strong>in</strong> the older<br />

families decreases significantly with the word length (about 40% from 8 bits to 32<br />

bits). Recent Altera devices use enhanced carry cha<strong>in</strong>. So-called carry-select cha<strong>in</strong><br />

uses the redundant carry calculation (hard-wired) to <strong>in</strong>crease the speed of carry<br />

functions. This feature enables to get process<strong>in</strong>g times for CPA PE comparable to<br />

CSA PE (but slower about 10 to 30%). S<strong>in</strong>ce CPA PE is about 20% smaller, one<br />

can improve the f<strong>in</strong>al speed <strong>in</strong>creas<strong>in</strong>g number of pipel<strong>in</strong>ed stages. However, this<br />

approach does not seem to be adequate for word lengths w > 32 bits.<br />

37


FEI KEMT<br />

2.3.2 <strong>Montgomery</strong> <strong>Multiplication</strong> Coprocessor<br />

Hav<strong>in</strong>g the optimised PE for the MMM computations our objective is to complete<br />

the MMM coprocessor with all necessary parts. The memory registers, the <strong>in</strong>terface<br />

to the control unit and the clock distribution logic are <strong>in</strong>tegral parts of the MMM<br />

coprocessor. The IP block <strong>in</strong>clud<strong>in</strong>g all mentioned design units is very suitable for<br />

quick system development provid<strong>in</strong>g the full functionality for operations demand<strong>in</strong>g<br />

the MMM and a universal <strong>in</strong>terface for connection to the control processor.<br />

The architecture of the coprocessor and all its parts has been discussed <strong>in</strong> the<br />

Section 2.2. In the Table 2 – 4 we provide the results for the area occupation and<br />

the critical path expressed as the maximal clock<strong>in</strong>g frequency on the Altera APEX<br />

20K200E FPGA. For the sample configuration we have chosen the MMM coprocessor<br />

based on the multiplier unit based on the MWR2MM CSA Algorithm with operands<br />

word width (w = 32) and precision k = 1024 and k = 2048 bits, respectively.<br />

Table 2 – 4 Area occupation <strong>in</strong> number of LEs and maximal clock frequency (fclkMMM ) (MHz) of<br />

the MMM coprocessor (w = 32, n = 1..4) with MWR2MM CSA algorithm<br />

k = 1024 k = 2048<br />

LEs (fclkMMM ) (LEs) (fclkMMM )<br />

n = 1 542 107.22 551 105.83<br />

n = 2 1100 110.43 1136 106.96<br />

n = 3 1621 108.34 1644 104.39<br />

n = 4 1943 106.67 1980 103.85<br />

2.3.3 <strong>Hard</strong><strong>ware</strong>-Soft<strong>ware</strong> Co-design of MMM: a Case Study<br />

For configurable platform is typical a SOC architecture. Such approach reduces<br />

the production costs and on the other hand provides very suitable platform for the<br />

cryptographic applications. The SOC m<strong>in</strong>imises the number of external <strong>in</strong>terfaces<br />

and <strong>in</strong> this way decreases also the amount of leaked <strong>in</strong>formation.<br />

Another advantage of use of the SOC is that hard<strong>ware</strong> and soft<strong>ware</strong> solutions can<br />

be compared <strong>in</strong> a better way. Therefore the choice of optimal resources utilisation<br />

is based on a proper analysis. In the SOC both soft<strong>ware</strong> and hard<strong>ware</strong> solutions<br />

occupy the same resources.<br />

The fully soft<strong>ware</strong> solution usually needs relatively large logic resources and small<br />

memory resources to implement the processor and sometimes large memory to im-<br />

38


FEI KEMT<br />

plement the program code. The fully hard<strong>ware</strong> solution needs greater logic resources<br />

and eventually some data memory. In a mixed hard<strong>ware</strong>-soft<strong>ware</strong> design, parallel<br />

and time critical operations can be done <strong>in</strong> a hard<strong>ware</strong> (dedicated coprocessors)<br />

and complex sequential and control operations <strong>in</strong> a soft<strong>ware</strong> (ma<strong>in</strong> processor). In<br />

our SOC design the speedup factor of the coprocessor application <strong>in</strong> relationship to<br />

the entirely soft<strong>ware</strong>-based solution can be measured quite easily: both implemen-<br />

tations use the same embedded processor, Altera Nios soft core described further <strong>in</strong><br />

the follow<strong>in</strong>g paragraph.<br />

Embedded Nios Processor The Nios CPU [10] is a pipel<strong>in</strong>ed general-purpose<br />

RISC processor that is generated by proprietary Altera VHDL generator (SOPC<br />

Builder) and can be synthesised and embedded <strong>in</strong> all recent Altera FPGAs. The<br />

Nios supports both 32-bit and 16-bit architectural variants. Both variants use 16-bit<br />

<strong>in</strong>structions. The pr<strong>in</strong>cipal features of the Nios <strong>in</strong>struction set architecture are:<br />

1. large, w<strong>in</strong>dowed register file,<br />

2. simple, complete <strong>in</strong>struction set,<br />

3. powerful address<strong>in</strong>g modes,<br />

4. extensibility.<br />

Exist<strong>in</strong>g Nios peripherals (e.g. UART, timer. . . ) as well as new custom peripherals<br />

can be connected through an Avalon bus [9]. Avalon is a simple bus architecture<br />

designed for connect<strong>in</strong>g on-chip processor(s) and peripheral together <strong>in</strong>to a SOC.<br />

Comparison of Implementations The Nios processor is used as a control unit<br />

<strong>in</strong> mixed implementations and as a ma<strong>in</strong> processor for the soft<strong>ware</strong> implementa-<br />

tion. The 32-bit version of the Nios CPU can optionally be configured to <strong>in</strong>clude<br />

a hard<strong>ware</strong>-supported <strong>in</strong>teger multiplier. The additional logic is used by the MUL<br />

<strong>in</strong>struction to compute 32-bit result <strong>in</strong> three clock cycles 1 . This option is not sup-<br />

ported <strong>in</strong> the 16-bit Nios <strong>in</strong>struction set. In order to obta<strong>in</strong> realistic comparisons,<br />

32-bit Nios CPU with hard<strong>ware</strong> supported MUL <strong>in</strong>struction was used for soft<strong>ware</strong><br />

implementation.<br />

In order to compare them, we have implemented three different systems:<br />

1 When us<strong>in</strong>g the MUL option with Altera Stratix devices, the hard<strong>ware</strong> multiplier uses the<br />

Stratix DSP blocks for implementation.<br />

39


FEI KEMT<br />

1. Fully soft<strong>ware</strong> solution implemented on a 32-bit Nios processor.<br />

2. Mixed soft<strong>ware</strong>-hard<strong>ware</strong> design with 16-bit Nios processor and the pipel<strong>in</strong>ed<br />

coprocessor <strong>in</strong>clud<strong>in</strong>g the CSA PE.<br />

3. Mixed soft<strong>ware</strong>-hard<strong>ware</strong> design with 16-bit Nios processor and the pipel<strong>in</strong>ed<br />

coprocessor <strong>in</strong>clud<strong>in</strong>g the CPA PE.<br />

Further, we provide the details of each system design and comment the obta<strong>in</strong>ed<br />

results.<br />

1. The soft<strong>ware</strong> implementation of the MMM algorithm has been written <strong>in</strong> the<br />

Nios assembly language by us<strong>in</strong>g all known optimization techniques for the<br />

target processor. The Separated Operand Scann<strong>in</strong>g (SOS) MMM method [39]<br />

was used as the best method for given Nios RISC architecture [66]. The<br />

Table 2 – 5 shows the tim<strong>in</strong>gs for the execution of the MMM on the fully<br />

soft<strong>ware</strong> solution runn<strong>in</strong>g on the processor clocked at 50 MHz. The 32-bit<br />

Nios processor occupies 2137 LEs without the logic for the <strong>in</strong>teger multiplier<br />

(for MUL <strong>in</strong>struction) that requires additional 446 LEs.<br />

In case of the soft<strong>ware</strong> implementation it is effective to apply a different algo-<br />

rithms for the multiplication and squar<strong>in</strong>g what reduces the execution time for<br />

the squar<strong>in</strong>g operation. However due to vulnerability aga<strong>in</strong>st the side-channel<br />

attacks it is better to align the execution times of both operations.<br />

Table 2 – 5 Execution times of soft<strong>ware</strong> implementation of MMM on Altera Nios development<br />

board (with APEX EP20K200 clocked at 50 MHz)<br />

Length Method <strong>Multiplication</strong> Squar<strong>in</strong>g<br />

(e × w) (ms) (ms)<br />

1024 SOS32MEM 2.40 1.87<br />

2048 SOS32MEM 9.47 7.24<br />

2. In the mixed hard<strong>ware</strong>-soft<strong>ware</strong> design the multiplication and squar<strong>in</strong>g is com-<br />

pletely implemented <strong>in</strong> the hard<strong>ware</strong>. Both operations share the same arith-<br />

metic unit. Due to move of the computational complexity from the ma<strong>in</strong> pro-<br />

cessor to the dedicated coprocessor one does not need to use the 32-bit version<br />

of the Nios core. Instead of the 32-bit controller one can <strong>in</strong>clude the 16-bit<br />

40


FEI KEMT<br />

Nios processor that is powerful enough to control the process and reduces the<br />

resources usage to reasonable 1275 LEs.<br />

The MMM coprocessor is based on a 16-bit (w = 16) CSA PE with 6 (n = 6)<br />

pipel<strong>in</strong>ed stages and occupies 1290 LEs. The total area occupation of the<br />

second, mixed hard<strong>ware</strong>-soft<strong>ware</strong> solution is comparable to the purely soft<strong>ware</strong><br />

solution. The processor has been clocked at 50 MHz and the MMM coprocessor<br />

at 150 MHz. Times necessary for MMM and squar<strong>in</strong>g are presented <strong>in</strong> Table 2 –<br />

6.<br />

Table 2 – 6 Execution times of mixed hard<strong>ware</strong>-soft<strong>ware</strong> implementation of MMM on Altera Nios<br />

development board (with APEX EP20K200) for the CSA PE<br />

Length Method <strong>Multiplication</strong> Squar<strong>in</strong>g<br />

(e × w) (ms) (ms)<br />

1024 = 64 × 16 MWR2MM CSA 0.073 0.073<br />

2048 = 128 × 16 MWR2MM CSA 0.291 0.291<br />

3. The third design we analyse is based on the same system architecture as the<br />

one <strong>in</strong>troduced <strong>in</strong> the second po<strong>in</strong>t. This time the MMM coprocessor <strong>in</strong>cludes<br />

the 16-bit (w = 16) CPA PE with 9 (n = 9) pipel<strong>in</strong>ed stages. The parameters<br />

were chosen with purpose to get the occupied area size comparable to the<br />

other two design variations. The processor has been clocked at 50 MHz and<br />

the MMM coprocessor at 100 MHz. The results obta<strong>in</strong>ed for this configuration<br />

are presented <strong>in</strong> Table 2 – 7.<br />

Table 2 – 7 Execution times of mixed hard<strong>ware</strong>-soft<strong>ware</strong> implementation of the MMM on Altera<br />

Nios development board (with APEX EP20K200) for the CPA PE<br />

Length Method <strong>Multiplication</strong> Squar<strong>in</strong>g<br />

(e × w) (ms) (ms)<br />

1024 = 64 × 16 MWR2MM CPA 0.069 0.069<br />

2048 = 128 × 16 MWR2MM CPA 0.278 0.278<br />

41


FEI KEMT<br />

2.3.4 Implementation Results<br />

The presented results have been obta<strong>in</strong>ed after P&R process <strong>in</strong> Altera Quartus de-<br />

velopment system, version 2.2. The simulation and synthesis of the designs was<br />

done <strong>in</strong> development tools from Mentor Graphics <strong>in</strong>cluded <strong>in</strong> the FPGA Advan-<br />

tage package. The carry cha<strong>in</strong>s <strong>in</strong> the CPA PE have been implemented us<strong>in</strong>g<br />

the lpm add sub function from the Library of Parameterized Modules (LPM) – a<br />

technology-<strong>in</strong>dependent library of logic functions that are parameterized to achieve<br />

scalability and adaptability.<br />

All the logic have been described by VHDL tak<strong>in</strong>g <strong>in</strong>to account the scalability<br />

and possible choice of the system parameters. Beside the memory registers block<br />

and the carry cha<strong>in</strong> logic, the designs are fully portable to any FPGA platform.<br />

In the subsection 2.3.1 we have summarised the differences between the two<br />

chosen concepts for implementation of the PE for the MMM. The result of the MMM<br />

coprocessor implementation shows importance of the clock distribution unit s<strong>in</strong>ce<br />

the achieved maximal clock<strong>in</strong>g frequency of the coprocessor overruns the typical<br />

work<strong>in</strong>g frequency of the control units (the Nios soft-core processor <strong>in</strong> our case).<br />

Accord<strong>in</strong>g to the previous analysis the critical path of the coprocessor does not<br />

change with <strong>in</strong>creas<strong>in</strong>g number of pipel<strong>in</strong>ed stages k, and the relation between the<br />

occupied area size and the computational time for the MMM operation stays l<strong>in</strong>ear.<br />

From the case study hav<strong>in</strong>g objective to f<strong>in</strong>d an optimal utilisation of the plat-<br />

form resources we can f<strong>in</strong>d to follow<strong>in</strong>g conclusions. From all three designs which<br />

parameters were chosen <strong>in</strong> order to achieve a comparable area occupation the slow-<br />

est is the soft<strong>ware</strong> solution 2 . The two designs <strong>in</strong>clud<strong>in</strong>g the optimised MMM units<br />

implemented <strong>in</strong> hard<strong>ware</strong> provides computational times around 30 times shorter.<br />

From the comparison between the CSA and CPA concepts the latter one provides<br />

slightly better times.<br />

2.4 Conclusions and Future Work<br />

The chapter covers the topics related to the effective implementation of the algebraic<br />

coprocessor for MMM operation. We compared two basic concepts of the multiplier<br />

architecture. The improvements of the algorithm are related to the reconfigurable<br />

platform chosen for the implementation. Tho pair of concepts was chosen to present<br />

2 In fact the <strong>in</strong>struction set of the Nios processor has been enhanced by the hard<strong>ware</strong>-supported<br />

MUL <strong>in</strong>struction. The completely soft<strong>ware</strong> solution gives too poor results to consider them <strong>in</strong> the<br />

comparison.<br />

42


FEI KEMT<br />

the contribution of the carry cha<strong>in</strong> dedicated logic <strong>in</strong> recent FPGA families and<br />

compare it to the classical approach with the CSA.<br />

Analysed multiplier PE provides the core unit for developed MMM coprocessor.<br />

Our attention was paid to keep the scalability feature <strong>in</strong>cluded <strong>in</strong> the PE also <strong>in</strong><br />

the other parts of the system. The <strong>in</strong>terface of the coprocessor provides flexible<br />

and powerful connection accord<strong>in</strong>g to the processor’s type of peripherals handl<strong>in</strong>g.<br />

The presented MMM coprocessor was successfully <strong>in</strong>corporated <strong>in</strong>to SOCs with two<br />

types of the control unit: <strong>in</strong> this chapter the soft-core processor Altera Nios was<br />

applied, <strong>in</strong> Chapter 4 we describe system controlled by an ARM processor.<br />

Obta<strong>in</strong>ed solution is very flexible and thanks to scalability and possibility to<br />

choose between two types of PE, one is able to adapt it to a large range of target<br />

platforms and applications. The features of the MMM coprocessor <strong>ware</strong> confirmed<br />

by two proof-of-concept implementations. In this chapter we consider the coproces-<br />

sor application for RSA-based public key cryptosystem <strong>in</strong> which typical operands<br />

length exceeds 1000 bits. In Chapter 4 we present a design of the coprocessor dedi-<br />

cated for <strong>in</strong>teger factor<strong>in</strong>g based on elliptic curves. The IP block cover<strong>in</strong>g the MMM<br />

coprocessor with all its features supports fast development of embedded systems.<br />

From areas <strong>in</strong> which we see possible improvements of the design we mention<br />

a better memory management for variables smaller than the total capacity of the<br />

memory block. The RSA application can be enhanced by the CRT method that<br />

requires shorter operands. Such requirement can be perfectly met by the MMM<br />

coprocessor <strong>in</strong> future thanks to its scalability.<br />

43


FEI KEMT<br />

3 Elliptic Curve Method <strong>in</strong> <strong>Hard</strong><strong>ware</strong> - prelimi-<br />

naries<br />

<strong>Hard</strong><strong>ware</strong> implementations of factor<strong>in</strong>g algorithms require special purpose devices<br />

suitable for effective execution of <strong>in</strong>tensive computations. In this chapter we provide<br />

prelim<strong>in</strong>aries for the topic of ECM hard<strong>ware</strong> implementation.<br />

In the Section 3.1 we start with <strong>in</strong>troduction on factor<strong>in</strong>g <strong>in</strong> general and present<br />

the motivation for implementation of the ECM <strong>in</strong> hard<strong>ware</strong>. The chapter cont<strong>in</strong>ues<br />

with a summary of previous work done <strong>in</strong> the area of ECM implementation (the<br />

Section 3.2). Mathematical background of the method and closer look at the both<br />

phases of the ECM are given <strong>in</strong> the Section 3.3.<br />

3.1 Integer Factor<strong>in</strong>g<br />

In the previous parts of the thesis we have expla<strong>in</strong>ed that the security of the RSA<br />

cryptosystem relies on the difficulty of factor<strong>in</strong>g large <strong>in</strong>tegers. Hence, the devel-<br />

opment of a fast factorisation method could allow the cryptanalysis of messages<br />

encrypted or signed by RSA. However, till now the problem of factorisation has<br />

rema<strong>in</strong>ed hard.<br />

In this section we start with basic facts on <strong>in</strong>teger factor<strong>in</strong>g and present the most<br />

important factor<strong>in</strong>g methods. Further, the ECM is described as a promis<strong>in</strong>g method<br />

for hard<strong>ware</strong> implementation.<br />

3.1.1 Factor<strong>in</strong>g Algorithms<br />

We provide def<strong>in</strong>itions of terms related to factor<strong>in</strong>g and <strong>in</strong>troduction to the factor<strong>in</strong>g<br />

methods that can be found also <strong>in</strong> [80].<br />

Factor<strong>in</strong>g a positive <strong>in</strong>teger n means f<strong>in</strong>d<strong>in</strong>g positive <strong>in</strong>tegers u and v such that<br />

the product of u and v equals n, and such that both u and v are greater than 1.<br />

Such u and v are called factors (or divisors) of n, and n = uv is called a factorisation<br />

of n. Positive <strong>in</strong>tegers that can be factored are called composites. Positive <strong>in</strong>tegers<br />

greater than 1 that cannot be factored are called primes.<br />

In some factorisation methods we use a feature of <strong>in</strong>tegers called smoothness. We<br />

say that a positive <strong>in</strong>teger is B-smooth if all its prime factors are ≤ B. An <strong>in</strong>teger<br />

is said to be smooth with respect to S, where S is some set of <strong>in</strong>tegers, if it can be<br />

completely factored us<strong>in</strong>g the elements of S. We often simply use the term smooth,<br />

<strong>in</strong> which case the bound B or the set S is clear from the context.<br />

44


FEI KEMT<br />

We start with the simplest method for <strong>in</strong>teger factor<strong>in</strong>g, namely the trial division.<br />

The smallest prime factor p of n can be found by try<strong>in</strong>g if n is divisible by all primes<br />

<strong>in</strong> succession, until p is reached. If we assume that a table of all primes ≤ p is<br />

available this process takes π(p) division attempts (called trial divisions), where π(p)<br />

is number of primes ≤ p, or the prime count<strong>in</strong>g function, where the approximation<br />

to get its value has been found as π(p) ≈ p/ log e(p).<br />

S<strong>in</strong>ce n has at least one factor ≤ √ n, factor<strong>in</strong>g n us<strong>in</strong>g trial division takes<br />

approximately √ n operations, <strong>in</strong> the worst case. For many composites trial division<br />

is therefore <strong>in</strong>feasible as factor<strong>in</strong>g method. For most numbers it is very effective,<br />

however, because most numbers have small factors: 88% of all positive <strong>in</strong>tegers have<br />

a factor < 100, and almost 92% have a factor < 1000.<br />

Several more efficient algorithms for factor<strong>in</strong>g <strong>in</strong>tegers have been proposed. Each<br />

algorithm is appropriate for a different situation. For <strong>in</strong>stance, the ECM [82] allows<br />

the efficient factor<strong>in</strong>g of numbers with relatively small factors. The generalised<br />

number field sieve (GNFS, see [81]) is the best algorithm for factor<strong>in</strong>g numbers with<br />

large factors and, hence, can be used for attack<strong>in</strong>g the RSA cryptosystem.<br />

In GNFS arise many mid-size <strong>in</strong>tegers that have to be checked for smoothness,<br />

i.e. if they decompose completely <strong>in</strong>to small prime factors. The siev<strong>in</strong>g step of<br />

GNFS f<strong>in</strong>ds some of these factors. After divid<strong>in</strong>g them out, one obta<strong>in</strong>s a co-factor<br />

that has to be checked for smoothness. Let us call this step the co-factorisation<br />

or smoothness test. An appropriate choice for this task is the multiple polynomial<br />

quadratic sieve (MPQS, see [104]) or the ECM.<br />

3.1.2 Motivation for <strong>Hard</strong><strong>ware</strong> Implementation<br />

The current world record <strong>in</strong> factor<strong>in</strong>g a random RSA modulus is 200 decimals and<br />

was achieved with a complete soft<strong>ware</strong> implementation of the GNFS <strong>in</strong> 2005 [63],<br />

us<strong>in</strong>g MPQS for the factorisation of the cofactors. For larger modulus it becomes<br />

crucial to use a special hard<strong>ware</strong> for factor<strong>in</strong>g. Recently, some new hard<strong>ware</strong> ar-<br />

chitectures for the siev<strong>in</strong>g step <strong>in</strong> GNFS have been proposed (e.g., SHARK [64],<br />

TWIRL [103]). The efficiency of, e.g. SHARK (and possibly other <strong>in</strong>novative<br />

GNFS realizations) is directly related to efficient support units for smoothness test-<br />

<strong>in</strong>g with<strong>in</strong> the architecture.<br />

It appears that the use of the ECM rather than the MPQS is a better choice<br />

for the smoothness test, s<strong>in</strong>ce the MPQS requires a larger silicon area and irregular<br />

operations. On the other hand, the ECM is almost ideal algorithm for dramatically<br />

45


FEI KEMT<br />

improv<strong>in</strong>g the area-time product through special purpose hard<strong>ware</strong>. We summarise<br />

the advantages of the ECM <strong>in</strong> the follow<strong>in</strong>g po<strong>in</strong>ts:<br />

1. ECM performs a very high number of operations on a very small set of <strong>in</strong>put<br />

data, hence, it is not very I/O <strong>in</strong>tensive.<br />

2. ECM requires relatively little memory when compar<strong>in</strong>g to other methods.<br />

3. The operands needed for support<strong>in</strong>g GNFS are well beyond the width of cur-<br />

rent computer buses, arithmetic units, and registers, so a special purpose<br />

hard<strong>ware</strong> can provide a much better efficiency <strong>in</strong> implementation and com-<br />

putational time.<br />

4. The nature of the smoothness test<strong>in</strong>g <strong>in</strong> the GNFS allows a very high degree<br />

of parallelisation.<br />

The key for efficient ECM hard<strong>ware</strong> with parallel architecture lies <strong>in</strong> fast arith-<br />

metic units. Such units for modular addition and multiplication have been studied<br />

thoroughly <strong>in</strong> the last few years, e.g. for the use <strong>in</strong> cryptographic devices <strong>in</strong>clud<strong>in</strong>g<br />

ECC (see e.g. [71,92]). Therefore, we could exploit the well developed area of ECC<br />

architectures for our ECM design.<br />

3.2 Previous Implementations of ECM<br />

To our knowledge, the ECM has never been implemented <strong>in</strong> hard<strong>ware</strong> before. In the<br />

context of special-purpose hard<strong>ware</strong> for the GNFS, [27] mentions that construction<br />

of a special ECM hard<strong>ware</strong> might be promis<strong>in</strong>g for support<strong>in</strong>g the GNFS. However,<br />

till now there were published only two concepts for the ECM hard<strong>ware</strong> implementa-<br />

tion. The first one, presented also <strong>in</strong> this work, has been a proof-of-concept design<br />

proposed by Jan Pelzl, Mart<strong>in</strong> ˇ Simka et al. [65, 94, 120]. The latter one from Kris<br />

Gaj et al. [67] improves our proposal and provides the most recent reference for the<br />

ECM implementation.<br />

The ma<strong>in</strong> differences of both concepts are <strong>in</strong> the follow<strong>in</strong>g areas:<br />

• control logic - external vs. <strong>in</strong>ternal, what <strong>in</strong> detail means a way of distribution<br />

the control over computation between the ECM units and the central control<br />

logic,<br />

• memory management - thanks to better organisation of memory registers and<br />

us<strong>in</strong>g s<strong>in</strong>gle-port memory access, the design of Gaj et al. requires significantly<br />

46


FEI KEMT<br />

less memory blocks than ours (with dual-port access and separate memory<br />

block for each register),<br />

• parallelisation - better computational times are achieved by parallel execution<br />

of arithmetic operations and addition of the second multiplier,<br />

• <strong>Montgomery</strong> multiplier - while <strong>in</strong> our concept the multiplier design is based<br />

on the proposal from Tenca and Koc [108], <strong>in</strong> the Gaj’s design the multiplier<br />

comes from McIvor and McLoony [85]. It provides a shorter computation<br />

time, but also a less flexible architecture what can be a disadvantage <strong>in</strong> case<br />

of chang<strong>in</strong>g the ECM parameters.<br />

By selection of faster multiplier and better resources utilisation <strong>in</strong> comparison to our<br />

proof-of-concept design, the authors have achieved the AT product improvement by<br />

factor 3.7 for Phase 1 and 6.4 for Phase 2, respectively, us<strong>in</strong>g the same hard<strong>ware</strong><br />

platform.<br />

In the soft<strong>ware</strong> doma<strong>in</strong>, there were several attempts to apply the ECM to the<br />

factorisation.<br />

A parallel soft<strong>ware</strong> implementation of ECM on several workstations (Pentiu-<br />

mII@350 MHz, L<strong>in</strong>ux OS) is reported <strong>in</strong> [123]. The implementation uses fast net-<br />

work switches and has been programmed based on the Message-Pass<strong>in</strong>g Interface<br />

(MPI) standard.<br />

Two massively parallel implementations of ECM based on systolic versions of the<br />

MMM are described <strong>in</strong> [45]. The authors apply a s<strong>in</strong>gle <strong>in</strong>struction, multiple data<br />

(SIMD) approach on a particular type of parallel computer.<br />

A well known free soft<strong>ware</strong> implementation of the ECM to factor <strong>in</strong>tegers is<br />

available from [128] (GMP-ECM). The implementation is based on the GNU mul-<br />

tiple precision (GMP) arithmetic library. The orig<strong>in</strong>al purpose of the project was<br />

to f<strong>in</strong>d a factor of 50 digits or more by ECM. The participation of several devel-<br />

opers made GMP-ECM an excellent resource for a state-of-the-art ECM soft<strong>ware</strong><br />

implementation, <strong>in</strong>clud<strong>in</strong>g many useful tweaks.<br />

3.3 Mathematical Background<br />

The pr<strong>in</strong>ciples of ECM are based on Pollard’s (p − 1)-method [95]. Therefore we<br />

start with short summarization of the Pollard’s method. Afterwards we describe<br />

H. W. Lenstra’s ECM [82].<br />

47


FEI KEMT<br />

3.3.1 Pollard’s (p − 1)-algorithm<br />

Let k, n ∈ N with n be<strong>in</strong>g the composite to be factored. Furthermore, let p|n with<br />

p ∈ P. Let a ∈ Z and n be co-prime, i.e. gcd(a, n) = 1. Let e = k(p − 1).<br />

1. By little Fermat,<br />

2. p|n yields gcd(a e − 1, n) > 1.<br />

a p−1 ≡ 1 mod q ⇒ a k(p−1) ≡ 1 mod p<br />

⇐ a e ≡ 1 mod p<br />

⇐ a e − 1 ≡ 0 mod p<br />

⇐ p|(a e − 1).<br />

3. If a e �≡ 1 mod n, then 1 < gcd(a e − 1, n) < n. In this case, we found a<br />

non-trivial divisor of n.<br />

Obviously, we cannot compute e = k(p−1) without the knowledge of p. Instead,<br />

we assume that p − 1 can be decomposed <strong>in</strong>to many small factors below a certa<strong>in</strong><br />

bound B1. In this case, p − 1 is called B1-smooth.<br />

Let B2 denote the highest prime power divid<strong>in</strong>g p − 1 and choose e such that<br />

e =<br />

�<br />

pi∈P,pi≤B1<br />

p ep i<br />

i , epi = max{r ∈ N : pr i ≤ B2} . (3.1)<br />

With the computation of a e with d = gcd(a e − 1, n) we hope to f<strong>in</strong>d a non-trivial<br />

factor d of n.<br />

In general, Pollard’s method can be def<strong>in</strong>ed as follows:<br />

Let Gp = (Zp) ⋆ and Gn = (Zn) ⋆ be multiplicative groups and let φ be the canon-<br />

ical homomorphism<br />

φ : Gn → Gp (reduction modulo p) (3.2)<br />

A factor of n is found if simultaneously a e �≡ 1 mod n and a e ≡ 1 mod p, i.e.<br />

∀k1 ∈ N : e �= k1 · ordGn(a),<br />

∃k2 ∈ N : e = k2 · ordGp(φ(a)).<br />

48


FEI KEMT<br />

3.3.2 ECM Algorithm<br />

In 1987, H. Lenstra came up with the idea of translat<strong>in</strong>g Pollard’s method from<br />

the groups Gp and Gn to the groups of po<strong>in</strong>ts on elliptic curves E modulo n and<br />

modulo q [82]. Indeed, a group operation <strong>in</strong> E(Zn) can be def<strong>in</strong>ed by us<strong>in</strong>g the<br />

given addition formulae [32].<br />

The correspond<strong>in</strong>g homomorphism φ to the one def<strong>in</strong>ed <strong>in</strong> Equation 3.2 is:<br />

φ : E(Zn) → E(Zq) (reduction of coord<strong>in</strong>ates modulo q) (3.3)<br />

The exponentiation <strong>in</strong> Pollard’s (p−1) method is replaced by a po<strong>in</strong>t multiplication.<br />

Let n be an <strong>in</strong>teger without small prime factors which is divisible by at least two<br />

different primes, one of them q. Such numbers appear after trial division and a quick<br />

prime power test. Let E(Zn) be an elliptic curve with good reduction at all prime<br />

divisors of n (this can be checked by calculat<strong>in</strong>g the gcd of n and the discrim<strong>in</strong>ant<br />

of E, which very rarely yields a prime factor of n) and a po<strong>in</strong>t P ∈ E(Zn) �= O.<br />

A factor of n is found if k · P is not equal to the identity element <strong>in</strong> E(Zn) but<br />

k · φ(P ) equals to the identity element <strong>in</strong> E(Zq), i.e.<br />

∀k1 ∈ N : k �= k1 · ordE(Zn)(P ),<br />

∃k2 ∈ N : k = k2 · ordE(Zq)(φ(P )).<br />

Let the elliptic curve E be def<strong>in</strong>ed by the homogeneous Weierstrass Equation:<br />

y 2 z = x 3 + axz 2 + bz 3<br />

(3.4)<br />

In this case, above conditions yield two properties for the z-coord<strong>in</strong>ate zQ of the<br />

result<strong>in</strong>g po<strong>in</strong>t Q = k · P :<br />

k �= k1 · ordE(Zn)(P ) ⇐ n ∤ zQ<br />

k = k2 · ordE(Zq)(φ(P )) ⇐ q | zQ.<br />

Under these conditions, a non-trivial factor d of n is obta<strong>in</strong>ed by d = gcd(zQ, n).<br />

With the assumption that the order of P is B1-smooth and does not conta<strong>in</strong><br />

any prime power larger than B2, the scalar k is computed <strong>in</strong> the same way as e <strong>in</strong><br />

Equation 3.1 as<br />

k =<br />

�<br />

pi∈P,pi≤B1<br />

p ep i<br />

i , epi = max{r ∈ N : pr i ≤ B2} . (3.5)<br />

49


FEI KEMT<br />

If the order of P ∈ E(Fq) satisfies certa<strong>in</strong> smoothness conditions described below,<br />

we can discover the factor q of n as follows:<br />

In the first phase of ECM, we calculate Q = kP where k is a product of prime<br />

powers p e ≤ B1 with appropriately chosen smoothness bounds. The second phase of<br />

ECM checks for each prime B1 < p ≤ B2 whether pQ reduces to the neutral element<br />

<strong>in</strong> E(Fq). Algorithm 3 – 1 summarises all necessary steps for both phases of ECM.<br />

Phase 2 can be done efficiently, e.g., us<strong>in</strong>g the Weierstraß form and projective<br />

coord<strong>in</strong>ates pQ = (xpQ : ypQ : zpQ) by test<strong>in</strong>g whether gcd(zpQ, n) is bigger than 1.<br />

Note that we can avoid all gcd computations but one at the expense of one<br />

modular multiplication per gcd by accumulat<strong>in</strong>g the numbers to be checked <strong>in</strong> a<br />

product modulo n and perform<strong>in</strong>g one f<strong>in</strong>al gcd.<br />

Algorithm 3 – 1 Elliptic Curve Method<br />

Require: Composite n<br />

Ensure: Factor d of n<br />

1: Phase 1:<br />

2: Choose arbitrary curve E(Zn) and random po<strong>in</strong>t P ∈ E(Zn) �= O<br />

3: Choose smoothness bounds B1, B2 ∈ N<br />

4: Compute<br />

k ⇐<br />

�<br />

pi∈P,pi≤B1<br />

5: Compute Q = kP ⇐ (xQ, yQ, zQ)<br />

6: Compute d ⇐ gcd(zQ, n)<br />

7: Phase 2:<br />

8: Set Π := 1<br />

9: for each prime p with B1 < p ≤ B2 do<br />

10: Compute pQ ⇐ (xpQ : ypQ : zpQ)<br />

11: Compute Π ⇐ Π · zpQ<br />

12: end for<br />

13: Compute d ⇐ gcd(Π, n)<br />

14: if 1 < d < n then<br />

15: A non-trivial factor d is found<br />

16: return d<br />

17: else<br />

p ep i<br />

i , epi ⇐ max{r ∈ N : pr i ≤ B2}<br />

18: Restart from choos<strong>in</strong>g another elliptic curve <strong>in</strong> phase 1 (Step 2).<br />

19: end if<br />

50


FEI KEMT<br />

If us<strong>in</strong>g only one s<strong>in</strong>gle curve, the properties of the ECM are related to those of<br />

the Pollard’s (p − 1)-method. The advantage of the ECM lies <strong>in</strong> the possibility of<br />

choos<strong>in</strong>g a different curve after each unsuccessful trial to <strong>in</strong>crease the probability of<br />

f<strong>in</strong>d<strong>in</strong>g factors of n.<br />

All calculations are done modulo n. If the f<strong>in</strong>al gcd of the product Π and n<br />

satisfies<br />

1 < gcd(Π, n) < n , (3.6)<br />

a factor is found. The parameters B1 and B2 control the probability of f<strong>in</strong>d<strong>in</strong>g a<br />

divisor q. More precisely, if the of P factors <strong>in</strong>to a product of co-prime prime powers<br />

(each ≤ B1) and at most one additional prime between B1 and B2, the prime factor<br />

q is discovered.<br />

The procedure will be repeated for other elliptic curves. To generate them one<br />

commences with the start<strong>in</strong>g po<strong>in</strong>t P and constructs an elliptic curve such that P<br />

lies on it.<br />

It is possible that more than one or even all prime divisors of n are discovered<br />

simultaneously. This happens rarely for reasonable parameter choices and can be<br />

ignored by proceed<strong>in</strong>g to the next elliptic curve.<br />

The runn<strong>in</strong>g time of the ECM is given by<br />

T (q) q→∞<br />

= e (√ 2+o(1)) √ log q log log q<br />

(3.7)<br />

operations, thus, it ma<strong>in</strong>ly depends on the size of the factors to be found and not<br />

on the size of n [34]. However, remark that the operations are computed modulo n,<br />

hence, the runn<strong>in</strong>g time of the operations depends on n.<br />

<strong>Montgomery</strong>-Form Curves Apart from the Weierstraß form there are vari-<br />

ous other forms for the elliptic curves. We use <strong>Montgomery</strong>’s form (described by<br />

Equation 3.8) that was suggested <strong>in</strong> [89] by <strong>Montgomery</strong> and compute <strong>in</strong> the set<br />

S = E(Z/nZ)/{±1} only us<strong>in</strong>g the x- and z-coord<strong>in</strong>ates.<br />

By 2 z = x 3 + Ax 2 z + xz 2<br />

(3.8)<br />

The curves of this form always have an order divisible by 4. In our case, the curves<br />

can be chosen <strong>in</strong> such a way that they have an order divisible by 12. The advantage<br />

of the use of <strong>Montgomery</strong> form curves <strong>in</strong> cryptography is the <strong>in</strong>herent resistance<br />

aga<strong>in</strong>st side channel attacks due to almost <strong>in</strong>dist<strong>in</strong>guishable group operations, i.e.<br />

the elementary operations for addition and doubl<strong>in</strong>g of po<strong>in</strong>ts are quite similar. A<br />

51


FEI KEMT<br />

handicap of the <strong>Montgomery</strong> form is the fact that not every arbitrary curve can be<br />

transformed <strong>in</strong>to this form. Hence, there is merely <strong>in</strong>terest <strong>in</strong> implement<strong>in</strong>g ECC<br />

based on <strong>Montgomery</strong> form curves.<br />

The residue class of P +Q <strong>in</strong> this set can be computed from P , Q and P −Q us<strong>in</strong>g<br />

4 multiplications and 1 squar<strong>in</strong>g (see Equation 3.9). A doubl<strong>in</strong>g, i. e. 2P , can be<br />

computed from P and curve parameter A (see 3.8) us<strong>in</strong>g 5 squar<strong>in</strong>gs (Equation 3.10).<br />

S<strong>in</strong>ce we are only <strong>in</strong>terested <strong>in</strong> check<strong>in</strong>g whether we obta<strong>in</strong> the po<strong>in</strong>t at <strong>in</strong>f<strong>in</strong>ity O<br />

for some prime divisor of n comput<strong>in</strong>g <strong>in</strong> S is no restriction.<br />

Addition: (3.9)<br />

xP +Q ≡ zP −Q[(xP − zP )(xQ + zQ) + (xP + zP )(xQ − zQ)] 2<br />

zP +Q ≡ xP −Q[(xP − zP )(xQ + zQ) − (xP + zP )(xQ − zQ)] 2<br />

(mod n)<br />

(mod n)<br />

Doubl<strong>in</strong>g: (3.10)<br />

4xP zP ≡ (xP + zP ) 2 − (xP − zP ) 2<br />

x2P ≡ (xP + zP ) 2 (xP − zP ) 2<br />

(mod n)<br />

(mod n)<br />

z2P ≡ 4xP zP [(xP − zP ) 2 + 4xP zP (A + 2)/4] (mod n)<br />

F<strong>in</strong>d<strong>in</strong>g Suitable Curves <strong>in</strong> <strong>Montgomery</strong> Form Assume a curve of the form<br />

By 2 = x 3 + Ax 2 + x with gcd((A 2 − 4)B, n) = 1 (3.11)<br />

Such curves have a group order divisible by 4. To obta<strong>in</strong> an order divisible by 12,<br />

choose A and B such that<br />

The po<strong>in</strong>t<br />

A = −3a4 − 6a2 + 1<br />

4a3 , B = (a2 − 1) 2<br />

4a3 , with a = t2 − 1<br />

t2 + 3<br />

� √ �<br />

2 3a + 1 3a2 + 1<br />

(x0, y0) = ,<br />

4a 4a<br />

(3.12)<br />

(3.13)<br />

is on the curve, if 3a 2 + 1 = 4(t 4 + 3)/(t 2 + 3) 2 is a rational square, which can be<br />

obta<strong>in</strong>ed by t 2 = (u 2 − 12)/4u with u 2 − 12u be<strong>in</strong>g a rational square.<br />

First Phase of the ECM If the triple (P, mP, (m + 1)P ) is given <strong>in</strong> the Mont-<br />

gomery form, we can compute (P, 2mP, (2m + 1)P ) or (P, (2m + 1)P, (2m + 2)P )<br />

by perform<strong>in</strong>g one addition (follow<strong>in</strong>g the Equations 3.9) and one doubl<strong>in</strong>g (follow-<br />

<strong>in</strong>g the Equations 3.10) <strong>in</strong> <strong>Montgomery</strong>’s form. Thus, Q = kP can be calculated<br />

52


FEI KEMT<br />

us<strong>in</strong>g [log 2 k] additions and duplications accord<strong>in</strong>g to Algorithm 3 – 2, amount<strong>in</strong>g to<br />

11[log 2 k] multiplications. In case when zP = 1 we can even reduce the number to<br />

10[log 2 k] modular multiplications.<br />

Algorithm 3 – 2 Exponentiation for Curves <strong>in</strong> <strong>Montgomery</strong> Form<br />

Require: Integer k > 1 with k = (ktkt−1 . . . k1k0)2 and a po<strong>in</strong>t P on the curve<br />

E M : By 2 = x 3 + Ax 2 + x.<br />

Ensure: Product Q = kP .<br />

1: Pm ⇐ P<br />

2: Pm+1 ⇐ 2P<br />

3: for i = t − 1 to 1 do<br />

4: if ki = 1 then<br />

5: Pm ⇐ Pm + Pm+1<br />

6: Pm+1 ⇐ 2Pm+1<br />

7: else<br />

8: Pm+1 ⇐ Pm + Pm+1<br />

9: Pm ⇐ 2Pm<br />

10: end if<br />

11: end for<br />

12: if k0 = 1 then<br />

13: Q ⇐ Pm + Pm+1<br />

14: else<br />

15: Q ⇐ 2Pm<br />

16: end if<br />

17: return Q<br />

By handl<strong>in</strong>g each prime factor of k separately and by us<strong>in</strong>g optimal addition<br />

cha<strong>in</strong>s, the number of multiplications can be decreased further to roughly 9.3[log 2 k]<br />

(see [89]). The addition cha<strong>in</strong>s can be precalculated.<br />

Second Phase of the ECM The standard way to calculate the po<strong>in</strong>ts pQ for all<br />

primes B1 < p ≤ B2 is to precompute a (small) table of multiples kQ, where k runs<br />

through the differences of consecutive primes <strong>in</strong> the <strong>in</strong>terval [B1, B2]. Then, a s<strong>in</strong>gle<br />

po<strong>in</strong>t multiple p0Q is computed with p0 be<strong>in</strong>g the smallest prime <strong>in</strong> that <strong>in</strong>terval<br />

and the correspond<strong>in</strong>g table entries are added successively to obta<strong>in</strong> pQ for the next<br />

prime p.<br />

53


FEI KEMT<br />

Two major improvements have been proposed for the ECM [33, 89]. Us<strong>in</strong>g the<br />

<strong>Montgomery</strong>’s form, the procedure is difficult to implement but can be improved as<br />

follows.<br />

The follow<strong>in</strong>g Lemma allows us to reduce the complexity by repeatedly multi-<br />

ply<strong>in</strong>g a difference of two products <strong>in</strong>stead of comput<strong>in</strong>g complex po<strong>in</strong>t operations<br />

<strong>in</strong> each step of phase 2:<br />

Lemma 1 Let q = a + b with a and b co-prime. Furthermore, let qQ = A + B with<br />

A = aQ and B = bQ, then zqQ = 0 mod t for gcd(zQ, n) = 1 if and only if<br />

Proof<br />

xA · zB − zA · xB ≡ 0 mod t.<br />

1. <strong>Montgomery</strong>’s po<strong>in</strong>t addition formula 3.9 yields<br />

t|zqQ ⇔ t|xA−B[xA · zB − zA · xB] 2<br />

⇐ t|(xA · zB − zA · xB).<br />

2. If zqQ ≡ 0 mod t, qQ is the identity po<strong>in</strong>t on the elliptic curve over Ft. Hence,<br />

A = −B, i.e. A and B are zero or<br />

xA/zA ≡ xB/zB mod t.<br />

A = B = 0 yields Q = 0, thus t|zQ, which is a contradiction to the assumption<br />

of gcd(zQ, n) = 1. Then we have<br />

xA/zA ≡ xB/zB mod t and<br />

xA · zB ≡ zA · xB mod t respectively.<br />

The improved standard cont<strong>in</strong>uation uses a parameter 2 < D < B1. First, a<br />

table T of multiples kQ of Q for all 1 ≤ k < D,<br />

gcd(k, D) = 1 is calculated.<br />

2<br />

Each prime B1 < p ≤ B2 can be written as mD ± k with kQ ∈ T . Now, with<br />

Lemma 1, gcd(zpQ, n) > 1 if and only if gcd(xmDQzkQ − xkQzmDQ, n) > 1. Thus, we<br />

calculate the sequence mDQ (which can easily be done <strong>in</strong> <strong>Montgomery</strong>’s form) and<br />

accumulate the product of all xmDQzkQ − xkQzmDQ for which mD − k or mD + k is<br />

prime.<br />

The memory requirements for the improved standard cont<strong>in</strong>uation are ϕ(D)<br />

2<br />

po<strong>in</strong>ts for the table T and the po<strong>in</strong>ts DQ, (m − 1)DQ,and mDQ for comput<strong>in</strong>g<br />

54


FEI KEMT<br />

the sequence, altogether ϕ(D) + 6 numbers. The computational costs consist of the<br />

generation of T and the calculation of mDQ which amounts to at most D<br />

4<br />

+ B2<br />

D<br />

elliptic curve operations (mostly additions) and at most 3(π(B2) − π(B1)) modular<br />

multiplications, π(x) be<strong>in</strong>g the number of primes up to x. The last term can be<br />

lowered if D conta<strong>in</strong>s many small prime factors s<strong>in</strong>ce this will <strong>in</strong>crease the number<br />

of pairs (m, k) for which both mD − k and mD + k are prime. Neglect<strong>in</strong>g space<br />

considerations a good choice for D is a number around √ B2 which is divisible by<br />

many small primes.<br />

4 Elliptic Curve Method <strong>in</strong> <strong>Hard</strong><strong>ware</strong><br />

We present the first published hard<strong>ware</strong> implementation of the ECM for <strong>in</strong>teger fac-<br />

tor<strong>in</strong>g. The ECM implementation <strong>in</strong>cludes a complete hard<strong>ware</strong> logic that supports<br />

the ECM factor<strong>in</strong>g of numbers up to approximately 200 bits. The proposed solution<br />

applies parameters best suited to f<strong>in</strong>d factors of up to about 42 bits. The ECM<br />

design features a support<strong>in</strong>g logic for computation of the modular operations as ad-<br />

dition, subtraction, multiplication and squar<strong>in</strong>g. The multiplication and squar<strong>in</strong>g<br />

is computed <strong>in</strong> the MMM unit analysed <strong>in</strong> the Chapter 2. The circuit has a good<br />

scalability also to larger and smaller bit lengths. For a proof-of-concept purpose,<br />

the ECM architecture has been implemented as a soft<strong>ware</strong>-hard<strong>ware</strong> co-design on a<br />

FPGA and an embedded micro-controller <strong>in</strong> a SOC. Such a design perfectly fits the<br />

needs of recent proposals for hard<strong>ware</strong> architectures for the GNFS (see, e.g. [64])<br />

and can reduce the overall costs of a GNFS device considerably.<br />

Parts of this section were published <strong>in</strong> papers [65,94,120]. The research achieve-<br />

ments described <strong>in</strong> this chapter <strong>in</strong>clude the follow<strong>in</strong>g:<br />

• ECM algorithm for hard<strong>ware</strong> – algorithm adaptation and parametrisation,<br />

• ECM implementation – unit design, parallelisation, case study for GNFS.<br />

The ECM implementation was done as a jo<strong>in</strong>t work, ma<strong>in</strong>ly with Jan Pelzl from<br />

Ruhr University Bochum (<strong>in</strong> SHARK project that <strong>in</strong>cludes the ECM design, have<br />

cooperated also Christ<strong>in</strong>e Priplata and Col<strong>in</strong> Stahlke (Edizone GmbH, Germany),<br />

and Jens Franke and Thorsten Kle<strong>in</strong>jung (University of Bonn, Germany)).<br />

The Section 4.1 describes the details on selection of the parameters <strong>in</strong> the ECM.<br />

The architecture of the implementation and discussion on the chosen algorithms<br />

for the modular operations is presented <strong>in</strong> the Section 4.2. Implementation details<br />

55<br />

+ 7


FEI KEMT<br />

and case study with GNFS based on ECM units are summarised <strong>in</strong> the Section 4.3.<br />

F<strong>in</strong>ally, we conclude the chapter with discussion on obta<strong>in</strong>ed results.<br />

4.1 Parameterisation of the ECM Algorithm<br />

Our implementation focuses on the factorisation of numbers up to 200 bits with<br />

factors of up to around 42 bits. Thus, the most optimal parameters need to be found<br />

for the smoothness bounds B1, B2, and <strong>in</strong> the improved standard cont<strong>in</strong>uation used<br />

parameter D (see the description of the ECM second phase <strong>in</strong> Section 3.3.2). We<br />

f<strong>in</strong>d the values that yield a high probability of success and a relatively small runn<strong>in</strong>g<br />

time and area consumption. With the runn<strong>in</strong>g time depend<strong>in</strong>g on the size of the<br />

(unknown) factors to be found, optimal parameters cannot be known beforehand.<br />

Hence, good parameters can be found by experiments with different prime bounds.<br />

4.1.1 Phase 1<br />

Deduced from soft<strong>ware</strong> experiments, we choose B1 = 960 and B2 = 57 000 as prime<br />

bounds. The value of k has 1 375 bits, hence, assum<strong>in</strong>g the b<strong>in</strong>ary method (Algo-<br />

rithm 3 – 2), 1 374 po<strong>in</strong>t additions and 1 374 po<strong>in</strong>t duplications for the execution of<br />

phase 1 are required. Due to the use of <strong>Montgomery</strong> coord<strong>in</strong>ates, the coord<strong>in</strong>ate<br />

zP of the start<strong>in</strong>g po<strong>in</strong>t P can be set to 1, then the addition takes only 5 multi-<br />

plications <strong>in</strong>stead of 6. The improved phase 1 (with optimal addition cha<strong>in</strong>s) has<br />

to use the general case, where zP �= 1. For the sake of simplicity and a preferably<br />

simple control logic, we choose the b<strong>in</strong>ary method for the time be<strong>in</strong>g. For the chosen<br />

parameters, the computational complexity of phase 1 is 13 740 modular multiplica-<br />

tions and squar<strong>in</strong>gs 3 . With optimised addition cha<strong>in</strong>s this number can be reduced<br />

to approximately 12 000 modular multiplications and squar<strong>in</strong>gs.<br />

Accord<strong>in</strong>g to Equation 3.10, duplicat<strong>in</strong>g a po<strong>in</strong>t 2PA = PC <strong>in</strong>volves the <strong>in</strong>put<br />

values xA, zA, A24 and n, where A24 = (A + 2)/4 is computed from the curve pa-<br />

rameter A (see Equation 3.8) <strong>in</strong> advance and should be stored <strong>in</strong> a fixed register.<br />

A po<strong>in</strong>t addition PC = PA + PB handles the <strong>in</strong>put values xA, zA, xB, zB, xA−B, zA−B<br />

and n (see Equation 3.9).<br />

Notice that the values n, A24, xA−B and zA−B do not change dur<strong>in</strong>g phase 1.<br />

Furthermore, zA−B = z1 can be chosen to be 1. Thus, no register is required for<br />

zA−B. The output values xC and zC can be written to certa<strong>in</strong> <strong>in</strong>put registers to<br />

3 Squar<strong>in</strong>gs and multiplications are considered to have an identical complexity <strong>in</strong> our case s<strong>in</strong>ce<br />

the hard<strong>ware</strong> unit is the same for both, the multiplication and squar<strong>in</strong>g.<br />

56


FEI KEMT<br />

save memory. If we assume that the ECM unit does not execute addition and<br />

duplication <strong>in</strong> parallel, at most 7 registers for the values <strong>in</strong> Zn are required for<br />

phase 1. Additionally, we will require 4 temporary registers for <strong>in</strong>termediate values.<br />

Thus, a total of 11 registers is required for phase 1.<br />

4.1.2 Phase 2<br />

For the prime bounds chosen, 5 621 primes p ∈ [B1, B2] have to be tested <strong>in</strong> phase<br />

2. With the prime bounds fixed, the computational complexity depends on the size<br />

of D. Hence, D should consist of small primes <strong>in</strong> order to keep ϕ(D) as small as<br />

possible. We consider the cases D = 6, D = 30, D = 60 and D = 210. The<br />

<strong>in</strong>itial values can be computed by first comput<strong>in</strong>g ˆ Q = DQ, then B1<br />

D ˆ Q with the<br />

b<strong>in</strong>ary method, yield<strong>in</strong>g automatically ( B1<br />

D − 1) ˆ Q. The total number of modular<br />

multiplications is determ<strong>in</strong>ed by the number of po<strong>in</strong>t additions, po<strong>in</strong>t duplications<br />

and multiplications for the product Π.<br />

Table 4 – 1 displays the computational complexity and the number of registers<br />

required additionally for phase 2. For the numbers <strong>in</strong> the table, we assume the use<br />

of Algorithm 3 – 2 for comput<strong>in</strong>g the <strong>in</strong>itial values. E.g., <strong>in</strong> the case D = 30, the cost<br />

for the computation of DQ, ( B1<br />

D<br />

B1<br />

− 1)DQ, and DQ is as much as 8 po<strong>in</strong>t additions<br />

D<br />

and 8 po<strong>in</strong>t duplications. For the same D, the computation of the table <strong>in</strong>volves<br />

5 po<strong>in</strong>t additions and 2 po<strong>in</strong>t duplications, yield<strong>in</strong>g to a total of 13 590 modular<br />

multiplications.<br />

Remark: for the case D = 210, we start with B1 = 1 050 <strong>in</strong> order to assure that<br />

D and B1 share the same prime factors. For phase 2 we choose D = 30 to obta<strong>in</strong><br />

a m<strong>in</strong>imal AT product of the design. S<strong>in</strong>ce ϕ(D) = 8 is small, only 8 additional<br />

registers are required to store all coord<strong>in</strong>ates <strong>in</strong> a table. Unlike <strong>in</strong> phase 1, we have<br />

to consider the general case for po<strong>in</strong>t addition where zA−B �= 1. Hence, an additional<br />

register for this quantity is needed.<br />

For the product Π of all xA · zB − zA · xB, one more register is necessary. The<br />

temporary registers from phase 1 suffice to store the <strong>in</strong>termediate results xA · zB,<br />

zA · xB and xA · zB − zA · xB. Hence, additional 10 registers for phase 2 yield a total<br />

of 21 required registers for both phases. The computational complexity of phase 2 is<br />

1 881 po<strong>in</strong>t additions and 10 po<strong>in</strong>t duplications. Together with the 13 590 modular<br />

multiplications for comput<strong>in</strong>g the product Π, 24 926 modular multiplications and<br />

squar<strong>in</strong>gs are required.<br />

For a high probability of success (p > 80%) of f<strong>in</strong>d<strong>in</strong>g a s<strong>in</strong>gle factor of size of 42<br />

57


FEI KEMT<br />

Table 4 – 1 Computational complexity and memory requirements for phase 2 depend<strong>in</strong>g on D<br />

number of modular multiplications for number<br />

D po<strong>in</strong>t additions po<strong>in</strong>t duplications product Π total of regs.<br />

6 (9 + 0 + 9 340) · 6 = 56 094 (9 + 0) · 5 = 45 14 625 70 764 4<br />

30 (8 + 5 + 1 868) · 6 = 11 286 (8 + 2) · 5 = 50 13 590 24 926 10<br />

60 (8 + 9 + 934) · 6 = 5 706 (8 + 2) · 5 = 50 13 629 19 385 18<br />

210 (9 + 28 + 266) · 6 = 1 818 (9 + 5) · 5 = 70 13 038 14 926 50<br />

bit, soft<strong>ware</strong> experiments suggest to run ECM on approximately 20 different curves<br />

for a s<strong>in</strong>gle candidate for the given parameters. For factors of size of 40 bit, only 10<br />

curves are required on average for a similar probability of success.<br />

4.2 Design of the ECM Unit<br />

The ECM unit consists of three ma<strong>in</strong> parts: the Arithmetic Logic Unit (ALU), the<br />

memory part (registers) and an <strong>in</strong>ternal control logic (see Figure 4 – 1). Each unit<br />

has a very low communication overhead s<strong>in</strong>ce all <strong>in</strong>termediate results dur<strong>in</strong>g com-<br />

putation are stored <strong>in</strong>side the unit, <strong>in</strong> the registers. Before the actual computation<br />

starts, all required <strong>in</strong>itial values (xP , n, A24) are assigned to memory registers of the<br />

unit. This is the only data <strong>in</strong>put.<br />

The only output is the above mentioned product Π. The number Π is read from<br />

the unit’s memory only at the very end of the computation. The computation of<br />

gcd(Π, n) as well as the commands for the ECM units are handled outside the ECM<br />

units by the central control logic.<br />

central<br />

control<br />

logic<br />

ctrl<br />

data<br />

control<br />

logic<br />

memory<br />

ALU<br />

ECM unit<br />

Figure 4 – 1 Architecture of the ECM unit<br />

58


FEI KEMT<br />

4.2.1 Control Logic<br />

The central control logic is connected to each ECM unit via a control bus (ctrl). The<br />

logic coord<strong>in</strong>ates the data exchange with the unit before and after computation and<br />

starts each computation <strong>in</strong> the unit by a special set of commands. The commands<br />

conta<strong>in</strong> an <strong>in</strong>struction for the next computation to be performed (i.e. add, subtract,<br />

multiply, square), <strong>in</strong>clud<strong>in</strong>g the <strong>in</strong>- and output registers to be used. The start of an<br />

operation is <strong>in</strong>voked by sett<strong>in</strong>g the start-bit to the active level.<br />

The control bus has to offer the possibility to specify which <strong>in</strong>put register(s) and<br />

which output register are connected to the ALU. Only certa<strong>in</strong> comb<strong>in</strong>ations of <strong>in</strong>-<br />

and output registers occur, offer<strong>in</strong>g the possibility to reduce the complexity of the<br />

logic and the width of the control bus by compress<strong>in</strong>g the necessary <strong>in</strong>formation.<br />

For simplicity and clarity, we skipped the further optimisation of the commands.<br />

Instead, we use a clearly understandable structure for the commands. A command<br />

consists of 16 bit which are assigned as shown <strong>in</strong> Table 4 – 2 (LSB is left).<br />

Table 4 – 2 A command syntax for the ECM unit (LSB left)<br />

start operation <strong>in</strong>put 1 <strong>in</strong>put 2 output<br />

X XX XXXX XXXX XXXXX<br />

If several ECM units work <strong>in</strong> parallel, only one central control logic is needed.<br />

All commands are sent <strong>in</strong> parallel to all units. Separate communication with each<br />

of all units, one by one, is expected only <strong>in</strong> the beg<strong>in</strong>n<strong>in</strong>g and <strong>in</strong> the end of the<br />

computations. The unit’s memory cells have to be written and read out separately.<br />

Once the computations <strong>in</strong> all units are f<strong>in</strong>ished, an LSB of the central status register<br />

is set to active value to <strong>in</strong>dicate the units’ availability for further commands.<br />

Each ECM unit <strong>in</strong>cludes some <strong>in</strong>ternal control logic <strong>in</strong> order to coord<strong>in</strong>ate the<br />

data and command flow <strong>in</strong>side the unit. Once a command with the correspond<strong>in</strong>g<br />

start bit is set, the computation <strong>in</strong>side the unit is started. The ALU is fed by<br />

correspond<strong>in</strong>g <strong>in</strong>put registers and the results are stored aga<strong>in</strong> <strong>in</strong>side the unit <strong>in</strong> one<br />

of registers. Once the computation is f<strong>in</strong>ished, a status bit is set to <strong>in</strong>dicate the<br />

unit’s availability for further commands.<br />

4.2.2 Memory Management<br />

The addresses specified above refer to relative addresses <strong>in</strong>side each unit s<strong>in</strong>ce we<br />

want to address the same register <strong>in</strong> multiple ECM units <strong>in</strong> parallel. For read<strong>in</strong>g<br />

59


FEI KEMT<br />

from or writ<strong>in</strong>g to a s<strong>in</strong>gle register <strong>in</strong> a specific ECM unit, the unit needs to be<br />

recognised separately by unique address prefix. In comb<strong>in</strong>ation with a address for<br />

each unit, a register has a unique hard<strong>ware</strong> address and can be addressed from<br />

outside the ECM unit. This is imperative s<strong>in</strong>ce the central control logic writes data<br />

to these registers before phase 1 starts and it reads data from one of the registers<br />

after phase 2 has been f<strong>in</strong>ished.<br />

Each register can conta<strong>in</strong> n bits and is organised <strong>in</strong> e = � �<br />

n+1 words of size w<br />

w<br />

(see Figure 4 – 2). Memory access is performed word wise. Reasonable values for<br />

w are w = 4, 8, 16, 32 what is given by the <strong>in</strong>cluded multiplier requir<strong>in</strong>g those word<br />

widths.<br />

0:<br />

1:<br />

e-1:<br />

w bits<br />

w bits<br />

.<br />

.<br />

.<br />

w bits<br />

P1 register: e x w bits<br />

. . . .<br />

0:<br />

1:<br />

e-1:<br />

w bits<br />

w bits<br />

.<br />

.<br />

.<br />

w bits<br />

P21 register: e x w bits<br />

Figure 4 – 2 Organisation of the ECM unit’s memory registers for 21 variables with e words of<br />

width w<br />

The ALU performs the arithmetic modulo 2n, i.e., modular multiplication, mod-<br />

ular squar<strong>in</strong>g, modular addition and subtraction.<br />

4.2.3 Choice of the Arithmetic Algorithms<br />

The ma<strong>in</strong> purpose when we were design<strong>in</strong>g the ECM was to synthesise an area-time<br />

efficient implementation. All algorithms are chosen to allow achievement of a low<br />

area and relatively high speed. Low area consumption can be achieved by structures,<br />

which allow for a certa<strong>in</strong> degree of pipel<strong>in</strong>e and consequently do not require much<br />

memory. For the ECM, we have chosen a set of algorithms which seem to be very well<br />

suited for our purpose. The chosen algorithms are fully scalable and make possible<br />

to analyse different unit parameters and their impact on units performance.<br />

In the follow<strong>in</strong>g, we briefly describe the algorithms for modular addition, subtrac-<br />

tion, and multiplication to be implemented for the ALU. Squar<strong>in</strong>g is done with the<br />

multiplication circuit s<strong>in</strong>ce a separate hard<strong>ware</strong> circuit for squar<strong>in</strong>g would <strong>in</strong>crease<br />

60


FEI KEMT<br />

the overall AT product. Similarly, subtraction can be computed with a slightly<br />

modified circuit for addition.<br />

<strong>Modular</strong> <strong>Multiplication</strong> An efficient <strong>Montgomery</strong> multiplier, highly suitable for<br />

our design is described <strong>in</strong> [108]. While <strong>in</strong> [108] a structure with carry-save adders<br />

and redundant representation of operands has been implemented, we have chosen a<br />

configuration with carry-propagate adders and non-redundant representation that<br />

makes a more effective implementation possible especially when the target plat-<br />

form supports fast carry cha<strong>in</strong> logic. A detailed analysis and comparison of both<br />

structures can be found <strong>in</strong> [46] and also <strong>in</strong> this thesis <strong>in</strong> chapter 2.<br />

The depicted hard<strong>ware</strong> performs a slightly modified MWR2MM (Algorithm 2 –<br />

1), but with non-redundant carry-propagate architecture (earlier denoted as MW-<br />

R2MM CPA). Therefore, our previously mentioned thoughts and analysis of param-<br />

eters for other variants of the MMM algorithm are valid also for this version. In the<br />

implemented algorithm (Algorithm 4 – 1) we have used <strong>in</strong> the step (a) only bit op-<br />

erations <strong>in</strong>stead of more expensive word-wise addition as it was orig<strong>in</strong>ally proposed<br />

<strong>in</strong> [108].<br />

The f<strong>in</strong>al reduction step of the orig<strong>in</strong>ally proposed MMM (Algorithm 1 – 2) can<br />

be omitted when the follow<strong>in</strong>g condition is fulfilled:<br />

4M < 2 n . (4.1)<br />

With bounded <strong>in</strong>put values X, Y < 2M, the output value is also bounded (S < 2M).<br />

A m<strong>in</strong>imal AT product of the sole multiplier can be achieved with a word width<br />

of 8 bits and a pipel<strong>in</strong>e depth of 1 (w = 8, p = 1, see [108]). However, for our<br />

ECM architecture, the AT product does not only depend on the AT product of the<br />

multiplier. In fact, the multiplier only takes a comparably small part of the overall<br />

area. On the other hand, the overall speed relies primarily on the speed of the<br />

multiplier. Thus, we choose a pipel<strong>in</strong>e depth of p = 2 for word width w = 32 bits,<br />

<strong>in</strong> order to achieve a shorter computation time for multiplication.<br />

<strong>Modular</strong> Addition and Subtraction Addition and subtraction is implemented<br />

as one circuit. As with the multiplication circuit, the operations are done word<br />

wise and the word size and number of words can be chosen arbitrary. S<strong>in</strong>ce the<br />

same memory is used for <strong>in</strong>put and output operands, we choose the same word size<br />

as for the multiplier. The subtraction relies on the same hard<strong>ware</strong> as the adder,<br />

only one <strong>in</strong>put bit has to be changed (sub = 1) <strong>in</strong> order to compute a subtraction<br />

61


FEI KEMT<br />

Algorithm 4 – 1 Modified MWR2MM algorithm<br />

1: S ⇐ 0<br />

2: for i = 0 to n − 1 do<br />

3: qi ⇐ xiY (0)<br />

0<br />

4: if qi = 1 then<br />

+ S (0)<br />

0<br />

5: for j = 0 to e do<br />

6: (Ca, S (j) ) ⇐ Ca + xiY (j) + M (j)<br />

7: (Cb, S (j) ) ⇐ Cb + S (j)<br />

8: S (j−1) ⇐ (S (j)<br />

0 , S (j−1)<br />

w−1..1)<br />

9: end for<br />

10: else<br />

11: for j = 0 to e do<br />

12: (Ca, S (j) ) ⇐ Ca + xiY (j)<br />

13: (Cb, S (j) ) ⇐ Cb + S (j)<br />

14: S (j−1) ⇐ (S (j)<br />

0 , S (j−1)<br />

w−1..1)<br />

15: end for<br />

16: end if<br />

17: S (e) ⇐ 0<br />

18: end for<br />

rather than an addition (see Figure 4 – 3). All operations are done modulo 2n.<br />

Algorithms 4 – 2 and 4 – 3 show the elementary steps of a modular addition and<br />

subtraction, respectively.<br />

If x + y ≥ 2n a reduction can be applied by simple subtraction of 2n. A variable<br />

z conta<strong>in</strong>s the result and T is a (temporary) register. A comparison z < 2n takes<br />

the same amount of time as a subtraction T = z − 2n. Thus, we compute the<br />

subtraction <strong>in</strong> all cases and decide by the sign of the values, which one to take as<br />

the result (z or T ). If T is the correct result, the content of T has to be copied to<br />

the register z.<br />

For a modular addition, we need at most<br />

Tadd = 3(e + 1) (4.2)<br />

clock cycles, where e is the number of words (for implemented non-redundant form<br />

of operands e = � N+1<br />

w<br />

�<br />

). On average, we would only have to reduce every second<br />

time. However, s<strong>in</strong>ce the control of phase 1 and phase 2 is parallelised for many<br />

units, we have to assume the worst case runn<strong>in</strong>g time which is given by Equation 4.2.<br />

62


FEI KEMT<br />

C a<br />

X w-1 Y w-1 X w-2 X 0<br />

C b<br />

+<br />

FA FA<br />

FA<br />

M w-1<br />

+<br />

FA<br />

S w-1<br />

Y w-2<br />

+<br />

M w-2<br />

+<br />

FA<br />

Figure 4 – 3 Scalable addition and subtraction unit for operands with word width w<br />

S w-2<br />

The subtraction x − y can be accomplished by the addition of x with the bitwise<br />

complement of y and 1. The addition of 1 is simply achieved by sett<strong>in</strong>g the first carry<br />

bit to one (c<strong>in</strong> = 1) (Step 1). S<strong>in</strong>ce the result can be negative, a f<strong>in</strong>al verification<br />

is required. If necessary, the modulus has to be added. The follow<strong>in</strong>g algorithm<br />

describes the modular subtraction:<br />

In step 1, both memory cells z and T obta<strong>in</strong> the same value, which can be done<br />

<strong>in</strong> hard<strong>ware</strong> <strong>in</strong> parallel at the same time without any additional overhead. After the<br />

computation of the difference, one can check for the correctness of the result.<br />

Hence, subtraction can be performed more efficiently than addition and requires<br />

<strong>in</strong> the worst case<br />

clock cycles.<br />

. . .<br />

Y 0<br />

+<br />

M 0<br />

+<br />

FA<br />

S 0<br />

sub<br />

C a<br />

sub<br />

Tsub = 2(e + 1) (4.3)<br />

63<br />

C b


FEI KEMT<br />

Algorithm 4 – 2 <strong>Modular</strong> addition<br />

Require: Two <strong>in</strong>tegers x, y < 2n<br />

Ensure: Sum z = x + y mod 2n<br />

1: z ⇐ x + y<br />

2: T ⇐ z − 2n<br />

3: if T ≥ 0 then<br />

4: z ⇐ T<br />

5: end if<br />

6: return z<br />

Algorithm 4 – 3 <strong>Modular</strong> subtraction<br />

Require: Two <strong>in</strong>tegers x, y < 2n<br />

Ensure: Difference z = x − y mod 2n<br />

1: T = z ⇐ x − y<br />

2: if z < 0 then<br />

3: z ⇐ T + 2n<br />

4: end if<br />

5: return z<br />

4.2.4 Parallelization of the Algorithm<br />

ECM can be perfectly parallelized by us<strong>in</strong>g different curves <strong>in</strong> parallel s<strong>in</strong>ce the<br />

computations of each unit are completely <strong>in</strong>dependent. For the control of more<br />

than one ECM unit, it is essential to know that both phases, phase 1 and phase 2,<br />

are controlled completely identically, <strong>in</strong>dependent of the composite to be factored.<br />

Solely the curve parameter and possibly the modulus of the units and, hence, the<br />

coord<strong>in</strong>ates of the <strong>in</strong>itial po<strong>in</strong>t differ. Thus, all units have to be <strong>in</strong>itialized differently<br />

which is done by simply writ<strong>in</strong>g the values <strong>in</strong>to the correspond<strong>in</strong>g memory locations<br />

sequentially.<br />

Dur<strong>in</strong>g the execution of both phases, exactly the same commands can be sent to<br />

all units <strong>in</strong> parallel. S<strong>in</strong>ce the runtime of multiplication/squar<strong>in</strong>g is constant (does<br />

not rely on <strong>in</strong>put values) and for addition/subtraction differs at most <strong>in</strong> 2(e + 1)<br />

clock cycles, all units can execute the same command <strong>in</strong> approximately the same<br />

time.<br />

After phase 2, the results are read from the units one after another. The required<br />

time for this data I/O is negligible for one ECM unit s<strong>in</strong>ce the computation time of<br />

both phases dom<strong>in</strong>ates. For several units <strong>in</strong> parallel, the computation time does not<br />

64


FEI KEMT<br />

change, but the time for data I/O scales l<strong>in</strong>early with the number of units. Hence,<br />

not too many units should be controlled by one s<strong>in</strong>gle logic. For massively parallel<br />

ECM <strong>in</strong> hard<strong>ware</strong>, the ECM units can be segmented <strong>in</strong>to clusters, each with its own<br />

control unit.<br />

4.3 Implementation of the ECM Unit<br />

This section presents the actual hard<strong>ware</strong> implementation done on a SOC (FPGA<br />

and embedded microprocessor). This first hard<strong>ware</strong> implementation of ECM is de-<br />

signed as a proof-of-concept. All tim<strong>in</strong>gs are obta<strong>in</strong>ed by us<strong>in</strong>g real hard<strong>ware</strong>, not<br />

only simulation. All results have been carefully checked by a reference implementa-<br />

tion <strong>in</strong> soft<strong>ware</strong>.<br />

4.3.1 <strong>Hard</strong><strong>ware</strong> Platform<br />

The ECM implementation is realized as a hybrid design. It consists of an ECM<br />

unit implemented on an FPGA (Xil<strong>in</strong>x Virtex2000E-6) [124] and a control logic<br />

implemented <strong>in</strong> soft<strong>ware</strong> on an embedded micro-controller (ARM7TDMI, 25MHz)<br />

[90]. The ECM unit is coded <strong>in</strong> VHDL and was simulated and synthesised for the<br />

FPGA by us<strong>in</strong>g FPGA Advantage tools, place & route was done <strong>in</strong> Xil<strong>in</strong>x ISE. For<br />

the actual VHDL implementation, memory cells have been realized with the FPGA’s<br />

<strong>in</strong>ternal block RAM. For the word width w = 32 bits 2 blocks with e = ⌈ N+1⌉<br />

words<br />

2<br />

are used for each register due to dual-port access mode and selected algorithm for<br />

multiplication.<br />

The ECM unit, as implemented, expects the commands which are written to a<br />

control register accessible by the embedded ARM processor. Required po<strong>in</strong>t coordi-<br />

nates and curve parameters are loaded <strong>in</strong>to the ECM unit before the first command<br />

is decoded. For this purpose, these memory cells of unit are accessible from the<br />

outside by a unique address. Internal registers, which are only used as temporary<br />

registers dur<strong>in</strong>g the computation are not accessible from the outside, by the micro-<br />

controller.<br />

The control of the whole unit is done by the micro-controller present on the<br />

board. The processor controls the data transfer from and to the units, and issues<br />

the commands for all steps <strong>in</strong> phase 1 and phase 2 for the central control log<strong>in</strong> <strong>in</strong>side<br />

FPGA. For code generation, debugg<strong>in</strong>g and compilation, the ARM Developer Suite<br />

1.2 was used. For details on the ARM microprocessor, see [23]. At a later stage,<br />

a soft-core processor core (<strong>in</strong> VHDL) could be used <strong>in</strong>stead of an hard-wired ARM<br />

65


FEI KEMT<br />

microprocessor, e.g. Altera Nios [10].<br />

For a suitable implementation on a selected platform one can choose the word<br />

width w, number of words e (length of operands), level p of pipel<strong>in</strong>e stages of the<br />

multiplier, and the number of ECM units. Although the presented implementation<br />

was realised on a Xil<strong>in</strong>x Virtex-E FPGA, the proposed algorithms and the design<br />

architecture can be implemented on any FPGA. Hence, a significant speed-up on<br />

state-of-the-art devices can be expected. Anyway, the platform at hand is sufficient<br />

for proof-of-concept purposes. S<strong>in</strong>ce the suggested clock rate of the synthesis tool<br />

was higher than the actual supported frequency of the hard<strong>ware</strong>, no attempt to<br />

further accelerate the design has been made. Due to the lack of FPGA specific<br />

optimisations, the code can easily be used for different types of FPGAs that <strong>in</strong>clude<br />

dedicated memory blocks and fast carry-cha<strong>in</strong> logic.<br />

The actual design was done for n = 198 bit composites. The parameters for the<br />

multiplier are p = 2 and w = 32. Scal<strong>in</strong>g the design to bit lengths from 100 to<br />

300 bits can be easily accomplished. In this case, the AT product will de-/ <strong>in</strong>crease<br />

accord<strong>in</strong>g to the size of O(N 2 ).<br />

4.3.2 Results<br />

After the synthesis and place and route, the b<strong>in</strong>ary image was loaded onto the<br />

FPGA and clocked with a frequency of 25MHz. Hence, the cycle length of the ALU<br />

perform<strong>in</strong>g the modular arithmetic is 40ns. Table 4 – 3 shows the tim<strong>in</strong>gs of relevant<br />

operations of the implementation.<br />

<strong>Hard</strong><strong>ware</strong> factorization design <strong>in</strong>cludes full support for all operations needed<br />

dur<strong>in</strong>g the ECM phases 1 and 2. The tim<strong>in</strong>gs for phase 1 and 2 are obta<strong>in</strong>ed after<br />

tim<strong>in</strong>g measurements on a test<strong>in</strong>g board. The time for the <strong>in</strong>itialization and read<strong>in</strong>g<br />

from the memories is not taken <strong>in</strong>to account, s<strong>in</strong>ce it only delays the computation<br />

at the very beg<strong>in</strong>n<strong>in</strong>g and the very end.<br />

Although a squar<strong>in</strong>g is computed with the multiplication circuit, the overhead<br />

is slightly lower yield<strong>in</strong>g a mere 0.3% faster execution. Po<strong>in</strong>t addition <strong>in</strong> phase 1 is<br />

more efficient s<strong>in</strong>ce it makes use of the fact that the z coord<strong>in</strong>ate of the difference<br />

of po<strong>in</strong>ts can be chosen to be 1.<br />

The ECM unit <strong>in</strong>clud<strong>in</strong>g the full support for the phase 1 and 2 of the ECM<br />

with the word width w = 32 bits, number of words e = 7, level of pipel<strong>in</strong>e p = 2<br />

has the follow<strong>in</strong>g area requirements: 1754 LUTs, 506 flip-flops and 44 Blocks RAM.<br />

M<strong>in</strong>imum clock period achieved the value of 26.225ns (maximum clock frequency:<br />

66


FEI KEMT<br />

Table 4 – 3 Runn<strong>in</strong>g Times of the ECM Implementation (198 bits modulus), p = 2, w = 32<br />

(Xil<strong>in</strong>x Virtex2000E-6 and ARM7TDMI, 25MHz)<br />

Operation Time<br />

modular addition 2.00µs<br />

modular subtraction 1.68µs<br />

modular multiplication 64.5µs<br />

modular squar<strong>in</strong>g 64.5µs<br />

po<strong>in</strong>t addition (phase 1, zQ = 1) 333µs<br />

po<strong>in</strong>t addition (phase 2) 397µs<br />

po<strong>in</strong>t doubl<strong>in</strong>g 330µs<br />

Phase 1 912ms<br />

Phase 2 1879ms<br />

38.132MHz). Further improvements <strong>in</strong> data organisation <strong>in</strong>side the ECM unit should<br />

yield higher performance of the whole design. The critical path of design <strong>in</strong>cludes<br />

multiplexers of <strong>in</strong>put and output buses of memory registers. High number of sup-<br />

ported comb<strong>in</strong>ations due to universality of proposed design causes complicated and<br />

hence a slow logic. More optimised data-path with multiple multipliers <strong>in</strong> ALU helps<br />

to decrease the number of supported comb<strong>in</strong>ations of registers as shown <strong>in</strong> [67].<br />

Due to the system’s latency for load<strong>in</strong>g and stor<strong>in</strong>g values <strong>in</strong> the registers, not<br />

more than 100 ECM units (FPGA) should be controlled by one processor. With<br />

a much higher number of units the communication overhead would outweigh the<br />

computation time. However, the control logic of the data I/O has not been <strong>in</strong> the<br />

focus of our optimisation efforts yet and, thus, we assume that slight improvements<br />

of the speed of the data I/O are still feasible. Especially if target<strong>in</strong>g an ASIC<br />

implementation, such numbers are likely to change.<br />

4.3.3 ECM-Based Acceleration of GNFS: a Case Study<br />

Build<strong>in</strong>g an efficient and cheap ECM hard<strong>ware</strong> can <strong>in</strong>fluence the overall performance<br />

of the GNFS s<strong>in</strong>ce ECM can be perfectly used for smoothness test<strong>in</strong>g step with<strong>in</strong><br />

the GNFS (see [64]). In this section, we briefly estimate the costs, space require-<br />

ments and power consumption of a special ECM hard<strong>ware</strong> implemented as ASIC.<br />

Motivation for such analysis lies <strong>in</strong> a fact that ASIC design can achieve roughly<br />

10 times better performance as FPGA design. Know<strong>in</strong>g the area requirements and<br />

67


FEI KEMT<br />

tim<strong>in</strong>gs of ECM implementation makes possible to compare fairly our design with<br />

other (future) solutions. In our estimate, we focus on the production cost which we<br />

believe to be much higher than the development cost of such an ASIC. This special<br />

hard<strong>ware</strong> could be produced as s<strong>in</strong>gle ICs (such as common CPUs), ready for the<br />

use <strong>in</strong> larger circuits. We choose a sett<strong>in</strong>g with a word width w = 8 and assume the<br />

use of carry save adders.<br />

Estimation of the Runtime We can determ<strong>in</strong>e the runn<strong>in</strong>g time of both phases<br />

on basis of the underly<strong>in</strong>g r<strong>in</strong>g arithmetic. The upper bounds for the number of clock<br />

cycles of a modular addition and a modular subtraction are given <strong>in</strong> Equations 4.2<br />

and 4.3, respectively. A sett<strong>in</strong>g with N = 199, w = 8, p = 8, and e = 25 yields<br />

Tadd = 3(e + 1) = 78 and Tsub = 2(e + 1) = 52 cycles. Accord<strong>in</strong>g to Equation 2.4,<br />

the implemented multiplier requires Tmul = 666 cycles. For each operation we<br />

should <strong>in</strong>clude T<strong>in</strong>it = 2 cycles for <strong>in</strong>itialisation of the ALU at the beg<strong>in</strong>n<strong>in</strong>g of each<br />

computation.<br />

For the group operations for phase 1 we obta<strong>in</strong><br />

TP add = 5Tmul + 3Tadd + 3Tsub + 11T<strong>in</strong>it = 3 742 and<br />

TP dbl = 5Tmul + 2Tadd + 2Tsub + 9T<strong>in</strong>it = 3 608<br />

clock cycles. For phase 2, TP add changes to T ′ P add = 4 410 cycles s<strong>in</strong>ce zA−B �= 1 <strong>in</strong><br />

most cases, hence, we have to take the multiplication by zA−B <strong>in</strong>to account.<br />

The total cycle count for both phases is<br />

TP hase 1 = 1 374(TP add + TP dbl) = 10 098 900 and<br />

TP hase 2 = 1 881T ′ P add + 50TP dbl + 13 590Tmul = 17 553 730<br />

clock cycles. Exclud<strong>in</strong>g the time for pre- and post-process<strong>in</strong>g, a unit needs ap-<br />

proximately 27.7 · 10 6 clock cycles for both phases on one curve. If we assume a<br />

frequency of 500 MHz (for ASIC), such a complex computation can be performed<br />

<strong>in</strong> approximately 55 ms.<br />

Estimation of Area Requirements The estimation of area requirements have<br />

been based on results published <strong>in</strong> [108] 4 , the multiplier with w = 8 and p = 8<br />

4 The numbers provided <strong>in</strong> that contribution refer to a multiplier built with CSAs. S<strong>in</strong>ce we<br />

implemented the architecture with CPAs, given numbers are larger (approximately 20%) than<br />

those which would be achieved with our design.<br />

68


FEI KEMT<br />

requires 21 400 transistors <strong>in</strong> standard CMOS technology (assum<strong>in</strong>g 4 transistors<br />

per NAND gate). We assume that the circuit for addition and subtraction can be<br />

achieved with at most 1 000 transistors. For the memory, we assume (area expen-<br />

sive) static RAM which requires 25 200 transistors for 21 registers. For the unit’s<br />

<strong>in</strong>ternal control we assume additional 6 000 transistors. The central control requires<br />

less than 2 000 000 transistors. Hence, one unit requires approximately 53 600 tran-<br />

sistors. Assum<strong>in</strong>g the CMOS technology of a standard Pentium 4 processor (0.13<br />

µm, approx. 55 million transistors), we could fit 990 ECM units <strong>in</strong>to the area of<br />

one standard processor. One ECM unit needs an area of approximately 0.1475 mm 2<br />

and has a power dissipation of approximately 40 mW.<br />

Application to the GNFS Consider<strong>in</strong>g the architecture for a special GNFS<br />

hard<strong>ware</strong> of [64], we have to test approximately 1.7 · 10 14 co-factors up to 125 bits<br />

for smoothness. S<strong>in</strong>ce both the runn<strong>in</strong>g time as well as the area requirement scales<br />

l<strong>in</strong>early with the bit size, we can multiply the results from the subsections above<br />

with a factor of 125/198 ≈ 0.628. If we distribute the computation over a whole<br />

year, we have to check 5 390 665 co-factors per second 5 .<br />

For a probability of success of p > 80%, we test 20 curves per co-factor, thus,<br />

we need approximately 3 850 000 ECM units which would yield a total chip area<br />

of 625 000mm 2 (= 4 300 ICs of the size of a Pentium 4) and a power consumption<br />

of approximately 175 kW. If we assume a cost of US$ 5 000 per 300mm wafer, as<br />

done <strong>in</strong> [103], the ECM units would cost less than US$ 45 000 for the whole GNFS<br />

architecture, which is negligible <strong>in</strong> the context of the overall costs.<br />

4.4 Conclusions and Future Steps<br />

In this chapter we presented the first published implementation of the ECM <strong>in</strong> a<br />

real hard<strong>ware</strong> for factor<strong>in</strong>g numbers up to 200 bits. To make the implementation<br />

possible the algorithm was adapted for conditions given by hard<strong>ware</strong>, e.g. limited<br />

memory space, bus width, communication load. . . The parametrisation of the algo-<br />

rithms was done to particularly fit the needs of a hard<strong>ware</strong> environment, yield<strong>in</strong>g a<br />

high efficiency regard<strong>in</strong>g the area-time product.<br />

The sequential control part of the ECM is operated by soft<strong>ware</strong> commands of the<br />

embedded ARM processor. For <strong>in</strong>tensive comput<strong>in</strong>g operations the special purpose<br />

5 Remark that we only take the time for f<strong>in</strong>d<strong>in</strong>g the first factor <strong>in</strong>to account. S<strong>in</strong>ce this happens<br />

quite seldom, we neglect the factorization of the rema<strong>in</strong>der for our estimate.<br />

69


FEI KEMT<br />

hard<strong>ware</strong> was implemented on Xil<strong>in</strong>x FPGA. The ECM unit provides full support for<br />

all computations of the phases 1 and 2 of the ECM. It is also possible to <strong>in</strong>clude more<br />

ECM units work<strong>in</strong>g parallel <strong>in</strong> one FPGA chip. Our implementation impressively<br />

shows that due to very low area requirements and low data I/O, ECM is predest<strong>in</strong>ed<br />

for the use <strong>in</strong> hard<strong>ware</strong>. A s<strong>in</strong>gle unit for factor<strong>in</strong>g composites of up to 198 bits<br />

requires 506 flip-flops, 1754 lookup-tables and 44 Blocks RAM (less than 6% of logic<br />

and 27% of memory resources of the Xil<strong>in</strong>x Vertex2000E device).<br />

Thanks to scalability of the design, it is possible to change the data width and<br />

adapt it to target FPGA architecture. Another advantage lies <strong>in</strong> modularity of the<br />

design, namely the blocks for underly<strong>in</strong>g modular operations: addition/subtraction<br />

and multiplication/squar<strong>in</strong>g. At this stage we re-used the MMM very similar to the<br />

versions of the multiplier described <strong>in</strong> the chapter 2.<br />

The known drawbacks of the design are the noneffective usage of on-chip memory<br />

blocks and low maximum clock frequency. Our proof-of-concept design has not<br />

optimised the dedication of registers just for certa<strong>in</strong> arithmetical operation or data-<br />

flow direction. S<strong>in</strong>ce the chosen algorithm for MMM requires simultaneous access<br />

for writ<strong>in</strong>g and read<strong>in</strong>g to/from register with the sum S, we have selected dual-port<br />

memory mode for all registers. Similarly, the multiplex<strong>in</strong>g of the registers with <strong>in</strong>put<br />

and output operands has been left universal and therefore complicated and slow.<br />

As demonstrated, ECM can be perfectly parallelised and, thus, an implementa-<br />

tion at a larger scale can be used to assist the GNFS factor<strong>in</strong>g algorithm by carry<strong>in</strong>g<br />

out all required smoothness tests. A low cost ASIC implementation of ECM can<br />

decrease the overall costs of the GNFS architecture SHARK, as shown <strong>in</strong> [64]. We<br />

believe that an extensive use of ECM for smoothness test<strong>in</strong>g can further reduce the<br />

costs of such a GNFS mach<strong>in</strong>e.<br />

As future steps, variants of phase 2 can be exam<strong>in</strong>ed <strong>in</strong> order to achieve the<br />

lowest possible AT product. To achieve a higher maximal clock frequency of the<br />

ECM unit, the control logic <strong>in</strong>side the unit might be optimised.<br />

S<strong>in</strong>ce most of the computation time is spent for modular multiplications, an im-<br />

provement of the implementation of the multiplication directly affects the overall<br />

performance. Hence, alternative architectures for the multiplication can be <strong>in</strong>vesti-<br />

gated.<br />

70


FEI KEMT<br />

5 True Random Number Generator - prelim<strong>in</strong>ar-<br />

ies<br />

Random values play a crucial role <strong>in</strong> several areas of science. In dependency on field<br />

of application the requirements for parameters of random sequence and generator<br />

of sequence itself may vary. Focus<strong>in</strong>g on the sequence orig<strong>in</strong> we dist<strong>in</strong>guish between<br />

truly- and pseudo-random sequences. The construction of generators decides on<br />

their suitability for commercial or research applications.<br />

In the follow<strong>in</strong>g chapter we provide an <strong>in</strong>troduction to the topic of randomness<br />

and random values (Section 5.1) while focus<strong>in</strong>g on generators applicable <strong>in</strong> cryptog-<br />

raphy. In Section 5.2 we mention typical sources for generation of random sequences<br />

<strong>in</strong> digital circuits. In Section 5.3 we summarise design ideas of the PLL-based gen-<br />

erator we will analyse <strong>in</strong> the follow<strong>in</strong>g chapter. In Section 5.4 we expla<strong>in</strong> test<strong>in</strong>g<br />

techniques applied <strong>in</strong> order to evaluate generators and <strong>in</strong> Section 5.5 we discuss is-<br />

sues related to attacks on RNGs. F<strong>in</strong>ally, <strong>in</strong> Section 5.6 we summarise the chapter.<br />

5.1 Randomness<br />

We start with topic called randomness, and the most natural questions that come<br />

<strong>in</strong> our m<strong>in</strong>ds may look like: How to def<strong>in</strong>e the randomness? Where comes it from?<br />

Or how can we prove that a sequence is random?<br />

The randomness of the world we live <strong>in</strong> has been a scientific and philosophical<br />

topic for long time. Famous remark of Albert E<strong>in</strong>ste<strong>in</strong> says that “God does not<br />

play dice with the universe” what might conv<strong>in</strong>ce us about determ<strong>in</strong>ism of our<br />

environment. However, several physical phenomena present <strong>in</strong> physical world are<br />

proved to have a random nature e.g. probabilistic nature of quantum mechanics,<br />

thermal and shot noise <strong>in</strong> electronic components, or nuclear decay.<br />

The fundamental problem of randomness is <strong>in</strong> fact that even with exact def<strong>in</strong>ition<br />

it is very difficult to prove whether any f<strong>in</strong>ite numeric sequence is random or not. The<br />

randomness of a source is evaluated through the parameters of sequence generated<br />

us<strong>in</strong>g that source. The way how the values of sequence are extracted from the source<br />

depends on applied harvest<strong>in</strong>g mechanism. The optimal harvest<strong>in</strong>g does not disturb<br />

the random physical process and extracts as much entropy as possible.<br />

The entropy H of a random variable X with n outcomes �<br />

xi : i = 1, . . . , n �<br />

is<br />

def<strong>in</strong>ed as negative logarithm of the probability of the process’s most likely output<br />

71


FEI KEMT<br />

[68] what can be expressed as the follow<strong>in</strong>g equation:<br />

n�<br />

H(X) = − p(xi) logb p(xi) (5.1)<br />

i=1<br />

where p(xi) is a probability function of the outcome xi. Therefore, the higher is<br />

the level of entropy, the less predictable is the process. A completely random pro-<br />

cess with maximal entropy provides uniformly distributed sequence. For the natural<br />

sources of randomness it is usually more difficult to achieve good statistical proper-<br />

ties of the sequences s<strong>in</strong>ce they tend to <strong>in</strong>clude a certa<strong>in</strong> level of bias or other k<strong>in</strong>d of<br />

deviation from ideally equiprobable sequence. Post-process<strong>in</strong>g sequence convertors<br />

are able to improve the statistic distribution, but usually reduce the output bitrate<br />

of the sequence.<br />

Achiev<strong>in</strong>g constantly high level of entropy <strong>in</strong> a RNG assures randomness of the<br />

produced bit sequence. When design<strong>in</strong>g a RNG it is important to f<strong>in</strong>d level of<br />

entropy <strong>in</strong> the source, a relation between generator’s parameters and the entropy<br />

level and a monitor<strong>in</strong>g mechanism for the entropy level.<br />

5.1.1 Def<strong>in</strong>itions of Randomness<br />

There are several partial def<strong>in</strong>itions of random numbers that help us to gather<br />

the requirements given on random sequences and devices generat<strong>in</strong>g them. Let us<br />

mention some of the def<strong>in</strong>itions.<br />

The follow<strong>in</strong>g def<strong>in</strong>itions provide us <strong>in</strong>formation about the process by which the<br />

random numbers should be generated - a truly random number is generated by<br />

a process, whose outcome is unpredictable, and which cannot be subsequentially<br />

reliably reproduced. The unpredictability of the process means that each output<br />

state of the process is equally possible and may be guessed correctly with the same<br />

(negligible) probability (follow<strong>in</strong>g the uniform distribution). The ability to repro-<br />

duce the random process would require some sign of periodic pattern <strong>in</strong> the process<br />

behaviour, what is undesirable <strong>in</strong> case of a random pattern.<br />

Chait<strong>in</strong>’s Theorem [40] says that it is formally impossible to verify whether a<br />

f<strong>in</strong>ite sequence is random or not. S<strong>in</strong>ce we technically do not handle with <strong>in</strong>f<strong>in</strong>ite<br />

sequences what we can do is to check a practical randomness of f<strong>in</strong>ite sequence. That<br />

means to evaluate how the sequence under review shares the statistical properties<br />

of an ideal random sequence e.g. the equal probability of all possible outputs.<br />

Accord<strong>in</strong>g to Knuth [76], a sequence of random numbers is a sequence of <strong>in</strong>de-<br />

pendent numbers with a specified distribution and a specified probability of fall<strong>in</strong>g<br />

72


FEI KEMT<br />

<strong>in</strong> any given range of values. Other def<strong>in</strong>ition comes from Schneier [101], who says<br />

that random is a sequence that has the same statistical properties as random bits,<br />

is unpredictable and cannot be reliably reproduced. Kolmogorov def<strong>in</strong>es a str<strong>in</strong>g of<br />

bits as be<strong>in</strong>g random if and only if it is shorter than any computer program that<br />

can produce that str<strong>in</strong>g. From all three def<strong>in</strong>itions we can extract a common re-<br />

quirement (necessary but <strong>in</strong>sufficient) for hav<strong>in</strong>g the numbers <strong>in</strong> a random sequence<br />

uncorrelated 6 .<br />

Unpredictable sequence is the one for which the knowledge of all generated values<br />

<strong>in</strong> the past does not <strong>in</strong>crease probability to guess the subsequent value, or <strong>in</strong> other<br />

words know<strong>in</strong>g one of the numbers <strong>in</strong> the sequence must not help predict<strong>in</strong>g the<br />

other ones. The same fact we can illustrate by another of unpredictability def<strong>in</strong>itions<br />

which def<strong>in</strong>es it as a status that there is no polynomial algorithm, by which know<strong>in</strong>g<br />

l bits of the generated sequence S one is able to predict (l+1)-th bit with probability<br />

bigger than 0.5 [86]. No correlation also causes that the generated random sequence<br />

cannot be produced by other computer program than the one that pr<strong>in</strong>ts the whole<br />

random sequence as it is.<br />

Under truly random sequence of bits we understand an uncorrelated sequence<br />

that cannot be reproduced or predicted, has equal probability of all possible outputs<br />

(equiprobability) and its generation is based on a random process.<br />

A sequence that keeps the statistical properties of random sequence, but its<br />

members are correlated or the sequence can be reproduced is called pseudo-random.<br />

The pseudo-random sequence looks random, but its orig<strong>in</strong> is not <strong>in</strong> a random process<br />

and the sequence generation can be reproduced and described as an algorithm.<br />

One of the issues discussed <strong>in</strong> the thesis is the ability to dist<strong>in</strong>guish between the<br />

truly random and pseudo-random sequence by exploration of the generation process<br />

<strong>in</strong> generator.<br />

5.1.2 Random Number Generator<br />

A RNG is an electronic device or soft<strong>ware</strong> rout<strong>in</strong>e designed to yield a sequence of<br />

random numbers.<br />

A pseudo-random number generator (PRNG) is based on an algebraic function<br />

that expands the <strong>in</strong>itial random value (a seed) <strong>in</strong>to a random-like look<strong>in</strong>g sequence.<br />

A true-random number generator (TRNG) <strong>in</strong>cludes a physical source of randomness<br />

6 However, simple (l<strong>in</strong>ear) and known correlation relations between the members of sequence do<br />

not exclude such source. In these cases a corrector that removes the correlated samples may be<br />

applied. More dangerous are masked correlations of higher order that are difficult <strong>in</strong> detection.<br />

73


FEI KEMT<br />

and a harvest<strong>in</strong>g mechanism which extracts the randomness and generates truly<br />

random values.<br />

Security level of PRNG depends on complexity of the generat<strong>in</strong>g function, the<br />

period length of the generated sequence, and the amount of entropy <strong>in</strong> the seed. As<br />

a result, the pseudo-random sequences may achieve a high level of unpredictability<br />

<strong>in</strong> case of sufficient complexity of the generat<strong>in</strong>g function. However, the pseudo-<br />

random sequence has always a f<strong>in</strong>ite period and rema<strong>in</strong>s reproducible as far as<br />

<strong>in</strong>itial conditions are susta<strong>in</strong>ed.<br />

The PRNG is the only choice for soft<strong>ware</strong> implementations and thanks to de-<br />

term<strong>in</strong>istic components it attracts also the designers of electronic digital systems.<br />

Note that also pseudo-random sequence can be unpredictable when produced by<br />

cryptographically secure PRNG e.g. based on hash (one-way) functions, stream ci-<br />

phers or Blum Blum Shub pr<strong>in</strong>ciple [28]. The PRNG requires a random seed (from<br />

a TRNG or other reliable source of entropy, if available) to obta<strong>in</strong> the start<strong>in</strong>g level<br />

of entropy. As the system is determ<strong>in</strong>istic, for identical seeds the PRNG generates<br />

identical output pseudo-random sequences, too. No more entropy is added dur<strong>in</strong>g<br />

exploitation of the seed, therefore the seed’s entropy designates the unpredictability<br />

of the generated sequence.<br />

The term generator is not completely correct <strong>in</strong> case of TRNG as the randomness<br />

is not generated but rather extracted from a source of randomness (see Figure 5 –<br />

1). In TRNG the occurrence of random events is sampled by an extractor and<br />

transformed <strong>in</strong>to a sequence of numerical values usually expressed as a b<strong>in</strong>ary stream.<br />

Source of<br />

randomness<br />

A/D conversion<br />

analogue part digital part<br />

noise<br />

signal<br />

Postprocess<strong>in</strong>g<br />

digitised<br />

noise<br />

signal<br />

<strong>in</strong>ternal<br />

random<br />

sequence<br />

Output<br />

buffer<br />

external <strong>in</strong>terface<br />

random<br />

number<br />

sequence<br />

Figure 5 – 1 Schematic diagram of a TRNG with designation of <strong>in</strong>ternal signals and <strong>in</strong>terfaces<br />

The Figure 5 – 1 represents a typical design of TRNG based on a physical phe-<br />

nomenon. Us<strong>in</strong>g a proper harvest mechanism the analogue signal is converted <strong>in</strong>to<br />

its digitised form. Accord<strong>in</strong>g to statistical properties of the signal it may be required<br />

to apply a post-process<strong>in</strong>g <strong>in</strong> order to produce an <strong>in</strong>ternal random sequence. The<br />

generated sequence can be further accumulated <strong>in</strong> output buffer before leav<strong>in</strong>g the<br />

74


FEI KEMT<br />

generator on an external request.<br />

5.1.3 Applications of Random Numbers<br />

Random or pseudo-random values may be applied <strong>in</strong> variety of application areas, e.g.<br />

<strong>in</strong> simulation methods like Monte Carlo [84], <strong>in</strong> generation of spread<strong>in</strong>g sequences <strong>in</strong><br />

spread spectrum communication systems [106], by generation of primes, <strong>in</strong> several<br />

cryptographic algorithms, or <strong>in</strong> gambl<strong>in</strong>g <strong>in</strong>dustry. Naturally, the requirements for<br />

generators and generated random data differ accord<strong>in</strong>g to the application.<br />

In addition to proper statistical parameters, a generated random sequence for<br />

sensible cryptographic application has to be unpredictable and unrepeatable. Due to<br />

unrepeatability we expect completely different and random sequence for each use of<br />

the generator, even by identical start<strong>in</strong>g conditions (like the seed for PRNG). This<br />

is an <strong>in</strong>herent feature of TRNGs based on entropy extraction from natural physical<br />

phenomena. In such case the entropy of the generator is <strong>in</strong>creased by each generated<br />

value.<br />

Application areas for the RNGs can be found <strong>in</strong> a number of cryptographic<br />

algorithms. The dom<strong>in</strong>ant application of RNG is a secure generation of the keys<br />

for encryption. The bit-length of the key is chosen <strong>in</strong> dependency on length of the<br />

time when the key is valid. In different cryptographic applications this time can<br />

vary from the seconds for session keys to the years for encryption keys for archiv<strong>in</strong>g<br />

systems. Follow<strong>in</strong>g this, the RNG has to provide random values with bit rate <strong>in</strong> the<br />

range between tens to thousands of bits per second. While for PRNG it is not a<br />

problem to achieve high output bit rates, for TRNG desired <strong>in</strong> high-level security<br />

cryptosystems the situation is different. A source of randomness <strong>in</strong> TRNG may<br />

have a low level of entropy per bit what means also low output bit rate because of<br />

required accumulation of the entropy.<br />

In cryptography, the values produced by randomness extractors or generators are<br />

used as cryptographic keys, <strong>in</strong>itialization vectors, padd<strong>in</strong>g bits, bl<strong>in</strong>d<strong>in</strong>g values and<br />

or mask<strong>in</strong>g values <strong>in</strong> countermeasures aga<strong>in</strong>st side-channel attacks. In dependency<br />

on application the random value needs to be kept secret as <strong>in</strong> case of encryption<br />

(secret) keys or can be published as a nonce or a part of public key.<br />

Nowadays the security of cryptography systems is not based on secrecy of en-<br />

cryption methods, those are publicly known, but on the knowledge of a secret key.<br />

An adversary focuses all her attacks on revelation of that secret <strong>in</strong>formation. Hav-<br />

<strong>in</strong>g under control the device that generates the values–keys allows the attacker to<br />

75


FEI KEMT<br />

control also all the systems which security depends on them. Those are the reasons<br />

that emphasise the randomness generation process <strong>in</strong> cryptography.<br />

Requirements on TRNG for cryptography We can conclude the previous<br />

paragraphs with a list of special requirements given for implementation of TRNGs<br />

<strong>in</strong> case the produced sequences are applied <strong>in</strong> cryptography:<br />

• Specific statistical properties – generated sequence must have perfect statistical<br />

properties. Some known bias of the probability of zeros and ones <strong>in</strong> the gen-<br />

erated bit-stream could make cryptographic attacks easier s<strong>in</strong>ce nonzero value<br />

of bias deforms the required uniform distribution. The expected parameters<br />

are usually achieved by random sequence post-process<strong>in</strong>g.<br />

• Unpredictability – knowledge of arbitrary long sequence from the generator<br />

or any other <strong>in</strong>formation about the <strong>in</strong>ternal status of the generator should<br />

not enable anyone to predict preced<strong>in</strong>g or subsequent generator outputs or to<br />

guess them with some non-negligible probability. Such behaviour is natural for<br />

random physical phenomena. The requirement is satisfied by a proof show<strong>in</strong>g<br />

the orig<strong>in</strong> of randomly look<strong>in</strong>g sequence.<br />

• Security parameters – the TRNG is target for an adversely attack also as an<br />

electronic device. More than one-off reveal<strong>in</strong>g of the secret key, an adversary<br />

is usually <strong>in</strong>terested to <strong>in</strong>fluence the key generation process permanently. As<br />

means for improved vulnerability aga<strong>in</strong>st this k<strong>in</strong>d of attacks, the RNG de-<br />

signers should consider implementation of on-l<strong>in</strong>e tests tailored to harvest<strong>in</strong>g<br />

mechanism of a TRNG.<br />

5.2 TRNG Implementations <strong>in</strong> Digital Systems<br />

In the follow<strong>in</strong>g part of the thesis we provide an overview of known TRNG implemen-<br />

tations and design proposals. We focus mostly on designs targeted for application<br />

<strong>in</strong> digital circuits.<br />

Nowadays, a common hard<strong>ware</strong> platform for implementation of cryptographic<br />

primitives is a digital device. The cryptographic functions are performed as a soft-<br />

<strong>ware</strong> code on embedded processors on DSPs, FPGAs, SoC etc. or run on dedicated<br />

(co)processors with programmable (FPGA) or hard-wired logic cells (ASIC). That<br />

fact motivates research of generators that could be <strong>in</strong>tegrated <strong>in</strong>to circuits that are<br />

completely digital.<br />

76


FEI KEMT<br />

Digital circuits are naturally well-suited for implementation of a PRNG because<br />

of their determ<strong>in</strong>istic nature. For implementation of a physical TRNG it is required<br />

to look for a source of randomness <strong>in</strong>side a circuit. Typical digital circuits <strong>in</strong>clude<br />

only a limited range of sources of randomness that we will <strong>in</strong>vestigate further.<br />

As we expla<strong>in</strong>ed already, true randomness is achievable only <strong>in</strong> generators based<br />

on some physical phenomenon. Anyhow, one of the ma<strong>in</strong> objectives of digital sys-<br />

tems designers is to m<strong>in</strong>imise the impact of spurious analogue effects and achieve<br />

perfect stability of the system. Therefore the goal is an optimisation of clock distri-<br />

bution network for wide range of frequencies and a careful design of PCB layout and<br />

power supply network. One can see these contradictory requirements for the system,<br />

when on one side we expect perfectly determ<strong>in</strong>istic behaviour of digital part of the<br />

system, and on other side we look for a high-quality source of truly randomness for<br />

TRNG placed <strong>in</strong> the same system.<br />

For the sake of security preferred are completely embedded implementations of<br />

RNG. In such case the <strong>in</strong>ternal signals of the RNG are not exposed to potential<br />

attacks. However, due to lack of suitable sources of randomness on a given platform<br />

there are designs that propose a use of external discrete components as a source of<br />

randomness while the process<strong>in</strong>g part of generator is implemented <strong>in</strong> digital part of<br />

the system (e.g. [112]).<br />

5.2.1 Sources of Randomness<br />

The follow<strong>in</strong>g sources of randomness can be found <strong>in</strong> the digital devices:<br />

• metastability<br />

• various types of noise<br />

• clock jitter<br />

Although the clock jitter is primarily caused by a noise and therefore it could be<br />

<strong>in</strong>cluded under the noise, we mention the jitter as a separate category. The TRNGs<br />

based on jitter use techniques different from the ones based on direct sampl<strong>in</strong>g of<br />

noise. In addition, the generators sourced by jitter belong to the most popular<br />

designs of TRNGs.<br />

We note that although the sources of randomness are presented separately, it is<br />

generally more difficult to separate them <strong>in</strong> the technical designs, where all of them<br />

may be present and have <strong>in</strong>fluence on randomness source entropy. As an example<br />

77


FEI KEMT<br />

we can mention a generator kept <strong>in</strong> metastable state whose stable output value will<br />

be <strong>in</strong>fluenced by noise conditions <strong>in</strong>side the generator. In such case, the primary<br />

source of randomness is the metastability and the secondary source is the noise.<br />

Metastability A fundamental build<strong>in</strong>g block of digital circuits, the flip-flop (FF)<br />

has two well-def<strong>in</strong>ed stable states - high and low level usually denoted as 1 and 0<br />

(see Figure 5 – 2). Under certa<strong>in</strong> conditions the device may get <strong>in</strong>to a state which<br />

cannot be described by any of the above def<strong>in</strong>ed states. This condition is called<br />

metastability.<br />

stable state 0<br />

Metastable state<br />

stable state 1<br />

Figure 5 – 2 Illustration of stable states (0 and 1) and undef<strong>in</strong>ed metastable state<br />

The most common way to get a device <strong>in</strong>to the metastability is to violate the<br />

setup 7 and hold 8 times of the device. That can be achieved by choos<strong>in</strong>g the frequen-<br />

cies of the clock and <strong>in</strong>put signals of the FF <strong>in</strong> a ratio that results <strong>in</strong>to changes of<br />

the <strong>in</strong>put signal level that are too close to edges of the clock signal. Other option<br />

is that the frequencies of the signals are the same, but the phases are aligned <strong>in</strong> a<br />

way that causes FF’s setup and hold time violation.<br />

Keep<strong>in</strong>g the FF close to metastability and then allow<strong>in</strong>g it to resolve produces a<br />

b<strong>in</strong>ary sequence that depends on noise conditions <strong>in</strong>side the FF <strong>in</strong> the time of release.<br />

If the orig<strong>in</strong> of the noise is a thermal motion, then its random nature suggests that<br />

repeatedly clock<strong>in</strong>g a FF forced <strong>in</strong>to metastability will produce a succession of b<strong>in</strong>ary<br />

bits with little correlation between any pair <strong>in</strong> the sequence [75].<br />

7Setup time is def<strong>in</strong>ed as the m<strong>in</strong>imum time before sampl<strong>in</strong>g edge by which the sampled signal<br />

must be stable<br />

8Hold time is def<strong>in</strong>ed as the m<strong>in</strong>imum time after sampl<strong>in</strong>g edge dur<strong>in</strong>g which the sampled signal<br />

must be stable<br />

78


FEI KEMT<br />

In case of generators based on metastability the ma<strong>in</strong> implementation issue is<br />

the phase or frequency control of the <strong>in</strong>put signals that forces the metastability<br />

conditions. Complicated control system makes the implementation more vulnerable<br />

to attacks. In RNGs based on other randomness extraction techniques e.g. on<br />

free-runn<strong>in</strong>g oscillators, the metastability may also occur and contribute to overall<br />

entropy of the randomness source [52].<br />

Producers of FPGAs, and digital circuits <strong>in</strong> general, constantly work on reduc<strong>in</strong>g<br />

of the setup and hold times 9 as metastability produces <strong>in</strong>eligible non-determ<strong>in</strong>istic<br />

exceptions <strong>in</strong> the behaviour of the devices [6]. Therefore the published implemen-<br />

tations of TRNG [75, 83, 121] usually propose special circuits implemented e.g. by<br />

CMOS technology.<br />

Due to difficulty to meet the metastable condition <strong>in</strong> a long-term mean<strong>in</strong>g we<br />

can conclude that the metastability is good (secondary) source of randomness <strong>in</strong><br />

case it is comb<strong>in</strong>ed with other sources.<br />

Noise Despite their determ<strong>in</strong>istic behaviour the digital devices are based on analog<br />

elements naturally produc<strong>in</strong>g a certa<strong>in</strong> level of noise. There is always a source of<br />

noise (e.g. thermal noise – resistance or shoot) present <strong>in</strong> an electronic device. In<br />

order to apply the noise as a source of randomness it is required to amplify the noise<br />

itself or the effects caused by the noise.<br />

Most of the true hard<strong>ware</strong> RNGs depend primarily on a source of thermal noise,<br />

which is then post-processed to reduce the effects of determ<strong>in</strong>istic <strong>in</strong>ternal and ex-<br />

ternal <strong>in</strong>fluences such as power supply variations, DC bias, and electromagnetic<br />

fields [73]. Direct amplification and sampl<strong>in</strong>g of a noisy signal is not possible <strong>in</strong><br />

pure digital circuits. However, more complex devices are not exclusively digital and<br />

<strong>in</strong>clude embedded components for mixed signals (analogue-digital) process<strong>in</strong>g like<br />

A/D and D/A converters, or clock circuitry for a signal skew compensation.<br />

A technique with clocked comparator fed by directly amplified noise is applicable<br />

only <strong>in</strong> case of well-shielded noise sources, what can be hardly achieved <strong>in</strong> case of<br />

<strong>in</strong>tegrated digital systems. Instead of direct amplification of the noise, it is techni-<br />

cally more feasible to amplify signals that <strong>in</strong>clude a randomly chang<strong>in</strong>g part, but<br />

has higher level of amplitude than the noise itself (see e.g. [24, 73]).<br />

Bag<strong>in</strong>i and Bucci [24] provide one of first designs that <strong>in</strong>clude an analytical model<br />

of the generator behaviour and a self-test<strong>in</strong>g procedure. As a convertor of analogue<br />

9 For LE FF of Altera Stratix II speed grade -3 the setup time tSU = 90 ps and hold time<br />

tH = 149 ns [21].<br />

79


FEI KEMT<br />

noise to b<strong>in</strong>ary signal a comparator is applied. Balanced signal is then sampled by<br />

a delay FF. The number of <strong>in</strong>ternal transitions <strong>in</strong> the generated b<strong>in</strong>ary signal allow<br />

onl<strong>in</strong>e check<strong>in</strong>g of the generator behaviour.<br />

Noise as an <strong>in</strong>tr<strong>in</strong>sic and reliable source of noise <strong>in</strong> electronic devices is attractive<br />

for designers of TRNG. We further elaborate its <strong>in</strong>fluence on signals, e.g. <strong>in</strong> form of<br />

jitter.<br />

Jitter In this part we discuss various sources of jitter and a qualification of the<br />

jitter components to determ<strong>in</strong>istic and random ones. We start with some basic<br />

def<strong>in</strong>itions of the jitter, determ<strong>in</strong>istic and random jitter [102].<br />

By convention, tim<strong>in</strong>g variations are split <strong>in</strong>to two categories, called jitter and<br />

wander, based on a Fourier analysis of the variations vs. time. Tim<strong>in</strong>g variations<br />

that occur slowly are called wander. On the other hand the jitter describes tim-<br />

<strong>in</strong>g variations that occur more rapidly. The threshold between wander and jitter<br />

is def<strong>in</strong>ed to be 10 Hz accord<strong>in</strong>g to the ITU, but also other def<strong>in</strong>itions may be<br />

encountered.<br />

We cont<strong>in</strong>ue with more specific def<strong>in</strong>ition of the jitter and its two components.<br />

Jitter is a deviation from the ideal tim<strong>in</strong>g of an event (see Figure 5 – 3). The<br />

reference event is the differential zero cross<strong>in</strong>g for electrical signals and the<br />

nom<strong>in</strong>al receiver threshold power level for optical systems. Jitter is composed<br />

of both determ<strong>in</strong>istic and Gaussian (random) content.<br />

Determ<strong>in</strong>istic Jitter (DJ) is the jitter with non-Gaussian probability density<br />

function. Determ<strong>in</strong>istic jitter is always bounded <strong>in</strong> amplitude and has specific<br />

causes. Four k<strong>in</strong>ds of determ<strong>in</strong>istic jitter are identified: duty cycle distor-<br />

tion, data dependent, s<strong>in</strong>usoidal or periodic, and uncorrelated (to the data)<br />

bounded jitter. The DJ is characterized by its bounded, peak-to-peak value.<br />

Random Jitter (RJ) is the jitter that is characterized by a Gaussian distribution.<br />

Random jitter is def<strong>in</strong>ed to be the peak-to-peak value which is given to be 14<br />

times the standard deviation (14σjit) of the Gaussian distribution.<br />

Know<strong>in</strong>g the basic def<strong>in</strong>ition of the jitter we can cont<strong>in</strong>ue by def<strong>in</strong>itions of three<br />

types of jitter that differ <strong>in</strong> the reference signal that is considered to be ideal, without<br />

any jitter, and the time period of observations [102]. We add also our def<strong>in</strong>ition of<br />

the track<strong>in</strong>g jitter that plays crucial role <strong>in</strong> the randomness extraction method of<br />

PLL based TRNG.<br />

80


FEI KEMT<br />

reference<br />

edge<br />

mean period<br />

unit<br />

<strong>in</strong>terval<br />

jitter<br />

Figure 5 – 3 Tim<strong>in</strong>g jitter <strong>in</strong> clock signal<br />

Cycle-to-cycle jitter is the difference <strong>in</strong> a clock’s period from one cycle to the next<br />

one. Cycle-to-cycle jitter is the most difficult to measure usually requir<strong>in</strong>g a<br />

tim<strong>in</strong>g <strong>in</strong>terval analyser.<br />

Half-period jitter is the measure of maximum change <strong>in</strong> a clock’s output transi-<br />

tion from its ideal position dur<strong>in</strong>g one-half period.<br />

Period jitter is the change <strong>in</strong> a clock’s output transition, typically the ris<strong>in</strong>g edge,<br />

from its ideal position over consecutive clock edges. Period jitter is measured<br />

and expressed <strong>in</strong> time or frequency. Period jitter measurements are used to<br />

calculate tim<strong>in</strong>g marg<strong>in</strong>s <strong>in</strong> systems.<br />

Track<strong>in</strong>g jitter is def<strong>in</strong>ed as a variation <strong>in</strong> time relationship between the edges of<br />

the reference (<strong>in</strong>put) clock and output clock of a clock circuitry.<br />

Determ<strong>in</strong>istic periodic jitter is typically caused by external determ<strong>in</strong>istic noise<br />

sources coupl<strong>in</strong>g <strong>in</strong>to a system, such as switch<strong>in</strong>g power-supply noise or a strong<br />

local radio frequency carrier. It may also be caused by an unstable clock-recovery<br />

PLL.<br />

While a random process can, <strong>in</strong> theory, have any probability distribution, ran-<br />

dom jitter is assumed to have a Gaussian distribution for the purpose of the jitter<br />

model. One reason for this is that the primary source of random noise <strong>in</strong> many<br />

electrical circuits is thermal noise (also called Johnson noise or shot noise), which<br />

is known to have a Gaussian distribution. Another, more fundamental reason is<br />

that the composite effect of many uncorrelated noise sources, no matter what the<br />

distributions of the <strong>in</strong>dividual sources approaches a Gaussian distribution accord<strong>in</strong>g<br />

to the central limit theorem [107].<br />

For a random signal with a Gaussian distribution, there is theoretically no limit<br />

on the max and m<strong>in</strong> values, so the observed peak-peak value will generally grow<br />

81


FEI KEMT<br />

over time. For this reason, the peak-peak value should be used <strong>in</strong> conjunction with<br />

the population size and some knowledge of the type of distribution.<br />

5.2.2 Survey of Designs Based on Jitter<br />

In this section we summarise currently most known concepts and designs of genera-<br />

tors based on extraction of randomness from clock jitter. The jitter appears <strong>in</strong> clock<br />

signals generated by free-runn<strong>in</strong>g oscillators or PLL circuitry implemented <strong>in</strong>side a<br />

digital device.<br />

The Tkacik TRNG Design The generator <strong>in</strong>vented by Tkacik [111] <strong>in</strong>cludes<br />

comb<strong>in</strong>ation of two determ<strong>in</strong>istic circuits – a l<strong>in</strong>ear feedback shift register (LFSR)<br />

and cellular automation shift register (CASR). The registers are clocked by two <strong>in</strong>de-<br />

pendent r<strong>in</strong>gs whose clock frequency is <strong>in</strong>fluenced by external impacts and <strong>in</strong>cludes<br />

jitter. In addition, the selected outputs of CASR and LFSR are XORed together<br />

provid<strong>in</strong>g the f<strong>in</strong>al random signal. The harvest<strong>in</strong>g technique of the generator is very<br />

complex and no verification of its effectiveness is provided.<br />

The design was evaluated by Dichtl [43] who po<strong>in</strong>ted out an issue with unclear<br />

source of randomness <strong>in</strong> the generator. Under certa<strong>in</strong> conditions and with partial<br />

knowledge of some <strong>in</strong>ternal values an attacker is able to predict the generated value<br />

due to low level of entropy.<br />

The Fischer and Drutarovsk´y Design In design from Fischer and Drutarovsk´y<br />

[60] the idea is to extract random values by sampl<strong>in</strong>g a clock signal <strong>in</strong>fluenced by<br />

track<strong>in</strong>g jitter caused by analogue PLL <strong>in</strong> FPGAs from Altera. The jitter can be<br />

sampled only under def<strong>in</strong>ed condition when frequencies of sampled and sampl<strong>in</strong>g<br />

clock signals are <strong>in</strong> a certa<strong>in</strong> ratio.<br />

Sampl<strong>in</strong>g of clock signal is executed periodically with period given by PLL di-<br />

viders. Samples taken <strong>in</strong> transition zones have nonzero probability to result <strong>in</strong><br />

logical one or zero and are called critical samples. The position of critical samples is<br />

stabilised dur<strong>in</strong>g operation of the generator as far as the work<strong>in</strong>g conditions of the<br />

generator do not change.<br />

More details on the TRNG implementation and features of the generator are<br />

described <strong>in</strong> the next section. This design provide us a reference for theoretical<br />

test<strong>in</strong>g and theories which are presented <strong>in</strong> the thesis.<br />

82


FEI KEMT<br />

The Golić Design Golić’s goal is to provide digital TRNG built from logic gates<br />

only. Such design is cost effective and suitable for implementation on any digital<br />

chip. In article from Golić [70] the author proposes two new elements applied <strong>in</strong><br />

design of TRNG showed <strong>in</strong> Figures 5 – 4(a) and 5 – 4(b): the Galois r<strong>in</strong>g oscillator<br />

(GARO) and Fibonacci r<strong>in</strong>g oscillator (FIRO).<br />

(a) Galois r<strong>in</strong>g oscillator (b) Fibonacci r<strong>in</strong>g oscillator<br />

Figure 5 – 4 R<strong>in</strong>g oscillator structures proposed by Golić.<br />

Add<strong>in</strong>g more complex feedback loop <strong>in</strong> the r<strong>in</strong>g oscillator (RO) makes also its<br />

behaviour more complex and therefore more suitable for TRNG where the random-<br />

ness com<strong>in</strong>g from jitter spreads faster. In comparison to classical RO, the usage of<br />

GARO and FIRO yields a higher level of entropy and robustness of the generator.<br />

Additional entropy of the generator comes from frequent metastability effects <strong>in</strong> the<br />

sampl<strong>in</strong>g gate.<br />

In [44] Golić and Dichtl show results of practical implementation of TRNG us<strong>in</strong>g<br />

the oscillators presented above. The authors prove the randomness of the solution by<br />

analysis of the generator output after repeated restarts of the circuit. The standard<br />

deviation of the output signal voltage raises quickly after the restart and stabilises<br />

on significantly large level which assure randomness of the sample taken <strong>in</strong> this time<br />

period.<br />

The Kohlbrenner and Gaj Design The pr<strong>in</strong>ciple similar to PLL-based genera-<br />

tor [60] was proposed by Kohlbrenner and Gaj <strong>in</strong> [79]. Instead of PLL circuitry that<br />

is not present <strong>in</strong> all FPGAs, the authors use a pair of oscillator r<strong>in</strong>gs implemented<br />

<strong>in</strong> programmable logic area of FPGA. S<strong>in</strong>ce the pr<strong>in</strong>ciple expects a tight pair of<br />

frequencies generated by r<strong>in</strong>gs, the oscillators must be matched precisely. That re-<br />

quires also proper position<strong>in</strong>g of the r<strong>in</strong>gs <strong>in</strong>side the FPGA and manual corrections<br />

<strong>in</strong> placements and rout<strong>in</strong>g.<br />

The authors <strong>in</strong>vestigated also the <strong>in</strong>fluence of temperature on RO. The frequency<br />

of a RO tends to wander as the chip’s temperature varies. It is important to place<br />

the ROs <strong>in</strong> a pair close to each other so the difference between the frequencies is<br />

reduced due to m<strong>in</strong>imal difference <strong>in</strong> temperature.<br />

83


FEI KEMT<br />

The Bucci and Luzzi Testable TRNG Design Framework The authors of<br />

testable TRNG design framework [36] come with idea of a stateless RNG which<br />

generates statistically <strong>in</strong>dependent random bits. In case the post-process<strong>in</strong>g unit<br />

is also memoryless, the <strong>in</strong>ternal random bits are <strong>in</strong>dependent too. The stateless<br />

condition of the generator can be achieved by resett<strong>in</strong>g the generation and post-<br />

process<strong>in</strong>g circuit before generation a random bit or word, respectively.<br />

In case of RO based generators the reset state is achieved by stopp<strong>in</strong>g the os-<br />

cillators after each bit generation, so the phase shift between the oscillators is not<br />

accumulated. Another motivation is to avoid a complicated determ<strong>in</strong>istic beat<strong>in</strong>g<br />

pattern between fast and slow frequencies of the RNG. Should the generator <strong>in</strong>clude<br />

any control or compensation loops, then the stateless condition is met only if the<br />

loops achieved their steady state.<br />

The Sunar et al. TRNG Design A theoretical concept of generator based on<br />

r<strong>in</strong>g oscillators (ROs) with equal length was published by Sunar et al. <strong>in</strong> [105]. Ac-<br />

cord<strong>in</strong>g to the concept the outputs of several ROs are XORed together and sampled<br />

by a D flip-flop. The number of oscillators is chosen accord<strong>in</strong>g to jitter size and<br />

<strong>in</strong>ternal frequency of the r<strong>in</strong>gs. The design goal of properly work<strong>in</strong>g generator is<br />

an uniformly distributed region of unpredictable transitions. It is assumed that the<br />

phase drift caused by jitter appears <strong>in</strong> the <strong>in</strong>ternal signal of each r<strong>in</strong>g and <strong>in</strong>fluences<br />

the movement the edges <strong>in</strong> the signal.<br />

Several assumptions made by the authors of this concept were questioned by<br />

authors of [44]. The ma<strong>in</strong> problem lies <strong>in</strong> expectation that the ROs are <strong>in</strong>dependent<br />

what is usually difficult to achieve due to their high tendency to couple with each<br />

other or lock on a common frequency if there is a strong source of periodic signal<br />

close to the ROs.<br />

Notes on Other Published Designs Several published designs of the TRNG<br />

are based on frequency <strong>in</strong>stability of free-runn<strong>in</strong>g oscillators e.g. [53]. Free-runn<strong>in</strong>g<br />

oscillators are typically used also <strong>in</strong> FPGAs based TRNGs [79, 112].<br />

In the papers published recently [31, 105] we can observe that the successfully<br />

passed statistical tests of the proposed RNG are not sufficient anymore. Much more<br />

attention is paid to an analysis and model of the randomness extraction process.<br />

The theoretical bounds for entropy and statistical estimations of the RNG behaviour<br />

are provided <strong>in</strong> order to prove the security of the generator. The requirement for<br />

cont<strong>in</strong>uous test<strong>in</strong>g of the generated sequence was raised by Sch<strong>in</strong>dler <strong>in</strong> [100]. As a<br />

84


FEI KEMT<br />

consequence the RNG designs should provide a test<strong>in</strong>g method designed particularly<br />

for given type of RNG (see e.g. [36]).<br />

In [26] the authors improve model<strong>in</strong>g of RO TRNG, and <strong>in</strong>stead of conventional<br />

time-based models they provide a phase-oriented presentation. The observation<br />

claim<strong>in</strong>g that the ROs tend to couple with each other have been confirmed by the<br />

experiments with global determ<strong>in</strong>istic jitter. Instead of conclusion that coupl<strong>in</strong>g<br />

reduces the randomness of a TRNG, the authors warn of overestimation of the<br />

jitter size. After remov<strong>in</strong>g the impact of global jitter the accumulation of jitter is<br />

much slower, what implies <strong>in</strong> lower sampl<strong>in</strong>g frequency of the generator <strong>in</strong> order to<br />

accumulate obta<strong>in</strong> random sequences.<br />

5.3 PLL-Based TRNG on FPGA<br />

In this section we <strong>in</strong>troduce TRNG implementation based on randomness extrac-<br />

tion from track<strong>in</strong>g jitter that is <strong>in</strong>herent <strong>in</strong> clock signal produced by analog PLL<br />

embedded <strong>in</strong> some FPGA families. The PLL circuitry normally applied for synthesis<br />

of on-chip clock signals derived from external quartz signal is driven to provide a<br />

couple of signals with certa<strong>in</strong> fixed ratio of their frequencies. The ratio is selected<br />

for purpose of the jitter sampl<strong>in</strong>g and sets also other parameters of the generator as<br />

speed of output random sequence.<br />

In the follow<strong>in</strong>g pages we compile dependencies between the PLL and TRNG<br />

parameters and expla<strong>in</strong> their mean<strong>in</strong>g. We expla<strong>in</strong> the fundamental method beh<strong>in</strong>d<br />

the PLL-based TRNG (PLL-TRNG) <strong>in</strong>vented by Fischer and Drutarovsk´y and pub-<br />

lished <strong>in</strong> [60].<br />

5.3.1 Randomness Extraction Method<br />

The track<strong>in</strong>g jitter <strong>in</strong> the output signal of the on-chip analog PLL is detected by<br />

sampl<strong>in</strong>g the signal us<strong>in</strong>g an other rationally related clock signal. The fundamental<br />

issue of allow<strong>in</strong>g jitter sampl<strong>in</strong>g lies <strong>in</strong> sett<strong>in</strong>g of the sampled and sampl<strong>in</strong>g edges<br />

close enough to each other. When this condition is met, the unpredictable jitter<br />

decides on the output values of the sampl<strong>in</strong>g gate. The simplified structure of the<br />

PLL-TRNG is depicted <strong>in</strong> Figure 5 – 5.<br />

Let us have two clock signals CLK and CLJ with frequencies FCLJ and FCLK<br />

<strong>in</strong> the given ratio:<br />

FCLJ<br />

FCLK<br />

= KM<br />

KD<br />

= MCLJDCLK<br />

MCLKDCLJ<br />

85<br />

, (5.2)


FEI KEMT<br />

CLI<br />

PLL<br />

PLL<br />

1<br />

2<br />

CLJ<br />

CLK<br />

D<br />

Flip<br />

Flop<br />

q(nT CLK) Decimator<br />

(NK D)<br />

x(nNT Q)<br />

Figure 5 – 5 Block structure of the PLL-TRNG with two PLLs, sampl<strong>in</strong>g gate and corrector of<br />

the output sequence.<br />

where KM and KD are comb<strong>in</strong>ations of PLL dividers (DCLK, DCLJ) and multi-<br />

pliers (MCLK, MCLJ). As it can be seen <strong>in</strong> Figure 5 – 6, the signal CLJ is sampled<br />

<strong>in</strong> KD discrete positions dur<strong>in</strong>g the period TQ, which is given as<br />

CLJ<br />

CLK<br />

OUT<br />

critical samples<br />

TQ = KDTCLK = KMTCLJ . (5.3)<br />

TQ TQ<br />

DT<br />

KM<br />

samples<br />

Figure 5 – 6 Sampl<strong>in</strong>g of the CLJ clock signal <strong>in</strong>clud<strong>in</strong>g the track<strong>in</strong>g jitter on the rais<strong>in</strong>g edge<br />

of the CLK signal (illustrated for KM = 5 and KD = 7)<br />

It has been shown <strong>in</strong> [60] that if KM and KD are relatively prime, the set of<br />

samples creates an equidistant set of values with a distance step<br />

d = TCLK<br />

2KM<br />

GCD(2KM, KD) = TCLJ<br />

GCD(2KM, KD) , (5.4)<br />

2KD<br />

The method offers a possibility to choose the worst-case distance MAX(∆Tm<strong>in</strong>) =<br />

d/2 between two closest edges of the CLK and CLJ signals as [60]<br />

MAX(∆Tm<strong>in</strong>) = TCLK<br />

GCD(2KM, KD) =<br />

4KM<br />

TCLJ<br />

GCD(2KM, KD) (5.5)<br />

4KD<br />

and thus to assure proper behavior of the generator.<br />

86<br />

KD


FEI KEMT<br />

If the parameters KM and KD are chosen so that<br />

MAX(∆Tm<strong>in</strong>) < σjit , (5.6)<br />

it is guaranteed that dur<strong>in</strong>g the period TQ the sampl<strong>in</strong>g edge of CLK will fall at<br />

least once <strong>in</strong>to the edge zone of CLJ (where the edge zone means the time <strong>in</strong>terval<br />

around the edge with a width smaller than σjit, while σjit is a standard deviation of<br />

the jitter). The KD samples represented by the output signal q(nTCLK) are XOR-<br />

ed bit-wise <strong>in</strong> a corrector [60] to obta<strong>in</strong> one random bit dur<strong>in</strong>g N periods TQ. The<br />

generator output bitrate R is thus decimated by factor N to R = 1/(NTQ). It can<br />

be seen that while the left side of (5.6) depends on the generator structure and PLL<br />

sett<strong>in</strong>gs, its right side, the jitter, depends on the noise of the PLL circuitry, the<br />

work<strong>in</strong>g environment, and on the circuit board design. Therefore, the jitter must be<br />

known <strong>in</strong> advance or (even better) measured <strong>in</strong> real time. Measurement of the jitter<br />

requires special measur<strong>in</strong>g equipment. Common methods of jitter measurement<br />

(e.g. those used <strong>in</strong> [61]) enable one to measure the absolute long-term jitter and<br />

not the relative track<strong>in</strong>g jitter employed <strong>in</strong> the proposed TRNG. Furthermore, the<br />

jitter is measured under laboratory conditions and not <strong>in</strong> a real (potentially hostile)<br />

environment. If the results of measurements are not available, parameters from the<br />

vendor’s documentation can be used for the design of the TRNG as <strong>in</strong> [60].<br />

The decimated output signal of the TRNG<br />

x(nTQ) = q(nTQ) ⊕ q(nTQ − TCLK) ⊕ . . . ⊕ q(nTQ − (KD − 1)TCLK) , (5.7)<br />

which is generated at the output of an Exclusive-OR (XOR)-based decimator [42]<br />

as a bit-wise addition modulo 2 (⊕) of KD samples q(.) sampled with the frequency<br />

FCLK will be nondeterm<strong>in</strong>istic, too. Note that the delay l<strong>in</strong>e can still be a useful<br />

build<strong>in</strong>g block for σjit ≈ MAX(∆Tm<strong>in</strong>) or σjit < MAX(∆Tm<strong>in</strong>), as it was shown<br />

<strong>in</strong> [62].<br />

The sampler sensitivity on the jitter<br />

S = FCLIMAX(∆Tm<strong>in</strong>) =<br />

1<br />

4MCLKMCLJ<br />

(5.8)<br />

is derived from Equation (5.5). Decreas<strong>in</strong>g MAX(∆Tm<strong>in</strong>) for a fixed FCLI requires<br />

maximisation of multiply<strong>in</strong>g coefficients (M).<br />

For the output bitrate R = 1/TQ = FCLK/KD we get the condition<br />

R =<br />

FCLI<br />

DCLKDCLJ<br />

87<br />

(5.9)


FEI KEMT<br />

For R it holds that the <strong>in</strong>creas<strong>in</strong>g R for a fixed FCLI requires m<strong>in</strong>imisation of divid<strong>in</strong>g<br />

coefficients (D). Of course, optimization cannot be done <strong>in</strong>dependently. There are<br />

system limits expressed by the condition<br />

5.3.2 Coherent Sampl<strong>in</strong>g<br />

R<br />

MAX(∆Tm<strong>in</strong>) = 4FCLKFCLJ . (5.10)<br />

The sampl<strong>in</strong>g technique applied for randomness extraction <strong>in</strong> PLL-TRNG is called<br />

a coherent sampl<strong>in</strong>g.<br />

The method expects that the samples are processed dur<strong>in</strong>g the period TQ that<br />

is given by ratio of the clock frequencies. In case of ideal signals without a jitter<br />

the output signal is perfectly periodical. Let us provide some more details on the<br />

parameters of this signal.<br />

Similar technique is applied to measure high frequency signals. The coherent<br />

pr<strong>in</strong>ciple is based on sampl<strong>in</strong>g the measured signal dur<strong>in</strong>g several periods of the<br />

sampled signal, <strong>in</strong>stead of usually expected one period. Sampl<strong>in</strong>g frequency fs is<br />

lower than the frequency of sampled signal f. The ratio between the frequencies is<br />

expressed as<br />

fs = N<br />

f , GCD(M, N) = 1 . (5.11)<br />

M<br />

Dur<strong>in</strong>g M periods of sampled signal is obta<strong>in</strong>ed N samples. S<strong>in</strong>ce M and N are<br />

relatively prime numbers, the N samples are dist<strong>in</strong>ct and evenly distributed <strong>in</strong> TQ,<br />

thus the effective sampl<strong>in</strong>g frequency is fseff = Nf. In order to obta<strong>in</strong> the orig<strong>in</strong>al<br />

waveform of the sampled signal a time shuffl<strong>in</strong>g of the samples may be needed. In<br />

case of M = N + 1 time shuffl<strong>in</strong>g can be avoided if 0 ≤ φ1 ≤ 2π/N.<br />

This sampl<strong>in</strong>g theory may applied to the referred generator. S<strong>in</strong>ce it is difficult<br />

to fulfil the condition for avoid<strong>in</strong>g shuffl<strong>in</strong>g of the samples, an re-shuffl<strong>in</strong>g is required.<br />

Let’s assume that dur<strong>in</strong>g the period TQ we acquired KD samples of the CLJ signal<br />

with order i = 0, 1, . . . , KD − 1. Next, we need to rearrange the samples accord<strong>in</strong>g<br />

to their tim<strong>in</strong>g position <strong>in</strong> the CLJ signal. The idea beh<strong>in</strong>d this reorder<strong>in</strong>g lies<br />

<strong>in</strong> the fact that KD samples of CLJ are taken dur<strong>in</strong>g KM periods of CLJ signal,<br />

therefore we can reconstruct one period of the signal CLJ from KD samples. Thus,<br />

we compute the order <strong>in</strong>dex j and we sort the samples regard<strong>in</strong>g this <strong>in</strong>dex.<br />

j = iKM mod KD<br />

88<br />

(5.12)


FEI KEMT<br />

5.4 Test<strong>in</strong>g of TRNGs<br />

Randomness of the generated numbers cannot be proven only by pass<strong>in</strong>g generally<br />

used statistical tests. Instead of that each RNG implementation has to be evaluated<br />

<strong>in</strong>dividually as an unique system. However, if the prototype <strong>in</strong> the lab generates<br />

acceptable random numbers this may not be true for each piece of TRNG of the<br />

same type dur<strong>in</strong>g the whole operation time and therefore a cont<strong>in</strong>ual test<strong>in</strong>g of the<br />

generated output is required.<br />

It is well-known that most of the attacks are directed towards the implementa-<br />

tions of the cryptographic algorithms and not to the algorithms themselves. This<br />

means that special attention should be paid to avoid all weaknesses help<strong>in</strong>g an at-<br />

tacker <strong>in</strong> break<strong>in</strong>g of a system.<br />

The topic of tests is highly accurate <strong>in</strong> case of attacks. The generators as sources<br />

of secrets, on which the security of the whole cryptosystems is based, are popular<br />

target of attacks and attempts to obscure the generated output. The topic of at-<br />

tacks is also <strong>in</strong>cluded <strong>in</strong> the chapter. Chang<strong>in</strong>g the work<strong>in</strong>g conditions may have a<br />

degrad<strong>in</strong>g <strong>in</strong>fluence on the parameters of generated sequence.<br />

In [74] an approach for the evaluation of physical random number generators<br />

is given which takes the construction of the TRNG <strong>in</strong>to account. The document<br />

presents a theory how the TRNGs used <strong>in</strong> cryptographic systems should be evalu-<br />

ated.<br />

For the TRNGs test<strong>in</strong>g we have to accept the follow<strong>in</strong>g facts [100]:<br />

• A f<strong>in</strong>al set of statistical tests may detect defects of a random source, but these<br />

tests cannot verify the randomness of the source.<br />

• Good statistical properties of the random numbers are clearly not sufficient for<br />

sensitive cryptographic applications as the generation of the keys, signature<br />

key pars or signature parameters.<br />

• The key criterion is not the statistical behavior of the numbers but their en-<br />

tropy.<br />

• For good TRNG it has to be given that the <strong>in</strong>crease of entropy per generated<br />

number is sufficiently large.<br />

In [74], there is proposed a set of tests that should be passed, <strong>in</strong>clud<strong>in</strong>g the<br />

Coron’s test of entropy <strong>in</strong>crease. In addition to the proof that the generated num-<br />

bers have desired properties, it is needed to provide an explanation of randomness<br />

89


FEI KEMT<br />

extraction. In other words, the pr<strong>in</strong>ciple of random numbers generation has to be<br />

described for better understand<strong>in</strong>g and for better analysis of possible attacks on the<br />

TRNG.<br />

Startup Test, Onl<strong>in</strong>e Test, TOT Tests If RNG prototype <strong>in</strong> a lab generates<br />

acceptable random numbers this may not be true for each TRNG of the same type<br />

dur<strong>in</strong>g the whole operation time. The reason for this could be found <strong>in</strong> tolerances<br />

of components of the noise source, age<strong>in</strong>g effects, or outside attacks. In the worst<br />

case the TRNG breaks down totally and the output numbers are constant from that<br />

moment on. Therefore, the developer of the TRNG should implement also tests<br />

that will detect similar cases of the randomness degradation of the output bits. We<br />

dist<strong>in</strong>guish between 3 types of tests [74]:<br />

1. startup test is used to verify the pr<strong>in</strong>ciple functionality of the noise source<br />

when the TRNG has been started.<br />

2. onl<strong>in</strong>e test should detect if the quality of the random numbers is not sufficient<br />

for this particular TRNG or deteriorates <strong>in</strong> the course of the time.<br />

3. tot test (’tot’ stands for ’total failure of the noise source’) should detect a total<br />

breakdown of the noise source.<br />

Implementation of the tests For implementation of the tests one has to consider<br />

the limitations that are given by the platform on which the TRNG is implemented.<br />

Not rarely the implementation target are smart cards, or field programmable gate<br />

arrays (FPGAs) with limited memory space. Therefore the chosen tests should<br />

require only small additional logic resources. Moreover the tests should be selected<br />

accord<strong>in</strong>g to the features of the TRNG and the basic pr<strong>in</strong>ciple of the random source.<br />

It is possible to create also new tests that are more suitable for the particular TRNG<br />

and detect better the possible defects.<br />

Due to the limited memory resources of target platforms it is impossible to test<br />

the statistical properties on very long sequences (up to Mbits of data) as some tests<br />

(e.g. [97]) require. The goal is to f<strong>in</strong>d tests that are able cont<strong>in</strong>ually evaluate the<br />

quality of the random source without the need of stor<strong>in</strong>g the output bits. Require-<br />

ments which appropriate onl<strong>in</strong>e tests should fulfil are formulated <strong>in</strong> [100].<br />

Two another requirements are given on the tests. On one side we expect detection<br />

of even small deviation from ideal random source, but on the other side often random<br />

90


FEI KEMT<br />

alarms are not acceptable (e.g. tot test can block the smart card, so the revision by<br />

the producer is required for reus<strong>in</strong>g it). Therefore the ranges of deviations from the<br />

ideal randomness have to be set very carefully to do not decrease the security of the<br />

system, but also do not block the TRNG by fake alarms. This is task is even more<br />

difficult for short sequences of random bits tested <strong>in</strong>side the TRNG.<br />

5.5 Attacks aga<strong>in</strong>st TRNG<br />

The ma<strong>in</strong> attacker’s goal of a cryptographic algorithm or implementation is to reveal<br />

some part or even the whole secret key and then decrypt easily any encrypted<br />

message. Attack<strong>in</strong>g RNGs has a different motivation than f<strong>in</strong>d<strong>in</strong>g the key. Inside<br />

cryptographic systems the RNG plays crucial role <strong>in</strong> generation of secret keys, session<br />

keys, etc. A random key is the outcome of the generation process. Therefore the<br />

target of the attack is not only the generated value of the secret key but also any<br />

<strong>in</strong>formation mak<strong>in</strong>g possible to predict the succeed<strong>in</strong>g or preced<strong>in</strong>g values of the<br />

keys.<br />

In case of successful attack, the generated values may not be random anymore<br />

and can be constant or strongly biased or attacker knows the algorithm for their<br />

correct prediction with high probability. By this approach one tries to change the<br />

random behaviour of the TRNG to determ<strong>in</strong>istic one, or at least change the proba-<br />

bility distribution of the generated sequence.<br />

In case of PRNG, the knowledge of the seed or <strong>in</strong>ternal status can lead to break<strong>in</strong>g<br />

the generator because its structure is usually known and public. In case of well-<br />

deigned TRNG the <strong>in</strong>formation about actual <strong>in</strong>ternal status does not provide any<br />

<strong>in</strong>formation about the previous or follow<strong>in</strong>g one. Therefore focus of the attack is the<br />

source of noise and randomness extraction method rather than the <strong>in</strong>ternal status<br />

of the TRNG.<br />

Attacks on cryptographic systems (<strong>in</strong>clud<strong>in</strong>g RNG) can be divided <strong>in</strong>to algorith-<br />

mic and implementation attacks.<br />

Algorithmic attacks The first group of attacks, the algorithmic attacks, <strong>in</strong>cludes<br />

mathematical analysis of the mechanism for randomness extraction or the structure<br />

of the PRNG and does not require any access to the attacked unit. The analysis<br />

can be used especially aga<strong>in</strong>st PRNG designs with non-properly designed way of<br />

obta<strong>in</strong><strong>in</strong>g the seed value [69]. If seed conta<strong>in</strong>s low level of entropy, then the output<br />

of the generator has statistical properties not comparable to the random sequence<br />

91


FEI KEMT<br />

and effort needed for reproduction the output is lower. Mathematical analysis of<br />

TRNGs tries to f<strong>in</strong>d determ<strong>in</strong>istic dependencies <strong>in</strong>side the extraction method caus<strong>in</strong>g<br />

pseudo-randomness.<br />

As the parameters of TRNGs are highly dependent on the implementation, at-<br />

tack<strong>in</strong>g directly the hard<strong>ware</strong> realisation can be more powerful.<br />

Implementation attacks The second group, the implementation attacks, expects<br />

a direct physical access to an implementation and is based on weaknesses caused by<br />

implementation of the RNG. Implementation attacks are further divided to passive<br />

and active attacks.<br />

Passive attacks usually called side-channel attacks, benefit from a side channel <strong>in</strong>-<br />

formation ga<strong>in</strong>ed from the physical implementation. The power consumption,<br />

execution time or electromagnetic emanations can provide additional useful<br />

<strong>in</strong>formation about RNG <strong>in</strong>ternal status or processed data.<br />

Active attacks require an <strong>in</strong>volvement of the attacker <strong>in</strong>to changes of the standard<br />

work<strong>in</strong>g conditions, operation flow or design of the orig<strong>in</strong>al implementation of<br />

the RNG. The non-<strong>in</strong>vasive active attacks apply non-permanent changes of ex-<br />

ternal parameters for RNG e.g. supply voltage, temperature, with motivation<br />

to achieve non-standard - biased RNG output. With more resources one can<br />

execute an <strong>in</strong>vasive attack and change the physical structure of the implemen-<br />

tation. The attacker tries to destroy the source of randomness and make the<br />

output of the RNG constant or to get directly the output of generator.<br />

5.6 Conclusions<br />

In this chapter we have <strong>in</strong>troduced the topic of random numbers. The extraction<br />

of random bits <strong>in</strong> digital environment is a crucial topic <strong>in</strong> the area of system imple-<br />

mentations with public-key cryptography. The randomness itself and typical three<br />

sources of randomness: noise, metastability and jitter were described. In order to<br />

provide an overview on the actual status <strong>in</strong> the research we have collected descrip-<br />

tions of the recently published design proposals and implementations of TRNG.<br />

A typical design of TRNG implemented <strong>in</strong> a digital device <strong>in</strong>cludes a source<br />

of randomness from which a digitised noise signal can be harvested by a proper<br />

mechanism. We have expla<strong>in</strong>ed the importance of research <strong>in</strong> the areas of the<br />

harvest<strong>in</strong>g mechanisms and postprocess<strong>in</strong>g. The positive results of statistical tests<br />

92


FEI KEMT<br />

do not assure the random base of generated sequence. In addition the work<strong>in</strong>g<br />

environment may also have a significant impact on the parameters of output bits.<br />

Requirements on RNGs applied <strong>in</strong> cryptography cover security parameters of the<br />

design, unpredictability of the generated sequence and specific statistical properties<br />

of the output sequence.<br />

The generator chosen for our research - the PLL-TRNG proposed <strong>in</strong> [60] will be<br />

further tested and analysed <strong>in</strong> order to provide better tools for choos<strong>in</strong>g its param-<br />

eters and understand its behaviour <strong>in</strong> chang<strong>in</strong>g environment. Described theoretical<br />

background on test<strong>in</strong>g and attacks of RNGs has been applied and the results are<br />

given <strong>in</strong> the follow<strong>in</strong>g chapter.<br />

93


FEI KEMT<br />

6 True Random Number Generator<br />

The chapter is dedicated to analysis of jitter-based random generator under various<br />

aspects. Our work is based on the TRNG design proposed by Viktor Fischer and<br />

Miloˇs Drutarovsk´y published <strong>in</strong> 2002 [60]. We enhance the already published results<br />

summarised <strong>in</strong> the previous chapter. Our focus is put on analysis of the generator<br />

<strong>in</strong> chang<strong>in</strong>g work<strong>in</strong>g conditions and configurations sett<strong>in</strong>gs.<br />

Results of the research were published <strong>in</strong> the follow<strong>in</strong>g list of papers [47, 48, 61,<br />

62,114–116,119]. The ma<strong>in</strong> achievements of our research were done <strong>in</strong> the follow<strong>in</strong>g<br />

areas:<br />

• Analysis of PLL circuitry as a source of randomness – implementation issues,<br />

possible PLL configurations, verification of vendor parameters,<br />

• Analysis of TRNG implementation <strong>in</strong> different FPGAs – achievability study,<br />

design consideration, practical results,<br />

• Stochastic model of PLL-TRNG – proposal and practical verification,<br />

• Temperature <strong>in</strong>fluence on PLL-TRNG – practical attack on TRNG with results<br />

and suggestions for design.<br />

The chapter is structured as follows. In the Section 6.1 we describe two ways<br />

of clock synthesis <strong>in</strong> modern FPGAs and summarise the parameters of the clock<br />

circuitry verified by practical measurements of the PLL parameters. The Section 6.2<br />

provides an analysis of PLL configurations, practical results from Altera and Actel<br />

FPGA implementations of the generator and a stochastic model of the generator.<br />

In Section 6.3 we describe a non-<strong>in</strong>vasive attack on the generator together with<br />

practical outcomes. In the last part (Section 6.4) we discuss the obta<strong>in</strong>ed results<br />

and provide ideas on the further research.<br />

6.1 Clock Synthesis <strong>in</strong> FPGAs<br />

In present-day <strong>in</strong>tegrated digital systems, there is a need for numerous clock sig-<br />

nals with various frequencies. The synthesis of the clocks <strong>in</strong> separated circuits is<br />

not effective and the frequencies are too high to be generated by an external crys-<br />

tal. FPGA vendors offer for this purpose a clock circuitry embedded on the FPGA<br />

chip. Beside synthesis of clock signals with required frequencies it provides addi-<br />

tional functions mak<strong>in</strong>g possible a process<strong>in</strong>g of signals with very high frequencies.<br />

94


FEI KEMT<br />

The clock condition<strong>in</strong>g circuits usually enable to perform follow<strong>in</strong>g functions (<strong>in</strong><br />

dependency on FPGA vendor and family): clock phase adjustment, clock delay<br />

m<strong>in</strong>imisation, clock frequency synthesis, clock modulation spread-spectrum, static<br />

or dynamic configuration of circuits parameters, etc.<br />

We expla<strong>in</strong> two mostly applied pr<strong>in</strong>ciples for clock signal management <strong>in</strong> FPGAs<br />

based on PLL and delay-locked loop (DLL). Both pr<strong>in</strong>ciples can be implemented<br />

as digital or analog circuits. While the FPGA vendor Xil<strong>in</strong>x has chosen digital<br />

implementation of DLL <strong>in</strong> most of their FPGAs, other vendors like Altera and<br />

Actel <strong>in</strong>cluded <strong>in</strong> their devices a clock circuitry based on an analog PLL.<br />

Phase-Locked Loop Circuitry Typical analog PLL block <strong>in</strong> Altera and Actel<br />

devices (see Figure 6 – 1) can provide at least one synthesised clock signal with<br />

frequency FOUT :<br />

FOUT = FV CO<br />

k<br />

= FREF<br />

m<br />

k<br />

= FIN<br />

m<br />

n × k<br />

, (6.1)<br />

where FIN is the frequency of the <strong>in</strong>put clock source that can be an external crystal<br />

or other PLL <strong>in</strong> case of PLL cascade, FREF is the <strong>in</strong>put reference frequency that<br />

is used to lock the feedback clock FF B, and f<strong>in</strong>ally the voltage controlled oscillator<br />

(VCO) produces a clock signal with output frequency FV CO. Reference-, feedback-<br />

and post-divider values n, m, and k can vary from one to several hundreds <strong>in</strong><br />

FPGAs [11, 14], or to several thousands <strong>in</strong> ASICs [22] and set together with VCO<br />

work<strong>in</strong>g limits the range of <strong>in</strong>put and output frequencies.<br />

clock<br />

<strong>in</strong>put<br />

F IN<br />

:n<br />

F REF<br />

F FB<br />

Phase<br />

Frequency<br />

Detector<br />

:m<br />

Charge<br />

Pump<br />

Loop<br />

Filter<br />

&<br />

VCO<br />

F VCO<br />

:k<br />

.<br />

.<br />

.<br />

:k<br />

1<br />

c<br />

clock<br />

output(s)<br />

Figure 6 – 1 Block diagram of analog PLL circuitry for clock signal synthesis <strong>in</strong> Altera FPGA [11]<br />

Delay-Locked Loop Circuitry Synthesis of clock signal <strong>in</strong> DLL circuits is achieved<br />

by <strong>in</strong>sertion of a variable delay between the <strong>in</strong>put and output clock signal (see Fig-<br />

ure 6 – 2). Delay l<strong>in</strong>es can be built us<strong>in</strong>g a voltage controlled delay or as a series of<br />

discrete delay elements as it is <strong>in</strong> Xil<strong>in</strong>x DLL [125, 126].<br />

95<br />

F OUT


FEI KEMT<br />

clock<br />

<strong>in</strong>put<br />

F IN<br />

F FB<br />

Phase<br />

Detector<br />

+/-<br />

Delay<br />

L<strong>in</strong>e<br />

clock<br />

output<br />

F OUT<br />

Figure 6 – 2 Block diagram of digital DLL unit typical for Xil<strong>in</strong>x FPGA clock management<br />

circuits<br />

The DLL achieves very good results <strong>in</strong> delay compensation and clock condition-<br />

<strong>in</strong>g. However, the available range of clock dividers is much more limited than <strong>in</strong><br />

case of PLL. It is possible to use an output p<strong>in</strong> with clock signal derived from <strong>in</strong>put<br />

signal, where its frequency may be doubled or divided by values: 1.5, 2, 2.5, 3, 4, 5,<br />

8, or 16 <strong>in</strong> case of Spartan II FPGA devices [127].<br />

6.1.1 PLL as Source of Randomness<br />

Due to its digital nature the DLL <strong>in</strong> Xil<strong>in</strong>x devices is less sensible to noise envi-<br />

ronment than analog PLL with VCO. The VCO tends to lock to frequencies of<br />

disturb<strong>in</strong>g external signals and therefore is required a use of separated networks for<br />

power supply and ground connection mounted only to the clock circuitry. On the<br />

other hand, the analog PLL makes possible a small area implementation provid<strong>in</strong>g<br />

a wide range of clock frequencies. The DLL technology is limited <strong>in</strong> this direction<br />

and offers only certa<strong>in</strong> comb<strong>in</strong>ations of ratios between <strong>in</strong>put and output frequencies.<br />

Changes <strong>in</strong> the temperature or fluctuations of the supply voltage correlated to<br />

switch<strong>in</strong>g activity of the closely placed logic may cause a drift <strong>in</strong> the generated<br />

clock signal. As a compensation the loop makes adjustments of the delay elements<br />

or VCO frequency what is recognised as a determ<strong>in</strong>istic jitter added to the clock<br />

signal. Other source of noise <strong>in</strong>fluenc<strong>in</strong>g the PLL circuitry is the <strong>in</strong>put clock signal.<br />

Therefore there is a tradeoff between compensation of the <strong>in</strong>ternal or external jitter.<br />

All phase changes <strong>in</strong> the PLL or differences of delays <strong>in</strong> the DLL <strong>in</strong>troduce a<br />

jitter <strong>in</strong> the synthesised output signal. Filters <strong>in</strong>side the clock circuitry are matched<br />

to elim<strong>in</strong>ate the non-l<strong>in</strong>earity caused by the loop and external <strong>in</strong>fluences, however<br />

the <strong>in</strong>tr<strong>in</strong>sic random noise of the VCO is always present <strong>in</strong> the output clock signal<br />

and cannot be attenuated completely. Thanks to that, the PLL provides a promis<strong>in</strong>g<br />

source of randomness suitable for an implementation of the TRNG. In addition, the<br />

96


FEI KEMT<br />

frequency of the VCO is never constant and even by stable work<strong>in</strong>g conditions it<br />

fluctuates around a mean value.<br />

From the provided analysis we can conclude that PLL circuits are more suitable<br />

for TRNG design based on jitter sampl<strong>in</strong>g as they offer a wide frequency range for<br />

generated signals. Moreover, the <strong>in</strong>ternal PLL circuitry provide a reliable source of<br />

a jitter.<br />

Analog PLL <strong>in</strong> Altera and Actel FPGAs The core of clock circuitry embed-<br />

ded <strong>in</strong> Altera and Actel FPGAs is formed by an analog PLL circuit surrounded<br />

by several delay l<strong>in</strong>es, clock multipliers/dividers, and circuits for <strong>in</strong>terconnections<br />

between <strong>in</strong>ternal clock network and external pads. Number of PLLs and their fea-<br />

tures depend on chosen FPGA type and vendor. The Tables 6 – 1 and 6 – 2 present<br />

the basic parameters of PLLs and clock circuits for FPGA devices from Altera<br />

(APEX20K(E) [14], Cyclone [12,17] and Stratix [15,19]) and Actel (Axcelerator [2],<br />

ProASICplus [3], ProASIC3(E) [4]).<br />

Table 6 – 1 Parameters of PLL embedded <strong>in</strong> Altera FPGAs<br />

family # of PLLs<br />

dividers range<br />

m n k<br />

max. output period jitter<br />

APEX20K 1 – – – 200ps<br />

APEX20KE 2, 4 1-160 – * – 0.35% RMS of output period<br />

Cyclone 1, 2 2-32 1-32 1-32 ±300ps for FOUT ≥ 100MHz<br />

60mUI for FOUT < 100MHz<br />

Cyclone II 2, 4 1-32 1-4 1-32 NA **<br />

Stratix<br />

Stratix II<br />

* m/(n × k)=1-280.<br />

4, 8×FPLL *** 1-32 1-32 1-32 ±100ps for FOUT > 200MHz<br />

2, 4×EPLL 1-512 1-512 1-1024 ±20mUI for FOUT < 200MHz<br />

4, 8×FPLL 1-32 1-4 1-32<br />

2, 4×EPLL 1-32 1-32 1-32<br />

NA **<br />

** The jitter specification for the PLL output p<strong>in</strong>s are dependent on the I/O p<strong>in</strong>s <strong>in</strong><br />

its VCCIO bank, how many of them are switch<strong>in</strong>g outputs, how much they toggle,<br />

and whether or not they use programmable current strength.<br />

*** EPLL and FPLL stand for Enhanced and Fast PLL, respectively.<br />

97


FEI KEMT<br />

Table 6 – 2 Parameters of PLL embedded <strong>in</strong> Actel FPGAs<br />

family # of PLLs dividers range max. output period jitter<br />

ProASIC3(E) 1 (6) NA<br />

ProASICplus 2<br />

Axcelerator 8<br />

180ps for FOUT = 24MHz<br />

90ps for FOUT = 100MHz<br />

70ps for FOUT = 350MHz<br />

m = 1-64 ±1% for FOUT < 10MHz<br />

n=1-32 ±2% for 10MHz < FOUT < 60MHz<br />

k=1-4 ±1% for FOUT > 60MHz<br />

m =1-64 long-term: 1% of FOUT or 100ps<br />

n = 1-64 short-term: 50ps +1% of FOUT<br />

There are two parameters of the PLL clock circuits that have significant impact<br />

on possibility to extract randomness from the clock jitter, namely the output period<br />

jitter of the PLL and range of frequency dividers. The level of tim<strong>in</strong>g jitter <strong>in</strong> clock<br />

signals is for latest FPGAs families permanently decreased by FPGA vendors what<br />

was proved also by our experimental measurements (described later). On the other<br />

hand, the range of divisors <strong>in</strong> high-density devices is enlarged enough to achieve<br />

wider range of synthesised clock<strong>in</strong>g frequencies.<br />

The jitter size is usually expressed <strong>in</strong> peak-to-peak value (what is a difference<br />

between the smallest and the largest clock period) or 1-sigma value (σjit) (standard<br />

deviation). Typical values of the period jitter depend on the technology and config-<br />

uration of the PLL and can range from 3.5 ps to 10 ps for ASICs [22], or up to 100<br />

ps for FPGAs [11, 19]. S<strong>in</strong>ce the technology of the embedded PLL and the quality<br />

of the VCO is usually set by FPGA vendor, a user can modify the output jitter by<br />

configuration of the PLL divider values (m, n, k) and loop filter bandwidth.<br />

Jitter Generated <strong>in</strong> Altera Stratix FPGA In analog PLLs, various noise<br />

sources cause that the PLL’s <strong>in</strong>ternal VCO fluctuates <strong>in</strong> frequency. Under ideal<br />

conditions, the fluctuations visible as a jitter are caused only by analog (non-<br />

determ<strong>in</strong>istic) <strong>in</strong>ternal noise sources. In such case the noise is denoted as an <strong>in</strong>tr<strong>in</strong>sic<br />

jitter. Other possible frequency fluctuations are caused by variations of supply volt-<br />

age, temperature, external <strong>in</strong>terference through the power, ground, or by the <strong>in</strong>ternal<br />

noisy environment generated by <strong>in</strong>ternal FPGA circuits [125]. The PLL’s control<br />

circuitry adjusts the VCO back to the specified frequency and this change is seen<br />

98


FEI KEMT<br />

as a (determ<strong>in</strong>istic) jitter.<br />

We analyse further the parameters of PLL circuits <strong>in</strong> Stratix family of Altera<br />

FPGAs and their relations to the generated clock signal and jitter <strong>in</strong>cluded <strong>in</strong> it.<br />

The Altera Stratix devices <strong>in</strong>clude two types of PLLs:<br />

Fast PLL (FPLL): Stratix devices <strong>in</strong>clude up to 8 FPLLs. The FPLLs offer<br />

general-purpose clock management with multiplication and phase shift<strong>in</strong>g.<br />

The multiplication is simplified <strong>in</strong> comparison to EPLL and uses only m/k<br />

scal<strong>in</strong>g factors with a range from 1 to 32 [15]. Input frequency can vary <strong>in</strong><br />

dependency on m (for speed grade -5) from 15 to 717 MHz, output frequency<br />

from 9.4 to 420 MHz, and the frequency of the VCO from 300 to 1000 MHz.<br />

Enhanced PLL (EPLL): Compar<strong>in</strong>g to FPLL, the EPLLs have some additional<br />

configurable features like external feedback, configurable bandwidth, run-time<br />

reconfiguration, etc. and have enhanced range of parameters. Input frequency<br />

can vary (for a speed grade -5 device) from 3 to 684 MHz, output frequency<br />

from 9.4 to 420 MHz and the frequency of the VCO from 300 to 800 MHz.<br />

Reference-, feedback- and post-divider values n, m and k can vary from 1 to<br />

512 (1024 for k) with 50% duty cycle [15].<br />

The size of the <strong>in</strong>tr<strong>in</strong>sic jitter of the PLL depends on the quality factor Q of the<br />

VCO, on the bandwidth of the loop filter (see Figure 6 – 1), and on the so-called<br />

pattern jitter <strong>in</strong>troduced by the phase frequency detector. The technology of the<br />

PLL and the quality of the VCO is given by FPGA design. A designer can change<br />

the output jitter directly - by modification of scal<strong>in</strong>g factors (for FPLL and EPLL)<br />

and filter bandwidth (only for EPLL), but also <strong>in</strong>directly by the design of the board<br />

(separation of the analog and digital ground, filter<strong>in</strong>g of the analog power supply,<br />

etc.).<br />

PLL acts as a low-pass filter, therefore a low bandwidth sett<strong>in</strong>g of the lop filter<br />

can be applied to filter out high frequency jitter from the <strong>in</strong>put clock. To track the<br />

<strong>in</strong>put jitter, one can use a high bandwidth sett<strong>in</strong>g. As mentioned already a power<br />

supply noise could cause the VCO output frequency to fluctuate and cause jitter. In<br />

such cases a low bandwidth causes the feedback loop to respond slower to the noise<br />

be<strong>in</strong>g <strong>in</strong>jected by the VCO. In turn, it cannot adjust for this noise and counteract it.<br />

A high bandwidth allows the loop to respond quickly to the noise and compensate<br />

for it. Therefore there is a tradeoff between high and low pass filter of PLL loop<br />

filter that causes either filter<strong>in</strong>g of the <strong>in</strong>put signal jitter or VCO noise.<br />

99


FEI KEMT<br />

S<strong>in</strong>ce the size of the jitter is very important for our method, we needed to<br />

measure it for various PLL configurations and confirm the values provided by chips<br />

vendors. For example, accord<strong>in</strong>g to vendor’s measurements [125], the PLL jitter<br />

<strong>in</strong> an Apex FPGA has 1-sigma value of σjit ≈ 15.9 ps for a FOUT = 66.6 MHz<br />

synthesised clock signal and feedback divider m = 2. These results were acquired<br />

under “ideal conditions” with a m<strong>in</strong>imal amount of FPGA resources occupied and<br />

m<strong>in</strong>imal <strong>in</strong>put/output activities. Our measurements showed that the clock jitter <strong>in</strong><br />

the Apex FPGAs is significantly higher (about 140 ps) for higher dividers factors<br />

and <strong>in</strong>ternal FPGA flip-flops switch<strong>in</strong>g on different clock frequencies. Note that the<br />

value of jitter size depends on the PLL sett<strong>in</strong>gs and the type of the power supply<br />

filter <strong>in</strong>cluded <strong>in</strong> the development board, but the measured value of jitter is never<br />

lower than <strong>in</strong>ternal <strong>in</strong>tr<strong>in</strong>sic jitter of FPGA.<br />

(a) FPLL with ratio 12/7, σjit ≈ 10 ps (b) EPLL with ratio 139/133, σjit ≈ 16 ps<br />

Figure 6 – 3 Jitter of the clock signal <strong>in</strong> Altera Stratix design (horizontal scale: 200 ps/div)<br />

For jitter measurement on a Stratix family FPGA we have selected Altera DSP<br />

Development board with Stratix EP1S25F780C5 device [16]. The jitter has been<br />

measured similarly as <strong>in</strong> [62] us<strong>in</strong>g Agilent Inf<strong>in</strong>iium DCA 86100B wide bandwidth<br />

oscilloscope. We have found that <strong>in</strong> comparison to the Nios board with APEX [10]<br />

(used as reference <strong>in</strong> [60]) the jitter is significantly smaller. For example, for the<br />

FPLL and the ratio 12/7 the jitter achieves 1-sigma value of about 10 ps (see Figure<br />

6 – 3(a)) and for the EPLL and the ratio 139/133 the 1-sigma value of the jitter is<br />

about 16 ps (see Figure 6 – 3(b)).<br />

100


FEI KEMT<br />

6.2 PLL-Based TRNG on FPGA<br />

After the part concern<strong>in</strong>g the general parameters of PLL circuitry <strong>in</strong> FPGAs we<br />

cont<strong>in</strong>ue with section which delivers results on practical implementation of the PLL-<br />

TRNG <strong>in</strong> different families of FPGA vendors - Altera and Actel. Presented stochas-<br />

tic model of the generator helps to understand the randomness extraction method.<br />

6.2.1 PLL Configurations<br />

The design depicted <strong>in</strong> Figure 5 – 5 represents only one of the possible PLL configu-<br />

rations that we will <strong>in</strong>vestigate further. In general, there are three options how the<br />

PLLs can be configured <strong>in</strong> the TRNG <strong>in</strong> dependency on chosen FPGA: with one<br />

PLL, with two parallel PLLs and with two (or more) cascaded PLLs (see Figure 6 –<br />

4).<br />

a)<br />

b)<br />

c)<br />

CLI<br />

CLI<br />

CLI<br />

PLL<br />

PLL1<br />

PLL2<br />

PLL1 PLL2<br />

Figure 6 – 4 Configurations of TRNG with: a) one PLL, b) two parallel PLLs and c) two cascaded<br />

PLLs<br />

CLJ<br />

CLK<br />

CLJ<br />

CLK<br />

CLJ<br />

CLK<br />

D<br />

Flip<br />

Flop<br />

D<br />

Flip<br />

Flop<br />

D<br />

Flip<br />

Flop<br />

In some cases, especially <strong>in</strong> low-cost FPGAs, only one PLL is available for the<br />

TRNG (see Figure 6 – 4a) ) and the other (if available) are used for the rest of the<br />

system. If there are no or only some acceptable restrictions 10 for the <strong>in</strong>put clock<br />

10 By acceptable we mean the requirements for the clock<strong>in</strong>g frequency, which are <strong>in</strong> a certa<strong>in</strong><br />

OUT<br />

OUT<br />

OUT<br />

range that is suitable also for the TRNG to achieve the work<strong>in</strong>g condition (5.6).<br />

101


FEI KEMT<br />

Table 6 – 3 Parameters sett<strong>in</strong>gs for different TRNG configurations<br />

configuration / parameter one PLL two parallel PLLs two cascaded PLLs<br />

FCLK<br />

FCLJ<br />

FCLI<br />

MCLJ<br />

DCLJ FCLI<br />

MCLK<br />

DCLK FCLI<br />

MCLJ<br />

DCLJ FCLI<br />

FCLI<br />

MCLJ MCLJ 1 2<br />

FCLI<br />

DCLJ DCLJ 1 2<br />

KM MCLJ MCLJDCLK MCLJ1MCLJ2<br />

KD DCLJ DCLJMCLK DCLJ1DCLJ2<br />

S<br />

R<br />

1<br />

4MCLJ<br />

FCLI<br />

DCLJ<br />

1<br />

4MCLKMCLJ<br />

FCLI<br />

DCLKDCLJ<br />

1<br />

4MCLJ 1 MCLJ 2<br />

FCLI<br />

DCLJ 1 DCLJ 2<br />

frequency of the logic part out of the TRNG, then one or more PLLs can be shared<br />

by the TRNG and the user logic.<br />

In most cases the use of two PLLs is largely sufficient to fulfil the condition (5.6).<br />

Usually, the option with two parallel PLLs is used (see Fig. 6 – 4b) ). In cases when<br />

the range of PLL divisors is not satisfactory (aga<strong>in</strong>, this is the case of the low-cost<br />

FPGAs), a cascade of two (or more, if available) PLLs can be applied (see Figure 6 –<br />

4c) ). Each configuration permits to achieve different characteristics (def<strong>in</strong>ed <strong>in</strong><br />

[61]) depend<strong>in</strong>g on parameters of PLLs, namely maximum <strong>in</strong>put, output and VCO<br />

frequency, multiplication and division factors, etc. and <strong>in</strong> this way the needed<br />

frequency can be synthesised. The parameters of the considered three generator<br />

configurations are summarised <strong>in</strong> Table 6 – 3.<br />

We can conclude that the use of two PLLs <strong>in</strong> either parallel or serial (cascaded)<br />

configuration can <strong>in</strong>crease significantly sensitivity on the jitter and the output bit-<br />

rate of the generator, depend<strong>in</strong>g on the available range of multiplication or division<br />

factors or both.<br />

In the equations presented <strong>in</strong> Table 6 – 3 it is shown from which PLL coefficients<br />

(dividers) the factors KM and KD are composed. The factor KM has a direct<br />

<strong>in</strong>fluence on the value of MAX(∆Tm<strong>in</strong>) (see Eq. 5.5). While for the configurations<br />

with one PLL or several cascaded PLLs KM is composed only from multiply<strong>in</strong>g<br />

coefficients, <strong>in</strong> case of the parallel configuration the divid<strong>in</strong>g coefficient is <strong>in</strong>cluded.<br />

This should be considered especially <strong>in</strong> cases when not all the PLL coefficients have<br />

identical range.<br />

102


FEI KEMT<br />

6.2.2 Analysis of TRNG <strong>in</strong> Altera Stratix FPGAs<br />

Our implementation strategy for the described case was to get the fastest and the<br />

best quality generator us<strong>in</strong>g a m<strong>in</strong>imum amount of resources (PLLs). S<strong>in</strong>ce the<br />

Stratix family conta<strong>in</strong>s two types of PLLs, several configurations are possible.<br />

The most economic solution would be based on the use of one FPLL (s<strong>in</strong>ce there<br />

are four FPLLs <strong>in</strong> the chosen device). But the multiplication and division factors<br />

of a s<strong>in</strong>gle FPLL cannot fulfil the implementation condition (5.6). Other option is<br />

to use EPLL with extended range of parameters that enables to build a s<strong>in</strong>gle-PLL<br />

TRNG. For this reason, follow<strong>in</strong>g four architectures of the TRNG implemented <strong>in</strong><br />

Altera Stratix devices are possible:<br />

1. Two FPLLs (referenced further as configuration A)<br />

2. One FPLL and one EPLL (configuration B)<br />

3. One EPLL (configuration C)<br />

4. Two EPLLs (configuration D)<br />

The relationship between the sensibility on the jitter S and the output bitrate<br />

R of the TRNG for configuration with 2 parallel PLLs (see Table 6 – 3 for other<br />

configurations and characteristic parameters) was described <strong>in</strong> equations 5.8 and<br />

5.9.<br />

Experimental Results TRNG architectures were tested on Altera DSP board<br />

with Stratix EP1S25F780C5 [16]. The TRNG architectures were described <strong>in</strong> VHDL<br />

and implemented us<strong>in</strong>g Altera Quartus II development system, version 3.0 SP2.<br />

Acquired bits were transmitted to the PC through a parallel port. The complete<br />

TRNG design <strong>in</strong>clud<strong>in</strong>g 1024 x 8-bit FIFO and a parallel <strong>in</strong>terface controller needs<br />

up to 120 LEs from about 25000 LEs available <strong>in</strong> the device. The signal CLK was<br />

used as a clock signal for the control logic and was therefore limited to about 250<br />

MHz (although the output frequency of the PLL can be higher).<br />

In order to test basic quality of different versions of TRNG, we evaluated the<br />

follow<strong>in</strong>g statistical parameters of the generated bit sequence b(n) (all of them were<br />

computed for the record length of N = 1000000 bits):<br />

1. Bias computed as<br />

bias = E[b(n)] − 0.5 = E[b] − 0.5 ∼ = N1<br />

N<br />

103<br />

− 0.5 (6.2)


FEI KEMT<br />

where N1 is the number of b(n) = 1 for n = 0, 1, . . . , N −1. For a good TRNG,<br />

the bias should converge to 0 (with deviation ≈ ±3/ √ N ).<br />

2. Maximal autocorrelation coefficient computed as<br />

where<br />

=<br />

ρmax = max{|corr(bk)| , k = 1, 2, . . . , 100} (6.3)<br />

�<br />

�<br />

corr(bk) = corr b(n), b(n − k) = (6.4)<br />

�<br />

�<br />

E b(n) − E[b(n)] ��<br />

b(n − k) − E[b(n − k)] ��<br />

� �<br />

var b(n))var(b(n − k) �<br />

var(b(n)) = var(b) = E �<br />

{b − E[b]} 2�<br />

= E[b]{1 − E[b]} (6.5)<br />

Based on [42, 86] it can be shown that for a good TRNG (with bias → 0)<br />

and a f<strong>in</strong>ite record length N the corr(bk) follows standard normal distribution<br />

N(0, 1) and the follow<strong>in</strong>g condition should be fulfilled (value χ = 2.576 is from<br />

P (X > χ) = α = 0.01/2 valid for N(0, 1) distribution)<br />

ρmax → 2.576<br />

√ N = 0.002576 (6.6)<br />

3. Standard FIPS140-2 statistical tests [57] that analyse 20000 bit records and<br />

def<strong>in</strong>e thresholds to assess TRNG randomness. FIPS140-2 tests <strong>in</strong>clude Mono-<br />

bit, Poker, Run and Long runs tests. We analysed 100 sequences for each<br />

tested TRNG architecture and evaluated relative number (tM, tP , tR, tL) of se-<br />

quences that passed each test. Good TRNG should pass all FIPS tests so that<br />

tF IP S = tMtP tRtL = 1.<br />

Tables 6 – 4 and 6 – 5 <strong>in</strong>clude parameters and results for selected TRNG archi-<br />

tectures. The best output bitrate and quality (expressed through the bias, ρmax<br />

and tF IP S) is obta<strong>in</strong>ed us<strong>in</strong>g TRNG configuration with two EPLLs. The enhanced<br />

adjustable parameters of the EPLL allow to achieve the required level of sensitivity<br />

accord<strong>in</strong>g to the jitter present <strong>in</strong> the device. The configurations with the FPLL<br />

are not suitable for jitter sampl<strong>in</strong>g due to limited range of PLL dividers (see Ta-<br />

ble 6 – 1). In case of low sensitivity S the number of critical samples is very low, the<br />

configuration is unstable and the output sequence has significant bias.<br />

104


FEI KEMT<br />

Table 6 – 4 Configuration parameters of tested TRNG<br />

MAX<br />

Conf. PLL1 PLL2 Total ∆Tm<strong>in</strong> R σjit<br />

Type KM/KD Type KM/KD KM/KD [ps] [kb/s] [ps]<br />

A Fast 12/7 Fast 25/12 144/175 10.4 952.4 10<br />

B Enh. 43/7 Fast 25/12 516/175 2.9 952.4 23<br />

C Enh. 212/207 - 1 212/207 14.7 386.5 12<br />

D Enh. 43/7 Enh. 31/10 430/217 2.3 1142.9 13<br />

Table 6 – 5 Results of quality evaluation of tested TRNG configurations<br />

Configuration bias ρmax tF IP S<br />

A -0.358 0.043 0<br />

B 0.054 0.023 0<br />

C -0.003 0.012 0.96<br />

D 0.002 0.003 1<br />

The f<strong>in</strong>al speed of the generator <strong>in</strong> configuration D (more than 1Mbit/s) is much<br />

higher than that presented <strong>in</strong> [60], while the quality confirmed by statistical tests<br />

rema<strong>in</strong>s comparable. Thanks to the analysis of available PLL configuration and<br />

their parameters we have presented a generator without additional delay<strong>in</strong>g logic<br />

applied <strong>in</strong> the orig<strong>in</strong>al proposal [60]. Application of simpler sampl<strong>in</strong>g part of the<br />

generator is possible thanks to wider dividers range of PLL circuits.<br />

6.2.3 Analysis of TRNG <strong>in</strong> Actel FPGAs<br />

In this section we expla<strong>in</strong> how the parameters of the clock circuitry <strong>in</strong>fluence the<br />

parameters of the discussed PLL-TRNG <strong>in</strong> case of low-cost FPGA. Analysis should<br />

answer the question whether Actel FPGAs are suitable PLL-TRNG implementation<br />

and what parameters of the TRNG are achievable.<br />

Clock generator circuitry <strong>in</strong> Actel FPGAs As a target family for TRNG<br />

implementation the ProASICplus was chosen. This low-cost FPGA family based<br />

on flash technology offers two well-configurable PLLs on a chip. We selected an<br />

evaluation board [1] provided with ProASICplus APA300-PQFP208 device [3] for<br />

experiments and measurements. As a reference <strong>in</strong>put clock source an on-board<br />

105


FEI KEMT<br />

oscillator with frequency 40MHz was used. The board has separated power supply<br />

for the PLLs and for the rest of the chip what enables to analyse the impact of power<br />

supply violations (from off-chip manipulations, or from activity of the on-chip logic<br />

by <strong>in</strong>terconnection of the power supplies) on the generated sequences.<br />

In the on-chip PLL there exist the follow<strong>in</strong>g limitations for the frequencies of<br />

signals connected to PLL circuits: F<strong>in</strong> = 1.5 − 240MHz, Fout = 6 − 180MHz and<br />

FVCO = 24 − 180MHz. As it was already mentioned <strong>in</strong> Table 6 – 2 the PLL output<br />

frequency of the PLL Fout is derived from the <strong>in</strong>put frequency F<strong>in</strong> by application of<br />

the dividers:<br />

m FVCO<br />

Fout = F<strong>in</strong> =<br />

n × k k<br />

(6.7)<br />

where m, n and k are PLL frequency dividers and FVCO states for an output fre-<br />

quency of the VCO.<br />

In order to compare possible configurations and f<strong>in</strong>d out the ranges of TRNG<br />

parameters one can go through the follow<strong>in</strong>g steps. The frequency ranges of the<br />

two rationally related clock<strong>in</strong>g signals are given by the frequency ranges of the PLL<br />

dividers and the <strong>in</strong>put frequency (us<strong>in</strong>g equations from Table 6 – 3). From the ratio<br />

of the frequencies it is possible to set the parameters KM and KD and then also<br />

check the basic condition (expressed <strong>in</strong> Equation 5.6) that has to be fulfilled for<br />

the functionality of the TRNG. The size of the jitter deviation σjit can either be<br />

measured on the target device (if required equipments measurements are available),<br />

or just estimated (consider<strong>in</strong>g the ranges given <strong>in</strong> vendor’s documentation) and then<br />

set empirically after experiments with generator’s sett<strong>in</strong>gs. Know<strong>in</strong>g the frequencies<br />

of the clock<strong>in</strong>g signals and parameters KM and KD it is easy to f<strong>in</strong>d the period TQ<br />

(see Equation 5.3) and then the output bit-rate R = 1/TQ.<br />

To give an overview on what ranges of MAX(∆Tm<strong>in</strong>) are achievable <strong>in</strong> different<br />

PLL configurations we summarise them <strong>in</strong> Table 6 – 6. One should note that the<br />

<strong>in</strong>tervals are only theoretically achievable or could be slightly different <strong>in</strong> practical<br />

cases, s<strong>in</strong>ce some limitations were not taken <strong>in</strong>to account (e.g. the limited output<br />

and <strong>in</strong>put frequency for cascaded configuration, limited number of comb<strong>in</strong>ations of<br />

dividers, etc.).<br />

From Table 6 – 6 we can see that the smallest values of MAX(∆Tm<strong>in</strong>) can be<br />

reached with the cascaded configuration. While the frequencies range is the same<br />

as for the other configurations, the number of comb<strong>in</strong>ations of frequency dividers is<br />

higher what offers better possibilities for match<strong>in</strong>g the FCLJ frequency to the fixed<br />

FCLK.<br />

As expected, the lowest sensitivity is achievable by us<strong>in</strong>g only one PLL. On the<br />

106


FEI KEMT<br />

Table 6 – 6 Achievable sensitivity on jitter us<strong>in</strong>g two clock signals <strong>in</strong> Actel ProASICplus (FCLI =<br />

40MHz)<br />

configuration MAX(∆Tm<strong>in</strong>)<br />

two PLLs 0.17ps - 41ns<br />

one PLL 10.85ps - 41ns<br />

two cascaded PLLs 0.084ps - 41ns<br />

other side, if the size of the jitter is large enough, this configuration is the most<br />

effective <strong>in</strong> area consumption. In practical cases the configuration with one PLL is<br />

not usable, as the number of random samples and their entropy is low because of<br />

the low sensitivity S.<br />

As a solution one can add the second PLL <strong>in</strong> parallel or cascaded configuration.<br />

It was already mentioned that the parallel configuration has a disadvantage <strong>in</strong> con-<br />

troll<strong>in</strong>g two clock signals <strong>in</strong>stead of one as it is <strong>in</strong> case of the cascaded configuration.<br />

On the other hand, a disadvantage of the cascaded configuration could be the fact<br />

that the track<strong>in</strong>g jitter is composed of components produced <strong>in</strong> the all PLLs <strong>in</strong> the<br />

cascade.<br />

Achievable sensitivity is <strong>in</strong> the worst case comparable, <strong>in</strong> other cases much higher<br />

than is the size of jitter (usually around 10-100ps) therefore we can conclude that<br />

tak<strong>in</strong>g <strong>in</strong>to account the theoretical requirements the proposed method is feasible to<br />

implement and is suitable for Actel FPGAs.<br />

Experimental Results After the theoretical analysis we have proceeded to a<br />

practical implementation. The generator has been synthesised and programmed <strong>in</strong><br />

the FPGA us<strong>in</strong>g Actel design tools Libero IDE 7.1.<br />

In experiments we have focused on the configuration with one PLL circuit, as a<br />

specific configuration typical for low-cost FPGAs. In order to <strong>in</strong>crease the sensitivity<br />

of the sampler we have added some delay elements <strong>in</strong> the front of bank of sampl<strong>in</strong>g<br />

gates (for details check [61]). In case of Actel ProASICplus the shortest delay<br />

<strong>in</strong>side the chip, around 0.5ns, is available between the <strong>in</strong>put and output of a NAND<br />

gate [5]. Outputs of all delay<strong>in</strong>g paths are accumulated dur<strong>in</strong>g a multiple of periods<br />

TQ, afterwards the bits of accumulator are XORed together and provide as one<br />

output bit.<br />

The configuration prov<strong>in</strong>g the possibility to implement the TRNG <strong>in</strong> Actel ProA-<br />

SICplus FPGA us<strong>in</strong>g one PLL and a delay l<strong>in</strong>e from NAND gates has the follow<strong>in</strong>g<br />

107


FEI KEMT<br />

Table 6 – 7 Area occupation of one PLL TRNG with delay l<strong>in</strong>e <strong>in</strong> FPGA Actel ProASICPlus<br />

parameters:<br />

• FCLK = FCLI = 40 MHz<br />

Logic type Number Usage<br />

Core Cells 396 4.8%<br />

FIFO Cells 2 6.3%<br />

PLLs 1 50%<br />

• FCLJ = MCLJ<br />

DCLJ FCLI = 1240<br />

= 68.5714 MHz<br />

• Number of delay elements (NAND gates): 8<br />

7<br />

• Accumulation period: 17TQ = 119 periods of FCLK<br />

The requirements for the area occupation are summarised <strong>in</strong> Table 6 – 7. The<br />

design <strong>in</strong>cludes also the logic for read<strong>in</strong>g the <strong>in</strong>ternal signals and generated sequence<br />

by a computer and can be reduced if required.<br />

The NIST statistical tests were performed on cont<strong>in</strong>uous 1-Gigabit TRNG out-<br />

put records and followed the test<strong>in</strong>g strategy, general recommendations, and result<br />

<strong>in</strong>terpretation described <strong>in</strong> [97]. We have used a set of 1000 1-Megabits sequences<br />

produced by the TRNG, for which most of the tests were passed, however, some<br />

of them do not e.g. overlapp<strong>in</strong>g template test or some variants of non-periodic<br />

templates. Consider<strong>in</strong>g the fact that the generated sequence is <strong>in</strong> some parame-<br />

ters slightly dist<strong>in</strong>guishable from truly random stream may signalise some problems<br />

<strong>in</strong>side the TRNG implementation, on the other hand, the tested sequence is ex-<br />

tremely long (1 gigabit cont<strong>in</strong>ual record) unlike the output streams required for<br />

practical applications.<br />

The experimental tests of configurations with two PLLs connected <strong>in</strong> parallel or<br />

cascade have shown, that the condition expressed by Equation 5.6 is necessary but<br />

not sufficient condition for proper runn<strong>in</strong>g of the TRNG. From the results we can<br />

prove, confirm<strong>in</strong>g the theoretical analysis, that the track<strong>in</strong>g jitter can be sampled<br />

and the generator <strong>in</strong>cludes critical random samples. But to achieve reliably an unbi-<br />

ased and random sequence the number of the critical samples and their probability<br />

distribution have to satisfy some additional conditions that will be specified later <strong>in</strong><br />

this chapter.<br />

108


FEI KEMT<br />

On case of Actel FPGAs we expla<strong>in</strong>ed the way how the basic parameters of the<br />

TRNG can be computed and what is the relation between them and target device<br />

parameters. Follow<strong>in</strong>g the presented results it is possible to implement the TRNG<br />

with required parameters. We can conclude that Actel FPGAs are suitable for<br />

implementation of the TRNG based on discussed method, and achieved parameters<br />

are comparable with the ones from Altera FPGAs.<br />

6.2.4 Stochastic Model of PLL-TRNG<br />

It is a common requirement that a good TRNG design should be supported by a<br />

mathematical (more precisely stochastic) model of the source of randomness. A<br />

reliable model is a necessary requirement for the security evaluation dur<strong>in</strong>g the<br />

certification process [37]. On one hand, the model should be as simple as possible,<br />

but on the other hand, it should also reliably describe a basic behavior of the TRNG.<br />

In our case, the stochastic model should express the probability that the value on<br />

the generator output is equal to one as a function of the jitter variation and the<br />

phase of the CLK and CLJ signals.<br />

Reorder<strong>in</strong>g of the Samples If sampled values of the signal CLJ are ordered <strong>in</strong><br />

a proper way, they create an image of the orig<strong>in</strong>al clock waveform. If we accumulate<br />

the ordered samples <strong>in</strong> KD accumulators dur<strong>in</strong>g Q periods TQ, we obta<strong>in</strong> an image<br />

of the distribution of the probabilities where the i-th sample is equal to one.<br />

The Figure 6 – 5 presents an example of accumulated and reordered samples<br />

obta<strong>in</strong>ed dur<strong>in</strong>g Q = 1000 periods TQ for these parameters:<br />

• KM = 212, KD = 207, FCLJ = 81.93 MHz presented at Figure 6 – 5(a)) and<br />

• KM = 516, KD = 175, FCLJ = 491.43 MHz at Figure 6 – 5(b)).<br />

The variation of the jitter is proportional to the number of po<strong>in</strong>ts (critical sam-<br />

ples) <strong>in</strong> the ris<strong>in</strong>g (or fall<strong>in</strong>g) region of the waveforms (two and six <strong>in</strong> the pre-<br />

sented example). S<strong>in</strong>ce <strong>in</strong> (b) FCLJ = 491.43 MHz, the period TCLJ is divided <strong>in</strong>to<br />

KD = 175 sampl<strong>in</strong>g <strong>in</strong>tervals, the distance between two subsequent samples is equal<br />

to about 11.6 ps. The width of the region <strong>in</strong>fluenced by the jitter is thus about<br />

69.6 ps. This value is equal to approximately 3σjit, so the σjit ∼ 23.2 ps. Us<strong>in</strong>g the<br />

same method, we can get σjit ∼ 29.5 ps from Figure 6 – 5(a). It is clear that the<br />

presented method of the jitter measurement is sufficiently simple to be implemented<br />

<strong>in</strong>side a device and the jitter can thus be monitored cont<strong>in</strong>uously <strong>in</strong> real time.<br />

109


FEI KEMT<br />

0,75<br />

0,5<br />

0,25<br />

1<br />

0<br />

1 30 59 88 117 146 175 204<br />

(a) KM /KD = 212/207<br />

0,75<br />

0,5<br />

0,25<br />

1<br />

0<br />

1 30 59 88 117 146 175<br />

(b) KM /KD = 516/175<br />

Figure 6 – 5 Distribution of mean values of ordered CLJ signal samples obta<strong>in</strong>ed dur<strong>in</strong>g Q = 1000<br />

periods TQ<br />

On-chip reorder<strong>in</strong>g In order to make possible a better analysis of samples pro-<br />

cessed <strong>in</strong> TRNG we implemented the follow<strong>in</strong>g method for on-chip reorder<strong>in</strong>g of the<br />

samples.<br />

The structure of order<strong>in</strong>g logic is illustrated <strong>in</strong> Figure 6 – 6. Samples com<strong>in</strong>g<br />

from the TRNG are cont<strong>in</strong>ually written <strong>in</strong> a dual-port memory block organised as<br />

512 1-bit wide words (usually we do not use KD parameter, which determ<strong>in</strong>es the<br />

number of samples <strong>in</strong> one period, bigger than 512). Writ<strong>in</strong>g address is <strong>in</strong>itialised<br />

with each new period TQ signalised by signal next tq. In order to read samples <strong>in</strong> a<br />

way they create the CLJ clock waveform we need to set a correct read<strong>in</strong>g address.<br />

This operation is done by a LUT implemented as a ROM block. Input of the table<br />

is identical with writ<strong>in</strong>g address, and output of LUT is used as a read<strong>in</strong>g address<br />

from samples memory. The content of ROM – the LUT can be easily generated<br />

us<strong>in</strong>g Equation 5.12.<br />

Signal sample ord was assigned to an output p<strong>in</strong> of DSP Stratix board and<br />

measured by a scope (Tektronix TDS 3052), with trigger signal next tq. In Figure 6 –<br />

7 we present the measured waveform. The parameters of the TRNG are follow<strong>in</strong>g:<br />

MCLK = 13, DCLK = 12, MCLJ = 14, DCLJ = 11, KM = 168, KD = 143, K −1<br />

M = 103.<br />

In the region of edge (<strong>in</strong> this particular case, on fall<strong>in</strong>g edge). Ordered samples<br />

do not create ideal rectangular waveform, <strong>in</strong>stead there can be observed more edges<br />

<strong>in</strong> one period. Samples placed around a position of an ideal edge are sampled <strong>in</strong><br />

different tim<strong>in</strong>g <strong>in</strong>stances (due to required reorder<strong>in</strong>g of the samples <strong>in</strong> time). Hence,<br />

they may be <strong>in</strong>fluenced by different amount of jitter or said <strong>in</strong> other words, jitter<br />

changes are faster than sampl<strong>in</strong>g frequency. This fact causes more than one change<br />

(edge) of the signal. In order to make a better analysis of this phenomenon we need<br />

to collect samples from the edge region for several hundreds of subsequent periods.<br />

110


FEI KEMT<br />

0..K D<br />

next_tq<br />

sample<br />

01<br />

9<br />

9<br />

writ<strong>in</strong>g port<br />

RAM<br />

512 x 1b<br />

ROM<br />

K x 9b<br />

D<br />

read<strong>in</strong>g port<br />

00 11<br />

sample_ord<br />

9<br />

9<br />

01<br />

D<br />

D<br />

edge<br />

Figure 6 – 6 Block diagram of design for on-chip samples reorder<strong>in</strong>g<br />

Figure 6 – 7 Reordered samples from generator measured by oscilloscope<br />

111<br />

9


FEI KEMT<br />

Stochastic Model The clock signal CLJ is sampled KD times by other clock<br />

signal CLK dur<strong>in</strong>g one period TQ. The output signal is quasi periodic 11 with the<br />

period TQ as long as the condition<br />

GCD(KM, KD) = 1 (6.8)<br />

is fulfilled. Samples, which are taken <strong>in</strong> a “stable” part of the CLJ signal (i.e.<br />

samples, which are not <strong>in</strong>fluenced by the jitter) always have a constant value (logical<br />

zero or one). They form a dom<strong>in</strong>ant part of the set of output samples.<br />

The value of the i-th sample qi (0 ≤ i ≤ KD − 1) can be viewed as a b<strong>in</strong>ary<br />

random variable Xi ∈ {0, 1}. Its mean value E[Xi] is equal to the probability<br />

pi(Xi = 1), which is related to the mean value of the jitter <strong>in</strong> the correspond<strong>in</strong>g<br />

sampl<strong>in</strong>g <strong>in</strong>stant. It was shown <strong>in</strong> [60] that the decimated output signal x(nTQ) of<br />

the TRNG represents a bit-wise addition modulo 2 of KD b<strong>in</strong>ary samples q() (check<br />

also Figure 5 – 6) expressed as<br />

x(nTQ) = q(nTQ) ⊕ q(nTQ − TCLK) ⊕ . . . (6.9)<br />

. . . ⊕ q(nTQ − (KD − 1)TCLK) .<br />

We denote the number of critical samples K p<br />

D. The critical samples get the value<br />

of 1 with the probability pi ∈ (0, 1), i = 0, 1, . . . , K p<br />

D − 1. The rest of KD samples<br />

is determ<strong>in</strong>istic. They can obta<strong>in</strong> logical values of zero and one and their number<br />

is denoted as K 0 D and K 1 D, respectively. The total number of samples <strong>in</strong> the period<br />

TQ can be expressed as a sum of determ<strong>in</strong>istic and critical samples:<br />

KD = K p<br />

D + K 1 D + K 0 D . (6.10)<br />

The generator extracts randomness from K p<br />

D b<strong>in</strong>ary values us<strong>in</strong>g a standard XOR<br />

corrector. It was assumed <strong>in</strong> [60] that these values are statistically <strong>in</strong>dependent.<br />

Us<strong>in</strong>g mathematical background from [42], it is possible to show that the follow<strong>in</strong>g<br />

relation holds for the set of probabilities pi of K p<br />

D <strong>in</strong>dependent samples and the<br />

mean value E[pi] at the output of the XOR corrector (the output of the TRNG) is:<br />

E[pi] = 1<br />

2 + (−1)K1 D(−2) Kp<br />

D −1<br />

K p<br />

D−1 �<br />

�<br />

i=0<br />

pi − 1<br />

�<br />

2<br />

. (6.11)<br />

11 If the signals are not <strong>in</strong>fluenced by the jitter, the output signal of the sampl<strong>in</strong>g gate is perfectly<br />

periodic. If some jitter is present, the subsequent periods are not identical, but differ only <strong>in</strong> few<br />

random samples while constant samples form a major part of the waveform.<br />

112


FEI KEMT<br />

Equation 6.11 can be viewed as a stochastic model of the generator, s<strong>in</strong>ce it permits<br />

to estimate a probability of the generator output value as a function of the mean<br />

values of critical samples (which depend on the jitter characteristics). However, the<br />

model is valid if and only if critical samples are <strong>in</strong>dependent.<br />

The proposed model shows that (as it could be expected) the bias of the generator<br />

output decreases with the <strong>in</strong>creas<strong>in</strong>g number of critical samples (note that this<br />

number is related to the jitter variation). It can be seen that if the mean value of<br />

any of these samples is equal to 0.5, the bias on the generator output is equal to<br />

zero and does not depend on the rema<strong>in</strong><strong>in</strong>g samples. F<strong>in</strong>ally, the sign of the bias<br />

depends on the number of samples hav<strong>in</strong>g a mean value equal to one (K 1 D).<br />

The advantage of the proposed model lies <strong>in</strong> the fact that the model can also be<br />

used as a proof of mutual statistical <strong>in</strong>dependence of the critical samples. To evaluate<br />

the statistical <strong>in</strong>dependence, the output mean value and the mean value of critical<br />

samples are measured and the validity of the model expressed <strong>in</strong> Equation 6.11<br />

is verified. If the test fails, the random variables (critical samples) are mutually<br />

dependent.<br />

Model Verification The validity of the model has been tested on real data <strong>in</strong><br />

order to confirm the model empirically. We have tested outputs of seven TRNG<br />

configurations implemented <strong>in</strong> Altera Stratix devices. The Table 6 – 8 presents the<br />

chosen parameters of the tested configurations (KM, KD, FCLK and FCLJ) and the<br />

correspond<strong>in</strong>g results – mean value of critical samples (E[pi]), mean value of the<br />

generator output (m = E [x(nTQ)]), number of samples equal to one (K 1 D) and<br />

number of critical samples (K p<br />

D).<br />

The mean value of the output bitstream m = E [x(nTQ)] is computed as an<br />

arithmetic mean of 512,000 successive bits at the output of the TRNG. The mean of<br />

the model E[pi] is calculated us<strong>in</strong>g the Equation 6.11, while employ<strong>in</strong>g probabilities<br />

of the critical samples pi accumulated after Q = 1000 periods TQ.<br />

As it can be seen, the model is very precise for a small number of critical samples,<br />

s<strong>in</strong>ce both mean values are very similar. For a higher number of critical samples,<br />

the mean value tends to the ideal value 0.5. Note that the model provides correct<br />

<strong>in</strong>formation about the statistical deviation of the output bitstream <strong>in</strong> configurations<br />

1, 2 and 5. The model gives acceptable results correspond<strong>in</strong>g closely to the mean<br />

value of the generated sequence <strong>in</strong> tests 6 and 7. It should be noted that <strong>in</strong> config-<br />

urations 3 and 4, the model outputs do not agree with the generator outputs (most<br />

probably) because of statistical dependence between critical samples.<br />

113


FEI KEMT<br />

Table 6 – 8 Mean values measured us<strong>in</strong>g the stochastic model E[pi] and the output sequence of<br />

the TRNG m = E [x(nTQ)]<br />

# KM KD FCLK FCLJ E[pi] m K 1 D K p<br />

D<br />

(MHz) (MHz)<br />

1 144 119 113.33 137.14 0.846 0.829 61 2<br />

2 144 175 166.66 139.14 0.717 0.729 89 3<br />

3 486 119 75.55 139.14 0.501 0.553 55 10<br />

4 486 161 102.22 139.14 0.507 0.524 74 13<br />

5 250 203 232 285.71 0.489 0.526 95 16<br />

6 270 203 232 308.57 0.5 0.496 96 16<br />

7 486 217 137.77 308.57 0.499 0.496 99 22<br />

6.3 Active Non-Invasive Attack on TRNG<br />

To obta<strong>in</strong> results of a real-life attack we have executed an active non-<strong>in</strong>vasive attack<br />

on FPGA implementation of TRNG [60]. Namely we have tried to force some bias<br />

to the output of generator by chang<strong>in</strong>g the work<strong>in</strong>g temperature of the FPGA chip.<br />

Our aim is to f<strong>in</strong>d out what k<strong>in</strong>d of changes <strong>in</strong> the parameters of generated sequence<br />

can be observed. Moreover, we will record the <strong>in</strong>ternal signals of the generator and<br />

evaluate the <strong>in</strong>fluence of temperature on them.<br />

Similar experiments has been described <strong>in</strong> [98] where the PLL-TRNG has been<br />

evaluated as problematic, with vary<strong>in</strong>g quality of the generated bit sequence. Based<br />

on obta<strong>in</strong>ed results from the attack realisation we will provide additional require-<br />

ments for the PLL-TRNG design and expla<strong>in</strong> why the configuration chosen by San-<br />

toro et al. [98] had problems to pass the statistical tests.<br />

6.3.1 Attack description<br />

The temperature of the FPGA was decreased by application of a freez<strong>in</strong>g spray. The<br />

lowest achieved temperature was −40 ◦ C. As the FPGA chip produces some heat it<br />

has been warmed up by itself up to +30 ◦ C. Dur<strong>in</strong>g the measurements we have tried<br />

to keep the temperature <strong>in</strong> the range of the selected value. The temperature of the<br />

chip was measured by simple contact thermometer.<br />

Two similar configurations of the TRNG were chosen as objects under attack.<br />

In both cases we have used Altera Stratix DSP board with EP1S25 device [18].<br />

The follow<strong>in</strong>g parameters have been chosen or given by the board: FCLI = 80<br />

114


FEI KEMT<br />

MHz, MCLK = 31, DCLK = 10, MCLJ = 36, DCLJ = 7. Then FCLK = 248 MHz,<br />

FCLJ = 411 MHz, and KM/KD = 360/217. In order to make possible a comparison<br />

of TRNG behaviour for two sett<strong>in</strong>gs we have chosen the follow<strong>in</strong>g configurations<br />

that differ <strong>in</strong> bandwidth of the loop filter:<br />

• Configuration A has the filter bandwidth set automatically by the synthesis<strong>in</strong>g<br />

tool (Altera Quartus).<br />

• Configuration B has the filter bandwidth set to preset value low.<br />

The lower is the bandwidth the better <strong>in</strong>put jitter rejection can be achieved for<br />

the price of longer lock<strong>in</strong>g time of PLL. The synthesis<strong>in</strong>g tool chooses the optimal<br />

bandwidth for selected signal frequencies, achiev<strong>in</strong>g acceptable lock<strong>in</strong>g time and<br />

level of <strong>in</strong>put jitter filter<strong>in</strong>g. By sett<strong>in</strong>g the bandwidth to a low value, we achieve<br />

that jitter from sources outside the PLL are filtered out and we can observe the<br />

jitter sourced <strong>in</strong>side the PLL.<br />

6.3.2 Measurements results<br />

For evaluation of the TRNG behaviour by chang<strong>in</strong>g the temperature we collected<br />

for each value the random bit sequence from the output of generator, as well as<br />

the <strong>in</strong>ternal signal values, provid<strong>in</strong>g <strong>in</strong>formation on number of <strong>in</strong>fluenced random<br />

samples.<br />

By reorder<strong>in</strong>g the samples it is possible to reconstruct the waveforms of sampled<br />

clock signal and track the changes of their probabilities. The waveforms sampled by<br />

the generator are depicted <strong>in</strong> Figures 6 – 8 and 6 – 9. For each sample the number of<br />

ones is counted dur<strong>in</strong>g one thousand of TQ periods. The samples <strong>in</strong> stable regions<br />

end up with 0 or 1000 number of sampled ones. The samples <strong>in</strong> edge areas (ris<strong>in</strong>g<br />

and fall<strong>in</strong>g edge), <strong>in</strong>fluenced by jitter, reach values between the boundaries.<br />

In ideal case we suppose that a position of sampl<strong>in</strong>g edge is stable and what<br />

changes is the position of edge <strong>in</strong> the sampled clock signal. The logical value <strong>in</strong> the<br />

moment of sampl<strong>in</strong>g is <strong>in</strong>fluenced by an additive jitter. Analys<strong>in</strong>g the sampled values<br />

allows us to describe the behaviour of the generator and impact of temperature on<br />

the jitter parameters.<br />

From the charts we can see that the position of critical samples does not change<br />

across the range of temperatures for both configurations. The configuration A <strong>in</strong>-<br />

cludes less critical samples than the configuration B what implies lower σ 2 of the<br />

jitter.<br />

115


FEI KEMT<br />

Figure 6 – 8 Sampled waveform of a clock signal for TRNG for Configuration A for temperatures<br />

<strong>in</strong> range −40 ◦ C + 30 ◦ C.<br />

The random sequences were tested by simple statistical tests def<strong>in</strong>ed <strong>in</strong> FIPS<br />

standard [57]. The test suite can reveal a bias or unbalanced distribution of zeros<br />

and ones <strong>in</strong> generated sequence by application of 4 basic tests (monobit test, poker<br />

test, runs and long runs tests). If at least one test from the set was not passed, the<br />

result is denoted as FAILED, otherwise we put OK mark.<br />

In Table 6 – 9 we summarise the results of statistical tests at different FPGA chip<br />

temperatures. It can be seen that while the configuration A has produced by some<br />

temperatures the sequences that did not pass the statistical tests, the configuration<br />

B is reliable <strong>in</strong> the whole range of temperatures.<br />

The columns with critical samples number show the number of samples <strong>in</strong>fluenced<br />

by jitter. It can be observed that <strong>in</strong> case of the configuration B, when we have set<br />

a low bandwidth of the loop filter, the number of <strong>in</strong>fluenced samples is significantly<br />

higher.<br />

We further <strong>in</strong>vestigate the number and position of critical samples for both con-<br />

figurations <strong>in</strong> dependency on the chip temperature. Crucial impact on the statistical<br />

parameters of the generated sequence have the samples with probability around 0.5.<br />

In case we elim<strong>in</strong>ate the almost constant samples, with less than 100 by jitter <strong>in</strong>-<br />

fluenced values dur<strong>in</strong>g 1000 periods, there are 4-6 and 12-13 highly critical samples<br />

per edge for configuration A and B, respectively.<br />

The Figures 6 – 10 and 6 – 11 show <strong>in</strong> details the area of ris<strong>in</strong>g edge of the sampled<br />

waveform. We can observe how dur<strong>in</strong>g the measur<strong>in</strong>g period the number of sampled<br />

116


FEI KEMT<br />

Figure 6 – 9 Sampled waveform of a clock signal for TRNG for configuration B for temperatures<br />

<strong>in</strong> range −40 ◦ C + 32 ◦ C.<br />

ones changes <strong>in</strong> relation to different chip temperature. For configuration A is typical<br />

a large spread of the amounts for a fixed position of sample. In configuration B the<br />

subsequent samples have very similar amounts of sampled ones, and the overall<br />

waveform looks more stable.<br />

In order to better visualise the changes <strong>in</strong> sampled signals <strong>in</strong> dependency on<br />

temperature we provide Figures 6 – 12 and 6 – 13 which show <strong>in</strong> detail a dynamic of<br />

amounts of sampled ones for most critical samples.<br />

In configuration A we can observe a significant change of sampled ones by chang-<br />

<strong>in</strong>g the chip temperature. For example at position number 84 the difference <strong>in</strong><br />

amount of ones sampled by m<strong>in</strong>imal and maximal temperature is more than 500.<br />

This fact as well as the low number of critical samples cause <strong>in</strong>stability of the gen-<br />

erator <strong>in</strong> chang<strong>in</strong>g work<strong>in</strong>g environment.<br />

Although the jitter is present dur<strong>in</strong>g the whole range of the temperatures (the<br />

number of critical samples does not change), the bias of the samples changes visibly<br />

and <strong>in</strong>fluences the statistical parameters of the generated sequence. In a moment<br />

when all samples are strongly biased (case of temperature between 20 and 30 ◦ C)<br />

the output sequence is also biased and does not pass the statistical tests suite.<br />

The configuration B is more stable <strong>in</strong> chang<strong>in</strong>g chip temperature and the density<br />

of samples with equal probability to sample zero and one is much higher when<br />

compar<strong>in</strong>g to the case A. Thanks to that the statistical parameters of the generated<br />

sequence stay acceptable and pass all required statistical tests. The bias of particular<br />

117


FEI KEMT<br />

Table 6 – 9 Results of statistical tests (FIPS) of TRNG output and number of random samples<br />

<strong>in</strong>fluenced by the jitter at different chip temperatures<br />

Conf A Conf B<br />

temperature FIPS critical FIPS critical<br />

<strong>in</strong> ◦ C tests samples # tests samples #<br />

-40 OK 26 OK 64<br />

-30 OK 26 OK 66<br />

-20 FAILED 25 OK 64<br />

-10 OK 24 OK 62<br />

0 OK 24 OK 63<br />

+10 OK 24 OK 68<br />

+20 FAILED 22 OK 61<br />

+30 FAILED 25 OK 60<br />

samples is compensated by other samples <strong>in</strong> critical area, and the f<strong>in</strong>al sequence is<br />

kept unbiased.<br />

Observ<strong>in</strong>g Jitter From the observations depicted above we can conclude that<br />

the standard deviation (σ 2 ) of the jitter <strong>in</strong> the sampled signal does not change.<br />

The size of deviation can be observed as number of critical samples which rema<strong>in</strong>s<br />

almost constant <strong>in</strong> the whole range of tested temperatures. The presence of jitter<br />

represents a fundamental condition for generator proper function. Therefore, a well<br />

suited startup test for this k<strong>in</strong>d of generators should <strong>in</strong>clude a test of critical samples<br />

presence.<br />

The on-chip implementation of this test needs to <strong>in</strong>clude a memory block and<br />

counters which sum up for each edge position of the sampl<strong>in</strong>g signal the number<br />

of sampled ones. The edge positions with the counter value different from 0 or not<br />

equal to the number of TQ periods signalise the presence of jitter. The number<br />

of critical samples must be higher than zero, but low number of samples cannot be<br />

accepted neither. From empirical experiments described above we can conclude that<br />

configurations with more than 10 highly critical samples per edge behave reliably<br />

even <strong>in</strong> chang<strong>in</strong>g environment.<br />

Cont<strong>in</strong>uous monitor<strong>in</strong>g of the critical samples number allows to implement an<br />

effective onl<strong>in</strong>e test for the discussed category of PLL-based generators. Each signif-<br />

icant change either <strong>in</strong> position or <strong>in</strong> probability value of critical samples may have<br />

118


FEI KEMT<br />

Figure 6 – 10 Amount of sampled ones dur<strong>in</strong>g 1000 sampl<strong>in</strong>g periods for TRNG with configuration<br />

A (detail of the rais<strong>in</strong>g edge).<br />

an impact on the parameters of the generated sequence and therefore should <strong>in</strong>itiate<br />

an alarm signal <strong>in</strong>side the TRNG.<br />

From measured data it is possible to estimate the jitter parameters and draw the<br />

probability histograms. In the Figure 6 – 14 we compare the histograms of TRNG<br />

work<strong>in</strong>g <strong>in</strong> configuration A and B which differ <strong>in</strong> loop filter bandwidth. In both cases<br />

the jitter has normal Gaussian distribution. As it can be observed, the configuration<br />

A <strong>in</strong>cludes jitter with lower deviation while the jitter <strong>in</strong> configuration B has almost<br />

three times higher value.<br />

What we f<strong>in</strong>d crucial <strong>in</strong> our measurements is the observation of jitter parameters<br />

with chang<strong>in</strong>g temperature. The jitter <strong>in</strong> the PLL circuitry becomes different with<br />

freez<strong>in</strong>g the chip what can be observed as a change <strong>in</strong> number of sampled ones<br />

at critical samples positions. In Figure 6 – 15 we depicted the difference <strong>in</strong> those<br />

numbers when compar<strong>in</strong>g the values by boundary temperatures −40 ◦ C and +30 ◦ C<br />

<strong>in</strong> both configurations. The difference has the Gaussian normal distribution as well<br />

as <strong>in</strong> case of the previously discussed jitter by the room temperature. The standard<br />

deviation of the additional jitter is identical to its values for measurements at stable<br />

temperature. As a result we can conclude that by chang<strong>in</strong>g the chip temperature<br />

the amplitude of the jitter changes, too.<br />

In case of PLL-TRNG the bigger are the changes of jitter amplitude the bigger<br />

are changes <strong>in</strong> the histogram of jitter and that has direct impact on statistical<br />

properties of the generated sequence. In case of configuration A the changes <strong>in</strong><br />

119


FEI KEMT<br />

Figure 6 – 11 Amount of sampled ones dur<strong>in</strong>g 1000 sampl<strong>in</strong>g periods for TRNG with configuration<br />

B, with low-pass loop filter (detail of the rais<strong>in</strong>g edge).<br />

amplitude of the jitter are significant as are the differences <strong>in</strong> probability values<br />

between particular samples. The configuration B is characterised by smaller changes<br />

of the jitter amplitude which are <strong>in</strong> addition more flat. In such case the probability<br />

changes uniformly for the most critical samples and does not have any unwanted<br />

impact on the generated random sequence. Described higher level of robustness was<br />

observed <strong>in</strong> configuration B and confirmed by positive output of all statistical tests.<br />

From the obta<strong>in</strong>ed results and suggestions for PLL-TRNG design we can con-<br />

clude that the design tested <strong>in</strong> [98] with parameters KM/KD = 270/203 is not<br />

suitable for usage <strong>in</strong> chang<strong>in</strong>g temperature. From the Table 6 – 8 we get the number<br />

of critical samples that is 22, 11 per edge. As we proposed <strong>in</strong> the suggestions above,<br />

more important is the number of highly critical samples that should be more than 10.<br />

This condition is not met <strong>in</strong> this configuration and the generator behaves similarly<br />

to the Configuration A <strong>in</strong> our experiments dur<strong>in</strong>g simulated attack on PLL-TRNG.<br />

6.4 Conclusions and Further Research<br />

The chapter provided an analysis of the PLL based TRNG. We focused on implemen-<br />

tation aspects and relations between the target platform FPGA and PLL circuitry<br />

and achievable technical parameters of the generator <strong>in</strong> devices from vendors Actel<br />

and Altera. In the second part of the chapter we brought our proposal for stochastic<br />

model of the TRNG and proposed additional steps <strong>in</strong> PLL-TRNG design <strong>in</strong> order<br />

to achieve a robustness of the generator <strong>in</strong> chang<strong>in</strong>g environment.<br />

120


FEI KEMT<br />

Figure 6 – 12 Amount of sampled ones dur<strong>in</strong>g 1000 sampl<strong>in</strong>g periods accord<strong>in</strong>g to temperature<br />

for chosen sample positions <strong>in</strong> TRNG with configuration A.<br />

By theoretical and practical analysis we concluded that the PLL circuitry is<br />

more suitable for discussed TRNG implementation when compared to DLL. The<br />

parameters of the PLL circuitry available <strong>in</strong> FPGAs present on the market are<br />

satisfactory for reliable implementation. We showed the steps for theoretical analysis<br />

of the PLL parameters with estimation of the jitter and TRNG parameters that were<br />

later confirmed by empirical measurements.<br />

Two practical implementations <strong>in</strong> Altera and Actel families of FPGAs showed<br />

which criteria are important <strong>in</strong> the design. The achieved f<strong>in</strong>al speed of the generator<br />

<strong>in</strong> Altera Stratix device is more than 1Mbit/s with the quality of output confirmed<br />

by statistical tests. Thanks to the analysis of available PLL configuration and their<br />

parameters we have presented a generator without additional delay<strong>in</strong>g logic applied<br />

<strong>in</strong> the orig<strong>in</strong>al proposal [60]. Application of simpler sampl<strong>in</strong>g part of the generator<br />

is possible thanks to wider dividers range of PLL circuits <strong>in</strong> Stratix FPGA family.<br />

We presented the most compact solution with one PLL circuit and the cha<strong>in</strong> of<br />

delay elements implemented <strong>in</strong> Actel ProASICplus device. The results of statistical<br />

tests for very long record of generated data confirm high level of randomness, with<br />

few tests failed. We can conclude that it was theoretically and also practically<br />

confirmed that the PLL-TRNG is suitable for fully embedded implementation <strong>in</strong><br />

low-cost FPGAs and provides a reliable source of truly random values also <strong>in</strong> cases<br />

when only a small number of PLLs with limited range of frequency dividers is<br />

available.<br />

The proposed stochastic model of the generator allows to prove the mutual sta-<br />

121


FEI KEMT<br />

Figure 6 – 13 Amount of sampled ones dur<strong>in</strong>g 1000 sampl<strong>in</strong>g periods accord<strong>in</strong>g to temperature<br />

for chosen sample positions <strong>in</strong> TRNG with configuration B.<br />

tistical <strong>in</strong>dependence between the critical samples. The model was confirmed <strong>in</strong><br />

empirical way and is valid for small number of critical samples, however, <strong>in</strong> case<br />

of higher number the model is less precise. In order to achieve a better adjusted<br />

model we propose for future research to monitor and analyse the bit sequence at<br />

the output of the sampl<strong>in</strong>g gate, before XOR operation. This k<strong>in</strong>d of measurements<br />

may uncover a possible dependency between the samples.<br />

In the last part of the chapter we presented results of experiments with change-<br />

able temperature of chip with the PLL-TRNG. As a result we propose additional<br />

requirements for the generator design that need to be met <strong>in</strong> order to achieve a<br />

robustness of the design. We can conclude that configurations with more than 10<br />

highly critical samples per edge behave reliably even <strong>in</strong> chang<strong>in</strong>g environment. The<br />

bigger are the changes of jitter amplitude the bigger are changes <strong>in</strong> the histogram of<br />

jitter and that has direct negative impact on statistical properties of the generated<br />

sequence.<br />

122


FEI KEMT<br />

Figure 6 – 14 Comparison of probability histograms for the jitter measured by temperature 20 ◦ C<br />

<strong>in</strong> TRNG with configuration A and B. Data measured were around the ris<strong>in</strong>g edge of the sampled<br />

clock waveform.<br />

Figure 6 – 15 Difference <strong>in</strong> number of sampled ones for critical samples by boundary temperatures<br />

−40 ◦ C and +30 ◦ C <strong>in</strong> TRNG with configuration A and B around the ris<strong>in</strong>g edge of the sampled<br />

clock waveform.<br />

123


FEI KEMT<br />

7 Research Contribution<br />

With this thesis we contributed to the field of hard<strong>ware</strong> implementation of public-key<br />

cryptographic system elements. We discussed the aspects of algorithm adaptations<br />

and system architectures for modular multiplier and cryptanalytic hard<strong>ware</strong>. Ran-<br />

domness extraction method based on clock circuitry was evaluated and new f<strong>in</strong>d<strong>in</strong>gs<br />

were presented.<br />

The research contribution were achieved <strong>in</strong> the follow<strong>in</strong>g topics:<br />

• Optimised <strong>Montgomery</strong> modular multiplier implementation <strong>in</strong> hard<strong>ware</strong>.<br />

• The elliptic curve method implementation <strong>in</strong> hard<strong>ware</strong>.<br />

• Evaluation of random number generator based on clock circuitry <strong>in</strong> FPGAs.<br />

Optimised <strong>Montgomery</strong> modular multiplier implementation <strong>in</strong> hard<strong>ware</strong><br />

Two most popular public-key cryptographic algorithms – the RSA and ECC use<br />

extensively modular operations with large numbers. The MM can be a very slow<br />

operation when performed on general-purpose computers, therefore can be acceler-<br />

ated by an effective hard<strong>ware</strong> implementation.<br />

We analysed algorithms for <strong>Montgomery</strong> MM and architectures for their effec-<br />

tive implementation suitable for reconfigurable hard<strong>ware</strong> structures. Our attention<br />

was paid to keep the scalability and parametrisation of multiplier unit also <strong>in</strong> the<br />

other parts of the system and f<strong>in</strong>d an optimal model for division of computational<br />

load between the soft<strong>ware</strong> and hard<strong>ware</strong> part of the system. The results of area oc-<br />

cupation and tim<strong>in</strong>g analysis were presented after application of hard<strong>ware</strong>-soft<strong>ware</strong><br />

co-design.<br />

The elliptic curve method implementation <strong>in</strong> hard<strong>ware</strong> The security of<br />

the most applied public-key cryptographic algorithm – RSA depends on hardness<br />

of factor<strong>in</strong>g large numbers. In the currently best known method for factor<strong>in</strong>g large<br />

<strong>in</strong>tegers – the GNFS one important step is the factorisation of mid-sized <strong>in</strong>tegers<br />

for which an ECM is an efficient algorithm.<br />

The ECM algorithm is a classical example of algorithm that can be significantly<br />

accelerated thanks to special-purpose hard<strong>ware</strong>. We provided a detailed description<br />

of efficient ECM architecture, especially suited for hard<strong>ware</strong> implementation. The<br />

modular multiplier obta<strong>in</strong>ed as a result of our research described <strong>in</strong> the previous<br />

po<strong>in</strong>t presents a core element of the ECM unit and allows fast prototyp<strong>in</strong>g. For<br />

124


FEI KEMT<br />

proof-of-concept purpose, we have chosen architecture with an embedded controller<br />

and dedicated coprocessor designed by soft<strong>ware</strong>-hard<strong>ware</strong> co-design on an FPGA.<br />

We presented the area requirements of the system and tim<strong>in</strong>gs on the first published<br />

real hard<strong>ware</strong> implementation.<br />

Evaluation of random number generator based on clock circuitry Random<br />

values play a crucial role <strong>in</strong> several areas of science. In dependency on field of<br />

application the requirements for parameters of random sequence and generator of<br />

sequence itself may vary.<br />

We enhanced the already published results on the generator <strong>in</strong>vented <strong>in</strong> [60].<br />

Our focus was put on analysis of the generator <strong>in</strong> chang<strong>in</strong>g work<strong>in</strong>g conditions and<br />

configurations sett<strong>in</strong>gs. We presented the most compact solution with one PLL<br />

circuit and a cha<strong>in</strong> of delay elements implemented <strong>in</strong> a low-cost FPGA. In other<br />

design, we focused on achiev<strong>in</strong>g high bitrate of the generated sequence when the<br />

achieved f<strong>in</strong>al speed of the generator was more than 1Mbit/s with the quality of<br />

output confirmed by statistical tests. We summarised our results of experiments<br />

with changeable temperature of chip and proposed additional requirements for the<br />

generator design that need to be met <strong>in</strong> order to achieve a robustness of the design.<br />

125


FEI KEMT<br />

Curriculum vitae<br />

Professional experience<br />

• self-employed, Electronic Documents Laboratory – Team Leader (August 2008<br />

– now).<br />

Projects related to PKI, biometrics and cryptography for Polish Security Pr<strong>in</strong>t-<br />

<strong>in</strong>g Works (PWPW S.A.), Warsaw, Poland. System design and analysis,<br />

preparation of proof-of-concept systems.<br />

• Sentivision Polska, Warsaw, Poland, Senior Soft<strong>ware</strong> Eng<strong>in</strong>eer (October 2006<br />

– July 2008).<br />

Expert on Digital Rights Management implementations <strong>in</strong> embedded plat-<br />

forms for IPTV and VoD systems, cryptography related applications and fea-<br />

tures. End-to-end implementation of Marl<strong>in</strong> IPTV-ES DRM system <strong>in</strong> C<br />

<strong>in</strong>clud<strong>in</strong>g server and client side. Technical project leader - narrow cooperation<br />

with project manager, contact with customers, consult<strong>in</strong>g and on-site support,<br />

ma<strong>in</strong>tenance and release of soft<strong>ware</strong>.<br />

Stages abroad<br />

• Three months research stage <strong>in</strong> COSIC group at Katholieke Universiteit Leu-<br />

ven, Belgium – Involved <strong>in</strong> the FP6 project “SCA Resistant Design”, analysis<br />

of side-channel attacks (2006)<br />

• Four months research stage at Laboratoire Traitement du Signal et Instrumen-<br />

tation, Unité Mixte de Recherche CNRS 5516, Université Jean Monnet, Sa<strong>in</strong>t-<br />

Etienne, France – Analysis of TRNG embedded <strong>in</strong> Altera FPGAs, stochastic<br />

model of the generator (2005)<br />

• Four months research stage at Communication Security group, Ruhr-Universität<br />

Bochum, Germany – Optimisation and implementation of ECM for factorisa-<br />

tion on Xil<strong>in</strong>x FPGA (2004)<br />

• Four months stage as an Erasmus student at the Higher Institute for Advanced<br />

Technologies of Sa<strong>in</strong>t-Etienne (ISTASE), Université Jean Monnet, Sa<strong>in</strong>t-Etienne,<br />

France – Implementation of scalable MM, work with Altera Nios processor<br />

(2002)<br />

126


FEI KEMT<br />

References<br />

[1] Actel Corporation. ProASICplus Evaluation Board, User’s guide, 2002.<br />

[2] Actel Corporation. Axcelerator Family PLL and Clock Management, Ap-<br />

plication Note, June 2003.<br />

[3] Actel Corporation. Us<strong>in</strong>g ProASICplus Clock Condition<strong>in</strong>g Circuits, Ap-<br />

plication Note, Dec. 2004.<br />

[4] Actel Corporation. ProASIC3(E) Flash Family FPGAs, Datasheet, Jan.<br />

2005.<br />

[5] Actel Corporation. ProASICplus Flash Family FPGAs, ver. 5.3, May<br />

2006.<br />

[6] Altera Corporation. Metastability <strong>in</strong> Altera Devices ver.4.0, May 1999.<br />

[7] Altera Corporation. ACEX 1K Programmable Logic Device Family, Data<br />

Sheet, Sept. 2001. ver. 3.3.<br />

[8] Altera Corporation. APEX 20K Programmable Logic Device Family,<br />

Data Sheet, Feb. 2002. ver. 4.3.<br />

[9] Altera Corporation. Avalon Bus Specification, Reference Manual, Jan.<br />

2002. ver. 2.0.<br />

[10] Altera Corporation. Nios Embedded Processor Development Board<br />

ver.2.1, Apr. 2002.<br />

[11] Altera Corporation. Us<strong>in</strong>g PLLs <strong>in</strong> Stratix Devices, Feb. 2002. ver. 1.0.<br />

[12] Altera Corporation. Cyclone Device Handbook, Us<strong>in</strong>g PLLs <strong>in</strong> Cyclone<br />

Devices, Oct. 2003. ver. 1.2.<br />

[13] Altera Corporation. Cyclone Programmable Logic Device Family, Data<br />

Sheet, Mar. 2003. ver. 1.1.<br />

[14] Altera Corporation. Us<strong>in</strong>g the ClockLock & ClockBoost PLL Features <strong>in</strong><br />

APEX Devices, Nov. 2003. Application Note 115, ver. 2.6.<br />

[15] Altera Corporation. Stratix Device Handbook, General-Purpose PLLs <strong>in</strong><br />

Stratix & Stratix GX Devices, Sept. 2004. ver. 3.1.<br />

127


FEI KEMT<br />

[16] Altera Corporation. Stratix EP1S25 DSP Development Board, Dec. 2004.<br />

ver. 1.6.<br />

[17] Altera Corporation. Cyclone II Device Handbook, PLLs <strong>in</strong> Cyclone II<br />

Devices, Feb. 2005. ver. 1.2.<br />

[18] Altera Corporation. Stratix Device Handbook, July 2005. ver. 3.4.<br />

[19] Altera Corporation. Stratix II Device Handbook, PLLs <strong>in</strong> Stratix II De-<br />

vices, Mar. 2005. ver. 2.2.<br />

[20] Altera Corporation. Stratix II Device Handbook, Volume 2, Chapter 2,<br />

TriMatrix Embedded Memory Blocks <strong>in</strong> Stratix II & Stratix II GX Devices,<br />

Apr. 2006. ver. 4.2.<br />

[21] Altera Corporation. Stratix II Device Handbook, Volume 1, Chapter 5,<br />

DC & Switch<strong>in</strong>g Characteristics, May 2007. ver. 4.3.<br />

[22] AMI Semiconductors Company. XpressArray High Density 0.18 um<br />

Structured ASIC.<br />

[23] ARM Limited. ARM7TDMI (Rev 3) — Technical Reference Manual. Avail-<br />

able at<br />

http://www.arm.com/pdfs/DDI0029G_7TDMI_R3_trm.pdf, 2001.<br />

[24] Bag<strong>in</strong>i, V., and Bucci, M. A design of reliable true random number gener-<br />

ator for cryptographic applications. In Cryptographic <strong>Hard</strong><strong>ware</strong> and Embedded<br />

Systems – CHES’99 (Berl<strong>in</strong>, Germany, Aug. 1999), Ç. K. Koç and C. Paar,<br />

Eds., no. 1717 <strong>in</strong> Lecture Notes <strong>in</strong> Computer Science, Spr<strong>in</strong>ger-Verlag, pp. 204–<br />

218.<br />

[25] Barrett, P. Implementat<strong>in</strong>g the rivest, shamir and aldham public-key en-<br />

cryption algorithm on standard digital signal processor. In Proceed<strong>in</strong>gs of<br />

CRYPTO’86 (1986), vol. 263 of Lecture Notes <strong>in</strong> Computer Science, pp. 311–<br />

323.<br />

[26] Baudet, M., Lubicz, D., Micolod, J., and Tassiaux, A. On the secu-<br />

rity of oscillator-based random number generators. Cryptology ePr<strong>in</strong>t Archive,<br />

Report 2009/299, 2009. http://epr<strong>in</strong>t.iacr.org/.<br />

128


FEI KEMT<br />

[27] Bernste<strong>in</strong>, D. Circuits for Integer Factorization: A Proposal. Manuscript.<br />

Available at http://cr.yp.to/papers.html#nfscircuit, 2001.<br />

[28] Blum, L., Blum, M., and Shub, M. A simple unpredictable pseudo-<br />

random number generator. SIAM Journal on Comput<strong>in</strong>g 15 (1986), 364–383.<br />

[29] Blum, T., and Paar, C. <strong>Montgomery</strong> modular exponentiation on reconfig-<br />

urable hard<strong>ware</strong>. In Proceed<strong>in</strong>gs of the 14th IEEE Symposium on Computer<br />

Arithmetic (Adelaide, Australia) (Los Alamitos, CA, April 1999), Koren and<br />

Kornerup, Eds., IEEE Computer Society Press, pp. 70–77.<br />

[30] Blum, T., and Paar, C. High radix montgomery modular exponentiation<br />

on reconfigurable hard<strong>ware</strong>. IEEE Transaction on Computers 50, 7 (2001),<br />

759–764.<br />

[31] Bock, H., Bucci, M., and Luzzi, R. An offset-compensated oscillator-<br />

based random bit source for security applications. In Cryptographic <strong>Hard</strong><strong>ware</strong><br />

and Embedded Systems – CHES 2004 (Berl<strong>in</strong>, Germany, 2004), M. Joye and J.-<br />

J. Quisquater, Eds., no. 3156 <strong>in</strong> Lecture Notes <strong>in</strong> Computer Science, Spr<strong>in</strong>ger-<br />

Verlag, pp. 268–281.<br />

[32] Bosma, W. Primality test<strong>in</strong>g us<strong>in</strong>g elliptic curves. Tech. Rep. 85-12, Math-<br />

ematical Institut, Universiteit van Amsterdam, 1985.<br />

[33] Brent, R. P. Some Integer Factorization Algorithms Us<strong>in</strong>g Elliptic Curves.<br />

In Australian Computer Science Communications 8 (1986), pp. 149–163.<br />

[34] Brent, R. P. Factorization of the tenth Fermat number. Mathematics of<br />

Computation 68, 225 (1999), 429–451.<br />

[35] Brown, M., Hankerson, D., López, J., and Menezes, A. Soft<strong>ware</strong><br />

Implementation of the NIST Elliptic Curves Over Prime Fields. In Top-<br />

ics <strong>in</strong> Cryptology — CT-RSA 2001 (Berl<strong>in</strong>, April 2001), D. Naccache, Ed.,<br />

vol. LNCS 2020, Spr<strong>in</strong>ger-Verlag, pp. 250–265.<br />

[36] Bucci, M., and Luzzi, R. Design of testable random bit generators. In<br />

Cryptographic <strong>Hard</strong><strong>ware</strong> and Embedded Systems – CHES 2005 (Berl<strong>in</strong>, Ger-<br />

many, 2005), J. Rao and B. Sunar, Eds., no. 3659 <strong>in</strong> Lecture Notes <strong>in</strong> Computer<br />

Science, Spr<strong>in</strong>ger-Verlag, pp. 147–156.<br />

129


FEI KEMT<br />

[37] Bundesamt für Sicherheit <strong>in</strong> der Informationstechnik – BSI. Ap-<br />

plication Notes and Interpretation of the scheme (AIS), AIS 31, Funcionality<br />

Classes and Evaluation Methodology for Physical Random Number Generators,<br />

Sept. 2001.<br />

[38] Ç. K. Koç. RSA hard<strong>ware</strong> implementation. Tech. rep., RSA Laboratoties,<br />

RSA Data Security, Inc., Aug. 1995.<br />

[39] Ç. K. Koç, Acar, T., and Kaliski, Jr., B. S. Analyz<strong>in</strong>g and compar<strong>in</strong>g<br />

<strong>Montgomery</strong> multiplication algorithms. IEEE Micro 16, 3 (June 1996), 26–33.<br />

[40] Chait<strong>in</strong>, G. J. Algorithmic Information Theory. Cambridge University<br />

Press, 1987.<br />

[41] Daly, A., and Marname, W. Efficient architectures for implemet<strong>in</strong>g Mont-<br />

gomery modular multiplication and RSA modular exponentiation on reconfig-<br />

urable logic. In Proceed<strong>in</strong>gs of the 2002 ACM/SIGDA tenth <strong>in</strong>ternational<br />

symposium on Field-programmable gate arrays FPGA’02 (Monterey, Califor-<br />

nia, USA, Feb. 2002).<br />

[42] Davies, R. B. Exclusive OR (XOR) and hard<strong>ware</strong> random number genera-<br />

tors. Tech. rep., 2002.<br />

[43] Dichtl, M. How to Predict the Output of a <strong>Hard</strong><strong>ware</strong> Random Number<br />

Generator. In Workshop on Cryptographic <strong>Hard</strong><strong>ware</strong> and Embedded Systems<br />

– CHES 2003 (Berl<strong>in</strong>, Germany, Sept. 8–10, 2003), C. D. Walter, Ç. K. Koç,<br />

and C. Paar, Eds., vol. 2779 of Lecture Notes <strong>in</strong> Computer Science, Spr<strong>in</strong>ger-<br />

Verlag, pp. 181–188.<br />

[44] Dichtl, M., and Golić, J. D. High-speed true random number genera-<br />

tion with logic gates only. In CHES ’07: Proceed<strong>in</strong>gs of the 9th <strong>in</strong>ternational<br />

workshop on Cryptographic <strong>Hard</strong><strong>ware</strong> and Embedded Systems (Berl<strong>in</strong>, Heidel-<br />

berg, 2007), vol. 4727 of Lecture Notes <strong>in</strong> Computer Science, Spr<strong>in</strong>ger-Verlag,<br />

pp. 45–62.<br />

[45] Dixon, B., and Lenstra, A. Massively parallel elliptic curve factor<strong>in</strong>g. In<br />

Advances <strong>in</strong> Cryptology - Eurocrypt ’92 (1993), R. Rueppel, Ed., vol. 658 of<br />

LNCS, Spr<strong>in</strong>ger, Berl<strong>in</strong>, pp. 183–193.<br />

130


FEI KEMT<br />

[46] Drutarovsk´y, M., Fischer, V., and ˇ Simka, M. Comparison of Two<br />

Implementations of Scalable <strong>Montgomery</strong> Coprocessor Embedded <strong>in</strong> Reconfig-<br />

urable <strong>Hard</strong><strong>ware</strong>. In Proceed<strong>in</strong>gs of the XIX Conference on Design of Circuits<br />

and Integrated Systems – DCIS 2004 (Bordeaux, France, Nov. 24–26, 2004),<br />

pp. 240–245.<br />

[47] Drutarovsk´y, M., Fischer, V., ˇ Simka, M., and Celle, F. A Simple<br />

PLL-based True Random Number Generator for Embedded Digital Systems.<br />

Comput<strong>in</strong>g and Informatics 23, 5 (2004), 501–515.<br />

[48] Drutarovsk´y, M., and ˇ Simka, M. Cryptographic True Random Number<br />

Generator for Embedded Nios Processor. In Proceed<strong>in</strong>gs of 13th International<br />

Czech-Slovak Scientific Conference Radioelektronika (Brno, Czech Republic,<br />

May 6–7, 2003), pp. 268–371.<br />

[49] Drutarovsk´y, M., and ˇ Simka, M. Custom FPGA Cryptographic Blocks<br />

for Reconfigurable Embedded NIOS Processor. Acta Electrotechnica et Infor-<br />

matica 4, 2 (2004), 33–39.<br />

[50] Drutarovsk´y, M., ˇ Simka, M., and Fischer, V. Comparison of Scalable<br />

<strong>Montgomery</strong> <strong>Modular</strong> <strong>Multiplication</strong>s Embedded <strong>in</strong> Reconfigurable <strong>Hard</strong><strong>ware</strong>.<br />

Acta Electrotechnica et Informatica 6, 2 (2006), 37–45.<br />

[51] Eldridge, S. E., and Walter, C. D. <strong>Hard</strong><strong>ware</strong> implementation of Mont-<br />

gomery’s modular multiplication algorithm. IEEE Trans. Comput. 42, 6<br />

(1993), 693–699.<br />

[52] Epste<strong>in</strong>, M., Hars, L., Kras<strong>in</strong>ski, R., Rosner, M., and Zheng, H.<br />

Design and implementation of a true random number generator based on dig-<br />

ital circuit artifacts. In Workshop on Cryptographic <strong>Hard</strong><strong>ware</strong> and Embedded<br />

Systems – CHES 2003 (Berl<strong>in</strong>, Germany, Sept. 8–10, 2003), C. D. Walter, Ç.<br />

K. Koç, and C. Paar, Eds., vol. 2779 of Lecture Notes <strong>in</strong> Computer Science,<br />

Spr<strong>in</strong>ger-Verlag, pp. 152–165.<br />

[53] Fairfield, R. C., Mortenson, R. L., and Coulthart, K. B. An LSI<br />

random number generator (RNG). In Proceed<strong>in</strong>gs of CRYPTO 84 on Advances<br />

<strong>in</strong> cryptology (1985), Spr<strong>in</strong>ger-Verlag New York, Inc., pp. 203–230.<br />

131


FEI KEMT<br />

[54] Federal Information Process<strong>in</strong>g Standards, National Institute<br />

of Standards and Technology, U.S. Department of Commerce.<br />

Data Encryption Standard, Jan. 1977. NIST FIPS PUB 46.<br />

[55] Federal Information Process<strong>in</strong>g Standards, National Institute<br />

of Standards and Technology, U.S. Department of Commerce.<br />

Data Encryption Standard, Oct. 1999. NIST FIPS PUB 46-3.<br />

[56] Federal Information Process<strong>in</strong>g Standards, National Institute<br />

of Standards and Technology, U.S. Department of Commerce.<br />

Specification for the Digital Signature Standard, Jan. 2000. NIST FIPS PUB<br />

186-2.<br />

[57] Federal Information Process<strong>in</strong>g Standards, National Institute<br />

of Standards and Technology, U.S. Department of Commerce.<br />

Security Requirements for Cryptographic Modules, May 2001. NIST FIPS PUB<br />

140-2.<br />

[58] Federal Information Process<strong>in</strong>g Standards, National Institute<br />

of Standards and Technology, U.S. Department of Commerce.<br />

Specification for the Advanced Encryption Standard (AES), 2001. NIST FIPS<br />

PUB 197.<br />

[59] Federal Information Process<strong>in</strong>g Standards, National Institute<br />

of Standards and Technology, U.S. Department of Commerce.<br />

Specification for the Secure Hash Standard, Aug. 2002. NIST FIPS PUB 180-2<br />

+ change notice to <strong>in</strong>clude SHA-224.<br />

[60] Fischer, V., and Drutarovsk´y, M. True random number generator em-<br />

bedded <strong>in</strong> reconfigurable hard<strong>ware</strong>. In Workshop on Cryptographic <strong>Hard</strong><strong>ware</strong><br />

and Embedded Systems – CHES 2002 (Berl<strong>in</strong>, Germany, Aug.13–15, 2002),<br />

B. S. Kaliski, Jr., Ç. K. Koç, and C. Paar, Eds., vol. 2523 of Lecture Notes <strong>in</strong><br />

Computer Science, Spr<strong>in</strong>ger-Verlag, pp. 415–430.<br />

[61] Fischer, V., Drutarovsk´y, M., ˇ Simka, M., and Bochard, N. High<br />

Performance True Random Number Generator <strong>in</strong> Altera Stratix FPLDs. In<br />

Field-Programmable Logic and Applications – FPL 2004 (Lueven, Belgium,<br />

Aug. 2004), J. Becker, M. Platzner, and S. Vernalde, Eds., vol. 3203 of Lecture<br />

Notes <strong>in</strong> Computer Science, Spr<strong>in</strong>ger-Verlag, pp. 555–564.<br />

132


FEI KEMT<br />

[62] Fischer, V., Drutarovsk´y, M., ˇ Simka, M., and Celle, F. Simple<br />

PLL-based True Random Number Generator for Embedded Digital Systems.<br />

In Proceed<strong>in</strong>gs of IEEE Design and Diagnostics of Electronic Circuits and<br />

Systems Workshop – DDECS 2004 (Stará Lesná, Slovakia, Apr. 18–21, 2004),<br />

pp. 129–136.<br />

[63] Franke, J., and Kle<strong>in</strong>jung, T. E-mail announcement.<br />

http://www.crypto-world.com/announcements/rsa200.txt, May 2005.<br />

[64] Franke, J., Kle<strong>in</strong>jung, T., Paar, C., Pelzl, J., Priplata, C., and<br />

Stahlke, C. SHARK — A Realizable Special <strong>Hard</strong><strong>ware</strong> Siev<strong>in</strong>g Device<br />

for Factor<strong>in</strong>g 1024-bit Integers. In Workshop on Cryptographic <strong>Hard</strong><strong>ware</strong> and<br />

Embedded Systems — CHES 2005, Ed<strong>in</strong>burgh (August 2005), LNCS, Spr<strong>in</strong>ger.<br />

To appear.<br />

[65] Franke, J., Kle<strong>in</strong>jung, T., Paar, C., Pelzl, J., Priplata, C., ˇ Simka,<br />

M., and Stahlke, C. An effcient hard<strong>ware</strong> architecture for factor<strong>in</strong>g <strong>in</strong>tegers<br />

with the Elliptic Curve Method. In 1st Workshop on Special-purpose <strong>Hard</strong><strong>ware</strong><br />

for Attack<strong>in</strong>g Cryptographic Systems – SHARCS 2005 (Paris, France, Feb. 24–<br />

25, 2005), pp. 51–62.<br />

[66] Frolek, V. Implementation of asymmetric encryption algorithms <strong>in</strong> recon-<br />

figurable circuits. Master’s thesis, Technical University of Koˇsice, Department<br />

of Electronics and Multimedia Communications, Jan.-May 2002.<br />

[67] Gaj, K., Kwon, S., Baier, P., Kohlbrenner, P., Le, H., Khaleelud-<br />

d<strong>in</strong>, M., and Bachimanchi, R. Implement<strong>in</strong>g the elliptic curve method of<br />

factor<strong>in</strong>g <strong>in</strong> reconfigurable hard<strong>ware</strong>. In Workshop on Special-purpose <strong>Hard</strong>-<br />

<strong>ware</strong> for Attack<strong>in</strong>g Cryptographic Systems – SHARCS 2006 (Cologne, Ger-<br />

many, Apr. 03–04, 2006).<br />

[68] Gennaro, R. Randomness <strong>in</strong> cryptography. IEEE Security and Privacy 4,<br />

2 (2006), 64–67.<br />

[69] Goldberg, I., and Wagner, D. Randomness and the Netscape browser.<br />

Dr. Dobb’s Journal (Jan. 1996), 66–70.<br />

[70] Golic, J. New methods for digital generation and postprocess<strong>in</strong>g of random<br />

data. IEEE Transaction on Computers 55, 10 (2006), 1217–1229.<br />

133


FEI KEMT<br />

[71] Gura, N., Chang, S., 2, H., Sumit, G., Gupta, V., F<strong>in</strong>chelste<strong>in</strong>, D.,<br />

Goupy, E., and Stebila, D. An End-to-End Systems Approach to Elliptic<br />

Curve Cryptography. In Cryptographic <strong>Hard</strong><strong>ware</strong> and Embedded Systems —<br />

CHES 2002 (2002), Ç. K. Koç and C. Paar, Eds., vol. LNCS 2523, Spr<strong>in</strong>ger,<br />

pp. 349–365.<br />

[72] Huang, M., Gaj, K., Kwon, S., and El-Ghazawi, T. An optimized<br />

hard<strong>ware</strong> architecture for the <strong>Montgomery</strong> <strong>Multiplication</strong> Algorithm. In PKC<br />

2008: 11th International Workshop on Practice and Theory <strong>in</strong> Public Key<br />

Cryptography, Barcelona, Spa<strong>in</strong> (March 2008), pp. 214–228.<br />

[73] Jun, B., and Kocher, P. The <strong>in</strong>tel random number generator.<br />

White paper prepared for <strong>in</strong>tel corporation, Cryptography Research, Inc.,<br />

http://www.cryptography.com/resources/whitepapers/IntelRNG.pdf, Apr.<br />

1999.<br />

[74] Killmann, W., and Sch<strong>in</strong>dler, W. A proposal for: Fuctionality Classes<br />

and Evaluation Methodology for True (Physical) Random Number Generators,<br />

Sept. 2001.<br />

[75] K<strong>in</strong>niment, D., and Chester, E. Design of an on-chip random number<br />

generator us<strong>in</strong>g metastability. In Proceed<strong>in</strong>gs of the 28th European Solid-State<br />

Circuit Conference (Sept. 2002), Univ. Bologna, Italy, pp. 595–598.<br />

[76] Knuth, D. E. Sem<strong>in</strong>umerical Algorithms, second ed., vol. 2 of The Art of<br />

Computer Programm<strong>in</strong>g. Addison-Wesley, Read<strong>in</strong>g, Massachusetts, Jan. 10,<br />

1981.<br />

[77] Koblitz, N. Elliptic curve cryptosystems. Mathematics of Computation 48,<br />

177 (Jan. 1987), 203–209.<br />

[78] Koblitz, N., Menezes, A., and Vanstone, S. The state of elliptic curve<br />

cryptography. Designs, Codes and Cryptography 19, 2-3 (Mar. 2000), 173–193.<br />

[79] Kohlbrenner, P., and Gaj, K. An embedded true random number gen-<br />

erator for FPGAs. In Proceed<strong>in</strong>g of the 2004 ACM/SIGDA 12th <strong>in</strong>ternational<br />

symposium on Field programmable gate arrays (2004), ACM Press, pp. 71–78.<br />

[80] Lenstra, A. K. Designs, Codes and Cryptography. Kluwer Academic Pub-<br />

lishers, Boston, 2000, ch. Integer Factor<strong>in</strong>g.<br />

134


FEI KEMT<br />

[81] Lenstra, A. K., and H. W. Lenstra, J., Eds. The Development of the<br />

Number Field Sieve. Lecture Notes <strong>in</strong> Math. Volume 1554. Spr<strong>in</strong>ger, 1993.<br />

[82] Lenstra, H. W. Factor<strong>in</strong>g Integers with Elliptic Curves. Annals of Mathe-<br />

matics 126, 2 (1987), 649–673.<br />

[83] Lim, D., Ranas<strong>in</strong>ghe, D. C., Devadas, S., Jamali, B., Abbott, D.,<br />

and Coleb, P. H. Exploit<strong>in</strong>g metastability and thermal noise to build a<br />

re-configurable hard<strong>ware</strong> random number generator. In Noise <strong>in</strong> Devices and<br />

Circuits III; Proceed<strong>in</strong>gs of SPIE (Texas, USA, May 2005), vol. 5844, pp. 294–<br />

309.<br />

[84] MacKay, D. J. C. Introduction to Monte Carlo methods. In Learn<strong>in</strong>g <strong>in</strong><br />

Graphical Models, M. I. Jordan, Ed., NATO Science Series. Kluwer Academic<br />

Press, 1998, pp. 175–204.<br />

[85] McIvor, C., McLoone, M., McCanny, J., Daly, A., and Marnane,<br />

W. Fast montgomery modular multiplication and rsa cryptographic proces-<br />

sor architectures. In 37th IEEE Computer Society Asilomar Conference on<br />

Signals, Systems and Computers (Monterey, USA, Nov. 2003), pp. 379–384.<br />

[86] Menezes, J. A., Oorschot, P. C., and Vanstone, S. A. Handbook of<br />

Applied Cryptography. CRC Press, New York, Oct. 1996.<br />

[87] Miller, V. S. Use of elliptic curves <strong>in</strong> cryptography. In Lecture notes<br />

<strong>in</strong> computer sciences; 218 on Advances <strong>in</strong> cryptology—CRYPTO 85 (1986),<br />

Spr<strong>in</strong>ger-Verlag New York, Inc., pp. 417–426.<br />

[88] <strong>Montgomery</strong>, P. <strong>Modular</strong> <strong>Multiplication</strong> without Trial Division. Mathe-<br />

matics of Computation 44, 170 (April 1985), 519–521.<br />

[89] <strong>Montgomery</strong>, P. Speed<strong>in</strong>g up the Pollard and elliptic curve methods of<br />

factorization. Mathematics of Computation 48 (1987), 243–264.<br />

[90] NEC Corporation. Prelim<strong>in</strong>ary User’s Manual System-on-Chip Lite, De-<br />

velopment Board, <strong>Hard</strong><strong>ware</strong>, Document No. A15650EE1V0UM00, July 2001.<br />

Available at http://www.ee.nec.de/_pdf/A15650EE1V0UM00.PDF.<br />

[91] organization = Federal Information Process<strong>in</strong>g Standards, Na-<br />

tional Institute of Standards and Technology, U.S. Department<br />

135


FEI KEMT<br />

of Commerce, month = aug, year = 2005, note =. ”Recommendation<br />

for Key Management, part 1 - General”.<br />

[92] Orlando, G., and Paar, C. A Scalable GF (p) Elliptic Curve Processor Ar-<br />

chitecture for Programmable <strong>Hard</strong><strong>ware</strong>. In Workshop on Cryptographic <strong>Hard</strong>-<br />

<strong>ware</strong> and Embedded Systems — CHES 2001 (May 14-16, 2001), Ç. K. Koç,<br />

D. Naccache, and C. Paar, Eds., vol. LNCS 2162, Spr<strong>in</strong>ger, pp. 348–363.<br />

[93] Pavelka, P., Galajda, P., and Fischer, V. Crypto FPGA a step to-<br />

wards a new class of flexible security devices. In Radioelektronika 2005 : 15th<br />

<strong>in</strong>ternational Czech-Slovak scientific conference (Brno, Czech Republic, May<br />

2005), University of Technology, pp. 397–400.<br />

[94] Pelzl, J., ˇ Simka, M., Kle<strong>in</strong>jung, T., Franke, J., Priplata, C.,<br />

Stahlke, C., Drutarovsk´y, M., Fischer, V., and Paar, C. Area–time<br />

efficient hard<strong>ware</strong> architecture for factor<strong>in</strong>g <strong>in</strong>tegers with the elliptic curve<br />

method. IEE Proceed<strong>in</strong>gs - Information Security 152, 1 (2005), 67–78.<br />

[95] Pollard, J. A Monte Carlo Method for Factorization. Nordisk Tidskrift for<br />

Informationsbehandlung (BIT) 15 (1975), 331–334.<br />

[96] Rivest, R. L., Shamir, A., and Adleman, L. A Method for Obta<strong>in</strong><strong>in</strong>g<br />

Digital Signatures and Public-Key Cryptosystems. Communications of the<br />

ACM 21, 2 (February 1978), 120–126.<br />

[97] Rukh<strong>in</strong>, A., Soto, J., Nechvatal, J., Smid, M., Barker, E., Leigh,<br />

S., Levenson, M., Vangel, M., Banks, D., Heckert, A., Dray, J.,<br />

and Vo, S. A Statistical Test Suite for Random and Pseudorandom Number<br />

Generators for Cryptographic Applications. NIST Special Publication 800-22.<br />

(revised May 15, 2002).<br />

[98] Santoro, R., Sentieys, O., and Roy, S. On-l<strong>in</strong>e monitor<strong>in</strong>g of random<br />

number generators for embedded security. In IEEE International Symposium<br />

on Circuit and Systems – ISCAS 2009 (2009), pp. 3050–3053.<br />

[99] Schaumont, P., and Ch<strong>in</strong>g, D. Gezel. Available at<br />

http://rijndael.ece.vt.edu/gezel2.<br />

[100] Sch<strong>in</strong>dler, W. Efficient Onl<strong>in</strong>e Tests for True Random Number Gener-<br />

ators. In Workshop on Cryptographic <strong>Hard</strong><strong>ware</strong> and Embedded Systems –<br />

136


FEI KEMT<br />

CHES 2001 (Berl<strong>in</strong>, Germany, May 13–16, 2001), Ç. K. Koç, D. Naccache,<br />

and C. Paar, Eds., vol. 2162 of Lecture Notes <strong>in</strong> Computer Science, Spr<strong>in</strong>ger-<br />

Verlag, pp. 103–117.<br />

[101] Schneier, B. Applied Cryptography: Protocols, Algorithms, and Source Code<br />

<strong>in</strong> C, 2nd ed. John Wiley & Sons, Inc., New York, 1996.<br />

[102] Secretariat National Committee for Information Technology<br />

Standardization. Fibre Channel - Methodologies for Jitter Specification,<br />

T11.2 / Project 1230/ Rev 10, June 1999.<br />

[103] Shamir, A., and Tromer, E. Factor<strong>in</strong>g Large Numbers with the TWIRL<br />

Device. In Advances <strong>in</strong> Cryptology — Crypto 2003 (2003), vol. 2729 of LNCS,<br />

Spr<strong>in</strong>ger, pp. 1–26.<br />

[104] Silverman, R. D. The multiple polynomial quadratic sieve. Mathematics of<br />

Computation 48 (1987), 329–340.<br />

[105] Sunar, B., Mart<strong>in</strong>, W. J., and St<strong>in</strong>son, D. R. A provably secure true<br />

random number generator with built-<strong>in</strong> tolerance to active attacks. IEEE<br />

Transaction on Computers 56, 1 (2007), 109–119.<br />

[106] Tang, K., Siegel, P. H., and Milste<strong>in</strong>, L. B. A comparison of long<br />

versus short spread<strong>in</strong>g sequences <strong>in</strong> coded asynchronous DS-CDMA systems.<br />

IEEE Journal on Selected Areas <strong>in</strong> Communications 19, 8 (Aug. 2001), 1614–<br />

1624.<br />

[107] Tektronix. A Guide to Understand<strong>in</strong>g and Characteriz<strong>in</strong>g Tim<strong>in</strong>g Jitter.<br />

[108] Tenca, A. F., and Ç. K. Koç. A scalable architecture for <strong>Montgomery</strong><br />

multiplication. In Cryptographic <strong>Hard</strong><strong>ware</strong> and Embedded Systems (Berl<strong>in</strong>,<br />

Germany, 1999), Ç.K. Koç and C. Paar, Eds., no. 1717 <strong>in</strong> Computer Science,<br />

Spr<strong>in</strong>ger Verlag, pp. 94–108.<br />

[109] Tenca, A. F., and Ç. K. Koç. A scalable architecture for modular multipli-<br />

cation based on <strong>Montgomery</strong>’s algorithm. IEEE Transactions on Computers<br />

52, 9 (Sept. 2003), 1215–1221.<br />

[110] Tenca, A. F., Todorov, G., and Ç. K. Koç. High-radix design of a<br />

scalable modular multiplier. In Cryptographic <strong>Hard</strong><strong>ware</strong> and Embedded Sys-<br />

tems – CHES 2001 (Berl<strong>in</strong>, Germany, May 2001), Ç. K. Koç, D. Naccache,<br />

137


FEI KEMT<br />

and C. Paar, Eds., no. 2162 <strong>in</strong> Lecture Notes <strong>in</strong> Computer Science, Spr<strong>in</strong>ger-<br />

Verlag, pp. 189–205.<br />

[111] Tkacik, T. E. A hard<strong>ware</strong> random number generator. In Workshop on Cryp-<br />

tographic <strong>Hard</strong><strong>ware</strong> and Embedded Systems – CHES 2002 (Berl<strong>in</strong>, Germany,<br />

Aug.13–15, 2002), B. S. Kaliski, Jr., Ç. K. Koç, and C. Paar, Eds., vol. 2523<br />

of Lecture Notes <strong>in</strong> Computer Science, Spr<strong>in</strong>ger-Verlag, pp. 450–453.<br />

[112] Tsoi, K., Leung, K., and Leong, P. Compact FPGA-based true and<br />

pseudo random number generators. In Proceed<strong>in</strong>gs of the IEEE Symposium on<br />

Field-Programmable Custom Comput<strong>in</strong>g Mach<strong>in</strong>es (FCCM), California USA<br />

(2003), pp. 51–61.<br />

[113] ˇ Simka, M., and Drutarovsk´y, M. <strong>Montgomery</strong> <strong>Multiplication</strong> Copro-<br />

cessor on Reconfigurable Logic. In Proceed<strong>in</strong>gs of 13th International Czech-<br />

Slovak Scientific Conference Radioelektronika (Brno, Czech Republic, May<br />

6–7, 2003), pp. 95–98.<br />

[114] ˇ Simka, M., Drutarovsk´y, M., and Fischer, V. Embedded True Ran-<br />

dom Number Generator <strong>in</strong> Actel FPGAs. In Workshop on Cryptographic<br />

Advances <strong>in</strong> Secure <strong>Hard</strong><strong>ware</strong> – CRASH 2005 (Leuven, Belgium, Sept. 6–7,<br />

2005).<br />

[115] ˇ Simka, M., Drutarovsk´y, M., and Fischer, V. Randomness Extrac-<br />

tion Method Based on Rationally Related Clock Signals. In Proceed<strong>in</strong>gs of<br />

the DSP-MCOM 2005, The 6th International Conference on Digital Signal<br />

Process<strong>in</strong>g and Multimedia Communications (Koˇsice, Slovakia, Sept. 13–14,<br />

2005), pp. 190–193.<br />

[116] ˇ Simka, M., Drutarovsk´y, M., and Fischer, V. Performance of PLL-<br />

based True Random Number Generator <strong>in</strong> chang<strong>in</strong>g work<strong>in</strong>g conditions (Sub-<br />

mitted). Acta Electrotechnica et Informatica (2010).<br />

[117] ˇ Simka, M., and Fischer, V. <strong>Montgomery</strong> <strong>Multiplication</strong> Coprocessor for<br />

Altera Nios Embedded Processor. In Proceed<strong>in</strong>gs of Electronic Computers and<br />

Informatics (Herl’any, Slovakia, Oct. 2002), pp. 206–211.<br />

[118] ˇ Simka, M., Fischer, V., and Drutarovsk´y, M. <strong>Hard</strong><strong>ware</strong>-Soft<strong>ware</strong><br />

Codesign <strong>in</strong> Embedded Asymmetric Cryptography Application – a Case<br />

138


FEI KEMT<br />

Study. In Field-Programmable Logic and Applications – FPL 2003 (Lis-<br />

bon, Portugal, Sept. 2003), P. Y. Cheung, G. A. Constant<strong>in</strong>ide, and J. T.<br />

de Sousa, Eds., vol. 2778 of Lecture Notes <strong>in</strong> Computer Science, Spr<strong>in</strong>ger-<br />

Verlag, pp. 1075–1078.<br />

[119] ˇ Simka, M., Fischer, V., Drutarovsk´y, M., and Fayolle, J. Model<br />

of a true random number generator aimed at cryptographics applications. In<br />

Proceed<strong>in</strong>gs of the International Symposium on Circuit and Systems – ISCAS<br />

2006 (Island of Kos, Greece, May 21–24, 2006), pp. 5619–5623.<br />

[120] ˇ Simka, M., Pelzl, J., Kle<strong>in</strong>jung, T., Franke, J., Priplata, C.,<br />

Stahlke, C., Drutarovsk´y, M., Fischer, V., and Paar, C. <strong>Hard</strong>-<br />

<strong>ware</strong> Factorization Based on Elliptic Curve Method. In FCCM – IEEE Sym-<br />

posium on Field-Programmable Custom Comput<strong>in</strong>g Mach<strong>in</strong>es (Napa Valley,<br />

California, Apr. 17–20, 2005).<br />

[121] Walker, S., and Foo, S. Evaluat<strong>in</strong>g metastability <strong>in</strong> electronic circuits<br />

for random number generation. In Proceed<strong>in</strong>gs of the IEEE Computer Society<br />

Workshop on VLSI 2001 (WVLSI ’01) (2001), IEEE Computer Society, p. 99.<br />

[122] Woll<strong>in</strong>ger, T., and Paar, C. How secure are FPGAs <strong>in</strong> cryptographic<br />

applications? (long version). Cryptology ePr<strong>in</strong>t Archive, Report 2003/119,<br />

2003.<br />

[123] Wolski, E., Filho, J. G. S., and Dantas, M. A. R. Parallel Implementa-<br />

tion of Elliptic Curve Method for Integer Factorization Us<strong>in</strong>g Message-Pass<strong>in</strong>g<br />

Interface (MPI). In SBAC- PAD 13th Symposium on Computer Architecture<br />

and High-Performance, 2001, Pirenopolis (2001).<br />

[124] Xil<strong>in</strong>x Corporation. Virtex-E 1.8V Field Programmable Gate Arrays —<br />

Production Product Specification.<br />

[125] Xil<strong>in</strong>x Corporation. Superior Jitter Management with DLLs ver.1.2,<br />

Virtech Tech Topic VTT013 ed., Jan. 2003.<br />

[126] Xil<strong>in</strong>x Corporation. Us<strong>in</strong>g the Virtex Delay-Locked Loop ver.2.8, Appli-<br />

cation Note 132: Virtex Series ed., Jan. 2006.<br />

[127] Xil<strong>in</strong>x Corporation. Us<strong>in</strong>g Delay-Locked Loops <strong>in</strong> Spartan-II/IIE FPGAs<br />

ver.1.2, Application Note 174 ed., June 2008.<br />

139


FEI KEMT<br />

[128] Zimmermann, P. ECMNET page. Available at<br />

http://www.loria.fr/˜zimmerma/records/ecmnet.html.<br />

140

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!