FPGA based Hardware Accleration for Elliptic Curve Cryptography ...

Fachbereich Informatik 

Integrierte Schaltungen und Systeme 

Prof. Dr.-Ing. Sorin Huss 

Studienarbeit 

FPGA based Hardware Acceleration for 

Elliptic Curve Cryptography based on ¢¡¤£¦¥¨§© 

Felix Madlener 

madlener@iss.tu-darmstadt.de 

Matr.-Nr.: 948463 

Betreuer : Dipl.-Inform. Markus Ernst 

Ausgabe : 01.02.2002 

Abgabe : 30.08.2002

Zusicherung 

Zur Erstellung der vorliegenden Studienarbeit wurden nur die in der Arbeit angegebenen Hilfsmittel verwendet. 

Felix Madlener

Contents 

List of Figures 

iv 

1 Introduction 1 

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 

1.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 

1.3 Goals of this Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 

1.4 Content of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 

( 

Field 

over 

Fields 

in 

in 

in 

in 

2 Mathematical Background 4 

2.1 Elliptic Curve Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 

2.1.1 Affine Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 

2.1.2 Projective Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 

2.1.3 EC point multiplication ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 

2.2 Finite Field Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 

2.2.1 The Finite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 

2.2.2 Polynomial Rings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 

2.2.3 Finite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 

2.2.4 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 

2.2.5 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 

2.2.6 Squaring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 

2.2.7 Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 

2.2.8 Polynomial Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 

2.3 Sequential Multiplication Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 

2.3.1 Schoolbook Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 

2.3.2 Polynomial Karatsuba Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . 15 

2.3.3 Multi-Segment Karatsuba Multiplication . . . . . . . . . . . . . . . . . . . . . . . 16 

3 Hardware Architecture 19 

3.1 PCI Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 

3.2 Register File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 

3.3 EC Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 

3.4 Finite Field Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 

3.4.1 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 

3.4.2 Square . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 

3.4.3 Input Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

CONTENTS 

iii 

3.4.4 Combinational Multiplier (CKM) . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 

3.4.5 MSK Pattern Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 

3.4.6 Interleaved Polynomial Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 

3.5 VHDL-Code Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 

3.6 Evaluation Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 

4 Implementation Results 27 

4.1 Xilinx XC4085XLA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 

4.2 Xilinx XCV405E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 

4.3 Atmel AT94K40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 

4.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 

5 Conclusions and Outlook 30 

5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 

5.2 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 

Bibliography 31 

Annex 34 

3-Segment Karatsuba Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 

2P Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

List of Figures 

c)! #"%$ d)& #"'$ 

2.1 Example of an EC visualizing the point addition . . . . . . . . . . . . . . . . . . . . . . . . 5 

2.2 EC arithmetic hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 

2.3 Structure of the polynomial reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 

2.4 Sequential Multiplication Schemes: a) Schoolbook method; b) unrolled Karatsuba for 2 

recursion steps; before reordering of the subterms; after reordering of 

the subterms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 

2.5 Polynomial Karatsuba multiplication scheme . . . . . . . . . . . . . . . . . . . . . . . . . 16 

of(*),+ 

3.1 Generic Datapath of the EC coprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 

3.2 Generic Datapath of the Finite Field Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . 21 

3.3 Recursive construction process for polynomial Karatsuba multipliers . . . . . . . . . . . . . 22 

3.4 Combinational Karatsuba Multiplier gate count . . . . . . . . . . . . . . . . . . . . . . . . 23 

3.5 Structure of the polynomial reduction bit . . . . . . . . . . . . . . . . . . . . . . 24 

4.1 microEnable PCI card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 

4.2 Atmel AT94K40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Chapter 1 

Introduction 

1.1 Motivation 

Today there is a wide range of distributed systems, which use communication resources that can not be 

safeguarded against eavesdropping or unauthorized data alteration. Thus cryptographic protocols are applied 

to these systems in order to prevent information extraction or to detect data manipulation by unauthorized 

parties. 

In general, cryptographic methods can be subdivided into two categories: symmetric- and asymmetric 

cryptographic algorithms. 

In the case of symmetric cryptographic algorithms like DES [1] or AES [2] both communication partners 

use the same secret key to encrypt and decrypt messages. Compared to asymmetric cryptography these 

algorithms are considered to be faster and more efficient. However, the general problem of symmetric 

methods is the distribution of the secret key. Sender and recipient both have to possess the same secret key 

to process the message but no one else may have the key, because otherwise one would be able to decrypt 

or alter the message just like the original author and recipient. So secure channels have to be established in 

order to exchange the keys. 

With regard of this problem, asymmetric algorithms have been developed. These algorithms, which are 

also called public key algorithms, differ in the utilized set of keys consisting of a public- and a private key. 

This key pair can only be computed by the original creator of the keys. For all others both keys are virtually 

independent. The general principle of all public key schemes is then, that a message that is encrypted with 

one of the keys can only be decrypted with the other one. 

After publication of the public key, everyone can use this key, e.g. to encrypt a message. Afterwards, 

this message can only be decrypted with the corresponding private key which is exclusively known by the 

authorized recipient of the message. 

Alternatively, public key algorithms can be used to compute and verify digital signatures. The author of 

a message uses his secret, private key to compute a signature. By looking up the authors public key any 

recipient of the message-signature-pair is able to verify this signature subsequently. 

Public key algorithms provide much flexibility and a very high level of security. On the other hand, 

in comparison to symmetric methods, they are based on much more complex and expensive arithmetic 

operations. In practice a combination of both methods is frequently used. E.g. SSL [3]: public key methods 

are used for key exchange and authentication while symmetric algorithms are applied for the encryption of 

the data stream. 

The most prominent public key method is the widely-used RSA algorithm [4]. It is based on the problem

1.2. PREVIOUS WORK 2 

of dividing a large number into it’s prime factors, a problem that is considered to be hard, meaning it can 

not be calculated in polynomial time. 

Unfortunately it is not known if this problem is really hard (though it’s quite probable). If someone would 

develop an algorithm to calculate the prime factors efficiently, the RSA scheme would become insecure 

immediately. This problematic leaded to the need of alternatives in public key cryptography. The main 

requirement for such alternatives is the use of different underlying mathematical problems, which should 

optimally be as well researched as the problem of dividing large numbers into prime factors. 

One of the most important alternative public-key schemes is based on the discrete logarithm problem 

(DLP) on elliptic curves (EC). In 1985 elliptic curve cryptography (ECC) has been first proposed by 

V. Miller [5] and N. Koblitz [6] independently. In the following a lot of research has been done and nowadays 

ECC is widely known and accepted. Because EC methods in general are believed to give a higher 

security per key bit in comparison to RSA (1024 RSA-bits are equivalent to 160 EC-bits), one can work 

with shorter keys in order to achieve the same level of security [7]. The smaller key size permits more 

cost-efficient implementations, which is of special interest for hardware implementations and systems with 

low computing power. 

ECC is based on a set of points on elliptic curves and their arithmetic operations EC-Add and EC-Double. 

These EC operations are in turn composed of arithmetic operations in the underlying finite field (FF). Here, 

the most expensive and complex operation is the field multiplication FF-Mult. The finite -. field , 

which is characteristic of and degree( extension is treated throughout this work. 

Each application has different demands on the utilized cryptosystem (e.g., in terms of required bandwidth, 

level of security, incurred cost per node and number of communicating partners). Corresponding to the 

growing number of ECC clients, there is also a need for high performance server implementations. Such an 

implementation should be capable of processing different cryptographic parameter sets (in special different 

bit widths) at high speed. 

Depending on the application, the performance of genuine SW implementations of EC cryptosystems may 

not be sufficient. In this work a generic and scalable architecture of an ECC coprocessor has been developed. 

The presented prototype implementations are based on different reconfigurable Field Programmable Gate 

Array (FPGA) devices from Xilinx [8] and Atmel [9]. 

The main focus of this work is the acceleration of the field multiplication FF-Mult. This is realized 

by combining a fast and resource efficient combinational multiplier with a novel scheme for sequential 

multiplication called multi-segment Karatsuba multiplication (MSK). A clever structuring of the datapath 

together with well fitting EC level algorithms leads to highly efficient implementations of EC cryptosystems. 

1.2 Previous Work 

Concerning EC coprocessor designs, there is some previous work at the institute, on which this work is 

relying on. In contrast to the polynomial base representation treated here, this previous hardware implementation 

(documented in [10]) is based on an Optimal Normal Basis (ONB) representation of the field 

elements. 

The manner how basic arithmetic operations in have to be performed depends on the utilized 

field representation. Especially the multiplication in is completely different when comparing ONB 

against polynomial base arithmetic. However, the top-level EC algorithms do not depend on field representation 

and could therefore be partially reused within this work. 

The microEnable FPGA card, which is one of the utilized hardware platforms in Chap. 4, has been already 

used for the implementation of the ONB based design. The idea of using a generator approach to derive

1.3. GOALS OF THIS STUDY 3 

specific and synthesizable VHDL descriptions from a superior, generic coprocessor-model has also been 

adopted from this previous design. 

1.3 Goals of this Study 

The main goal of this work is the design and the implementation of a arithmetic processor kernel 

which can be embedded into the previously described EC coprocessor design. After some research on 

existing algorithms and implementations it was decided that the main work should concentrate on the finite 

field arithmetic while the EC level algorithms should be adopted from the literature. The main reason for this 

decision was the fact, that an efficient hardware architecture has a much greater influence on the efficiency 

of the data flow dominated finite field algorithms than on the control flow oriented EC level algorithms. 

The final design should be widely scalable in terms of different types of resource usage. To provide this 

scalability and flexibility a generator program should be used to produce the VHDL hardware models, out 

of which the FPGA programming bitstream is synthesized subsequently. While it was clear that the most 

important parameter for scalability will be the bitwidth of the design, the complete parameter set and the 

resulting degree of flexibility caused by the generator approach was specified during the implementation 

progress. 

A minor goal has been the compatibility to existing modules. The interface of the new developed design 

should correspond to that of the existing ONB implementation. Using this goal it was possible to reuse the 

already optimized EC Controller from the ONB implementation without modifications. 

A new family of attacks against cryptographic hardware implementations is currently gaining much importance. 

These so called Side-Channel-Attacks use additional information the hardware provides beside the 

cryptographic functions to extract knowledge of the secret key. Examples for this additional information are 

the runtime of an operation that might depend on the secret key or the power consumption of a chip during 

the computation. Though it was not a main goal of this work to provide resistance against such attacks, the 

problem should be reminded during implementation. Where possible, simple countermeasures should be 

implemented. 

To evaluate the functionality of the hardware implementation, the results should be compared against 

results of a pure software implementation. To provide a framework for this evaluation process, the existing 

C software implementation has been extended to support- elements in polynomial representation. 

1.4 Content of this work 

The mathematical background of elliptic curves and finite fields is introduced in the following chapter. 

Furthermore, the multi-segment Karatsuba multiplication scheme is described in detail. Chap. 3 focuses on 

the architecture and the implementation of the proposed ECC coprocessor. Special attention is given to the 

arithmetic processor kernel. Implementation results and some performance numbers are given in 

Chap. 4. Finally, Chap. 5 summarizes the conclusions and gives an outlook on work that might follow.

Chapter 2 

Mathematical Background 

There are several cryptographic schemes based on elliptic curves. These schemes work on a subgroup of 

points of an EC over a finite field. Arbitrary finite fields are approved to be suitable for ECC. This work 

concentrates on elliptic curves over the finite field- and their arithmetics only. For further information 

see [11] and [12]. 

There are several bases known for . The most common bases, which are also proposed by the 

leading standards concerning ECC (IEEE 1363 [13] and ANSI X9.62 [14]) are polynomial bases and normal 

bases. Please remark, that the design detailed in the following is exclusively treating with polynomial basis 

representation. 

Sec. 2.1 introduces some basic facts and algorithms of elliptic curves. In Sec. 2.2 a short review on the finite 

field based on polynomial basis representation is given. Sec. 2.3 presents several multiplication 

schemes in- and leads to the multi-segment Karatsuba multiplication algorithm which is one main 

contribution of this work. 

2.1 Elliptic Curve Arithmetic 

2.1.1 Affine Coordinates 

An elliptic curve over is defined as the cubic equation 

),5 176 598:) (2.1) 

/¢02143 

1FE and>HG 6!I 

. The set of solutions J=K5D@ 1 ML 1 3 )?5 1N6 5 8 )O;P5 3 )Q>SR is called the 

points of the elliptic curve/ 

. By defining an appropriate addition operation and an extra pointT , called the 

with;9@A>B@C5D@ 

point at infinity, these points become an additive, abelian withT group the neutral element. 

Fig. 2.1 depicts an example of an elliptic curve over the reals. Here, a geometric interpretation of the 

addition can be given: Find the third intersection (-U point ) of a straight line through V and with the 

elliptic curve. resultU 

6 Q)WV The is found by -U mirroring at the x-axis.

V (EC-Add), ^ 6 1 3_ 1 Y 6 

then 

3 ^ ^ 

) 6 

8 

1 

6 V 6 K5D@ 1 (EC-Double), then ^ 6 57) 

If 

1 

8 

1 

5 

^ 

2.1. ELLIPTIC CURVE ARITHMETIC 5 

Figure 2.1: Example of an EC visualizing the point addition 

For an curve/ 

elliptic over- defined the basic operation of points¨@XV 

EQ/ 

adding with 

6 

K5DYZ@ 1 Y[ andV 

6 K5 3 @ 1 3 

, 

is as follows: 

U 6 Q)WV 6 K5 8 @ 1 

8 9\ 

If]G 

5 3_ 5`Y 

),5`Ya)

Ymonpn _ #qkr9;=stPuiY[ v 

l 

_ wr9xKyhzf 3 @ l Y[ v monMn _ v-{P{ v @gfDYh 

monMn 

monMn _ wr9xKyhKe 3 @XijY[ | 

monMn _ v-{P{ | @Ce'Yh } 

| 

_ 2rx~yui Y @ | € m nMn _ wr9xKyhl Y @g; 

monpn 

m nMn _ v-{P{ € @ } € 

Y monpn _ #qkr9;=stP| € 

l 

nMn _ wr9xKyhl Y @ € m 

monpn _ 2rxKyhv @ } / 

l Ymonpn _ #qkr9;=stPv 

monpn _ 2rx~yzf 3 @Xi 8 „ 

„ 

_ vƒ{={ „ @Ce 8 monpn 

Ymonpn _ 2rxKyhui 8 @ „ l 


Algorithm 1 EC-Add 

Input: 

6 KeNYS@gfDYh@XijY[X@XV 6 Ke 3 @gf 3 @ZcS 1E :k 

Output:U 

6 Q)WV 6 Ke 8 @gf 8 @Xi 8 E -k. 

i 8 monpn _ #qkr9;=st=} 

e 8 m‚npn _ v-{P{ € @ / 

e 8 m‚npn _ v-{P{ Ke 8 @ l Y[ 

n!monpn _ 2rx~yKe 3 @Xi 8 

n!monpn _ vƒ{={ zn@Ce 8 

f 8 monpn _ 2rxKyh/ @gn… 

f 8 monpn _ vƒ{={ zf 8 @ l Y[ 

returnKe 

8 @gf 8 @Xi 8 

2.1.2 Projective Coordinates 

Computing inverses in is relatively expensive in comparison to multiplication. One may switch to 

projective coordinates in order to avoid computing inverses. The hardware implementation presented in this 

work is based on the projective representation detailed in [15]. 

Replacing5 

6 e%†i and176 f†i 3 

in Eqn. 2.1 leads to the EC equation 

3 )‡ê f‰i 6 e 8 i‡)i $ \ (2.2) 

f 

An point 

6 K5b@ 1 affine is converted into its projective representation settinge 

6 5 ,f 6Š1 

by 

i 6 c 

and 

. The conversion from projective to affine is done as stated before by 5 6 e%†i computing 

1*6 f†i 3 

and 

. 

Applying these projective coordinates an EC-Add operation can be performed with Alg. 1 and the corresponding 

EC-Double algorithm is given by Alg. 2. Thus, computing (,)‹V EC-Add ) requires 10 multipli- 

1 We can fixŒ9`Ž because the base point‘’bŽ”“–•’A—˜h’g will always be added during the computation of the point multiplicationš›‘ 

.

Ymonpn _ #qkr9;=stPKe'Y[ l 

l 

monpn _ #qkr9;=stPuiY[ 3 

3 monpn _ #qkr9;=stPl 3 l 

l 

monpn _ 2rxKyhl 3 @A> Y 

3 monpn _ #qkr9;=stPzf Y l 

l 

monpn _ vƒ{={ l 3 @ l Y l Ymonpn _ 2rxKyhz;9@Xi 

3 

l Ymonpn _ vƒ{={ l YS@ l 3 8 

l Ymonpn _ 2rxKyhl YZ@Ce 8 


Algorithm 2 EC-Double 

Input: 

6 KeNYS@gfDYh@XijY[ E -k 

Output:Q)< 

6 ¦ E ¦. 

8 monpn _ 2rx~yl YZ@ l 3 

e 

i 

m‚npn _ œqBr;Pst=l YX 8 

e 8 m‚npn _ v-{P{ Ke 8 @ l Y 

f 8 monpn _ 2rxKyhl Y @Xi 8 

8 monpn _ vƒ{={ zf 8 @ l Y[ 

returnKe 

f 

@gf 8 @Xi 8 8 

cations, 8 additions and 4 square operations. The computation of (4 EC-Double ) requires multiplications, 

4 additions and 5 squares. All these operations have to be done in the underlying finite field. 

2.1.3 EC point multiplication (žbŸ¡ ) 

Since the points on an elliptic curve/ 

form an additive group, there is no inner group operation like the 

multiplication. Even so repeated point additions such as 

§Z 6 U 6 

Ë / 

and 

E'©ª 

, are usually considered as the operation called EC point multiplication. 

Based on this operation, a discrete logarithm problem for elliptic curves can be formulated. A problem, 

that is considered to be a secure cryptographic function. A secure cryptographic function in this terms 

with@gU 

means, the ofU calculation of and out can be performed quite efficient while it is hardly possible 

compute 

to 

only andU if known. are is called the discrete ofU logarithm to base the . 

The level of security, ECC provides directly follows from the bitwidth the numbers in the underlying 

finite field. Currently bitwidths ranging from 113 bit (for low security applications) up to about 409 bit (for 

very high security applications) are utilized. 

The hierarchy of arithmetics for an EC point multiplication is depicted in Fig. 2.2. The level top algorithm 

is performed by repeated EC-Add and EC-Double operations. The EC operations in turn are composed 

of basic operations in the underlying field. The proposed finite field arithmetic is capable to compute the 

FF-Add and FF-Square operations within one clock cycle. The operation FF-Mult is more costly. The 

number of clock cycles for its computation depends on the number of segments used in the FF multiplier 

(see Sec. 2.3.3 for details). 

¦ 

times 

¢ £h¤ ¥ d),d)o\\\)

2.2. FINITE FIELD ARITHMETIC 8 

k P 

EC-Double 

EC-Add 

FF-Mult 

FF-Add 

FF-Square 

Figure 2.2: EC arithmetic hierarchy 

By exploiting the previously detailed projective coordinates during the computation of a operation all 

but one field inversion can be circumvented. This inversion, that takes place at the end of a operation, 

converts the result that is given in projective coordinates back to the affine representation. Compared to the 

number of cycles a complete operation takes, the time for this single inversion is negligible. Therefore 

it can be computed using a simple algorithm based on the existing field operations FF-Square, FF-Mult and 

FF-Add (see Sec. 2.2.6). 

Depending on different constraints one can chose from a lot of different algorithms on all level of arithmetic. 

For the operations, this work applies the Double-And-Add algorithm given in Alg. 3. The algorithm 

simply scans bit-by-bit. If the current bit is set ( 6 c ), the intermediate result is doubled and the base 

point is added one time. If the bit is not set ( 6«I 

), only the EC-Double operation is performed. 

Using precomputated values would allow to scan multiple bits at one time and by that would lead to an 

improved performance. Due to space limitations the additional registers that would be needed to store these 

precomputated values were the main reason not to implement such an algorithm. 

Using Alg. 3 each multiplication requires( EC-Double and¬ EC-Add operations. As EC-Double is 

cheaper in terms of FF multiplication as EC-Add, the performance of the algorithm benefits from a key 

with low Hamming weight¬ . 

2.2 Finite Field Arithmetic 

The finite field- is the underlying field on which elliptic curves are based throughout this work. It 

can be viewed as a vector space of dimension( over the field . In hardware field elements can be 

easily implemented as a bit vector, which makes this kind of finite fields especially interesting for hardware 

implementations. As already mentioned, the representation treated in this paper is a polynomial basis only.

2.2.2 Polynomial Rings over·7¸À¹4» 


Algorithm 3 Double-And-Add 

Input: 

6 = Y@\\\S@AYZ@AP®S3 and 

Ë / 

m ¯ _ c ¯ 

end while 

I 

then 

Vwm 

if¯:´ 

m ¯ _ c ¯ 

6 ¯ m°( _ c 

Output:V 

6²I and¯³ I 

do 

while± 

while¯³ I 

do 

/ } _ €Hµ r>[xtPuV… 

if ± 6 c then 

Vwm 

/ } _ v-{P{ uV§@g… 

end if 

Vwm 

m ¯ _ c ¯ 

end while 

else 

end if 

VwmoT 

2.2.1 The Field·§¸º¹» 

Finite 

The smallest imaginable finite is- 6½¼ †¦ field , which has two elements only: The additive and the 

multiplicative elementsI 

andc neutral respectively. Its addition and multiplication tables resemble the truth 

tables of the binary (¾ XOR ) and the binary (¿ AND ) operation respectively. The elements can directly 

represented by a single bit. 

returnV 

sethÁ5ÃÂ 6 J¦ÄÆÅ ±ÈÇÉ® ;P±K5 ± L`;P± E -AR The of polynomials with in: coefficient together with 

the additive elementI 

5 ® 

neutral , the multiplicative elementch5 

® 

neutral , and polynomial addition as well as 

multiplication operations constitutes a over ring . Since the degree of a coefficient is given by it’s bit 

position, an ofhÁ5.Â element can effectively be represented by it’s coefficients stored a bit vector. 

2.2.3 Fields-·‰¸º¹ » 

Finite 

Given an irreducible E hÁ5.Â polynomial of ( degree , finite fields of extension ( degree are constructed 

by modular arithmetic out of the previously defined polynomial rings as follows: 

6 hÁ5.ÂK†k\ (2.3) 

The set, which is underlying the Galois field, is thus the finite set of residue classes of polynomials modulo 

the prime polynomial . The canonical representative of a polynomialv 

’s residue class is the remainder of

2.2.4 Addition in-·‰¸º¹ » 

2.2.7 Inversion in·7¸À¹ » 

Ò 

Ì 


the polynomial divisionv 

†k : It is a polynomial of degree less than( . The computation of the canonical 

representative is called polynomial reduction. 

This leads to the following definitions of the basic arithmetic operations that are similar to the operations 

defined in-hÁ5.Â except that an additional reduction is necessary whenever the degree of the resulting 

polynomial is³ 

( . 

Given polynomialsv 

@ |ÊE withv 6 Ä = Y ±ÈÇÉ® ; ±5 ± and|Ë6 Ä = Y ±ÈÇÉ® > ±5 ± 

two , the addition operation 

is defined as 

¾ |!6 = Y v z; ± ¾W> ±º5 ±ÎÍ‰Ï4Ð \ (2.4) 

±ÈÇÉ® 

From Eqn. 2.4 thatv 

¾ v 6¢I 

follows allv E for . The additive inverse is therefore the identity 

function, i.e., addition and subtraction are identical Sincev 

¾ | 

operations. will be of a maximum 

of( _ c forv 

@ |½E -¦ 

degree 

, no reduction step has to be performed in the case of addition. 

2.2.5 in·7¸À¹ » 

Multiplication 

The multiplication of polynomialsv 

@ |¤E - two is given by 

denoting 

|Ñ6 3 = 3 Ì v ±5 ±ÓÍ7Ï4Ð (2.5) 

±–ÇÉ®%Ò 

6 ¦ Ô 

±ÈÇÉ® ;=±¿W> ¦ ± for 

I7Õ Õ k( _ 4@ 

¦ 

with P as the corresponding prime polynomial ;.± 6ÖI 

and >X± 6×I 

, 

¯%³ ( for . Ä 3 = 3 

Since 

maximum ofk( _ degree the of( _ c reduction bits has to be performed. 

±ÈÇÉ® Ò ±z5 ± 

has a 

2.2.6 in·7¸À¹ » 

Squaring 

Squaring is a special case of multiplication. By inserting Eqn. 2.4 into Eqn. 2.5 it can be simplified to 

3 6 = Y Ì v ;=±~5 3 ± Í‰Ï4Ð (2.6) 

±ÈÇÉ® 

Like in the case of multiplication, a of( _ c maximum bits have to be reduced while performing a square 

operation. 

As stated in Sec. 2.1 the inversion is a complex operation that is computed only once a in 

62I 

Operation. 

, Fermat’s Little Theorem can be 

To compute the multiplicative inverse for elementv E ,v G 

an 

applied:

v E k 

Input: 

v Y 

Ø moxµBÙ 3 ( _ c 

Output: 


Algorithm 4 Finite Field Inversion 

Ø rx~yÚm v 

whileØ ³ I 

do 

st 

ṕ´ Ø 

// right shift byØ 

bits spmo( 

st Ø r9xKy 

for¯ 

fromc toKs ṕ´ cS do 

qm 

_ œqBr;Pst=zq // perform Û3 square operations 

end for 

qpmonpn 

_ wr9xKyKst Ø rxKyX@gq 

ifs is odd then 

yœm‚npn 

npn _ #qkr;PstPKyg yœm 

Ø r9xKyœm npn _ wr9xKyKyX@ v 

else 

st 

Ø r9xKyœmoy 

end if 

st 

m Ø _ c Ø 

end while 

Ø rx~yÚmonpn _ #qkr9;=stPKst Ø r9xKyg 

returnst 

Ø rx~y 

st 

v 3CÝ Y Í‰Ï4Ð v Y cƒÜ v 3CÝ 3 Í‰Ï4Ð (2.7) 

Ü 

Inversion can therefore be simply computed by repeated FF-Square and FF-Mult operations like it is shown 

in Alg. 4. The algorithm in particular benefits from the fact in that squaring is much cheaper than 

multiplication. The total number multiplications¢K(b of required for one FF inversion is given by 

6]ÞKßÏà 3 K( _ cSºá)?¬bK( _ cS _ c#\ 

½K(b 

2.2.8 Polynomial Reduction 

As mentioned above, the basic arithmetic operations take place in-hÁ5ÃÂ . In case of multiplication and 

squaring the resulting polynomial has to be reduced. According to Eqn. 2.5 the maximum degree of the 

multiplication result} 6 v | 

withv 

@ |¤E isk( _ . The subsequent polynomial reduction of} 

modulo is based on the equivalence 

Ü P Y Ì 

±–ÇÉ®â ±~5 ± Í‰Ï4Ð \ (2.8) 

5

Ì 

Ì 

Ì 

Ì 

Ì 

Ì 

Ì 

Ì 

Ì 

Ì 

Ì 

Ì 

Ì 

Ì 

Ì 

Ì 


Hardware implementations of the polynomial reduction can especially benefit from hard-coded prime 

polynomials with low Hamming weight such as trinomials or pentanomials. Such polynomials are typical 

for cryptographic and exist for all interesting EC parameter sets. 

Given a prime trinomial 

6 5 )‡59ãä)Oc the reduction process can be performed efficiently by using the 

identities: 

Ü 5 ã )²c Í‰Ï4Ð 5 

ª Y Ü 5 ãª Y ),5 Í7Ï4Ð 5 

. 

This leads to 

5 3 Ü 5 ãª ) binary XOR operations for one polynomial reduction. Reduction of 

pentanomials can be performed similar leading to some additional XOR operations. The particular terms 

(1...5) of the final equation are structured according to Fig. 2.3 in order to perform the reduction. With 

respect to the implementation a single( -bit register is sufficient to store the resulting bit string. 

åYºæ 

å3 æ 

å8 æ 

å$Aæ 

åèç æ 

¢ £h¤ ¥ 

¢ £¤ ¥ 

¢ £¤ ¥ 

¢ £h¤ ¥ 

¢ £¤ ¥ 

) = Y ã 

) ã Y 

) ã Y 

) = Y 

±ÈÇÉ® Ò ±K5 ± 

±ÈÇÉ® Ò ±ª 5 ãª ± 

±–ÇÉ® Ò 3 = ãª ±5 ãª ± 

±ÈÇÉ® Ò 3 P ãª ±5 ± 

±ÈÇÉ® Ò ª ±K5 ± 

n−1 0 

(1) 

2n−1 

2n−b−1 n 2n−1 2n−b 

(2) (4) 

2n−1 

(5) 

2n−b 

(3) 

Result Register (n bit) 

Figure 2.3: Structure of the polynomial reduction 

n

2.3. SEQUENTIAL MULTIPLICATION SCHEMES 13 

Due to the complexity of a reduction step, in the following this work diverts the arithmetic operations in 

into two successive parts. The arithmetical operation:hÁ5.Â corresponding to that in- and 

with a degreeÕ 

( and the subsequent reduction step. 

2.3 Sequential Multiplication Schemes 

In order to achieve a reasonable level of security for an EC cryptosystem, the extension degree ( of the 

underlying finite field has to be in the hundreds. Due to chip area limitations an application of 

combinational multipliers of full bit width ( is usually not feasible. Instead, a sequential multiplication 

scheme, which is based on a reasonable sized purely combinational multiplier unit, has to be utilized. In 

Sec. 3.4.4 the architecture of such a combinationalhÁ5.Â multiplier is presented, which is scalable and 

highly efficient in terms of required logic resources. 

In the remainder of this section it is assumed that a reasonable sized combinational multiplier, which 

computes the unreduced product of two degree +êé ( polynomials, is part of the design. Some wellknown 

methods for sequential multiplication are introduced first. Then, in Sec. 2.3.3, the Multi-Segment 

Karatsuba multiplication scheme is detailed and compared to classical approaches. 

2.3.1 Schoolbook Multiplication 

Given two polynomialsv 

@ |ëE -hÁ5ÃÂ of degree( and a combinational multiplier of size+ 

6ëì(b†¦kí , 

the product} 6 v | 

can be computed as follows: Firstv 

and| 

are split each into two segments of equal 

size. 

Then the product can be computed as 

6 v YC5 î 3 ¾ v ® v 

6 | YC5 î 3 ¾ | ® | 

6 v | } 

v Yg5 î 3 ¾ v ®ZaP| Yg5 î 3 ¾ | ® 6 

v YÚ | YC5 ¾Æv YÚ | ®Ú¾ v ®j | Y[º5 î 3 ¾ v ®j | ®k\ (2.10) 

6 

Please note that in the context of hardware implementations the5 

± 

factors correspond to position offsets, 

which can simply be implemented by appropriate wiring. 

Generally, the polynomials can be split into an arbitrary number of segments 

E”© ª 

. It is selected such 

that the resulting segments are small enough to be multiplied on the combinational multiplier (+ ³ (b†¦ ). 

The number of necessary multiplications is given by 

3 

. Since the additions can be computed combinationally 

in the same cycle, the cycle count for a complete multiplication is also given by 

3 

. 

A variation of the schoolbook method splits each of the two polynomialsv 

and| 

into different numbers 

of segments (=ïÚ@A=ð .) Of course, in this case an appropriate asymmetric combinational multiplier is necessary 

2 . The number of required multiplications is given here by.ïˆBPð . In the extreme case of4ï 6 ( and 

=ð 6 c this scheme is called bit serial multiplication. 

2 This topic is extensively treated in [16].


x^ 8 

x^ 7 

x^ 6 

x^ 5 

x^ 4 

x^ 3 

x^ 2 

x^ 1 

x^ 

a) 

0 

x^ 

b) 

8 

x^ 

7 

x^ 

6 

x^ 

5 

x^ 

4 

x^ 

3 

x^ 

2 

x^ 

1 

x^ 

0 

3 

23 

A 3*B3 

A 3*B2 

A 2*B3 

A 3*B1 

A 2*B2 

A 1*B3 

A 3*B0 

A 2*B1 

A 1*B2 

A 0*B3 

A 2*B0 

A 1*B1 

A 0*B2 

A 1*B0 

A 0*B1 

A 0*B0 

13 

0123 

2 

02 

1 

01 

0 

A*B 

A*B 

c) 

x^ 8 

x^ 7 

x^ 6 

x^ 5 

x^ 4 

x^ 3 

x^ 2 

x^ 1 

x^ 0 

x^ 8 

x^ 7 

x^ 6 

x^ 5 

x^ 4 

x^ 3 

x^ 2 

x^ 1 

x^ 

d) 

3 

0 

23 

3 

123 

23 

12 

23 

2 

3 

0123 

123 

012 

1 

2 

012 

01 

12 

01 

1 

0 

0 

123 

0123 

2 

12 

012 

1 

01 

0 

A*B 

A*B 

Figure 2.4: Sequential Multiplication Schemes: a) Schoolbook method; b) unrolled Karatsuba for 2 recursion 

steps; c)! #"%$ before reordering of the subterms; d)& œ"H$ after reordering of the subterms

6 v | } 

v Yg5 ¦î 3 ¾ v ®SaP| Yg5 î 3 ¾ | ®S 6 

v Y | Yg5 ¾ v Y | ®[5 î 3 ¾ v ® | Yg5 ¦î 3 ¾ v ® | ® 

6 

6 v Y | Y 

ú 

ü v Y | Y 

ü v ® | ® 

Y 0ó6 v Y | Y l 

3 0ó6 v Ya¾ v ®Sb=| Ya¾ | ®S l 

v Y | ®Ú¾ v ® | YD¾ v Y | Yb¾ v ® | ® 6 

l 

8 

8 

8 


Fig. 2.4a illustrates the schoolbook multiplication for 6°ñ 

. The gray boxes represent the results of 

the degree+ respective polynomial multiplications, which are denoted next to the boxes. The horizontal 

position of a box indicates its ò 5 0ó6 59ô offset . The ordering of the partial products by 

¯ 

decreasing 

allows for the accumulation of the final result in a shift register and the application of an interleaved 

reduction scheme as detailed in Sec. 3.4.6. 

5 ± 

with ò 

2.3.2 Polynomial Karatsuba Multiplication 

In 1963 A. Karatsuba and Y. Ofman developed an algorithm of complexity õ7K(Úöø÷Cù 8 that computes the 

product of two( -bit integers [17]. 

Like the Schoolbook Multiplication, this algorithm divides the operands into two equal parts. Adopting 

the arithmetical operations tohÁ5.Â leads to 

¢ £h¤ ¥ 

¢ £¤ ¥ 

¢ £h¤ ¥ 

¢ £h¤ ¥ 

¢ £h¤ ¥ 

úû 5 ¾ÆÁ–v Ya¾ v ®Sh| Ya¾ | ®S 

úBý ÂÈ5 î 3 ¾ v ® | ® 

úû 

úBý 

6 l Y 5 ¾²Ál 3 ü l Y ü l 

8 ÂÈ5 ¦î 3 ¾ l 

8 

(2.11) 

6 l Y 5 ¾²Ál Y ¾ l 3 ¾ l 

8 

given by 

8 ÂÈ5 ¦î 3 ¾ l 

withl 

YZ@ l 3 andl 

v ® | ®k\ 0ó6 

Thus, the final product can be computed by 3 multiplications and 2 additions degree(b†¦ of polynomials 

and 4 additions degree( of polynomials as illustrated in Fig. 2.5. Again, since the addition can be computed 

combinationally in the same cycle as a partial multiplication, the complete multiplication takes 3 cycles 

only. By splitting the into 

6 ± 

factors segments any¯ EN© 

(for ), the product can be withþ=öø÷Cù ¦ 

computed 

multiplications by a recursive application of this scheme. 

Due to the need to store intermediate results and to maintain a stack, recursive algorithms are not appropriate 

for hardware implementations. The recursion has thus to be unrolled. Fig. 2.4b shows the resulting 

degree+ 

multiplication scheme for an unrolled recursion of the Karatsuba with 

6Šñ 

multiplication . Each pattern 

in Fig. 2.4b, which is additionally surrounded by a gray box, can be composed from one partial multiplication. 

The labels at the right side of the boxes determine the indices of the segments, whose sums have been 

multiplied. E.g., the label "13" denotes the termv 

Ya¾ v 

8 . 

6 v Y | ®Ú¾ v ® | YD¾ l Ya¾ l 

8 b=| Yb¾ |

6 v |!6 v 3 5 3 î 8 ¾ v Yg5 î 8 ¾ v ®S#P| 3 5 3 ¦î 8 ¾ | YC5 ¦î 8 ¾ | ®Z 

} 

v 3 | 3 5 $ î 8 ¾Æv 3 | YD¾ v Y | 3 º5 ¾²v 3 | ®:¾ v Y | Yb¾ v ® | 3 º5 3 î 8 

6 

Y | ®:¾ v ® | Y[º5 î 8 ¾Æv ® | ®Z ¾7v 

6 v 3 | 3 

ú 

ü v Y | Y 


m/2 

m/2−1 1 m/2−1 m/2 

T 1 

T 1 

T 2 

T 3 

m/2 

A=A 1x + A 0 

m/2 

B=B 1x + B0 

T 1=A 1B1 

T =(A +A )(B +B ) 

2 1 0 1 0 

T T =A B 

3 3 0 0 

. A B 

2m−1 

Figure 2.5: Polynomial Karatsuba multiplication scheme 

2.3.3 Multi-Segment Karatsuba Multiplication 

The basic Karatsuba multiplication for polynomials in-hÁ5.Â is based on the idea of divide and conquer, 

since the operands are divided into two segments each. 

One may attempt to generalize this idea by subdividing the operands into more than two segments. [18] 

reports on such an implementation with a fixed number of three segments denoted as Karatsuba-variant 

multiplication. 

The proof for that multiplication scheme follows directly out of the classical Karatsuba algorithm by 

dividing the operands into three parts: 

¢ £¤ ¥ 

¢ £h¤ ¥ 

¢ £h¤ ¥ 

¢ £h¤ ¥ 

ü v 3 | 3 

úû 

úû 5 $ î 8 ¾ÆÁ–v 3 ¾ v Y[h| 3 ¾ | YX 

úBý ÂÈ5 

ü l 3 ü l 

(2.12) 

3 ¾ v Ya¾ v ®Sh| 3 ¾ | Ya¾ | ®Z ¢ £¤ ¥ ¾7Á–v 

8 ü l ç ü l 

8 ÂÈ5 3 î 8 

úBÿ 

Ya¾ v ®Zh| Y#¾ | ®Z ¢ £h¤ ¥ ¾7Á–v 

¢ £h¤ ¥ 

¢ £h¤ ¥ 

¢ £h¤ ¥ 

Comparing Eqn. 2.11 and Eqn. 2.12 one might assume that a generalized Karatsuba scheme is possible by 

following the same scheme again. This assumption has been verified forñ 

to ¤ segments manually. This 

had lead to a generalized scheme that will be called Multi-Segment Karatsuba (MSK). Disregarding some 

slight arithmetic variations, the Karatsuba-variant multiplication is a special case of the MSK approach. The 

MSK multiplication scheme, which is proposed in this work, is more general because an arbitrary number 

of segments is supported. 

Two polynomials of degree ( over -hÁ5ÃÂ are multiplied by a -segment Karatsuba multiplication 

ú ¡ 

ü v Y | Y 

ü v ® | ® 

úBý 

ÂÈ5 î 8:¾ v ® | ® 

ú£¢ ú£¢

Ô 

Ô 

Ô 


#" ¦ ) 3 in the following way: It is assumed that ( Í‰Ï4Ð 6°I 

; if not, the polynomials are padded 

with the necessary number of zero coefficients. A polynomial 

v E hÁ5.Â is divided into segments 

(! 

Y ±ÈÇÉ® v ±ä`ò 5 ± 

, with ò 5 0ó6 5 ¦î ¦ 

. With Eqn. 2.13} 6 v |]6 ! #" ¦ v @ | holds for any 

¦ 

such thatv 6¦¥ 

degree( polynomialsv 

@ |ËE -hÁ5.Â : 

5 ± Yª ¦ 

(2.13) 

whereas 

& œ" ¦ v @ | 

6¨§ 

¦ 

5 ± Y 

¾ 

§ 

¦ Y 

±ÈÇbY ±©®v @ | #ò 

±ÈÇbY 

¦ ±©± v @ | aò 

© v @ | 6 § ô Y Ô 

±ÈÇbY 

±© v @ | 

 

ô 

§ ô Y Ô 

±–ÇbY 

±© ª ô ± v @ | 

 

¾? ô © v @ | 9@ (2.14) 

¾ 

| ± 

 

\ ±ÈÇ ±ÈÇ 

The annex of this paper presents an example application of the for& #" 8 

above equations . 

According to Eqn. 2.13 product} 6 v | 6 ! #" ¦ v @ | the entire is composed of the partial sums 

v @ | . Each partial sum consists of partial products ô © v @ | according to Eqn. 2.14. The total 

© 

,. number of (b†¦ required -bit multiplications in order to perform ( one -bit multiplication using 

ô 

#" ¦ 

the 

scheme results from ! 

Y© v @ | 6 Y© v @ | and ô © v @ | 6 § 

ª ô Y 

ô Y Ô ª 

v ± 

 

 

§ 

6 Ì ¯ ¦ 6 M)²cS#B 

\ (2.15) 


to the rectangles determine the indices of the segments, whose sums have been multiplied. E.g., the label 

”123” represents termv 

YP¾ v 3 ¾ v 

8 =A| Y4¾ | 3 ¾ | 

8 the , which denoted 

8 ©øYSv @ | is in Eqn. 2.14. The 

horizontal position of a rectangle represents exponent¯ 

the of the associated ò 5 ± 

factor . E.g., the rectangle 

in the lower left edge labeled ”3” together with its position denotes the v 

8 ¾ | 

8 Pò 5 term . The 

} 6 v | 

result 

is computed by summing up (XORing) all the terms according to their horizontal position. This 

final is product segments wide, as one would expect. The partial products can be reordered as shown in 

Fig. 2.4d. This order was achieved from a consideration of three optimization criteria. 

First, most partial products are added two times to compute the final result. They can be grouped together 

and placed in one of three patterns, which are indicated in Fig. 2.4d. This is true for all instances of the 

MSK algorithm (again this has been evaluated semi-manually by a C program for Õ c II 

any ). In the 

architecture detailed in Sec. 3, these patterns are computed by some additional combinational logic, which 

is connected to the output of the combinational multiplier. 

Second, the resulting patterns are ordered descending¯ 

by of their ò factor 

± 

. In this way, the product can 

be accumulated easily in a shift register. 

5 

As the third optimization criterion the remaining degree of freedom is taken advantage of in the following 

way: The patterns are once more reordered, such that when iterating over them from top to bottom, one of 

two conditions holds: Either the current pattern is constructed from a single segment (e.g.v 

®j | ® product , 

but v ®j¾ v YXœ| ®j¾ | Y not ) or the set of indices of the pattern segments differs only at one index from 

its predecessor (as in the productsv 

® | ® andv 

®¾ v Y[#=| ®:¾ | YX partial ). Since this criterion can not 

always be met for all segments some accumulation steps take one additional cycle. However it can be shown 

that it is always possible to reorder the segments in a way that either the sum of up to two single segments or 

at most two additional segments need to be accumulated. A fact that already has been proven and has been 

evaluated for interesting all , too. 

By applying the third optimization criterion to the pattern sequence, the partial product computations 

can be performed as follows: By + placing -bit accumulator registers at the inputs of the combinational 

multiplier, from which each can add up one segment to the current value or load one new segment in a 

single clock cycle, terms 

ô © v @ | the can be computed iteratively in a pipelined fashion (see Fig. 3.2). 

This results in a two stage pipelined design for the complete datapath and yields a total cZ of clock cycles 

to perform one multiplication the! #"'$ using . 

The MSK scheme has a slight performance disadvantage in terms of + required -bit multiplications in 

comparison to the classical Karatsuba algorithm (11% ! #"¨$ for and 33% ! #" for ), but there are 

considerable benefits: 

First, the number segments of that the polynomials are divided into is not limited to be a power of two, 

but can be any natural number when the MSK scheme is applied. With respect to a HW implementation 

this provides more flexibility concerning the selection of system parameters. Like stated before, segment 

counts in range 

E JBþ4@\–\–@PR the provide the best results; a fact that can be uniquely exploited by the MSK 

approach. 

Second, each time an additional level of recursion unrolling is applied to the classical Karatsuba algorithm, 

two new patterns occur in the multiplication scheme, whose size is growing exponentially by a factor of 2 

(compare Fig. 2.5 to Fig. 2.4b.) In contrast, for any of value the number of different patterns will 

exceedþ 

never 

in case of the MSK scheme. This fact allows the efficient multiplication of polynomials of different 

degrees on the same datapath: If, e.g., the underlying supports datapath any& œ" segments, scheme 

x Õ 

for 

can be performed just by modification of the controller which is running the MSK algorithm.

Chapter 3 

Hardware Architecture 

The architecture of the EC coprocessor (depicted in Fig. 3.1) mainly consists of four modules denoted as 

part (a) to (d). It has been implemented on several FPGA platforms (see Chap. 4 for implementation results) 

which allowed fast and easy practical evaluation of the design. 

Since the implementation of the modules (a) to (c) is basically straight forward and not expensive in 

terms of logic resources Sec. 3.1 to Sec. 3.3 give a brief overview on these components. In Sec. 3.4 the 

more complex and resource intensive finite field arithmetic is described in detail. Finally the software parts 

(VHDL generator and evaluation software) are illustrated in Sec. 3.5 and Sec. 3.6. 

a) 

b) 

Address 

16 * n bit 

Register File 

DataFromPci 

DataToPci 

PCI 

Interface 

c) 

d) 

Interrupt 

EC Arithmetic 

FF Arithmetic 

Figure 3.1: Generic Datapath of the EC coprocessor 

3.1 PCI Interface 

The interface component denoted as part (a) in Fig. 3.1 provides the external 32-bit wide PCI interface and an 

internal( -bit wide interface to the Register File. Data formats are converted across the interface by the use 

of appropriate shift registers. Most of this module could be reused from the previous ONB implementation.

3.2. REGISTER FILE 20 

3.2 Register File 

Part (b) of Fig. 3.1 covers the Register File. It provides c registers of( -bit width. These registers are 

implemented by using FPGA internal Lookup Tables (LUT) and provide a dual ported interface allowing 

concurrent read and write access to the data. Currently ¤ registers are used as internal temporary registers 

for the EC algorithms and can therefore not be accessed by the user. The remaining ¤ register are left for 

field parameters, input operands and results. By modifying the EC level algorithms the requirements for 

internal temporary registers might change. 

3.3 EC Arithmetic 

The EC Arithmetic, highlighted as part (c) in Fig. 3.1, implements the algorithm and the underlying 

EC operations EC-Add and EC-Double. It also implements the FF-Inversion that is intrinsically not an 

EC level operation. But since it is realized by the control flow oriented algorithm and composed of basic 

arithmetic operations in just like the EC level operations, FF-Inversion has been implemented here 

(see Alg. 4). 

Beside of the controller the module contains a single shift register of( -bit size. This register is needed to 

scan bit-wise as required by Eqn. 3. 

The controller is implemented by several Finite State Machines (FSM) in a hierarchical order. Each FSM 

controls one arithmetical operation, i.e., the algorithm, the EC-Add and the EC-Double operation. 

The hierarchical ordering of FSMs is considered to be the best tradeoff between speed and flexibility. 

Leaving the controller logic inside the software would provide more flexibility for changing EC level algorithms 

at the cost that pipelining delays and parallel operation of different modules becomes much more 

complicated (or might even be impossible). Furthermore the overhead and the delay for communication over 

the PCI interface makes this solution impracticable. The other way would be a single controller to prevent 

possible delays emerging from communication between the different FSMs. The great disadvantage of this 

approach is, that due to the unmanageable side-effects inside such an FSM, modifications of one algorithm 

(e.g. the EC algorithm) would lead to a rewrite of the whole controller and not just the EC controller. 

3.4 Finite Field Arithmetic 

The finite field arithmetic that is denoted as part (d) in Fig. 3.1 is the most expensive part of the EC coprocessor. 

Due to the complexity it is split into separate modules that provide the functionality of the operations 

described in Sec. 2.2. A general overview of the datapath is given in Fig. 3.2. The particular parts depicted 

in this picture are described subsequently and the corresponding gate counts are summarized in Tab. 3.2. 

3.4.1 Addition 

According to Eqn. 2.4 the addition inhÁ5ÃÂ is a just an( -bit wide XOR with no need for a subsequent 

reduction step. A complete addition can be performed in one cycle. So part (a) of Fig. 3.2 takes( XORgates 

of logic resources for it’s implementation. Please note, that the input registers are part of the Register 

File and the result register is used by all finite field operations together, because of which it’s FlipFlops are 

counted only once in part (f).


Input Register 

Input Register 

c) 

’0’ 

’0’ 

n n 

n n 

(k+1):1 MUX 

m 

’0’ 

2:1 MUX 

(k+1):1 MUX 

m 

’0’ 

2:1 MUX 

a) b) 

m−bit register 

m 

m−bit register 

m 

d) 

Add 

Square & 

Reduce 

m−bit CKM 

2m−1 

e) f) 

Shift Left 

n 

n 

3:1 MUX 

4m−2 

Reduce 

2:1 MUX 

n 

n 

3:1 MUX 

n−bit register 

Figure 3.2: Generic Datapath of the Finite Field Arithmetic 

3.4.2 Square 

Following from Eqn. 2.6 squaring in hÁ5.Â is done by inserting a constant "0" at every second bit 

position. In hardware this can be done at nearly no cost by appropriate wiring of the signals. The subsequent 

reduction can be done according to Sec. 2.2.8. However the need for k( )w> binary XOR gates that are 

mentioned there is an upper bound for the resource requirement of the square module. Since half of the 

bits in the unreduced intermediate result are constant "0" some XOR gates are dispensable. Because of that, 

the exact number of XOR gates is remarkable smaller than k(”) > and depends on the particular prime 

polynomial. In the following k( )w> binary XOR gates will be used as a suitable approximation for the 

resource usage of the square module. As in the case of addition the square module that is depicted as part 

(b) Fig. 3.2 takes only a single clock cycle.


3.4.3 Input Stage 

The input stage provides two accumulation registers, i.e., one for each operand of the combinational multiplier. 

Besides parallel load functionality these registers must be capable of adding one new segment to its 

current value, as stated in Sec. 2.3.3. According to part (c) of Fig. 3.2 the following logic gatecounts are 

required to build up the input stage for a& #" ¦ based on a+ -bit combinational multiplier: 

6 k+ "!ƒe 3$#YBK+F@A. 6 k+&% v 

 

€ 3 K+ˆ 6dñ + e('U 3 K+ˆ 6 k+ 

npn*K+ˆ 

remark:"!ƒe 3$#Y Please components with constant zero inputs have been tov 

 

€ 3 optimized gates. 

3.4.4 Combinational Multiplier (CKM) 

As stated before in Sec. 2.2.1 and shown in Fig. 3.3a the product of two one bit polynomials is computed by 

a single AND operation. 

a a b 

3 2 3 

a a b 1 0 1 

a 0 b 0 

c 2 

c 0 

c 1 

c 0 

c c c c c 

6 5 4 3 

c c 

2 1 0 

b 0 

a 1 

a 0 

b 1 

b 0 

CKM2 

CKM2 

CKM1 

CKM1 

CKM2 

CKM1 

b 

2 

a) 1−bit CKM 

b) 2−bit CKM 

c) 4−bit CKM 

Figure 3.3: Recursive construction process for polynomial Karatsuba multipliers 

Using Karatsuba’s divide and conquer multiplication algorithm, a multiplication of two( -bit polynomials 

can be computed with three(b†¦ -bit multiplications and some additions (which are XOR’s in our case) to 

determine interim results and accumulate the final result. This leads immediately to a recursive construction 

process, which builds combinational Karatsuba multipliers (CKM) of width( 

6 ô for arbitrary+ 

E


Thus, we can calculate the number of gates of an+ -bit CKM with the following recurrences: 

e)'U 3 K+ˆ 

6+* 

I + 6 c 

3 K+ˆ†¦«+ ´ c e('U 8 K+ˆ 

6+* 

I + 6 c 

c:)?þe('U 8 K+ˆ†¦ + ´ c 

+¢)Wþe)'pU 

 

€ 3 K+ˆ 

6+* c + 6 c 

þ v 

 

€ 3 K+ˆ†¦«+ ´ c e)'U $ K+ˆ 

6+* 

I + 6 c 

+ _ ¨)?þe)'pU-$PK+ˆ†¦ + ´ c 

v 

With the master method [19] it can easily be shown that all of these recurrences belong to the complexity 

õ7K+¨öø÷Cù 8 class . The ofv 

 

€ 3 number gates exactly+¨öø÷Cù 8 is . By substituting a 3-input XOR by 

e('U 3 

two 

and a 4-input XOR threee)'pU 3 by gates, an upper bound on requirede('U 3 the count is be given 

k+Nöø÷Cù 8 by . Some gate counts for multipliers of various operand bit widths are summarized in Tab. 3.1 and 

illustrated in Fig. 3.4. 

v 

 

€ 3 

e('U 3 

e('U 8 e('U-$ 

Table 3.1: 

Bit Width 1 2 4 8 16 32 64 

1 3 9 27 81 243 729 

0 2 10 38 130 422 1330 

0 0 2 12 50 180 602 

0 1 4 13 40 121 364 

,! 1 6 25 90 301 966 3025 

3500 

3000 

2500 

2000 

gate count 

1500 

1000 

500 

0 

4 

8 

16 

bit width 

32 

64 

AND2 

gate type 

XOR4 

XOR3 

SUM 

XOR2 

Figure 3.4: Combinational Karatsuba Multiplier gate count 

3.4.5 MSK Pattern Generation 

As detailed in Sec. 2.3.3 there are three different MSK patterns which are build based on the output of the 

CKM. A simplified illustration of the corresponding architecture is shown in part (e) of Fig. 3.2. In practice, 

the pattern creation is implemented by various multiplexers. Moreover, in case of the& œ" 8 

scheme, the 

patterns will exceed the bit width( of the affiliated"!ƒe 

8 #Y and therefore have to be reduced before they

Ì 

Ì 

Ì 

Ì 

Ì 


can be added to the intermediate result. This leads to approximately k+ additional e)'U 3 gates for a 

! #" 8 

design. In general, an optimized multiplexer structure leads to the following resource requirement: 

3 K+ˆ 6 + -!-e 3$#YZK+ˆ 6 k+ v 

 

€ 3 K+ˆ 6 k+ 

e('U 

3.4.6 Interleaved Polynomial Reduction 

A first naive design approach may perform the polynomial reduction after the calculation of the complete 

multiplication. According to Eqn. 2.9, such a architecture would requirek( ) >#e('U 3 gates. Furthermore, 

this would lead to datapaths and multiplexers of sizek( _ c . To keep the datapaths on a maximum size of 

( a method of interleaved reduction has been developed. 

When utilizing the MSK multiplication scheme based on + a -bit CKM the maximum degree of each 

intermediate is( )Q+ result with+/. ( , . only( )Q+ Therefore, bit values have to be reduced in each 

iteration. Regarding this fact Eqn. 2.9 reads as 

} 6 = Yª ô Ì 

Ò ±K5 ± Ü = Y 

±ÈÇÉ® Ò ±z5 ± ) P Yª ô 

±ÈÇÉ® 

±–Ç Ò ±CK5 ±. ),5 ãª ±. Í7ÏÐ 

@ (3.1) 

which results in a total of onlyk+]e('U 3 gates for the polynomial reduction. The particular terms (1...3) 

of Eqn. 3.1 are structured according to Fig. 3.5 in order to calculate the reduction. Within the MSK scheme, 

this kind of interleaved reduction of degree ( )²+ polynomials is performed each time the intermediate 

result is shifted left by+ bit. 

n−1 0 

(1) 

n−1+m 

(3) 

n−1+m 

(2) 

n 

n 

6 = Y 

±ÈÇÉ®#Ò ±5 ± 

±–ÇÉ®Ò ±ª 5 ãª ± 

±ÈÇÉ®Ò ±ª 5 ± 

) ô Y 

) ô Y 

åYºæ 

å3 æ 

å8 æ 

¢ £h¤ ¥ 

¢ £h¤ ¥ 

¢ £h¤ ¥ 

Result Register (n bit) 

Figure 3.5: Structure of the polynomial reduction of(*),+ bit 

The gatecount calculation for part (f) of Fig. 3.2 is as follows: As already mentioned, the reduction of 

degree()*+ polynomials takesk+!e)'pU 3 gates. Additionally,+ 

¯(:K(œ@ ñ + _ Fe)'U 3 gates are required 

to perform the accumulation of the actual MSK pattern to the current result value. Finallyþk(D"!ƒe 3$#Y are 

needed to choose if the result of an addition, square or multiply operation is stored in the result register. 

Summing up, this results in the following resource requirement: 

npn%K(b 6 ( e('U 3 K+ˆ 6 + ¯(:K(œ@ ñ + _ `)?k+ -!-e 3$#YBK(b 6dñ (

D 

SUM E K]GiD Qlk DHGIeGk KNGR KTSVUXW Y [mGnK_^`DbacDedZLfKhgnEfj R KNGnK_SoUpW Y [ E KNMPOqGJEFK]GnLfD 

3.5. VHDL-CODE GENERATOR 25 

Table 3.2: Datapath gatecount 

010 24365,7 8:9@?,2A7B C 

Datapath 

Part (a) 

Part (b) 

EFDHGJI 

Part E K L£K E KNMPO 

(c) 

Part Q)R KTSVUXWZYX[ KTSVUXWZY\[ 

(d) 

Part K E K EFK 

(e) 

Part D E K]GK_^`DbacDedZLfKhgiEfj LfD 

(f) 

3.5 VHDL-Code Generator 

In this section the VHDL-Code Generator is presented. Though VHDL offers several possibilities to generalize 

code statements this opportunities are not sufficient to implement a design as scalable as required 

in this application. E.g., it would not be possible to generalize the reduction logic due to the variable 

prime polynomials. To provide the wanted scalability a generator based approach proposed in [10] has been 

adopted where the VHDL code is generated by a C program. The generator consists of one top-level file 

calculating some internal variables out of the user input parameters, opening the files into which the VHDL 

code is written and calling the appropriate subroutines. Each subroutine is implemented in a separate file 

and generates the VHDL description (entity and architecture) for one particular hardware module. 

The generator takes several parameters that are implemented as commandline parameters. Currently these 

parameters are: 

Key Size [Bits]: 

r 

The Key Size specifies the overall bit width of the coprocessor. A given constraint is that this number 

must be smaller than the CKM size multiplied by the number of segments. 

Combinational Karatsuba Multiplier (CKM) size [Bits]: 

r 

The CKM size specifies the width of the combinational part of the multiplier. Since the size of the 

CKM is exponentially dependent on it’s bit width, this value is essential determining the resource 

consumption of the complete ECC coprocessor. 

Bit Position of the middle bit in the prime trinomial [Int]: 

r 

The bit position of the middle bit in the prime trinomial is needed for reduction component. The 

position of the highest bit is already specified by the key size and and the lowest bit is fixed be5 

® 

to . 

The current generator is only capable of trinomials. Supporting pentanomials would be possible by 

slight modifications of the generator. 

Number of Segments [Int]: 

r 

The type of MSK is specified by the number of segments. Currently the generator is limited from 

three to seven segments which are regarded as the most interesting ones. Additional support of new 

MSK sizes can be applied by adding a new subroutine to the ’generate_ff_controller’ routine in the 

generator program. Outside the controller the number of segments does not have any influences to the 

design. 

Card Type [1|2]: 

r 

The card type specifies which hardware dependent top level module should be instantiated. ’1’ is

3.6. EVALUATION SOFTWARE 26 

used for the microEnable platform board, ’2’ is used for ADM-XRC-II platform. For details on these 

platforms see Chap. 4. This option has no effect on the functional part of the design. 

3.6 Evaluation Software 

To evaluate the presented hardware design it is necessary to provide a software that can communicate with 

the FPGA on the one hand, and that can compare hardware results with software results that are considered 

as correct. To provide this functionality the existing evaluation software (ECCLib) has been enhanced. The 

previous software provided EC level arithmetic, finite field level arithmetic for ONB representation and an 

interface to the microEnable PCI card. 

As part of this work a second finite field level arithmetic for polynomial representation has been implemented 

based on [20]. Besides some simple modifications the EC level algorithms could be reused and a 

new runtime switch has been added to the software to determine which representation should be used. 

At second, an interface to the new Alpha Data ADM-XRC-II FPGA platform has been added, that is also 

selected by a runtime option in the software. This second hardware platform allows the evaluation of a 

greater range of parameter sets and therefore illustrates the flexibility of the presented design. 

The hardware has successfully been tested by performing operations with randomly generated parameters. 

Currently the implementation uses memory mapped I/O for communication between software and 

FPGA board which seems to allow a sufficient data transfer rate. The termination of a computation is 

signaled to the software via Interrupts.

In t e r f a c e sXtuovxw 

Chapter 4 

Implementation Results 

Various instances of the presented architecture have been implemented and evaluated within several FPGA 

devices. These widely used devices are typical representatives of different complexity classes ranging from 

a low-cost device that might be interesting for client-side applications up to a high-end chip that is of special 

interest for high-performance server applications. 

In the following, the different platforms are briefly introduced and in Sec. 4.4 the implementation results 

are summarized. 

4.1 Xilinx XC4085XLA 

One of the outlined implementations is based on the microEnable PCI card (illustrated in Fig. 4.1) from 

Silicon Software GmbH [21]. This card is equipped with a FPGA from Xilinx, Inc. [8], in which the 

coprocessors functionality is implemented. The card is available with FPGAs of different complexities. In 

our case the XC4085XLA, a medium-sized FPGA with a complexity of max. 180K system gates is used. 

Furthermore the card comes with a programmable clock generator, static RAM, and external interfaces. The 

integration into a target system is accomplished via the PCI interface. The XC4085XLA FPGA allows the 

RAM 

RAM 

FPGA 

y{zf|~} 

z €‚|o} 

RAM 

RAM 

PCI 

ĉ‰ Š ‹Œ 

RAM 

„Z…$„‡ 

microEnable 

ycƒ£|~} 

„Z…$„\† 

Figure 4.1: microEnable PCI card

4.2. XILINX XCV405E 28 

implementation of a CKM with a maximum width of 64 bit within the datapath. The exact number varies 

on other generator parameters such as the number of segments. 

4.2 Xilinx XCV405E 

The XCV405E FPGA device from Xilinx, Inc. is a high-end FPGA providing 400K system gates. This 

device allows the implementation of a CKM with up to 85 bits. Due to the newer technology and powerful 

routing resources higher frequencies can be achieved compared to the XC4085XLA device. Similar to the 

XC4085XLA the XCV405E is applied on a PCI interface card. This ADM-XRC-II platform from Alpha 

Data, Inc. [22] provides a 

ñ 

-bit wide PCI interface and 6M Bytes SRAM. It supports up to two different 

Xilinx VirtexE or Virtex2 FPGA devices on one PCI interface card. 

4.3 Atmel AT94K40 

The AT94K40 from Atmel, Inc. [9] is a System-on-Chip device. As illustrated in Fig. 4.2 it provides an 8 bit 

AVR Ž processor core, FPGA resources, some peripherals and up to 36K Byte of SRAM on a single chip. 

With only 40K system gates the FPGA resources are limited and allow the implementation of a maximum 

width of 25 bit for the CKM. Compared to the Xilinx based implementations which use the PCI bus for the 

communication with the software running on the host system, the AT94K40 provides a low latency interface 

between FPGA hardware part and Ž processor core that is capable of 8 bit transfers in each clock cycle. 

Figure 4.2: Atmel AT94K40 

Due to these features it is reasonable to run the EC level algorithms in software on the Ž processor core. 

By this, the replacement of the EC algorithms proposed in Chap. 2 with the better performing 2P algorithm 

documented in [23] could realized easily.

‘ 6 c "!”“ 

l 6 

 

ÞxµBÙ 3 4áj)²c I 

v €H€½6 )’ 

ÞxµBÙ 3 4á)@ :VU 6 ÞxµBÙ 3 4áj)?þ 

þ 

4.4. COMPARISON 29 

The 2P algorithm, which is furthermore considered to be more resistant against Side-Channel-Attacks 

consists of three different EC level algorithms and a modified algorithm. 

Looking at the EC level algorithms, Mdouble performs a variation of the classical EC-Double operation. 

The EC-Add operation is exchanged by the Madd operation. The Mxy operations is used to transfer the result 

from the internal projective coordinate representation back to the affine representation. All these algorithms 

are given in the Annex of this work. 

The entire 2P algorithm performs the following number of arithmetic operations in . 

4.4 Comparison 

Tab. 4.1 gives a summary of some implementations that are considered to have the most interesting parameter 

sets for the proposed hardware platforms. Since the VHDL generator currently only supports the 

Double-And-Add algorithm for a genuine hardware implementation, the values for the 2P algorithm on Xilinx 

platforms are estimated. Please note, that due to the application two pipeline stages within the datapath 

the number of clock cycles for a complete FF-Mult operation is increased by two compared to Eqn. 2.15. 

Because of resource problems the AT94K40 implementation has only one pipeline stage applied, which 

results in one additional clock cycle. In the case of the Double-And-Add algorithm the hamming-weight 

of is approximated with (b†¦ . A comparison of the AT94K40 implementation to other state-of-the-art 

implementations has been recently published in [24]. 

Table 4.1: Implementation results 

Target platform Atmel Xilinx 

AT94K40 XC4085XLA XC4085XLA XCV405E 

degree( 

size+ 

segments 

Ž 

Ž Ž Ž 

finite field 113 191 239 409 

CKM bit 23 64 60 82 

MSK 5 3 4 5 

clock cycles per FF-Mult 17 8 12 18 

Device utilization 96 % 76 % 81 % 97 % 

Operating frequency 12 MHz 33 MHz 31 MHz 60 MHz 

Double-And-Add n/a 742.9 s 1.4 ms 1.7 ms 

2P algorithm 1.4 ms 395.9 s 759.9 s 926.8 s 

The Xilinx based designs have been synthesized with FPGA Compiler II v3.7 from Synopsys, Inc. The 

FPGA Mapping has been done with ISE v4.2 from Xilinx, Inc. The implementations for the Atmel device 

have been synthesized using Leonardo v2000.1b from Mentor, Inc. and have been mapped to the chip 

utilizing Figaro IDS v7.5 from Atmel, Inc.

Chapter 5 

Conclusions and Outlook 

5.1 Summary 

With the MSK, this work presented a new algorithm for multiplication in that is considered to 

be more efficient in terms of time and logic resource requirements than any other approach known to the 

author. A corresponding hardware architecture has been developed and integrated into an elliptic curve 

coprocessor. The entire EC coprocessor has been implemented on different FPGA devices. The functionality 

of these implementations has been evaluated by comparing the hardware results with that of a corresponding 

software solution. 

Due to the application of a VHDL generator approach, the presented design is widely scalable by modification 

of the number of segments and/or the size of the combinational multiplier. By utilizing that generator 

it is possible to create new design variants with different parameters at minimal effort. 

Because of the fact that a finite field multiplication takes a constant amount of clock cycles and processes 

various bits in parallel the design is resistant against currently known Side-Channel-Attacks based on the 

measurement of computing time or power consumption. 

5.2 Further Work 

There were several topics arising during the treatment of this work which could not be finally addressed. 

In the present design, the input stage in Fig. 3.2c allows the accumulation of one single segment to the current 

value of the CKM input register only. This leads to CKM idle cycles if the number of segments becomes 

greater than 4. Regarding the fact that it is always possible to reorder the segments in a way that adding or 

loading of at most two segments will be sufficient to do an MSK-based multiplication without idle cycles of 

the CKM, a modified input stage would be reasonable. The implementation of such a modified input stage 

would take only a little more hardware resources but will gain a further performance enhancement. 

The effects of applying the MSK scheme inside the combinational multiplier should be investigated. Especially 

on hardware platforms where for some reason a combinational multiplier of a given bitwidth can 

be utilized the modification of the presented CKM component might be interesting. 

As detailed in Chap. 4 the 2P algorithm that is currently only used in the AT94K40 based implementation 

would give a significant performance enhancement compared to the currently used Double-And-Add 

algorithm when implemented entirely in hardware on Xilinx based platforms. 

Currently the generator provides a set of hard-coded FSMs for the most interesting number of segments.

5.2. FURTHER WORK 31 

It should be possible to apply the formula of the MSK scheme to the generator, in order to derive FSM 

modules for arbitrary segment numbers at generator runtime. 

Though the MSK algorithm has been evaluated manually for the interesting range of segments the general 

mathematical proof is still interesting. Actually, there are people working on the proof and it is probable 

that the MSK scheme will soon be proven. 

Acknowledgment 

The author would like to thank Markus Ernst and Michael Jung for their great help and support.

Bibliography 

[1] National Institute of Standards and Technology, “Data Encryption Standard,” Federal 

Information Processing Standard (FIPS) Publication 46-2 (supercedes FIPS-46-1), 

http://www.itl.nist.gov/div897/pubs/fip46-2.htm, December 1993. 

[2] National Institute of Standards and Technology, “Specification for the Advanced Encryption 

Standard (AES),” Federal Information Processing Standard (FIPS) Publication 197, 

http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf, November 26, 2001. 

[3] Internet Engineering Task Force, “The TLS Protocol,” RFC 2246, 

http://www.ietf.org/rfc/rfc2246.txt, January 1999. 

[4] R. L. Rivest, A. Shamir and L. M. Adleman, “A Method for Obtaining Digital Signatures and Public- 

Key Cryptosystems,” Communications of the ACM, Feb 1978. 

[5] V. Miller, “Use of elliptic curves in cryptography,” Advances in Cryptology, Proc. CRYPTO’85, 

LNCS 218, H. C. Williams, Ed., Springer-Verlag, pp. 417–426, 1986. 

[6] N. Koblitz, “Elliptic Curve Cryptosystems,” Mathematics of Computation, vol. 48, pp. 203–209, 

1987. 

[7] A. Lenstra and E. Verheul, “Selecting Cryptographic Key Sizes,” Proc. Workshop on Practice and 

Theory in Public Key Cryptography, Springer-Verlag, ISBN 3540669671, pp. 446–465, 2000. 

[8] Xilinx, “Programmable Logic Data Book,” 2001. 

[9] Atmel, Inc. “Configurable Logic Data Book,” 2001. 

[10] M. Ernst, S. Klupsch, O. Hauck and S. A. Huss, “Rapid Prototyping for Hardware Accelerated 

Elliptic Curve Public-Key Cryptosystems,” Proc. 12th IEEE Workshop on Rapid System Prototyping 

(RSP01), Monterey, CA, June 2001. 

[11] A. J. Menezes, “Elliptic Curve Public Key Cryptosystems,” Kluwer Akademic Publishers, 1993. 

[12] J. H. Silverman, “The Arithmetic of Elliptic Curves,” Graduate Texts in Mathematics, Springer- 

Verlag, 1986. 

[13] IEEE 1363, “Standard Specifications For Public Key Cryptography,” 

http://grouper.ieee.org/groups/1363/, 2000. 

[14] ANSI X9.62, “Public key cryptography for the financial services industry: The Elliptic Curve Digital 

Signature Algorithm (ECDSA),” (available from the ANSI X9 catalog), 1999.

BIBLIOGRAPHY 33 

[15] J. Lopez and R. Dahab, “Improved algorithms for elliptic curve arithmetic in 

„ n% ,” Selected 

Areas in Cryptography (SAC’98), LNCS 1556, Springer-Verlag, pp. 201–212, 1998. 

[16] S. Okada, N. Torii, K. Itoh and M. Takenaka, “Implementation of Elliptic Curve Cryptographic 

Coprocessor over on an FPGA ,” Workshop on Cryptographic Hardware and Embedded 

Systems (CHES 2000), LNCS 1965, C.K. Koc and C. Paar Eds., Springer-Verlag, pp.25–40, 2000. 

[17] A. Karatsuba and Y. Ofman, “Multiplication of multidigit numbers on automata,” Sov. Phys.-Dokl 

(Engl. transl.), vol. 7, no. 7, pp. 595-596, 1963. 

[18] D. V. Bailey and C. Paar, “Efficient Arithmetic in Finie Field Extensions with Application in Elliptic 

Curve Cryptography,” Journal of Cryptology, vol. 14, no. 3, pp. 153–176, 2001. 

[19] J. L. Bentley, D. Haken and J. B. Saxe, “A general method for solving divide-and-conquer recurrences,” 

SIGACT News, vol. 12(3), pp. 36–44, 1980. 

[20] M. Rosing, “Implementing Elliptic Curve Cryptogarphy,” Manning Publications Co., ISBN 1- 

884777-69-4, Greenwich, 1999. 

[21] Silicon Software, “microEnable Users Guide,” 1999. 

[22] Alpha Data Parallel Systems Ltd., “ADC-PMC-64 User Manual”, Ver. 1.1, 2002. 

[23] J. Lopez and R. Dahab, “Fast multiplication on elliptic curves over 

„ n*ô without precomputation,” 

Workshop on Cryptographic Hardware and Embedded Systems (CHES 99), LNCS 1717, 

C.K. Koc and C. Paar Eds., Springer-Verlag, pp. 316–327, 1999. 

[24] M. Ernst, M. Jung, F. Madlener, S. Huss and R. Bluemel, “A Reconfigurable System on Chip Implementation 

for Elliptic Curve Cryptography over- ,” Workshop on Cryptographic Hardware 

and Embedded Systems (CHES 2002), Springer-Verlag, 2002.

Ô Ô 

bY©®v @ | ò 

6 

6 Y©® v @ | ò 

ò 

ò 

ò 

6 WY©®v @ | ò 

ò 

ò 

ò 

ò 

ò 

ò 

ò 

ò 

ò 

ò 

ò 

ò 

ò 

ò 

ANNEX 34 

Annex A: 3-Segment Karatsuba Multiplication 

For any polynomialsv 

and| 

over- the product} 6 v |½6 & #" 8 v @ | using the 3-segment 

Karatsuba multiplication according to Eqn. 2.13 is given by: 

! #" 8 v @ | 

6¨§ 8 

5 ± Y 

¾ 

§ 

3 

5 ±ª 3 

±ÈÇbY ±©®¦v @ | aäò 

±ÈÇbY 8 ±©±Cv @ | aÉò 

5 ® ¾ 

©®v @ | ò 5 Y ¾ 

3 

©® v @ | ò 5 3 ¾ 

8 

©øY v @ | ò 5 8 ¾ 

3 

Y©3 v @ | 

5 $ 

5 ® ¾ 

u Y©® v @ | `¾W Y©øY v @ | D¾? 3 ©® 

5 Y ¾ 

u Y©® v @ | `¾W 3 ©® v @ | D¾W Y©3 v @ | `¾O 3 ©øY v @ | `¾? 8 ©® v @ | C 

5 3 ¾ 

u bY©øYSv @ | `¾W bY©3 v @ | D¾? 3 ©øYBv @ | C 

5 8 ¾ 

WY©®v @ | ò 5 ® ¾ 

6 

WY©3 v @ | 

5 $ 

WY©®v @ | `¾O bY©®v @ | `¾W aY©øYZv @ | `¾? 3 ©®v @ | `¾?WY©3 v @ | C¾ 

WY©®v @ | `¾W?Y©øYSv @ | D¾? 3 ©®v @ | C 

5 Y ¾ 

bY©øYSv @ | `¾W bY©3 v @ | D¾? 3 ©øYBv @ | `¾< 8 ©®v @ | C 

5 3 ¾ 

WY©øYBv @ | `¾W?Y©3 v @ | D¾? 3 ©øYBv @ | C 

5 8 ¾ 

WY©®v @ | ò 5 ® ¾ 

6 

WY©3 v @ | 

5 $ 

WY©®v @ | `¾W?Y©®v @ | D¾??Y©øYBv @ | `¾? 3 ©®v @ | `¾?WY©3 v @ | C¾ 

WY©®v @ | `¾W?Y©øYSv @ | D¾? 3 ©®v @ | C 

5 Y ¾ 

WY©øYBv @ | `¾

Input: An integer 

³ I 

and a Point 

6 K5D@ 1 EN/ 

Output: the5 -coordinatee%†i for the point¦ 

Ym l 

Ò 

Ym‚i@% l Y l 

returnê @Xi 

Y 

Ò 

ANNEX 35 

Annex B: 2P Algorithm 

Algorithm 5 2P Algorithm (Montgomery Scalar Multiplication ) 

6 

if 

6²I 

or5 

6²I 

the output(0,0) and stop. 

Output:V 

Y \–\–\ Y ® 3 . 

Set*mÓ 

Y m°5b@Xi Y mÓc¦@Ce 3 m°5 $ )?>B@Xi 3 mo5 3 

. 

for¯ 

fromx _ downtoI 

do 

Sete 

6 c then 

w; {P{ Ke'YS@XiYZ@Ce 3 @Xi 3 , { µ r>[xtPKe 3 @Xi 3 . 

if± 

w; {P{ Ke 3 @Xi 3 @Ce'YZ@XijY[ , { µ r>[xtPKe'YS@XijY[ 

else 

returnuV 

6 w5 1 Ke'YZ@XijYZ@Ce 3 @Xi 3 C 

. 

Algorithm 6 Mdouble 

Input: the field ; the field elements; and 

Ò 

6 > 

3•—– 

û 

3 6 > defining a curve/ 

over: ; 

the5 -coordinate for a Point . 

eÖm°e 3 

i²m‚i 3 

l Ym l 3 i²m‚i@%:e 

eÖm°e 3 

eÖm°e]) l Y

3 m°e'Ỹ %jijY l 

Y 

Ym°5 l 

l 

$ƒm l 8 Y l 

l 

l $) l 3 $ƒm 

8 

l 

8 

ANNEX 36 

Algorithm 7 Madd 

Input: the field¦ ; the field elements; and> defining a curve/ 

over-P. ; 

-coordinate of the Point P; the5 -coordinatese Y[†iY ande 3 †i 3 for the pointsY and 3 on/ 

. 

Output: the5 -coordinateë Y[†iY for the pointYa)< 3 

the5 

Ym°5 l 

e'Ym e'Ỹ %i 3 

ijYm‚iỸ %e 3 

ijYm‚iYa)

FPGA based Hardware Accleration for Elliptic Curve Cryptography ...

Create successful ePaper yourself

Delete template?

Save as template?