Ultra-Low-Power Digital Circuit Design - Microelectronic Systems ...

Contents1 Introduction 61.1 Low-power circuits for RFID . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Low-Power Logic Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.1 CMOS adapted for low power consumption . . . . . . . . . . . . . . 61.2.2 Current-mode logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Report organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Subthreshold Source-Coupled Logic 82.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Description of STSCL Circuits . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.1 NMOS Tail Transistor . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.2 NMOS Switching Network . . . . . . . . . . . . . . . . . . . . . . . . 92.2.3 PMOS Load Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.4 Replica Bias Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Use of STSCL for cryptographic hardware . . . . . . . . . . . . . . . . . . . 102.4 Performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4.1 Gate Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4.2 Power consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4.3 Eects of Process Variations and Mismatch . . . . . . . . . . . . . . 122.4.4 Noise margin analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5.1 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.5.2 Placement and Routing . . . . . . . . . . . . . . . . . . . . . . . . . 163 Elliptic curve cryptographic processor 173.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Elliptic curve cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2.1 Montgomery algorithm for scalar multiplication . . . . . . . . . . . . 193.3 Specications of the cryptographic processor . . . . . . . . . . . . . . . . . . 193.4 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.4.1 ALU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4.2 Finite State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4.3 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.5 Standard CMOS implementation . . . . . . . . . . . . . . . . . . . . . . . . 213.6 STSCL implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.6.1 Modications of the design ow and library . . . . . . . . . . . . . . 214

Contents3.6.2 Library Specications . . . . . . . . . . . . . . . . . . . . . . . . . . 224 Results 234.1 Design and simulation ow . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 Performance comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3 Interpretation of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.3.1 Advantages and disadvantages STSCL . . . . . . . . . . . . . . . . . 265 Outlook 295.1 Possible improvements in the STSCL library . . . . . . . . . . . . . . . . . . 295.1.1 Device sizing optimization . . . . . . . . . . . . . . . . . . . . . . . . 295.1.2 Shallow pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 Conclusion 31Bibliography 33Appendix 345

CHAPTER 1.INTRODUCTIONI DS ≈ I 0 · e V GS −V TnU Twhere V T is the threshold voltage and n the subthreshold slope factor.In the subthreshold region, both the speed and the power dissipation of CMOS circuitsdepend strongly on the voltage. For this reason, subthreshold CMOS designs require veryprecise control over the supply voltage.1.2.2 Current-mode logicCurrent-mode logic (CML) is a group of logic styles where operation is based on controlledswitching (steering) of a current from one branch of the circuit to another. CML has a longhistory: Emitter-coupled logic (ECL) using bipolar transistors and resistors was patentedin 1956. ECL was used for high-performance bipolar circuits because of the low voltageswing and the fact that the transistors would not enter saturation, leading to very fastgate delays.More recently, the logically equivalent source-coupled logic using CMOS technology hasbeen implemented for its low power supply noise injection, as well as its high speed.Subthreshold source-coupled logic [1, 2, 3] has been proposed as an alternative to CMOSin ultra-low power applications, due to the precise control it oers over speed and powerconsumption.1.3 Report organizationChapter 2 rst gives an overview of STSCL circuits and then continues with an analysisof the performance in presence of device variations. In Chapter 3, an elliptic curve cryptographyprocessor is introduced which will serve as an example for the top-down designow using the STSCL library. The results of this implementation will be summarized andcommented in Chapter 4. Chapter 5 gives an outlook on possible improvements of theSTSCL standard library and design ow. The nal Chapter 6 is a short conclusion.7

2 Subthreshold Source-Coupled Logic2.1 OverviewSource-coupled logic, also known under the more general term current-mode logic (CML)is a group of logic families that use transistor dierential pairs to switch a constant biascurrent towards one of two branches representing the two terminals of a dierential outputsignal. In a MOS implementation, an NMOS tail transistor biased with a constant gatevoltage acts as a current source that draws a constant current I ss from the supply. Logicoperation takes place by steering the tail current to one of the two load devices. This canbe achieved by a network of dierential pairs controlled by the (dierential) gate inputvoltages.The output signal is created by two load `resistors' that convert the dierence in currentin their respective branches into a dierential output voltage. The value of these resistorsis chosen such that I ss creates a voltage drop of V SW = I SS R L when the tail currentpasses through them. One of the output terminals will therefore be at a voltage of V L =V DD − V SW , the other at V H = V DD .Subthreshold source-coupled logic (STSCL, [1, 2, 3]) is a variant of current-mode logic inwhich all transistors are operating in the subthreshold region. Since in STSCL, operationis based on switching a subthreshold current between the two branches of the logic gate,the problem of leakage power consumption is virtually nonexistent 1 .2.2 Description of STSCL CircuitsFigure 2.1 shows the STSCL logic style proposed in [3]. It uses a replica bias circuitwhich allows the tail current to be adjusted over a wide range, enabling circuits that candynamically adapt to available power and required speed. Operation of the circuits isindierent to variations in supply voltage, as long as the replica bias circuit is able togenerate a large enough bias voltage for the desired tail current. As shown in Figure 2.2,supply voltages as low as 0.3 V are possible for a tail current of 100 pA. Gates with higherdriving strength require a slightly higher V DD if device sizes are kept the same, becausethe gate-to-source voltage of the active dierential pair transistor is higher.One particularity with current-mode logic styles is the fact that they draw a constantcurrent, even when no switching takes place. For good energy-per-operation eciency, itis therefore important to design circuits with a high activity rate.1 As long as the currents in the p-n junction formed by the source and bulk of the PMOS load devices aresmall compared to the bias current.8

CHAPTER 2.SUBTHRESHOLD SOURCE-COUPLED LOGICFigure 2.1: Replica bias circuit and STSCL gate (inverter)Figure 2.2: Transfer function of a NAND gate at various supply voltages (I BIAS = 100pA,both inputs set to V IN )2.2.1 NMOS Tail TransistorA single reference current source I BIAS is required to bias all logic gates of a given drivingstrength. If I BIAS is implemented using a programmable current mirror, the tail currentand therefore the speed of the logic gates can be adjusted dynamically [4].In the current implementation of the STSCL library in 90 nm technology, the dierentdriving strengths of each logic gate are implemented through the use of dierent biasvoltages. 12 dierent bias voltage signals have to be routed on the chip to bias the NMOStail transistors and PMOS loads for six dierent driving strengths (x1 x32).2.2.2 NMOS Switching NetworkA network of combined NMOS dierential pairs controlled by the dierential input voltagessteers the current towards one of the two load devices. The dierential input voltage needs9

CHAPTER 2.SUBTHRESHOLD SOURCE-COUPLED LOGICto satisfy V SW = ∆V in > 4 · n · U T (n is the subthreshold slope factor and U T the thermalvoltage) in order to completely switch the current. In a technology with n = 1.5, thevoltage swing has to be at least 150 mV.For more complex gates, a systematic approach is needed to identify all the logic functionsthat can be implemented with a given number of logic stages. A binary decisiondiagram can be used to systematically generate all possible gate topologies for a givennumber of inputs [5].2.2.3 PMOS Load DevicesSTSCL gates require a pair of load `resistors' with a high resistivity that can be preciselycontrolled and that is relatively insensitive to process variations. This can be achieved byusing a pair of PMOS transistors biased by a gate voltage V P , and with their bulk terminal(the n-well tap) tied to the drain. The resulting load device has been shown to have areasonably linear resistivity and low sensitivity to process variations [3].2.2.4 Replica Bias CircuitIn order to maintain the desired circuit performance in the presence of PVT (process -voltage - temperature) variations, a feedback loop containing a replica circuit is used toset the gate voltage of the PMOS load devices. This replica circuit consists of a tailtransistor with a bulk-drain connected load transistor, both using the same dimensionsas their counterparts inside the logic gates. The output voltage of this replica stage isequal to V DD − I SS R L , the low voltage in a dierential output pair. The desired valueof V SW = I SS R L is fed as an input to a negative-feedback loop which controls the gatevoltage V P of the load device and therefore its resistance R L .2.3 Use of STSCL for cryptographic hardwareFigure 4.4 shows the power supply current waveform for the STSCL implementation ofthe ECC core presented in Chapter 3. Contrary to CMOS, STSCL exhibits a very atpower prole, with supply current uctuations of less than 5%. The (partially) symmetricnature of STSCL gates means that the transient current waveform is also much less datadependent.This reduces the risk of exposing secrets (e.g. the private key) to a side-channelattacker.2.4 Performance analysis2.4.1 Gate DelaysThe bulk-drain connected load device acts like a resistor with a large-signal resistanceR = V SWnode.I SSand together with the load capacitance C L creates an RC network at the outputThe dierential output of an STSCL gate switches with a time constant given by:10

CHAPTER 2.SUBTHRESHOLD SOURCE-COUPLED LOGICasτ ST SCL = R L · C L ≈ V SWI SS· C L ,It can be shown that the propagation delay is given by [6]:t d,ST SCL = ln(2) · τ ST SCL = ln(2) · VSWI SSWith P = V DD I SS , the power-delay product (PDP) for each gate can thus be written· C LP DP ST SCL = ln(2) · V DD · V SW · C LIt must be noted that, for a given choice of V DD and V SW , the PDP is proportional tothe output capacitance, but independent of I SS . In other words, by scaling I SS , the speedof the circuit can be adjusted while keeping the PDP constant.2.4.2 Power consumptionIn a system with a target clock frequency f = 1 and an average logic depth (number ofTgates in a register-to-register path) d, the power consumption can be estimated as follows:For the sake of simplicity, all gates are assumed to have identical load capacitances C L .The delay for each gate (assuming equal delays) has to be less than or equal:The required gate bias current ist d = ln(2) · VSWI SS· C L = T dI SS = ln(2) · dT · V SW · C L = ln(2) · d · f · V SW · C L (2.1)In a system with N gates, the minimum total supply current isI total = ln(2) · C total · d · f · V SW (2.2)where C total is the total single-ended load capacitance in the system. Because (2.1) islinear in C L , this expression is valid even if C total is not equally distributed among thegates, if the current in each gate can be adjusted such that t d is the same for all gates.Comparison to CMOSFigure 2.3 shows a comparison of the power dissipation of anSTSCL an a CMOS implementation of the nite eld multiplier design 2 (Appendix, pg.34). The STSCL core was run with a clock period of 6µs (red dot); the dashed red lineshows the theoretical (linear) power-frequency characteristic of STSCL. In reality, thereis a maximum I SS that depends on V DD and which sets an upper limit on the operatingfrequency.2 The design used for this analysis is a modied version of the ALU contained in the ECC processor ofChapter 3, consisting of the ALU and 3 registers with 163 bits each and the control signals needed forthe shift-and-add multiplication algorithm.11

CHAPTER 2.SUBTHRESHOLD SOURCE-COUPLED LOGIC1E+51E+4Power [uW]1E+31E+2STSCL 0.4VCMOS 1 VCMOS 0.75 VCMOS 0.5 VCMOS 0.4 V1E+11E+01E+4 1E+5 1E+6 1E+7 1E+8 1E+9Frequency [Hz]Figure 2.3: Power dissipation of CMOS and STSCL as a function of frequencyIt can be seen from the gure that the power dissipation of CMOS circuits is dominatedby leakage currents up to frequencies in the order of 1 MHz. At higher speeds, powerdissipation increases in a linear fashion with the frequency, up to the maximum possibleoperating frequency (indicated by the sudden drop in power).For this design, STSCL uses less power than CMOS with a supply voltage of 0.4 V forfrequencies of about 100 kHz and lower.2.4.3 Eects of Process Variations and MismatchFor a basic analysis of the nominal STSCL performance as well as variability, a buer(or inverter) is used, since it represents the simplest possible logic gate and the eectsof variations are the same for all other gates. Three components can be identied thatinuence the performance of this gate:1. The NMOS tail transistor which provides the constant current required for CMLoperation. Mismatch between the tail transistor in the replica circuit and the one inthe actual logic gate can lead to a lower current value in the gate, and therefore alower output voltage swing.Since the replica bias circuit will be located far away from some of the logic gates,the amount of mismatch can be considerable, but it is hard to estimate during thedesign phase.2. The NMOS dierential pair (several pairs for more complex gates). The exponentialsubthreshold conduction law dictates the minimally required input voltage dierencefor complete current switching. In addition to that, any threshold voltage mismatchbetween the two dierential pair devices will appear as an input oset; it has therefore12

CHAPTER 2.SUBTHRESHOLD SOURCE-COUPLED LOGICto be added to the voltage swing as a margin.3. The PMOS load transistors. Their large-signal resistance is set by the replica circuitto result in a voltage drop equal to V SW . Again, the distance to the replica biascircuit will create signicant variation in V SW . Mismatch between the two devicestranslates into an input-referred oset at the gate input.In order to guarantee correct operation, the gates are required to have a positive noisemargin (NM) under the presence of variations.It can be shown that the NM for an STSCL gate with ideal resistor loads is given by [6]:√NM= 1 − 1 − 1 (√· tanh −1 1 − 1 )V SW A V A V A V(2.3)where A V is the DC voltage gain. As long as the load devices are close to ideal resistors,A V (and therefore NMV SW) is determined by the subthreshold slope factors for thegiven technology. Considering only nominal performance, the library designer is left withchoosing V SW to achieve the desired NM. On the other hand, device variability has animportant consequence. If all gates are to work under worst-case mismatch conditions, theoutput voltage swing has to be overdesigned, therefore requiring a higher bias current. Theamount of mismatch can be reduced by using transistors with a large gate ares. Standardcell design will therefore be dominated by the trade-o between circuit area and powerdissipation.2.4.4 Noise margin analysisUnder the presence of device variations, the noise margin can be estimated using [6]:( ) ∂NMNM ≈ NM 0 − · △V SW − V OS∂V SWwhere NM 0 is the nominal noise margin, △V SW is the variation of output low voltageand V OS is the input referred oset of the gate. This expression shows device variationsaecting the noise margin on both the input side (oset voltage) and the output side(reduced output dierential voltage).The sensitivity to output swing variations can be estimated using (2.3):Noise margin variance becomes:K NM = ∂NM√≈ 1 − 1∂V SW A VσNM 2 ≈ KNMσ 2 SW 2 + σOS2The variances of output voltage swing (σSW 2 ) and input referred oset voltage (σ2 OS ) areboth dependent on device dimensions. Assuming that the main source of variability arethreshold voltage variations it follows that σSW 2 depends mainly on the gate area of thetail and load transistors, whereas σOS 2 depends on V T H mismatch between the dierential13

CHAPTER 2.SUBTHRESHOLD SOURCE-COUPLED LOGICFigure 2.4: Dependency of NM variance on tail transistor gate areacoecient value [V 2 · µm 2 ]7.4 × 10 −6C NC BC P83.5 × 10 −647.5 × 10 −6Table 2.1: Relative contributions to NM variancepair devices as well as between the load devices. The total NM variation can therefore beapproximated as a function of device sizes for the three independently sized transistors:σ 2 NM ≈ C NS N+ C BS B+ C PS PWhere S N , S B , S P are the per-device transistor gate area (W × L) for the NMOSdierential pair, NMOS current source, and PMOS bulk-drain shorted load devices.Monte Carlo analysis In order to obtain numerical values for the coecients C N , C B andC P , Monte Carlo simulations were performed on an inverter gate over a range of dierentsizes for the three types of transistors.A MATLAB program (Appendix, pg. 38) was written to calculate the mean and standarddeviation of the noise margin (cf. [7]) for a series of voltage sweeps. Figure 15 shows theoutput of this program. The values of σNM 2 were calculated for a range of dierent gateareas for each of the three transistor types. The coecients C N , C B and C P were thenfound by linear regression.As an example, Figure 2.4 shows how the area of the NMOS tail transistor aects NMvariability. Table 2.1 shows the coecients for the three types of transistors.These simulation results are unable to take into account the degree of matching amongpairs of transistors. Inside the cell layouts, transistors are placed close together and with asimilar environment. It can therefore be expected that the actual matching is better than14

CHAPTER 2.SUBTHRESHOLD SOURCE-COUPLED LOGICFigure 2.5: Noise margin histogram for an STSCL inverterthe simulations would suggest.When designing a standard cell library in STSCL, the total budget for noise marginvariations should be distributed among these three types of devices while keeping the totalarea at a minimum. It should be noted, however, that the gate area of the dierentialpair transistors, and, to a lesser extent, the n-well of the load devices' bulk-drain terminalconstitute parasitic capacitances for the gate. Therefore, the dierential pair and loadtransistors should be made somewhat smaller for fast operation and the tail transistorlarger to keep σNM 2 at an acceptable level.2.5 Design FlowThe novel characteristics of current-mode logic families requires modied tools for top-downdesign of digital circuits. A major diculty with existing design software is their inability toroute dierential signals. (Cadence Encounter has support for routing dierential pairs, butthis only applies to pre-dened pairs of nets, so no optimization is possible). [8] introduceda series of custom programs that use standard synthesis and Place and Route tools fora MCML design ow. Their particularity lies in the use of two versions of the standardcell library, one dierential and one single-ended. The dierential library represents thephysical circuits; each input or output consists of a pair of signals. In the single-endedlibrary (also called `fat' library), logic gates are represented as if they were single-ended,meaning a single pin is used to represent one signal.In this work, a design ow based on the scripts presented in [4] is being used. Thescripts had to be adapted for compatibility with more recent EDA tools, and functionalityfor routing of the bias voltages as well as clock tree synthesis have been added.In a rst step, the design is synthesized, placed, and routed using the single-endedlibrary. This allows the designer to make full use of existing methods for power and timing15

CHAPTER 2.SUBTHRESHOLD SOURCE-COUPLED LOGICoptimization, clock tree synthesis, and so on. During this step, wide wires are used toroute signals.As a second step, a custom software tool processes the design, replacing the singleendedgates with their physical (dierential) counterparts, and splitting signal wires intodierential wire pairs.2.5.1 SynthesisSynthesis of the RTL code is straightforward using an existing logic synthesis tool (SynopsysDesign Compiler). It must be noted, however, that one main dierence betweenCMOS and STSCL may cause the tool to produce a non-optimum netlist: In CMOS, thedriving strength of a gate is roughly proportional to its input capacitance (for a given logicfunction), whereas in the STSCL library, dierent driving strengths use the same devicesizes with dierent bias voltages.2.5.2 Placement and RoutingFor the rst P&R iteration, the design is loaded using a single-ended setup. The structuralVerilog netlist generated by Design Compiler is used together with the `fat' libraries, thatis, the LEF les for the standard library and routing wires. The design is then placed androuted using a standard ow.The routing information for the V_P and V_N bias voltages is then temporarily savedto a DEF le. This allows the bias routing to be done at an earlier stage, keeping theamount of routing required after wire splitting at a minimum.Wire splitting is performed by exporting the cell placement and signal routing to aDEF le, which can then be processed by the custom-made [8, 5] split_wires tool. Thistool replaces all wire shapes (rectangles) by two narrower rectangles and updates all cellinstances to their respective dierential versions. split_wires outputs again a DEF lewhich can then be read back into Encounter to continue with the P&R ow.Mechanically replacing all wires by dierential pairs obviously creates numerous designrule violations. Therefore, the routing command has to be called once again to x theseviolations and to complete the routing in places where shapes had previously been deleted.In order to keep the energy consumption low, a special option (leakage power optimization)was used in place and route. Using this option enables a nal pass where gates arereplaced by alternative implementations that have lower leakage current, which in STSCLtranslates to lower overall power.16

3 Elliptic curve cryptographic processor3.1 IntroductionFor demonstration purposes, an existing design for an elliptic curve cryptographic core(provided by [9]) was implemented in both a standard CMOS library as well as STSCL.Both were synthesized starting from the same VHDL register-transfer-level (RTL) code.The two implementations were then compared in terms of area, power consumption andsupply current prole.3.2 Elliptic curve cryptographyPublic-key cryptographic systems rely on computationally `hard' mathematical problems.Traditionally, public-key systems used the fact that it is computationally expensive tofactor large integers. Elliptic curve cryptography instead relies on the discrete logarithmproblem on elliptic curves.Recently, elliptic curve cryptography has been proposed as a viable encryption/authenticationtechnology for RFID applications, because it can be implemented with a comparativelylow hardware cost [10].In order to understand the required hardware for elliptic curve cryptography, the followingconcepts need to be dened [11]:Finite Fields A nite eld is a system consisting of a nite set F (the numbers) togetherwith operations + and × (addition and multiplication) that satisfy a number ofproperties:• Closure: for all a, b ∈ F , we have a + b ∈ F and a × b ∈ F .• Associativity: for all a, b, c ∈ F , (a × b) × c = a × (b × c).• Existence of an identity element• Existence of an inverse• Abelian property (commutativity)• DistributivityFor any prime number p, the prime (nite) eld F p is dened to be the set of integers from0 to p − 1, together with the operations dened as follows:17

CHAPTER 3.ELLIPTIC CURVE CRYPTOGRAPHIC PROCESSORAddition a + b = r, where r is the remainder of the division of a + b by p.Multiplication a × b = s, where s is the remainder of the division of a × b by p.The nite eld F 2 m The binary nite eld, F 2 m, is a vector space of dimension m overthe prime eld F 2 . A polynomial basis of F 2 m can be introduced as follows: Let f(x) be anirreducible polynomial of degree m over F 2 called the reduction polynomial. Each elementa of F 2 m can now be written as a binary polynomial of degree m − 1 or less:a = a m−1 x m−1 + ... + a 1 x + a 0When a polynomial basis is specied, an element of F 2 mbit vector of length m.can therefore be written as aFinite eld operationscan be dened:Using a polynomial basis for F 2 m, the following eld operations• Addition: a + b = c = (c m−1 ...c 1 c 0 ), where c i = (a i + b i ) mod 2. Field addition isthe bitwise XOR of the bit vectors representing elements of F 2 m.• Multiplication: a · b = c = (c m−1 ...c 1 c 0 ), where c(x) = ∑ m−1i=0 c ix i is the remainder ofthe division of the polynomial ( ∑ m−1i=0 a ix i )( ∑ m−1i=0 b ix i ) by f(x). Field multiplicationcan be performed by the shift-and-add method, where one bit of b is considered ata time, starting at the MSB. If the bit is equal to one, a is added (using XOR) to arunning sum c. After each step, c is left-shifted by one bit and reduced modulo f(x).Elliptic curves over F 2 m The elliptic curve E(F 2 m) over F 2 m for the parameters a, b ∈F 2 m, b≠0 is dened to be the set of points P = (x, y) for x, y ∈ F 2 mthe equationthat are solution toy 2 + xy = x 3 + ax 2 + btogether with the special point O called the point at innity.Addition on elliptic curvesAddition of two points P = (x 1 , y 1 ), Q = (x 2 , y 2 ) ∈ E(F 2 m),P ≠ ±Q, results in a new point P + Q = (x 3 , y 3 ) ∈ E(F 2 m), where(x 3 = λ 2 + λ + x 1 + x 2 + a, y 3 = λ(x 1 + x 2 ) + x 3 + y 1 λ = y )2 + y 1.x 2 + x 1A similar expression can be found for the double of a point P .Elliptic scalar multiplicationScalar multiplication of a point P on an elliptic curve bythe integer k is dened to be the result of adding P to itself k times. This operation isthe underlying principle of all elliptic curve cryptographic schemes.There are ecientalgorithms for calculating kP . On the other hand, it is very hard to nd k if only P andkP are known. This is the so-called discrete logarithm problem.18

CHAPTER 3.ELLIPTIC CURVE CRYPTOGRAPHIC PROCESSOR3.2.1 Montgomery algorithm for scalar multiplicationThe binary method for scalar multiplication on elliptic curves can be implemented as follows:set k ← (k l−1 ...k 1 k 0 ) 2set P 1 ← P, P 2 ← 2Pfor i = l − 2 to 0 doif k i = 1 thenelseset P 1 ← P 1 + P 2 , P 2 = 2 · P 2 .set P 2 ← P 1 + P 2 , P 1 = 2 · P 1 .end ifend forOne fast algorithm for fast scalar multiplication on elliptic curves is the ladder algorithmrst proposed by Montgomery [12]. It is based on the observation that in the binarymethod, the dierence between P 1 and P 2 is always equal to P . This makes it possible toimplement scalar multiplication with fewer registers.3.3 Specications of the cryptographic processorThe ECC core that was chosen as an example implementation for this project is designedfor elliptic curve calculations over the eld F 2 163. It provides a hardware implementation ofthe add-and-double operation that is at the heart of Montgomery's ladder multiplicationalgorithm.The ALU and the register le are controlled by a Finite State Machine (FSM), which isalso part of the hardware implementation. An external controller implements the physicalinterface for encryption or authentication and repeatedly calls the ECC core to performthe operations on two curve points.The main specications are given in table 3.1. The core is designed to operate at a verylow clock frequency of 100 kHz.technologytarget clock frequencypower dissipationUMC 90nm100 kHzmin.Table 3.1: Specications of the ECC core3.4 ArchitectureShown in Figure 3.1, the architecture of the ECC core consists of a core containing theArithmetic and Logic Unit (ALU) on the left, and a 6-word register le on the right.19

CHAPTER 3.ELLIPTIC CURVE CRYPTOGRAPHIC PROCESSORFigure 3.1: Basic architecture of the ECC coreThe processor's register le consists of six registers with 163 bits each, named with theletters A through F. Each register has an enable signal which has to be asserted by thestate machine if a dierent value is to be loaded into the register. A global reset signal(one of the inputs to the processor) can be used to initially set all registers to zero. Aglobal clock signal clocks the ip-ops in the register le and in the counter.To start a new operation, the system is reset and then the start signal is asserted. Thestart signal causes the FSM to leave the default state and progress through a number ofstates, rst loading the input data, then calculating the sum of the two input points, andthen the double of one of them.The rst two registers are equipped with multiplexers that are controlled from the FSM.Register A has the most diverse functionality. It can load the output of the ALU, thedata_in input, the output of register B, or one of three predened values.Register B can take either the output of register A or a left-shifted copy of its own value.This allows cyclic shifting of values in register B, used for feeding the operand to the ALUin a bit-serial fashion.The remaining registers C to F serve to store intermediate results. They each take theoutput of the previous register.3.4.1 ALUThe nite eld multiplier is implemented as a bit-serial unit. One of the operands is storedin register B and its MSB is an input to the ALU. During multiplication, register B isleft-shifted at each clock cycle; thus the operand is entered into the ALU bit-serially.The value of the current bit position of B is multiplied with the other operand (registerC) by an array of 163 AND gates. In each cycle, this partial product is added to therunning sum (register A) using an array of 162 XOR gates. If the result has a `1' at the20

CHAPTER 3.ELLIPTIC CURVE CRYPTOGRAPHIC PROCESSORbit-position m (the degree of the sum is equal to the degree of the reduction polynomial),the bits at positions corresponding to the reduction polynomial are ipped, again usingXOR gates. This step corresponds to the modulo operation which guarantees that thedegree of the running sum is less than m after each clock cycle.3.4.2 Finite State MachineThe Finite State Machine that is part of this hardware ECC implementation loads thecoordinates of the two points P 1 and P 2 into registers and then cycles through a numberof states in order to compute the sum P 1 + P 2 and the value 2 · P 1 . The result is thenoutput in projective coordinates at the ports x1_out, x2_out and z_out and the readysignal is asserted to signal completion of the calculations. The entire operation requiresroughly 1800 clock cycles.3.4.3 RegistersThe ECC core uses a register le with six 163-bit registers called regA, regB, ... regFto store intermediate results. Registers regA and regB have an input multiplexer whichselects the value to store according to the select bits generated in the FSM. The otherregisters form a circular shift memory where each register takes the output of the previousregister as its input.3.5 Standard CMOS implementationThe ECC core was implemented in the UMC 90nm logic/mixed-mode CMOS process usingthe Faraday standard cell library. The RTL code was synthesized using Synopsys DesignCompiler and then placed and routed using Cadence Encounter. Synthesis constraintswere chosen to favor low power consumption over high performance. No special poweroptimization techniques were used and the library was not re-characterized for lower supplyvoltages. Therefore, power consumption measurements can be expected to give pessimisticresults.3.6 STSCL implementation3.6.1 Modications of the design ow and libraryThe STSCL implementation used the existing library described in [4] with some smallmodications.An analysis of the routing process with the existing library showed that the design wasdicult to route because of the many D ip-ops in the design. The existing layout ofthese ip-ops had been assembled from two copies of the existing layout for the 2-inputmultiplexer. This lead to a bad cell layout with a long wire in the metal 3 layer, creatingobstructions during the routing process. The D Flip-Flop layout was thus optimized toeliminate most of the routing in metal 3.21

CHAPTER 3.ELLIPTIC CURVE CRYPTOGRAPHIC PROCESSORFigure 3.2: Modied D Flip-Flop layout3.6.2 Library SpecicationsThe STSCL implementation of the ECC core uses the library with the characterizationparameters listed in Table 3.2.V DDV SWI SS,x11.2 V0.2 V1nATable 3.2: Library cornerIn order to nd the best conguration, the library was characterized for dierent valuesfor the bias current (200pA, 1nA, 5nA). It was found that the 1nA variant is the bestchoice for the present design, due to the fact that the range of available gates (drivingstrengths from x1 to x32) covers the requirements for synthesis. The nal P&R resultsshow that most of the gates have an intermediate driving strength, and only relatively fewx1 and x32 gates are present. This suggests that the chosen bias current is appropriate forthis design.22

4 Results4.1 Design and simulation owThe two implementations were synthesized, placed and routed as described in the previouschapter. The CMOS design was then imported in Cadence Virtuoso and a device andcapacitance extraction from the layout was performed. This extracted netlist was thensimulated using Cadence Spectre, with a VCD (Verilog value change dump) le providingthe input stimuli as well as the correct output values for verication.The STSCL design could not be simulated correctly using this approach. It was simulatedin Nanosim using the Verilog netlist and extracted capacitance le (SPEF) generatedin Cadence Encounter. A supply voltage of 1 V was used.Figure 4.1: Output waveform of one of the regB ip-ops with the corresponding clocksignalA comparison of output signals with the results from VHDL simulation shows that thedesign is operating correctly at a frequency of 100 kHz. Figure 4.1 shows an example of aclock and signal waveform.4.2 Performance comparisonTable 4.1 gives a summary of the results obtained for the implementations of the ECC coreusing the Faraday CMOS library as well as the STSCL library. Figure 4.2 shows the nallayouts for both implementations.It can be seen from Figure 4.2 that the area of the STSCL implementation is roughly8.5 times larger. This is mainly due to the large size of cells in the STSCL library. For23

CHAPTER 4.RESULTSFigure 4.2: Layout of the CMOS (left) and STSCL implementation (to scale)Implementation CMOS STSCLArea (including power rings, no pads) 36000 µm 2 308000 µm 2Number of standard cells 2568 4541Number of D ip-ops 993 993Power consumption (average) 9 µW @ 0.45V 56 µW @ 1V% of area in DFFs 54% 27%Table 4.1: comparison of CMOS and STSCL implementations of the ECC coreexample, an AND gate with minimum driving strength has an area of about 3.9 µm 2 inthe CMOS library, and 42 µm 2 in the STSCL library. The layout area of STSCL gatesis signicantly larger because of the large gate areas. Furthermore, the two load devicesneed to be placed in separate n-wells because of their drain-bulk connection. This adds alot of area due to spacing rules that have to be met, such that in the end, the gates of thetwo load devices need to have a distance of more than 2 µm between them.The dierent number of standard library cells in the two implementations can be explainedby the fact that the commercial CMOS library oers a large selection of gates tothe synthesis tools. For example, the CMOS version of the design contains 6-input AOIgates, whereas the STSCL library only has gates with up to 3 inputs.Running the STSCL core at the same supply voltage V DD = 0.45V can be expected toreduce the power consumption to 25µW .4.3 Interpretation of resultsIn the present design, the critical register-to-register path contains 15 logic gates, eventhough the design had been synthesized with tight timing constraints. For STSCL a highlogic depth means that each gate is doing a useful operation only during a small fractionof the clock cycle. Even if every node were to switch once per clock period, the gates are24

CHAPTER 4.RESULTS`wasting' current for more than 90% of the clock period.Since STSCL allows reduction of the power consumption at the cost of speed downto tail currents of a few pA, the STSCL design could oer better performance at lowerfrequencies. However, this is not feasible due to the speed constraint imposed on the ECCcore.Interconnect capacitanceAnother factor is the large area of the STSCL block, whichleads to wires being roughly three times longer on average. Post-layout capacitance extractionshows a total routing capacitance of 89 pF in STSCL (counting both wires of eachdierential pair), whereas the CMOS design has only 13 pF of capacitance.Equation 2.2 can be used to calculate an estimate of the theoretical lower limit on powerdissipation. The average logic depth in each path was estimated to be 10, and the singleendedgate capacitance 2 fF for each gate input. For a conservative estimate, it can beassumed that all 4500 gates have only two inputs, resulting in a total capacitance ofC total ≈ 89pF + 4500 · 2 · 2fF = 62.5pF2I total ≥ ln(2) · C total · d · f · V SW ≈ 0.7 · 62.5pF · 10 · 10 5 s −1 · 0.2V = 8.8µAThis value is signicantly lower than the simulated result of 56µA. Several reasons forthis dierence can be identied. First, the estimate is overly conservative because in reality,many gates have more than two inputs, and D ip-ops use twice the current of a normalgate. On the other hand, in the nal design, some paths are signicantly faster than theyneed to be (Figure 4.3). The EDA tools did not correctly replace gates with low-powerequivalents in those paths. Moreover, the delays in a path may not be equal at all; forinstance, if one stage has to drive a very large fan-out, the delay of that stage will begreater, requiring higher current consumption in the other stages to compensate for thisdelay.While STSCL was expected to be very ecient due to the low voltage swing, it has tobe noted that the dierential nature of the signals entails a switching current that is atleast twice as high as in the single-ended case. In fact, the routing method used leads to alarge coupling capacitance between the two wires of the dierential signal. If the leakagecurrent in CMOS can be kept at acceptable values, it can thus be expected that the powerdissipation of STSCL circuits with a dierential voltage swing of ±0.2V will not be muchbetter than that of a CMOS circuit operating at V DD = 0.4V .Supply Current Figure 4.4 shows the current owing into the V DD node, for CMOS andSTSCL 1 . Whereas in CMOS, the current is concentrated in peaks (when the ip-opsare switching), STSCL, as expected, draws a nearly constant current with only minortransients due to switching.1 For CMOS, a higher frequency (10 MHz) was used to generate this waveform; otherwise the switchingcurrents would be very narrow peaks.25

CHAPTER 4.RESULTSFigure 4.3: Register-to-register path slack distribution in the STSCL core4.3.1 Advantages and disadvantages STSCLWith regard to the present implementation of the ECC core in STSCL, the followingconclusions can be drawn:Advantages• Operation at very low speed: there is almost no lower limit to the power consumptionof STSCL gates. For clock frequencies in the kilohertz range and below, STSCL isan ideal choice. In CMOS, leakage power is determined by the supply voltage, whichhas to meet a minimum noise margin requirement in the presence of PVT variations.In STSCL, current consumption and noise margin are controlled separately, and sothe power dissipation can be reduced to a very low value while keeping a reasonablevoltage.• Shallow logic depth: Similarly, circuits with very shallow pipelining can be implementedmore eciently in STSCL since low logic depth means each gate is switchingduring a signicant fraction of the clock period. That way, less current is `wasted' ininactive gates.• Tunability over a wide range of frequencies: The use of a single replica bias circuitmakes it possible that the same circuit can be used over a range of frequenciesof several orders of magnitude. Using a single constant current source and a programmablecurrent mirror, STSCL circuits can be used in applications where theoperating speed, and therefore the power consumption, has to be adjusted dynamicallyto meet performance demands.26

CHAPTER 4.RESULTS32.521.5Iss [mA]10.50-0.5-1200 300 400 500 600 700time [ns]605856Iss [A]545250500 510 520 530 540 550time [s]Figure 4.4: Supply current waveform for CMOS (top) and STSCL.27

CHAPTER 4.RESULTSDisadvantages• Large area: The large area due to matching requirements and the overhead for theload devices present in each gate are a signicant problem.• Power consumption: In their current version, larger STSCL designs implementedusing a top-down design ow are not competitive due to the issues with large cellarea and interconnect capacitance.28

5 Outlook5.1 Possible improvements in the STSCL library5.1.1 Device sizing optimizationFigure 5.1: Layout of a 2-input NAND gate in STSCLAs discussed in Section 2.4.4, optimal sizing of standard cell devices takes into accountthe relative weight with which the dierent transistors contribute to noise margin variability.In order for STSCL to be more area-ecient, logic synthesis should favor cells with manyinputs. Figure 5.1 shows the relative sizes of the dierential pairs, load devices, and tailcurrent source. Even though the dierential pair devices implement the actual `logic' ofthe gate, they only occupy a comparatively small area. For this reason, gates with a smallnumber of inputs are very inecient in terms of area: a 3-input XOR gate has an area of50.5µm 2 , whereas a simple buer has an area of 39µm 2 .In CMOS, using low fan-in gates is justied by the higher driving strength that thesegates oer for a given input capacitance. In STSCL however, the driving strength onlydepends on the tail current. For these reasons, it is more ecient to use more complexgates in STSCL, both in terms of area (due to the overhead for tail and load devices) andpower consumption. An STSCL library should therefore contain a large selection of highfan-in gates, possibly custom-made for a specic design.29

CHAPTER 5.OUTLOOK5.1.2 Shallow pipeliningIt has been suggested in [3] that in order to increase the power eciency of STSCL, singlestagepipelining is to be used. In this scenario, the system is clocked with two clock phaseswhich alternately latch the outputs of two consecutive gates by switching the tail currentto a lower value. An output stage consisting of a pair of cross-coupled NMOS transistorsbiased with a small current can be used as a keeper to ensure that the output state doesnot degrade during the hold phase.While this shallow-pipelining method is very promising for manually designed circuitsblocks like multipliers, it would be dicult to integrate in a top-down design ow.30

6 ConclusionThis work successfully demonstrated the design of a elliptic curve cryptographic core insubthreshold source-coupled logic using a top-down design ow. The ECC core runs correctlyat the specied frequency of 100 kHz.The comparison to a standard CMOS implementation of the same core shows, however,that in the current state, the STSCL library is not competitive in terms of power dissipation.The area required by the PMOS load devices and sizing constraints imposed by devicevariations make the STSCL standard cells considerably larger than their CMOS counterparts.On the system-level, this leads to an excessive amount of device and interconnectcapacitance.Advantages of STSCL over CMOS have been identied. The at power prole of STSCLcircuits makes it dicult to extract information on the data being processed by studyingthe supply current. This is an important advantage for cryptographic applications.31

Bibliography[1] A. Tajalli, E. Vittoz, Y. Leblebici, and E. J. Brauer, Ultra-low power subthresholdcurrent-mode logic utilising PMOS load device, IEE Electronics Letters, vol. 43,no. 17, pp. 911913, 16 Aug. 2007.[2] A. Tajalli, Y. Leblebici, E. Vittoz, and E. J. Brauer, Ultra Low Power SubthresholdMOS Current Mode Logic Circuits Using a Novel Load Device Concept, in 33rdEuropean Solid-State Circuits Conference (ESSCIRC), 2007.[3] A. Tajalli, E. J. Brauer, Y. Leblebici, and E. Vittoz, Subthreshold Source-CoupledLogic Circuits for Ultra-Low-Power Applications, IEEE Journal of Solid-State Circuits,vol. 43, no. 7, pp. 16991710, Jul. 2008.[4] M. Beikahmadi, Developing a Standard Cell Library for Subthreshold Source-CoupledLogic, Master's thesis, Ecole Polytechnique Fédérale de Lausanne, Jan. 2009.[5] P. Vietti, Design of MCML standard-cell library and dierential routing methodology,Master's thesis, Ecole Polytechnique Fédérale de Lausanne, Aug. 2007.[6] A. Tajalli, Power-Performance Scalable Integrated Circuit Design Using SubthresholdMOS, Ph.D. dissertation, Ecole Polytechnique Fédérale de Lausanne, 17 Aug. 2010.[7] E. Seevinck, F. J. List, and J. Lohstroh, Static-Noise Margin Analysis of MOS SRAMCells, IEEE Journal of Solid-State Circuits, vol. SC-22, no. 5, pp. 748754, Oct. 1987.[8] S. Badel, MOS Current-Mode Logic Standard Cells for High-Speed Low-Noise Applications,Ph.D. dissertation, Ecole Polytechnique Fédérale de Lausanne, Jul. 2008.[9] K. Padarnitsas, ecc_add_doubler, VHDL code, private communication, 2010.[10] D. Hein, J. Wolkerstorfer, and N. Felber, ECC Is Ready for RFID - A Proof inSilicon, in Selected Areas in Cryptography. Springer, 2009.[11] J. López and R. Dahab, An Overview of Elliptic Curve Cryptography, May 2000.[12] P. L. Montgomery, Speeding the Pollard and Elliptic Curve Methods of Factorization,Mathematics of Computation, vol. 48, no. 177, pp. 243264, Jan. 1987.[13] J. López and R. Dahab, Fast Multiplication on Elliptic Curves over GF(2m) withoutPrecomputation, in Cryptographic Hardware and Embedded Systems. Springer, 1999.33

AppendixVHDL code for the nite eld multiplier1 library IEEE ;2 use IEEE . STD_LOGIC_1164 .ALL;3 use IEEE .STD_LOGIC_ARITH.ALL;4 use IEEE .STD_LOGIC_UNSIGNED.ALL;56 entity m u l t i p l i e r i sListing 6.1: multiplier.vhd7 generic ( n b i t s : n a t u r a l := 163) ;8 Port ( c l k : in s t d _ l o g i c ;9 s t a r t : in s t d _ l o g i c ;10 A, B : in s t d _ l o g i c _ v e c t o r ( n b i t s −1 downto 0) ;11 output : out s t d _ l o g i c _ v e c t o r ( n b i t s −1 downto 0) ;12 done : out s t d _ l o g i c13 ) ;14 end m u l t i p l i e r ;1516 architecture s t r u c t of m u l t i p l i e r i s17 signal x o r s h i f t _ o u t : s t d _ l o g i c _ v e c t o r ( n b i t s −1 downto 0) ;18 signal sum_reg : s t d _ l o g i c _ v e c t o r ( n b i t s −1 downto 0) ;19 signal reduce_out : s t d _ l o g i c _ v e c t o r ( n b i t s −1 downto 0) ;20 signal s h i f t _ r e g , B_reg : s t d _ l o g i c _ v e c t o r ( n b i t s −1 downto 0) ;2122 component c o u n t er i s23 port ( c l k : in s t d _ l o g i c ;24 s t a r t : in s t d _ l o g i c ;25 done : out s t d _ l o g i c26 ) ;27 end component ;2829 component reduce i s30 generic ( n b i t s : n a t u r a l ) ;31 port ( e n a b l e : in s t d _ l o g i c ;32 input : in s t d _ l o g i c _ v e c t o r ( n b i t s −1 downto 0) ;33 output : out s t d _ l o g i c _ v e c t o r ( n b i t s −1 downto 0)34 ) ;35 end component ;3637 component x o r s h i f t i s38 generic ( n b i t s : n a t u r a l ) ;39 port ( s h i f t e d _ b i t : in s t d _ l o g i c ;40 m u l t i p l i c a n d : in s t d _ l o g i c _ v e c t o r ( n b i t s −1 downto 0);41 sum_in : in s t d _ l o g i c _ v e c t o r ( n b i t s −2 downto 0) ;42 sum_out : out s t d _ l o g i c _ v e c t o r ( n b i t s −1 downto 0)34

43 ) ;44 end component ;4546 begin4748 xs : x o r s h i f t49 generic map ( n b i t s => n b i t s )50 port map ( s h i f t e d _ b i t => s h i f t _ r e g ( n b i t s −1) ,51 m u l t i p l i c a n d => B_reg ,52 sum_in => sum_reg ( n b i t s −2 downto 0) ,53 sum_out => x o r s h i f t _ o u t ) ;54 red : reduce55 generic map ( n b i t s => n b i t s )56 port map ( e n a b l e => sum_reg ( n b i t s −1) ,57 input => x o r s h i f t _ o u t ,58 output => reduce_out ) ;5960 cnt : c o u n t er61 port map ( c l k => clk ,62 s t a r t => s t a r t ,63 done => done ) ;6465 process ( c l k )66 begin67 i f r i s i n g _ e d g e ( c l k ) then68 i f s t a r t = ' 1 ' then69 s h i f t _ r e g

16 process ( c l k )17 begin18 i f r i s i n g _ e d g e ( c l k ) then19 i f s t a r t = ' 1 ' then20 tmp ' 0 ' ) ;21 output ( n b i t s −1 downto 0)

8 entity x o r s h i f t i s9 generic ( n b i t s : n a t u r a l := 8) ;10 port ( s h i f t e d _ b i t : in s t d _ l o g i c ;11 m u l t i p l i c a n d : in s t d _ l o g i c _ v e c t o r ( n b i t s −1 downto 0) ;12 sum_in : in s t d _ l o g i c _ v e c t o r ( n b i t s −2 downto 0) ;13 sum_out : out s t d _ l o g i c _ v e c t o r ( n b i t s −1 downto 0)14 ) ;15 end x o r s h i f t ;161718 architecture arch of x o r s h i f t i s1920 component and_221 port (22 a : in s t d _ l o g i c ;23 b : in s t d _ l o g i c ;24 c : out s t d _ l o g i c25 ) ;26 end component and_2 ;2728 component xor_229 port (30 a : in s t d _ l o g i c ;31 b : in s t d _ l o g i c ;32 c : out s t d _ l o g i c33 ) ;34 end component xor_2 ;353637 signal and_temp : s t d _ l o g i c _ v e c t o r ( n b i t s −1 downto 0) ;3839 begin4041 and_gate : for i in 0 to n b i t s −1 generate42 Comp : and_2 port map (43 a => s h i f t e d _ b i t ,44 b => m u l t i p l i c a n d ( i ) ,45 c => and_temp ( i )46 ) ;47 end generate ;4849 xor_gate : for i in 1 to n b i t s −1 generate50 Comp : xor_2 port map (51 a => and_temp ( i ) ,52 b => sum_in ( i −1) ,53 c => sum_out ( i )54 ) ;55 end generate ;5657 sum_out ( 0 )

4 use IEEE .STD_LOGIC_UNSIGNED.ALL;567 entity xor_2 i s8 Port ( a : in s t d _ l o g i c ;9 b : in s t d _ l o g i c ;10 c : out s t d _ l o g i c ) ;11 end xor_2 ;1213 architecture B e h a v i o r a l of xor_2 i s1415 begin1617 c

15 x p o i n t s = AF1 . data ( : , 1 ) ;16 vout1 = AF1 . data ( : , 2 : 2 : Nsamples ∗2) ;17 vin2 = AF2 . data ( : , 1 ) ;18 vout2 = AF2 . data ( : , 2 : 2 : Nsamples ∗2) ;1920 F1 = vout1 ;2122 F2 = [ ] ;23 for i = 1 : Nsamples24 F2 = [ F2 interp1 ( vout2 ( : , i ) , vin2 , x p o i n t s ) ] ;25 end2627 plot ( xpoints , F1 )28 hold on29 plot ( xpoints , F2 )30 %a x i s ( [ 0 , Vdd , 0 , Vdd ] ) ;31 axis e q u a l ;3233 % r o t a t e d system o f c o o r d i n a t e s34 v1 = ( x p o i n t s ∗ ones ( 1 , Nsamples ) + vout1 ) / sqrt ( 2 ) ;35 u1 = ( x p o i n t s ∗ ones ( 1 , Nsamples ) − vout1 ) / sqrt ( 2 ) ;36 v2 = ( x p o i n t s ∗ ones ( 1 , Nsamples ) + vout2 ) / sqrt ( 2 ) ;37 u2 = −( x p o i n t s ∗ ones ( 1 , Nsamples ) − vout2 ) / sqrt ( 2 ) ;3839 % p o i n t s on the new ' x'− a x i s40 % l e s s than vdd/ s q r t 2 to t r u n c a t e the e x t r a p o l a t e d ' t a i l s '41 u p o i n t s = linspace ( −0.9∗Vdd/ sqrt ( 2 ) , 0 . 9 ∗ Vdd/ sqrt ( 2 ) , i n t e r p ) ' ;4243 v1resamp = [ ] ;44 v2resamp = [ ] ;4546 % i n t e r p o l a t e the curve at the new p o i n t s47 for i = 1 : Nsamples48 v1resamp = [ v1resamp interp1 ( u1 ( : , i ) , v1 ( : , i ) , u p o i n t s ) ] ;49 v2resamp = [ v2resamp interp1 ( u2 ( : , i ) , v2 ( : , i ) , u p o i n t s ) ] ;50 end5152 %v1resamp ( isnan ( v1resamp ) ) = 0 ;53 %v2resamp ( isnan ( v2resamp ) ) = 0 ;5455 %p l o t ( upoints , v1resamp , upoints , v2resamp ) ;56 axis e q u a l ;5758 s nr = [ ] ;5960 for i = 1 : Nsamples61 d i f f = ( v1resamp ( : , i ) ∗ ones ( 1 , Nsamples )−v2resamp ) ;62 s nr = [ s n r ; min(max( d i f f ( 1 : round ( i n t e r p /2) , : ) , [ ] , 1) ,63 max(− d i f f (round ( i n t e r p /2) : i n t e r p , : ) , [ ] , 1) ) '/ sqrt( 2 ) ] ;64 end6566 figure ( )67 h i s t f i t ( snr , 2 0 ) ;6869 [mu, s i g ] = n o r m f i t ( snr )70 xlabel ( ' S t a t i c Noise Margin [V] ' ) ;39

7172 v = axis ( ) ;7374 text ( v ( 1 ) + 0 . 7 ∗ ( v ( 2 )−v ( 1 ) ) , v ( 3 ) + 0 . 7 ∗ ( v ( 4 )−v ( 3 ) ) ,75 s t r v c a t ( [ ' \mu = ' num2str (mu) ] , [ ' \ sigma = ' num2str ( s i g ) ] ,76 [ 'N = ' num2str ( Nsamples ) ' x ' num2str ( Nsamples ) ] ) ) ;7778 minSNR = min( s nr )40

Ultra-Low-Power Digital Circuit Design - Microelectronic Systems ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?