Self-Timed SRAM for Energy Harvesting Systems - Electronics ...

Self-Timed SRAM for Energy Harvesting Systems 

Abdullah Baz, Delong Shang, Fei Xia, and Alex Yakovlev 

Microelectronic System Design Group, School of EECE, Newcastle University 

Newcastle upon Tyne, NE1 7RU, England, United Kingdom 

{Abdullah.baz,delong.shang,fei.xia,alex.yakovlev}@ncl.ac.uk 

Abstract. Portable digital systems tend to be not just low power but power efficient 

as they are powered by low batteries or energy harvesters. Energy harvesting 

systems tend to provide nondeterministic, rather than stable, power over 

time. Existing memory systems use delay elements to cope with the problems 

under different Vdds. However, this introduces huge penalties on performance, 

as the delay elements need to follow the worst case timing assumption under the 

worst environment. In this paper, the latency mismatch between memory cells 

and the corresponding controller using typical delay elements is investigated 

and found to be highly variable for different Vdd values. A Speed Independent 

(SI) SRAM memory is then developed which can help avoid such mismatch 

problems. It can also be used to replace typical delay lines for use in bundleddata 

memory banks. A 1Kb SI memory bank is implemented based on this 

method and analysed in terms of the latency and power consumption. 

1 Introduction 

With the wide advancement in such remote and mobile fields as wireless sensor based 

applications, microelectronic system design is becoming more energy conscious. This 

is mainly because of limited energy supply (scavenged energy or low battery) and 

excessive heat with associated thermal stress and device wear-out. At the same time, 

the high density of devices per die and the ability to operate with a high degree of 

parallelism, coupled with environmental variations, create almost permanent instability 

in voltage supply (cf. Vdd droop), making systems highly power variant. In the not 

so long past low power design was targeted merely at the reduction of capacitance, 

Vdd and switching activity, whilst maintaining the required system performance. In 

many current applications, the design objectives are changing to maximizing the performance 

within the dynamic power constrains from energy supply and consumption 

regimes. Such systems can no longer be simply regarded as low power systems, but 

rather as power adaptive or power resilient systems. 

Normally, this kind of system has the following properties: 1) power efficient not 

just low power; 2) non-deterministic supply voltage (probably with known range, 

which tends to be low) variable over time. Recently a possible solution is proposed for 

this kind of system. It is a power elastic system which takes power and energy as dynamic 

resources [13]. For example, when power is not enough, some of the subsystems 

could either be powered off or be executed under lower power supplies 

(Vdds). When power is enough, systems can provide high performance. This means 

that all tasks in a system are managed based on the power resources, performance 

requirements, and thermal constraints. 

When systems are subjected to varying environmental conditions, with voltage and

thermal fluctuations, timing tends to be the first issue affected. Most systems are still 

designed with global clocking and the design is often made overly pessimistic to avoid 

failures due to Vdd (timing) variations. 

Along with the advent of the nanometre CMOS technology, the continuation of this 

scaling process is vital to the future development of the digital industries. The International 

Technology Roadmap for Semiconductors (ITRS) [1] predicts poorer scaling 

for wires than transistors in future technology nodes. This makes the above worst 

timing assumption even worse along with power supply voltage drooping [17]. 

Asynchronous techniques may provide solutions to all these problems. Unlike synchronous 

systems, asynchronous designs can completely remove global clocking. As a 

result, asynchronous designs may be more tolerant to timing variations. 

The ITRS also predicts that asynchrony will increase with the complexity of onchip 

systems. The power, design effort, and reliability cost of global clocks will also 

make increased asynchrony more attractive. Increasingly complex asynchronous systems 

or subsystems will thus become more prevalent in future VLSI systems. 

In order to fully realize the potential of asynchrony in an environment of variable 

supply voltage and latencies, system memories may need to be asynchronous together 

with the computation parts. In this paper, we concentrate on asynchronous SRAM. 

Our main contributions include: analysing the behaviour of latency in SRAM memory 

systems under different Vdds, developing asynchronous SRAM memory, and proposing 

a new method to build delay elements for bundled SRAM memory. We develop a 

fully Speed Independent (SI) [16] SRAM cell and a bundled SRAM bank technology 

by using such SI SRAM cells as delay elements. 

The remainder of the paper is organized as follows. Section 2 introduces existing 

asynchronous SRAM memory structures. Section 3 analyses the effects on the latency 

of the SRAM memory and its controller of different Vdds. Section 4 gives our asynchronous 

SRAM solutions and implementations, and proposes a new method to build 

SI delay elements for SRAM memory. Section 5 demonstrates a memory bank and the 

measurements in terms of latency, power consumption. Section 6 gives the conclusions 

and the future work. 

2 Existing asynchronous SRAM memory 

Several asynchronous SRAM methods have been reported [5,6,7,8,9]. 

In [5] a methodology was mostly developed for designing and verifying low power 

asynchronous SRAM. An SI SRAM cell was alluded to in [5]. This memory cell is 

different from the conventional six transistor cell [15] and provides the possibility of 

checking that the data has been stored in memory. The paper however does not explain 

how the cell needs to be controlled nor does it include a controller design. 

[6,7,8,9] focus on asynchronous SRAM memory designs. [6] presents a four-phase 

handshake asynchronous SRAM design for self-timed systems. It proposes an SI circuit 

to realize completion detection of reading operations. However, the paper claims 

that completion detection is not suitable for writing operations. Because the critical 

circuit is the memory cell, it is said to be impractical to add a monitoring sensor to 

each memory cell to generate completion detection signals. Instead the paper proposes 

a delay based solution, which uses several delay lines for different delay regions as 

variation is considered. The other works [7,8,9] abandon SI altogether and adopt bun-

dled data methods based on delays. Noting that the delay of inverter chains commonly 

used in conventional SRAM to generate required timings for precharge and data access 

phase hardly match all the timing variations of the bit line activities across a wide 

range of supply voltages [11,12], the authors of [9] used a duplicated column of memory 

cells to replace inverter chains to serve as delay elements. Although in theory this 

offers potentially correct delay matching for memory under variable Vdd, so long as 

process variation [3] is kept under control, the method requires voltage references for 

precharge and sensing data. The voltage reference is assumed to be adjustable to accommodate 

the process, voltage, and temperature conditions. 

In summary, most of existing solutions work under worst case timing assumptions, 

and some of them also require adjustable and known reference voltages. However, in 

the energy harvesting environment, there may not be stable reference voltages in a 

system at all, so anything based on comparators will not work. All voltages in the 

system may be non-deterministic. All delays may therefore be non-deterministic. 

3 Investigation on SRAM cells in terms of latency 

SRAM memory is constructed from SRAM sells, address decoders, precharge driver, 

write driver, read driver, and controller. Although there exist different structures of 

SRAM cells, here we only focus on the simplest 6T [15] cell which offers the best 

prospect for use in energy harvesting systems. 

Normally memory works based on timing assumption. However, energy harvesting 

systems work under a wide range of non-deterministic power. It is necessary to know 

how timing assumptions are affected under different Vdds. 

Here we investigate the difference between the latency on bit line drive and its corresponding 

typical inverter-chain delay elements used in controllers under different 

Vdds. This potential mismatch has already been pointed out in papers [11,12]. [11] 

concludes that the latency on inverter chains are getting worse and worse along with 

reducing the Vdd. [12] concludes that the percentage of the bit line drive time of the 

total access time under reducing Vdds is getting bigger and bigger. But do both types 

of delays increase at the same rate under the same Vdd reduction rate 

To emphasize the mismatch, we directly show the difference between the reading/writing 

times and the latency of delay elements in various Vdds in the right hand 

side of Figure 1. 

start 

SRAM 

finish 

Figure 1 Investigation on delay elements in various Vdd: Block diagram (left) 

and Results (right). 

The experiment bundles an SRAM cell with an inverter chain, with both operating 

under the same variable Vdd as shown in the left hand side of Figure 1. A start signal

triggers reading/writing operation of the cell. This start signal is also connected to the 

inverter chain as its input signal. We measure the number of inverters the start signal 

has passed through when the reading/writing operation finishes. In reading, under 

lowest Vdd the memory is about 3 times slower than under the normal Vdd in terms of 

the number of inverters. In writing, under lowest Vdd the memory is about 2 times 

slower than under the normal Vdd in terms of the number of inverters. In other words, 

both reading from and writing to memory become slower at a much higher rate than 

inverter chains when Vdd is reduced, and inverter chain type delays do not track 

memory operation delays when both are under the same variable Vdd. This demonstrates 

that using standard inverter chains for memory delay bundling would require 

precise design-time delay characterization and conservative worst-case provisions 

which could be 2-3 times more wasteful for some cases. 

4 Asynchronous SRAM solutions 

The characteristics of the energy harvesting systems lead to non-deterministic Vdd 

and delays across the entire system. To deal with this it is possible to employ asynchrony 

in the form of memory bundling or completion detection. 

For bundling, the above discussion has established that normal delay elements built 

using inverter chains are unsuitable for memory. A natural extension of using dummy 

SRAM cells as delay elements exists [9], but the method has too many assumptions 

and requirements such as known and variable reference voltages which may not be 

possible for energy harvesting systems. 

In this section, a fully Speed Independent (SI) SRAM memory is proposed. The SI 

circuits are not affected by delays on gates but delays on wires are assumed as zero or 

very little. This is generally not a problem for circuits of small size such as an individual 

6T SRAM cell. However, fully SI solutions for memory banks can be expensive in 

terms of power and size of circuits and it also reduces performance [16]. A new 

method in which an asynchronous SRAM memory is bundled with SI SRAM serving 

as delay elements is proposed as a compromise. 

4.1 Speed Independent SRAM 

WL 

Q 

Qb 

WE 

BL 

BLb 

WL 

Q 

Qb 

Db 

D 

BL 

BLb 

CDb 

(a) 

BL 

BLb 

CD 

(b) 

(c) 

Figure 2 Proposed SRAM cell (a) for SI solution, the write driver (b), and 

standard 6T cell (c). 

As discussed in [6], the reading completion detection can be built by monitoring 

the bit lines. For a 6T cell (Figure 2 (c)), in reading, the precharge pulls the two bit 

lines to high. Then the reading sets the WL high to open the two pass transistors. After

that, one bit line will be discharge to low. This means that the data is ready for reading. 

However, the writing operation is to write each bit of data to its corresponding cell. 

It is impractical to monitor all cells. Instead, we still monitor the bit lines. Figure 2 (a) 

shows our proposed SI SRAM cell. 

The cell is based on the normal 6T cell. The new cell duplicates the bit lines and 

uses the six extra transistors to control the two discharge channels. The cell works as 

follows. The reading operation is the same as the normal 6T cell. The writing operation 

is arranged as: 1) precharging the four bit lines to high; 2) enabling the writing 

data on BL and BLb; 3) setting the WL high to write the data into cell; 4) monitoring 

the CD and CDb; 5) when one of them changes to low, writing done. The writing 

driver is shown in Figure 2 (b). 

After the writing driver is enabled, one of BL and BLb is low and the other is floating. 

If the new data is the same as the data stored in the cell, for example D=1, CD 

will be discharged (Qb goes to CD). If the new data and the stored data are not the 

same, for example, Q=1 and D=0, BL is low and then waiting for Qb high to discharge 

CDb. In this situation, BL is low and written to Q. But only after the Q is 

propagated to Qb, the discharging path is opened. 

In fact, this method introduces a reading at the writing operation with the execution 

order “precharging, writing, reading”. However, unlike the normal reading operation, 

it uses the duplicated bit lines as a reading port and to guarantee the writing data being 

stored into the cell. The two discharge paths can be taken as two AND gates implemented 

in transmission gate logic. 

We optimize this method based on ideas borrowed from [14]. By changing the 

execution order to “precharging, reading, writing”, the duplicated bit lines in Figure 2 

(a) can be removed. The normal 6T SRAM cell in Figure 2 (c) can be used instead 

with considerable savings. 

SRAM cells depend on control signals. The control signals PreCharge, WL, and 

WE, are issued based on timing assumptions in existing asynchronous SRAMs. 

An intelligent controller is designed to manage these control signals based on the 

new execution order. To completely remove timing assumption, Delay Insensitive (DI) 

circuits are the best choice. However, DI circuits are limited in practice [2]. Instead, 

SI circuits suffice here. The block diagram of the controller is shown in Figure 3. 

Wa 

Wr 

Rr 

Ra 

Controller 

Pre 

Dn 

WL 

Dn 

WE 

Dn 

Data 

Memory 

Figure 3 Block diagram of the controller. 

There are two handshake protocols ((Wr,Wa) and (Rr,Ra)) to connect with the 

processing unit and three protocols ((Pre,Dn), (WL,Dn), and (WE,Dn)) with the 

memory system. The signals (Wr,Wa) are the writing request and its finish signals. 

The (Rr,Ra) pair is the reading request and its finish signals. The (Pre,Dn) handshake 

is the precharge request and its done signals. 

The STG specifications of the reading and writing operation are shown in Figure 4.

The writing and reading are specified separately. The bit lines are monitored to form a 

“Dn” signal. For example, after the precharging is triggered, when (BL,BLb) equals to 

(1,1), the “Dn” signal is generated. 

Reading: 

Rr+ Pre− (BL,BLb) 

(1,1) 

Ra− 

WL− 

Rr− 

Ra+ 

Pre+ 

WL+ 

Writing: 

(BL,BLb) 

(1,0) or (0,1) 

Wa− 

Wr+ Pre− (BL,BLb) 

(1,1) 

WE− 

WL− 

Figure 4 STG specifications. 

Pre+ WL+ (BL,BLb) 

(1,0) or (0,1) 

Wr− 

Wa+ 

WE+ 

(Q,Qb)=(BL,BLb) 

We combine the two STG specifications. And then after putting the specification to 

the Petrify toolkit and optimizing the obtained results manually, the controller shown 

in Figure 5 is obtained. 

Initially, Wr, Rr, x2, and x3 are 0, 0, 1, 0. Consequently Wa, Ra, PreCharge, WL, 

WE, x1, x5, and x6 are 0, 0, 1, 0, 0, 0, 1, 0. The x4 is in a “don’t care” value initially. 

Wr 

BL 

D 

x4 

3 

BL 

1 2 

Wa 

BBL 

x5 x6 

4 5 

6 

x3 0 

DB 

BBL 

WE 

Wr 

Rr 

7 

8 

x2 

10 

1 

11 

x1 

12 

9 

Rr 

Ra 

13 

PreCharge 

WL 

Figure 5 Possible implementation of the controller. 

We use the writing operation as an example to show how the controller works. After 

the address and data are ready, the Wr signal is issued. Wr goes through gate 7 and 

then through to gate 10. As x2 is 1, so x1 is 1 and then it makes PreCharge 0. The low 

PreCharge signal opens the P-type transistors in precharge drivers. The PreCharge 

also goes to the SR latch formed by gates 6 and 8 to reset the latch when PreCharge is 

low. After the bit lines are 1 and the SR latch is reset, x1 is changed to 0. And then 

PreCharge is removed. After PreCharge is removed, WL is generated, which opens 

the pass transistors in the 6T cell. And then the data stored in the cell is sent to the bit 

lines. This makes x4 equal to 1. As the SR latch has been reset, x6 will be 1. And then 

WE is 1, which opens the write driver. If the new data is the same as the data stored in 

the cell, either (D,BL)=(1,1) or (Db,BLb)=(1,1), Wa is generated to notify the data 

processing unit that the data has been written into the cell. If, for example, new data is 

1 and the stored data is 0, after the write driver is opened, BLb is low and then Qb is 

discharged to 0, Q is charged to 1. That 1 will transfer to BL. after that writing is 

finished. After Wa is generated, Wr is removed and then only after the controller is

eturned to the initial states, Wa is withdrawn to wait for new Reading/Writing operations. 

Here data is assumed to be withdrawn only after Wa is removed. Clearly there is 

no need for duplicated bit lines in the memory cell in this method. 

As for memory banks, gate 1 is duplicated. The number of the duplicated gates 

equals to the bits of the memory word. The inputs of each gate are a pair of bit lines 

corresponding to each bit of the memory word. All outputs of the duplicated gates are 

collected in a C element. The output of the C element is used to replace x4. Gate 5 is 

also duplicated. All outputs of the duplicated gates are collected in a C element and 

the output of the C element is the new Wa signal. 

Here an SI SRAM cell is investigated under variable Vdd. In this experiment, we 

use a sinusoidal Vdd starting at a low level as an example. The lowest Vdd level is 

300mV and the highest is 1V and the sinusoid’s frequency is 700KHz. Figure 6 shows 

the obtained waveforms. 

Figure 6 Waveforms under variable Vdd. 

This experiment consists of a writing 0 operation followed by a reading operation 

and then a writing 1 operation followed by a reading operation. As Vdd is variable, 

each operation takes a different amount of time. For example, the first writing works 

under lower Vdd. It takes long time for precharging, writing data and then generating 

the Wa (WAck) signal. The second writing works under the highest Vdd, it goes very 

fast and generates the WAck signal very fast as well. This experiment also demonstrates 

that the SI SRAM structure works under continuously variable Vdd as expected. 

4.2 New bundled SRAM based on SI delay elements 

However, a fully SI solution for large memory banks has penalties on performance, 

areas and power. This is because the completion detection logic consumes too much 

area, time and power. Here a new bundled method is proposed to overcome the problem. 

We can choose a worst column in a memory bank, and fill it with SI SRAM cells. 

Normally the far end column is the worst one in a memory bank. And we only monitor

the bit lines of this column. This means that gate 1 and gate 5 are connected with the 

bit lines of this column in the SI controller. The memory cells of the other columns 

use the same control signals generated from the controller but do not provide feedback 

information. This means that the far end column is used as delay elements and the 

other columns are bundled with them. 

Compared to the existing method which duplicates a column SRAM cell, the new 

method does not employ duplicated cells and referent voltages. And the delay elements, 

being SI SRAM cells based on the same kind of cells used elsewhere in the 

bank, should provide correct delay tracking over a wide Vdd range. 

5 1Kb memory bank design and measurements 

Using the proposed circuit, 1k-bit (64x16) SI SRAM is implemented using the Cadence 

toolkit with the UMC 90nm CMOS technology. The design is verified with 

analogue simulations with SPECTRE provided in the toolkit. The chip is fully functional 

from as low as 190mV up to 1V. The SRAM chip was simulated by writing 16- 

bits to the chip, then reading them and latching the data into SI latches. 

Figure 7 Energy consumption of SRAM. 

Figure 8 Access time of SRAM. 

Meanwhile the energy consumption and the worst case latency under different 

Vdds from 190mV to 1V are measured. 

Figure 7 shows the energy consumption of the chip during reading and writing 

when the data is 1 and 0. The four curves show that the minimum energy point of the 

chip is at 400mV-500mV. The SRAM consumes 5.8pJ in 1V when writing a 16-bit 

word to the SRAM memory and 1.9pJ in 400mV.

Figure 8 shows the access time of the SRAM. The access time is the latency from 

the reading/writing request to the done signal. For example, under 1V, the worst access 

time for writing and reading are 5.4ns and 3.0ns. And under 190mV, they are 

1.6µs and 4.0µs respectively. 

6 Conclusions and future work 

In this paper, we focus on SRAM memory design for energy harvesting systems. 

Normally, this kind of system works under a variable power supply with high power 

efficiency and not just low power. Under such non-deterministic power supply assumption, 

existing asynchronous SRAM working based on bundled delay has huge 

penalties and is impractical because a need for voltage references. 

The latency difference between SRAM memory and its controller under different 

Vdds is investigated. With reducing Vdd, the latency mismatch becomes bigger and 

bigger if traditional inverter chain delays are used. Under 190mV, the mismatch is 

more than twice bigger than under the normal 1V Vdd in 90nm technology. 

An SI SRAM is proposed and designed. The SRAM has a simple interface, which 

is similar to the normal SRAM including data, address, reading request, reading acknowledgement, 

writing request, and writing acknowledgement. The internal signals 

for memory control are fully triggered by the corresponding events of the memory 

systems. This works by monitoring the bit lines of memory. 

A new method is proposed to implement SI writing operation based on ideas from 

[14]. This solves the previously considered impractical or impossible problem of 

completion detection for writing operations. 

A 1Kb (64X16) SI SRAM is implemented using Cadence toolkits. The simulation 

results show the SRAM working as expected from 190mV to 1V. Meanwhile, the 

energy consumption and the worst case performance are measured. The measurements 

show the SRAM cell has acceptable characteristics. 

However, as the completion detection logic in SI SRAM is expensive in terms of 

area, performance, and power. A compromised SRAM is designed as well based on 

the modified SI SRAM. 

The new SRAM is based on the bundled delay principle. However unlike existing 

asynchronous SRAM solutions, a column (the worst column, if it can be identified) of 

SI SRAM cells doubles as delay elements. This column should be slower anyway than 

the other columns because completion detection elements take extra time. The other 

columns of the memory cells are bundled with this column. 

However, so far, we have only investigated basic asynchronous SRAM design. 

Other issues, such as static noise margin, readability, stability, etc. need further study. 

These are the targets of our future research. We will also investigate multi-port asynchronous 

SRAM in the context of variable and nondeterministic Vdd. 

Acknowledgement 

This work is supported by the EPSRC project Holistic (EP/G066728/1) at Newcastle 

University. During the work, we get very helpful discussions from our colleagues, Dr 

Alex Bystrov and other members of the MSD research group. The authors would like 

to express our thanks to them.

References 

[1] International Technology Roadmap for Semiconductors: http://public.itrs.net/. 

[2] Alain J. Martin, “The limitations to delay-insensitivity in asynchronous circuits”, In Willian J. Dally 

ed, Advanced Research in VLSI, pp263-278, MIT press, 1990. 

[3] D. Sylvester, K. Agarwal, S. Shah, “Variability in nanometer CMOS: Impact, analysis, and minimization”, 

Integration the VLSI journal, No. 41, pp. 319-339, 2008. 

[4] H. Saito, A. Kondratyev, J. Cortadella, L. Lavagno and A. Yakovlev., “What is the cost of delay 

insensitivity”, Proc. ICCAD’99, San Jose, CA, pp. 316-323, Nov. 1999. 

[5] L.S. Nielsen, and J. Staunstrup, “Design and verification of a self-timed RAM”, Proc. of the IFIP 

international conference on VLSI 1995. 

[6] Vincent Wing-Yun Sit, et al., “A four phase handshaking asynchronous static RAM design for selftimed 

systems”, IEEE Journal of solid-state circuits, pp. 90-96, Vol. 34, No.1, January 1999. 

[7] Tan Soon-Hwei, et al., “A 160Mhz 45mw asynchronous dual-port 1Mb CMOS SRAM”, Proc. of 

IEEE conference on electron Devices and solid-state circuits 2005. 

[8] J. Dama, and A. Lines, “GHz asynchronous SRAM in 65nm”, Proc. of 15 th IEEE symposium on 

asynchronous circuits and systems 2009. 

[9] M. F. Chang, S. M. Yang, and K. T. Chen, “Wide Vdd embedded asynchronous SRAM with dualmode 

self-timed technique for dynamic voltage systems”, IEEE trans. on circuits and systems I, pp. 

1657-1667, Vol. 56, No. 8, August 2009. 

[10] A. Wang and A. Chandrakasan, “A 180mv subthreshold FFT processor using a minimum energy 

design methodology”, IEEE Journal of solid-state circuits, pp. 310-319, Vol. 40, No. 1, January 

2005. 

[11] A. Sekiyama, et al., “A 1-V operating 256 Kb full CMOS SRAM”, IEEE Journal of solid-state 

circuits, pp. 776-782, Vol. 27, No. 5, May 1992. 

[12] B. S. Amrutur and A. Horowitz, “A Replica technique for wordline and sense control in low power 

SRAM’s”, IEEE Journal of solid-state circuits, pp. 1208-1219, Vol. 33, No.8, August 1998. 

[13] Andrey Mokhov, et al., “power elastic systems: Discrete event control, concurrency reduction and 

hardware implementation”, Tech. Report NCL-EECE-MSD-TR-2009-151, School of EECE, Newcastle 

University. 

[14] V. Varshavsky, et al., “A self-timed random access memory”, USSR Patent, 1988. 

[15] Bo Zhai, et al., “A Sub-200mV 6T SRAM in 0.13um CMOS”, Proc. of ISSCC, 2007. 

[16] Jens Sparsø, and Steve Furber, “Principles of asynchronous circuit design: a system perspective”, 

Kluwer Academic Publishers, Boston, 2001. 

[17] V. Reddi, M. Gupta, G. Holloway, et al., “Voltage emergency prediction: a signature-based approach 

to reducing voltage emergencies”, In Proc. of international symposium on high-performance computer 

architecture (HPCA-15), 2009.

Self-Timed SRAM for Energy Harvesting Systems - Electronics ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?