05.02.2015 Views

Self-Timed SRAM for Energy Harvesting Systems - Electronics ...

Self-Timed SRAM for Energy Harvesting Systems - Electronics ...

Self-Timed SRAM for Energy Harvesting Systems - Electronics ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Self</strong>-<strong>Timed</strong> <strong>SRAM</strong> <strong>for</strong> <strong>Energy</strong> <strong>Harvesting</strong> <strong>Systems</strong><br />

Abdullah Baz, Delong Shang, Fei Xia, and Alex Yakovlev<br />

Microelectronic System Design Group, School of EECE, Newcastle University<br />

Newcastle upon Tyne, NE1 7RU, England, United Kingdom<br />

{Abdullah.baz,delong.shang,fei.xia,alex.yakovlev}@ncl.ac.uk<br />

Abstract. Portable digital systems tend to be not just low power but power efficient<br />

as they are powered by low batteries or energy harvesters. <strong>Energy</strong> harvesting<br />

systems tend to provide nondeterministic, rather than stable, power over<br />

time. Existing memory systems use delay elements to cope with the problems<br />

under different Vdds. However, this introduces huge penalties on per<strong>for</strong>mance,<br />

as the delay elements need to follow the worst case timing assumption under the<br />

worst environment. In this paper, the latency mismatch between memory cells<br />

and the corresponding controller using typical delay elements is investigated<br />

and found to be highly variable <strong>for</strong> different Vdd values. A Speed Independent<br />

(SI) <strong>SRAM</strong> memory is then developed which can help avoid such mismatch<br />

problems. It can also be used to replace typical delay lines <strong>for</strong> use in bundleddata<br />

memory banks. A 1Kb SI memory bank is implemented based on this<br />

method and analysed in terms of the latency and power consumption.<br />

1 Introduction<br />

With the wide advancement in such remote and mobile fields as wireless sensor based<br />

applications, microelectronic system design is becoming more energy conscious. This<br />

is mainly because of limited energy supply (scavenged energy or low battery) and<br />

excessive heat with associated thermal stress and device wear-out. At the same time,<br />

the high density of devices per die and the ability to operate with a high degree of<br />

parallelism, coupled with environmental variations, create almost permanent instability<br />

in voltage supply (cf. Vdd droop), making systems highly power variant. In the not<br />

so long past low power design was targeted merely at the reduction of capacitance,<br />

Vdd and switching activity, whilst maintaining the required system per<strong>for</strong>mance. In<br />

many current applications, the design objectives are changing to maximizing the per<strong>for</strong>mance<br />

within the dynamic power constrains from energy supply and consumption<br />

regimes. Such systems can no longer be simply regarded as low power systems, but<br />

rather as power adaptive or power resilient systems.<br />

Normally, this kind of system has the following properties: 1) power efficient not<br />

just low power; 2) non-deterministic supply voltage (probably with known range,<br />

which tends to be low) variable over time. Recently a possible solution is proposed <strong>for</strong><br />

this kind of system. It is a power elastic system which takes power and energy as dynamic<br />

resources [13]. For example, when power is not enough, some of the subsystems<br />

could either be powered off or be executed under lower power supplies<br />

(Vdds). When power is enough, systems can provide high per<strong>for</strong>mance. This means<br />

that all tasks in a system are managed based on the power resources, per<strong>for</strong>mance<br />

requirements, and thermal constraints.<br />

When systems are subjected to varying environmental conditions, with voltage and


thermal fluctuations, timing tends to be the first issue affected. Most systems are still<br />

designed with global clocking and the design is often made overly pessimistic to avoid<br />

failures due to Vdd (timing) variations.<br />

Along with the advent of the nanometre CMOS technology, the continuation of this<br />

scaling process is vital to the future development of the digital industries. The International<br />

Technology Roadmap <strong>for</strong> Semiconductors (ITRS) [1] predicts poorer scaling<br />

<strong>for</strong> wires than transistors in future technology nodes. This makes the above worst<br />

timing assumption even worse along with power supply voltage drooping [17].<br />

Asynchronous techniques may provide solutions to all these problems. Unlike synchronous<br />

systems, asynchronous designs can completely remove global clocking. As a<br />

result, asynchronous designs may be more tolerant to timing variations.<br />

The ITRS also predicts that asynchrony will increase with the complexity of onchip<br />

systems. The power, design ef<strong>for</strong>t, and reliability cost of global clocks will also<br />

make increased asynchrony more attractive. Increasingly complex asynchronous systems<br />

or subsystems will thus become more prevalent in future VLSI systems.<br />

In order to fully realize the potential of asynchrony in an environment of variable<br />

supply voltage and latencies, system memories may need to be asynchronous together<br />

with the computation parts. In this paper, we concentrate on asynchronous <strong>SRAM</strong>.<br />

Our main contributions include: analysing the behaviour of latency in <strong>SRAM</strong> memory<br />

systems under different Vdds, developing asynchronous <strong>SRAM</strong> memory, and proposing<br />

a new method to build delay elements <strong>for</strong> bundled <strong>SRAM</strong> memory. We develop a<br />

fully Speed Independent (SI) [16] <strong>SRAM</strong> cell and a bundled <strong>SRAM</strong> bank technology<br />

by using such SI <strong>SRAM</strong> cells as delay elements.<br />

The remainder of the paper is organized as follows. Section 2 introduces existing<br />

asynchronous <strong>SRAM</strong> memory structures. Section 3 analyses the effects on the latency<br />

of the <strong>SRAM</strong> memory and its controller of different Vdds. Section 4 gives our asynchronous<br />

<strong>SRAM</strong> solutions and implementations, and proposes a new method to build<br />

SI delay elements <strong>for</strong> <strong>SRAM</strong> memory. Section 5 demonstrates a memory bank and the<br />

measurements in terms of latency, power consumption. Section 6 gives the conclusions<br />

and the future work.<br />

2 Existing asynchronous <strong>SRAM</strong> memory<br />

Several asynchronous <strong>SRAM</strong> methods have been reported [5,6,7,8,9].<br />

In [5] a methodology was mostly developed <strong>for</strong> designing and verifying low power<br />

asynchronous <strong>SRAM</strong>. An SI <strong>SRAM</strong> cell was alluded to in [5]. This memory cell is<br />

different from the conventional six transistor cell [15] and provides the possibility of<br />

checking that the data has been stored in memory. The paper however does not explain<br />

how the cell needs to be controlled nor does it include a controller design.<br />

[6,7,8,9] focus on asynchronous <strong>SRAM</strong> memory designs. [6] presents a four-phase<br />

handshake asynchronous <strong>SRAM</strong> design <strong>for</strong> self-timed systems. It proposes an SI circuit<br />

to realize completion detection of reading operations. However, the paper claims<br />

that completion detection is not suitable <strong>for</strong> writing operations. Because the critical<br />

circuit is the memory cell, it is said to be impractical to add a monitoring sensor to<br />

each memory cell to generate completion detection signals. Instead the paper proposes<br />

a delay based solution, which uses several delay lines <strong>for</strong> different delay regions as<br />

variation is considered. The other works [7,8,9] abandon SI altogether and adopt bun-


dled data methods based on delays. Noting that the delay of inverter chains commonly<br />

used in conventional <strong>SRAM</strong> to generate required timings <strong>for</strong> precharge and data access<br />

phase hardly match all the timing variations of the bit line activities across a wide<br />

range of supply voltages [11,12], the authors of [9] used a duplicated column of memory<br />

cells to replace inverter chains to serve as delay elements. Although in theory this<br />

offers potentially correct delay matching <strong>for</strong> memory under variable Vdd, so long as<br />

process variation [3] is kept under control, the method requires voltage references <strong>for</strong><br />

precharge and sensing data. The voltage reference is assumed to be adjustable to accommodate<br />

the process, voltage, and temperature conditions.<br />

In summary, most of existing solutions work under worst case timing assumptions,<br />

and some of them also require adjustable and known reference voltages. However, in<br />

the energy harvesting environment, there may not be stable reference voltages in a<br />

system at all, so anything based on comparators will not work. All voltages in the<br />

system may be non-deterministic. All delays may there<strong>for</strong>e be non-deterministic.<br />

3 Investigation on <strong>SRAM</strong> cells in terms of latency<br />

<strong>SRAM</strong> memory is constructed from <strong>SRAM</strong> sells, address decoders, precharge driver,<br />

write driver, read driver, and controller. Although there exist different structures of<br />

<strong>SRAM</strong> cells, here we only focus on the simplest 6T [15] cell which offers the best<br />

prospect <strong>for</strong> use in energy harvesting systems.<br />

Normally memory works based on timing assumption. However, energy harvesting<br />

systems work under a wide range of non-deterministic power. It is necessary to know<br />

how timing assumptions are affected under different Vdds.<br />

Here we investigate the difference between the latency on bit line drive and its corresponding<br />

typical inverter-chain delay elements used in controllers under different<br />

Vdds. This potential mismatch has already been pointed out in papers [11,12]. [11]<br />

concludes that the latency on inverter chains are getting worse and worse along with<br />

reducing the Vdd. [12] concludes that the percentage of the bit line drive time of the<br />

total access time under reducing Vdds is getting bigger and bigger. But do both types<br />

of delays increase at the same rate under the same Vdd reduction rate<br />

To emphasize the mismatch, we directly show the difference between the reading/writing<br />

times and the latency of delay elements in various Vdds in the right hand<br />

side of Figure 1.<br />

start<br />

<strong>SRAM</strong><br />

finish<br />

Figure 1 Investigation on delay elements in various Vdd: Block diagram (left)<br />

and Results (right).<br />

The experiment bundles an <strong>SRAM</strong> cell with an inverter chain, with both operating<br />

under the same variable Vdd as shown in the left hand side of Figure 1. A start signal


triggers reading/writing operation of the cell. This start signal is also connected to the<br />

inverter chain as its input signal. We measure the number of inverters the start signal<br />

has passed through when the reading/writing operation finishes. In reading, under<br />

lowest Vdd the memory is about 3 times slower than under the normal Vdd in terms of<br />

the number of inverters. In writing, under lowest Vdd the memory is about 2 times<br />

slower than under the normal Vdd in terms of the number of inverters. In other words,<br />

both reading from and writing to memory become slower at a much higher rate than<br />

inverter chains when Vdd is reduced, and inverter chain type delays do not track<br />

memory operation delays when both are under the same variable Vdd. This demonstrates<br />

that using standard inverter chains <strong>for</strong> memory delay bundling would require<br />

precise design-time delay characterization and conservative worst-case provisions<br />

which could be 2-3 times more wasteful <strong>for</strong> some cases.<br />

4 Asynchronous <strong>SRAM</strong> solutions<br />

The characteristics of the energy harvesting systems lead to non-deterministic Vdd<br />

and delays across the entire system. To deal with this it is possible to employ asynchrony<br />

in the <strong>for</strong>m of memory bundling or completion detection.<br />

For bundling, the above discussion has established that normal delay elements built<br />

using inverter chains are unsuitable <strong>for</strong> memory. A natural extension of using dummy<br />

<strong>SRAM</strong> cells as delay elements exists [9], but the method has too many assumptions<br />

and requirements such as known and variable reference voltages which may not be<br />

possible <strong>for</strong> energy harvesting systems.<br />

In this section, a fully Speed Independent (SI) <strong>SRAM</strong> memory is proposed. The SI<br />

circuits are not affected by delays on gates but delays on wires are assumed as zero or<br />

very little. This is generally not a problem <strong>for</strong> circuits of small size such as an individual<br />

6T <strong>SRAM</strong> cell. However, fully SI solutions <strong>for</strong> memory banks can be expensive in<br />

terms of power and size of circuits and it also reduces per<strong>for</strong>mance [16]. A new<br />

method in which an asynchronous <strong>SRAM</strong> memory is bundled with SI <strong>SRAM</strong> serving<br />

as delay elements is proposed as a compromise.<br />

4.1 Speed Independent <strong>SRAM</strong><br />

WL<br />

Q<br />

Qb<br />

WE<br />

BL<br />

BLb<br />

WL<br />

Q<br />

Qb<br />

Db<br />

D<br />

BL<br />

BLb<br />

CDb<br />

(a)<br />

BL<br />

BLb<br />

CD<br />

(b)<br />

(c)<br />

Figure 2 Proposed <strong>SRAM</strong> cell (a) <strong>for</strong> SI solution, the write driver (b), and<br />

standard 6T cell (c).<br />

As discussed in [6], the reading completion detection can be built by monitoring<br />

the bit lines. For a 6T cell (Figure 2 (c)), in reading, the precharge pulls the two bit<br />

lines to high. Then the reading sets the WL high to open the two pass transistors. After


that, one bit line will be discharge to low. This means that the data is ready <strong>for</strong> reading.<br />

However, the writing operation is to write each bit of data to its corresponding cell.<br />

It is impractical to monitor all cells. Instead, we still monitor the bit lines. Figure 2 (a)<br />

shows our proposed SI <strong>SRAM</strong> cell.<br />

The cell is based on the normal 6T cell. The new cell duplicates the bit lines and<br />

uses the six extra transistors to control the two discharge channels. The cell works as<br />

follows. The reading operation is the same as the normal 6T cell. The writing operation<br />

is arranged as: 1) precharging the four bit lines to high; 2) enabling the writing<br />

data on BL and BLb; 3) setting the WL high to write the data into cell; 4) monitoring<br />

the CD and CDb; 5) when one of them changes to low, writing done. The writing<br />

driver is shown in Figure 2 (b).<br />

After the writing driver is enabled, one of BL and BLb is low and the other is floating.<br />

If the new data is the same as the data stored in the cell, <strong>for</strong> example D=1, CD<br />

will be discharged (Qb goes to CD). If the new data and the stored data are not the<br />

same, <strong>for</strong> example, Q=1 and D=0, BL is low and then waiting <strong>for</strong> Qb high to discharge<br />

CDb. In this situation, BL is low and written to Q. But only after the Q is<br />

propagated to Qb, the discharging path is opened.<br />

In fact, this method introduces a reading at the writing operation with the execution<br />

order “precharging, writing, reading”. However, unlike the normal reading operation,<br />

it uses the duplicated bit lines as a reading port and to guarantee the writing data being<br />

stored into the cell. The two discharge paths can be taken as two AND gates implemented<br />

in transmission gate logic.<br />

We optimize this method based on ideas borrowed from [14]. By changing the<br />

execution order to “precharging, reading, writing”, the duplicated bit lines in Figure 2<br />

(a) can be removed. The normal 6T <strong>SRAM</strong> cell in Figure 2 (c) can be used instead<br />

with considerable savings.<br />

<strong>SRAM</strong> cells depend on control signals. The control signals PreCharge, WL, and<br />

WE, are issued based on timing assumptions in existing asynchronous <strong>SRAM</strong>s.<br />

An intelligent controller is designed to manage these control signals based on the<br />

new execution order. To completely remove timing assumption, Delay Insensitive (DI)<br />

circuits are the best choice. However, DI circuits are limited in practice [2]. Instead,<br />

SI circuits suffice here. The block diagram of the controller is shown in Figure 3.<br />

Wa<br />

Wr<br />

Rr<br />

Ra<br />

Controller<br />

Pre<br />

Dn<br />

WL<br />

Dn<br />

WE<br />

Dn<br />

Data<br />

Memory<br />

Figure 3 Block diagram of the controller.<br />

There are two handshake protocols ((Wr,Wa) and (Rr,Ra)) to connect with the<br />

processing unit and three protocols ((Pre,Dn), (WL,Dn), and (WE,Dn)) with the<br />

memory system. The signals (Wr,Wa) are the writing request and its finish signals.<br />

The (Rr,Ra) pair is the reading request and its finish signals. The (Pre,Dn) handshake<br />

is the precharge request and its done signals.<br />

The STG specifications of the reading and writing operation are shown in Figure 4.


The writing and reading are specified separately. The bit lines are monitored to <strong>for</strong>m a<br />

“Dn” signal. For example, after the precharging is triggered, when (BL,BLb) equals to<br />

(1,1), the “Dn” signal is generated.<br />

Reading:<br />

Rr+ Pre− (BL,BLb)<br />

(1,1)<br />

Ra−<br />

WL−<br />

Rr−<br />

Ra+<br />

Pre+<br />

WL+<br />

Writing:<br />

(BL,BLb)<br />

(1,0) or (0,1)<br />

Wa−<br />

Wr+ Pre− (BL,BLb)<br />

(1,1)<br />

WE−<br />

WL−<br />

Figure 4 STG specifications.<br />

Pre+ WL+ (BL,BLb)<br />

(1,0) or (0,1)<br />

Wr−<br />

Wa+<br />

WE+<br />

(Q,Qb)=(BL,BLb)<br />

We combine the two STG specifications. And then after putting the specification to<br />

the Petrify toolkit and optimizing the obtained results manually, the controller shown<br />

in Figure 5 is obtained.<br />

Initially, Wr, Rr, x2, and x3 are 0, 0, 1, 0. Consequently Wa, Ra, PreCharge, WL,<br />

WE, x1, x5, and x6 are 0, 0, 1, 0, 0, 0, 1, 0. The x4 is in a “don’t care” value initially.<br />

Wr<br />

BL<br />

D<br />

x4<br />

3<br />

BL<br />

1 2<br />

Wa<br />

BBL<br />

x5 x6<br />

4 5<br />

6<br />

x3 0<br />

DB<br />

BBL<br />

WE<br />

Wr<br />

Rr<br />

7<br />

8<br />

x2<br />

10<br />

1<br />

11<br />

x1<br />

12<br />

9<br />

Rr<br />

Ra<br />

13<br />

PreCharge<br />

WL<br />

Figure 5 Possible implementation of the controller.<br />

We use the writing operation as an example to show how the controller works. After<br />

the address and data are ready, the Wr signal is issued. Wr goes through gate 7 and<br />

then through to gate 10. As x2 is 1, so x1 is 1 and then it makes PreCharge 0. The low<br />

PreCharge signal opens the P-type transistors in precharge drivers. The PreCharge<br />

also goes to the SR latch <strong>for</strong>med by gates 6 and 8 to reset the latch when PreCharge is<br />

low. After the bit lines are 1 and the SR latch is reset, x1 is changed to 0. And then<br />

PreCharge is removed. After PreCharge is removed, WL is generated, which opens<br />

the pass transistors in the 6T cell. And then the data stored in the cell is sent to the bit<br />

lines. This makes x4 equal to 1. As the SR latch has been reset, x6 will be 1. And then<br />

WE is 1, which opens the write driver. If the new data is the same as the data stored in<br />

the cell, either (D,BL)=(1,1) or (Db,BLb)=(1,1), Wa is generated to notify the data<br />

processing unit that the data has been written into the cell. If, <strong>for</strong> example, new data is<br />

1 and the stored data is 0, after the write driver is opened, BLb is low and then Qb is<br />

discharged to 0, Q is charged to 1. That 1 will transfer to BL. after that writing is<br />

finished. After Wa is generated, Wr is removed and then only after the controller is


eturned to the initial states, Wa is withdrawn to wait <strong>for</strong> new Reading/Writing operations.<br />

Here data is assumed to be withdrawn only after Wa is removed. Clearly there is<br />

no need <strong>for</strong> duplicated bit lines in the memory cell in this method.<br />

As <strong>for</strong> memory banks, gate 1 is duplicated. The number of the duplicated gates<br />

equals to the bits of the memory word. The inputs of each gate are a pair of bit lines<br />

corresponding to each bit of the memory word. All outputs of the duplicated gates are<br />

collected in a C element. The output of the C element is used to replace x4. Gate 5 is<br />

also duplicated. All outputs of the duplicated gates are collected in a C element and<br />

the output of the C element is the new Wa signal.<br />

Here an SI <strong>SRAM</strong> cell is investigated under variable Vdd. In this experiment, we<br />

use a sinusoidal Vdd starting at a low level as an example. The lowest Vdd level is<br />

300mV and the highest is 1V and the sinusoid’s frequency is 700KHz. Figure 6 shows<br />

the obtained wave<strong>for</strong>ms.<br />

Figure 6 Wave<strong>for</strong>ms under variable Vdd.<br />

This experiment consists of a writing 0 operation followed by a reading operation<br />

and then a writing 1 operation followed by a reading operation. As Vdd is variable,<br />

each operation takes a different amount of time. For example, the first writing works<br />

under lower Vdd. It takes long time <strong>for</strong> precharging, writing data and then generating<br />

the Wa (WAck) signal. The second writing works under the highest Vdd, it goes very<br />

fast and generates the WAck signal very fast as well. This experiment also demonstrates<br />

that the SI <strong>SRAM</strong> structure works under continuously variable Vdd as expected.<br />

4.2 New bundled <strong>SRAM</strong> based on SI delay elements<br />

However, a fully SI solution <strong>for</strong> large memory banks has penalties on per<strong>for</strong>mance,<br />

areas and power. This is because the completion detection logic consumes too much<br />

area, time and power. Here a new bundled method is proposed to overcome the problem.<br />

We can choose a worst column in a memory bank, and fill it with SI <strong>SRAM</strong> cells.<br />

Normally the far end column is the worst one in a memory bank. And we only monitor


the bit lines of this column. This means that gate 1 and gate 5 are connected with the<br />

bit lines of this column in the SI controller. The memory cells of the other columns<br />

use the same control signals generated from the controller but do not provide feedback<br />

in<strong>for</strong>mation. This means that the far end column is used as delay elements and the<br />

other columns are bundled with them.<br />

Compared to the existing method which duplicates a column <strong>SRAM</strong> cell, the new<br />

method does not employ duplicated cells and referent voltages. And the delay elements,<br />

being SI <strong>SRAM</strong> cells based on the same kind of cells used elsewhere in the<br />

bank, should provide correct delay tracking over a wide Vdd range.<br />

5 1Kb memory bank design and measurements<br />

Using the proposed circuit, 1k-bit (64x16) SI <strong>SRAM</strong> is implemented using the Cadence<br />

toolkit with the UMC 90nm CMOS technology. The design is verified with<br />

analogue simulations with SPECTRE provided in the toolkit. The chip is fully functional<br />

from as low as 190mV up to 1V. The <strong>SRAM</strong> chip was simulated by writing 16-<br />

bits to the chip, then reading them and latching the data into SI latches.<br />

Figure 7 <strong>Energy</strong> consumption of <strong>SRAM</strong>.<br />

Figure 8 Access time of <strong>SRAM</strong>.<br />

Meanwhile the energy consumption and the worst case latency under different<br />

Vdds from 190mV to 1V are measured.<br />

Figure 7 shows the energy consumption of the chip during reading and writing<br />

when the data is 1 and 0. The four curves show that the minimum energy point of the<br />

chip is at 400mV-500mV. The <strong>SRAM</strong> consumes 5.8pJ in 1V when writing a 16-bit<br />

word to the <strong>SRAM</strong> memory and 1.9pJ in 400mV.


Figure 8 shows the access time of the <strong>SRAM</strong>. The access time is the latency from<br />

the reading/writing request to the done signal. For example, under 1V, the worst access<br />

time <strong>for</strong> writing and reading are 5.4ns and 3.0ns. And under 190mV, they are<br />

1.6µs and 4.0µs respectively.<br />

6 Conclusions and future work<br />

In this paper, we focus on <strong>SRAM</strong> memory design <strong>for</strong> energy harvesting systems.<br />

Normally, this kind of system works under a variable power supply with high power<br />

efficiency and not just low power. Under such non-deterministic power supply assumption,<br />

existing asynchronous <strong>SRAM</strong> working based on bundled delay has huge<br />

penalties and is impractical because a need <strong>for</strong> voltage references.<br />

The latency difference between <strong>SRAM</strong> memory and its controller under different<br />

Vdds is investigated. With reducing Vdd, the latency mismatch becomes bigger and<br />

bigger if traditional inverter chain delays are used. Under 190mV, the mismatch is<br />

more than twice bigger than under the normal 1V Vdd in 90nm technology.<br />

An SI <strong>SRAM</strong> is proposed and designed. The <strong>SRAM</strong> has a simple interface, which<br />

is similar to the normal <strong>SRAM</strong> including data, address, reading request, reading acknowledgement,<br />

writing request, and writing acknowledgement. The internal signals<br />

<strong>for</strong> memory control are fully triggered by the corresponding events of the memory<br />

systems. This works by monitoring the bit lines of memory.<br />

A new method is proposed to implement SI writing operation based on ideas from<br />

[14]. This solves the previously considered impractical or impossible problem of<br />

completion detection <strong>for</strong> writing operations.<br />

A 1Kb (64X16) SI <strong>SRAM</strong> is implemented using Cadence toolkits. The simulation<br />

results show the <strong>SRAM</strong> working as expected from 190mV to 1V. Meanwhile, the<br />

energy consumption and the worst case per<strong>for</strong>mance are measured. The measurements<br />

show the <strong>SRAM</strong> cell has acceptable characteristics.<br />

However, as the completion detection logic in SI <strong>SRAM</strong> is expensive in terms of<br />

area, per<strong>for</strong>mance, and power. A compromised <strong>SRAM</strong> is designed as well based on<br />

the modified SI <strong>SRAM</strong>.<br />

The new <strong>SRAM</strong> is based on the bundled delay principle. However unlike existing<br />

asynchronous <strong>SRAM</strong> solutions, a column (the worst column, if it can be identified) of<br />

SI <strong>SRAM</strong> cells doubles as delay elements. This column should be slower anyway than<br />

the other columns because completion detection elements take extra time. The other<br />

columns of the memory cells are bundled with this column.<br />

However, so far, we have only investigated basic asynchronous <strong>SRAM</strong> design.<br />

Other issues, such as static noise margin, readability, stability, etc. need further study.<br />

These are the targets of our future research. We will also investigate multi-port asynchronous<br />

<strong>SRAM</strong> in the context of variable and nondeterministic Vdd.<br />

Acknowledgement<br />

This work is supported by the EPSRC project Holistic (EP/G066728/1) at Newcastle<br />

University. During the work, we get very helpful discussions from our colleagues, Dr<br />

Alex Bystrov and other members of the MSD research group. The authors would like<br />

to express our thanks to them.


References<br />

[1] International Technology Roadmap <strong>for</strong> Semiconductors: http://public.itrs.net/.<br />

[2] Alain J. Martin, “The limitations to delay-insensitivity in asynchronous circuits”, In Willian J. Dally<br />

ed, Advanced Research in VLSI, pp263-278, MIT press, 1990.<br />

[3] D. Sylvester, K. Agarwal, S. Shah, “Variability in nanometer CMOS: Impact, analysis, and minimization”,<br />

Integration the VLSI journal, No. 41, pp. 319-339, 2008.<br />

[4] H. Saito, A. Kondratyev, J. Cortadella, L. Lavagno and A. Yakovlev., “What is the cost of delay<br />

insensitivity”, Proc. ICCAD’99, San Jose, CA, pp. 316-323, Nov. 1999.<br />

[5] L.S. Nielsen, and J. Staunstrup, “Design and verification of a self-timed RAM”, Proc. of the IFIP<br />

international conference on VLSI 1995.<br />

[6] Vincent Wing-Yun Sit, et al., “A four phase handshaking asynchronous static RAM design <strong>for</strong> selftimed<br />

systems”, IEEE Journal of solid-state circuits, pp. 90-96, Vol. 34, No.1, January 1999.<br />

[7] Tan Soon-Hwei, et al., “A 160Mhz 45mw asynchronous dual-port 1Mb CMOS <strong>SRAM</strong>”, Proc. of<br />

IEEE conference on electron Devices and solid-state circuits 2005.<br />

[8] J. Dama, and A. Lines, “GHz asynchronous <strong>SRAM</strong> in 65nm”, Proc. of 15 th IEEE symposium on<br />

asynchronous circuits and systems 2009.<br />

[9] M. F. Chang, S. M. Yang, and K. T. Chen, “Wide Vdd embedded asynchronous <strong>SRAM</strong> with dualmode<br />

self-timed technique <strong>for</strong> dynamic voltage systems”, IEEE trans. on circuits and systems I, pp.<br />

1657-1667, Vol. 56, No. 8, August 2009.<br />

[10] A. Wang and A. Chandrakasan, “A 180mv subthreshold FFT processor using a minimum energy<br />

design methodology”, IEEE Journal of solid-state circuits, pp. 310-319, Vol. 40, No. 1, January<br />

2005.<br />

[11] A. Sekiyama, et al., “A 1-V operating 256 Kb full CMOS <strong>SRAM</strong>”, IEEE Journal of solid-state<br />

circuits, pp. 776-782, Vol. 27, No. 5, May 1992.<br />

[12] B. S. Amrutur and A. Horowitz, “A Replica technique <strong>for</strong> wordline and sense control in low power<br />

<strong>SRAM</strong>’s”, IEEE Journal of solid-state circuits, pp. 1208-1219, Vol. 33, No.8, August 1998.<br />

[13] Andrey Mokhov, et al., “power elastic systems: Discrete event control, concurrency reduction and<br />

hardware implementation”, Tech. Report NCL-EECE-MSD-TR-2009-151, School of EECE, Newcastle<br />

University.<br />

[14] V. Varshavsky, et al., “A self-timed random access memory”, USSR Patent, 1988.<br />

[15] Bo Zhai, et al., “A Sub-200mV 6T <strong>SRAM</strong> in 0.13um CMOS”, Proc. of ISSCC, 2007.<br />

[16] Jens Sparsø, and Steve Furber, “Principles of asynchronous circuit design: a system perspective”,<br />

Kluwer Academic Publishers, Boston, 2001.<br />

[17] V. Reddi, M. Gupta, G. Holloway, et al., “Voltage emergency prediction: a signature-based approach<br />

to reducing voltage emergencies”, In Proc. of international symposium on high-per<strong>for</strong>mance computer<br />

architecture (HPCA-15), 2009.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!