19.03.2015 Views

Designing Packet Buffers with Statistical Guarantees

Designing Packet Buffers with Statistical Guarantees

Designing Packet Buffers with Statistical Guarantees

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Designing</strong> <strong>Packet</strong> <strong>Buffers</strong> <strong>with</strong> <strong>Statistical</strong> <strong>Guarantees</strong><br />

Gireesh Shrimali, Isaac Keslassy, Nick McKeown<br />

Computer Systems Laboratory, Stanford University,<br />

Stanford, CA 94305-9030<br />

{gireesh, keslassy, nickm}@stanford.edu<br />

Abstract -- <strong>Packet</strong> buffers are an essential part of routers. In highend<br />

routers these buffers need to store a large amount of data at very<br />

high speeds. To satisfy these requirements, we need a memory <strong>with</strong> the<br />

the speed of SRAM and the density of DRAM. A typical solution is to<br />

use hybrid packet buffers built from a combination of SRAM and<br />

DRAM, where the SRAM holds the heads and tails of per-flow packet<br />

FIFOs and the DRAM is used for bulk storage. The main challenge<br />

then is to minimize the size of the SRAM while providing reasonable<br />

performance guarantees. In this paper, we analyze a commonly used<br />

hybrid architecture from a statistical perspective, and ask the following<br />

question: if the packet buffer designer is willing to tolerate a certain<br />

drop probability, then how small can the SRAM get? To do so, we<br />

introduce an analytical model for representing the SRAM buffer occupancy,<br />

and derive drop probabilities as a function of SRAM size under<br />

a wide range of statistical traffic patterns. As a consequence of our<br />

analysis we show that, for low drop probability, the required SRAM<br />

size is proportional to the number of flows.<br />

I. INTRODUCTION<br />

<strong>Packet</strong> buffers in high-performance routers are challenging to design<br />

because of two factors: memory speed and memory size.<br />

<strong>Packet</strong>s belonging to different flows (for example, these flows may<br />

correspond to different IP source-destination pairs) arrive and depart at<br />

line rate, and are typically stored in per-flow queues. Consecutive packets<br />

may belong to different flows in an unpredictable manner. This<br />

requires that the buffer be able to store as well as retrieve packets at line<br />

rates in an unpredictable memory access order. Thus, the buffer has to<br />

match a raw bandwidth (in bits/s) as well as a memory random access<br />

speed (in packets/s) of at least twice the line rate.<br />

In addition, a rule of thumb indicates that, for TCP to work well, the<br />

buffer should be able to store an amount of data equal to the product of<br />

the line rate and the average round-trip-time [1]. While it has been<br />

recently challenged [2], this rule of thumb is still widely used.<br />

Therefore, both the speed and size of the memory grow linearly <strong>with</strong><br />

the line rate.<br />

As an example, consider a 40Gbits/s linecard. This requires the buffer<br />

to match a raw bandwidth of 80Gbits/s. In addition, assuming a constant<br />

stream of 40-byte packets, which corresponds to minimum size IP packets<br />

containing TCP ACKs, the buffer must read and write a packet every<br />

8ns. This translates to one memory operation every 4ns, or a random<br />

access speed of 250Mpackets/s. Finally, assuming an average round-trip<br />

time of 0.25s [3], the buffer must hold 10Gbits.<br />

We now investigate the properties of two popular commercially available<br />

memories - SRAM and DRAM - to see if they match these requirements.<br />

Arriving<br />

Cells<br />

R<br />

b cells<br />

1<br />

Q<br />

Tail Cache<br />

Large DRAM<br />

SRAM<br />

Head Cache<br />

Figure 1: Hybrid SRAM-DRAM memory architecture.<br />

Departing<br />

Cells<br />

We note that state-of-the-art SRAMs [4] meet the raw bandwidth<br />

requirement of 80Gbits/s as well as the random access time requirement<br />

of 4ns. However, these SRAMs can only hold a maximum of 32Mbits<br />

per device. Thus, an SRAM-only solution would require over 300<br />

SRAM devices, and therefore be very costly in terms of board real<br />

estate. In addition, these SRAMs consume approximately 1.6W per<br />

device. This means a total power consumption of 480W - more than the<br />

power budget typically allocated to the whole linecard [6].<br />

On the other hand, state-of-the-art DRAMs [5] can hold up to 1Gbits<br />

per device, while consuming 2W per device. So, a DRAM-only solution<br />

would require only 10 DRAM devices, while consuming 20W, and<br />

therefore easily meet the real estate and power requirements. However,<br />

DRAM access times haven’t kept up <strong>with</strong> the line rates - <strong>with</strong> today’s<br />

DRAM technology, the random access times are in the range 20ns-40ns,<br />

and barely meet the requirements for even a 10Gbits/s line card. This<br />

shortfall is not going to be solved anytime soon since DRAMs are optimized<br />

for size rather than random access times, and the random access<br />

times improve by only 10% every 18 months [7]. On the other hand, the<br />

line rate doubles in the same time period [8]. Thus, this problem will get<br />

worse rather than better over time.<br />

Thus, an SRAM-only or a DRAM-only solution cannot meet both the<br />

speed and size requirements simultaneously. Since, overall, we would<br />

like to have a fast and large memory <strong>with</strong> the speed of SRAM and the<br />

density of DRAM, a solution would be to use both, in a manner very<br />

similar to computer systems where fast SRAMs are used as caches<br />

whereas dense DRAMs hold bulk of data.<br />

1<br />

Q<br />

b cells<br />

1<br />

Q<br />

R<br />

0-7803-8686-8/04/$20.00 ©2004 IEEE


A common approach is to use a hybrid SRAM-DRAM architecture<br />

[9][10][11], as shown in Figure 1. We encourage the reader to read [9]<br />

for a detailed background and motivation for this architecture. Under<br />

this architecture, one can envision the memory hierarchy as a large<br />

DRAM containing a set of cell FIFOs <strong>with</strong> the heads and tails of<br />

FIFOs in a dynamically shared SRAM.<br />

The SRAM behaves like a cache, holding packets temporarily<br />

when they first arrive and just prior to their departure. Variable size<br />

packets arrive at the SRAM at rate R. They are segmented into fixedsize<br />

cells and are stored into one of the Q tail FIFOs depending on<br />

their flow ID. Later, a Memory Management Algorithm (MMA)<br />

writes cells into the DRAM in blocks of b cells. Similarly, the MMA<br />

transfers blocks of b cells from the DRAM FIFOs to the corresponding<br />

head FIFOs in the SRAM. Finally, cells from head FIFOs depart<br />

when requested by an external arbiter. Note that the MMA always<br />

transfers b cells at a time - never less - between an SRAM FIFO and<br />

the corresponding DRAM FIFO. By transferring b cells every b timeslots,<br />

it ensures that the DRAM meets the memory speed requirement<br />

in the long run. 1<br />

In this paper, we look at the problem of independently sizing the<br />

tail and head SRAMs from a statistical perspective under a wide range<br />

of arrival traffic patterns. We provide models of the SRAM buffers,<br />

present MMAs that would minimize the SRAM sizes for a given drop<br />

probability, and derive formulas relating SRAM sizes to the drop<br />

probability. As a consequence of our analysis we show that, in order<br />

to provide low drop probability, the required SRAM sizes scales linearly<br />

<strong>with</strong> Q . When Q is large, this linear dependence would make<br />

such a buffer hard to implement. 2 Therefore, we exhibit an inherent<br />

limitation of this architecture.<br />

The rest of the paper is organized as follows. Section II presents a<br />

model of the tail SRAM followed by a detailed analysis. Similarly,<br />

Section III presents a model of the head SRAM along <strong>with</strong> analytical<br />

results. Section IV provides some simulation results and Section V<br />

concludes the paper.<br />

A. Tail SRAM Model<br />

II. ANALYSIS OF THE TAIL SRAM<br />

In this section, we introduce a queueing model for the tail SRAM,<br />

together <strong>with</strong> some simplifying assumptions.<br />

As illustrated in Figure 2, we model the tail SRAM as a single<br />

queue served by a deterministic server of rate 1. For simplicity, we<br />

assume time to be a continuous variable. A similar analysis could be<br />

carried out in discrete time domain as well. We also assume that the<br />

SRAM is dynamically shared among Q queues corresponding to Q<br />

flows.<br />

We denote by A( t)<br />

the cumulative number of cells arriving at the<br />

SRAM in [ 0,<br />

t] . A( t) is assumed to be the sum of Q arrival processes<br />

A i ( t) corresponding to the Q flows; A i ( t) have rates λ i such<br />

that λ ≡∑1<br />

λ . 3 i < We also assume that A i ( t)<br />

are independent and<br />

identically distributed (IID) stationary and ergodic processes. 4 We<br />

similarly denote by D( t)<br />

the cumulative number of departures in<br />

1.<br />

With line rate R, DRAM random access time T, and cell size c, we<br />

define b = 2RT ⁄ c . Thus, T = b time-slots.<br />

2. For example, in edge routers Q can be as large as a million.<br />

3.<br />

In this paper, all the finite sums are over index i over the range 1 to Q.<br />

source 1<br />

A 1 (t)<br />

A Q (t)<br />

source Q<br />

A(t)<br />

Figure 2: The tail SRAM model<br />

+<br />

[ 0,<br />

t] . We then denote L( i,<br />

t)<br />

as the number of cells present in queue<br />

i , and L( t) = L( i,<br />

t)<br />

as the total SRAM occupancy at time t.<br />

To find the drop probability, we start by assuming that the SRAM is<br />

infinite. We then obtain the steady state probability that the sum of<br />

queue sizes exceeds S (i.e., P( L > S)<br />

) as a surrogate for the steady<br />

state overflow probability for an SRAM of size S .<br />

We assume that the MMA works as follows [9]. Whenever the<br />

DRAM is free, it serves an arbitrary queue from the set of all queues i<br />

satisfying L( i,<br />

t) ≥ b . For example, it could service the longest queue<br />

<strong>with</strong> at least b cells, or the oldest queue <strong>with</strong> at least b cells, and so on.<br />

We refer to this MMA as a b-Work Conserving MMA (BWC-<br />

MMA) since it is work conserving as soon as at least one queue has<br />

occupancy of at least . Because of work conservation, BWC-MMA<br />

minimizes the tail SRAM occupancy among all possible MMAs in<br />

this architecture. Therefore, given a tail SRAM of size S in the hybrid<br />

architecture, BWC-MMA ensures that the drop probability P is minimized.<br />

Equivalently, it minimizes S given a fixed P.<br />

Under BWC-MMA, service to an individual queue depends on the<br />

occupancy of all the queues in the system. Therefore, the queues are<br />

not independent. For example, in one queue is being served, no other<br />

queue<br />

∑<br />

b<br />

can be served at the same time. This makes it hard to analyze<br />

the queues in isolation, and we need to analyze the queue occupancies<br />

together. This analysis is extremely challenging due to the interactions<br />

between the queue occupancies.<br />

B. The Fixed-Batch Decomposition<br />

L(t)<br />

D(t)<br />

In this section, we simplify the analysis by decomposing the sum of<br />

occupancies into elements that can be analyzed independently.<br />

We start by noting that each A i ( t)<br />

can be written down as<br />

A i ( t) = b × MA i ( t)<br />

+ R i ( t)<br />

, (1)<br />

where MA i ( t) and R i ( t)<br />

are the quotient and the remainder after<br />

A i ( t) is divided by b . Now the arrival process A( t)<br />

can be written as<br />

A( t) = A i ( t)<br />

= ∑[ × MA i ( t)<br />

+ R i ( t ) ] . (2)<br />

We can also write<br />

A( t) = b × MA( t)<br />

+ R( t)<br />

, (3)<br />

where MA( t) = MA i ( t)<br />

and R( t) = R i ( t)<br />

are referred to as<br />

the batch arrival process and remainder workload, respectively.<br />

Having defined the arrival process, we now examine the departure<br />

process. Since cells are serviced by fixed batches of b,<br />

4. We believe this independence assumption is reasonable since the traffic<br />

on a high speed WAN link usually comprises of traffic generated by<br />

thousands of independent sources. In addition, in [12] we show that the<br />

drop probability in the IID case upper bounds the corresponding probability<br />

in the non-IID (independent but not identically distributed) case.<br />

Thus, analysis for the IID case suffices for the scope of this paper.<br />

∑<br />

b


D( t) = b × MD( t)<br />

, (4)<br />

where MD( t)<br />

represents the cumulative number of batch departures<br />

in [ 0,<br />

t]<br />

.<br />

Finally, having considered arrivals and departures, we look at the<br />

sum of occupancies L( t)<br />

, which can simply be written down as<br />

L( t) = A( t) – D( t)<br />

. (5)<br />

Substituting for A( t) and D( t)<br />

from Equation (3) and<br />

Equation (4), we get<br />

L( t) = [ b × MA( t)<br />

+ R( t)<br />

] – b × MD( t ) (6)<br />

or<br />

L( t) = b × ML( t)<br />

+ R( t)<br />

, (7)<br />

where ML( t ) = MA( t) – MD( t)<br />

is referred to as the batch workload.<br />

Equation (7) indicates that the system workload L( t)<br />

can be<br />

decomposed into two terms. The first term, b × ML( t)<br />

, is the product<br />

of the batch size and the batch workload. The second term, R( t)<br />

, is<br />

simply the remainder workload.<br />

We now show that the two terms in this decomposition are independent.<br />

This greatly simplifies the analysis since it is now possible to<br />

study them separately. The independence between the batch workload<br />

and the remainder workload is provided by the following theorem.<br />

Theorem 1: The remainder workload is independent of the batch<br />

workload.<br />

Proof: See<br />

Theorem 1 provides what we call the fixed-batch decomposition. It<br />

indicates that the steady state distribution of the system workload can<br />

be derived simply by convolving the steady state distributions of the<br />

remainder and batch workloads, i.e,<br />

Appendix.<br />

f L ( x) = f R ( x) ⊗ f ML ( x ⁄ b , )<br />

(8)<br />

where f L ( x) , f R ( x) , and f ML ( x)<br />

are the steady state probability density<br />

functions (PDF) of system, remainder, and batch workloads,<br />

respectively. Thus, Theorem 1 allows us to translate the general queue<br />

analysis to two more tractable problems.<br />

C. Steady State Distributions<br />

We first derive the steady-state distributions of remainder and batch<br />

workloads. The fixed-batch decomposition can then be used to obtain<br />

the steady-state distribution of the queue occupancy.<br />

1) Steady State Distribution of the Remainder Workload<br />

The steady state distribution of the remainder workload can be<br />

given by the following theorem.<br />

Theorem 2: As the number of flows increases (i.e., Q → ∞ ), the<br />

steady state distribution of the remainder workload tends towards a<br />

Gaussian distribution <strong>with</strong> mean Q( b – 1) ⁄ 2 and variance<br />

Q( b 2 – 1) ⁄ 12 .<br />

Proof: Appendix.<br />

See<br />

2) Steady State Distribution of the Batch Workload<br />

Having derived the steady state distribution of the remainder workload,<br />

we now derive the steady state distribution of the batch workload.<br />

We use the batch queue model shown in Figure 3. In this model,<br />

we analyze arrivals and departures of batches of cells instead of individual<br />

cells. The batch arrival process is the superposition of Q IID<br />

source 1<br />

MA 1 (t)<br />

MA Q (t)<br />

source Q<br />

MA(t)<br />

Figure 3: The batch queue model<br />

+<br />

ML(t)<br />

processes MA i ( t) , <strong>with</strong> total rate λ ⁄ b . Under the BWC-MMA discipline,<br />

the batch queue is serviced by a work-conserving server of rate<br />

1 ⁄ b .<br />

Using Lindley’s recursion [13], the batch workload can be written<br />

as<br />

ML( t) = max 0 ≤ s ≤ t (( MA( t) – MA( s)<br />

)–( t – s) ⁄ b)<br />

. (9)<br />

Equation (9) indicates that, given the steady-state distribution of<br />

the batch arrivals, it is theoretically possible to derive the steady-state<br />

distribution of the batch workload.<br />

For general arrival patterns it is often not easy to find a closed-form<br />

solution. However, for a broad range of arrival traffic patterns, the<br />

superposition of an increasing number of flows can be shown to result<br />

in a steady-state workload distribution that converges towards the<br />

steady-state workload distribution of an M/D/1 queue.<br />

To do so, we first make the following additional assumptions on the<br />

arrival processes A i ( t) . We assume that each A i ( t)<br />

is a simple point<br />

process satisfying the following properties [14]. First, the expected<br />

value of A i ( t) is given by λ i t . Second, a source cannot send more<br />

than one cell at a time. And third, the probability of many cells arriving<br />

in an arbitrarily small interval [ 0,<br />

t] decays fast as t → 0 .<br />

These assumptions are fairly general and apply to a variety of traffic<br />

sources, including Poisson, Gamma, Weibull, Inverse Weibull,<br />

ExpOn-ExpOff, and ParetoOn-ExpOff. These point processes model<br />

a wide range of observed traffic [14], including wide area network<br />

traffic.<br />

Given these assumptions, we can state the following theorem.<br />

Theorem 3: As the number of flows increases (i.e., Q → ∞ ), the<br />

steady state exceedence probability P( ML > x)<br />

of the batch workload<br />

approaches the corresponding exceedence probability <strong>with</strong> a<br />

Poisson source <strong>with</strong> the same load.<br />

Proof: See<br />

Theorem 3 shows that as the number of multiplexed IID sources<br />

increases, the exceedence probability approaches the corresponding<br />

probability assuming Poisson sources, which is known explicitly<br />

through the analysis of the resulting M ⁄ D ⁄ 1 system<br />

Appendix.<br />

[15].<br />

3) Steady State Distribution of the System Workload<br />

MD(t)<br />

Using the fixed-batch decomposition, f L ( x)<br />

can easily be derived<br />

as the convolution of f R ( x) and f ML ( x)<br />

, which were obtained above.<br />

We now provide an intuitive analysis of f L ( x)<br />

for a large number<br />

of flows. 5 For low values of λ , f ML ( x)<br />

behaves like an impulse close<br />

to zero. Since a convolution <strong>with</strong> an impulse at zero produces an output<br />

equal to the input, f L ( x) would very much resemble f R ( x)<br />

for<br />

low values of λ , and therefore be close to a Gaussian.<br />

5. In practice, we start observing the convergence indicated in Theorem 2<br />

and Theorem 3 when the number of flows exceeds 100.


DRAM<br />

L(t)<br />

A(t)<br />

SRAM<br />

Buffer<br />

F<br />

I<br />

F<br />

O<br />

MA ˆ<br />

MD ˆ<br />

( t)<br />

( t)<br />

D(t-LA)<br />

Lookahead Buffer<br />

D(t)<br />

D(t-LA)<br />

LA<br />

Figure 4: Theoretical PDF of L(t) for various loads for Q=1024 and<br />

b=4 (the mean of the Gaussian is 1536)<br />

Figure 4 plots f R ( x) and f L ( x)<br />

as predicted by the theoretical<br />

model and shows that this is indeed the case. At λ = 0.1 , the plots<br />

corresponding to f R ( x) and f L ( x) are indistinguishable. At λ = 0.5 ,<br />

the plots are still very close. The plots then separate out as the load<br />

increases.<br />

The main consequence of this observation is that, even for low<br />

loads, f L ( x)<br />

would result in more than 50% drops for an SRAM size<br />

less than Q( b – 1) ⁄ 2 , the mean of the Gaussian. However, f L ( x)<br />

falls very quickly due to the low variance of the Gaussian, and low<br />

drop probabilities can be obtained close to Q( b – 1) ⁄ 2 . Therefore, to<br />

give reasonable performance guarantees, the SRAM has to be slightly<br />

more than Q( b – 1) ⁄ 2 , or Θ( Qb)<br />

.<br />

A. Head SRAM Model<br />

III. ANALYSIS OF THE HEAD SRAM<br />

As was the case for the tail SRAM, we assume the head SRAM to<br />

be dynamically shared among Q flow queues. However, similarities<br />

<strong>with</strong> the tail SRAM end here. For dynamic sharing to be useful for the<br />

head SRAM, we need to be able to predict the (external) arbiter<br />

request pattern in some way - if we can’t predict the request pattern to<br />

the SRAM, we have to buffer up cells for each queue in a static way<br />

and the advantage of dynamic sharing is lost. Thus, we assume the<br />

presence of a fixed length lookahead buffer, which we use to predict<br />

the request pattern to the SRAM. 6 Note that this assumes that the arbiter<br />

is willing to tolerate a fixed delay LA between the arrival of<br />

requests and the delivery of cells from the SRAM.<br />

We use the scheme shown in Figure 5. Incoming requests ( D( t)<br />

)<br />

from an external arbiter enter the lookahead buffer to the right and<br />

exit to the left after a fixed delay LA . Based on the requests in the<br />

lookahead buffer and the cells in the SRAM, read requests ( MDˆ<br />

( t)<br />

)<br />

are made to the DRAM so as to ensure that the cells ( A( t)<br />

) are written<br />

to the SRAM before the requests exit the lookahead buffer and<br />

cells are read from the SRAM.<br />

Similar to Section II.A, we define L( i,<br />

t)<br />

as the number of cells<br />

present in queue i , and L( t) = L( i,<br />

t)<br />

as the total SRAM occupancy<br />

at time t. We then define LR( i,<br />

t)<br />

to be the number of requests<br />

for queue i in the lookahead buffer, and DEF( i,<br />

t)<br />

to be the deficit<br />

∑<br />

6.<br />

This is inspired by the lookahead scheme in [9].<br />

(number of arbiter requests for which a DRAM request has not been<br />

made) for queue i at time t . A queue is defined to be critical at time<br />

t if DEF( i,<br />

t) < 0 .<br />

We assume that the MMA works as follows [9]. Starting from an<br />

empty SRAM, whenever the DRAM is free, it serves the earliest critical<br />

queue from the set of critical queues, provided there is space in the<br />

SRAM. If there are no critical queues then it doesn’t do anything.<br />

In reference to Figure 5, the MMA operation can be described as<br />

follows. When an arbiter request arrives, the deficit of the corresponding<br />

queue is calculated. If the queue has gone critical, a read request is<br />

is queued at the fetch FIFO, which is served in batches of b cells at<br />

rate 1 ⁄ b . We now define FD( t)<br />

to be the delay through the fetch<br />

FIFO.<br />

We refer to this MMA as a Deficit Work Conserving MMA (DWC-<br />

MMA) since it is work conserving as soon as at least one queue has<br />

gone critical. Because of the work conservation, DWC-MMA minimizes<br />

the head SRAM occupancy among all possible MMAs in this<br />

architecture. Therefore, given a head SRAM of size S in the hybrid<br />

architecture, DWC-MMA ensures that the drop probability P is minimized.<br />

Equivalently, it minimizes S given a fixed P.<br />

In what follows we relate the (request) drop probability from the<br />

head SRAM to the size of the lookahead buffer ( LA ) as well as the<br />

size of the SRAM ( S ). This analysis can be extremely challenging<br />

due to the interactions between the lookahead buffer and the SRAM.<br />

B. Analysis of Head SRAM<br />

The analysis of this complex system can be simplified as follows.<br />

Observe that a request may not be served (or dropped) due to two reasons.<br />

First, the lookahead buffer may not be deep enough to bring in<br />

the required cells into the SRAM by the time the request traverses to<br />

the end of the lookahead buffer. Second, even if the lookahead buffer<br />

is deep enough, the SRAM may not be big enough to store the incoming<br />

cells from the DRAM.<br />

We refer to the overflowing of the lookahead buffer and head<br />

SRAMs as events E1 and E2 , respectively. Now, the (request) drop<br />

probability from the head SRAM of size S , using a lookahead buffer<br />

of size LA , can be given by the following<br />

P( S,<br />

LA) = P( E1 ∪ E2)<br />

, (10)<br />

or<br />

Figure 5: The head SRAM buffer model<br />

P( S,<br />

LA) ≤ P( E1) + P( E2)<br />

. (11)


We again assume infinite buffers and use P( FD > LA)<br />

and<br />

P( L > S) as surrogates for P( E1) and P( E2)<br />

, respectively. This lets<br />

us rewrite Equation (11) as<br />

P( S,<br />

LA) ≤ P( FD > LA) + P( L > S)<br />

. (12)<br />

Thus, we can analyze P( FD > LA) and P( L > S)<br />

separately to get<br />

an upper bound on the drop probability of the head SRAM. It turns<br />

out that both of these can be analyzed in a way very similar to the tail<br />

SRAM.<br />

C. A Model of the Queue<br />

∑ ∑<br />

Similar to Section II we can develop a model for the head SRAM<br />

that is easy to analyze. We denote the request arrival process to the<br />

SRAM by D( t) , where D( t)<br />

is the cumulative number of requests<br />

arriving in [ 0,<br />

t] . D( t)<br />

is further assumed to be generated by the<br />

superposition of Q arrival processes D i ( t) , where D i ( t)<br />

is the<br />

cumulative number of requests sent for flow i in [ 0,<br />

t] . Each D i ( t)<br />

is assumed to have rate λ i , <strong>with</strong><br />

∑λ λ i = < 1 . In addition, we<br />

assume that the request arrival processes D i ( t)<br />

satisfy the assumptions<br />

stated the cell arrival processes A i ( t)<br />

in Section II.<br />

We start by noting that each D i ( t)<br />

can be written down as the following<br />

D i ( t) = b × MD i ( t)<br />

+ R i ( t)<br />

, (13)<br />

where R i ( t) and MD i ( t) are the remainder and quotient after D i ( t)<br />

is divided by b .<br />

Now, similar to Equation (3), the multiplexed process D( t)<br />

can be<br />

written as<br />

D( t) = b × MD( t)<br />

+ R( t)<br />

, (14)<br />

where MD( t) = MD i ( t)<br />

and R( t) = R i ( t)<br />

.<br />

We now define a derivative process that is more useful in analyzing<br />

the DRAM traffic due to DWC-MMA. The derivative process is<br />

defined as ˆ<br />

D i ( t) = D i ( t) + ( b – 1)<br />

, (15)<br />

which gives<br />

Dˆ<br />

( t) = D( t) + Q( b – 1)<br />

. (16)<br />

Now, breaking Dˆ i ( t) down in the same way as D i ( t)<br />

, we get<br />

Dˆ i ( t) = b × MDˆ i ( t)<br />

+ Rˆ i ( t)<br />

, (17)<br />

where Rˆ i ( t) and MDˆ i ( t) are the remainder and quotient after Dˆ i ( t)<br />

is divided by b . Thus, in a way similar to Equation (14), we get<br />

Dˆ<br />

( t) = b × MDˆ<br />

( t)<br />

+ Rˆ ( t)<br />

, (18)<br />

where MDˆ<br />

( t) = MDˆ i ( t)<br />

and Rˆ ( t) = Rˆ i ( t)<br />

.<br />

. MDˆ<br />

( t)<br />

is precisely the arrival process to the fetch FIFO. This is<br />

due to the fact that, starting from an empty SRAM buffer, arrival<br />

numbering 1,<br />

b + 1,<br />

2b + 1 , … to an SRAM queue result in fetches<br />

from the DRAM.<br />

Now, as mentioned earlier, the fetch FIFO is served at a rate 1 ⁄ b ,<br />

resulting in the departure process MÂ( t) , where MÂ( t)<br />

is related<br />

to the arrival process to the SRAM buffer (i.e. A( t)<br />

) in the following<br />

way<br />

A( t) = b × MA( t)<br />

= b × MÂ( t – b)<br />

. (19)<br />

The first equality in Equation (19) represents the fact that arrivals<br />

to the SRAM always occur in batches of b . The second equality<br />

reflects the fact that<br />

∑ ∑<br />

the there is a fixed delay of b in reading from the<br />

DRAM. Now, noting that the departure process from the SRAM<br />

buffer is given by D( t – LA) , the SRAM buffer occupancy L( t)<br />

can<br />

be given by<br />

L( t) = A( t) – D( t – LA)<br />

. (20)<br />

Using Equation (16) and Equation (19), we can write this as<br />

L( t) = b × MÂ( t – b)<br />

– ( Dˆ<br />

( t – LA) – Q( b – 1)<br />

) . (21)<br />

Substituting for Dˆ<br />

( t)<br />

from Equation (18), we get<br />

L( t) = b × ( MÂ( t – b) – MDˆ<br />

( t – LA)<br />

) + ( Q( b – 1) – Rˆ ( t – LA)<br />

)<br />

(22)<br />

or<br />

L( t) = b × ML( t)<br />

+ RL( t)<br />

, (23)<br />

where<br />

ML( t) = MÂ( t – b) – MDˆ<br />

( t – LA)<br />

(24)<br />

and<br />

RL( t) = Q( b – 1) – Rˆ ( t – LA)<br />

. (25)<br />

Realize that Equation (23) looks strikingly similar to Equation (7) -<br />

in what follows, we show that it can be analyzed in a similar way.<br />

However, for the sake of brevity, we intentionally stay away from<br />

proving theorems that pretty much resemble the ones proved in Section<br />

II, and state the results in an intuitive way.<br />

We start by looking at the steady state distribution of RL( t)<br />

(i.e,<br />

f RL ( x) ). We note that f RL ( x)<br />

can be obtained from the distribution of<br />

Rˆ ( t) (i.e., f Rˆ ( x) ) by first flipping f Rˆ ( x)<br />

around the origin and then<br />

shifting to the right by Q( b – 1)<br />

. 7 In addition, it can be proven, in a<br />

way very similar to Theorem 2, that f Rˆ ( x)<br />

is Gaussian <strong>with</strong> mean<br />

Q( b – 1) ⁄ 2 and variance Q( b 2 – 1) ⁄ 12 . 8 So, for large Q , after<br />

going through the linear transformations mentioned above, f RL ( x)<br />

would pretty much be the same Gaussian as f Rˆ ( x)<br />

.<br />

Now we are ready to look at the distribution of ML( t)<br />

. To do so,<br />

we use the following inequalities<br />

MDˆ<br />

( t) – LA ⁄ b ≤ MDˆ<br />

( t – LA) ≤ MDˆ<br />

( t)<br />

(26)<br />

and<br />

MÂ( t) – 1 ≤ MÂ( t – b) ≤ MÂ( t)<br />

. (27)<br />

In Equation (26), the right hand side inequality is pretty straightforward.<br />

The left hand side inequality comes from the fact that in time<br />

LA there can be at most LA ⁄ b fetch requests. Equation (27) can be<br />

explained in a similar way.<br />

Using Equation (26) and Equation (27) in Equation (24), we get<br />

ML( t) ≤ MÂ( t) – ( MDˆ<br />

( t) – LA ⁄ b)<br />

(28)<br />

or<br />

ML( t) ≤ LA ⁄ b – ( MDˆ<br />

( t) – MÂ( t)<br />

)<br />

(29)<br />

or<br />

ML( t ) ≤ LA ⁄ b – MLˆ ( t)<br />

(30)<br />

where MLˆ ( t) = MDˆ<br />

( t) – MÂ( t)<br />

is the occupancy of the fetch<br />

FIFO. Equation (30) shows that the steady state distribution of<br />

ML( t) (i.e., f ML ( x)<br />

) can be obtained from the distribution of<br />

MLˆ ( t) (i.e., f MLˆ ( x) ) by first flipping f MLˆ ( x)<br />

around the origin and<br />

then shifting to the right by LA ⁄ b . Note that, for IID D i ( t)<br />

, Theorem<br />

3 applies, and f MLˆ ( x)<br />

can be derived as the steady state distribution<br />

of an M ⁄ D ⁄ 1 queue.<br />

At this point, we have the distributions of both the components of<br />

Equation (23). It can be shown, in a way very similar to Theorem 1,<br />

that Rˆ ( t) and MLˆ ( t)<br />

are independent of each other, given the<br />

assumptions on D i ( t) . Since RL( t) and ML( t)<br />

are linear transformations<br />

of Rˆ ( t) and MLˆ ( t)<br />

, they are also independent of each other,<br />

and the PDF of L( t) (i.e., f L ( x)<br />

) can be obtained by convolving<br />

f RL ( x) and f ML ( x)<br />

, i.e.,<br />

f L ( x) = f RL ( x) ⊗ f ML ( x ⁄ b)<br />

. (31)<br />

7. The steady state distribution of Rˆ ( t – LA)<br />

is the same as the steady<br />

state distribution of Rˆ ( t) as t → ∞<br />

8.<br />

While keeping similar assumptions in mind.


(a) load=0.5<br />

Figure 6: Theoretical PDF of L(t) for various loads for Q=1024 and<br />

b=4, <strong>with</strong> T=100*load<br />

By analyzing for the distribution of L( t)<br />

we can get one of the<br />

quantities (i.e., P( L > S)<br />

) required by Equation (12). To derive the<br />

other quantity (i.e., P( FD > LA)<br />

) we simply note that the delay<br />

through the constant-service-rate fetch FIFO is directly related to the<br />

occupancy of the FIFO, i.e.,<br />

P( FD > LA) = P( MLˆ<br />

> LA ⁄ b)<br />

. (32)<br />

Thus, by solving for the distributions of RLˆ ( t) and MLˆ ( t)<br />

, we<br />

get both the quantities required by Equation (12).<br />

D. Putting it Together<br />

We have talked about the sizes of the lookahead and SRAM buffers<br />

as independent parameters. However, to get a reasonable value for the<br />

upper bound indicated by Equation (12), we would have to first<br />

choose a value of LA such that P( FD > LA)<br />

is fairly low. Once we<br />

have picked LA , we can then get P( L > S) using LA as a constant.<br />

Note that this value of LA would depend only on the load λ . However,<br />

in the M ⁄ D ⁄ 1 context, the value of LA to achieve low values<br />

of P( FD > LA) may blow up in the limit λ → 1 . So, how do we<br />

choose a value of LA that works for all traffic loads? The solution is<br />

to limit the maximum load and assume that a design has a small speed<br />

up - for example, a maximum λ of 0.9 would require a speed up of<br />

1 ⁄ 0.9 ≅ 1.1 . Now, LA can be chosen for this maximum λ . In fact,<br />

for large Q, the value of LA required to achieve low values of<br />

P( FD > LA) turns out to be much smaller than Q( b – 1) ⁄ 2 .<br />

Figure 6 plots f L ( x)<br />

, as predicted by the theoretical model, for various<br />

values of λ when Q = 1024 , b = 4 , and LA = cλ . We<br />

picked c = 100 by noting that, as long as λ ≤ 0.9 , the steady state<br />

distribution for the M ⁄ D ⁄ 1 queue dies out by the time the queue<br />

occupancy gets to 100λ .<br />

These plots are very similar to the ones in Figure 4, except for a<br />

right shift by LA = cλ . For low values of λ , f ML ( x)<br />

behaves like<br />

an impulse close to LA ⁄ b . Thus, we expect f L ( x)<br />

to very much<br />

resemble a Gaussian centered about Q( b – 1) ⁄ 2 + LA . This would<br />

of course change as λ → 1 , <strong>with</strong> the Gaussian being shifted to the<br />

right and spread out a little. Again, note that in order to provide reasonable<br />

performance guarantees, the head SRAM is required to be<br />

Θ( Qb)<br />

.<br />

(b) load=0.9<br />

Figure 7: Complementary CDF for b=4 and Q=1024<br />

IV. SIMULATIONS<br />

In this section we present some simulation results for the tail<br />

SRAM and compare them to the predictions from Section II. The simulation<br />

results for the head SRAM are similar, and are not presented<br />

here.<br />

Figure 7 plots of the drop probability as a function of the tail<br />

SRAM size when b = 4 and Q = 1024 . These zoomed-out plots<br />

correspond to the prediction from theory and simulation results for<br />

two traffic types: Bernoulli IID Uniform and Bursty Uniform (the<br />

burst lengths are geometrically distributed <strong>with</strong> average burst length<br />

12).<br />

We observe that the plots for λ = 0.5 are pretty much indistinguishable.<br />

In addition, we observe that the plots are very close to the<br />

Gaussian predicted by Theorem 2. Similarly, the plots for λ = 0.9<br />

are very close to each other, <strong>with</strong> the plot from theory upper bounding<br />

the other two for large values of queue occupancy. This shows the<br />

power of Theorem 3 and indicates that the Poisson limit has already<br />

been reached for Q = 1024 .<br />

V. CONCLUSIONS<br />

In this paper we presented a model for providing statistical guarantees<br />

for a hybrid SRAM-DRAM architecture. We used this model to<br />

establish exact bounds relating the drop probability to the SRAM size.<br />

This model may be useful beyond the scope of this paper because it<br />

may apply to many queueing systems <strong>with</strong> fixed batch service. These<br />

systems are increasingly common due to the growing line rates and<br />

the resulting use of parallelism and load-balancing.<br />

Comparing to the deterministic, worst case analysis in [9], which<br />

established Q( b – 1)<br />

as the lower bound on SRAM size, we note that<br />

our results provide an improvement by at most a factor of two. How-


ever, similar to [9], our bounds have a linear dependence on Qb . The<br />

linear dependence on Q could be undesirable in cases where Q is<br />

large.<br />

Thus, our results can also be interpreted as an negative result for<br />

this architecture. This can be stated in the following way: under the<br />

hybrid SRAM-DRAM architecture, Θ( Qb)<br />

is a hard lower bound on<br />

the size of the SRAM, which can not be improved upon under any<br />

realistic traffic pattern.<br />

We believe this to be a characteristic of this architecture - since we<br />

always transfer blocks of b cells, this results in storing Θ( b)<br />

cells<br />

for Θ( Q) flows, resulting in a total storage of Θ( Qb)<br />

. This indicates<br />

that we need to look at alternative architectures if design choices dictate<br />

using SRAM sizes orders of magnitude lower than Θ( Qb)<br />

.<br />

REFERENCES<br />

[1] C. Villamizar and C. Song, “High Performance TCP in ANSNET”,<br />

Computer Communication Review, Vol. 24, No. 5, pp. 45--60, October<br />

1994.<br />

[2] G. Appenzeler, I. Keslassy, and N. McKeown, “Sizing Router <strong>Buffers</strong>”,<br />

accepted at ACM SIGCOMM’04. Available at http://www.stanford.edu/~nickm.<br />

[3] Caida, “Round-Trip Time Measurements from CAIDA’s Macroscopic<br />

Internet Topology Monitor”, available at http://www.caida.org/analysis/performance/rtt/walrus0202/.<br />

[4] Samsung, “Samsung K7N323645M NtSRAM”. Available at http://<br />

www.samsung.com/Products/Semiconductor/SRAM/index.htm.<br />

[5] Samsung, “Samsung K4S1G0632M SDRAM”. Available at http://<br />

www.samsung.com/Products/Semiconductor/DRAM/index.htm.<br />

[6] Cisco, “Cisco 12000 Series Two-Port OC-192c/STM-64c POS Line<br />

card”. Available at http://www.cisco.com/en/US/products/hw/routers/<br />

ps167/products_data_sheets_list.html.<br />

[7] D. A. Patterson and J. L. Hennessy, Computer Architecture, A Quantitative<br />

Approach, Section 8.4., pp. 425-432, Morgan Kaufmann, 1996,<br />

[8] K.G. Coffman and A. M. Odlyzko, “Is there a “Moore’s Law” for data<br />

traffic?,” Handbook of Massive Data Sets, eds., Kluwer, 2002, pp. 47-<br />

93.<br />

[9] S. Iyer, R. R. Compella, and N. McKeown, “<strong>Designing</strong> <strong>Buffers</strong> for<br />

Router Line Cards”, Stanford University HPNG Technical Report -<br />

TR02-HPNG-031001, Stanford, CA, 2002.<br />

[10] S. Iyer, R. R. Kompella, and N. McKeown, “Analysis of a Memory Architecture<br />

for Fast <strong>Packet</strong> <strong>Buffers</strong>,” IEEE HPSR’02, Dallas, Texas, May<br />

2001.<br />

[11] J. García, J. Corbal, L. Cerdà and M. Valero, “Design and Implementation<br />

of High-Performance Memory Systems for Future <strong>Packet</strong> <strong>Buffers</strong>,”<br />

Proceedings of the 36th Annual IEEE/ACM International Symposium<br />

on Microarchitecture, p.373, 2003.<br />

[12] G. Shrimali and N. McKeown, “<strong>Statistical</strong> <strong>Guarantees</strong> for <strong>Packet</strong> <strong>Buffers</strong>:<br />

The Monolithic DRAM Case”, Stanford University HPNG Technical<br />

Report - TR04-HPNG-020603, Stanford, CA, 2004.<br />

[13] C.-S. Chang, Performance <strong>Guarantees</strong> in Communication Networks<br />

London, Springer-Verlag, 2000.<br />

[14] J. Cao and K. Ramanan, “A Poisson Limit for Buffer Overflow Probabilities”,<br />

Proceedings of IEEE INFOCOM’02, pp. 994-1003, 2002.<br />

[15] U. Mocci, J. Roberts, and J. Virtamo, Broadband Network Teletraffic:<br />

Final Report of Action COST 242, Springer, Berlin, 1996.<br />

. ∑ ∑<br />

.<br />

result.<br />

For the first part, we start by proving that, for all i,j, R i ( t)<br />

is independent<br />

of MA j ( t) . For i ≠ j , R i ( t) is a function of A i ( t) , MA j ( t)<br />

is<br />

a function of A j ( t) , and A i ( t) is independent of A j ( t)<br />

. Therefore<br />

R i ( t) is independent of MA j ( t) . In addition, R i ( t) and MA j ( t)<br />

are<br />

independent of each other for i = j since A i ( t)<br />

is stationary and<br />

ergodic. Therefore, due to component-wise independence, the derived<br />

processes MA( t) = MA i ( t)<br />

and R( t) = R i ( t)<br />

are independent<br />

of each other. This finishes the first part of the proof.<br />

Now, to prove the second part, we note that D( t) and MD( t)<br />

are<br />

functions of the process MA( t)<br />

, and therefore are independent of<br />

R( t) . Thus, ML( t ) = MA( t) – MD( t)<br />

is also independent of<br />

R( t)<br />

Proof: (Theorem 2) The proof is in two steps. The first part involves<br />

proving that the workload remainders R i ( t)<br />

are IID. The second part<br />

involves the application of the Central Limit Theorem.<br />

For the first part, we note that R i ( t) depends only on A i ( t)<br />

. Since<br />

the parent processes A i ( t)<br />

are independent of each other, the derived<br />

processes R i ( t)<br />

are independent of each other. Also, since the interarrival<br />

times of arrival process A i ( t)<br />

are stationary and ergodic,<br />

R i ( t) would stay at each of the b states <strong>with</strong> equal probability, giving<br />

P( R = x) = 1 ⁄ b,<br />

∀x<br />

. This is precisely the discrete uniform distribution,<br />

Proof: (Theorem 3)We first show that if A i ( t)<br />

are simple and stationary<br />

point processes then MA i ( t)<br />

are simple and stationary point processes.<br />

Remember that each MA i ( t) is generated by taking every b th<br />

sample of the corresponding parent process A i ( t)<br />

. Therefore, in any<br />

time interval, MA i ( t) will have fewer arrivals than A i ( t)<br />

. So if<br />

<strong>with</strong> mean ( b – 1) ⁄ 2 and variance ( b 2 – 1) ⁄ 12 .<br />

Now, R( t) is a sum of Q IID uniformly-distributed random variables,<br />

each <strong>with</strong> mean ( b – 1) ⁄ 2 and variance ( b 2 – 1) ⁄ 12 . By the<br />

Central Limit Theorem, as Q → ∞ , R( t)<br />

tends towards a Gaussian<br />

random variable <strong>with</strong> mean Q( b – 1) ⁄ 2 and variance<br />

Q( b 2 – 1) ⁄ 12<br />

A i ( t)<br />

satisfies properties of a simple point process (Section II.C.2),<br />

then MA i ( t) will too. 9 Thus MA i ( t)<br />

is a simple point process.<br />

The stationarity of MA i ( t) follows from that fact that MA i ( t)<br />

is<br />

generated from A i ( t)<br />

using a fixed sampling rule. Thus, if the characteristics<br />

of the parent process A i ( t)<br />

are stationary (i.e., independent<br />

of time) then the characteristics of MA i ( t)<br />

are stationary.<br />

Thus, all the assumptions stated for A i ( t) are also true for MA i ( t)<br />

.<br />

This allows us to use Theorem 1 in [14] to get the required<br />

APPENDIX<br />

Proof: (Theorem 1) The proof is in two steps. The first part involves<br />

proving the independence of R( t) and MA( t)<br />

. The second part then<br />

proves the independence of R( t) and ML( t ) .<br />

9. Note that<br />

E[ MA i ( t)<br />

] = E[ A i ( t) ⁄ b] = λ i t ⁄ b

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!