12.07.2015 Views

RAM Space Skimpy Key-Value Store on Flash-based Storage

RAM Space Skimpy Key-Value Store on Flash-based Storage

RAM Space Skimpy Key-Value Store on Flash-based Storage

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash: <str<strong>on</strong>g>RAM</str<strong>on</strong>g> <str<strong>on</strong>g>Space</str<strong>on</strong>g> <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g> <str<strong>on</strong>g>Key</str<strong>on</strong>g>-<str<strong>on</strong>g>Value</str<strong>on</strong>g> <str<strong>on</strong>g>Store</str<strong>on</strong>g> <strong>on</strong><strong>Flash</strong>-<strong>based</strong> <strong>Storage</strong>Biplob Debnath ∗,1 Sudipta Sengupta ‡ Jin Li ‡‡Microsoft Research, Redm<strong>on</strong>d, WA, USA∗EMC Corporati<strong>on</strong>, Santa Clara, CA, USAABSTRACTWe present <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash, a <str<strong>on</strong>g>RAM</str<strong>on</strong>g> space skimpy key-value store<strong>on</strong> flash-<strong>based</strong> storage, designed for high throughput, low latencyserver applicati<strong>on</strong>s. The distinguishing feature of <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash isthe design goal of extremely low <str<strong>on</strong>g>RAM</str<strong>on</strong>g> footprint at about 1 (±0.5) byte per key-value pair, which is more aggressive than earlierdesigns. <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash uses a hash table directory in <str<strong>on</strong>g>RAM</str<strong>on</strong>g> toindex key-value pairs stored in a log-structured manner <strong>on</strong> flash.To break the barrier of a flash pointer (say, 4 bytes) worth of <str<strong>on</strong>g>RAM</str<strong>on</strong>g>overhead per key, it “moves" most of the pointers that locate eachkey-value pair from <str<strong>on</strong>g>RAM</str<strong>on</strong>g> to flash itself. This is realized by (i)resolving hash table collisi<strong>on</strong>s using linear chaining, where multiplekeys that resolve (collide) to the same hash table bucket arechained in a linked list, and (ii) storing the linked lists <strong>on</strong> flash itselfwith a pointer in each hash table bucket in <str<strong>on</strong>g>RAM</str<strong>on</strong>g> pointing tothe beginning record of the chain <strong>on</strong> flash, hence incurring multipleflash reads per lookup. Two further techniques are used to improveperformance: (iii) two-choice <strong>based</strong> load balancing to reducewide variati<strong>on</strong> in bucket sizes (hence, chain lengths and associatedlookup times), and a bloom filter in each hash table directory slotin <str<strong>on</strong>g>RAM</str<strong>on</strong>g> to disambiguate the choice during lookup, and (iv) compacti<strong>on</strong>procedure to pack bucket chain records c<strong>on</strong>tiguously <strong>on</strong>toflash pages so as to reduce flash reads during lookup. The averagebucket size is the critical design parameter that serves as a powerfulknob for making a c<strong>on</strong>tinuum of tradeoffs between low <str<strong>on</strong>g>RAM</str<strong>on</strong>g>usage and low lookup latencies. Our evaluati<strong>on</strong>s <strong>on</strong> commodityserver platforms with real-world data center applicati<strong>on</strong>s show that<str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash provides throughputs from few 10,000s to upwardsof 100,000 get-set operati<strong>on</strong>s/sec.Categories and Subject DescriptorsH.3 Informati<strong>on</strong> <strong>Storage</strong> and Retrieval [H.3.1 C<strong>on</strong>tent Analysisand Indexing]: Indexing MethodsGeneral TermsAlgorithms, Design, Experimentati<strong>on</strong>, Performance.1 Work d<strong>on</strong>e while Biplob Debnath was at Microsoft Research.Permissi<strong>on</strong> to make digital or hard copies of all or part of this work forpers<strong>on</strong>al or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citati<strong>on</strong> <strong>on</strong> the first page. To copy otherwise, torepublish, to post <strong>on</strong> servers or to redistribute to lists, requires prior specificpermissi<strong>on</strong> and/or a fee.SIGMOD’11, June 12–16, 2011, Athens, Greece.Copyright 2011 ACM 978-1-4503-0661-4/11/06 ...$10.00.<str<strong>on</strong>g>Key</str<strong>on</strong>g>words<str<strong>on</strong>g>Key</str<strong>on</strong>g>-value store, <strong>Flash</strong> memory, Indexing, <str<strong>on</strong>g>RAM</str<strong>on</strong>g> space efficient index,Log-structured index.1. INTRODUCTIONA broad range of server-side applicati<strong>on</strong>s need an underlying, oftenpersistent, key-value store to functi<strong>on</strong>. Examples include statemaintenance in Internet applicati<strong>on</strong>s like <strong>on</strong>line multi-player gamingand inline storage deduplicati<strong>on</strong> (as described in Secti<strong>on</strong> 3).A high throughput persistent key-value store can help to improvethe performance of such applicati<strong>on</strong>s. <strong>Flash</strong> memory is a naturalchoice for such a store, providing persistency and 100-1000 timeslower access times than hard disk. Compared to D<str<strong>on</strong>g>RAM</str<strong>on</strong>g>, flash accesstimes are about 100 times higher. <strong>Flash</strong> stands in the middlebetween D<str<strong>on</strong>g>RAM</str<strong>on</strong>g> and disk also in terms of cost – it is 10x cheaperthan D<str<strong>on</strong>g>RAM</str<strong>on</strong>g>, while 20x more expensive than disk – thus, making itan ideal gap filler between D<str<strong>on</strong>g>RAM</str<strong>on</strong>g> and disk.It is <strong>on</strong>ly recently that flash memory, in the form of Solid StateDrives (SSDs), is seeing widespread adopti<strong>on</strong> in desktop and serverapplicati<strong>on</strong>s. For example, My<str<strong>on</strong>g>Space</str<strong>on</strong>g>.com recently switched fromusing hard disk drives in its servers to using PCI Express (PCIe)cards loaded with solid state flash chips as primary storage for itsdata center operati<strong>on</strong>s [5]. Also recently, Facebook released <strong>Flash</strong>cache,a simple write back persistent block cache designed to acceleratereads and writes from slower rotati<strong>on</strong>al media (hard disks) bycaching data in SSDs [6]. These applicati<strong>on</strong>s have different storageaccess patterns than typical c<strong>on</strong>sumer devices and pose newchallenges to flash media to deliver sustained and high throughput(and low latency).These challenges arising from new applicati<strong>on</strong>s of flash are beingaddressed at different layers of the storage stack by flash devicevendors and system builders, with the former focusing <strong>on</strong> techniquesat the device driver software level and inside the device,and the latter driving innovati<strong>on</strong> at the operating system and applicati<strong>on</strong>layers. The work in this paper falls in the latter category.To get the maximum performance per dollar out of SSDs, it is necessaryto use flash aware data structures and algorithms that workaround c<strong>on</strong>straints of flash media (e.g., avoid or reduce small randomwrites that not <strong>on</strong>ly have a higher latency but also reduce flashdevice lifetimes through increased page wearing). In the rest of thepaper, we use NAND flash <strong>based</strong> SSDs as the architectural choiceand simply refer to it as flash memory. We describe the internalarchitecture of SSDs in Secti<strong>on</strong> 2.Recently, there are several interesting proposals to design keyvaluestores using flash memory [10, 11, 15, 16]. These designsuse a combinati<strong>on</strong> <str<strong>on</strong>g>RAM</str<strong>on</strong>g> and flash memory – they store thefull key-value pairs <strong>on</strong> flash memory and use a small amount ofmetadata per key-value pair in <str<strong>on</strong>g>RAM</str<strong>on</strong>g> to support faster insert and


lookup operati<strong>on</strong>s. For example, FAWN [11], <strong>Flash</strong><str<strong>on</strong>g>Store</str<strong>on</strong>g> [16], andChunkStash [15] each require about six bytes of <str<strong>on</strong>g>RAM</str<strong>on</strong>g> space perkey-value pair stored <strong>on</strong> flash memory. Thus, the amount of available<str<strong>on</strong>g>RAM</str<strong>on</strong>g> space limits the total number of key-value pairs thatcould be indexed <strong>on</strong> flash memory. As flash capacities are aboutan order of magnitude bigger than <str<strong>on</strong>g>RAM</str<strong>on</strong>g> and getting bigger, <str<strong>on</strong>g>RAM</str<strong>on</strong>g>size could well become the bottleneck for supporting large flash<strong>based</strong>key-value stores. By reducing the amount of <str<strong>on</strong>g>RAM</str<strong>on</strong>g> bytesneeded per key-value pair stored <strong>on</strong> flash down to the extreme lowsof about a byte, <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash can help to scale key-value stores <strong>on</strong>flash <strong>on</strong> a lean <str<strong>on</strong>g>RAM</str<strong>on</strong>g> size budget when existing current designs runout.Our C<strong>on</strong>tributi<strong>on</strong>sIn this paper, we present the design and evaluati<strong>on</strong> of <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash,a <str<strong>on</strong>g>RAM</str<strong>on</strong>g> space skimpy key-value store <strong>on</strong> flash-<strong>based</strong> <strong>Storage</strong>,designed for high throughput server applicati<strong>on</strong>s. The distinguishingfeature of <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash is the design goal of extremely low<str<strong>on</strong>g>RAM</str<strong>on</strong>g> footprint at about 1 byte per key-value pair, which is moreaggressive than earlier designs like FAWN [11], BufferHash [10],ChunkStash [15], and <strong>Flash</strong><str<strong>on</strong>g>Store</str<strong>on</strong>g> [16]. Our base design uses lessthan 1 byte in <str<strong>on</strong>g>RAM</str<strong>on</strong>g> per key-value pair and our enhanced designtakes slightly more than 1 byte per key-value pair. By being <str<strong>on</strong>g>RAM</str<strong>on</strong>g>space frugal, <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash can accommodate larger flash drive capacitiesfor storing and indexing key-value pairs.Design Innovati<strong>on</strong>s: <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash uses a hash table directoryin <str<strong>on</strong>g>RAM</str<strong>on</strong>g> to index key-value pairs stored in a log-structure <strong>on</strong> flash.To break the barrier of a flash pointer (say, 4 bytes) worth of <str<strong>on</strong>g>RAM</str<strong>on</strong>g>overhead per key, it “moves" most of the pointers that locate eachkey-value pair from <str<strong>on</strong>g>RAM</str<strong>on</strong>g> to flash itself. This is realized by(i)Resolving hash table collisi<strong>on</strong>s using linear chaining, wheremultiple keys that resolve (collide) to the same hash tablebucket are chained in a linked list, and(ii) Storing the linked lists <strong>on</strong> flash itself with a pointer in eachhash table bucket in <str<strong>on</strong>g>RAM</str<strong>on</strong>g> pointing to the beginning recordof the chain <strong>on</strong> flash, hence incurring multiple flash reads perlookup.Two further techniques are used to improve performance:(iii) Two-choice <strong>based</strong> load balancing [12] to reduce wide variati<strong>on</strong>in bucket sizes (hence, chain lengths and associatedlookup times), and a bloom filter [13] in each hash tabledirectory slot in <str<strong>on</strong>g>RAM</str<strong>on</strong>g> for summarizing the records in thatbucket so that at most <strong>on</strong>e bucket chain <strong>on</strong> flash needs to besearched during a lookup, and(iv) Compacti<strong>on</strong> procedure to pack bucket chain records c<strong>on</strong>tiguously<strong>on</strong>to flash pages so as to reduce flash reads duringlookup.The average bucket size is the critical design parameter thatserves as a powerful knob for making a c<strong>on</strong>tinuum of tradeoffs betweenlow <str<strong>on</strong>g>RAM</str<strong>on</strong>g> usage and low lookup latencies.Evaluati<strong>on</strong> <strong>on</strong> data center server applicati<strong>on</strong>s: <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stashcan be used as a high throughput persistent key-value storage layerfor a broad range of server class applicati<strong>on</strong>s. We use real-worlddata traces from two data center applicati<strong>on</strong>s, namely, Xbox LIVEPrimetime <strong>on</strong>line multi-player game and inline storage deduplicati<strong>on</strong>,to drive and evaluate the design of <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash <strong>on</strong> commodityserver platforms. <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash provides throughputs fromfew 10,000s to upwards of 100,000 get-set operati<strong>on</strong>s/sec <strong>on</strong> theevaluated applicati<strong>on</strong>s.Figure 1: Internal architecture of a Solid State Drive (SSD)The rest of the paper is organized as follows. We provide anoverview of flash memory in Secti<strong>on</strong> 2. In Secti<strong>on</strong> 3, we describetwo motivating real-world data center applicati<strong>on</strong>s that can benefitfrom a high throughput key-value store and are used to evaluate<str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash. We develop the design of <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash in Secti<strong>on</strong> 4.We evaluate <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash in Secti<strong>on</strong> 5. We review related work inSecti<strong>on</strong> 6. Finally, we c<strong>on</strong>clude in Secti<strong>on</strong> 7.2. FLASH MEMORY OVERVIEWFigure 1 gives a block-diagram of an NAND flash <strong>based</strong> SSD.In flash memory, data is stored in an array of flash blocks. Eachblock spans 32-64 pages, where a page is the smallest unit of readand write operati<strong>on</strong>s. A distinguishing feature of flash memory isthat read operati<strong>on</strong>s are very fast compared to magnetic disk drive.Moreover, unlike disks, random read operati<strong>on</strong>s are as fast as sequentialread operati<strong>on</strong>s as there is no mechanical head movement.A major drawback of the flash memory is that it does not allowin-place updates (i.e., overwrite). Page write operati<strong>on</strong>s in a flashmemory must be preceded by an erase operati<strong>on</strong> and within a block,pages need be to written sequentially. The in-place update problembecomes complicated as write operati<strong>on</strong>s are performed in thepage granularity, while erase operati<strong>on</strong>s are performed in the blockgranularity. The typical access latencies for read, write, and eraseoperati<strong>on</strong>s are 25 microsec<strong>on</strong>ds, 200 microsec<strong>on</strong>ds, and 1500 microsec<strong>on</strong>ds,respectively [9].The <strong>Flash</strong> Translati<strong>on</strong> layer (FTL) is an intermediate softwarelayer inside SSD, which makes linear flash memory device act likea virtual disk. The FTL receives logical read and write commandsfrom the applicati<strong>on</strong>s and c<strong>on</strong>verts them to the internal flash memorycommands. To emulate disk like in-place update operati<strong>on</strong>for a logical page (L p), the FTL writes data into a new physicalpage (P p), maintains a mapping between logical pages and physicalpages, and marks the previous physical locati<strong>on</strong> of L p as invalid forfuture garbage collecti<strong>on</strong>. Although FTL allows current disk <strong>based</strong>applicati<strong>on</strong> to use SSD without any modificati<strong>on</strong>s, it needs to internallydeal with flash physical c<strong>on</strong>straint of erasing a block beforeoverwriting a page in that block. Besides the in-place update problem,flash memory exhibits another limitati<strong>on</strong> – a flash block can<strong>on</strong>ly be erased for limited number of times (e.g., 10K-100K) [9].FTL uses various wear leveling techniques to even out the erasecounts of different blocks in the flash memory to increase its overalll<strong>on</strong>gevity [17]. Recent studies show that current FTL schemesare very effective for the workloads with sequential access writepatterns. However, for the workloads with random access patterns,these schemes show very poor performance [18, 20, 22]. One ofthe design goals of <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash is to use flash memory in FTLfriendlymanner.


3. KEY-VALUE STORE APPLICATIONSWe describe two real-world applicati<strong>on</strong>s that can use <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stashas an underlying persistent key-value store. Data traces obtainedfrom real-world instances of these applicati<strong>on</strong>s are used todrive and evaluate the design of <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash.3.1 Online Multi-player GamingOnline multi-player gaming technology allows people from geographicallydiverse regi<strong>on</strong>s around the globe to participate in thesame game. The number of c<strong>on</strong>current players in such a gamecould range from tens to hundreds of thousands and the numberof c<strong>on</strong>current game instances offered by a single <strong>on</strong>line servicecould range from tens to hundreds. An important challenge in <strong>on</strong>linemulti-player gaming is the requirement to scale the number ofusers per game and the number of simultaneous game instances.At the core of this is the need to maintain server-side state so asto track player acti<strong>on</strong>s <strong>on</strong> each client machine and update globalgame states to make them visible to other players as quickly aspossible. These functi<strong>on</strong>alities map to set and get key operati<strong>on</strong>sperformed by clients <strong>on</strong> server-side state. The real-time resp<strong>on</strong>sivenessof the game is, thus, critically dependent <strong>on</strong> the resp<strong>on</strong>se timeand throughput of these operati<strong>on</strong>s.There is also the requirement to store server-side game state ina persistent manner for (at least) the following reas<strong>on</strong>s: (i) resumegame from interrupted state if and when crashes occur, (ii) offlineanalysis of game popularity, progressi<strong>on</strong>, and dynamics with theobjective of improving the game, and (iii) verificati<strong>on</strong> of playeracti<strong>on</strong>s for fairness when outcomes are associated with m<strong>on</strong>etaryrewards. We designed <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash to meet the high throughputand low latency requirement of such get-set key operati<strong>on</strong>s in<strong>on</strong>line multi-player gaming.3.2 <strong>Storage</strong> Deduplicati<strong>on</strong>Deduplicati<strong>on</strong> is a recent trend in storage backup systems thateliminates redundancy of data across full and incremental backupdata sets [28]. It works by splitting files into multiple chunks usinga c<strong>on</strong>tent-aware chunking algorithm like Rabin fingerprintingand using SHA-1 hash [24] signatures for each chunk to determinewhether two chunks c<strong>on</strong>tain identical data [28]. In inline storagededuplicati<strong>on</strong> systems, the chunks (or their hashes) arrive <strong>on</strong>e-ata-timeat the deduplicati<strong>on</strong> server from client systems. The serverneeds to lookup each chunk hash in an index it maintains for allchunk hashes seen so far for that backup locati<strong>on</strong> instance – if thereis a match, the incoming chunk c<strong>on</strong>tains redundant data and can bededuplicated; if not, the (new) chunk hash needs to be inserted intothe index.Because storage systems currently need to scale to tens of terabytesto petabytes of data volume, the chunk hash index is too bigto fit in <str<strong>on</strong>g>RAM</str<strong>on</strong>g>, hence it is stored <strong>on</strong> hard disk. Index operati<strong>on</strong>s arethus throughput limited by expensive disk seek operati<strong>on</strong>s. Sincebackups need to be completed over windows of half-a-day or so(e.g., nights and weekends), it is desirable to obtain high throughputin inline storage deduplicati<strong>on</strong> systems. <str<strong>on</strong>g>RAM</str<strong>on</strong>g> prefetching andbloom-filter <strong>based</strong> techniques used by Zhu et al. [28] can avoiddisk I/Os <strong>on</strong> close to 99% of the index lookups. Even at this reducedrate, an index lookup going to disk c<strong>on</strong>tributes about 0.1msecto the average lookup time – this is about 10 3 times slower than alookup hitting in <str<strong>on</strong>g>RAM</str<strong>on</strong>g>. <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash can be used as the chunk hashindex for inline deduplicati<strong>on</strong> systems. By reducing the penalty ofindex lookup misses in <str<strong>on</strong>g>RAM</str<strong>on</strong>g> by orders of magnitude by servingsuch lookups from flash memory, <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash can help to increasededuplicati<strong>on</strong> throughput.IOPS10000080000600004000020000099019seq-reads94500160645948rand-reads seq-writes rand-writesFigure 2: IOPS for sequential/random reads and writes using 4KBI/O request size <strong>on</strong> a 160GB fusi<strong>on</strong>IO drive.4. <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash DESIGNWe present the system architecture of <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash and the rati<strong>on</strong>alebehind some design choices in this secti<strong>on</strong>.4.1 Coping with <strong>Flash</strong> C<strong>on</strong>straintsThe design is driven by the need to work around two types ofoperati<strong>on</strong>s that are not efficient <strong>on</strong> flash media, namely:1. Random Writes: Small random writes effectively need toupdate data porti<strong>on</strong>s within pages. Since a (physical) flashpage cannot be updated in place, a new (physical) page willneed to be allocated and the unmodified porti<strong>on</strong> of the data<strong>on</strong> the page needs to be relocated to the new page.2. Writes less than flash page size: Since a page is the smallestunit of write <strong>on</strong> flash, writing an amount less than a page rendersthe rest of the (physical) page wasted – any subsequentappend to that partially written (logical) page will need copyingof existing data and writing to a new (physical) page.To validate the performance gap between sequential and randomwrites <strong>on</strong> flash, we used Iometer [3], a widely used performanceevaluati<strong>on</strong> tool in the storage community, <strong>on</strong> a 160GB fusi<strong>on</strong>IOSSD [2] attached over PCIe bus to an Intel Core 2 Duo E6850 3GHzCPU. The number of worker threads was fixed at 8 and the numberof outstanding I/Os for the drive at 64. The results for IOPS(I/O operati<strong>on</strong>s per sec) <strong>on</strong> 4KB I/O request sizes are summarizedin Figure 2. Each test was run for 1 hour. The IOPS performanceof sequential writes is about 3x that of random writes and worsenswhen the tests are run for l<strong>on</strong>ger durati<strong>on</strong>s (due to accumulatingdevice garbage collecti<strong>on</strong> overheads). We also observe that theIOPS performance of (random/sequential) reads is about 6x thatsequential writes. (The slight gap between IOPS performance ofsequential and random reads is possibly due to prefetching insidethe device.)Given the above, the most efficient way to write flash is to simplyuse it as an append log, where an append operati<strong>on</strong> involves aflash page worth of data, typically 2KB or 4KB. This is the mainc<strong>on</strong>straint that drives the rest of our key-value store design. <strong>Flash</strong>has been used in a log-structured manner and its benefits reportedin earlier works [11, 14, 15, 16, 19, 23, 27].4.2 Design GoalsThe design of <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash is driven by the following guidingprinciples:• Support low-latency, high throughput operati<strong>on</strong>s. Thisrequirement is extracted from the needs of many server class


applicati<strong>on</strong>s that need an underlying key-value store to functi<strong>on</strong>.Two motivating applicati<strong>on</strong>s that are used for evaluating<str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash are described in Secti<strong>on</strong> 3.• Use flash aware data structures and algorithms. Thisprinciple accommodates the c<strong>on</strong>straints of the flash deviceso as to extract maximum performance out of it. Randomwrites and in-place updates are expensive <strong>on</strong> flash memory,hence must be reduced or avoided. Sequential writes shouldbe used to the extent possible and the fast nature of random/sequentialreads should be exploited.• Low <str<strong>on</strong>g>RAM</str<strong>on</strong>g> footprint per key independent of key-valuesize. The goal here is to index all key-value pairs <strong>on</strong> flashin a <str<strong>on</strong>g>RAM</str<strong>on</strong>g> space efficient manner and make them accessibleusing a small number of flash reads per lookup. By being<str<strong>on</strong>g>RAM</str<strong>on</strong>g> space frugal, <strong>on</strong>e can accommodate larger flashdrive capacities and corresp<strong>on</strong>dingly larger number of keyvaluepairs stored in it. <str<strong>on</strong>g>Key</str<strong>on</strong>g>-value pairs can be arbitrarilylarge but the <str<strong>on</strong>g>RAM</str<strong>on</strong>g> footprint per key should be independentof it and small. We target a skimpy <str<strong>on</strong>g>RAM</str<strong>on</strong>g> usage of about 1byte per key-value pair, a design point that is more aggressivethan earlier designs like FAWN [11], BufferHash [10],ChunkStash [15], and <strong>Flash</strong><str<strong>on</strong>g>Store</str<strong>on</strong>g> [16].4.3 Architectural Comp<strong>on</strong>ents<str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash has the following main comp<strong>on</strong>ents. A baseversi<strong>on</strong> of the design is shown in Figure 3 and an enhanced versi<strong>on</strong>in Figure 5. We will get to the details shortly.<str<strong>on</strong>g>RAM</str<strong>on</strong>g> Write Buffer: This is a fixed-size data structure maintainedin <str<strong>on</strong>g>RAM</str<strong>on</strong>g> that buffers key-value writes so that a write to flashhappens <strong>on</strong>ly after there is enough data to fill a flash page (which istypically 2KB or 4KB in size). To provide strict durability guarantees,writes can also happen to flash when a c<strong>on</strong>figurable timeoutinterval (e.g., 1 msec) has expired (during which period multiplekey-value pairs are collected in the buffer). The client call returns<strong>on</strong>ly after the write buffer is flushed to flash. The <str<strong>on</strong>g>RAM</str<strong>on</strong>g> write bufferis sized to 2-3 times the flash page size so that key-value writescan still go through when part of the buffer is being written to flash.<str<strong>on</strong>g>RAM</str<strong>on</strong>g> Hash Table (HT) Directory: The directory structure, forkey-value pairs stored <strong>on</strong> flash, is maintained in <str<strong>on</strong>g>RAM</str<strong>on</strong>g> and is organizedas a hash table with each slot c<strong>on</strong>taining a pointer to a chainof records <strong>on</strong> flash. Each key-value pair record <strong>on</strong> flash c<strong>on</strong>tains,in additi<strong>on</strong> to the key and value fields, a pointer to the next record(in the order in its respective chain) <strong>on</strong> flash. The chain of records<strong>on</strong> flash pointed to by each slot comprises the bucket of recordscorresp<strong>on</strong>ding to this slot in the HT directory. This is illustrated inFigure 3. The average number of records in a bucket, k, is a c<strong>on</strong>figurableparameter. In summary, we resolve hash table directorycollisi<strong>on</strong>s by linear chaining and store the chains in flash.In an enhancement of the design, we use two-choice <strong>based</strong> loadbalancing to reduce wide variati<strong>on</strong> in bucket sizes (hence, chainlengths and associated lookup times), and introduce a bloom filterin each hash table directory slot in <str<strong>on</strong>g>RAM</str<strong>on</strong>g> for summarizing therecords in that bucket so that at most <strong>on</strong>e bucket chain <strong>on</strong> flashneeds to be searched during a lookup. These enhancements formthe the core of our design and are discussed in detail in Secti<strong>on</strong> 4.5.<strong>Flash</strong> store: The flash store provides persistent storage for the keyvaluepairs and is organized as a circular append log. <str<strong>on</strong>g>Key</str<strong>on</strong>g>-valuepairs are written to flash in units of a page size to the tail of thelog. When the log accumulates garbage (c<strong>on</strong>sisting of deleted orHash tabledirectoryptr...<str<strong>on</strong>g>RAM</str<strong>on</strong>g>Sequential logkeykeykeykeykeykey<strong>Flash</strong> Memoryvaluevaluevalue...valuevaluevaluenullnullnull<str<strong>on</strong>g>Key</str<strong>on</strong>g>sorderedby writetime inlogFigure 3: <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash architecture showing the sequential log organizati<strong>on</strong>of key-value pair records <strong>on</strong> flash and base design for the hashtable directory in <str<strong>on</strong>g>RAM</str<strong>on</strong>g>. (<str<strong>on</strong>g>RAM</str<strong>on</strong>g> write buffer is not shown.)older values of updated records) bey<strong>on</strong>d a c<strong>on</strong>figurable threshold,the pages <strong>on</strong> flash from the head of the log are recycled – valid entriesfrom the head of the log are written back to the end of the log.This also helps to place the records in a given bucket c<strong>on</strong>tiguously<strong>on</strong> flash and improve read performance, as we elaborate shortly.Each key-value pair record <strong>on</strong> flash c<strong>on</strong>tains, in additi<strong>on</strong> to the keyand value fields, a pointer to the next record (in the order in its HTbucket chain) <strong>on</strong> flash.4.4 Overview of <str<strong>on</strong>g>Key</str<strong>on</strong>g> Lookup and Insert Operati<strong>on</strong>sTo understand the relati<strong>on</strong>ship of the different storage areas inour design, it is helpful to follow the sequence of accesses in keyinsert and lookup operati<strong>on</strong>s performed by the client applicati<strong>on</strong>.A key lookup operati<strong>on</strong> (get) first looks up the <str<strong>on</strong>g>RAM</str<strong>on</strong>g> writebuffer. Up<strong>on</strong> a miss there, it lookups up the HT directory in <str<strong>on</strong>g>RAM</str<strong>on</strong>g>and searches the chained key-value pair records <strong>on</strong> flash in the respectivebucket.A key insert (or, update) operati<strong>on</strong> (set) writes the key-valuepair into the <str<strong>on</strong>g>RAM</str<strong>on</strong>g> write buffer. When there are enough key-valuepairs in <str<strong>on</strong>g>RAM</str<strong>on</strong>g> write buffer to fill a flash page (or, a c<strong>on</strong>figurabletimeout interval since the client call has expired, say 1 msec), theseentries are written to flash and inserted into the <str<strong>on</strong>g>RAM</str<strong>on</strong>g> HT directoryand flash.A delete operati<strong>on</strong> <strong>on</strong> a key is supported through inserti<strong>on</strong> ofa null value for that key. Eventually the null entry and earlierinserted values of the key <strong>on</strong> flash will be garbage collected.When flash usage and fracti<strong>on</strong> of garbage records in the flash logexceed a certain threshold, a garbage collecti<strong>on</strong> (and compacti<strong>on</strong>)operati<strong>on</strong> is initiated to reclaim storage <strong>on</strong> flash in a manner similarto that in log-structured file systems [26]. This garbage collecti<strong>on</strong>operati<strong>on</strong> starts scanning key-value pairs from the (current) head ofthe log – it discards garbage (invalid or orphaned, as defined later)key-value pair records and moves valid key-value pair records fromthe head to the tail of the log. It stops when floor thresholds arereached for flash usage or fracti<strong>on</strong> of garbage records remaining inthe flash log.The functi<strong>on</strong>alities of (i) client key lookup/insert operati<strong>on</strong>s, (ii)


Sequential log≈≈≈≈Two-choice loadbalancing for key xHash tabledirectoryBF<str<strong>on</strong>g>RAM</str<strong>on</strong>g>...ptrkeykeykeykeykeykey<strong>Flash</strong> Memoryvaluevaluevalue...valuevaluevaluenullnullnull<str<strong>on</strong>g>Key</str<strong>on</strong>g>sorderedby writetime inlogFigure 5: <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash architecture showing the sequential log organizati<strong>on</strong>of key-value pair records <strong>on</strong> flash and enhanced design for thehash table directory in <str<strong>on</strong>g>RAM</str<strong>on</strong>g>. (<str<strong>on</strong>g>RAM</str<strong>on</strong>g> write buffer is not shown.)might be necessary to enforce fairly equal load balancing of keysacross HT directory buckets in order to keep each bucket chain ofabout the same size. One simple way to achieve this is to use thepower of two choice idea from [12] that has been applied to balancea distributi<strong>on</strong> of balls thrown into bins. With a load balanced designfor the HT directory, each key would be hashed to two candidateHT directory buckets, using two hash functi<strong>on</strong>s h 1 and h 2, and actuallyinserted into the <strong>on</strong>e that has currently fewer elements. Weinvestigate the impact of this design decisi<strong>on</strong> <strong>on</strong> balancing bucketsizes <strong>on</strong> our evaluati<strong>on</strong> workloads in Secti<strong>on</strong> 5. To implement thisload balancing idea, we add <strong>on</strong>e byte of storage to each HT directoryslot in <str<strong>on</strong>g>RAM</str<strong>on</strong>g> that holds the current number of keys in thatbucket – this space allocati<strong>on</strong> accommodates up to a maximum of2 8 − 1 = 255 keys per bucket.Bloom Filter per BucketThis design modificati<strong>on</strong>, in its current form, will lead to an increasein the number of flash reads during lookup. Since each keywill need to be looked up in both of its candidate buckets, the worstcase number of flash reads (hence lookup times) would double. Toremove this latency impact <strong>on</strong> the lookup pathway, we add a bloomfilter [13] per HT directory slot that summarizes the keys that havebeen inserted in the respective bucket. Note that this bloom filter ineach HT directory slot can be sized to c<strong>on</strong>tain about k keys, sinceload balancing ensures that when the hash table reaches its budgetedfull capacity, each bucket will c<strong>on</strong>tain not many more thank keys with very high probability. A standard rule of thumb fordimensi<strong>on</strong>ing a bloom filter to use <strong>on</strong>e byte per key (which gives afalse positive probability of 2%), hence the bloom filter in each HTdirectory slot can be of size k bytes.The introducti<strong>on</strong> of bloom filters in each HT directory slot hasanother desirable side effect – lookups <strong>on</strong> n<strong>on</strong>-existent keys willalmost always not involve any flash reads since the bloom filtersin both candidate slots of the key will indicate that the key is notpresent (module false positive probabilities). (Note that in the <strong>based</strong>esign, lookups <strong>on</strong> n<strong>on</strong>-existent keys also lead to flash reads andinvolve traversing the entire chain in the respective bucket.)In an interesting reciprocity of benefits, the bloom filters in each≈Bloom Filter≈Bucketsize≈pointer to key-valuepair chain <strong>on</strong> flashk bytes 1 byte 4 bytes≈Figure 6: <str<strong>on</strong>g>RAM</str<strong>on</strong>g> hash table directory slot and sizes of comp<strong>on</strong>ent fieldsin the enhanced design of <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash. The parameter k is the averagenumber of keys in a bucket.bucket not <strong>on</strong>ly help in reducing lookup times when two-choiceload balancing is used but also benefit from load balancing. Loadbalancing aims to keep the number of keys in each bucket upperbounded (roughly) by the parameter k. This helps to keep bloomfilter false positive probabilities in that bucket bounded as per thedimensi<strong>on</strong>ed capacity of k keys. Without load balancing, manymore than k keys could be inserted into a given bucket and this willincrease the false positive rates of the respective bloom filter wellbey<strong>on</strong>d what it was dimensi<strong>on</strong>ed for.The additi<strong>on</strong>al fields added to each HT directory slot in <str<strong>on</strong>g>RAM</str<strong>on</strong>g> inthe enhanced design are shown in Figure 6.During a lookup operati<strong>on</strong>, the key is hashed to its two candidateHT directory buckets and the chain <strong>on</strong> flash is searched <strong>on</strong>ly if therespective bloom filter indicates that the key may be there in thatbucket. Thus, accounting for bloom filter false positives, the chain<strong>on</strong> flash will be searched with no success in less than 2% of thelookups.When an insert operati<strong>on</strong> corresp<strong>on</strong>ds to an update of an earlierinserted key, the record is always inserted in the same bucketas the earlier <strong>on</strong>e, even if the choice determined by load balancing(out of two candidate buckets) is the other bucket. If we followedthe choice given by load balancing, the key may be inserted in thebloom filters of both candidate slots – this would not preserve thedesign goal of traversing at most <strong>on</strong>e bucket chain (with high probability)<strong>on</strong> flash during lookups. Moreover, the same problem wouldarise with versi<strong>on</strong> resoluti<strong>on</strong> during lookups if different versi<strong>on</strong>s ofa key are allowed to be inserted in both candidate buckets. Thisrule also leads to efficiencies during garbage collecti<strong>on</strong> operati<strong>on</strong>ssince all the obsolete values of a key appear in the same bucketchain <strong>on</strong> flash. Note that this complicati<strong>on</strong> involving overriding ofthe load balancing <strong>based</strong> choice of inserti<strong>on</strong> bucket can be avoidedwhen the applicati<strong>on</strong> does not perform updates to earlier insertedkeys – <strong>on</strong>e example of such an applicati<strong>on</strong> is storage deduplicati<strong>on</strong>as described in Secti<strong>on</strong> 3.In summary, in this enhancement of the base design, two-choice<strong>based</strong> load balancing strategy is used to reduce variati<strong>on</strong>s in thethe number of keys assigned to each bucket (hence, chain lengthsand associated lookup times). Each HT directory slot in <str<strong>on</strong>g>RAM</str<strong>on</strong>g> alsoc<strong>on</strong>tains a bloom filter summarizing the keys in the bucket anda size (count) field storing the current number of keys in that bucket.<str<strong>on</strong>g>RAM</str<strong>on</strong>g> <str<strong>on</strong>g>Space</str<strong>on</strong>g> Overhead for Enhanced Design: With this designmodificati<strong>on</strong>, the <str<strong>on</strong>g>RAM</str<strong>on</strong>g> space overhead per bucket now has threecomp<strong>on</strong>ents, namely,• Pointer to chain <strong>on</strong> flash (4 bytes),• Bucket size (1 byte), and• Bloom filter (k bytes).This space overhead per HT directory slot is amortized over anaverage of k keys (in that bucket), hence the <str<strong>on</strong>g>RAM</str<strong>on</strong>g> space overheadper entry can be computed as (k + 1 + 4)/k = 1 + 5/k which


is about 1.5 bytes for k = 10. The average number of flashreads per lookup is still k/2 = 5 (with high probability); withcurrent SSDs achieving flash read times in the range of 10µsec,this corresp<strong>on</strong>ds to a lookup latency of about 50 µsec. Moreover,the variati<strong>on</strong> across lookup latencies for different keys is betterc<strong>on</strong>trolled in this design (compared to the base design) as bucketchains are about the same size due to two choice <strong>based</strong> load balancingof keys across buckets. The <str<strong>on</strong>g>RAM</str<strong>on</strong>g> space usage per key-valuepair for the enhanced design as a functi<strong>on</strong> of k is shown in Figure 4.Storing key-value pairs to flash: <str<strong>on</strong>g>Key</str<strong>on</strong>g>-value pairs are organized<strong>on</strong> flash in a log-structure in the order of the respective writeoperati<strong>on</strong>s coming into the system. Each slot in the HT directoryc<strong>on</strong>tains a pointer to the beginning of the chain <strong>on</strong> flash thatrepresents the keys in that bucket. Each key-value pair record <strong>on</strong>flash c<strong>on</strong>tains, in additi<strong>on</strong> to the key and value fields, a pointer tothe next record (in the order in its respective chain) <strong>on</strong> flash. Weuse a 4-byte pointer, which is a combinati<strong>on</strong> of a page pointer anda page offset. The all-zero pointer is reserved for the null pointer– in the HT directory slot, this represents an empty bucket, while<strong>on</strong> flash this indicates that the respective record has no successorin the chain.<str<strong>on</strong>g>RAM</str<strong>on</strong>g> and <strong>Flash</strong> Capacity C<strong>on</strong>siderati<strong>on</strong>s: We designed our<str<strong>on</strong>g>RAM</str<strong>on</strong>g> indexing scheme to use 1 byte in <str<strong>on</strong>g>RAM</str<strong>on</strong>g> per key-value pair soas to maximize the amount of indexable storage <strong>on</strong> flash for a given<str<strong>on</strong>g>RAM</str<strong>on</strong>g> usage size. Whether <str<strong>on</strong>g>RAM</str<strong>on</strong>g> or flash capacity becomes the bottleneckfor storing key-value pairs <strong>on</strong> flash depends further <strong>on</strong> thekey-value pair size. With 64-byte key-value pair records, 1GB of<str<strong>on</strong>g>RAM</str<strong>on</strong>g> can index about 1 billi<strong>on</strong> key-value pairs <strong>on</strong> flash which occupy64GB <strong>on</strong> flash. This flash memory capacity is well within thecapacity range of SSDs shipping in the market today (from 64GB to640GB). On the other hand, with 1024-byte key-value pair records,the same 1GB of <str<strong>on</strong>g>RAM</str<strong>on</strong>g> can index 1 billi<strong>on</strong> key-value pairs whichnow occupy 1TB <strong>on</strong> flash – at currently available SSD capacities,this will require multiple flash drives to store the dataset.4.6 <strong>Flash</strong> <strong>Storage</strong> Management<str<strong>on</strong>g>Key</str<strong>on</strong>g>-value pairs are organized <strong>on</strong> flash in a log-structure in theorder of the respective write operati<strong>on</strong>s coming into the system.When there are enough key-value pairs in the <str<strong>on</strong>g>RAM</str<strong>on</strong>g> write bufferto fill a flash page (or, when a pre-specified coalesce time intervalis reached), they are written to flash. The pages <strong>on</strong> flash aremaintained implicitly as a circular log. Since the <strong>Flash</strong> Translati<strong>on</strong>Layer (FTL) translates logical page numbers to physical<strong>on</strong>es, this circular log can be easily implemented as a c<strong>on</strong>tiguousblock of logical page addresses with wraparound, realized by twopage number variables, <strong>on</strong>e for the first valid page (oldest written)and the other for the last valid page (most recently written).We next describe two maintenance operati<strong>on</strong>s <strong>on</strong> flash in <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash,namely, compacti<strong>on</strong> and garbage collecti<strong>on</strong>. Compacti<strong>on</strong> ishelpful in improving lookup latencies by reducing number of flashreads when searching bucket. Garbage collecti<strong>on</strong> is necessary toreclaim storage <strong>on</strong> flash and is a c<strong>on</strong>sequence of flash being usedin a log-structured manner.Compacti<strong>on</strong> to Reduce <strong>Flash</strong> Reads duringLookupsIn <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash , a lookup in a HT directory bucket involves followingthe chain of key-value records <strong>on</strong> flash. For a chain lengthof c records in a bucket, this involves an average of c/2 flash reads.Over time, as keys are inserted into a bucket and earlier insertedkeys are updated, the chain length <strong>on</strong> flash for this bucket keepsFigure 7: Diagram illustrating the effect of compacti<strong>on</strong> procedure <strong>on</strong>the organizati<strong>on</strong> of a bucket chain <strong>on</strong> flash in <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash.increasing and degrading lookup times. We address this by periodicallycompacting the chain <strong>on</strong> flash in a bucket by placing thevalid keys in that chain c<strong>on</strong>tiguously <strong>on</strong> <strong>on</strong>e or more flash pagesthat are appended to the tail of the log. Thus, if m key value pairscan be packed <strong>on</strong>to a single flash page (<strong>on</strong> the average), the numberof flash reads required to search for a key in a bucket of k recordsis k/(2m) <strong>on</strong> the average and at most ⌈k/m⌉ in the worst case.The compacti<strong>on</strong> operati<strong>on</strong>s proceed over time in a bucket as follows.Initially, as key-value pairs are added at different times toa bucket, they appear <strong>on</strong> different flash pages and are chained togetherindividually <strong>on</strong> flash. When enough valid records accumulatein a bucket to fill a flash page (say, m of them), they are compactedand appended <strong>on</strong> a new flash page at the tail of the log –the chain is now of a single flash page size and requires <strong>on</strong>e flashread (instead of m) to search fully. Thereafter, as further keys areappended to a bucket, they will be chained together individuallyand appear before the compacted group of keys in the chain. Overtime, enough new records may accumulate in the bucket to allowthem to be compacted to a sec<strong>on</strong>d flash page, and so the processrepeats. At any given time, the chain <strong>on</strong> flash for each bucket nowbegins with a chained sequence of individual records followed bygroups of compacted records (each group appearing <strong>on</strong> the sameflash page). This organizati<strong>on</strong> of a bucket chain as a result of compacti<strong>on</strong>is illustrated in Figure 7.When a key-value pair size is relatively small, say 64 bytes asin the storage deduplicati<strong>on</strong> applicati<strong>on</strong>, there may not be enoughrecords in a bucket to fill a flash page, since this number is (roughly)upper bounded by the parameter k. In this case, we may reap thesame benefits of compacti<strong>on</strong> by applying the procedure to groupsof chains in multiple buckets at a time.The compacti<strong>on</strong> operati<strong>on</strong>, as described above, will lead to orphaned)(or, garbage) records <strong>on</strong> the flash log. Moreover, garbagerecords also accumulate in the log as a result of key update anddelete operati<strong>on</strong>s. These need to be garbage collected <strong>on</strong> flash aswe describe next.Garbage Collecti<strong>on</strong>Garbage records (holes) accumulate in the log as a result of compacti<strong>on</strong>and key update/delete operati<strong>on</strong>s. These operati<strong>on</strong>s create


garbage records corresp<strong>on</strong>ding to all previous versi<strong>on</strong>s of respectivekeys.When a certain c<strong>on</strong>figurable fracti<strong>on</strong> of garbage accumulates inthe log (in terms of space occupied), a cleaning operati<strong>on</strong> is performedto clean and compact the log. The cleaning operati<strong>on</strong> c<strong>on</strong>siderscurrently used flash pages in oldest first order and deallocatesthem in a way similar to garbage collecti<strong>on</strong> in log-structuredfile systems [26]. One each page, the sequence of key-value pairsare scanned to determine whether they are valid or not. The classificati<strong>on</strong>of a key-value pair record <strong>on</strong> flash follows from doing alookup <strong>on</strong> the respective key starting from the HT directory – if thisrecord is the same as that returned by the lookup, then it is valid; ifit appears later in the chain than a valid record for that key, then thisrecord is invalid and corresp<strong>on</strong>ds to an obsolete versi<strong>on</strong> of they key;otherwise, the record is orphaned and cannot be reached by followingpointers from the HT directory (this may happen because of thecompacti<strong>on</strong> procedure, for example). When an orphaned record isencountered at the head of the log, it is skipped and the head positi<strong>on</strong>of the log is advanced to the next record.From the descripti<strong>on</strong> of the key inserti<strong>on</strong> procedure in Secti<strong>on</strong>4.5, it follows that the first record in each bucket chain (the <strong>on</strong>epointed to from the HT directory slot) is the most recently insertedrecord, while the last record in the chain is the earliest insertedrecord in that bucket. Hence, the last record in a bucket chain willbe encountered first during the garbage collecti<strong>on</strong> process and itmay be a valid or invalid (obsolete versi<strong>on</strong> of the respective key)record. A valid record needs to be reinserted at the tail of the logwhile an invalid record can be skipped. In either case, the nextpointer in its predecessor record in the chain would need to be updated.Since we want to avoid in-place updates (random writes) <strong>on</strong>flash, this requires relocating the predecessor record and so forthall the way to the first record in the chain. This effectively leads tothe design decisi<strong>on</strong> of garbage collecting entire bucket chains <strong>on</strong>flash at a time.In summary, when the last record in a bucket chain is encounteredin the log during garbage collecti<strong>on</strong>, all valid records in thatchain are compacted and relocated to the tail of the log. Thisgarbage collecti<strong>on</strong> strategy has two benefits.• First, the writing of an entire chain of records in a bucketto the tail of the log also allows them to be compacted andplaced c<strong>on</strong>tiguously <strong>on</strong> <strong>on</strong>e or more flash pages and helps tospeed up the lookup operati<strong>on</strong>s <strong>on</strong> those keys, as explainedabove in the c<strong>on</strong>text of compacti<strong>on</strong> operati<strong>on</strong>s, and• Sec<strong>on</strong>d, since garbage (orphaned) records are created furtherdown the log between the (current) head and tail (corresp<strong>on</strong>dingto the locati<strong>on</strong>s of all records in the chain beforerelocati<strong>on</strong>), this helps to speedup the garbage collecti<strong>on</strong> processfor the respective pages when they are encountered later(since orphaned records can be simply discarded).We investigate the impact of compacti<strong>on</strong> and garbage collecti<strong>on</strong><strong>on</strong> system throughput in Secti<strong>on</strong> 5.4.7 Crash Recovery<str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash ’s persistency guarantee enables it to recover fromsystem crashes due to power failure or other reas<strong>on</strong>s. Because thesystem logs all key-value write operati<strong>on</strong>s to flash, it is straightforwardto rebuild the HT directory in <str<strong>on</strong>g>RAM</str<strong>on</strong>g> by scanning all validflash pages <strong>on</strong> flash. Recovery using this method can take sometime, however, depending <strong>on</strong> the total size of valid flash pages thatneed to be scanned and the read throughput of flash memory.If crash recovery needs to be executed faster so as to support“near" real-time recovery, then it is necessary to checkpoint theTraceTotal get- get:set Avg. size (bytes)set ops ratio <str<strong>on</strong>g>Key</str<strong>on</strong>g> <str<strong>on</strong>g>Value</str<strong>on</strong>g>Xbox 5.5 milli<strong>on</strong>s 7.5:1 92 1200Dedup 40 milli<strong>on</strong>s 2.2:1 20 44Table 1: Properties of the two traces used in the performance evaluati<strong>on</strong>of <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash.<str<strong>on</strong>g>RAM</str<strong>on</strong>g> HT directory periodically into flash (in a separate area fromthe key-value pair logs). Then, recovery involves reading the lastwritten HT directory checkpoint from flash and scanning key-valuepair logged flash pages with timestamps after that and insertingthem into the restored HT directory. During the operati<strong>on</strong> of checkpointingthe HT directory, all insert operati<strong>on</strong>s into it will need tobe suspended (but the read operati<strong>on</strong>s can c<strong>on</strong>tinue). We use a temporary,small in-<str<strong>on</strong>g>RAM</str<strong>on</strong>g> hash table to provide index for the interimitems and log them to flash. After the checkpointing operati<strong>on</strong> completes,key-value pairs from the flash pages written in the interimare inserted into the HT directory. <str<strong>on</strong>g>Key</str<strong>on</strong>g> lookup operati<strong>on</strong>s, up<strong>on</strong>missing in the HT directory, will need to check in these flash pages(via the small additi<strong>on</strong>al hash table) until the later inserti<strong>on</strong>s intoHT directory are complete. The flash garbage collecti<strong>on</strong> thread issuspended during the HT directory checkpointing operati<strong>on</strong>, sincethe HT directory entries cannot be modified during this time.5. EVALUATIONWe evaluate <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash <strong>on</strong> real-world traces obtained from thetwo applicati<strong>on</strong>s described in Secti<strong>on</strong> 3.5.1 C# Implementati<strong>on</strong>We have prototyped <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash in approximately 3000 linesof C# code. MurmurHash [4] is used to realize the hash functi<strong>on</strong>sused in our implementati<strong>on</strong> to compute hash table directoryindices and bloom filter lookup positi<strong>on</strong>s; different seeds are usedto generate different hash functi<strong>on</strong>s in this family. The metadatastore <strong>on</strong> flash is maintained as a file in the file system and iscreated/opened in n<strong>on</strong>-buffered mode so that there are no buffering/caching/prefetchingeffects in <str<strong>on</strong>g>RAM</str<strong>on</strong>g> from within the operatingsystem. The ReaderWriterLockSlim and M<strong>on</strong>itor classesin .NET 3.0 framework [1] are used to implement the c<strong>on</strong>currencyc<strong>on</strong>trol soluti<strong>on</strong> for multi-threading.5.2 Evaluati<strong>on</strong> Platform and DatasetsWe use a standard server c<strong>on</strong>figurati<strong>on</strong> to evaluate the performanceof <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash. The server runs Windows Server 2008 R2and uses an Intel Core2 Duo E6850 3GHz CPU, 4GB <str<strong>on</strong>g>RAM</str<strong>on</strong>g>, andfusi<strong>on</strong>IO 160GB flash drive [2] attached over PCIe interface. Wecompare the throughput (get-set operati<strong>on</strong>s per sec) <strong>on</strong> the twotraces described in Table 1.We described two real-world applicati<strong>on</strong>s in Secti<strong>on</strong> 3 thatcan use <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash as an underlying persistent key-valuestore. Data traces obtained from real-world instances of these applicati<strong>on</strong>sare used to drive and evaluate the design of <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash.Xbox LIVE Primetime traceWe evaluate the performance of <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash <strong>on</strong> a large trace ofget-set key operati<strong>on</strong>s obtained from real-world instances ofthe Microsoft Xbox LIVE Primetime <strong>on</strong>line multiplayer game[7]. In this applicati<strong>on</strong>, the key is a dot-separated sequence ofstrings with total length averaging 94 characters and the valueaverages around 1200 bytes. The ratio of get operati<strong>on</strong>s to setset operati<strong>on</strong>s is about 7.5:1.


Stddev/mean of bucket sizes10.80.60.40.2Xbox traceBase designEnhanced Design01 2 4 8 16 32Avg. keys per bucket (k) (log scale)Figure 8: Xbox trace: Relative variati<strong>on</strong> in bucket sizes (standarddeviati<strong>on</strong>/mean)for different values of average bucket size (k) for thebase and enhanced designs. (Behavior <strong>on</strong> dedup trace is similar.)Avg. lookup time (usec)10080604020Dedup traceBase designEnhanced Design01 2 4 8 16 32Avg. keys per bucket (k) (log scale)Figure 10: Dedup trace: Average lookup (get) time for different valuesof average bucket size (k) for the base and enhanced designs. Theenhanced design reduces lookup times by factors of 1.5x-2.5x as k increases.Avg. lookup time (usec)2500200015001000500Xbox traceBase designEnhanced Design01 2 4 8 16 32Avg. keys per bucket (k) (log scale)Figure 9: Xbox trace: Average lookup (get) time for different valuesof average bucket size (k) for the base and enhanced designs. The enhanceddesign reduces lookup times by factors of 6x-24x as k increases.<strong>Storage</strong> Deduplicati<strong>on</strong> trace We have built a storage deduplicati<strong>on</strong>analysis tool that can crawl a root directory, generate thesequence of chunk hashes for a given average chunk hash size, andcompute the number of deduplicated chunks and storage bytes.The enterprise data backup trace we use for evaluati<strong>on</strong>s in thispaper was obtained by our storage deduplicati<strong>on</strong> analysis toolusing 4KB chunk sizes. The trace c<strong>on</strong>tains 27,748,824 total chunksand 12,082,492 unique chunks. Using this, we obtain the ratio ofget operati<strong>on</strong>s to set operati<strong>on</strong>s in the trace as 2.2:1. In thisapplicati<strong>on</strong>, the key is a 20-byte SHA-1 hash of the corresp<strong>on</strong>dingchunk and the value is the meta-data for the chunk, with key-valuepair total size upper bounded by 64 bytes.The properties of the two traces are summarized in Table 1. Bothtraces include get operati<strong>on</strong>s <strong>on</strong> keys that have not been set earlierin the trace. (Such get operati<strong>on</strong>s will return null.) This isan inherent property of the nature of the applicati<strong>on</strong>, hence we playthe traces “as-is" to evaluate throughput in operati<strong>on</strong>s per sec<strong>on</strong>d.5.3 Performance Evaluati<strong>on</strong>We evaluate the performance impact of our design decisi<strong>on</strong>s andobtain ballpark ranges <strong>on</strong> lookup times and throughput (get-setoperati<strong>on</strong>s/sec) for <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash <strong>on</strong> the Xbox LIVE Primetime<strong>on</strong>line multi-player game and storage deduplicati<strong>on</strong> applicati<strong>on</strong>traces (from Table 1). We disable the log compacti<strong>on</strong> procedurefor all but the last set of experiments. In the last set of experiments,we investigate the impact of garbage collecti<strong>on</strong> (which also doescompacti<strong>on</strong>) <strong>on</strong> system throughput.Impact of load balancing <strong>on</strong> bucket sizes: In the base design,hashing keys to single buckets can lead to wide variati<strong>on</strong>s in thechain length <strong>on</strong> flash in each bucket, and this translates to widevariati<strong>on</strong>s in lookup times. We investigate how two-choice <strong>based</strong>load balancing of keys to buckets can help reduce variance am<strong>on</strong>gbucket sizes. In Figure 8, we plot the relative variati<strong>on</strong> in thebucket sizes (standard-deviati<strong>on</strong>/mean of bucket sizes) for differentvalues of k = 1, 2, 4, 8, 16, 32 <strong>on</strong> the Xbox trace (the behavior <strong>on</strong>dedup trace is similar). This metric is about 1.4x-6x times betterfor the enhanced design than for the <strong>based</strong> design <strong>on</strong> both traces.The gap starts at about 1.4x for the case k = 1 and increases toabout 6x for the case k = 32. This provides c<strong>on</strong>clusive evidencethat two choice <strong>based</strong> load balancing is an effective strategy forreducing variati<strong>on</strong>s in bucket sizes.<str<strong>on</strong>g>Key</str<strong>on</strong>g> lookup latencies: The two ideas of load balancing acrossbuckets and using a bloom filter per bucket help to significantlyreduce average key lookup times in the enhanced design, with thegains increasing as the average bucket size parameter k increases.At a value of k = 8, the average lookup time in the enhanceddesign is 20 µsec for the Xbox trace and 12 µsec for the deduptrace. In Figures 9 and 10, we plot the average lookup timefor different values of k = 1, 2, 4, 8, 16, 32 <strong>on</strong> the two tracesrespectively. As the value of k increases, the gains in the enhanceddesign increase from 6x to 24x for the Xbox trace and from 1.5xto 2.5x for the dedup trace. The gains are more for the Xbox tracesince that trace has many updates to earlier inserted keys (while thededup trace has n<strong>on</strong>e), hence chains accumulate garbage recordsand get l<strong>on</strong>ger over time and bloom filters help even more tospeedup lookups <strong>on</strong> n<strong>on</strong>-existing keys (by avoiding searching theentire chain <strong>on</strong> flash).Throughput (get-set operati<strong>on</strong>s/sec): The enhanced designachieves throughputs in the range of 10,000-69,000 ops/sec <strong>on</strong> theXbox trace and 34,000-165,000 ops/sec <strong>on</strong> the dedup trace, withthroughputs decreasing (as expected) with increasing values of k.This is shown in Figures 11 and 12. The throughput gains for theenhanced design over the base design are in the range of 3.5x-20x<strong>on</strong> the Xbox trace and 1.3x-2.5x <strong>on</strong> the dedup trace, with the gainsincreasing as the parameter k increases in value. These trends arein line with that of average lookup times.The higher throughput of <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash <strong>on</strong> the dedup trace canbe explained as follows. The write mix per unit operati<strong>on</strong> (get orset) in the dedup trace is about 2.65 times that of the Xbox trace.However, since the key-value pair size is about 20 times smallerfor the dedup trace, the number of syncs to stable storage per write


Throughput (get-set ops/sec)70000600005000040000300002000010000Xbox traceBase designEnhanced Design01 2 4 8 16 32Avg. keys per bucket (k) (log scale)Figure 11: Xbox trace: Throughput (get-set ops/sec) for differentvalues of average bucket size (k) for the base and enhanced designs.The enhanced design improves throughput by factors of 3.5x-20x as kincreases.Throughput (get-set ops/sec)16000014000012000010000080000600004000020000Dedup traceBase designEnhanced Design01 2 4 8 16 32Avg. keys per bucket (k) (log scale)Figure 12: Dedup trace: Throughput (get-set ops/sec) for differentvalues of average bucket size (k) for the base and enhanced designs.The enhanced design improves throughput by factors of 1.3x-2.5x as kincreases.Throughput (get-set ops/sec)7000060000500004000030000£¥¤¤¥ ¨¦¢§ ¡¢¢££¤£¥2§ ¦§¢¦4 8 16 32 64 128 no GC¥©£¡¥§£¡¤¡ ¥¥¨Garbage Collecti<strong>on</strong> parameter (g)Xbox traceFigure 13: Xbox trace: Throughput (get-set ops/sec) for differentvalues of the garbage collecti<strong>on</strong> parameter g. (The average bucket sizeparameter is fixed at k = 8.)Avg. lookup time (usec)24201612840Xbox trace 2 4 8 16 32 64 128 no GC Garbage Collecti<strong>on</strong> parameter (g)Figure 14: Xbox trace: Average lookup (get) time for different valuesof the garbage collecti<strong>on</strong> parameter g. (The average bucket sizeparameter is fixed at k = 8.)operati<strong>on</strong> is about 20 times less. Overall, the number of syncs tostable storage per unit operati<strong>on</strong> is about 7.6 times less for thededup trace. Moreover, bucket chains get l<strong>on</strong>ger in the Xbox tracedue to invalid (garbage) records accumulating from key updates.For these reas<strong>on</strong>s, <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash obtains higher throughputs <strong>on</strong> thededup trace. In additi<strong>on</strong>, since the dedup trace has no key updateoperati<strong>on</strong>s, the lookup operati<strong>on</strong> can be avoided during key insertsfor the dedup trace; this also c<strong>on</strong>tributes to the higher throughputof dedup trace over the Xbox trace. Garbage collecti<strong>on</strong> canhelp to improve performance <strong>on</strong> the Xbox trace, as we discuss next.Impact of garbage collecti<strong>on</strong> activity: We study the impact ofgarbage collecti<strong>on</strong> (GC) activity <strong>on</strong> system throughput (ops/sec)and lookup times in <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash. The storage deduplicati<strong>on</strong> applicati<strong>on</strong>does not involve updates to chunk metadata, hence weevaluate the impact of garbage collecti<strong>on</strong> <strong>on</strong> the Xbox trace. We fixan average bucket size of k = 8 for this set of experiments.The aggressiveness of the garbage collecti<strong>on</strong> activity is c<strong>on</strong>trolledby a parameter g which is the interval of number of key updateoperati<strong>on</strong>s after which the garbage collector is invoked. Notethat update operati<strong>on</strong>s are <strong>on</strong>ly those insert operati<strong>on</strong>s that updatean earlier value of an already inserted key. Hence, the garbagecollector is invoked after every g key-value pair records worth ofgarbage accumulate in the log. When the garbage collector is invoked,it starts scanning from the (current) head of the log and skipsover orphaned records until it encounters the first valid or invalidrecord (that is part of some bucket chain). Then, it garbage collectsthat entire bucket chain, compacts and writes the valid records inthat chain to the tail of the log, and returns.In Figure 13, we plot the throughput (ops/sec) obtained <strong>on</strong> theXbox trace as the garbage collecti<strong>on</strong> parameter g is varied. (Theaverage bucket size parameter is fixed at k = 8.) We see that asg is increased from g = 2 <strong>on</strong>wards, the throughput first increases,peaks at g = 8, and then drops off all the the way to g = 128,at which point the throughput is close to that without garbage collecti<strong>on</strong>.This trend can be explained as follows. For small valuesof g, the high frequency of garbage collecti<strong>on</strong> is an overhead <strong>on</strong>system performance. As g is increased, the frequency of garbagecollecti<strong>on</strong> activity decreases while lookup times c<strong>on</strong>tinue to benefit(but less and less) from the compacti<strong>on</strong> procedure involved ingarbage collecti<strong>on</strong> (as explained in Secti<strong>on</strong> 4.6). Thus, the parameterg pulls system throughput in two different directi<strong>on</strong>s due todifferent effects, with the sweet spot for optimal performance occurringat g opt = 8. We also reran the experiments for an averagebucket size parameter of k = 4 and found the optimal value of g tobe g opt = 8 in that case also. It appears that the value of g opt is aproperty of the trace and depends <strong>on</strong> the rate at which the applicati<strong>on</strong>is generating garbage records.Finally, we study the impact of the compacti<strong>on</strong> procedure (aspart of garbage collecti<strong>on</strong> activity) <strong>on</strong> average lookup times. Asthe compacti<strong>on</strong> procedure helps to remove invalid records frombucket chains and reduce their lengths as well as place themc<strong>on</strong>tiguously <strong>on</strong> <strong>on</strong>e or more flash pages, it helps to reduce searchtimes in a bucket, as explained in Secti<strong>on</strong> 4.6. The impact canbe seen in Figure 14, where lookup times steadily decrease asthe aggressiveness of the garbage collecti<strong>on</strong> activity is increased(through decrease of the parameter g).Comparis<strong>on</strong> with earlier key-value store design: We compare<str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash with <strong>Flash</strong><str<strong>on</strong>g>Store</str<strong>on</strong>g> [16], a flash-<strong>based</strong> key-value store


Trace <strong>Flash</strong><str<strong>on</strong>g>Store</str<strong>on</strong>g> <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash (k = 1)(ops/sec) (ops/sec)Xbox 58,600 69,000Dedup 75,100 165,000Table 2: <strong>Flash</strong><str<strong>on</strong>g>Store</str<strong>on</strong>g> [16] vs. <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash (k = 1).which uses a variant of cuckoo hashing-<strong>based</strong> collisi<strong>on</strong> resoluti<strong>on</strong>policy [25] and compact key signatures to store metadata per keyvaluepair in the <str<strong>on</strong>g>RAM</str<strong>on</strong>g>. We use n = 16 hash functi<strong>on</strong>s to implementcuckoo hashing as suggested in [16]. Since <strong>Flash</strong><str<strong>on</strong>g>Store</str<strong>on</strong>g> keeps metadatafor <strong>on</strong>ly <strong>on</strong>e key-value pair per hash table bucket, we compareit with the <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash design with the average bucket size parameter(k) set to 1, so as to keep the comparis<strong>on</strong>s fair. Table 2 givesthe summary of the throughput performance <strong>on</strong> Xbox and Deduptraces. <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash shows superior performance due to the fewernumber of <str<strong>on</strong>g>RAM</str<strong>on</strong>g> accesses when n<strong>on</strong>-existing keys are looked up– in such cases, <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash needs <strong>on</strong>ly two memory accesses,while <strong>Flash</strong><str<strong>on</strong>g>Store</str<strong>on</strong>g> needs as many as 16. In additi<strong>on</strong>, due to the natureof cuckoo hashing, <strong>Flash</strong><str<strong>on</strong>g>Store</str<strong>on</strong>g> also needs to lookup an auxiliaryhash table when items are not found in the main (cuckoo) hashtable, which c<strong>on</strong>tributes to its reduced throughput. Since the ratioof the insert (set) to the lookup (get) operati<strong>on</strong>s <strong>on</strong> the Dedupetrace is higher than the Xbox trace (as explained in Secti<strong>on</strong> 5.2),<str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash shows relatively better performance for the Dedupetrace as it has less overheads during the inserti<strong>on</strong> of n<strong>on</strong>-existingkeys.6. RELATED WORK<strong>Flash</strong> memory has received lots of recent interest as a stable storagemedia that can overcome the access bottlenecks of hard disks.Researchers have c<strong>on</strong>sidered modifying existing applicati<strong>on</strong>s to improveperformance <strong>on</strong> flash as well as providing operating systemsupport for inserting flash as another layer in the storage hierarchy.In this secti<strong>on</strong>, we briefly review work that is related to <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stashand point out its differentiating aspects.MicroHash [27] designs a memory-c<strong>on</strong>strained index structurefor flash-<strong>based</strong> sensor devices with the goal of optimizing energyusage and minimizing memory footprint. <str<strong>on</strong>g>Key</str<strong>on</strong>g>s are assigned tobuckets in a <str<strong>on</strong>g>RAM</str<strong>on</strong>g> directory <strong>based</strong> <strong>on</strong> range and flushed to flashwhen <str<strong>on</strong>g>RAM</str<strong>on</strong>g> buffer page is full. The directory needs to be repartiti<strong>on</strong>edperiodically <strong>based</strong> <strong>on</strong> bucket utilizati<strong>on</strong>s. <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash addressesthis problem by hashing keys to hash table directory bucketsand using two-choice load balancing across them, but gives upthe ability to serve range queries. It also uses a bloom filter ineach directory slot in <str<strong>on</strong>g>RAM</str<strong>on</strong>g> to disambiguate the two-choice duringlookups. Moreover, MicroHash needs to write partially filled(index) pages to flash corresp<strong>on</strong>ding to keys in the same bucket,while the writes to flash in <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash always write full pages ofkey-value pair records.<strong>Flash</strong>DB [23] is a self-tuning B + -tree <strong>based</strong> index that dynamicallyadapts to the mix of reads and writes in the workload. LikeMicroHash, it also targets memory and energy c<strong>on</strong>strained sensornetwork devices. Because a B + -tree needs to maintain partiallyfilled leaf-level buckets <strong>on</strong> flash, appending of new keys to thesebuckets involves random writes, which is not an efficient flash operati<strong>on</strong>.Hence, an adaptive mechanism is also provided to switchbetween disk and log-<strong>based</strong> modes. The system leverages the factthat key values in sensor applicati<strong>on</strong>s have a small range and thatat any given time, a small number of these leaf-level buckets areactive. Minimizing latency is not an explicit design goal.Tree-<strong>based</strong> index structures optimized for the flash memory areproposed in [8, 21]. These works do not focus <strong>on</strong> the <str<strong>on</strong>g>RAM</str<strong>on</strong>g> spacereducti<strong>on</strong>. In c<strong>on</strong>trast, here we are mainly focusing <strong>on</strong> the hash<strong>based</strong>index structure with low <str<strong>on</strong>g>RAM</str<strong>on</strong>g> footprint.The benefits of using flash in a log-like manner have been exploitedin <strong>Flash</strong>Logging [14] for synchr<strong>on</strong>ous logging. This systemuses multiple inexpensive USB drives and achieves performancecomparable to flash SSDs but with much lower price. <strong>Flash</strong>loggingassumes sequential workloads. In c<strong>on</strong>trast, <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash workswith arbitrary key access workloads.FAWN [11] uses an array of embedded processors equipped withsmall amounts of flash to build a power-efficient cluster architecturefor data-intensive computing. The differentiating aspect of<str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash includes the aggressive design goal of extremely low<str<strong>on</strong>g>RAM</str<strong>on</strong>g> usage – <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash incurs about 1 byte <str<strong>on</strong>g>RAM</str<strong>on</strong>g> overhead perkey-value pair stored <strong>on</strong> flash while FAWN uses 6 bytes. This requires<str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash to use some techniques that are different fromFAWN. Moreover, we investigate the use of <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash in datacenter server-class applicati<strong>on</strong>s that need to achieve high throughput<strong>on</strong> read-write mixed workloads.BufferHash [10] builds a c<strong>on</strong>tent addressable memory (CAM)system using flash storage for networking applicati<strong>on</strong>s like WANoptimizers. It buffers key-value pairs in <str<strong>on</strong>g>RAM</str<strong>on</strong>g>, organized as a hashtable, and flushes the hash table to flash when the buffer is full.Past copies of hash tables <strong>on</strong> flash are searched using a time seriesof Bloom filters maintained in <str<strong>on</strong>g>RAM</str<strong>on</strong>g> and searching keys <strong>on</strong> a givencopy involve multiple flash reads. Thus, a key uses as many bytes in<str<strong>on</strong>g>RAM</str<strong>on</strong>g> (across all bloom filters) as the number of times it is updated.Moreover, the storage of key-value pairs in hash tables <strong>on</strong> flashwastes space <strong>on</strong> flash, since hash table load factors need to be wellbelow 100% to keep lookup times bounded.<strong>Flash</strong><str<strong>on</strong>g>Store</str<strong>on</strong>g> [16] is a high throughput persistent key-value storethat uses flash memory as a n<strong>on</strong>-volatile cache between <str<strong>on</strong>g>RAM</str<strong>on</strong>g> andhard disk. It is designed to store the working set of key-value pairs<strong>on</strong> flash and use <strong>on</strong>e flash read per key lookup. As the workingset changes over time, space is made for the current working setby destaging recently unused key-value pairs to hard disk and recyclingpages in the flash store. <strong>Flash</strong><str<strong>on</strong>g>Store</str<strong>on</strong>g> organizes key-valuepairs in a log-structure <strong>on</strong> flash to exploit faster sequential writeperformance. It uses an in-memory hash table to index them, withhash collisi<strong>on</strong>s resolved by a variant of cuckoo hashing [25]. Thein-memory hash table stores compact key signatures instead of fullkeys so as to strike tradeoffs between <str<strong>on</strong>g>RAM</str<strong>on</strong>g> usage and false flashread operati<strong>on</strong>s. The <str<strong>on</strong>g>RAM</str<strong>on</strong>g> usage is 6 bytes per key-value stored <strong>on</strong>flash.ChunkStash [15] is a key-value store <strong>on</strong> flash that is designedfor speeding up inline storage deduplicati<strong>on</strong>. It indexes datachunks (with SHA-1 hash as key) for identifying duplicate data.ChunkStash uses <strong>on</strong>e flash read per chunk lookup and organizeschunk metadata in a log-structure <strong>on</strong> flash to exploit fast sequentialwrites. It builds an in-memory hash table to index them usingtechniques similar to <strong>Flash</strong><str<strong>on</strong>g>Store</str<strong>on</strong>g> [16] to reduce <str<strong>on</strong>g>RAM</str<strong>on</strong>g> usage.7. CONCLUSIONWe designed <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash to be used as a high throughput persistentkey-value storage layer for a broad range of server class applicati<strong>on</strong>s.The distinguishing feature of <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash is the designgoal of extremely low <str<strong>on</strong>g>RAM</str<strong>on</strong>g> footprint at about 1 byte per key-valuepair, which is more aggressive than earlier designs. We used realworlddata traces from two data center applicati<strong>on</strong>s, namely, XboxLIVE Primetime <strong>on</strong>line multi-player game and inline storage deduplicati<strong>on</strong>,to drive and evaluate the design of <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash <strong>on</strong> commodityserver platforms. <str<strong>on</strong>g>Skimpy</str<strong>on</strong>g>Stash provides throughputs fromfew 10,000s to upwards of 100,000 get-set operati<strong>on</strong>s/sec <strong>on</strong> theevaluated applicati<strong>on</strong>s.


8. REFERENCES[1] C# System.Threading.http://msdn.microsoft.com/en-us/library/system.threading.aspx.[2] Fusi<strong>on</strong>-IO Drive Datasheet . http://www.fusi<strong>on</strong>io.com/PDFs/Data_Sheet_ioDrive_2.pdf.[3] Iometer. http://www.iometer.org/.[4] MurmurHash Fucti<strong>on</strong>.http://en.wikipedia.org/wiki/MurmurHash.[5] My<str<strong>on</strong>g>Space</str<strong>on</strong>g> Uses Fusi<strong>on</strong> Powered I/O to Drive Greener andBetter Data Centers. http://www.fusi<strong>on</strong>io.com/case-studies/myspace-case-study.pdf.[6] Releasing <strong>Flash</strong>cache. http://www.facebook.com/note.php?note_id=388112370932.[7] Xbox LIVE Primetime game.http://www.xboxprimetime.com/.[8] AGRAWAL, D., GANESAN, D., SITA<str<strong>on</strong>g>RAM</str<strong>on</strong>g>AN, R., DIAO, Y.,AND SINGH, S. Lazy-Adaptive Tree: An Optimized IndexStructure for <strong>Flash</strong> Devices. In VLDB (2009).[9] AGRAWAL, N., PRABHAKARAN, V., WOBBER, T., DAVIS,J., MANASSE, M., AND PANIGRAHY, R. Design Tradeoffsfor SSD Performance. In USENIX (2008).[10] ANAND, A., MUTHUKRISHNAN, C., KAPPES, S.,AKELLA, A., AND NATH, S. Cheap and Large CAMs forHigh Performance Data-Intensive Networked Systems. InNSDI (2010).[11] ANDERSEN, D., FRANKLIN, J., KAMINSKY, M.,PHANISHAYEE, A., TAN, L., AND VASUDEVAN, V. FAWN:A Fast Array of Wimpy Nodes. In SOSP (2009).[12] AZAR, Y., BRODER, A., KARLIN, A., AND UPFAL, E.Balanced Allocati<strong>on</strong>s. In SIAM Journal <strong>on</strong> Computing(1994).[13] BRODER, A., AND MITZENMACHER, M. NetworkApplicati<strong>on</strong>s of Bloom Filters: A Survey. In InternetMathematics (2002).[14] CHEN, S. <strong>Flash</strong>Logging: Exploiting <strong>Flash</strong> Devices forSynchr<strong>on</strong>ous Logging Performance. In SIGMOD (2009).[15] DEBNATH, B., SENGUPTA, S., AND LI, J. ChunkStash:Speeding up Inline <strong>Storage</strong> Deduplicati<strong>on</strong> using <strong>Flash</strong>Memory. In USENIX (2010).[16] DEBNATH, B., SENGUPTA, S., AND LI, J. <strong>Flash</strong><str<strong>on</strong>g>Store</str<strong>on</strong>g>: HighThroughput Persistent <str<strong>on</strong>g>Key</str<strong>on</strong>g>-<str<strong>on</strong>g>Value</str<strong>on</strong>g> <str<strong>on</strong>g>Store</str<strong>on</strong>g>. In VLDB (2010).[17] GAL, E., AND TOLEDO, S. Algorithms and Data Structuresfor <strong>Flash</strong> Memories. In ACM Computing Surveys (2005),vol. 37.[18] GUPTA, A., KIM, Y., AND URGAONKAR, B. DFTL: A<strong>Flash</strong> Translati<strong>on</strong> Layer Employing Demand-Based SelectiveCaching of Page-Level Address Mappings. In ASPLOS(2009).[19] KAWAGUCHI, A., NISHIOKA, S., AND MOTODA, H. A<strong>Flash</strong>-Memory Based File System. In USENIX (1995).[20] KOLTSIDAS, I., AND VIGLAS, S. <strong>Flash</strong>ing Up the <strong>Storage</strong>Layer. In VLDB (2008).[21] LI, Y., HE, B., 0001, J. Y., LUO, Q., AND YI, K. TreeIndexing <strong>on</strong> Solid State Drives. PVLDB 3, 1 (2010).[22] NATH, S., AND GIBBONS, P. Online Maintenance of VeryLarge Random Samples <strong>on</strong> <strong>Flash</strong> <strong>Storage</strong>. In VLDB (2008).[23] NATH, S., AND KANSAL, A. <strong>Flash</strong>DB: DynamicSelf-tuning Database for NAND <strong>Flash</strong>. In IPSN (2007).[24] NATIONAL INSTITUTE OF STANDARDS ANDTECHNOLOGY, FIPS 180-1. Secure Hash Standard. U.S.Department of Commerce, 1995.[25] PAGH, R., AND RODLER, F. F. Cuckoo hashing. Journal ofAlgorithms 51, 2 (May 2004).[26] ROSENBLUM, M., AND OUSTERHOUT, J. The Design andImplementati<strong>on</strong> of a Log-Structured File System. ACMTransacti<strong>on</strong>s <strong>on</strong> Computer Systems 10 (1991).[27] ZEINALIPOUR-YAZTI, D., LIN, S., KALOGERAKI, V.,GUNOPULOS, D., AND NAJJAR, W. A. Microhash: AnEfficient Index Structure for <strong>Flash</strong>-<strong>based</strong> Sensor Devices. InFAST (2005).[28] ZHU, B., LI, K., AND PATTERSON, H. Avoiding the DiskBottleneck in the Data Domain Deduplicati<strong>on</strong> File System.In FAST (2008).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!