
Workload Analysis of a Large-Scale Key-Value StoreBerk AtikogluYuehai XuEitanFrachtenbergStanford,FacebookWayne State, FacebookFacebookatikoglu@stanford.eduetc@fb.comyhxu@wayne.eduSong JiangMike PalecznyWayne StateFacebookmpal@fb.comsjiang@wayne.eduABSTRACTKeywordsKey-value stores are a vital component in many scale-outWorkload Analysis, Workload modeling, Key-Value Storeenterprises, including social networks, online retail, and riskanalysis. Accordingly, they are receiving increased atten-INTRODUCTION1.tion from theresearch community in an effortto improveKey-value (KV) stores play an important role in manytheir performance, scalability, reliability, cost, and powerlarge websites.Examples include: Dynamo at Amazon [15];consumption. To be effective, such efforts require a detailedRedis at GitHub, Digg, and Blizzard Interactive [27]; Mem-understanding of realistic key-value workloads. And yet lit-cached at Facebook, Zynga and Twitter [18, 26]; and Volde-tle is known about these workloads outside of the companiesmort at Linkedin [1]. All these systems store ordered (key, value)that operate them. This paper aims to address this gap.pairs and are, in essence, a distributed hash table.To this end, we have collected detailed traces from Face-A common use case for these systems is as a layer in thebook's Memcached deployment, arguably the world's largest.data-retrieval hierarchy: a cache for expensive-to-obtain val-The traces capture over 284 billion requests from five differ-ues, indexed by unique keys.These values can representent Memcached use cases over several days.We analyze theany data that is cheaper or faster to cache than re-obtain,workloadsfrommultipleangles,including:requestcompo-such as commonly accessed results of database queries orsition, size, and rate; cache efficacy; temporal patterns; andthe results of complex computations that require temporaryapplication use cases.We alsopropose a simplemodel of thestorage and distribution.most representative trace to enable the generation of moreBecause of their key role in large website performance,KVrealistic synthetic workloads by the communitystores arecarefullytunedforlowresponsetimes andhighOur analysis details many characteristics of the cachinghit rates. But like all caching heuristics, a KV-store's per-workload.Italsoreveals anumberof surprises:aGET/SETformance is highly dependent on its workload.It is there-ratio of 30:1 that is higher than assumed in the literature:fore imperative to understand the workload's characteris-some applications of Memcached behavemore like persistenttics. Additionally, analyzing and understanding large-scalestoragethan a cache; strong locality metrics,such as keyscache workloads can also: provide insights into topics suchaccessed many millions of times a day, do not always suf.as the role and effectiveness of memory-based caching in dis-fice for a high hit rate; and there is still room for efficiencytributed website infrastructure; expose the underlying pat-and hit rate improvements in Memcached's implementation.terns of user behavior; and provide difficult-to-obtain dataToward the last point, we make several suggestions that ad-and statistical distributions forfuture studies.dress the exposed deficiencies.In this paper, we analyze five workloads from Facebook'sMemcached deployment.Aside from the sheer scale of thesite and data (over 284 billion requests over a period of 58Categories and Subject Descriptorssample days), this case study also introduces to the commu-nity several different usage scenarios for KV stores. ThisC.2.4[Distributed Systems]:Distributed Databases:D.4.8variability serves to explore the relationship between the[Performance]: Modeling and Prediction; D.4.2 [Storagecache and various data domains: where overall site patternsManagement]:Distributed Memoriesare adequatelyhandled byageneralized cachinginfrastruc-ture, and where specialization would help. In addition, this*Corresponding author.paper offers thefollowingkey contributions and findings:1.A workload decomposition of the traces that showshow different applications of Memcached can have ex-treme variations in terms of read/write mix, requestPermission to make digital or hard copies of all or part of this work forsizes and rates, and usage patterns (Sec. 3)personal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copies2. An analysis of the caching characteristics of the tracesbear this notice and the full citation on the first page. To copy otherwise, toand the factors that determine hit rates.We foundrepublish, to post on servers orto redistribute to lists,requires prior specificthat different Memcached pools can vary significantlypermissionand/orafeein their locality metrics, but surprisingly, the best pre-SIGMETRICS'12, June 11-15, 2012, London, England, UK.dictor of hit rates is actually the pool's size (Sec. 6).Copyright 2012 ACM 978-1-4503-1097-0/12/06 ..S10.00
Workload Analysis of a Large-Scale Key-Value Store Berk Atikoglu Stanford, Facebook atikoglu@stanford.edu Yuehai Xu Wayne State, Facebook yhxu@wayne.edu Eitan Frachtenberg∗ Facebook etc@fb.com Song Jiang Wayne State sjiang@wayne.edu Mike Paleczny Facebook mpal@fb.com ABSTRACT Key-value stores are a vital component in many scale-out enterprises, including social networks, online retail, and risk analysis. Accordingly, they are receiving increased attention from the research community in an effort to improve their performance, scalability, reliability, cost, and power consumption. To be effective, such efforts require a detailed understanding of realistic key-value workloads. And yet little is known about these workloads outside of the companies that operate them. This paper aims to address this gap. To this end, we have collected detailed traces from Facebook’s Memcached deployment, arguably the world’s largest. The traces capture over 284 billion requests from five different Memcached use cases over several days. We analyze the workloads from multiple angles, including: request composition, size, and rate; cache efficacy; temporal patterns; and application use cases. We also propose a simple model of the most representative trace to enable the generation of more realistic synthetic workloads by the community. Our analysis details many characteristics of the caching workload. It also reveals a number of surprises: a GET/SET ratio of 30:1 that is higher than assumed in the literature; some applications of Memcached behave more like persistent storage than a cache; strong locality metrics, such as keys accessed many millions of times a day, do not always suf- fice for a high hit rate; and there is still room for efficiency and hit rate improvements in Memcached’s implementation. Toward the last point, we make several suggestions that address the exposed deficiencies. Categories and Subject Descriptors C.2.4 [Distributed Systems]: Distributed Databases; D.4.8 [Performance]: Modeling and Prediction; D.4.2 [Storage Management]: Distributed Memories ∗Corresponding author. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMETRICS’12, June 11–15, 2012, London, England, UK. Copyright 2012 ACM 978-1-4503-1097-0/12/06 .$10.00. Keywords Workload Analysis, Workload modeling, Key-Value Store 1. INTRODUCTION Key-value (KV) stores play an important role in many large websites. Examples include: Dynamo at Amazon [15]; Redis at GitHub, Digg, and Blizzard Interactive [27]; Memcached at Facebook, Zynga and Twitter [18, 26]; and Voldemort at Linkedin [1]. All these systems store ordered (key, value) pairs and are, in essence, a distributed hash table. A common use case for these systems is as a layer in the data-retrieval hierarchy: a cache for expensive-to-obtain values, indexed by unique keys. These values can represent any data that is cheaper or faster to cache than re-obtain, such as commonly accessed results of database queries or the results of complex computations that require temporary storage and distribution. Because of their key role in large website performance, KV stores are carefully tuned for low response times and high hit rates. But like all caching heuristics, a KV-store’s performance is highly dependent on its workload. It is therefore imperative to understand the workload’s characteristics. Additionally, analyzing and understanding large-scale cache workloads can also: provide insights into topics such as the role and effectiveness of memory-based caching in distributed website infrastructure; expose the underlying patterns of user behavior; and provide difficult-to-obtain data and statistical distributions for future studies. In this paper, we analyze five workloads from Facebook’s Memcached deployment. Aside from the sheer scale of the site and data (over 284 billion requests over a period of 58 sample days), this case study also introduces to the community several different usage scenarios for KV stores. This variability serves to explore the relationship between the cache and various data domains: where overall site patterns are adequately handled by a generalized caching infrastructure, and where specialization would help. In addition, this paper offers the following key contributions and findings: 1. A workload decomposition of the traces that shows how different applications of Memcached can have extreme variations in terms of read/write mix, request sizes and rates, and usage patterns (Sec. 3). 2. An analysis of the caching characteristics of the traces and the factors that determine hit rates. We found that different Memcached pools can vary significantly in their locality metrics, but surprisingly, the best predictor of hit rates is actually the pool’s size (Sec. 6)

3.An examination of various performance metrics overTablel:Memcached pools sampled (in one cluster)time, showing diurnal and weekly patterns (Sec. 3.3,These pools do not match their UNIX namesakes,4.2.2, 6).but are used for illustrative purposes here insteadoftheirinternalnames4. An analytical model that can be used to generate morePoolSizeDescriptionrealisticsyntheticworkloads.WefoundthatthesalientUSRfewuser-account status informationsize characteristics follow power-law distributions, sim-APPdozensobject metadata of one applicationilar to other storage and Web-serving systems (Sec. 5).ETChundredsnonspecific, general-purposeVAR5.An exposition of a Memcached deployment that candozensserver-side browserinformationshed light on real-world, large-scale production usageSYSfewsystem data on service locationof KV-stores (Sec.2.2, 8).Therest of this paper is organized as follows.We begin byA new item arriving after the heap is exhausted requiresdescribing the architecture of Memcached, its deploymentthe eviction of an older item in the appropriate slab. Mem-at Facebook, and how we analyzed its workload.Sec.3cached uses the Least-Recently-Used (LRU) algorithm topresents the observed experimental properties of the traceselect the items for eviction. To this end, each slab classdata (from the request point of view), while Sec.4 describeshas an LRU queue maintaining access history on its items.the observed cache metrics (from the server point of view).AlthoughLRUdecreesthatanyaccesseditembemovedtoSec. 5 presents a simple analytical model of the most rep-the top of the queue, this version of Memcached coalescesresentative workload. The next section brings the data to-repeated accesses of the same item within a short periodgether in a discussion of our results, followed by a section(one minute by default) and only moves this item to the topsurveying previous efforts on analyzing cache behavior andthe first time, to reduce overhead.workload analysis.2.2Deployment2.MEMCACHEDDESCRIPTIONFacebook relies on Memcached forfast access tofrequently-accessed values. Web servers typically try to read persistent2.1 Architecturevalues from Memcached before trying the slower backenddatabases.In many cases,the caches aredemand-filledMemcached' is a simple,open-source software packagemeaning that generally, data is added to the cache afterthat exposes data in RAM to clients over the network. Asa client has requested it and failed.data sizegrows in theapplication, more RAM can be addedModifications to persistent data in the database oftento a server, or more servers can be added to the network.propagate as deletions (invalidations) to the MemcachedAdditionalserversgenerallyonlycommunicatewith clients.tier.Somecached data,however,istransientand not backedClients use consistent hashing [9] to select a unique serverby persistent storage, requiring no invalidations.perkey,requiringonlytheknowledgeofthetotalnumberofPhysically, Facebook deploys front-end servers in multipleservers and their IP addresses. This technique presents thedatacenters, each containing one or more clusters of varyingentire aggregate data in the servers as a unified distributedsizes. Front-end clusters consist of both Web servers, run-hash table, keeps servers completely independent, and facil-ning primarily HipHop [31], and caching servers, runningitates scaling as data size grows.primarily Memcached. These servers are further subdividedMemcached's interface provides the basic primitives thatbased on the concept of pools. A pool is a partition of thehash tables provideinsertion, deletion, and retrievalasentire key space, defined by a prefix of the key, and typi-well as more complex operations built atop them.cally represents a separate application or data domain. TheData are stored as individual items, each including a key,amain reason for separate domains (as opposed to one all-value, and metadata. Item size can vary from a few bytes toencompassing cache)is to ensure adequate quality of serviceover 100 K B, heavily skewed toward smaller items (Sec. 3).for each domain.For example,oneapplication withhighConsequently.anaive memory allocation scheme could re-turnover rate could evict keys of another application thatsultin significantmemoryfragmentation.Toaddress thisis-shares the same server, even if the latter has high temporalsue, Memcached adopts a slab allocation technique, in whichlocalitybutloweraccessrates.Anotherreasontoseparatememory isdivided intoslabs of differentsizes.The slabsindomains is to facilitate application-specific capacity plan-aclassstoreitemswhosesizesarewithintheslab'sspecificning and performanceanalysis.range.Anewly inserted item obtains its memory spacebyIn this paper,we describe tracesfrom fiveseparate poolsfirst searchingtheslab class corresponding to its size.Ifonetrace from eachpool (traces from separate machinesthis search fails, a new slab of the class is allocated fromin the same pool exhibit similar characteristics).Thesetheheap.Symmetrically, when an item is deleted from thepools represent avaried spectrum of application domainscache, its space is returned to the appropriate slab, ratherand cache usage characteristics (Table 1).Onepool inpar-than the heap. Memory is allocated to slab classes basedticular, ETC, represents general cache usage of multiple ap-on the initial workload and its item sizes, until the heapplications, and is also the largest of the pools; the data col-is exhausted. Consequently,if the workload characteristicsiected from this trace may be the most applicable to general-changesignificantlyafterthisinitialphase,wemayfindthatpurpose KV-stores.the slab allocation is inappropriate for the workload, result-Thefocus of this paper is on workload characteristics.ing in memory underutilization.patterns, and relationships to social networking, so the exact'http://memcached.org/details of server count and components have little relevance
3. An examination of various performance metrics over time, showing diurnal and weekly patterns (Sec. 3.3, 4.2.2, 6). 4. An analytical model that can be used to generate more realistic synthetic workloads. We found that the salient size characteristics follow power-law distributions, similar to other storage and Web-serving systems (Sec. 5). 5. An exposition of a Memcached deployment that can shed light on real-world, large-scale production usage of KV-stores (Sec. 2.2, 8). The rest of this paper is organized as follows. We begin by describing the architecture of Memcached, its deployment at Facebook, and how we analyzed its workload. Sec. 3 presents the observed experimental properties of the trace data (from the request point of view), while Sec. 4 describes the observed cache metrics (from the server point of view). Sec. 5 presents a simple analytical model of the most representative workload. The next section brings the data together in a discussion of our results, followed by a section surveying previous efforts on analyzing cache behavior and workload analysis. 2. MEMCACHED DESCRIPTION 2.1 Architecture Memcached1 is a simple, open-source software package that exposes data in RAM to clients over the network. As data size grows in the application, more RAM can be added to a server, or more servers can be added to the network. Additional servers generally only communicate with clients. Clients use consistent hashing [9] to select a unique server per key, requiring only the knowledge of the total number of servers and their IP addresses. This technique presents the entire aggregate data in the servers as a unified distributed hash table, keeps servers completely independent, and facilitates scaling as data size grows. Memcached’s interface provides the basic primitives that hash tables provide—insertion, deletion, and retrieval—as well as more complex operations built atop them. Data are stored as individual items, each including a key, a value, and metadata. Item size can vary from a few bytes to over 100 KB, heavily skewed toward smaller items (Sec. 3). Consequently, a na¨ıve memory allocation scheme could result in significant memory fragmentation. To address this issue, Memcached adopts a slab allocation technique, in which memory is divided into slabs of different sizes. The slabs in a class store items whose sizes are within the slab’s specific range. A newly inserted item obtains its memory space by first searching the slab class corresponding to its size. If this search fails, a new slab of the class is allocated from the heap. Symmetrically, when an item is deleted from the cache, its space is returned to the appropriate slab, rather than the heap. Memory is allocated to slab classes based on the initial workload and its item sizes, until the heap is exhausted. Consequently, if the workload characteristics change significantly after this initial phase, we may find that the slab allocation is inappropriate for the workload, resulting in memory underutilization. 1http://memcached.org/ Table 1: Memcached pools sampled (in one cluster). These pools do not match their UNIX namesakes, but are used for illustrative purposes here instead of their internal names. Pool Size Description USR few user-account status information APP dozens object metadata of one application ETC hundreds nonspecific, general-purpose VAR dozens server-side browser information SYS few system data on service location A new item arriving after the heap is exhausted requires the eviction of an older item in the appropriate slab. Memcached uses the Least-Recently-Used (LRU) algorithm to select the items for eviction. To this end, each slab class has an LRU queue maintaining access history on its items. Although LRU decrees that any accessed item be moved to the top of the queue, this version of Memcached coalesces repeated accesses of the same item within a short period (one minute by default) and only moves this item to the top the first time, to reduce overhead. 2.2 Deployment Facebook relies on Memcached for fast access to frequentlyaccessed values. Web servers typically try to read persistent values from Memcached before trying the slower backend databases. In many cases, the caches are demand-filled, meaning that generally, data is added to the cache after a client has requested it and failed. Modifications to persistent data in the database often propagate as deletions (invalidations) to the Memcached tier. Some cached data, however, is transient and not backed by persistent storage, requiring no invalidations. Physically, Facebook deploys front-end servers in multiple datacenters, each containing one or more clusters of varying sizes. Front-end clusters consist of both Web servers, running primarily HipHop [31], and caching servers, running primarily Memcached. These servers are further subdivided based on the concept of pools. A pool is a partition of the entire key space, defined by a prefix of the key, and typically represents a separate application or data domain. The main reason for separate domains (as opposed to one allencompassing cache) is to ensure adequate quality of service for each domain. For example, one application with high turnover rate could evict keys of another application that shares the same server, even if the latter has high temporal locality but lower access rates. Another reason to separate domains is to facilitate application-specific capacity planning and performance analysis. In this paper, we describe traces from five separate pools— one trace from each pool (traces from separate machines in the same pool exhibit similar characteristics). These pools represent a varied spectrum of application domains and cache usage characteristics (Table 1). One pool in particular, ETC, represents general cache usage of multiple applications, and is also the largest of the pools; the data collected from this trace may be the most applicable to generalpurpose KV-stores. The focus of this paper is on workload characteristics, patterns, and relationships to social networking, so the exact details of server count and components have little relevance

here. It is important to note, however, that all Memcached70000DELETEinstances in this study ran on identical hardware.UPDATE60000GET2.3TracingMethodologyOur analysis called for complete traces of traffic passing50000(srai)seethrough Memcached servers for at least a week. This task40000lilis particularly challenging because it requires nonintrusiveinstrumentation of high-traffic volume production servers.30000Standard packet sniffers such as tcpdumphave too muchWe therefore imple-overhead to run under heavy load.20000mented an efficient packet sniffer called mcap.ImplementedasaLinuxkernel module,mcap has several advantages over10000standard packet sniffers: it accessespacket data in kernelspace directly and avoids additional memory copying; it in-troduces only 3% performance overhead (as opposed to tcp-USRAPPETCVARSYSdump's 30%); and unlike standard sniffers, it handles out-Poolof-order packets correctly by capturing incoming traffic af-ter all TCP processing is done. Consequently, mcap has aFigure l:Distribution of request types per pool,complete view of what the Memcached server sees, whichover exactly7days.UPDATE commands aggregateeliminates the need for further processing of out-of-orderall non-DELETE writing operations, such as SET,packets. On the other hand, its packet parsing is optimizedREPLACE, etc.for Memcached packets, and would require adaptations forotherapplicationsThe captured traces vary in size from 3TB to 7TB each.operations. DELETE operations occur when a cachedThis data is too large to store locally on disk, adding anotherdatabase entry is modified (but not required to bechallenge: how to offload this much data (at an average rateset again in the cache). SET operations occur whenofmorethan80,000samplespersecond)without interferingthe Web servers add a value to the cache. The rela-with production traffic.We addressed this challenge by com-tively high number of DELETE operations show thatbininglocal disk buffering and dynamic ofload throttling tothis pool represents database-backed values that aretake advantage of low-activityperiods in the servers.affected by frequent user modifications.Finally,another challenge is this: how to effectively pro-cess these large data sets? We used Apache HIVE3 to ana-ETC has similar characteristics to APP, but with an evenlyzeMemcached traces.HIVE is part of the Hadoop frame-higher rate of DELETE requests (of which some maywork that translates SQL-like queries into MapReduce jobs.notbecurrentlycached).ETCisthelargestandleastWe also used the Memcached "stats" command, as well asspecificofthepools,soitsworkloadsmightbethemostFacebook's production logs, to verify that the statistics werepresentative to emulate. Because it is such a largecomputed, such as hit rates, are consistent with the aggre-and heterogenous workload, we pay special attentiongated operational metrics collected by these tools.to this workload throughout the paper.VAR is the only pool sampled that is write-dominated. It3.WORKLOADCHARACTERISTICSstores short-termvaluessuchasbrowser-windowsizeThis section describes the observed properties ofeach tracefor opportunistic latency reduction.As such, thesein terms of the requests that comprise it, their sizes, andvalues are not backed by a database (hence, no invali-their frequencies.dating DELETEs are required). But they change fre-quently, accounting for the high number of UPDATEs.3.11RequestCompositionWe begin by looking at the basic data that comprises theSYS is used to locate servers and services, not user data. Asworkload: the total number of requests in each server, bro-such, the number of requests scales with the numberken down by request types (Fig. 1). Several observationsof servers, not the number of user requests, which isdelineate the different usage of each pool:much larger. This explains why the total number ofSYS requests is much smaller than the other pools'.USR handles significantly more GET requests than any ofthe other pools.GET operations comprise over 99.8%It is interesting to note that the ratio of GETs to UPDATEsof this pool's workload. One reason for this is that thein ETC (approximately 30 : 1) is significantly higher thanpool is sized large enoughtomaximizehit rates,somost synthetic workloads typically assume (Sec.7).Forrefreshing values is rarely necessary.These values aredemand-filled caches like USR, where each miss is followedalso updated at a slower rate than some of the otherbyan UPDATE, theratios of GETtoUPDATEoperationspools.The overall effect is that USR is used more likementioned above are related to hit rate in general and theRAM-based persistent storage than a cache.sizing of the cache to the data in particular. So in theory,one could justify any synthetic GET to UPDATE mix byAPP has high GET rates too-owing to the popularity ofcontrolling the cache size. But in practice, not all caches orthis application-but also a large number of DELETEkeys are demand-filled, and these cachesarealready sized to2http://www.tcpdump.org/fit a real-world workload in a way that successfully trades3http://hive.apache.org/off hitrates tocost
here. It is important to note, however, that all Memcached instances in this study ran on identical hardware. 2.3 Tracing Methodology Our analysis called for complete traces of traffic passing through Memcached servers for at least a week. This task is particularly challenging because it requires nonintrusive instrumentation of high-traffic volume production servers. Standard packet sniffers such as tcpdump2 have too much overhead to run under heavy load. We therefore implemented an efficient packet sniffer called mcap. Implemented as a Linux kernel module, mcap has several advantages over standard packet sniffers: it accesses packet data in kernel space directly and avoids additional memory copying; it introduces only 3% performance overhead (as opposed to tcpdump’s 30%); and unlike standard sniffers, it handles outof-order packets correctly by capturing incoming traffic after all TCP processing is done. Consequently, mcap has a complete view of what the Memcached server sees, which eliminates the need for further processing of out-of-order packets. On the other hand, its packet parsing is optimized for Memcached packets, and would require adaptations for other applications. The captured traces vary in size from 3T B to 7T B each. This data is too large to store locally on disk, adding another challenge: how to offload this much data (at an average rate of more than 80, 000 samples per second) without interfering with production traffic. We addressed this challenge by combining local disk buffering and dynamic offload throttling to take advantage of low-activity periods in the servers. Finally, another challenge is this: how to effectively process these large data sets? We used Apache HIVE3 to analyze Memcached traces. HIVE is part of the Hadoop framework that translates SQL-like queries into MapReduce jobs. We also used the Memcached “stats” command, as well as Facebook’s production logs, to verify that the statistics we computed, such as hit rates, are consistent with the aggregated operational metrics collected by these tools. 3. WORKLOAD CHARACTERISTICS This section describes the observed properties of each trace in terms of the requests that comprise it, their sizes, and their frequencies. 3.1 Request Composition We begin by looking at the basic data that comprises the workload: the total number of requests in each server, broken down by request types (Fig. 1). Several observations delineate the different usage of each pool: USR handles significantly more GET requests than any of the other pools. GET operations comprise over 99.8% of this pool’s workload. One reason for this is that the pool is sized large enough to maximize hit rates, so refreshing values is rarely necessary. These values are also updated at a slower rate than some of the other pools. The overall effect is that USR is used more like RAM-based persistent storage than a cache. APP has high GET rates too—owing to the popularity of this application—but also a large number of DELETE 2http://www.tcpdump.org/ 3http://hive.apache.org/ 0 10000 20000 30000 40000 50000 60000 70000 USR APP ETC VAR SYS Requests (millions) Pool DELETE UPDATE GET Figure 1: Distribution of request types per pool, over exactly 7 days. UPDATE commands aggregate all non-DELETE writing operations, such as SET, REPLACE, etc. operations. DELETE operations occur when a cached database entry is modified (but not required to be set again in the cache). SET operations occur when the Web servers add a value to the cache. The relatively high number of DELETE operations show that this pool represents database-backed values that are affected by frequent user modifications. ETC has similar characteristics to APP, but with an even higher rate of DELETE requests (of which some may not be currently cached). ETC is the largest and least specific of the pools, so its workloads might be the most representative to emulate. Because it is such a large and heterogenous workload, we pay special attention to this workload throughout the paper. VAR is the only pool sampled that is write-dominated. It stores short-term values such as browser-window size for opportunistic latency reduction. As such, these values are not backed by a database (hence, no invalidating DELETEs are required). But they change frequently, accounting for the high number of UPDATEs. SYS is used to locate servers and services, not user data. As such, the number of requests scales with the number of servers, not the number of user requests, which is much larger. This explains why the total number of SYS requests is much smaller than the other pools’. It is interesting to note that the ratio of GETs to UPDATEs in ETC (approximately 30 : 1) is significantly higher than most synthetic workloads typically assume (Sec. 7). For demand-filled caches like USR, where each miss is followed by an UPDATE, the ratios of GET to UPDATE operations mentioned above are related to hit rate in general and the sizing of the cache to the data in particular. So in theory, one could justify any synthetic GET to UPDATE mix by controlling the cache size. But in practice, not all caches or keys are demand-filled, and these caches are already sized to fit a real-world workload in a way that successfully trades off hit rates to cost

KevsizeCDEbyannaaraneweigh0.8 0.8 0.80.6 0.60.60.4 0.40.4USRUSR饼0.20.202SYSSYSSYSn40601001001000100001000001001000100001000001e+0aKey size (bytes)Value size (bytes)Value size (bytes)Figure 2:Key and value size distributions for all traces. The leftmost CDF shows the sizes of keys, up toMemcached's limitof 250B (not shown).The center plot similarly showshow value sizes distribute.Therightmost CDF aggregates value sizes by the total amount of data they use in the cache, so for example,values under 320Bor so in SYS usevirtuallyno space in the cache;320Bvaluesweigharound 8% of the data,and values close to500Btakeupnearly80% of theentire cache'sallocationfor values.3.2RequestSizestrace) differ in which of the two peaks is higher, the entireperiod between them, representing the Western HemisphereNext, we look at the sizes of keys and values in each poolday,sees the highest traffic volume. In terms of weekly pat-(Fig. 2), based on SET requests.All distributions showterns, we observe a small traffic drop on most Fridays andstrong modalities.For example, over 90% of APP's keys areSaturdays, with traffic picking up again on Sundays and31byteslong,and values sizesaround 270B showup inmoreMondays.than 30% of SET requests.USR is themost extreme:it onlyThe diurnal cycle represents load variation on the order ofhas two key size values (16B and 21B)and virtually just2x. We also observe the presence of traffic spikes. Typically,one value size (2B).Even in ETC, the most heterogeneousthese can represent a swift surge in user interest on one topic,of thepools,requests with 2-,3-, or 11-bytevalues add upsuch as occur with major news or media events. Less fre-to 40% of the total requests. On the other hand, it also hasquently,these spikes stem from programmatic or operationala few very large values (around 1MB) that skew the weightcauses. Either way, the implication for Memcached devel-distribution (rightmost plot in Fig. 2), leaving less cachingopmentanddeploymentisthatonemustbudgetindividualspaceforsmallervaluesnode capacity to allow for these spikes, which can easily dou-Small values dominate all workloads, not just in count,ble or even triple the normal peak request rate.Althoughbutespecially in overall weight.Exceptfor ETC, 90%ofsuch budgeting underutilizes resources during normal traf-all cache space is allocated to values of less than 500 B.Thefic, it is nevertheless imperative; otherwise, the many Webimplications for caching and system optimizations are sig-servers thatwould taketo this sudden traffic and fail to getnificant.For example, network overhead in the processingaprompt response from Memcached, would all query theofmultiplesmallpacketscanbesubstantial,whichexplainssame database nodes.This scenario could be debilitating,why Facebook coalesces as many requests as possible in asso itmustremainhypothetical.fewpacketsaspossible9-Anotherexampleismemoryfragmentation.The strong modality of each workload imCACHEBEHAVIORplies that different Memcached pools can optimize memory4.allocation by modifyingthe slabsize constants tofit eachThe main metric used in evaluating cache efficacy is hitdistribution. In practice, this is an unmanageable and un-rate: the percentage of GET requests that return a value.scalable solution, so instead Memcached uses many (44) slabThe overall hit rate of each server, as derived from the tracesclasses with exponentially growing sizes, in the hope of re-and verified with Memcached's own statistics, are shown inducing allocation waste, especially for small sizes.Table 2.This section takes a deeper look at the factorsthat influence these hit rates and how they relate to cache3.3TemporalPatternslocality,user behavior, temporal patterns, and Memcached'sTo understand how production Memcached load variesdesign.over time, we look at each trace's transient request rate overits entire collection period (Fig. 3). All traces clearly showTable2: MeanLcachehitrateoverentiretrthe expected diurnal pattern, but with different values andPoolTAPPTVARTSYSTUSRTETCamplitudes.If we increase our zoom factor further (as in theHitrate92.9%93.7%98.7%98.2%81.4%lastplot),wenotice that trafficinETC bottoms out around08:00 and has two peaks around 17:00 and 03:00. Not sur-prisingly, the hours immediately preceding 08:00 UTC (mid-4.1HitRates over Timenight in Pacific Time) represent night time in the WesternWhen looking at how hit rates vary over time (Fig. 4).HemisphereThe first peak, on the other hand, occurs as North Amer-almost all traces show diurnal variance, within a small bandof a few percentage points.USR's plot is curious: it appearsica startsitsday,whileit iseveninginEurope,and continuesuntil the later peak time for North America. Although dif-to be monotonically increasing (with diurnal undulation)ferenttraces(andsometimes even differentdaysinthesameThis behavior stems from the usage model for USR.Recall
0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 Key size (bytes) Key size CDF by appearance USR APP ETC VAR SYS 0 0.2 0.4 0.6 0.8 1 1 10 100 1000 10000 100000 1e+06 Value size (bytes) Value Size CDF by appearance USR APP ETC VAR SYS 0 0.2 0.4 0.6 0.8 1 1 10 100 1000 10000 100000 1e+06 Value size (bytes) Value size CDF by total weight USR APP ETC VAR SYS Figure 2: Key and value size distributions for all traces. The leftmost CDF shows the sizes of keys, up to Memcached’s limit of 250 B (not shown). The center plot similarly shows how value sizes distribute. The rightmost CDF aggregates value sizes by the total amount of data they use in the cache, so for example, values under 320 B or so in SYS use virtually no space in the cache; 320 B values weigh around 8% of the data, and values close to 500 B take up nearly 80% of the entire cache’s allocation for values. 3.2 Request Sizes Next, we look at the sizes of keys and values in each pool (Fig. 2), based on SET requests. All distributions show strong modalities. For example, over 90% of APP’s keys are 31 bytes long, and values sizes around 270 B show up in more than 30% of SET requests. USR is the most extreme: it only has two key size values (16 B and 21 B) and virtually just one value size (2 B). Even in ETC, the most heterogeneous of the pools, requests with 2-, 3-, or 11-byte values add up to 40% of the total requests. On the other hand, it also has a few very large values (around 1MB) that skew the weight distribution (rightmost plot in Fig. 2), leaving less caching space for smaller values. Small values dominate all workloads, not just in count, but especially in overall weight. Except for ETC, 90% of all cache space is allocated to values of less than 500 B. The implications for caching and system optimizations are significant. For example, network overhead in the processing of multiple small packets can be substantial, which explains why Facebook coalesces as many requests as possible in as few packets as possible [9]. Another example is memory fragmentation. The strong modality of each workload implies that different Memcached pools can optimize memory allocation by modifying the slab size constants to fit each distribution. In practice, this is an unmanageable and unscalable solution, so instead Memcached uses many (44) slab classes with exponentially growing sizes, in the hope of reducing allocation waste, especially for small sizes. 3.3 Temporal Patterns To understand how production Memcached load varies over time, we look at each trace’s transient request rate over its entire collection period (Fig. 3). All traces clearly show the expected diurnal pattern, but with different values and amplitudes. If we increase our zoom factor further (as in the last plot), we notice that traffic in ETC bottoms out around 08:00 and has two peaks around 17:00 and 03:00. Not surprisingly, the hours immediately preceding 08:00 UTC (midnight in Pacific Time) represent night time in the Western Hemisphere. The first peak, on the other hand, occurs as North America starts its day, while it is evening in Europe, and continues until the later peak time for North America. Although different traces (and sometimes even different days in the same trace) differ in which of the two peaks is higher, the entire period between them, representing the Western Hemisphere day, sees the highest traffic volume. In terms of weekly patterns, we observe a small traffic drop on most Fridays and Saturdays, with traffic picking up again on Sundays and Mondays. The diurnal cycle represents load variation on the order of 2×. We also observe the presence of traffic spikes. Typically, these can represent a swift surge in user interest on one topic, such as occur with major news or media events. Less frequently, these spikes stem from programmatic or operational causes. Either way, the implication for Memcached development and deployment is that one must budget individual node capacity to allow for these spikes, which can easily double or even triple the normal peak request rate. Although such budgeting underutilizes resources during normal traf- fic, it is nevertheless imperative; otherwise, the many Web servers that would take to this sudden traffic and fail to get a prompt response from Memcached, would all query the same database nodes. This scenario could be debilitating, so it must remain hypothetical. 4. CACHE BEHAVIOR The main metric used in evaluating cache efficacy is hit rate: the percentage of GET requests that return a value. The overall hit rate of each server, as derived from the traces and verified with Memcached’s own statistics, are shown in Table 2. This section takes a deeper look at the factors that influence these hit rates and how they relate to cache locality, user behavior, temporal patterns, and Memcached’s design. Table 2: Mean cache hit rate over entire trace. Pool APP VAR SYS USR ETC Hit rate 92.9% 93.7% 98.7% 98.2% 81.4% 4.1 Hit Rates over Time When looking at how hit rates vary over time (Fig. 4), almost all traces show diurnal variance, within a small band of a few percentage points. USR’s plot is curious: it appears to be monotonically increasing (with diurnal undulation). This behavior stems from the usage model for USR. Recall

APPVAR16000011000010000014000090000120000W80000es/en006100000s/ssanbe7000080000M60000600005000400004000030000200018088888888888888888888888818888888888888888888888888888000.SYSUSR20000350000180003000001600025000014000A12000200000100001b1500008000MMMYM600010000040005000020008888888888888888888888888888888888888888888R8R888R838628868868888868858868869869869869892869869869869@9869@98@92房房5055ETCETC 24 hours9000080007500080000700007000065000es/sneWW60000055050000500004000450004000030000888888888888888888888350008588588588588588588885885888885889#######################馆##SSSISIIIIIEE#####SSSIIIEIEII店aFigure 3: Request rates at different dates and times of day, Coordinated Universal Time (UTC). Each datapoint counts the total number of requests in the preceding second. Except for USR and VAR,different traceswere collected in different times. The last plot zooms in on a 24-hour period from the ETC trace for greaterdetail
20000 40000 60000 80000 100000 120000 140000 160000 Sat 00:00 Sat 08:00 Sat 16:00 Sun 00:00 Sun 08:00 Sun 16:00 Mon 00:00 Mon 08:00 Mon 16:00 Tue 00:00 Tue 08:00 Tue 16:00 Wed 00:00 Wed 08:00 Wed 16:00 Thu 00:00 Thu 08:00 Thu 16:00 Fri 00:00 Fri 08:00 Fri 16:00 Sat 00:00 Sat 08:00 Sat 16:00 Sun 00:00 Sun 08:00 Sun 16:00 Mon 00:00 Mon 08:00 Mon 16:00 Tue 00:00 Tue 08:00 Tue 16:00 Wed 00:00 Wed 08:00 Wed 16:00 Requests/sec APP 30000 40000 50000 60000 70000 80000 90000 100000 110000 Wed 00:00 Wed 08:00 Wed 16:00 Thu 00:00 Thu 08:00 Thu 16:00 Fri 00:00 Fri 08:00 Fri 16:00 Sat 00:00 Sat 08:00 Sat 16:00 Sun 00:00 Sun 08:00 Sun 16:00 Mon 00:00 Mon 08:00 Mon 16:00 Tue 00:00 Tue 08:00 Tue 16:00 Wed 00:00 Wed 08:00 Wed 16:00 Thu 00:00 Requests/sec VAR 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 Sat 08:00 Sat 16:00 Sun 00:00 Sun 08:00 Sun 16:00 Mon 00:00 Mon 08:00 Mon 16:00 Tue 00:00 Tue 08:00 Tue 16:00 Wed 00:00 Wed 08:00 Wed 16:00 Thu 00:00 Thu 08:00 Thu 16:00 Fri 00:00 Fri 08:00 Fri 16:00 Sat 00:00 Sat 08:00 Sat 16:00 Sun 00:00 Sun 08:00 Sun 16:00 Mon 00:00 Mon 08:00 Mon 16:00 Tue 00:00 Tue 08:00 Tue 16:00 Wed 00:00 Wed 08:00 Wed 16:00 Thu 00:00 Thu 08:00 Thu 16:00 Fri 00:00 Fri 08:00 Fri 16:00 Sat 00:00 Requests/sec SYS 0 50000 100000 150000 200000 250000 300000 350000 Thu 00:00 Thu 08:00 Thu 16:00 Fri 00:00 Fri 08:00 Fri 16:00 Sat 00:00 Sat 08:00 Sat 16:00 Sun 00:00 Sun 08:00 Sun 16:00 Mon 00:00 Mon 08:00 Mon 16:00 Tue 00:00 Tue 08:00 Tue 16:00 Wed 00:00 Wed 08:00 Wed 16:00 Thu 00:00 Thu 08:00 Thu 16:00 Fri 00:00 Fri 08:00 Fri 16:00 Sat 00:00 Sat 08:00 Sat 16:00 Sun 00:00 Sun 08:00 Sun 16:00 Mon 00:00 Mon 08:00 Mon 16:00 Tue 00:00 Tue 08:00 Tue 16:00 Requests/sec USR 30000 40000 50000 60000 70000 80000 90000 Sat 08:00 Sat 16:00 Sun 00:00 Sun 08:00 Sun 16:00 Mon 00:00 Mon 08:00 Mon 16:00 Tue 00:00 Tue 08:00 Tue 16:00 Wed 00:00 Wed 08:00 Wed 16:00 Thu 00:00 Thu 08:00 Thu 16:00 Fri 00:00 Fri 08:00 Fri 16:00 Sat 00:00 Sat 08:00 Sat 16:00 Sun 00:00 Sun 08:00 Sun 16:00 Mon 00:00 Mon 08:00 Mon 16:00 Tue 00:00 Tue 08:00 Tue 16:00 Wed 00:00 Wed 08:00 Wed 16:00 Requests/sec ETC 35000 40000 45000 50000 55000 60000 65000 70000 75000 80000 00:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 00:00 Requests/sec ETC 24 hours Figure 3: Request rates at different dates and times of day, Coordinated Universal Time (UTC). Each data point counts the total number of requests in the preceding second. Except for USR and VAR, different traces were collected in different times. The last plot zooms in on a 24-hour period from the ETC trace for greater detail

USRVARSYSAPPETCd898062ENSRI2RSFESEOFigure 4: GET hit rates over time for all pools (days start at midnight UTC).of the requests, and most keys repeating only a handful ofSo,forexample,50%of ETC'skeysoccur inonlytimes.1% of all requests, meaning they do not repeat many times,Key appearance CDFwhile a few popular keys repeat in millions of requests per1aday. This high concentration of repeating keys provides the0.1justification for caching them in the firstplaceAll curves are remarkably similar, except for SYS's, which0.01has two distinct sections.Thefirst,uptoabout65%ofUSRthekeys,represents keys that are repeated infrequently0.001APPconceivably those that are retrieved when one or more clientsstart up and fill their local cache. The second part, repre-0.0001senting the last 25% of keys and more than 90% of the re-SYSquests,may account for the normal SYS scenario, when a0.10.20.30.40.50.60.70.80.9value is added or updated in the cache and all the clientsCumulativeratioofkeys fromtotalretrieve it.4.2.2Locality over TimeFigure5:CDFsof keyappearances,depictinghowIt is also interesting to examine howkey uniqueness variesmany keys account for how many requests, in rela-over time by counting how many keys do not repeat in closetive terms.Keys arerankedfrom leastpopular totime proximity (Fig. 6). To interpret this data, note that amost popular.lowerpercentage indicatesthatfewer keysare unique,andtherefore suggests a higher hit rate. Indeed, note that thefrom Sec. 2.2 that USR is sized large enough to minimizediurnal dips correspond to increases in hit rates in Fig. 4.the number of missesin other words, to contain almostAn immediately apparent property is that in any givenpool, this percentage remains relatively constant over timeall possible keys. When a USR Memcached server starts,especially with hour-long bins, with only small diurnal vari-it contains no data and misses on all requests.But overtime, as clients add values to it while the pressure to evict isations and few spikes.Data for 5-minute bins are natu-nonexistent.hitrates climb upwards.Thus.USR'stransientrally noisier,butevenhere most samples remain confinedhit rate is correlated not only with time of day, but primarilyto a narrow range.This suggests that different pools havewith the server's uptime, reaching 99.8% after several weeks.not only different traffic patterns, but also different cachingLikeUSR,SYShasarelativelyboundeddatadomain,soproperties that can benefit from different tuning, justifyingit can easily be sized to keep hit rates high and stable.Butthechoiceto segregateworkloadstopoolsunlike the other four workloads, SYS does not react directlyEachpoolexhibitsits characteristiclocalityband and av-to user load, so its performance is less cyclical and regularerage. SYS's low average rate of 3.3% unique keys per hourfor example, suggests that different clients request roughly4.2LocalityMetricsthe same service information.In contrast, USR'smuchThis section looks at three ways to measure locality inhigheraveragerateof 34.6%uniquekevsperhour.suggestsGET requests: (1)how often and how much somekeys re-that theper-user dataitrepresentsspansamuchmore dis-peat in requests; (2) the amount of unique keys and howparaterange.Generally,wewouldassumethat lower bandsit varies over time; and (3) reuse period, as a measure oftranslate to higher overall hit rates, all other things beingtemporal locality.equal. This turns out not to be the case. In fact, the Pear-These metrics, unlike hit rates, are an inherent property ofson correlation coefficientbetween averageuniquekey ratiosthe request stream of each pool; changing the server's hard-with 60-minute bins (taken from Fig.6) and the average hitrates (Table 2) is negative as expected, but small: -0.097.ware or server count will not affect them. Consequently, thisdata could provide insights toward the workload, in isolationIndeed not all other things are equal, as discussed in Sec: 4.1.of implementation choices.Comparing 5-minute bins to hour-long bins reveals thatunique keys in theformer appear in significantly higher con-4.2.1RepeatingKeyscentrations than in the latter.This implies a rapid rate ofWe start by looking at the distribution of key repeatsdecayininterestinmostkeys.Butdoesthisratecontinueto(Fig. 5). All workloads exhibit the expected long-tail distri-drop very fast over a longer time window? The next sectionbutions, with a small percentage of keys appearing in mostsets out to answer this question
97 98 99 Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed USR 92 93 94 95 Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu APP 75 80 85 90 95 Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu ETC 92 93 94 95 96 Wed Thu Fri Sat Sun Mon Tue Wed Thu VAR 95 96 97 98 99 100 Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat SYS Figure 4: GET hit rates over time for all pools (days start at midnight UTC). 0.0001 0.001 0.01 0.1 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Ratio from total requests Cumulative ratio of keys from total Key appearance CDF USR APP ETC VAR SYS Figure 5: CDFs of key appearances, depicting how many keys account for how many requests, in relative terms. Keys are ranked from least popular to most popular. from Sec. 2.2 that USR is sized large enough to minimize the number of misses—in other words, to contain almost all possible keys. When a USR Memcached server starts, it contains no data and misses on all requests. But over time, as clients add values to it while the pressure to evict is nonexistent, hit rates climb upwards. Thus, USR’s transient hit rate is correlated not only with time of day, but primarily with the server’s uptime, reaching 99.8% after several weeks. Like USR, SYS has a relatively bounded data domain, so it can easily be sized to keep hit rates high and stable. But unlike the other four workloads, SYS does not react directly to user load, so its performance is less cyclical and regular. 4.2 Locality Metrics This section looks at three ways to measure locality in GET requests: (1) how often and how much some keys repeat in requests; (2) the amount of unique keys and how it varies over time; and (3) reuse period, as a measure of temporal locality. These metrics, unlike hit rates, are an inherent property of the request stream of each pool; changing the server’s hardware or server count will not affect them. Consequently, this data could provide insights toward the workload, in isolation of implementation choices. 4.2.1 Repeating Keys We start by looking at the distribution of key repeats (Fig. 5). All workloads exhibit the expected long-tail distributions, with a small percentage of keys appearing in most of the requests, and most keys repeating only a handful of times. So, for example, 50% of ETC’s keys occur in only 1% of all requests, meaning they do not repeat many times, while a few popular keys repeat in millions of requests per day. This high concentration of repeating keys provides the justification for caching them in the first place. All curves are remarkably similar, except for SYS’s, which has two distinct sections. The first, up to about 65% of the keys, represents keys that are repeated infrequently— conceivably those that are retrieved when one or more clients start up and fill their local cache. The second part, representing the last 25% of keys and more than 90% of the requests, may account for the normal SYS scenario, when a value is added or updated in the cache and all the clients retrieve it. 4.2.2 Locality over Time It is also interesting to examine how key uniqueness varies over time by counting how many keys do not repeat in close time proximity (Fig. 6). To interpret this data, note that a lower percentage indicates that fewer keys are unique, and therefore suggests a higher hit rate. Indeed, note that the diurnal dips correspond to increases in hit rates in Fig. 4. An immediately apparent property is that in any given pool, this percentage remains relatively constant over time— especially with hour-long bins, with only small diurnal variations and few spikes. Data for 5-minute bins are naturally noisier, but even here most samples remain confined to a narrow range. This suggests that different pools have not only different traffic patterns, but also different caching properties that can benefit from different tuning, justifying the choice to segregate workloads to pools. Each pool exhibits its characteristic locality band and average. SYS’s low average rate of 3.3% unique keys per hour, for example, suggests that different clients request roughly the same service information. In contrast, USR’s much higher average rate of 34.6% unique keys per hour, suggests that the per-user data it represents spans a much more disparate range . Generally, we would assume that lower bands translate to higher overall hit rates, all other things being equal. This turns out not to be the case. In fact, the Pearson correlation coefficient between average unique key ratios with 60-minute bins (taken from Fig. 6) and the average hit rates (Table 2) is negative as expected, but small: −0.097. Indeed not all other things are equal, as discussed in Sec. 4.1. Comparing 5-minute bins to hour-long bins reveals that unique keys in the former appear in significantly higher concentrations than in the latter. This implies a rapid rate of decay in interest in most keys. But does this rate continue to drop very fast over a longer time window? The next section sets out to answer this question

Percentage of unique keys out of total in 5-minute bins100888USR=74.7%(%) skey anblunR884ETC=44.5%APP=43.0%VAR=33.4%AA30SYS=18.4%120-OL22456789222224522820323458Time (days)Percentage of unique keys out of total in 60-minute bins10090上80上(%)shey enbrun70上2884USR=34.6%APP=22.4%ETC=20.7%VAR=15.3%88AAAA10SYS=3.3%o Lullluhhlwhlnanaotodttbtttataswwlhullwlul2345678901284567892223456282332345Time (days)Figure 6: Ratio of unique keys over time. Each data point on the top (bottom) plot shows how many uniquekeys were requested in the preceding 5 (60) minutes, as percentage of all keys.Thelabel for each pool, atthe top right corner of the data, also includes the average ratio throughout the entire pool's trace.4.2.3TemporalLocality:ReusePeriodTemporallocalityreferstohowoftenakeyisre-accessedOnemetric toquantifytemporal locality of any givenkey isthe reuse period-the time between consecutive accesses to1e+16the key. Fig. 7 counts all key accesses in the five traces, and1e+11USRbins them according to the time duration from the previous1e+141e+101e+09TCkey's access. Unique keys (those that do not repeat at allVAR1e+121e+08withinthetraceperiod)areexcludedfromthiscount.seeseSYS1e+07The answer to the question from the previous section is1e+101e+061e+05therefore positive: count of accesses in each reuse period1e+082463continues to decay quickly after the first hour. For the ETC1e+06trace, for example, 88.5% of the keys are reused within anhour, but only 4% more within two, and within six hours,1e+0496.4% of all nonunique keys have already repeated. It con-1e+02tinues to decay at a slower rate. This access behavior sug-gests a pattern for Facebook's users as well, with some users1e+00O24426visiting the site more frequently than others and reusing24880468the keys associated with their accounts.Another interest-ing sub-pattern occurs every day. Note the periodic peaksTime (hours)on even 24 hours in four of the five traces, especially inthe VAR pool that is associated with browser usage. TheseFigure 7:Reuse period histogram per pool. Eachpeaks suggest that a noteworthy number of users log in tohour-longbin n countskeysthat werefirstrequestedthe site at approximately the same time of day each time.The insetn hours after their latest appearance.Once more, these increased-locality indications also corre-zooms in on the five hours after the first.spond to increasedhitrates inFig.4.Asintheprevioussection,theSYSpoolstandsout.Itdoes not show the same 24-hour periodicity, because its keys
0 10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 Unique keys (%) Time (days) Percentage of unique keys out of total in 5-minute bins ETC=44.5% APP=43.0% VAR=33.4% USR=74.7% SYS=18.4% 0 10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 Unique keys (%) Time (days) Percentage of unique keys out of total in 60-minute bins ETC=20.7% APP=22.4% VAR=15.3% USR=34.6% SYS=3.3% Figure 6: Ratio of unique keys over time. Each data point on the top (bottom) plot shows how many unique keys were requested in the preceding 5 (60) minutes, as percentage of all keys. The label for each pool, at the top right corner of the data, also includes the average ratio throughout the entire pool’s trace. 4.2.3 Temporal Locality: Reuse Period Temporal locality refers to how often a key is re-accessed. One metric to quantify temporal locality of any given key is the reuse period—the time between consecutive accesses to the key. Fig. 7 counts all key accesses in the five traces, and bins them according to the time duration from the previous key’s access. Unique keys (those that do not repeat at all within the trace period) are excluded from this count. The answer to the question from the previous section is therefore positive: count of accesses in each reuse period continues to decay quickly after the first hour. For the ETC trace, for example, 88.5% of the keys are reused within an hour, but only 4% more within two, and within six hours, 96.4% of all nonunique keys have already repeated. It continues to decay at a slower rate. This access behavior suggests a pattern for Facebook’s users as well, with some users visiting the site more frequently than others and reusing the keys associated with their accounts. Another interesting sub-pattern occurs every day. Note the periodic peaks on even 24 hours in four of the five traces, especially in the VAR pool that is associated with browser usage. These peaks suggest that a noteworthy number of users log in to the site at approximately the same time of day each time. Once more, these increased-locality indications also correspond to increased hit rates in Fig. 4. As in the previous section, the SYS pool stands out. It does not show the same 24-hour periodicity, because its keys 1e+00 1e+02 1e+04 1e+06 1e+08 1e+10 1e+12 1e+14 1e+16 0 24 48 72 96 120 144 168 192 216 240 264 288 312 Keys requested Time (hours) USR APP ETC VAR SYS 1e+05 1e+06 1e+07 1e+08 1e+09 1e+10 1e+11 1 2 3 4 5 6 Figure 7: Reuse period histogram per pool. Each hour-long bin n counts keys that were first requested n hours after their latest appearance. The inset zooms in on the five hours after the first

relate to servers and services, not users. It also decays pre-or SYS, may be interesting as edge cases, but probably notcipitously compared to the others.As in Sec. 4.2.1, we findso much as models for synthetic workloads.that since its data are cached locally by clients, it is likelyThe functional models presented here prioritize parsimo-that most of SYS's GET requests represent data that arenious characterization over fidelity. As such, they obviouslynewly available, updated or expired from the client cache;do not capture all the nuances of the trace, such as its burstythese are then requested by many clients concurrently. Thisnature or the inclusion of one-off events.But barring accesswould explain why 99.9% of GET requests are repeatedto the actual trace, they can serve the community as a bet-within an hour of the first key access.Later, such keysterbasis for synthetic workload generation than assumptionswould be cached locally and accessed rarely,perhaps whenbased on guesswork or small-scale logs.a newly added client needs to fill its own cache.Nevertheless, there is still value in reuse period to predictMethodologyhit rates. Since all pools have suficient memory for over anWe modeled independently the three main performance prop-hour of fresh data, thepercentage of keys reused within anerties that would enable simple emulation of this trace: keyhour correlates positively with the overall hit rates in Ta-sizes, value sizes, and inter-arrival rates. The rate and ratioble 2 (with a Pearson coefficient of 0.17). The correlationbetween GET/SET/DELETE requests can be derived fromis stronger-witha coefficient of 0.44if we omit USRandSec.3.1. For cache analysis, additional properties can beSYS, which have atypical cache behavior (minimum evic-gleaned from Sec. 4.tions in the former and local caching in the latter).To justify the assumption that the three properties areindependent,wepicked a sampleof 1,000,000consecutive4.3Case Study:ETCHit Ratesrequests and measured the Pearson coefficient between eachWe turn our attention to ETC's hit/miss rates, becausepair of variables..The pairwise correlations,as shown infrequent misses can noticeably hurt user experience. At thisTable 4, are indeed very low.point, one might expect ETC's hit rate to exceed the 96%6-hour key-reuse rate, since it is provisioned with more thanenough RAM to store the fresh data of the preceding 6 hours.Table 4:Pearson correlation coefficient betweenUnfortunately, this is not the case, and the observed hit rateeach two pair of modeled variables.issignificantlylowerat81%.To understand why, we an-CorrelationVariable pairalyzed all the misses in thelast 24 hours of the trace (Ta-0.0111Inter-arrival gap Key sizeble.3).Thelargestnumberof misses inETC comes fromInter-arrival gapValue size0.0065keys that are accessed for the first time (at least in a 10-day-0.0286Keysize→Valuesizeperiod). This is the long tail of the locality metrics we an-alyzed before.Sec.4.2.1 showed that ~50%of ETC's keysare accessed in only 1% of requests, and therefore benefitWe created functional models by fitting various distribu-little or not at all from a demand-flled cache.The manytions (such as Weibull,Gamma,Extreme Value,Normal,deletions in the cache also hinder the cache's efficacy. ETCetc.) to each data set and choosing the distribution thatis a very diversepool with many applications, some withminimizes the Kolmogorov-Smirnov distance. All our datalimited reusability.But the other half of the keys that showresemble power-law distributions, and fit the selected mod-up in 99% of the requests are so popular (some repeatingels quite well, with the exception of a handful of points (seemillions of times) that Memcached can satisfy over 4 in 5Fig.8).To deal with these outliers and improve the fit, werequests to the ETC pool.removed thesepoints as necessaryand modeledthe remain-ing samples.The few removed points are tabulated sepa-rately as a histogram, so a more accurate synthetic workloadTable 3: Miss categories in last 24 hours of theof the entire trace should combine thefunctional model withETC trace.Compulsorymisses count GETs withthe short list of value-frequency outliers.no matching SET in the preceding 10 days (mean-ing, for all practical purposes, new keys to theKey-SizeDistributioncache).Invalidation misses countGETs precededby a matching DELETE request.Eviction missesWe found the model that best fits key sizes in bytes (with acount all other missing GETsKolmogorov-Smirnov distance of 10.5) to be Generalized Ex-CompulsoryInvalidationEvictionMiss categorytreme Value distribution with parameters μ= 30.7984, =8%22%Ratioofmisses70%8.20449, k =0.078688.We have verified that these param-eters remain fairly constant, regardless of time of dayValue-SizeDistribution5.STATISTICALMODELINGWe found the model that best fits value sizes in bytes (withThis section describes the salient workload characteristicsof the ETC trace using simple distribution models. Thea Kolmogorov-Smirnov distanceof 10.5),startingfrom 15ETC trace was selected because it is both the most repre-bytes,tobeGeneralizedParetowithparameters0=0,=214.476, k =0.348238 (this distribution is also independentsentative of large-scale, general-purpose KV stores, and theof the time of day). The first 15 values of length and prob-easiest to model, since it is not distorted by the idiosyncraticaberrations of application-specific pools.We also think thatabilities can be modeled separately as a discrete probabilitydistribution whose values are given in Table 5.its mixed workload is easier to generalize to other general-purpose caches with a heterogeneous mix of requests andsizes. The more Facebook-specific workloads, such as USR
relate to servers and services, not users. It also decays precipitously compared to the others. As in Sec. 4.2.1, we find that since its data are cached locally by clients, it is likely that most of SYS’s GET requests represent data that are newly available, updated or expired from the client cache; these are then requested by many clients concurrently. This would explain why 99.9% of GET requests are repeated within an hour of the first key access. Later, such keys would be cached locally and accessed rarely, perhaps when a newly added client needs to fill its own cache. Nevertheless, there is still value in reuse period to predict hit rates. Since all pools have sufficient memory for over an hour of fresh data, the percentage of keys reused within an hour correlates positively with the overall hit rates in Table 2 (with a Pearson coefficient of 0.17). The correlation is stronger—with a coefficient of 0.44—if we omit USR and SYS, which have atypical cache behavior (minimum evictions in the former and local caching in the latter). 4.3 Case Study: ETC Hit Rates We turn our attention to ETC’s hit/miss rates, because frequent misses can noticeably hurt user experience. At this point, one might expect ETC’s hit rate to exceed the 96% 6-hour key-reuse rate, since it is provisioned with more than enough RAM to store the fresh data of the preceding 6 hours. Unfortunately, this is not the case, and the observed hit rate is significantly lower at 81%. To understand why, we analyzed all the misses in the last 24 hours of the trace (Table. 3). The largest number of misses in ETC comes from keys that are accessed for the first time (at least in a 10-day period). This is the long tail of the locality metrics we analyzed before. Sec. 4.2.1 showed that ≈ 50% of ETC’s keys are accessed in only 1% of requests, and therefore benefit little or not at all from a demand-filled cache. The many deletions in the cache also hinder the cache’s efficacy. ETC is a very diverse pool with many applications, some with limited reusability. But the other half of the keys that show up in 99% of the requests are so popular (some repeating millions of times) that Memcached can satisfy over 4 in 5 requests to the ETC pool. Table 3: Miss categories in last 24 hours of the ETC trace. Compulsory misses count GETs with no matching SET in the preceding 10 days (meaning, for all practical purposes, new keys to the cache). Invalidation misses count GETs preceded by a matching DELETE request. Eviction misses count all other missing GETs. Miss category Compulsory Invalidation Eviction Ratio of misses 70% 8% 22% 5. STATISTICAL MODELING This section describes the salient workload characteristics of the ETC trace using simple distribution models. The ETC trace was selected because it is both the most representative of large-scale, general-purpose KV stores, and the easiest to model, since it is not distorted by the idiosyncratic aberrations of application-specific pools. We also think that its mixed workload is easier to generalize to other generalpurpose caches with a heterogeneous mix of requests and sizes. The more Facebook-specific workloads, such as USR or SYS, may be interesting as edge cases, but probably not so much as models for synthetic workloads. The functional models presented here prioritize parsimonious characterization over fidelity. As such, they obviously do not capture all the nuances of the trace, such as its bursty nature or the inclusion of one-off events. But barring access to the actual trace, they can serve the community as a better basis for synthetic workload generation than assumptions based on guesswork or small-scale logs. Methodology We modeled independently the three main performance properties that would enable simple emulation of this trace: key sizes, value sizes, and inter-arrival rates. The rate and ratio between GET/SET/DELETE requests can be derived from Sec. 3.1. For cache analysis, additional properties can be gleaned from Sec. 4. To justify the assumption that the three properties are independent, we picked a sample of 1, 000, 000 consecutive requests and measured the Pearson coefficient between each pair of variables. The pairwise correlations, as shown in Table 4, are indeed very low. Table 4: Pearson correlation coefficient between each two pair of modeled variables. Variable pair Correlation Inter-arrival gap ↔ Key size −0.0111 Inter-arrival gap ↔ Value size 0.0065 Key size ↔ Value size −0.0286 We created functional models by fitting various distributions (such as Weibull, Gamma, Extreme Value, Normal, etc.) to each data set and choosing the distribution that minimizes the Kolmogorov-Smirnov distance. All our data resemble power-law distributions, and fit the selected models quite well, with the exception of a handful of points (see Fig. 8). To deal with these outliers and improve the fit, we removed these points as necessary and modeled the remaining samples. The few removed points are tabulated separately as a histogram, so a more accurate synthetic workload of the entire trace should combine the functional model with the short list of value-frequency outliers. Key-Size Distribution We found the model that best fits key sizes in bytes (with a Kolmogorov-Smirnov distance of 10.5) to be Generalized Extreme Value distribution with parameters μ = 30.7984, σ = 8.20449, k = 0.078688. We have verified that these parameters remain fairly constant, regardless of time of day. Value-Size Distribution We found the model that best fits value sizes in bytes (with a Kolmogorov-Smirnov distance of 10.5), starting from 15 bytes, to be Generalized Pareto with parameters θ = 0, σ = 214.476, k = 0.348238 (this distribution is also independent of the time of day). The first 15 values of length and probabilities can be modeled separately as a discrete probability distribution whose values are given in Table 5

ETC Key Size PDFETC Key Size CDFETC Key Size Residuals0.105100Sample营营Model80105eeeid260A040-5-1020Sample-15000Model0-20050150050100150200250010020025050100150200250Key size (bytes)Key size (bytes)Value size (bytes)ETC Value Size CDFETC Value Size PDFETC Value Size Residuals2505059100Sample0.1Mode800.01oe ense0.001Kungeordraoe600.00011e-05401e-06-101e-0720Sample-151e-08Model-2001e-09101001000100001101001000 100001000001e+061101001000100001000001e+061Value size (bytes)Value size (bytes)Value size (bytes)ETC Request Inter-arrival Gap PDFETC Request Inter-arrival Gap CDFETC Request Inter-arrival Gap Residuals0.110025Sample0.01Model800.0010505aeneKqeqoid雪0.000160e1e-05401e-06-101e-0720-15SMmdl1e-08-20 51e-090101001000101001000 100001000001e+061101001000 100001000001e+06Inter-arrival gap (us)Inter-arrival gap (us)Request inter-arrival gap (us)Figure 8:PDF (left), CDF (middle),and CDF residuals (right)plots for the distribution of ETC'skey size(top), value size (center): and inter-arrival gap (bottom). Note that some axes are logarithmic, and thatPDF plots limit the X-axis to areaof interestfor greater detail.Inter-arrivalRateDistributionWe found the model that best describes the time gap inmicroseconds between consecutive received requests (withTable 5:Probability distribution for first few valuea Kolmogorov-Smirnov distance of 2.0) to be Generalizedlengths, in bytes.Paretowithparameters0=0,α=16.0292,k=0.154971,Value sizeProbabilitystarting from the second value.0One excluded point from this model is the first value, rep-0.00536resenting a gap of O μsec (in other words, multiple requests0.000471at thesame microsecond timeslot),with a probability of20.178200.1159. This is likely an artifact of our measurement gran-30.09239ularity and aggregation by the network stack, and not of40.00018concurrent requests, since they are all serialized by the sin-50.02740gle networking interface.60.00065In addition, the model is most accurate up to about a70.00606gap of 1000 μsec.But the total number of sampled points80.00023not covered by this model (i.e., those requests that arrive90.00837more than lmsec after the previous request) represents less100.00837than0.002%ofthetotal samples and theirresidual erroris110.08989negligible.120.00092Unlike the previous two distributions, inter-arrival rate130.00326thereciprocalfunction of offered load-ishighlydependent140.01980on time of day, as evident in Fig.3.For those wishing to cap-ture this diurnal variation, this complete-trace model maybe too coarse. To refine this distribution, we divided the
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0 50 100 150 200 250 Probability Key size (bytes) ETC Key Size PDF Sample Model 0 20 40 60 80 100 0 50 100 150 200 250 Percentile Key size (bytes) ETC Key Size CDF Sample Model -20 -15 -10 -5 0 5 10 15 20 0 50 100 150 200 250 Residual error Value size (bytes) ETC Key Size Residuals 1e-09 1e-08 1e-07 1e-06 1e-05 0.0001 0.001 0.01 0.1 1 1 10 100 1000 10000 Probability Value size (bytes) ETC Value Size PDF Sample Model 0 20 40 60 80 100 1 10 100 1000 10000 100000 1e+06 Percentile Value size (bytes) ETC Value Size CDF Sample Model -20 -15 -10 -5 0 5 10 15 20 1 10 100 1000 10000 100000 1e+06 Residual error Value size (bytes) ETC Value Size Residuals 1e-09 1e-08 1e-07 1e-06 1e-05 0.0001 0.001 0.01 0.1 1 10 100 1000 Probability Inter-arrival gap (us) ETC Request Inter-arrival Gap PDF Sample Model 0 20 40 60 80 100 1 10 100 1000 10000 100000 1e+06 Percentile Inter-arrival gap (us) ETC Request Inter-arrival Gap CDF Sample Model -20 -15 -10 -5 0 5 10 15 20 1 10 100 1000 10000 100000 1e+06 Residual error Request inter-arrival gap (us) ETC Request Inter-arrival Gap Residuals Figure 8: PDF (left), CDF (middle), and CDF residuals (right) plots for the distribution of ETC’s key size (top), value size (center). and inter-arrival gap (bottom). Note that some axes are logarithmic, and that PDF plots limit the X-axis to area of interest for greater detail. Table 5: Probability distribution for first few value lengths, in bytes. Value size Probability 0 0.00536 1 0.00047 2 0.17820 3 0.09239 4 0.00018 5 0.02740 6 0.00065 7 0.00606 8 0.00023 9 0.00837 10 0.00837 11 0.08989 12 0.00092 13 0.00326 14 0.01980 Inter-arrival Rate Distribution We found the model that best describes the time gap in microseconds between consecutive received requests (with a Kolmogorov-Smirnov distance of 2.0) to be Generalized Pareto with parameters θ = 0, σ = 16.0292, k = 0.154971, starting from the second value. One excluded point from this model is the first value, representing a gap of 0 μsec (in other words, multiple requests at the same microsecond time slot), with a probability of 0.1159. This is likely an artifact of our measurement granularity and aggregation by the network stack, and not of concurrent requests, since they are all serialized by the single networking interface. In addition, the model is most accurate up to about a gap of 1000 μsec. But the total number of sampled points not covered by this model (i.e., those requests that arrive more than 1msec after the previous request) represents less than 0.002% of the total samples and their residual error is negligible. Unlike the previous two distributions, inter-arrival rate— the reciprocal function of offered load—is highly dependent on time of day, as evident in Fig. 3. For those wishing to capture this diurnal variation, this complete-trace model may be too coarse. To refine this distribution, we divided the

Nevertheless, improving hit rates is important for theseTable 6: Hourly distributions for inter-arrival gap.applications, or we would not need a cache in the first place.The columns represent (in order): start time of eachOne way to improve ETC's hit rates, at least in theory, is tohourly bin (in UTC), the two Generalized Pareto pa-increase the total amount of RAM (or servers) in the pool sorameters (with = 0), the fraction of samples underthat we can keep a longer history. But in practice, beyond1μs gap,and the Kolmogorov-Smirnov distance ofa couple of daysworth of history, the number of keys thatthe fitwould benefit from the longer memory is vanishingly small,KTimeKSa<1μsas Fig.7 shows. And of course, adding hardware adds cost.0:0016.28680.1552800.11582.18A more fruitful direction may be to focus on the cache1:0015.89370.1413680.11702.140.11714replacement policy.Several past studies demonstrated re-2:0015.63450.1375790.11742.09placement policies with reduced eviction misses, compared3:0015.70030.1423820.11742.16toLRU, such as LIRS [19].Table3puts an upper limit4:0016.32310.1607060.11762.32on the number of eviction missesthat can be eliminated5:000.1812780.11622.5217.5157at around 22%, meaning that eviction policy changes could6:000.1968852.6418.67480.1146improve hit rates by another 0.22× (1-0.814) = 4.1%7:0019.51140.2023960.11442.64This may sound modest, but it represents over 120 million8:0020.20500.2016370.11232.58GET requests per day per server, with noticeable impacton service latency.Moreover, the current cache replace-9:0020.29150.1937640.11162.46mentschemeanditsimplementationaresuboptimalwhenit10:0019.55770.1783860.11222.35comes to multithreaded performance [9], with its global lock11:0018.22940.1616360.11302.17protecting both hash table and slab LRUs. We therefore12:000.1404610.11382.0017.1879perceive great potential in alternative replacement policies13:0016.21590.1192420.11461.88not only for better hit rates but also for better performance.14:0015.67160.1045350.11521.76Another interesting question is whether we should opti-15:0015.29040.0942860.11441.72mize Memcached for hit rates or byte hit rates. To answer16:0015.20330.0969630.11361.72it, we estimated the penalty of each miss in the ETC work-17:0014.95330.0985100.11401.74load, bymeasuring the duration between the missing GET18:0015.13810.11281.670.096155and the subsequent SET that reinstates the value (presum-19:0015.32100.0941560.11291.65ably, the duration represents the time cost to recalculate the20:0015.38480.1003650.10.11281.68value, and is already highly optimized, emphasizing the im-21:0015.75020.1119210.11271.80portance of improved hit rates). We found that it is roughly22:001.9616.02050.1319460.1129proportional to thevalue size, so whetherwe fill the cache23:0016.32380.1472580.11482.14with few large items or many small items of the same aggre-gate size should not affect recalculation timemuch. On theotherhand,frequent misses do noticeably increase the loadraw data into 24 hourly bins and modeled each separatelyon the back-end servers and hurt the user experience, whichFortunately, they all fit a Generalized Pareto distributionexplains why this cache does not prioritize byte hit rate.with = 0 rather well.The remaining two parameters areMemcached optimizes for small values,because they are bydistributed over time in Table. 6.farthemost common values.It may even be worthwhileto investigate not caching large objects at all, to increaseoverall hit rates.6.DISCUSSIONOne pertinent question is, what are the factors that affect7.RELATEDWORKand predict hit rates? Since all hosts have the same amountTo the best of our knowledge, this is the first detailed de-of RAM, we should be able to easily explain the relative dif-ferences between different traces using the data we gatheredscription of a large-scale KV-store workload.Neverthelessso far on locality.But as Sec. 4 discusses, hit rates do notthere are a number of related studies on other caching sys-actually correlate verywell with most locality metrics, buttems that can shed light on the relevance of this work andrather, correlates inversely with the size of the pool (com-itsmethodology.pare Tables 1 and2). Does correlation imply causation inThe design and implementation of any storage or cachingthis case?system must be optimized for its workload to be effective.Probably not. A more likely explanation invokes a third,Accordingly, there is a large body of work on the collection,related parameter: the size of the application domain. Bothanalysis, and characterization of the workloads on storagein USR's and SYS's cases, these sizes aremore or less capped,systems, including enterprise computing environments [2,and the bound is small enough that a limited number of20, 21] and high-performance computing environments [11,servers can cover virtually the entire domain, so locality no22,30j.The observations canbe of great importancetolonger plays a factor. On the other extreme,ETC has a vary-system design, engineering, and tuning. For example, in aing and growing number of applications using it, some withstudy on file system workloads for large-scale scientific com-unbounded data. If any single application grows enoughputing applications, Wang et. al. collected and analyzed fileaccesses on an 800-node cluster running the Lustre file sys-in importance to require a certain quality of service, andhas the size limitations to enable this quality,given enoughtem at Lawrence Livermore National Laboratory [30]. Oneservers, then it is separated out of ETC to its own pool.Soof their findings is that in some workloads, small requeststhe applications that end up using ETC are precisely thoseaccount for more than 90% of all requests, but almost allthat cannot or need not benefit from hit-rate guarantees.data are accessed by large requests. In a study on file sys-
Table 6: Hourly distributions for inter-arrival gap. The columns represent (in order): start time of each hourly bin (in UTC), the two Generalized Pareto parameters (with θ = 0), the fraction of samples under 1 μs gap , and the Kolmogorov-Smirnov distance of the fit. Time σ k < 1 μs KS 0:00 16.2868 0.155280 0.1158 2.18 1:00 15.8937 0.141368 0.1170 2.14 2:00 15.6345 0.137579 0.1174 2.09 3:00 15.7003 0.142382 0.1174 2.16 4:00 16.3231 0.160706 0.1176 2.32 5:00 17.5157 0.181278 0.1162 2.52 6:00 18.6748 0.196885 0.1146 2.64 7:00 19.5114 0.202396 0.1144 2.64 8:00 20.2050 0.201637 0.1123 2.58 9:00 20.2915 0.193764 0.1116 2.46 10:00 19.5577 0.178386 0.1122 2.35 11:00 18.2294 0.161636 0.1130 2.17 12:00 17.1879 0.140461 0.1138 2.00 13:00 16.2159 0.119242 0.1146 1.88 14:00 15.6716 0.104535 0.1152 1.76 15:00 15.2904 0.094286 0.1144 1.72 16:00 15.2033 0.096963 0.1136 1.72 17:00 14.9533 0.098510 0.1140 1.74 18:00 15.1381 0.096155 0.1128 1.67 19:00 15.3210 0.094156 0.1129 1.65 20:00 15.3848 0.100365 0.1128 1.68 21:00 15.7502 0.111921 0.1127 1.80 22:00 16.0205 0.131946 0.1129 1.96 23:00 16.3238 0.147258 0.1148 2.14 raw data into 24 hourly bins and modeled each separately. Fortunately, they all fit a Generalized Pareto distribution with θ = 0 rather well. The remaining two parameters are distributed over time in Table. 6. 6. DISCUSSION One pertinent question is, what are the factors that affect and predict hit rates? Since all hosts have the same amount of RAM, we should be able to easily explain the relative differences between different traces using the data we gathered so far on locality. But as Sec. 4 discusses, hit rates do not actually correlate very well with most locality metrics, but rather, correlates inversely with the size of the pool (compare Tables 1 and2). Does correlation imply causation in this case? Probably not. A more likely explanation invokes a third, related parameter: the size of the application domain. Both in USR’s and SYS’s cases, these sizes are more or less capped, and the bound is small enough that a limited number of servers can cover virtually the entire domain, so locality no longer plays a factor. On the other extreme, ETC has a varying and growing number of applications using it, some with unbounded data. If any single application grows enough in importance to require a certain quality of service, and has the size limitations to enable this quality, given enough servers, then it is separated out of ETC to its own pool. So the applications that end up using ETC are precisely those that cannot or need not benefit from hit-rate guarantees. Nevertheless, improving hit rates is important for these applications, or we would not need a cache in the first place. One way to improve ETC’s hit rates, at least in theory, is to increase the total amount of RAM (or servers) in the pool so that we can keep a longer history. But in practice, beyond a couple of days’ worth of history, the number of keys that would benefit from the longer memory is vanishingly small, as Fig. 7 shows. And of course, adding hardware adds cost. A more fruitful direction may be to focus on the cache replacement policy. Several past studies demonstrated replacement policies with reduced eviction misses, compared to LRU, such as LIRS [19]. Table 3 puts an upper limit on the number of eviction misses that can be eliminated, at around 22%, meaning that eviction policy changes could improve hit rates by another 0.22 × (1 − 0.814) = 4.1%. This may sound modest, but it represents over 120 million GET requests per day per server, with noticeable impact on service latency. Moreover, the current cache replacement scheme and its implementation are suboptimal when it comes to multithreaded performance [9], with its global lock protecting both hash table and slab LRUs. We therefore perceive great potential in alternative replacement policies, not only for better hit rates but also for better performance. Another interesting question is whether we should optimize Memcached for hit rates or byte hit rates. To answer it, we estimated the penalty of each miss in the ETC workload, by measuring the duration between the missing GET and the subsequent SET that reinstates the value (presumably, the duration represents the time cost to recalculate the value, and is already highly optimized, emphasizing the importance of improved hit rates). We found that it is roughly proportional to the value size, so whether we fill the cache with few large items or many small items of the same aggregate size should not affect recalculation time much. On the other hand, frequent misses do noticeably increase the load on the back-end servers and hurt the user experience, which explains why this cache does not prioritize byte hit rate. Memcached optimizes for small values, because they are by far the most common values. It may even be worthwhile to investigate not caching large objects at all, to increase overall hit rates. 7. RELATED WORK To the best of our knowledge, this is the first detailed description of a large-scale KV-store workload. Nevertheless, there are a number of related studies on other caching systems that can shed light on the relevance of this work and its methodology. The design and implementation of any storage or caching system must be optimized for its workload to be effective. Accordingly, there is a large body of work on the collection, analysis, and characterization of the workloads on storage systems, including enterprise computing environments [2, 20, 21] and high-performance computing environments [11, 22, 30]. The observations can be of great importance to system design, engineering, and tuning. For example, in a study on file system workloads for large-scale scientific computing applications, Wang et. al. collected and analyzed file accesses on an 800-node cluster running the Lustre file system at Lawrence Livermore National Laboratory [30]. One of their findings is that in some workloads, small requests account for more than 90% of all requests, but almost all data are accessed by large requests. In a study on file sys-