ter.We postpone the details of these workloads and the execution E-Storeus s system-level statistics (e.g sustained high E-Store data.showing WAREHOUSE and CUSTOMER PC-C te ed me es experien ughput due to co fied H-Store to use range partitioning.We discuss how to suppor tive partitioning se In building E-Stor W ID).S The Wapi SE table is parti ioned by the lly pull large of tuples.As s and CUSTO)co LTP sy paramter.Since there sforeign-key coordination betw apmoednemgioe a. tem that secks to minimize this simplified example throughout the paper for exposition 2.3 The Need for Live Reconfiguration ease data Ir hat use a heavyweight the same (from one partition partition). manc anges in workload access pattems 32 ions or pa 3.OVERVIEW OF SQUALL o demonstrate the detr d partition clus ized and fault-toler r to creat n the pr BMS h sume hat a tuple doe oes)or false post tives (i.e the system as 10467 W_ID City Zip 1 Miami 33132 2 Seattle 98101 C_ID Name W_ID 14 Ron 1 2 Wyatt 2 12 Jack 1 Warehouse Customer Partition 1 W_ID City Zip 3 New York 10467 4 Chicago 60414 C_ID Name W_ID 1 Mike 3 1004 Gabriel 3 3 Dean 3 Warehouse Customer Partition 2 W_ID City Zip 5 Los Angeles 90001 7 San Diego 92008 C_ID Name W_ID 21 Snake 5 7 R.J. 5 4 Stephen 7 Warehouse Customer Partition 3 W_ID City Zip 10 Austin 78702 C_ID Name W_ID 9 Todd 10 Warehouse Customer Partition 4 Figure 2: Simple TPC-C data, showing WAREHOUSE and CUSTOMER partitioned by warehouse IDs. 0% 20% 40% 60% 80% Percent of New Orders for Warehouse 1-3 0 5000 10000 15000 TPS Figure 3: As workload skew increases, the number of new order transactions increasingly access 3 warehouses in TPC-C and the collocated warehouses experience reduced throughput due to contention. fied H-Store to use range partitioning. We discuss how to support alternative partitioning schemes in Appendix C. Fig. 2 shows a simplified TPC-C database partitioned by the plan in Fig. 5a. The WAREHOUSE table is partitioned by its id column (W_ID). Since there is a foreign key relationship between the WAREHOUSE and CUSTOMER tables, the CUSTOMER table is also partitioned by its W_ID attribute. Hence, all data related to a given W_ID (i.e., both WAREHOUSE and CUSTOMER) are collocated on a single partition. Any stored procedure that reads or modifies either table will use W_ID as its routing parameter. Since there is a foreign-key relationship between CUSTOMER and WAREHOUSE, the CUSTOMER table does not need an explicit mapping in the partition plan. We will use this simplified example throughout the paper for exposition. 2.3 The Need for Live Reconfiguration Although partitioned DBMSs like H-Store execute single-partition transactions more efficiently than systems that use a heavyweight concurrency control scheme, they are still susceptible to performance degradations due to changes in workload access patterns [32]. Such changes could either cause a larger percentage of transactions to access multiple partitions or partitions to grow larger than the amount of memory available on their node. As with any distributed system, DBMSs need to react to these situations to avoid becoming overloaded; failing to do so in a timely manner can impact both performance and availability in distributed DBMSs [20]. To demonstrate the detrimental effect of overloaded partitions on performance, we ran a micro-benchmark using H-Store with a three-node cluster. For this experiment, we used the TPC-C benchmark with a 100 warehouse database evenly distributed across 18 partitions. We modified the TPC-C workload generator to create a hotspot on one of the partitions by having a certain percentage of transactions access one of three hot warehouses instead of a uniformly random warehouse. Transaction requests are submitted from up to 150 clients running on a separate node in the same cluster. We postpone the details of these workloads and the execution environment until Section 7. As shown in Fig. 3, as the warehouse selection moves from a uniform to a highly skewed distribution, the throughput of the system degrades by ∼60%. This shows that overloaded partitions have a significant impact on the throughput of a distributed DBMS like H-Store. The solution is for the DBMS to respond to these adverse conditions by migrating data to either re-balance existing partitions or to offload data to new partitions. Some of the authors designed E-Store [38] for automatically identifying when a reconfiguration is needed and to create a new partition plan to shuffle data items between partitions. E-Store uses system-level statistics (e.g., sustained high CPU usage) to identify the need for reconfiguration, and then uses tuple-level statistics (e.g., tuple access frequency) to determine the placement of data to balance load across partitions. E-Store relies on Squall to execute the reconfiguration. Both components view each other as a black-box; E-Store only provides Squall with an updated partition plan and a designated leader node for a reconfiguration. Squall makes no assumptions on the plans generated by a system controller other than that all tuples must be accounted for. 0 20 40 60 80 100 120 Elapsed Time (seconds) 0 5000 10000 15000 TPS Figure 4: A Zephyr-like migration on two TPC-C warehouses to alleviate a hot-spot effectively causes downtime in a partitioned main-memory DBMS. In building E-Store we evaluated a Zephyr-like migration for load-balancing (cf. Section 7 for a detailed description) where destination partitions reactively migrate tuples as needed and periodically pull large blocks of tuples. As shown by Fig. 4, however, this approach results in downtime for the system, and is therefore not an option for modern OLTP systems. This disruption is due to migration requests blocking transaction execution and a lack of coordination between partitions involved in the migration. These issues highlight the demand for a live reconfiguration system that seeks to minimize performance impact and eliminate downtime. We define a live reconfiguration as a change in the assignment of data to partitions in which data is migrated without any part of the system going off-line. A reconfiguration can cause the number of partitions in the cluster to increase (i.e., data from existing partitions are sent to a new, empty partition), decrease (i.e., data from a partition being removed is sent to other existing partitions), or stay the same (i.e., data from one partition is sent to another partition). 3. OVERVIEW OF SQUALL Squall is a technique for efficiently migrating fine-grained data in a strongly consistent distributed OLTP DBMS. The key advantage of Squall over previous approaches is that it does not require the DBMS to halt execution while all data migrates between partitions, thereby minimizing the impact of reconfiguration. The control of data movement during a reconfiguration is completely decentralized and fault-tolerant. Squall is focused on the problem of how to perform this reconfiguration safely. In particular, Squall ensures that during the reconfiguration process the DBMS has no false negatives (i.e., the system assumes that a tuple does not exist at a partition when it actually does) or false positives (i.e., the system assumes that a tuple exists