正在加载图片...
-- -Cw- Table 1:Pote ntial Data Structure fo RDMA-enabled Snapshot Isolatio n,which is significantly more Both impl Let us as s has to a ing,we our novel RSI proto At i which is able t secute cucle nd) and a sag the client (ie. )and make the servers( their main m to the clien the- how making the system inherently unscalable (if the workload the clients, which are asy to add. The thr put of th ge bat ine can help but with in ed batch siz that the s (and InfiniBand switche n ha he pro Furthe add he CPU one f the most dominan ing is ea at the IPolB impleme ation over our FDR network helps IPoEth,but it does notr e the CPU overhead,and in 41.4 RID.First.the (acting as the TM) ontacts the to : ally PC,do not scale is true First,distributed in t ond head)is ing it conp-lo d by a the ominating f or i client,it"switches"th bit from o to Wit he network bandwidth As uming 10Gb Ethernet wit d by findin the tion updateon Note that rage three ree t 3KB have to ghput to218.500 transaction on the tota adii synchr sible 57,6 rate If this ption does not hold (e be of stra tiasoltionitimpoeanG social-graph data is notoriously hard to partition skip bits.which go beyo ond the e of this 4.2 RSI:An RDMA-based SI Protocol paper en thoug riments that to lift the two most imp factor CPU lient ha cute the first phase of 2P( e ely lmited without cha .e need to redesigr DA- from chat We have trnemctiondfercoomitedcttrditionldeiah memory (NAM)architecture e also implemented more Look CIDN RecordN CID(N−1) Record(N−1) CID(N−2) Record(N−2) 1-Bit (63 Bit) (m Bits) (64 Bit) (m Bits) (64 Bit) (m Bits) 0 20003 (”Name1”, ”Address1”) 0 23401 (”Name2”, ”Address2”) 22112 (”Name2”, ”OldAddr”) 1 24401 (”Name3”, ”Address3”) 22112 (”Old3”, ”Old3”) Table 1: Potential Data Structure for RDMA-enabled Snapshot Isolation get m = mr + ms = 5 + 8 · n, which is significantly more than the single or centralized case. Let us assume that a transaction always has to access all n servers. If we assume that every server has c cores (each of which is able to execute cyclesc per second) and a message costs cyclesm, then a very optimistic upper bound on the number of transactions per second is trxu = (c·cyclesc ·(n+ 1))/(5 + 8 · n) · cyclesm. On a modern 3 node cluster with 2.2GHz 8-core CPUs and assuming 3, 750 cycles per message (see Figure 3), this leads to ≈ 647, 000 trx/seconds. More in￾terestingly, though, if we increase the cluster to 4 nodes with the same hardware configuration the maximum throughput goes down to 634, 000. Of course, these are only back-of￾the-envelope calculations but they show that the message overhead essentially consumes all the added CPU power, making the system inherently unscalable (if the workload cannot be partitioned). Message batching can help, but with increased batch sizes, the processing time per message also increases. Further￾more, without redesigning the protocol and data structures, the CPU overhead will remain one of the most dominant bottlenecks. For instance, as Figure 2 and Figure 3 show, the IPoIB implementation over our FDR network helps in￾crease the bandwidth and reduce the latency compared to IPoEth, but it does not reduce the CPU overhead, and in some cases it may exacerbate the situation. 4.1.4 Discussion The traditional wisdom that distributed transactions, es￾pecially 2PC, do not scale is true. First, distributed transac￾tions increase the contention rate, but not significantly. Sec￾ond, the protocol itself (not considering the message over￾head) is rather simple and has no significant impact on the performance (2PC simply checks if a message arrived and what it contained). What remains the dominating factor is the increased CPU-load for handling the messages and even the network bandwidth. Assuming a 10Gb Ethernet with 3 servers, an average record size of 1KB, and that a transac￾tion updates on average three records, at least 3KB have to be read and written per transaction. This limits the total throughput to ≈ 218, 500 transactions per second. As a result, complicated partitioning schemes have been proposed to avoid distributed transactions as much as pos￾sible [19, 57, 63]. While it is a solution, it imposes a new set of challenges for the developer and some workloads (e.g., social-graph data is notoriously hard to partition). 4.2 RSI: An RDMA-based SI Protocol Fast high-bandwidth networks such as InfiniBand are able to lift the two most important limiting factors: CPU over￾head and network bandwidth. However, as our experiments show, the scalability is severely limited without changing the techniques themselves. Therefore, we need to redesign distributed DBMSs for RDMA-based architectures. In this section, we present a novel RDMA-based SI pro￾tocol, called RSI that is designed for the network-attached memory (NAM) architecture . We have also implemented the traditional SI protocol discussed before using two-sided RDMA verbs instead of TCP/IP sockets as a simplified shared-memory architecture. Both implementations are in￾cluded in our experimental evaluation in Section 4.3. In the following, we only focus on our novel RSI protocol. At its core, RSI moves the transaction processing logic to the client (i.e., compute nodes) and make the servers (i.e., storage nodes) “dumb” as their main purpose is to share their main memory to the clients. Moreover, clients im￾plement the transaction processing logic through one-sided RDMA operations (i.e., the client is the transaction man￾ager) allowing any compute node to act as a client that can access data on any storage node (i.e., a server). This design is similar to [15], but optimized for direct memory access rather than cloud services. Moving the logic to the client has several advantages. Most importantly, scale-out becomes much easier since all CPU-intensive operations are done by the clients, which are easy to add. The throughput of the system is only limited by the number of RDMA requests that the server’s RNICs (and InfiniBand switches) can han￾dle. Since several RNICs can be added to one machine, the architecture is highly scalable (see also Section 4.3). In ad￾dition, (1) load-balancing is easier since transactions can be executed on any node independent of any data-locality, and (2) latencies are reduced as clients can fetch data directly from the servers without involving the TM. As before, we assume that reads already have happened and that the transaction has an assigned read timestamp, RID. First, the client (acting as the TM) contacts the times￾tamp service to receive a new commit timestamp CID. In our implementation, we pre-assign timestamps to clients us￾ing a bitvector with 60k bits. The first bit in the vector belongs to client 1 and represents timestamp 1, up to client n representing timestamp n. Afterwards, position n + 1 again belongs to client 1 and so on. Whenever a timestamp is used by a client, it “switches” the bit from 0 to 1. With this scheme, the highest committed timestamp can be deter￾mined by finding the highest consecutive bit in the vector. If all bits are set by a client, we allow clients to “wrap” and start from the beginning. Note, that wrapping requires some additional bookkeeping to avoid that bits are overwritten. This simple scheme allows the clients to use timestamps without having a synchronization bottleneck but implicitly assumes that all clients make progress at roughly the same rate. If this assumption does not hold (e.g., because of strag￾glers, long running transactions, etc.), additional techniques are required to skip bits, which go beyond the scope of this paper. Also note, to ensure a fair comparison, we use the same technique for the traditional protocol implementation; even though we know from our experiments that it does not provide any benefits in that case. Next, the client has to execute the first phase of 2PC and check if the version has not changed since it was read (i.e., validation phase of 2PC). As before, this operation re￾quires a lock on the record to prevent other transactions from changing the value after the validation and before the transaction is fully committed. In a traditional design, the server would be responsible of locking and validating the version. In order to make this operation more efficient and “CPU-less”, we propose a new storage layout to allow di- 7
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有