10.0 mal operational case.As our experiments in the next section 1.0M wi show,this design enables new dimensions of scalability. 4.3 Experimental Evaluation 1.0k 102030405060 simp 4( by replacing TCP/IE Figure 6:RSI vs 2PC MA validatio ore up toera traditional S the Finally. and ha our obal dctio eter We er the read phase all in traditional). measured as it can hea y record hasn=m(6B ber tra s fo eord that山ha As bas the late roid contention uses n and machine ol t SI-proto atain the late ommit-id (CID)of The trac S-protoco second TM( datastructu of the h hic -swap operati 0 the RID 20003 to tra ctions he hat the client has with CID 30000 66%of our R 20003. 64 Bi scale er,we also noticed that the tw ided rDme t lock if th but that the throughput als 320.000 on per s Thus,the 。1 TM the same technian prepareal invoved records on for the decrease the tra our RS no lo f the for allintended update nine.with a tota insta The bandwidth of 13.8GB/s.With th 3KB).Fo of the server [m speculat transact ms the timest 1,the TM the m bandwidth and that ple mieht marios(note e,that they can still reduce latency and/or to manage hot ite 5.THE CASE FOR OLAP1.0 k 10.0 k 100.0 k 1.0 M 10.0 M 10 20 30 40 50 60 70 Transaction per second # Clients RSI (RDMA) Trad-SI (IPoIB) Trad-SI (IPoEth) Figure 6: RSI vs 2PC rect validation and locking with a single RDMA-operation shown in Table 1. The key idea is to store up to n versions of a fixed-size record of m-bits length in a fixed-size slotted memory record, called a “record block”, and have a global dictionary (e.g., using a DHT) to exactly determine the memory location of any record within the cluster. We will explain the global dictionary and how we handle inserts in the next subsections and assume for the moment, that after the read phase all memory locations are already known. How many slots (i.e., versions) a record block should hold depends on the update and read patterns as it can heavily influence the performance. For the moment, assume that every record has n = max(16KB / record-size, 2) slots for different record versions and that every read retrieves all n slots. From Figure 2(b) we know that transferring 1KB to roughly 16KB makes no difference in the latency threfore making n any smaller has essentially no benefit. Still, for simplicity, our current implementation uses n = 1 and aborts all transactions which require an older snapshot. The structure of a slot in memory is organized as follows: the first bit is used as a lock (0=no-lock, 1=locked) while the next 63 bits contain the latest commit-id (CID) of the most recent committed record, followed by the payload of the record, followed by the second latest CID and payload and so on, up to n records. Using this data structure, the TM (i.e., the client) is directly able to validate and lock a record for a write using a compare-and-swap operation on the first 64 bits [round-trip message 2]. For example, assume that the client has used the RID 20003 to read the record at memory address 1F (e.g., the first row in Table 1) and wants to install a new version with CID 30000. A simple RDMA compare-and-swap operation on the first 64 Bits of the record at address 1F with test-value 20003, setting it to 1 << 63|20003), would only acquire the lock if the record has not changed since it was read by the transaction, and fails otherwise. Thus, the operation validates and prepares the resource for the new update in a single round-trip. The TM uses the same technique to prepare all involved records (with SI inserts always succeeding). If the compare-and-swap succeeds for all intended updates of the transaction, the transaction is guaranteed to be successful and the TM can install a new version. The TM therefore checks if the record block has a free slot, and, if yes, inserts its new version at the head of the block and shifts the other versions to the left. Afterwards, the TM writes the entire record block with a signaled WRITE to the memory location of the server [message 3]. Finally, when all the writes have been successful, the TM informs the timestamp service about the outcome [message 3] as in the traditional protocol. This message can be sent unsignaled. Overall, our RDMA-enable SI protocol and storage layout requires 3 round-trip messages and one unsignaled message, and does not involve the CPU in the normal operational case. As our experiments in the next section will show, this design enables new dimensions of scalability. 4.3 Experimental Evaluation To evaluate the algorithms, we implemented the traditional SI protocol (Figure 5(a)) on the shared-nothing architecture with IPoETH (Figure 4(a)) and IPoIB (Figure 4(b)). We also implemented a simplified variant of the sharedmemory architecture (Figure 4(c)) by replacing TCP/IP sockets with two-sided RDMA verbs (requiring significantly modifiying memory management). We slightly adjusted the traditional SI implementation by using a local time-stamp server instead of a remote service (i.e., we gave the traditional implementation an advantage). Finally, our RSI protocol implements the NAM architecture (Figure 4(d)) and uses an external timestamp service as described earlier. We evaluated all protocols on an 8-node cluster using the same configuration as in Section 2.2. We use four machines to execute the clients, three as the NAM storage-servers, and one as the timestamp server (or as the transaction manager in traditional). We measured both protocols with a simple and extremely write-heavy workload, similar to the checkout transaction of the TPC-W benchmark. Every transaction reads 3 products, creates 1 order and 3 orderline records, and updates the stock of the products. As base data, we created 1 million products (every record is roughly 1KB) to avoid contention, and all data was evenly distributed across the machines. Clients wait until a transaction is finished before issuing the next transaction. Figure 6 shows the scalability of the traditional SI-protocol and our new RSI protocol with the number of client threads varied from 1 to 70. The traditional SI-protocol over IPoIB has the worst scalability, with ≈ 22, 000 transactions per second, whereas IPoEth achieves ≈ 32, 000 transactions per second. The IPoIB implementation performs worse because of the less efficient TCP/IP implementation for IPoIB, which plays an important role for small messages. In contrast, our RSI protocol achieved a stunning ≈ 1.8 million distributed transactions per second. The shared-memory architecture using two-sided RDMA verbs achieved a throughput of 1.1 million transaction per second, or only 66% of our RSI protocol (line omitted due to overlap with RSI because of the logscale). However, we also noticed that the two-sided RDMA verb implementation not only stops scaling after 40 clients, but that the throughput also decreases to only ≈ 320, 000 transaction per second with 70 clients, while our RSI implementation scales almost linearly with up to 60 clients. One reason for the decrease in performance is that the transaction managers become one of the major bottlenecks. However, our RSI implementation no longer scaled linearly after 60 clients, since we only had one dual-port FDR 4x RNIC per machine, with a total bandwidth of 13.8GB/s. With the three 1KB records per transactions, we can achieve a theoretical maximum throughput of ≈ 2.4M transactions per second (every transaction reads/writes at least 3KB). For greater than 60 clients, the network is saturated. We therefore speculate that distributed transactions no longer have to be a scalability limit when the network bandwidth matches the memory bandwidth and that complex partitioning schemes might become obsolete in many scenarios (note, that they can still reduce latency and/or to manage hot items). 5. THE CASE FOR OLAP 8