“二节 r to cess a single (random of an RDMA NIC (RNIC).With verbs.most of the p ggeoRYCitacosimoaneat,whid a NUMA a hich is not supported with RDMA ssage RITE ist)nor a pur m sing system (data can be directly operations allow a machine to write (read)data into(fro a via ory of anot is a need to critically the ( d-add )allo Ato ally. Tw example,given the netw ided ope an in teadshomia ing qe pairs(ie send The applic ar e en in the dis the c buted er While this not RDMA that negt a Work Que nt(WQE)by a re etw e nchmarks signa th for ork emory (NAM)architecture oper why the c a wisdon o RDMA be reg networks and OLTP w DMA opera the he network b ni the d to re he d 18 gorithms for the NAM architecture (Section 5). eir WQ 2. BACKGROUND Before making a detailed why distributed DBMS ar ork tec and the th WOE sig istcs ot iBa d RDMA the c ient implicitly the 2.1 InfiniBand and RDMA That wa and communication。 the clien has competitive with Ethernet and an (1)Recent Inte 3 (DDIO)[6]. xecuted by the rNi nd (IP B and Re in the CPU L cache if the memory addre d,allo with Ethern lata is copied by the a the ation kets the ork.Whilee nee coherenc ()Finally,non-oherent that ence between the che and n not the c tween data once copi d,which is always 2tween machines is still higher to access a single (random) byte than with today’s NUMA systems; and (3) hardwareembedded coherence mechanisms ensure data consistency in a NUMA architecture, which is not supported with RDMA. Clusters with RDMA-capable networks are most similar to a hybrid shared-memory and message-passing system: it is neither a shared-memory system (several address spaces exist) nor a pure message-passing system (data can be directly accessed via RDMA). Consequently, we believe there is a need to critically rethink the entire distributed DBMS architecture to take full advantage of the next generation of network technology. For example, given the fast network, it is no longer obvious that avoiding distributed transactions is always beneficial. Similarly, distributed storage managers and distributed execution algorithms (e.g., joins) should no longer be designed to avoid communication at all costs [48], but instead should consider the multi-core architecture and caching effects more carefully even in the distributed environment. While this is not the first attempt to leverage RDMA for databases [60, 55, 39], existing work does not fully recognize that next generation networks create an architectural 4 point. This paper makes the following contributions: • We present microbenchmarks to assess performance characteristics of one of the latest InfiniBand standards, FDR 4x (Section 2). • We present alternative architectures for a distributed in-memory DBMS over fast networks and introduce a novel Network-Attached Memory (NAM) architecture (Section 3). • We show why the common wisdom that says “2-phasecommit does not scale” no longer holds true for RDMAenabled networks and outline how OLTP workloads can take advantage of the network by using the NAM architecture. (Section 4) • We analyze the performance of distributed OLAP operations (joins and aggregations) and propose new algorithms for the NAM architecture (Section 5). 2. BACKGROUND Before making a detailed case why distributed DBMS architectures need to fundamentally change to take advantage of the next generation of network technology, we provide some background information and micro-benchmarks that showcase the characteristics of InfiniBand and RDMA. 2.1 InfiniBand and RDMA In the past, InfiniBand was a very expensive, high bandwidth, low latency network commonly found in large highperformance computing environments. However, InfiniBand has recently become cost-competitive with Ethernet and thus a viable alternative for enterprise clusters. Communication Stacks: InfiniBand offers two network communication stacks: IP over InfiniBand (IPoIB) and Remote Direct Memory Access (RDMA). IPoIB implements a classic TCP/IP stack over InfiniBand, allowing existing socket-based applications to run without modification. As with Ethernet-based networks, data is copied by the application into OS buffers, and the kernel processes the buffers by transmitting packets over the network. While providing an easy migration path from Ethernet to InfiniBand, our experiments show that IPoIB cannot fully leverage the network. On the other hand, RDMA provides a verbs API, which enable data transfers using the processing capabilities of an RDMA NIC (RNIC). With verbs, most of the processing is executed by the RNIC without OS involvement, which is essential for achieving low latencies. RDMA provides two verb communication models: onesided and two-sided. One-sided RDMA verbs (write, read, and atomic operations) are executed without involving the CPU of the remote machine. RDMA WRITE and READ operations allow a machine to write (read) data into (from) the remote memory of another machine. Atomic operations (fetch-and-add, compare-and-swap) allow remote memory to be modified atomically. Two-sided verbs (SEND and RECEIVE) enable applications to implement an RPC-based communication pattern that resembles the socket API. Unlike the first category, two-sided operations involve the CPU of the remote machine as well. RDMA Details: RDMA connections are implemented using queue pairs (i.e., send/receive queues). The application creates the queue pairs on the client and the server and the RNICs handle the state of the queue pairs. To communicate, a client creates a Work Queue Element (WQE) by specifying the verb and parameters (e.g., a remote memory location). The client puts the WQE into a send queue and informs the local RNIC via Programmed IO (PIO) to process the WQE. WQEs can be sent either signaled or unsignaled. Signaled means that the local RNIC pushes a completion event into a client’s completion queue (CQ) via a DMA write once the WQE has been processed by the remote side. For one-sided verbs, the WQEs are handled by the remote RNIC without interrupting the remote CPU using a DMA operation on the remote side (called server). However, as a caveat when using one-sided operations, a memory region must be registered to the local and remote RNIC to be accessible by DMA operations (i.e., the RNIC stores the virtual to physical page mappings of the registered region). For two-sided verbs, the server does not need to register a memory region, but it must put a RECEIVE request into its receive queue to handle a SEND request from the client. Since queue pairs process their WQEs in FIFO order, a typical pattern to reduce the overhead on the client side and to hide latency is to use selective signaling. That is, for send/receive queues of length n, the client can send n − 1 WQEs unsignaled and the n-th WQE signaled. Once the completion event (i.e., the acknowledgment message of the server) for the n-th WQE arrives, the client implicitly knows that the previous n − 1 WQEs have also been successfully processed. That way, computation and communication on the client can be efficiently overlapped without expensive synchronization mechanisms. Another interesting aspect is how RDMA operations of an RNIC interfere with operations of the CPU if data is concurrently accessed: (1) Recent Intel CPUs (Intel SandyBridge and later) provide a feature called Data Direct I/O (DDIO) [6]. With DDIO the DMA executed by the RNIC to read (write) data from (to) remote memory places the data directly in the CPU L3 cache if the memory address is resident in the cache to guarantee coherence. (2) On other systems the cache is flushed/invalidated by the DMA operation to guarantee coherence. (3) Finally, non-coherent systems leave the coherency problem to the software. These effects must be considered when designing distributed RDMAbased algorithms. Also note that this only concerns coherence between the cache and memory, not the coherence between data once copied, which is always left to the software. 2