正在加载图片...
“二节 r to cess a single (random of an RDMA NIC (RNIC).With verbs.most of the p ggeoRYCitacosimoaneat,whid a NUMA a hich is not supported with RDMA ssage RITE ist)nor a pur m sing system (data can be directly operations allow a machine to write (read)data into(fro a via ory of anot is a need to critically the ( d-add )allo Ato ally. Tw example,given the netw ided ope an in teadshomia ing qe pairs(ie send The applic ar e en in the dis the c buted er While this not RDMA that negt a Work Que nt(WQE)by a re etw e nchmarks signa th for ork emory (NAM)architecture oper why the c a wisdon o RDMA be reg networks and OLTP w DMA opera the he network b ni the d to re he d 18 gorithms for the NAM architecture (Section 5). eir WQ 2. BACKGROUND Before making a detailed why distributed DBMS ar ork tec and the th WOE sig istcs ot iBa d RDMA the c ient implicitly the 2.1 InfiniBand and RDMA That wa and communication。 the clien has competitive with Ethernet and an (1)Recent Inte 3 (DDIO)[6]. xecuted by the rNi nd (IP B and Re in the CPU L cache if the memory addre d,allo with Ethern lata is copied by the a the ation kets the ork.Whilee nee coherenc ()Finally,non-oherent that ence between the che and n not the c tween data once copi d,which is always 2tween machines is still higher to access a single (random) byte than with today’s NUMA systems; and (3) hardware￾embedded coherence mechanisms ensure data consistency in a NUMA architecture, which is not supported with RDMA. Clusters with RDMA-capable networks are most similar to a hybrid shared-memory and message-passing system: it is neither a shared-memory system (several address spaces ex￾ist) nor a pure message-passing system (data can be directly accessed via RDMA). Consequently, we believe there is a need to critically re￾think the entire distributed DBMS architecture to take full advantage of the next generation of network technology. For example, given the fast network, it is no longer obvious that avoiding distributed transactions is always beneficial. Sim￾ilarly, distributed storage managers and distributed execu￾tion algorithms (e.g., joins) should no longer be designed to avoid communication at all costs [48], but instead should consider the multi-core architecture and caching effects more carefully even in the distributed environment. While this is not the first attempt to leverage RDMA for databases [60, 55, 39], existing work does not fully recognize that next gen￾eration networks create an architectural 4 point. This paper makes the following contributions: • We present microbenchmarks to assess performance characteristics of one of the latest InfiniBand stan￾dards, FDR 4x (Section 2). • We present alternative architectures for a distributed in-memory DBMS over fast networks and introduce a novel Network-Attached Memory (NAM) architecture (Section 3). • We show why the common wisdom that says “2-phase￾commit does not scale” no longer holds true for RDMA￾enabled networks and outline how OLTP workloads can take advantage of the network by using the NAM architecture. (Section 4) • We analyze the performance of distributed OLAP op￾erations (joins and aggregations) and propose new al￾gorithms for the NAM architecture (Section 5). 2. BACKGROUND Before making a detailed case why distributed DBMS ar￾chitectures need to fundamentally change to take advantage of the next generation of network technology, we provide some background information and micro-benchmarks that showcase the characteristics of InfiniBand and RDMA. 2.1 InfiniBand and RDMA In the past, InfiniBand was a very expensive, high band￾width, low latency network commonly found in large high￾performance computing environments. However, InfiniBand has recently become cost-competitive with Ethernet and thus a viable alternative for enterprise clusters. Communication Stacks: InfiniBand offers two network communication stacks: IP over InfiniBand (IPoIB) and Re￾mote Direct Memory Access (RDMA). IPoIB implements a classic TCP/IP stack over InfiniBand, allowing existing socket-based applications to run without modification. As with Ethernet-based networks, data is copied by the appli￾cation into OS buffers, and the kernel processes the buffers by transmitting packets over the network. While provid￾ing an easy migration path from Ethernet to InfiniBand, our experiments show that IPoIB cannot fully leverage the network. On the other hand, RDMA provides a verbs API, which enable data transfers using the processing capabilities of an RDMA NIC (RNIC). With verbs, most of the process￾ing is executed by the RNIC without OS involvement, which is essential for achieving low latencies. RDMA provides two verb communication models: one￾sided and two-sided. One-sided RDMA verbs (write, read, and atomic operations) are executed without involving the CPU of the remote machine. RDMA WRITE and READ operations allow a machine to write (read) data into (from) the remote memory of another machine. Atomic operations (fetch-and-add, compare-and-swap) allow remote memory to be modified atomically. Two-sided verbs (SEND and RE￾CEIVE) enable applications to implement an RPC-based communication pattern that resembles the socket API. Un￾like the first category, two-sided operations involve the CPU of the remote machine as well. RDMA Details: RDMA connections are implemented us￾ing queue pairs (i.e., send/receive queues). The application creates the queue pairs on the client and the server and the RNICs handle the state of the queue pairs. To communicate, a client creates a Work Queue Element (WQE) by specifying the verb and parameters (e.g., a remote memory location). The client puts the WQE into a send queue and informs the local RNIC via Programmed IO (PIO) to process the WQE. WQEs can be sent either signaled or unsignaled. Signaled means that the local RNIC pushes a completion event into a client’s completion queue (CQ) via a DMA write once the WQE has been processed by the remote side. For one-sided verbs, the WQEs are handled by the remote RNIC without interrupting the remote CPU using a DMA operation on the remote side (called server). However, as a caveat when using one-sided operations, a memory region must be reg￾istered to the local and remote RNIC to be accessible by DMA operations (i.e., the RNIC stores the virtual to phys￾ical page mappings of the registered region). For two-sided verbs, the server does not need to register a memory region, but it must put a RECEIVE request into its receive queue to handle a SEND request from the client. Since queue pairs process their WQEs in FIFO order, a typical pattern to reduce the overhead on the client side and to hide latency is to use selective signaling. That is, for send/receive queues of length n, the client can send n − 1 WQEs unsignaled and the n-th WQE signaled. Once the completion event (i.e., the acknowledgment message of the server) for the n-th WQE arrives, the client implicitly knows that the previous n − 1 WQEs have also been successfully processed. That way, computation and communication on the client can be efficiently overlapped without expensive synchronization mechanisms. Another interesting aspect is how RDMA operations of an RNIC interfere with operations of the CPU if data is concurrently accessed: (1) Recent Intel CPUs (Intel Sandy￾Bridge and later) provide a feature called Data Direct I/O (DDIO) [6]. With DDIO the DMA executed by the RNIC to read (write) data from (to) remote memory places the data directly in the CPU L3 cache if the memory address is resident in the cache to guarantee coherence. (2) On other systems the cache is flushed/invalidated by the DMA oper￾ation to guarantee coherence. (3) Finally, non-coherent sys￾tems leave the coherency problem to the software. These ef￾fects must be considered when designing distributed RDMA￾based algorithms. Also note that this only concerns coher￾ence between the cache and memory, not the coherence be￾tween data once copied, which is always left to the software. 2
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有