Socket Send / Receive RAM (DB1) DBMS _中国高校课件下载中心

点击下载：The End of Slow Networks - It’s Time for a Redesign [Vision]

正在加载图片...

access via RDMA is very different than those of a shared. 宝富富 (a)SN (IPoEth) (b)SN (IPolB) ons fromgar ove ed e beli R igure 4 Distribu (d)NAM (RDMA) riment,we f nd only one c 5.In LT over e of th strategies mightr cture by sing imgN arc end and r e).Weomit thi ture entirely f s are to ou 3.1.2 The Shared-Nothing Architecture for IPolB the RDMA SEND arives at the RNIC).which uld hav 31.4 The Network-Anached-Memory Architecture the enefiting from )e) In a NAN rlyoof base specifi 101 control-Ho mal arationheeHfret beOwTmplexityanc ave a negative impact on the ibuted transaction pro sing 3.1.3 The Distributed Shared-Memory Architecture we have to take odes and compute n odes on the e2(0)).but vious architecture,the system gains mor 2)and 3).Unfortunately. chan g inter o RDMA ost importa ory syster () te the sed SEND/RECEIVE ct memor data befor RDMA READ/WRIT Itshoud be noted that this separation of com How is no rE he temis all us over,m 3or are mance netwSocket Send / Receive RAM (DB1) DBMS 1 RAM (DB2) DBMS 2 RAM (DB3) DBMS 3 RAM (DB4) DBMS 4 R/W R/W R/W R/W Compute + Storage Servers (a) SN (IPoEth) Socket Send/Rcv R/W R/W R/W R/W CPU CPU CPU CPU RAM (DB 1) RAM (DB 2) RAM (DB 3) RAM (DB 4) Compute + Storage Servers (b) SN (IPoIB) RDMA Send/Rcv R/W R/W R/W R/W CPU CPU CPU CPU Compute + Storage Servers RAM (DB 1) RAM (DB 2) RAM (DB 3) RAM (DB 4) RDMA R/W (c) SM (RDMA) Compute Servers Servers Storage RAM (Buffer) RAM (Buffer) RAM (Buffer) RDMA R/W+Atomics RDMA Send/Rcv CPU CPU CPU CPU RAM (Buffer) RDMA Send/Rcv RAM (DB 1) RAM (DB 2) RAM (DB 3) RAM (DB 4) CPU CPU CPU CPU R/W R/W R/WR/W R/W R/W R/WR/W (d) NAM (RDMA) Figure 4: In-Memory Distributed Architectures transactions, causing high communication costs [48]. Furthermore, workloads change over time, which makes it even harder to find a good static partitioning scheme [22], and dynamic strategies might require moving huge amounts of data, further restricting the bandwidth for the actual work. As a result, the network not only limits the throughput of the system, but also its scalability; the more machines are added, the more of a bottleneck the network becomes. 3.1.2 The Shared-Nothing Architecture for IPoIB An easy way to migrate from the traditional shared-nothing architecture to fast networks, such as InfiniBand, is to simply use IPoIB as shown in Figure 4(b). The advantage of this architecture is that it requires almost no change of the database system itself while still benefiting from the extra bandwidth. In particular, data-flow operations that send large messages (e.g., data re-partitioning) will benefit tremendously from this change. However, as shown in Section 2, IPoIB cannot fully leverage the network. Perhaps surprisingly, for some types of operations, upgrading the network with IPoIB can actually decrease the performance. This is particularly true for control-flow operations, which require sending many of small messages. Figure 3 shows that the CPU overhead of IPoIB is above the overhead of IPoEth for small messages. In fact, as we will show in Section 4, these small differences can have a negative impact on the overall performance of distributed transaction processing. 3.1.3 The Distributed Shared-Memory Architecture Obviously, to better leverage the network we have to take advantage of RDMA. RDMA not only allows the system to fully utilize the bandwidth (see Figure 2(a)), but also reduces network latency and CPU overhead (see Figures 2(b) and 3). Unfortunately, changing an application from a socket-based message passing interface to RDMA verbs is not trivial. One possibility is to treat the cluster as a sharedmemory system (shown in Figure 4(c)) with two types of communication patterns: message passing using RDMAbased SEND/RECEIVE verbs and remote direct memory access through one-sided RDMA READ/WRITE verbs. However, as stated before, there is no cache-coherence protocol. Moreover, machines need to carefully declare the sharable memory regions a priori and connect them via queue pairs. The latter, if not used carefully, can also have a negative effect on the performance [30]. In addition, a memory access via RDMA is very different than those of a sharedmemory system. While a local memory access only keeps one copy of the data around (i.e., conceptually it moves the data from main memory to the cache of a CPU), a remote memory access creates a fully-independent copy. This has a range of implications from garbage collection, over cache/buffer management, up to consistency protocols. Thus, in order to achieve the appearance of a sharedmemory system, the software stack has to hide the differences and provide a real distributed shared-memory space. There have been recent attempts to create a distributed shared-memory architecture over RDMA [21]. However, we believe that a single abstraction for local and remote memory is the wrong approach. Databases usually want to have full control over the memory management and because virtual memory management can get in the way of any database system, we believe the same is true for shared-memory over RDMA. While we had the ambitions to validate this assumption throughout our experiment, we found only one commercial offering for IBM mainframes [5]. Instead, for our OLTP comparison, we implemented a simplified version of this architecture by essentially using a SN architecture and replacing socket communication with two-sided RDMA verbs (send and receive). We omit this architecture entirely for our OLAP comparison since two-sided RDMA verbs would have additionally added synchronization overhead to our system (i.e., an RDMA RECEIVE must be issued strictly before the RDMA SEND arrives at the RNIC), which would have simply slowed down the execution of our OLAP algorithms when compared to their NAM alternatives. 3.1.4 The Network-Attached-Memory Architecture Based on the previous considerations, we envision a new type of architecture, referred to as network-attached memory (or NAM for short) shown in Figure 4(d). In a NAM architecture, compute and storage are logically decoupled. The storage servers provide a shared distributed memory pool, which can be accessed from any compute node. However, the storage nodes are not aware of any database specific operations (e.g., joins or consistency protocols). These are implemented by the compute nodes. This logical separation helps to control the complexity and makes the system aware of the different types of main memory. Moreover, the storage nodes can take care of issues like garbage collection, data-reorganization or metadata management to find the appropriate remote-memory address of a data page. Note, that it is also still possible to physically co-locate storage nodes and compute nodes on the same machine to further improve performance. However, in contrast to the previous architecture, the system gains more control over what data is copied and how copies are synchronized. The NAM architecture has also several other advantages. Most importantly, storage nodes can be scaled independently of compute nodes. Furthermore, the NAM architecture can efficiently handle data imbalance since any node can access any remote partition without the need to re-distribute the data before. It should be noted that this separation of compute and storage is not new. However, similar existing systems all use an extended key/value like interface for the storage nodes [15, 36, 39] or are focused on the cloud [16, 2], instead of being built from scratch to leverage high performance networks like InfiniBand. Instead, we argue that the storage servers in the NAM architecture should expose an in- 4

<<向上翻页向下翻页>>

点击下载：The End of Slow Networks - It’s Time for a Redesign [Vision]