正在加载图片...
access via RDMA is very different than those of a shared. 宝富富 (a)SN (IPoEth) (b)SN (IPolB) ons fromgar ove ed e beli R igure 4 Distribu (d)NAM (RDMA) riment,we f nd only one c 5.In LT over e of th strategies mightr cture by sing imgN arc end and r e).Weomit thi ture entirely f s are to ou 3.1.2 The Shared-Nothing Architecture for IPolB the RDMA SEND arives at the RNIC).which uld hav 31.4 The Network-Anached-Memory Architecture the enefiting from )e) In a NAN rlyoof base specifi 101 control-Ho mal arationheeHfret beOwTmplexityanc ave a negative impact on the ibuted transaction pro sing 3.1.3 The Distributed Shared-Memory Architecture we have to take odes and compute n odes on the e2(0)).but vious architecture,the system gains mor 2)and 3).Unfortunately. chan g inter o RDMA ost importa ory syster () te the sed SEND/RECEIVE ct memor data befor RDMA READ/WRIT Itshoud be noted that this separation of com How is no rE he temis all us over,m 3or are mance netwSocket Send / Receive RAM (DB1) DBMS 1 RAM (DB2) DBMS 2 RAM (DB3) DBMS 3 RAM (DB4) DBMS 4 R/W R/W R/W R/W Compute + Storage Servers (a) SN (IPoEth) Socket Send/Rcv R/W R/W R/W R/W CPU CPU CPU CPU RAM (DB 1) RAM (DB 2) RAM (DB 3) RAM (DB 4) Compute + Storage Servers (b) SN (IPoIB) RDMA Send/Rcv R/W R/W R/W R/W CPU CPU CPU CPU Compute + Storage Servers RAM (DB 1) RAM (DB 2) RAM (DB 3) RAM (DB 4) RDMA R/W (c) SM (RDMA) Compute Servers Servers Storage RAM (Buffer) RAM (Buffer) RAM (Buffer) RDMA R/W+Atomics RDMA Send/Rcv CPU CPU CPU CPU RAM (Buffer) RDMA Send/Rcv RAM (DB 1) RAM (DB 2) RAM (DB 3) RAM (DB 4) CPU CPU CPU CPU R/W R/W R/WR/W R/W R/W R/WR/W (d) NAM (RDMA) Figure 4: In-Memory Distributed Architectures transactions, causing high communication costs [48]. Fur￾thermore, workloads change over time, which makes it even harder to find a good static partitioning scheme [22], and dynamic strategies might require moving huge amounts of data, further restricting the bandwidth for the actual work. As a result, the network not only limits the throughput of the system, but also its scalability; the more machines are added, the more of a bottleneck the network becomes. 3.1.2 The Shared-Nothing Architecture for IPoIB An easy way to migrate from the traditional shared-nothing architecture to fast networks, such as InfiniBand, is to sim￾ply use IPoIB as shown in Figure 4(b). The advantage of this architecture is that it requires almost no change of the database system itself while still benefiting from the extra bandwidth. In particular, data-flow operations that send large messages (e.g., data re-partitioning) will benefit tremendously from this change. However, as shown in Sec￾tion 2, IPoIB cannot fully leverage the network. Perhaps surprisingly, for some types of operations, upgrading the network with IPoIB can actually decrease the performance. This is particularly true for control-flow operations, which require sending many of small messages. Figure 3 shows that the CPU overhead of IPoIB is above the overhead of IPoEth for small messages. In fact, as we will show in Section 4, these small differences can have a negative impact on the overall performance of distributed transaction processing. 3.1.3 The Distributed Shared-Memory Architecture Obviously, to better leverage the network we have to take advantage of RDMA. RDMA not only allows the system to fully utilize the bandwidth (see Figure 2(a)), but also reduces network latency and CPU overhead (see Figures 2(b) and 3). Unfortunately, changing an application from a socket-based message passing interface to RDMA verbs is not trivial. One possibility is to treat the cluster as a shared￾memory system (shown in Figure 4(c)) with two types of communication patterns: message passing using RDMA￾based SEND/RECEIVE verbs and remote direct memory access through one-sided RDMA READ/WRITE verbs. However, as stated before, there is no cache-coherence protocol. Moreover, machines need to carefully declare the sharable memory regions a priori and connect them via queue pairs. The latter, if not used carefully, can also have a neg￾ative effect on the performance [30]. In addition, a memory access via RDMA is very different than those of a shared￾memory system. While a local memory access only keeps one copy of the data around (i.e., conceptually it moves the data from main memory to the cache of a CPU), a re￾mote memory access creates a fully-independent copy. This has a range of implications from garbage collection, over cache/buffer management, up to consistency protocols. Thus, in order to achieve the appearance of a shared￾memory system, the software stack has to hide the differ￾ences and provide a real distributed shared-memory space. There have been recent attempts to create a distributed shared-memory architecture over RDMA [21]. However, we believe that a single abstraction for local and remote mem￾ory is the wrong approach. Databases usually want to have full control over the memory management and because vir￾tual memory management can get in the way of any database system, we believe the same is true for shared-memory over RDMA. While we had the ambitions to validate this assump￾tion throughout our experiment, we found only one commer￾cial offering for IBM mainframes [5]. Instead, for our OLTP comparison, we implemented a simplified version of this ar￾chitecture by essentially using a SN architecture and re￾placing socket communication with two-sided RDMA verbs (send and receive). We omit this architecture entirely for our OLAP comparison since two-sided RDMA verbs would have additionally added synchronization overhead to our system (i.e., an RDMA RECEIVE must be issued strictly before the RDMA SEND arrives at the RNIC), which would have simply slowed down the execution of our OLAP algorithms when compared to their NAM alternatives. 3.1.4 The Network-Attached-Memory Architecture Based on the previous considerations, we envision a new type of architecture, referred to as network-attached mem￾ory (or NAM for short) shown in Figure 4(d). In a NAM architecture, compute and storage are logically decoupled. The storage servers provide a shared distributed memory pool, which can be accessed from any compute node. How￾ever, the storage nodes are not aware of any database specific operations (e.g., joins or consistency protocols). These are implemented by the compute nodes. This logical separation helps to control the complexity and makes the system aware of the different types of main mem￾ory. Moreover, the storage nodes can take care of issues like garbage collection, data-reorganization or metadata man￾agement to find the appropriate remote-memory address of a data page. Note, that it is also still possible to physically co-locate storage nodes and compute nodes on the same ma￾chine to further improve performance. However, in contrast to the previous architecture, the system gains more control over what data is copied and how copies are synchronized. The NAM architecture has also several other advantages. Most importantly, storage nodes can be scaled independently of compute nodes. Furthermore, the NAM architecture can efficiently handle data imbalance since any node can access any remote partition without the need to re-distribute the data before. It should be noted that this separation of com￾pute and storage is not new. However, similar existing sys￾tems all use an extended key/value like interface for the storage nodes [15, 36, 39] or are focused on the cloud [16, 2], instead of being built from scratch to leverage high perfor￾mance networks like InfiniBand. Instead, we argue that the storage servers in the NAM architecture should expose an in- 4
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有