terface that supports fine-grained byte-level memory acces mapped toexisting RDMA hardware For trol.In the future to take advantage of the act f the 3.2 Challenges and Opport betw pairs ities RDMA in the b kt and th RNIC the RDMAs asyr thu first polls the completo queue once a re sts t and met While this i sing is typic sing a e partit ir ch shuf e data ove ork.Ho tio design. O tha DBMS archi there is a ircak3aditrbuted RDMA mpilation In a ture this ntral with hig hp NAM archita cture where all nodes can a entra oned.exi any noc Therefo ing co dist compil ordinat With fas ther a bottleneck nor a single point of failure. 4.THE CASE FOR OLTP th)8 The traditional wisdor that a NAMa is that distributed tra -phase c nmit (2 imp h do not hine query RDM ing archo whi of data rom 4.1 Why 2PC does not scale be eff ently d in distrib we discuss factors that hinder the scalabi ks an ould mploy to im Manag ent the lateney of on ed to lock-ba tions is still much ralized SI 37,23.Ho the d to age lave be gener d to more tra 2 PC protocols 42 41.1 Dissec ting 2PC 5(a)s orde tively m work r ing 2 has ad/wri oper shar hing hite ithout considering th ration for that combines the c server)has read all ne Ho older) read-ti BID)which the ly tent r the nishes r ing th alue in a single Conr (T m an he Whi nable device(e.g.,an FPGA)on the load across (CI)roud-trip terface that supports fine-grained byte-level memory access that preserves some of the underlying hardware features. For example, in Section 4 we show how the fine addressability allows us to efficiently implement concurrency control. In the future, we plan to take advantage of the fact that messages arrive in order between queue pairs. 3.2 Challenges and Opportunities Unfortunately, moving from a shared-nothing or sharedmemory system to a NAM architecture requires a redesign of the entire distributed database architecture from storage management to query processing and transaction management up to query compilation and metadata management. Transactions & Query Processing: Distributed query processing is typically implemented using a data-parallel execution scheme that leverages re-partitioning operators which shuffle data over the network. However, re-partitioning operators do not typically pay much attention to efficiently leveraging the CPU caches of individual machines in the cluster. Thus, we believe that there is a need for parallel cache-aware algorithms for query operators over RDMA. Similarly, we require new query optimization techniques for distributed in-memory database system with high-bandwidth network. As previously mentioned, existing distributed database systems assume that the network is the dominant bottleneck. Therefore existing cost-models for distributed query optimization often consider the network cost as the only cost-factor [44]. With fast networks and thus, a more balanced system, the optimizer needs to consider more factors since bottlenecks can shift from one component (e.g., CPU) to another (e.g., memory-bandwidth) [18]. Additionally, we believe that a NAM architecture requires new load-balancing schemes that implement ideas suggested for work-stealing on single-node machines [35]. For example, query operations could access a central data structure (i.e., a work queue) via one-sided RDMA verbs, which contains pointers to small portions of data to be processed by a given query. When a node is idle, it could pull data from the work queue. That way, distributed load balancing schemes can be efficiently implemented in a decentralized manner. Compared to existing distributed load balancing schemes, this avoids single bottlenecks and would allow greater scalability while also avoiding stragglers. Storage Management: Since the latency of one-sided RDMA verbs (i.e., read and write) to access remote database partitions is still much higher than for local memory accesses, we need to optimize the storage layer of a distributed database to minimize this latency. One idea in this direction is to develop complex storage access operations that combine different storage primitives in order to effectively minimize network roundtrips between compute and storage nodes. This is in contrast to existing storage managers which offer only simple read/write operations. For example, in Section 4 we present a complex storage operation for a distributed SI protocol that combines the locking and validation of the 2PC commit phase using a single RDMA atomic operation. However, for such complex operations, the memory layout must be carefully developed. Our current prototype therefore combines the lock information and the value into a single memory location. Modern RNICs, such as the Connect X4 Pro, provide a programmable device (e.g., an FPGA) on the RNIC. Thus, another idea to reduce storage access latencies is to implement complex storage operations that cannot be easily mapped to existing RDMA verbs in hardware. For example, writing data directly into a remote hash table of a storage node could be implemented completely on the RNICs in a single roundtrip without involving the CPUs of the storage nodes, hence allowing for new distributed join operations. Finally, we believe that novel techniques must be developed that allow efficient pre-fetching using RDMA. The idea is that the storage manager issues RDMA requests (e.g., RDMA READs) for memory regions that are likely to be accessed next and the RNIC processes them asynchronously in the background. Moreover, the RDMA storage manager thus first polls the completion queue once a requests for a remote memory address shall be executed to check if the remote memory has already been prefetched. While this is straightforward for sequential scanning of table partitions, index structures, which often rely on random access, require a more careful design. Metadata Management and Query Compilation: Typically, a distributed DBMS architecture has one central node which is responsible for metadata management and query compilation. In a classical architecture this central node can either fail or become a bottleneck under heavy loads. In a NAM architecture where all nodes can access central data structures using remote memory accesses, any node can read and update the metadata. Therefore, any node can compile queries and coordinate their execution. Thus, query compilation and metadata management exists as neither a bottleneck nor a single point of failure. 4. THE CASE FOR OLTP The traditional wisdom is that distributed transactions, particularly when using 2-phase commit (2PC), do not scale [59, 31, 57, 19, 45, 53]. In this section, we show that this is the case on a shared-nothing architecture over slow networks and then present a novel protocol for the NAM architecture that can take full advantage of the network and, theoretically, removes the scalability limit. 4.1 Why 2PC does not scale In this section, we discuss factors that hinder the scalability of distributed transaction processing over slow networks. Modern DBMSs employ Snapshot Isolation (SI) to implement concurrency control and isolation because it promises superior query performance compared to lock-based alternatives. The discussion in this section is based on a 2PC protocol for generalized SI [37, 23]. However, the findings can also be generalized to more traditional 2PC protocols [42]. 4.1.1 Dissecting 2PC Figure 5(a) shows a traditional (simplified) protocol using 2-phase commit with generalized SI guarantees [37, 23], assuming that the data is partitioned across the nodes (i.e., shared-nothing architecture) and without considering the read-phase (see also [15, 17, 49]). That is, we assume that the client (e.g., application server) has read all necessary records to issue the full transaction using a (potentially older) read-timestamp (RID), which guarantees a consistent view of the data. After the client finishes reading the records, it sends the commit request to the transaction manager (TM) [one-way message 1]. While Figure 5(a) only shows one TM, there can be more, evenly distributing the load across nodes. As a next step, the TM requests a commit timestamp (CID) [round-trip message 2]. In this paper, we assume 5