RDMA GHJ and the RRJ.is shown in Figure 7(b)and u comer cases (ie.for very. nerated dat very low 5.3 RDMA-based Aggregation Algorithms uniform distribu slow s due to the cost of th ork communication (44 the distic aggre irst phase al Whil uce tpu ion o for egates are then m erged using a global union and a up-by kev cheme suer from two problems:(1)Data-ske y can ca of d thatariesiaofDBMSopeatosfor architectur to high f the norCcd vel RDMA and sha n-memo r the NAM a architecture are opf ion ope addine additional ator use I hash tables to data ena ee has tealing a ges nee dat th analytical DBMSs for the NAM architecture In a in 6.RELATED WORK e this the High-F ond it is more robust wards data rying n distinct grou keys like InfiniBar 40 27.29 rity c 5.4 Experimental Evaluatio of D of [1]ado DBMS 155. RDMA for ea nod has RDMA support.including the use of RDMA ato =8B. d of then .5 s。 sing an GH+RMa)shc s the RDM I DBMSs 61.60 3 ants.RD GH. and query pro ng ar per As show the ne RJ algorithm that separa from 15.36.39 at RDMA esults are in line nio RDMA to PRo ovide acti ve-active se the RDMA variant chit t coordinat d ideas for rDMA build For the cl on.we used the algorithm as d n distr buted trar (eg 【62,23 11 RDMA GHJ and the RRJ, is shown in Figure 7(b) and demonstrates that the popular semi-join reduction for distributed joins only pays off in corner cases (i.e., for very, very low join selectivities). 5.3 RDMA-based Aggregation Algorithms The primary concern for distributed aggregation in a sharednothing architecture over slow networks is to avoid network communication [44]. Traditional distributed aggregation operators therefore use a hierarchical scheme. In a first phase all nodes individually execute an aggregation over their local data partition. In a second phase the intermediate aggregates are then merged using a global union and a post-aggregation is executed over that union. However, this scheme suffers from two problems: (1) Data-skew can cause individual nodes in the first phase to take much longer than other nodes to finish. (2) A high number of distinct groupby keys lead to high execution costs of the global union and post-aggregation. In order to tackle these issues, we present a novel RDMAoptimized aggregation operator, which implements a distributed version of a modern in-memory aggregation operator [51, 35] for our NAM architecture. In a first phase, this operator uses cache-sized hash tables to pre-aggregate data that is local to a core (thread). Moreover, if the hash tables are full it flushes them to overflow partitions. In our RDMAvariant of this operator we directly copy the data in the background to remote partitions while the pre-aggregation is still active. In a second phase, individual partitions are then post-aggregated in parallel to compute the final aggregate. Since this operator uses fine-grained parallelism in the first phase and there are more partitions than worker threads in the second phase, it is more robust towards dataskew and a varying number of distinct group-by keys. 5.4 Experimental Evaluation We implemented all the distributed join and aggregation variants discussed before and executed them using 4 servers (10 threads per node). Each node in our experiment hosted compute and a storage node with the same configuration described in Section 2.2. For the join workload, we used a variant of [11] adopted for the distributed setting: for each node we generated a partition that has the size |R| = |S| = 128 M illion and a tuple width wr = ws = 8B. We generated different data sets such that the selectivity of the Bloom filter covers 0.25, 0.5, 0.75, and 1.0 to show the effect of reduced network costs. Figure 8(a) shows the total runtime of the GHJ and GHJ+Red over Ethernet (IPoEth) and IP over InfiniBand (IPoIB) as well as our two RDMA variants, RDMA GHJ and RRJ, over InfiniBand (RDMA) when using 8 threads per node. As shown, the new RRJ algorithm significantly outperforms the other state-of-the-art join algorithms for different semi-join selectivities. These results are in line with our cost analysis, though the results vary slightly as caching effects and CPU effects play a more crucial role for the RDMA variants. In a second experiment, we analyze the performance of our RDMA Aggregation (RDMA AGG) and compare it to a classical hierarchical distributed aggregation (Dist. AGG). For the classical aggregation, we used the algorithm as described in [51, 35] as local aggregation operations. For the workload, we used one table with the size |R| = 128 M illion per partition. Each tuple of R has two attributes (one groupby key and one aggregation attribute) of 4B each resulting in a tuple width of wr = 8B. Moreover, we generated data sets with a different number of distinct values for the group-by keys ranging from 1 to 64M using a uniform distribution. Figure 8(b) shows the results. For the classical hierarchical aggregation (Dist. AGG), the runtime increases with the distinct number of group-by keys due to the cost of the global union the post-aggregation (i.e., the post-aggregation has to be executed over a union which produces an output with a size of #nodes · #groupkeys). While showing a similar performance for a small number of distinct group-by keys (i.e., 0.17ms), our RDMA Aggregation (RDMA AGG) is more robust for a high number of distinct group-by keys and shows major performance gains in that case. Both our experiments in Figure 8(a) and Figure 8(b) show that a redesign of DBMS operators for the NAM architecture results in major benefits not only regarding the sheer performance but also regarding other aspects such as robustness. Different from distributed operators for the shared-nothing and shared-memory architecture, our operators are optimized for the NAM architecture, thus enabling an efficient scale-out by adding additional compute servers. Moreover, the NAM architecture also enables more efficient schemes to handle data-skew using fine-grained parallelism and workstealing algorithms. All these challenges need to be addressed and analyzed in detail when redesigning distributed analytical DBMSs for the NAM architecture. 6. RELATED WORK A major focus in the High-Performance Computing community has been the development of techniques that take advantage of modern hardware, particularly fast networks like InfiniBand [40, 27, 29]. While the vast majority of this work is limited to specific applications, the results and gained experiences are highly relevant for developing the next generation of DBMSs for fast networks. In this paper, we made the case that RDMA-enabled networks should directly influence distributed DBMS architecture and algorithms. Many projects in both academia and industry have attempted to add RDMA as an afterthought to an existing DBMS [55, 4]. For example, Oracle RAC [4] has RDMA support, including the use of RDMA atomic primitives, but was not designed from scratch to harness the full power of the network. However, RAC does not directly take advantage of the network for transaction processing and is essentially a workaround for a legacy system. Recent work has investigated building RDMA-aware DBMSs [61, 60] on top of RDMA-enabled key/value stores [30, 43], but transactions and query processing are an afterthought instead of first-class design considerations. Other systems that separate storage from compute nodes [15, 36, 39, 16, 2] also treat RDMA as an afterthought. IBM pureScale [12] directly leverages RDMA to provide active-active scaleout for DB2 but relies on a centralized manager to coordinate distributed transactions. On the other hand, our NAM architecture natively incorporates RDMA primitives in order to build a shared distributed memory pool with no centralized coordinator. The proposed ideas for RDMA build upon the huge amount of work on distributed transaction protocols (e.g., [62, 23, 14, 56, 38]) and distributed join processing (see [32] for an overview). While we are not aware of any other RDMA- 11