正在加载图片...
RDMA GHJ and the RRJ.is shown in Figure 7(b)and u comer cases (ie.for very. nerated dat very low 5.3 RDMA-based Aggregation Algorithms uniform distribu slow s due to the cost of th ork communication (44 the distic aggre irst phase al Whil uce tpu ion o for egates are then m erged using a global union and a up-by kev cheme suer from two problems:(1)Data-ske y can ca of d thatariesiaofDBMSopeatosfor architectur to high f the norCcd vel RDMA and sha n-memo r the NAM a architecture are opf ion ope addine additional ator use I hash tables to data ena ee has tealing a ges nee dat th analytical DBMSs for the NAM architecture In a in 6.RELATED WORK e this the High-F ond it is more robust wards data rying n distinct grou keys like InfiniBar 40 27.29 rity c 5.4 Experimental Evaluatio of D of [1]ado DBMS 155. RDMA for ea nod has RDMA support.including the use of RDMA ato =8B. d of then .5 s。 sing an GH+RMa)shc s the RDM I DBMSs 61.60 3 ants.RD GH. and query pro ng ar per As show the ne RJ algorithm that separa from 15.36.39 at RDMA esults are in line nio RDMA to PRo ovide acti ve-active se the RDMA variant chit t coordinat d ideas for rDMA build For the cl on.we used the algorithm as d n distr buted trar (eg 【62,23 11 RDMA GHJ and the RRJ, is shown in Figure 7(b) and demonstrates that the popular semi-join reduction for dis￾tributed joins only pays off in corner cases (i.e., for very, very low join selectivities). 5.3 RDMA-based Aggregation Algorithms The primary concern for distributed aggregation in a shared￾nothing architecture over slow networks is to avoid net￾work communication [44]. Traditional distributed aggre￾gation operators therefore use a hierarchical scheme. In a first phase all nodes individually execute an aggregation over their local data partition. In a second phase the intermedi￾ate aggregates are then merged using a global union and a post-aggregation is executed over that union. However, this scheme suffers from two problems: (1) Data-skew can cause individual nodes in the first phase to take much longer than other nodes to finish. (2) A high number of distinct group￾by keys lead to high execution costs of the global union and post-aggregation. In order to tackle these issues, we present a novel RDMA￾optimized aggregation operator, which implements a dis￾tributed version of a modern in-memory aggregation opera￾tor [51, 35] for our NAM architecture. In a first phase, this operator uses cache-sized hash tables to pre-aggregate data that is local to a core (thread). Moreover, if the hash tables are full it flushes them to overflow partitions. In our RDMA￾variant of this operator we directly copy the data in the background to remote partitions while the pre-aggregation is still active. In a second phase, individual partitions are then post-aggregated in parallel to compute the final ag￾gregate. Since this operator uses fine-grained parallelism in the first phase and there are more partitions than worker threads in the second phase, it is more robust towards data￾skew and a varying number of distinct group-by keys. 5.4 Experimental Evaluation We implemented all the distributed join and aggregation variants discussed before and executed them using 4 servers (10 threads per node). Each node in our experiment hosted compute and a storage node with the same configuration described in Section 2.2. For the join workload, we used a variant of [11] adopted for the distributed setting: for each node we generated a partition that has the size |R| = |S| = 128 M illion and a tuple width wr = ws = 8B. We generated different data sets such that the selectivity of the Bloom filter covers 0.25, 0.5, 0.75, and 1.0 to show the effect of reduced network costs. Figure 8(a) shows the total runtime of the GHJ and GHJ+Red over Ethernet (IPoEth) and IP over InfiniBand (IPoIB) as well as our two RDMA variants, RDMA GHJ and RRJ, over InfiniBand (RDMA) when using 8 threads per node. As shown, the new RRJ algorithm significantly outperforms the other state-of-the-art join algorithms for different semi-join selectivities. These results are in line with our cost analysis, though the results vary slightly as caching effects and CPU effects play a more crucial role for the RDMA variants. In a second experiment, we analyze the performance of our RDMA Aggregation (RDMA AGG) and compare it to a classical hierarchical distributed aggregation (Dist. AGG). For the classical aggregation, we used the algorithm as de￾scribed in [51, 35] as local aggregation operations. For the workload, we used one table with the size |R| = 128 M illion per partition. Each tuple of R has two attributes (one group￾by key and one aggregation attribute) of 4B each resulting in a tuple width of wr = 8B. Moreover, we generated data sets with a different number of distinct values for the group-by keys ranging from 1 to 64M using a uniform distribution. Figure 8(b) shows the results. For the classical hierar￾chical aggregation (Dist. AGG), the runtime increases with the distinct number of group-by keys due to the cost of the global union the post-aggregation (i.e., the post-aggregation has to be executed over a union which produces an output with a size of #nodes · #groupkeys). While showing a sim￾ilar performance for a small number of distinct group-by keys (i.e., 0.17ms), our RDMA Aggregation (RDMA AGG) is more robust for a high number of distinct group-by keys and shows major performance gains in that case. Both our experiments in Figure 8(a) and Figure 8(b) show that a redesign of DBMS operators for the NAM architecture results in major benefits not only regarding the sheer perfor￾mance but also regarding other aspects such as robustness. Different from distributed operators for the shared-nothing and shared-memory architecture, our operators are opti￾mized for the NAM architecture, thus enabling an efficient scale-out by adding additional compute servers. Moreover, the NAM architecture also enables more efficient schemes to handle data-skew using fine-grained parallelism and work￾stealing algorithms. All these challenges need to be ad￾dressed and analyzed in detail when redesigning distributed analytical DBMSs for the NAM architecture. 6. RELATED WORK A major focus in the High-Performance Computing com￾munity has been the development of techniques that take advantage of modern hardware, particularly fast networks like InfiniBand [40, 27, 29]. While the vast majority of this work is limited to specific applications, the results and gained experiences are highly relevant for developing the next generation of DBMSs for fast networks. In this paper, we made the case that RDMA-enabled net￾works should directly influence distributed DBMS architec￾ture and algorithms. Many projects in both academia and industry have attempted to add RDMA as an afterthought to an existing DBMS [55, 4]. For example, Oracle RAC [4] has RDMA support, including the use of RDMA atomic primitives, but was not designed from scratch to harness the full power of the network. However, RAC does not directly take advantage of the network for transaction processing and is essentially a workaround for a legacy system. Recent work has investigated building RDMA-aware DBMSs [61, 60] on top of RDMA-enabled key/value stores [30, 43], but transactions and query processing are an afterthought instead of first-class design considerations. Other systems that separate storage from compute nodes [15, 36, 39, 16, 2] also treat RDMA as an afterthought. IBM pureScale [12] directly leverages RDMA to provide active-active scaleout for DB2 but relies on a centralized manager to coordinate distributed transactions. On the other hand, our NAM ar￾chitecture natively incorporates RDMA primitives in order to build a shared distributed memory pool with no central￾ized coordinator. The proposed ideas for RDMA build upon the huge amount of work on distributed transaction protocols (e.g., [62, 23, 14, 56, 38]) and distributed join processing (see [32] for an overview). While we are not aware of any other RDMA- 11
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有