正在加载图片...
(not the CPU), algo algorithm of the CH.weuse the f M ork The we resent novel RDMA-optimi funda agement.optimization) cussed in Section 3. buthe data As shown with software manage one pass. a bloc 5.1 Existing Distributed OLAP Operators work ([47]and 11])has shown that both phases of the radi s the distributed the total cost for hs,we can rtutedioialrith Tioin(R.S)=(Tmem(R)+Tmem(S))+(Tmem(R)+Tmem(S)) technique =2.c m(r-RI+w.S1) For sort-merge join c The total runtime of the GHJ TH,is therefore from TGHJ Tpart(R)+ art(S)+Tjoin(R,S) en proposed:the most prominent being a semioost have al t 5.12 Adding Semi-Reduction using Bloom Filter he ommon parti ic tion.the hly four time the GHD in d ir detail. Late d the ibuted i algo to further reduc tens RDMA-capable that the Througho the re of this and after partitioning all nodes hold the same data). the ost traditiona 5.1.1 An Optimized Grace Hash Join re uction u sing a Bloom TheGlexecitesacistribntedjoinintwophac In the hav ea join par heir join key s a tively.Then,b The cost of the GHJ TGH is the of the partit (ie atoritio o the cos data n be Thloom(R)= Tmem(R) 十Inet(6r er the network Ship Reduc that r the net size c B)can be Rl-cn with风being the number of tupl in for a GH with a semi-join reduction using Bloom filters is Tahi+=Thoom(R)+(S)+ repartitioning cost of R can be expressed as Tpart(R)= Tmem(R)+(R) +Tmnem(R) Tar(seln(bs)R)+(sels(o)S) = Ig. =2.小 Tjoin(selR(bs).R.selr(bR).S) 9 In order to motivate the redesign of distributed DBMSs for OLAP workloads, we first discuss why existing distributed algorithms, which were designed for a shared-nothing archi￾tecture over slow networks, are not optimal for fast RDMA￾capable networks. Then, we present novel RDMA-optimized operators for the NAM architecture, which require funda￾mental redesigns of central components (e.g., memory man￾agement, optimization), as discussed in Section 3.2. This paper focuses on distributed joins and aggregates, which are the predominant operators in almost any OLAP workload. 5.1 Existing Distributed OLAP Operators The most network-intensive operation in OLAP workloads is the distributed join [54]. Most distributed join algorithms have three components: (1) a local join algorithm, (2) a par￾titioning scheme, and (3) an optional reduction technique. All three components can be combined in different ways. For example, either a hash or sort-merge join could be used as the local join algorithm, whereas partitioning schemes range from static to dynamic hash partitioning [20]. Simi￾larly, several techniques to reduce the partitioning cost have been proposed, the most prominent being a semi-join reduc￾tion using a Bloom filter [52]. The following section explains the most common parti￾tioning technique for distributed join algorithms over shared￾nothing architectures, the grace hash join (GHJ), in more detail. Later, we expand the distributed join algorithm with an additional semi-join reduction using Bloom filters to further reduce communication. For both, we develop a simple cost model and argue why these algorithms are (in most cases) not optimal for in-memory databases over fast RDMA-capable networks. Throughout the rest of this sec￾tion, we assume that there is no skew in the data (i.e., before and after partitioning all nodes hold the same data). 5.1.1 An Optimized Grace Hash Join The GHJ executes a distributed join in two phases. In the first phase (partitioning phase), the GHJ scans the input re￾lations and hash-partitions them on their join key such that the resulting sub-relations can be joined in the second phase locally per node. The cost of the GHJ TGHJ is therefore given by the sum of the runtime of the partitioning phase Tpart and the local join phase Tjoin. We do not consider any static pre-partitioning, so the cost for repartitioning can be split into the cost of partitioning the two join relations R and S. The cost of repartitioning R can now further be split into the cost of (1) reading the data on the sender, (2) transferring the data over the network, and (3) materializing the data on the receiver. Assuming that the cost of sending R over the network is Tnet(R) = wr · |R| · cnet and scanning R in-memory is Tmem(R) = wr · |R| · cmem, with |R| being the number of tuples, wR being the width of a tuple r ∈ R in bytes, and cnet (cmem) the cost of accessing a byte over the network (memory), the repartitioning cost of R can be expressed as: Tpart(R) = Tmem(R) | {z } Reading (sender) + Tnet(R) | {z } Shuffling (net) + Tmem(R) | {z } Writing (receiver) = wr · |R| · cmem + wr · |R| · cnet + wr · |R| · cmem = 2 · wr(·cmem · |R| + cnet · |S|) The partition cost for S is similar. Note that we ignore any CPU cost, as we assume that the limiting factor is the memory and network access (not the CPU), which is rea￾sonable for a simple hash-based partitioning scheme. For the local join algorithm of the GHJ, we use the fastest local in-memory join algorithm, the (parallel) radix join [11]. The radix join proceeds in two phases. In the first phase, the radix join scans each input relation and partitions the relations locally into cache-sized blocks using multiple passes over the data. As shown in [11], with software managed buffers, most relations can efficiently be partitioned with one pass. After partitioning the data, the radix join scans the relations again to join the cache-sized blocks. Existing work ([47] and [11]) has shown that both phases of the radix join are memory-bandwidth bound. Thus, we can estimate the total cost for the local radix join as: Tjoin(R, S) = (Tmem(R) + Tmem(S)) | {z } Radix Phase 1 + (Tmem(R) + Tmem(S)) | {z } Radix Phase 2 = 2 · cmem · (wr · |R| + ws · |S|) The total runtime of the GHJ TGHJ is therefore: TGHJ = Tpart(R) + Tpart(S) + Tjoin(R, S) = (wr|R| + ws|S|) · (4 · cmem + cnet) 5.1.2 Adding Semi-Reduction using Bloom Filters As shown in the final cost equation from the previous section, the GHJ requires roughly four times more memory accesses than network transfers. However, in distributed in￾memory DBMSs, the network cost typically dominates up to 90% of the runtime of a join [54]. Thus, state-of-the-art join algorithms (e.g., track join [48], Neo-Join [54]) try to reduce network traffic through cost-intensive computations (e.g., Neo-Join uses a linear solver) or multiple communi￾cation round-trips to partition the data to further optimize the network traffic. Here, we focus on the most traditional approach: a semi￾join reduction using a Bloom filter. The core idea of the semi-join reduction is to send only tuples in the input rela￾tions R and S that have a join partner in the other relation. Therefore, the algorithm first creates Bloom filters bR and bS over the join keys of R and S, respectively. Then, bR and bS are copied across all nodes that hold a partition of S and R, respectively, and each node uses its Bloom filter to filter out the tuples that are guaranteed to have no join partner (i.e., if the Bloom filter matches a join key, it must be sent). The cost of creating bR includes both a scan over the data Tmem(R) and transmission over the network Tnet(bR): Tbloom(R) = Tmem(R) | {z } Create Reducer + Tnet(bR) | {z } Ship Reducer However, the size of the Bloom filter br is normally very small, so that Tbloom(R) can be disregarded. Assuming that selS(bR) is the selectivity of the Bloom filter bR over relation S (including the error rate of the Bloom filter), the total cost for a GHJ with a semi-join reduction using Bloom filters is: Tghj+bloom = Tbloom(R) + Tbloom(S) | {z } Create Bloom-Filter + Tpart(selR(bS) · R) + Tpart(selS(bR) · S) | {z } Reduced Partitioning Cost + Tjoin(selR(bS) · R, selR(bR) · S) | {z } Reduced Join Cost This equation models the cost of creating the Bloom filter plus the reduced partitioning and join costs. Assuming that 9
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有