U. Combining this fact with Lemma 8.1_中国高校课件下载中心

正在加载图片...

W.Combining this fact with Lemma 8.1,we have the Naturally,the same bound holds for randomized following proposition. ranged hash functions against a worst-case S.Specif- ically,for any family of ranged hash functions with PROPOSITION 8.1.If dimKr(u)-K.for any b dimgR(u)=K,Er-dimKn(n,m',n,m)=(2-2K. 4 and any A C T,l△A(b)l≤t22 k where t= 是lhm,whereK≤7log2(共lhm. maxiEAT(b). In [12],Karger and Ruhl show that a uniformly selected subset of w has essentially the same KR- This implies that if a neighborhood AA(b)is con- dimension with high probability,3 i.e.K-dimkR is prob- tained in the first t columns of x,the size of the neigh- borhood is at most t22K no matter how large A is. abilistically W-hereditary.It is easy to see that K- All the above facts intuitively suggest that for dimKR is also Z-hereditary,since the KR-dimension of W is irrelevant of Z.By Lemma 5.1,for all ranged hash ranged hash functions arising from a metric,the KR- functions with the same KR-dimension ofu it holds that dimension of u strongly affects the structure of the preference matrices,and thus controls load balancing. EK-dimgR(N,M,n,m)2 EK-dimKR(n,m',n,m).This gives us the following theorem. This intuition is justified by the following lemma. THEOREM 8.1.For any distribution of ranged hash LEMMA 8.2.Let n be a monotone ranged hash func- functions with I=N,lM川=M,and dimKR(亿)=K, tion with I=n,u=m',and dimKR(U)=K,for any sufficiently large n and m with N>2n,M> where m' max(n,2m),n ml+o(1),and K< 2m,n≥m,n=ml+o(,and}log2(alnm）≥K, l0g2(n Inm).Then for any constant 0<a<, it holds that for any data set D∈（日，there erists a Pz[s≥a2-2k.”lnm=1-1 stateS∈(%)，such that唱=2(2-2k.朵lnm)uith S∈()L m high probability.Equivalently,under these conditions, k-dimgn(N,M,n,m)=(2-2K.Inm). Proof.Let L a2-2K.m Inm.With the given This justifies our previous observation that dimen- constraints on the parameters,L=(vinm)nm(1). sionality helps load balance even though it hurts search- As in the proof of Theorem 6.1,we consider two cases ing. depending on the number of duplicate entries in each In particular,if u is growth-restricted, column of Recall that for each column j and bucket O(1)-dimkR(N,M,n,m）=2(是lnm）forN>2m, b,Aj(b)represents the number of copies of b in j. M>2m,n m,and n ml+o(1).This bound is tight Case1:If for each j≤L and beu,,入；(b)<L4/a, because it is achieved by standard consistent hashing, applying the same argument as in Lemma 6.2 gives that which (as discussed in Section 4)can be modeled there exists a family of disjoint L-wide neighborhoods using a discrete growth-restricted metric in place of (AA (b)}beU such that (a)U=(n)and (b)for each the usual continuous [0,1)metric.This lower bound bEU,the whole neighborhood AA (b)is contained in explains where the (Inm)bound for consistent the first L columns ofπ. hashing comes from:it is the best we can get from From Proposition8.1,|△A(b)jl≤L·22k.Again any growth-restricted metric.It also explains why no repeating the analysis in Lemma 6.2,we have that more sophisticated growth-restricted metric has arisen Pr[ls <L]=O(n-1+20)=o(1)with a<. to displace consistent hashing. Case 2:There is some j<L and some b,such that Unlike the case for general monotone ranged hash Aj(b)>L4/o.Without loss of generality,we assume functions,in a growth restricted metric,the price of that mij =b for i =1,2,...,t,where t =L/a.We only churn dominates the balls-into-bins skew,and unlike consider the first L2 columns in these t rows.Because the balls-into-bins bound,the price of churn does not they share the same entry b,according to Proposition approach O(n)as n grows to (mlogm).The price 8.1,l{π|1≤i≤t,1≤j≤L2=L222k,i.e,there are L222K distinct entries in the first L2 columns of of churn in this setting is also robust for all values of n and m satisfying the scalability condition n=m1+o(1). in these t rows.Effectively,assigns t=L/a items to only L222K buckets.The maximum load in these 3In fact,they show that a set S of uniformly random m points buckets is thus at least L22-2K2L,as long as none from with dimKR(亿)=K has dimKR(S)≤K+1with of these rows are totally absent from S.The probability probability 1-exp(-(t))if we only care about balls larger that there is at least one such bad row (containing than a threshold t.Because all balls in our proofs have size no element of S in its first L2 entries)is at most L (VIn m),we can easily make the probability (1-o(1)) t.PrL∈可≤Ll-nr)2=o.Therefore without affecting our argument.We also ignore the +1 on the KR-dimension.because it can only affect the maximum load by with high probability,the maximum load es >L. a constant factor.U. Combining this fact with Lemma 8.1, we have the following proposition. Proposition 8.1. If dimKR(U) = K, for any b ∈ U and any A ⊂ I, |∆A(b)| ≤ t2 2K where t = maxi∈A π −1 i (b). This implies that if a neighborhood ∆A(b) is contained in the first t columns of π, the size of the neighborhood is at most t2 2K no matter how large A is. All the above facts intuitively suggest that for ranged hash functions arising from a metric, the KRdimension of U strongly affects the structure of the preference matrices, and thus controls load balancing. This intuition is justified by the following lemma. Lemma 8.2. Let π be a monotone ranged hash function with |I| = n, |U| = m0 , and dimKR(U) = K, where m0 = max(n, 2m), n = m1+o(1), and K ≤ 1 4 log2 ( n m ln m). Then for any constant 0 < α < 1 2 , Pr S∈( U m) h `S ≥ α2 −2K · n m ln m i = 1 − o(1). Proof. Let L = α2 −2K · n m ln m. With the given constraints on the parameters, L = Ω(√ ln m) ∩ mo(1) . As in the proof of Theorem 6.1, we consider two cases depending on the number of duplicate entries in each column of π. Recall that for each column j and bucket b, λj (b) represents the number of copies of b in j. Case 1: If for each j ≤ L and b ∈ U, λj (b) < L4/α, applying the same argument as in Lemma 6.2 gives that there exists a family of disjoint L-wide neighborhoods {∆Ab (b)}b∈U such that (a) |U| = Ω( ˜ n) and (b) for each b ∈ U, the whole neighborhood ∆Ab (b) is contained in the first L columns of π. From Proposition 8.1, |∆Ab (b)| ≤ L · 2 2K. Again repeating the analysis in Lemma 6.2, we have that Pr[`S < L] = O˜(n −1+2α) = o(1) with α < 1 2 . Case 2: There is some j ≤ L and some b, such that λj (b) ≥ L 4/α. Without loss of generality, we assume that πij = b for i = 1, 2, . . . , t, where t = L 4/α. We only consider the first L 2 columns in these t rows. Because they share the same entry b, according to Proposition 8.1, |{πij | 1 ≤ i ≤ t, 1 ≤ j ≤ L 2}| = L 22 2K, i.e., there are L 22 2K distinct entries in the first L 2 columns of π in these t rows. Effectively, π assigns t = L 4/α items to only L 22 2K buckets. The maximum load in these buckets is thus at least 1 α L 22 −2K ≥ L, as long as none of these rows are totally absent from S. The probability that there is at least one such bad row (containing no element of S in its first L 2 entries) is at most t · Pr[[L 2 ] ⊆ S] ≤ 1 α L 4 (1 − m m0−L2 ) L 2 = o(1). Therefore with high probability, the maximum load `S ≥ L. Naturally, the same bound holds for randomized ranged hash functions against a worst-case S. Specifically, for any family of ranged hash functions with dimKR(U) = K, `K-dimKR (n, m0 , n, m) = Ω(2−2K · n m ln m), where K ≤ 1 4 log2 ( n m ln m). In [12], Karger and Ruhl show that a uniformly selected subset of U has essentially the same KRdimension with high probability,3 i.e. K-dimKR is probabilistically U-hereditary. It is easy to see that KdimKR is also I-hereditary, since the KR-dimension of U is irrelevant of I. By Lemma 5.1, for all ranged hash functions with the same KR-dimension of U it holds that `K-dimKR (N, M, n, m) ≥ `K-dimKR (n, m0 , n, m). This gives us the following theorem. Theorem 8.1. For any distribution of ranged hash functions with |I| = N, |U| = M, and dimKR(U) = K, for any sufficiently large n and m with N > 2n, M > 2m, n ≥ m, n = m1+o(1), and 1 4 log2 ( n m ln m) ≥ K, it holds that for any data set D ∈ I n , there exists a state S ∈ U m , such that ` D S = Ω(2−2K · n m ln m) with high probability. Equivalently, under these conditions, `K-dimKR (N, M, n, m) = Ω(2−2K · n m ln m). This justifies our previous observation that dimensionality helps load balance even though it hurts searching. In particular, if U is growth-restricted, `O(1)-dimKR (N, M, n, m) = Ω( n m ln m) for N > 2n, M > 2m, n ≥ m, and n = m1+o(1). This bound is tight because it is achieved by standard consistent hashing, which (as discussed in Section 4) can be modeled using a discrete growth-restricted metric in place of the usual continuous [0, 1) metric. This lower bound explains where the Θ( n m ln m) bound for consistent hashing comes from: it is the best we can get from any growth-restricted metric. It also explains why no more sophisticated growth-restricted metric has arisen to displace consistent hashing. Unlike the case for general monotone ranged hash functions, in a growth restricted metric, the price of churn dominates the balls-into-bins skew, and unlike the balls-into-bins bound, the price of churn does not approach O( n m ) as n grows to Ω(m log m). The price of churn in this setting is also robust for all values of n and m satisfying the scalability condition n = m1+o(1) . 3 In fact, they show that a set S of uniformly random m points from U with dimKR(U) = K has dimKR(S) ≤ K + 1 with probability 1 − exp(−Ω(t)) if we only care about balls larger than a threshold t. Because all balls in our proofs have size L = Ω(√ ln m), we can easily make the probability (1 − o(1)) without affecting our argument. We also ignore the +1 on the KR-dimension, because it can only affect the maximum load by a constant factor

<<向上翻页向下翻页>>