to denote the variance of ni.We rely on the following theorem C.Compute the Optimal Granularity for Ensemble Sampling to illustrate the accuracy of the estimator SE. Theorem I:Let o;represent the variance of the estimator SE As indicated in the above analysis,during a query cycle mi,the load factor p=ๅญ๏ผthen, of the ensemble sampling,in order to achieve the accuracy requirement for all categories,the essential scanning time d=%.eP+n-1 mainly depends on the category with the smallest tag size. n'co+n-1(6+n2)-n. (7) as the other categories must still be involved in the query cycle until this category achieves the accuracy requirement. Proof:See Appendix A. Therefore,we use the notion group to define a set of categories 2)Reducing the Variance through Repeated Tests:As the involved in a query cycle of the ensemble sampling.Hence, frame size for each query cycle has a maximum value,by each cycle of ensemble sampling should be applied over an estimating from the ensemble sampling within only one query appropriate group,such that the variance of the tag sizes for cycle,the estimated tag size may not be accurate enough for the involved categories cannot be too large.In this way,all the accuracy constraint.In this situation,multiple query cycles categories in the same group achieve the accuracy requirement are essential to reduce the variance through repeated tests. with very close finishing time.In addition,according to Eq.(7), Suppose the reader issues l query cycles over the same set of as the number of categories increases in the ensemble sampling categories,in regard to a specified category Ci,by utilizing the the load factor p is increased,then the achieved accuracy for weighted statistical averaging method,the averaged tag size each involved category is reduced.Therefore,it is essential to ๅ
=โk=1uknk:here wk=ใ ๅทฅ๏ผn,k and 6.k 1 compute an optimal granularity for the group in regard to the respectively denote the estimated tag size and variance for each reading performance.Suppose there exists m categories in total, cycle k.Then,the variance ofๅ
iso=ziๆฌ the objective is to divide them into d(1 sd m)groups for ensemble sampling,such that the overall scanning time can be Therefore,according to the accuracy constraint in the prob- minimized while achieving the accuracy requirement. lem formulation,we rely on the following theorem to express For a specified group,in order for all involved categories to this constraint in the form of the variance. satisfy the accuracy requirement,it is essential to compute the Theorem 2:Suppose the variance of the averaged tag size required frame size for the category with the smallest tag size, nis o.The accuracy constraint is satisfied for a specified category C,as long asๅญโค(zaaP.n,Zi-aa is the say nLet(then,according to Theorem 2. we can compute the essential frame size f such that Ai(f)<ti. 1-percentile for the standard normal distribution. Assume that the inter-cycle overhead is Te,and the average Proof:See Appendix B. โ time interval per slot is Ts.Therefore,if f fmaz,then the According to Theorem 2,we can verify if the accuracy con- total scanning time T f.Ts+Te.Otherwise,if the final straint is satisfied for each category through directly checking estimate is the average of r independent experiments each with the variance against the threshold If 1-5%. an estimator variance of Ai(fmaz),then the variance of the then Z1-B/2=1.96. average is).Hence,if we want the final variance to 3)Property of the Ensemble Sampling:According to The- bet thenr should be,the total scanning time is orem 1,the normalized variance of the SE estimator A;=5 n T=(fmax·Ts+Tc)·r. is equivalent to=(().Let n(eP+n-1)1 We propose a dynamic programming-based algorithm to ep+n-1 a ๅ.b=eๅทThen,the normalized compute the optimal granularity for ensemble sampling.As- ep-n-I variance i=a.+b.Since the SE estimator can utilize sume that currently there are m categories,ranked in non- any estimator like [4][12][15]to estimate the overall tag size, increasing order,according to the estimated tag size,e.g., then,without loss of generality,if we use the estimator in [4], C1,C2,....Cm.We need to cut the ranked categories into one or more continuous groups for ensemble sampling.In regard to a we can prove that a <0 for any value of n>0,f >0.This property applies to any estimator with variance smaller than 8o single group consisting of categories from Ci to Ci,we define t(i,j)as the essential scanning time for ensemble sampling, in ZE,which simply estimates the overall tag size based on the which is computed in the same way as aforementioned T.Fur- observed number of empty slots. According to Theorem 2,in order to satisfy the accuracy thermore,we define T(i,j)as the minimum overall scanning constraint,we should ensure that( )2.ni.As a<0 time over the categories from Ci to Ci among various grouping strategies.Then,the recursive expression of T(i,j)is shown for all values of f,it infers that the larger the value ni is, in Eq.(8) the faster it will be for the specified category to satisfy the accuracy constraint.On the contrary,the smaller the value n is,the slower it will be for the specified category to satisfy the T(i,)= minisksjt(i,k)+T(k+1,j)),i<j; t(i,i), (8) i=j. accuracy constraint.This occurs during the ensemble sampling, when the major categories occupy most of the singleton slots,In Eq.(8),the value of T(i,j)is obtained by enumerating each while those minor categories cannot obtain enough samplings possible combination of t(i,k)and T(+1,j),and then getting in the singleton slots for accurate estimation of the tag size. the minimum value of t(i,)+T(k+1,j).By solving theto denote the variance of ๐ห. We rely on the following theorem to illustrate the accuracy of the estimator SE. Theorem 1: Let ๐ฟ๐ represent the variance of the estimator SE ๐ห๐, the load factor ๐ = ๐ ๐ , then, ๐ฟ๐ = ๐๐ ๐ โ
๐๐ + ๐๐ โ 1 ๐๐ + ๐ โ 1 โ
(๐ฟ + ๐2) โ ๐2 ๐ . (7) Proof: See Appendix A. 2) Reducing the Variance through Repeated Tests: As the frame size for each query cycle has a maximum value, by estimating from the ensemble sampling within only one query cycle, the estimated tag size may not be accurate enough for the accuracy constraint. In this situation, multiple query cycles are essential to reduce the variance through repeated tests. Suppose the reader issues ๐ query cycles over the same set of categories, in regard to a specified category ๐ถ๐, by utilizing the weighted statistical averaging method, the averaged tag size ๐ห๐ = โ๐ ๐=1 ๐๐ โ
๐ห๐,๐; here ๐๐ = 1 ๐ฟ โ ๐,๐ ๐ ๐=1 1 ๐ฟ๐,๐ , ๐ห๐,๐ and ๐ฟ๐,๐ respectively denote the estimated tag size and variance for each cycle ๐. Then, the variance of ๐ห๐ is ๐2 ๐ = โ 1 ๐ ๐=1 1 ๐ฟ๐,๐ . Therefore, according to the accuracy constraint in the prob๏ฟพlem formulation, we rely on the following theorem to express this constraint in the form of the variance. Theorem 2: Suppose the variance of the averaged tag size ๐ห๐ is ๐2 ๐ . The accuracy constraint is satisfied for a specified category ๐ถ๐, as long as ๐2 ๐ โค ( ๐ ๐1โ๐ฝ/2 )2 โ
๐2 ๐ , ๐1โ๐ฝ/2 is the 1 โ ๐ฝ 2 percentile for the standard normal distribution. Proof: See Appendix B. According to Theorem 2, we can verify if the accuracy con๏ฟพstraint is satisfied for each category through directly checking the variance against the threshold ( ๐ ๐1โ๐ฝ/2 )2โ
๐2 ๐ . If 1โ๐ฝ = 95%, then ๐1โ๐ฝ/2 = 1.96. 3) Property of the Ensemble Sampling: According to The๏ฟพorem 1, the normalized variance of the SE estimator ๐๐ = ๐ฟ๐ ๐๐ is equivalent to ๐๐ = ๐ฟโ๐โ
๐๐+๐ ๐๐+๐โ1 โ
๐๐ ๐ + (๐ฟ+๐2)(๐๐โ1) ๐โ
(๐๐+๐โ1) . Let ๐ = ๐ฟโ๐โ
๐๐+๐ ๐๐+๐โ1 , ๐ = (๐ฟ+๐2)(๐๐โ1) ๐โ
(๐๐+๐โ1) . Then, the normalized variance ๐๐ = ๐ โ
๐๐ ๐ + ๐. Since the SE estimator can utilize any estimator like [4][12][15] to estimate the overall tag size, then, without loss of generality, if we use the estimator in [4], we can prove that ๐ < 0 for any value of ๐ > 0,๐ > 0. This property applies to any estimator with variance smaller than ๐ฟ0 in ZE, which simply estimates the overall tag size based on the observed number of empty slots. According to Theorem 2, in order to satisfy the accuracy constraint, we should ensure that ๐๐ โค ( ๐ ๐1โ๐ฝ/2 )2 โ
๐๐. As ๐ < 0 for all values of ๐, it infers that the larger the value ๐๐ is, the faster it will be for the specified category to satisfy the accuracy constraint. On the contrary, the smaller the value ๐๐ is, the slower it will be for the specified category to satisfy the accuracy constraint. This occurs during the ensemble sampling, when the major categories occupy most of the singleton slots, while those minor categories cannot obtain enough samplings in the singleton slots for accurate estimation of the tag size. C. Compute the Optimal Granularity for Ensemble Sampling As indicated in the above analysis, during a query cycle of the ensemble sampling, in order to achieve the accuracy requirement for all categories, the essential scanning time mainly depends on the category with the smallest tag size, as the other categories must still be involved in the query cycle until this category achieves the accuracy requirement. Therefore, we use the notion group to define a set of categories involved in a query cycle of the ensemble sampling. Hence, each cycle of ensemble sampling should be applied over an appropriate group, such that the variance of the tag sizes for the involved categories cannot be too large. In this way, all categories in the same group achieve the accuracy requirement with very close finishing time. In addition, according to Eq. (7), as the number of categories increases in the ensemble sampling, the load factor ๐ is increased, then the achieved accuracy for each involved category is reduced. Therefore, it is essential to compute an optimal granularity for the group in regard to the reading performance. Suppose there exists ๐ categories in total, the objective is to divide them into ๐(1 โค ๐ โค ๐) groups for ensemble sampling, such that the overall scanning time can be minimized while achieving the accuracy requirement. For a specified group, in order for all involved categories to satisfy the accuracy requirement, it is essential to compute the required frame size for the category with the smallest tag size, say ๐๐. Let ๐ก๐ = ( ๐ ๐1โ๐ฝ/2 )2 โ
๐๐, then, according to Theorem 2, we can compute the essential frame size ๐ such that ๐๐(๐) โค ๐ก๐. Assume that the inter-cycle overhead is ๐๐, and the average time interval per slot is ๐๐ . Therefore, if ๐ โค ๐๐๐๐ฅ, then the total scanning time ๐ = ๐ โ
๐๐ + ๐๐. Otherwise, if the final estimate is the average of ๐ independent experiments each with an estimator variance of ๐๐(๐๐๐๐ฅ), then the variance of the average is ๐๐(๐๐๐๐ฅ) ๐ . Hence, if we want the final variance to be ๐ก๐, then ๐ should be ๐๐(๐๐๐๐ฅ) ๐ก๐ , the total scanning time is ๐ = (๐๐๐๐ฅ โ
๐๐ + ๐๐) โ
๐. We propose a dynamic programming-based algorithm to compute the optimal granularity for ensemble sampling. As๏ฟพsume that currently there are ๐ categories, ranked in non๏ฟพincreasing order, according to the estimated tag size, e.g., ๐ถ1, ๐ถ2, ..., ๐ถ๐. We need to cut the ranked categories into one or more continuous groups for ensemble sampling. In regard to a single group consisting of categories from ๐ถ๐ to ๐ถ๐ , we define ๐ก(๐, ๐) as the essential scanning time for ensemble sampling, which is computed in the same way as aforementioned ๐. Fur๏ฟพthermore, we define ๐(๐, ๐) as the minimum overall scanning time over the categories from ๐ถ๐ to ๐ถ๐ among various grouping strategies. Then, the recursive expression of ๐(๐, ๐) is shown in Eq.(8). ๐(๐, ๐) = { min๐โค๐โค๐{๐ก(๐, ๐) + ๐(๐ + 1, ๐)}, ๐<๐; ๐ก(๐, ๐), ๐ = ๐. (8) In Eq. (8), the value of ๐(๐, ๐) is obtained by enumerating each possible combination of ๐ก(๐, ๐) and ๐(๐+1, ๐), and then getting the minimum value of ๐ก(๐, ๐) + ๐(๐ + 1, ๐). By solving the