ๆญฃๅœจๅŠ ่ฝฝๅ›พ็‰‡...
to denote the variance of ni.We rely on the following theorem C.Compute the Optimal Granularity for Ensemble Sampling to illustrate the accuracy of the estimator SE. Theorem I:Let o;represent the variance of the estimator SE As indicated in the above analysis,during a query cycle mi,the load factor p=ๅญ๏ผŒthen, of the ensemble sampling,in order to achieve the accuracy requirement for all categories,the essential scanning time d=%.eP+n-1 mainly depends on the category with the smallest tag size. n'co+n-1(6+n2)-n. (7) as the other categories must still be involved in the query cycle until this category achieves the accuracy requirement. Proof:See Appendix A. Therefore,we use the notion group to define a set of categories 2)Reducing the Variance through Repeated Tests:As the involved in a query cycle of the ensemble sampling.Hence, frame size for each query cycle has a maximum value,by each cycle of ensemble sampling should be applied over an estimating from the ensemble sampling within only one query appropriate group,such that the variance of the tag sizes for cycle,the estimated tag size may not be accurate enough for the involved categories cannot be too large.In this way,all the accuracy constraint.In this situation,multiple query cycles categories in the same group achieve the accuracy requirement are essential to reduce the variance through repeated tests. with very close finishing time.In addition,according to Eq.(7), Suppose the reader issues l query cycles over the same set of as the number of categories increases in the ensemble sampling categories,in regard to a specified category Ci,by utilizing the the load factor p is increased,then the achieved accuracy for weighted statistical averaging method,the averaged tag size each involved category is reduced.Therefore,it is essential to ๅ…ƒ=โˆ‘k=1uknk:here wk=ใ€‘ ๅทฅ๏ผŒn,k and 6.k 1 compute an optimal granularity for the group in regard to the respectively denote the estimated tag size and variance for each reading performance.Suppose there exists m categories in total, cycle k.Then,the variance ofๅ…ƒiso=ziๆœฌ the objective is to divide them into d(1 sd m)groups for ensemble sampling,such that the overall scanning time can be Therefore,according to the accuracy constraint in the prob- minimized while achieving the accuracy requirement. lem formulation,we rely on the following theorem to express For a specified group,in order for all involved categories to this constraint in the form of the variance. satisfy the accuracy requirement,it is essential to compute the Theorem 2:Suppose the variance of the averaged tag size required frame size for the category with the smallest tag size, nis o.The accuracy constraint is satisfied for a specified category C,as long asๅญโ‰ค(zaaP.n,Zi-aa is the say nLet(then,according to Theorem 2. we can compute the essential frame size f such that Ai(f)<ti. 1-percentile for the standard normal distribution. Assume that the inter-cycle overhead is Te,and the average Proof:See Appendix B. โ–  time interval per slot is Ts.Therefore,if f fmaz,then the According to Theorem 2,we can verify if the accuracy con- total scanning time T f.Ts+Te.Otherwise,if the final straint is satisfied for each category through directly checking estimate is the average of r independent experiments each with the variance against the threshold If 1-5%. an estimator variance of Ai(fmaz),then the variance of the then Z1-B/2=1.96. average is).Hence,if we want the final variance to 3)Property of the Ensemble Sampling:According to The- bet thenr should be,the total scanning time is orem 1,the normalized variance of the SE estimator A;=5 n T=(fmax·Ts+Tc)·r. is equivalent to=(().Let n(eP+n-1)1 We propose a dynamic programming-based algorithm to ep+n-1 a ๅ‘Š.b=eๅทThen,the normalized compute the optimal granularity for ensemble sampling.As- ep-n-I variance i=a.+b.Since the SE estimator can utilize sume that currently there are m categories,ranked in non- any estimator like [4][12][15]to estimate the overall tag size, increasing order,according to the estimated tag size,e.g., then,without loss of generality,if we use the estimator in [4], C1,C2,....Cm.We need to cut the ranked categories into one or more continuous groups for ensemble sampling.In regard to a we can prove that a <0 for any value of n>0,f >0.This property applies to any estimator with variance smaller than 8o single group consisting of categories from Ci to Ci,we define t(i,j)as the essential scanning time for ensemble sampling, in ZE,which simply estimates the overall tag size based on the which is computed in the same way as aforementioned T.Fur- observed number of empty slots. According to Theorem 2,in order to satisfy the accuracy thermore,we define T(i,j)as the minimum overall scanning constraint,we should ensure that( )2.ni.As a<0 time over the categories from Ci to Ci among various grouping strategies.Then,the recursive expression of T(i,j)is shown for all values of f,it infers that the larger the value ni is, in Eq.(8) the faster it will be for the specified category to satisfy the accuracy constraint.On the contrary,the smaller the value n is,the slower it will be for the specified category to satisfy the T(i,)= minisksjt(i,k)+T(k+1,j)),i<j; t(i,i), (8) i=j. accuracy constraint.This occurs during the ensemble sampling, when the major categories occupy most of the singleton slots,In Eq.(8),the value of T(i,j)is obtained by enumerating each while those minor categories cannot obtain enough samplings possible combination of t(i,k)and T(+1,j),and then getting in the singleton slots for accurate estimation of the tag size. the minimum value of t(i,)+T(k+1,j).By solving theto denote the variance of ๐‘›ห†. We rely on the following theorem to illustrate the accuracy of the estimator SE. Theorem 1: Let ๐›ฟ๐‘– represent the variance of the estimator SE ๐‘›ห†๐‘–, the load factor ๐œŒ = ๐‘› ๐‘“ , then, ๐›ฟ๐‘– = ๐‘›๐‘– ๐‘› โ‹… ๐‘’๐œŒ + ๐‘›๐‘– โˆ’ 1 ๐‘’๐œŒ + ๐‘› โˆ’ 1 โ‹… (๐›ฟ + ๐‘›2) โˆ’ ๐‘›2 ๐‘– . (7) Proof: See Appendix A. 2) Reducing the Variance through Repeated Tests: As the frame size for each query cycle has a maximum value, by estimating from the ensemble sampling within only one query cycle, the estimated tag size may not be accurate enough for the accuracy constraint. In this situation, multiple query cycles are essential to reduce the variance through repeated tests. Suppose the reader issues ๐‘™ query cycles over the same set of categories, in regard to a specified category ๐ถ๐‘–, by utilizing the weighted statistical averaging method, the averaged tag size ๐‘›ห†๐‘– = โˆ‘๐‘™ ๐‘˜=1 ๐œ”๐‘˜ โ‹… ๐‘›ห†๐‘–,๐‘˜; here ๐œ”๐‘˜ = 1 ๐›ฟ โˆ‘ ๐‘–,๐‘˜ ๐‘™ ๐‘˜=1 1 ๐›ฟ๐‘–,๐‘˜ , ๐‘›ห†๐‘–,๐‘˜ and ๐›ฟ๐‘–,๐‘˜ respectively denote the estimated tag size and variance for each cycle ๐‘˜. Then, the variance of ๐‘›ห†๐‘– is ๐œŽ2 ๐‘– = โˆ‘ 1 ๐‘™ ๐‘˜=1 1 ๐›ฟ๐‘–,๐‘˜ . Therefore, according to the accuracy constraint in the prob๏ฟพlem formulation, we rely on the following theorem to express this constraint in the form of the variance. Theorem 2: Suppose the variance of the averaged tag size ๐‘›ห†๐‘– is ๐œŽ2 ๐‘– . The accuracy constraint is satisfied for a specified category ๐ถ๐‘–, as long as ๐œŽ2 ๐‘– โ‰ค ( ๐œ– ๐‘1โˆ’๐›ฝ/2 )2 โ‹… ๐‘›2 ๐‘– , ๐‘1โˆ’๐›ฝ/2 is the 1 โˆ’ ๐›ฝ 2 percentile for the standard normal distribution. Proof: See Appendix B. According to Theorem 2, we can verify if the accuracy con๏ฟพstraint is satisfied for each category through directly checking the variance against the threshold ( ๐œ– ๐‘1โˆ’๐›ฝ/2 )2โ‹…๐‘›2 ๐‘– . If 1โˆ’๐›ฝ = 95%, then ๐‘1โˆ’๐›ฝ/2 = 1.96. 3) Property of the Ensemble Sampling: According to The๏ฟพorem 1, the normalized variance of the SE estimator ๐œ†๐‘– = ๐›ฟ๐‘– ๐‘›๐‘– is equivalent to ๐œ†๐‘– = ๐›ฟโˆ’๐‘›โ‹…๐‘’๐œŒ+๐‘› ๐‘’๐œŒ+๐‘›โˆ’1 โ‹… ๐‘›๐‘– ๐‘› + (๐›ฟ+๐‘›2)(๐‘’๐œŒโˆ’1) ๐‘›โ‹…(๐‘’๐œŒ+๐‘›โˆ’1) . Let ๐‘Ž = ๐›ฟโˆ’๐‘›โ‹…๐‘’๐œŒ+๐‘› ๐‘’๐œŒ+๐‘›โˆ’1 , ๐‘ = (๐›ฟ+๐‘›2)(๐‘’๐œŒโˆ’1) ๐‘›โ‹…(๐‘’๐œŒ+๐‘›โˆ’1) . Then, the normalized variance ๐œ†๐‘– = ๐‘Ž โ‹… ๐‘›๐‘– ๐‘› + ๐‘. Since the SE estimator can utilize any estimator like [4][12][15] to estimate the overall tag size, then, without loss of generality, if we use the estimator in [4], we can prove that ๐‘Ž < 0 for any value of ๐‘› > 0,๐‘“ > 0. This property applies to any estimator with variance smaller than ๐›ฟ0 in ZE, which simply estimates the overall tag size based on the observed number of empty slots. According to Theorem 2, in order to satisfy the accuracy constraint, we should ensure that ๐œ†๐‘– โ‰ค ( ๐œ– ๐‘1โˆ’๐›ฝ/2 )2 โ‹…๐‘›๐‘–. As ๐‘Ž < 0 for all values of ๐‘“, it infers that the larger the value ๐‘›๐‘– is, the faster it will be for the specified category to satisfy the accuracy constraint. On the contrary, the smaller the value ๐‘›๐‘– is, the slower it will be for the specified category to satisfy the accuracy constraint. This occurs during the ensemble sampling, when the major categories occupy most of the singleton slots, while those minor categories cannot obtain enough samplings in the singleton slots for accurate estimation of the tag size. C. Compute the Optimal Granularity for Ensemble Sampling As indicated in the above analysis, during a query cycle of the ensemble sampling, in order to achieve the accuracy requirement for all categories, the essential scanning time mainly depends on the category with the smallest tag size, as the other categories must still be involved in the query cycle until this category achieves the accuracy requirement. Therefore, we use the notion group to define a set of categories involved in a query cycle of the ensemble sampling. Hence, each cycle of ensemble sampling should be applied over an appropriate group, such that the variance of the tag sizes for the involved categories cannot be too large. In this way, all categories in the same group achieve the accuracy requirement with very close finishing time. In addition, according to Eq. (7), as the number of categories increases in the ensemble sampling, the load factor ๐œŒ is increased, then the achieved accuracy for each involved category is reduced. Therefore, it is essential to compute an optimal granularity for the group in regard to the reading performance. Suppose there exists ๐‘š categories in total, the objective is to divide them into ๐‘‘(1 โ‰ค ๐‘‘ โ‰ค ๐‘š) groups for ensemble sampling, such that the overall scanning time can be minimized while achieving the accuracy requirement. For a specified group, in order for all involved categories to satisfy the accuracy requirement, it is essential to compute the required frame size for the category with the smallest tag size, say ๐‘›๐‘–. Let ๐‘ก๐‘– = ( ๐œ– ๐‘1โˆ’๐›ฝ/2 )2 โ‹… ๐‘›๐‘–, then, according to Theorem 2, we can compute the essential frame size ๐‘“ such that ๐œ†๐‘–(๐‘“) โ‰ค ๐‘ก๐‘–. Assume that the inter-cycle overhead is ๐œ๐‘, and the average time interval per slot is ๐œ๐‘ . Therefore, if ๐‘“ โ‰ค ๐‘“๐‘š๐‘Ž๐‘ฅ, then the total scanning time ๐‘‡ = ๐‘“ โ‹… ๐œ๐‘  + ๐œ๐‘. Otherwise, if the final estimate is the average of ๐‘Ÿ independent experiments each with an estimator variance of ๐œ†๐‘–(๐‘“๐‘š๐‘Ž๐‘ฅ), then the variance of the average is ๐œ†๐‘–(๐‘“๐‘š๐‘Ž๐‘ฅ) ๐‘Ÿ . Hence, if we want the final variance to be ๐‘ก๐‘–, then ๐‘Ÿ should be ๐œ†๐‘–(๐‘“๐‘š๐‘Ž๐‘ฅ) ๐‘ก๐‘– , the total scanning time is ๐‘‡ = (๐‘“๐‘š๐‘Ž๐‘ฅ โ‹… ๐œ๐‘  + ๐œ๐‘) โ‹… ๐‘Ÿ. We propose a dynamic programming-based algorithm to compute the optimal granularity for ensemble sampling. As๏ฟพsume that currently there are ๐‘š categories, ranked in non๏ฟพincreasing order, according to the estimated tag size, e.g., ๐ถ1, ๐ถ2, ..., ๐ถ๐‘š. We need to cut the ranked categories into one or more continuous groups for ensemble sampling. In regard to a single group consisting of categories from ๐ถ๐‘– to ๐ถ๐‘— , we define ๐‘ก(๐‘–, ๐‘—) as the essential scanning time for ensemble sampling, which is computed in the same way as aforementioned ๐‘‡. Fur๏ฟพthermore, we define ๐‘‡(๐‘–, ๐‘—) as the minimum overall scanning time over the categories from ๐ถ๐‘– to ๐ถ๐‘— among various grouping strategies. Then, the recursive expression of ๐‘‡(๐‘–, ๐‘—) is shown in Eq.(8). ๐‘‡(๐‘–, ๐‘—) = { min๐‘–โ‰ค๐‘˜โ‰ค๐‘—{๐‘ก(๐‘–, ๐‘˜) + ๐‘‡(๐‘˜ + 1, ๐‘—)}, ๐‘–<๐‘—; ๐‘ก(๐‘–, ๐‘–), ๐‘– = ๐‘—. (8) In Eq. (8), the value of ๐‘‡(๐‘–, ๐‘—) is obtained by enumerating each possible combination of ๐‘ก(๐‘–, ๐‘˜) and ๐‘‡(๐‘˜+1, ๐‘—), and then getting the minimum value of ๐‘ก(๐‘–, ๐‘˜) + ๐‘‡(๐‘˜ + 1, ๐‘—). By solving the
<<ๅ‘ไธŠ็ฟป้กตๅ‘ไธ‹็ฟป้กต>>
©2008-็Žฐๅœจ cucdc.com ้ซ˜็ญ‰ๆ•™่‚ฒ่ต„่ฎฏ็ฝ‘ ็‰ˆๆƒๆ‰€ๆœ‰