正在加载图片...
2426 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,VOL.26.NO.9.SEPTEMBER 2015 constraint.This occurs during the ensemble sampling,when getting the minimum value of t(i,k)+T(+1,j).By solving the major categories occupy most of the singleton slots,while the overlapping subproblems in T(i,j),the optimization those minor categories cannot obtain enough samplings in problem is then reduced to computing the value of T(1,m). the singleton slots for an accurate estimation of the tag size. For example,suppose there are a set of tags with 10 cate- gories,these categories are ranked in non-increasing order of 5.3 Compute the Optimal Granularity the estimated tag size,say,{100,80,75,41,35,30,20,15,12, for Ensemble Sampling 8),then they are finally divided into three groups for ensem- As indicated in the above analysis,during a query cycle of ble sampling according to the dynamic programming,i.e., the ensemble sampling,in order to achieve the accuracy {100,80,75},{41,35,30,and{20,15,12,8}.In this way,the tag requirement for all categories,the essential scanning time sizes of each category inside one group are close to each mainly depends on the category with the smallest tag size, other,during the ensemble sampling all categories in the as the other categories must still be involved in the query same group can achieve the accuracy requirement with very cycle until this category achieves the accuracy requirement. close finishing time,very few slots are wasted due to waiting Therefore,we use the notion group to define a set of catego- for those,comparatively speaking,minor categories.On the ries involved in a query cycle of the ensemble sampling. other hand,these categories are put together with an appro- Hence,each cycle of ensemble sampling should be applied priate granularity for ensemble sampling to sufficiently over an appropriate group,such that the variance of the tag amortize the fixed time cost for each query cycle. sizes for the involved categories cannot be too large.In this way,all categories in the same group achieve the accuracy 5.4 The Ensemble Sampling-Based Algorithm requirement with very close finishing time.In addition, In Algorithm 1,we propose an ensemble sampling-based according to Eq.(7),as the number of categories increases in algorithm for the basic histogram collection.In the beginning, the ensemble sampling,the load factor p is increased,then as the overall number of tags n cannot be predicted,in order the achieved accuracy for each involved category is to accomodate a large operating range up to n,we need to reduced.Therefore,it is essential to compute an optimal set the initial frame size f by solving fe-/=5 as sug- granularity for the group in regard to the reading perfor- gested in [9].Then,during each cycle of ensemble sampling mance.Suppose there exists m categories in total,the objec- we find the category with the largest population vin the sin- tive is to divide them into d(1≤d≤m)groups for gleton slots,and set a threshold nsi>v(0<<1)to fil- ensemble sampling,such that the overall scanning time can ter out those minor categories which occasionally occupy a be minimized while achieving the accuracy requirement. small number of singleton slots.For example,suppose it is For a specified group,in order for all involved categories observed from the singleton slots that the number of slots to satisfy the accuracy requirement,it is essential to com- occupied by various categories are as follows:{35,25, pute the required frame size for the category with the small- 10,5,3,1,if 0 is set to 0.1,then the categories with the est tag size,,sayn.Let专=(22,hen according to number of slots equal to 3 and 1 are filtered out from the Theorem 2,we can compute the'essential frame size f such next ensemble sampling.Therefore,during the ensemble that Ai(f)<ti.Assume that the inter-cycle overhead is te, sampling,we can avoid estimating tag sizes for those minor the average time interval per slot is t..Therefore,if categories with a rather large variance.Then,the involved f fmar,then the total scanning time T=f.ts+te.Other- categories are further divided into smaller groups based on wise,if the final estimate is the average of r independent the dynamic programming.Therefore,as those major cate- experiments each with an estimator variance of Ai(fmar), gories are estimated and wiped out from the set R phase by then the variance of the average is.Hence,if we want phase,all categories including the relatively minor catego- the final variance to be thenr should be the total ries can be accurately estimated in terms of tag size.The query cycles continue to be issued until no singleton slots or scanning time is T=(fmr·ta+tc)·r. collision slots exist. We propose a dynamic programming-based algorithm to compute the optimal granularity for ensemble sampling. 6 ENSEMBLE SAMPLING FOR THE ICEBERG QUERY Assume that currently there are m categories ranked in 6.1 Motivation non-increasing order according to the estimated tag size, In some applications,the users only pay attention to the e.g.,C1,C2.....Cm.We need to cut the ranked categories major categories with the tag sizes above a certain threshold into one or more continuous groups for ensemble sampling. t,while those minor categories are not necessarily addressed. In regard to a single group consisting of categories from Ci Then,the iceberg query [33]is utilized to filter out those cate- to Ci,we define t(i,j)as the essential scanning time for gories below the threshold t in terms of the tag size.In this ensemble sampling,which is computed in the same way as situation,the separate counting scheme is especially not suit- the aforementioned T.Furthermore,we define T(i,j)as the able,since most of the categories are not within the scope of minimum overall scanning time over the categories from Ci the concern,which can be wiped out together immediately. to Cj among various grouping strategies.Then,the recur- According to the definition in the problem formulation, sive expression of T(i,j)is shown in Eq.(8): three constraints for the iceberg query must be satisfied: T(i,)= ∫mim≤k≤{t(i,k)+T(k+1,)},i<j, (8) t(i,), accuracy constraint. 1=j Prm-nl≤e·n21-B Pr<tn贴≥t< population constraint, In Eq.(8),the value of T(i,j)is obtained by enumerating each possible combination of t(i,k)and T(+1,j),and then Prlm≥tnm<t<B population constraint.constraint. This occurs during the ensemble sampling, when the major categories occupy most of the singleton slots, while those minor categories cannot obtain enough samplings in the singleton slots for an accurate estimation of the tag size. 5.3 Compute the Optimal Granularity for Ensemble Sampling As indicated in the above analysis, during a query cycle of the ensemble sampling, in order to achieve the accuracy requirement for all categories, the essential scanning time mainly depends on the category with the smallest tag size, as the other categories must still be involved in the query cycle until this category achieves the accuracy requirement. Therefore, we use the notion group to define a set of catego￾ries involved in a query cycle of the ensemble sampling. Hence, each cycle of ensemble sampling should be applied over an appropriate group, such that the variance of the tag sizes for the involved categories cannot be too large. In this way, all categories in the same group achieve the accuracy requirement with very close finishing time. In addition, according to Eq. (7), as the number of categories increases in the ensemble sampling, the load factor r is increased, then the achieved accuracy for each involved category is reduced. Therefore, it is essential to compute an optimal granularity for the group in regard to the reading perfor￾mance. Suppose there exists m categories in total, the objec￾tive is to divide them into dð1  d  mÞ groups for ensemble sampling, such that the overall scanning time can be minimized while achieving the accuracy requirement. For a specified group, in order for all involved categories to satisfy the accuracy requirement, it is essential to com￾pute the required frame size for the category with the small￾est tag size, say ni. Let ti ¼ ð Z1b=2 Þ 2  ni, then according to Theorem 2, we can compute the essential frame size f such that iðfÞ  ti. Assume that the inter-cycle overhead is tc, the average time interval per slot is ts. Therefore, if f  fmax, then the total scanning time T ¼ f  ts þ tc. Other￾wise, if the final estimate is the average of r independent experiments each with an estimator variance of iðfmaxÞ, then the variance of the average is iðfmaxÞ r . Hence, if we want the final variance to be ti, then r should be iðfmaxÞ ti , the total scanning time is T ¼ ðfmax  ts þ tcÞ  r. We propose a dynamic programming-based algorithm to compute the optimal granularity for ensemble sampling. Assume that currently there are m categories ranked in non-increasing order according to the estimated tag size, e.g., C1; C2; ... ; Cm. We need to cut the ranked categories into one or more continuous groups for ensemble sampling. In regard to a single group consisting of categories from Ci to Cj, we define tði; jÞ as the essential scanning time for ensemble sampling, which is computed in the same way as the aforementioned T. Furthermore, we define Tði; jÞ as the minimum overall scanning time over the categories from Ci to Cj among various grouping strategies. Then, the recur￾sive expression of Tði; jÞ is shown in Eq. (8): Tði; jÞ ¼ minikjftði; kÞ þ Tðk þ 1; jÞg; i<j, tði; iÞ; i ¼ j.  (8) In Eq. (8), the value of Tði; jÞis obtained by enumerating each possible combination of tði; kÞ and Tðk þ 1; jÞ, and then getting the minimum value of tði; kÞ þ Tðk þ 1; jÞ. By solving the overlapping subproblems in Tði; jÞ, the optimization problem is then reduced to computing the value of Tð1; mÞ. For example, suppose there are a set of tags with 10 cate￾gories, these categories are ranked in non-increasing order of the estimated tag size, say, f100, 80, 75, 41, 35, 30, 20, 15, 12, 8g, then they are finally divided into three groups for ensem￾ble sampling according to the dynamic programming, i.e., f100;80;75g; f41;35;30g, and f20;15;12;8g. In this way, the tag sizes of each category inside one group are close to each other, during the ensemble sampling all categories in the same group can achieve the accuracy requirement with very close finishing time, very few slots are wasted due to waiting for those, comparatively speaking, minor categories. On the other hand, these categories are put together with an appro￾priate granularity for ensemble sampling to sufficiently amortize the fixed time cost for each query cycle. 5.4 The Ensemble Sampling-Based Algorithm In Algorithm 1, we propose an ensemble sampling-based algorithm for the basic histogram collection. In the beginning, as the overall number of tags n cannot be predicted, in order to accomodate a large operating range up to n, we need to set the initial frame size f by solving fen=f ¼ 5 as sug￾gested in [9]. Then, during each cycle of ensemble sampling, we find the category with the largest population y in the sin￾gleton slots, and set a threshold ns;i > y  uð0 < u < 1Þ to fil￾ter out those minor categories which occasionally occupy a small number of singleton slots. For example, suppose it is observed from the singleton slots that the number of slots occupied by various categories are as follows: f35; 25; 10; 5; 3; 1g, if u is set to 0.1, then the categories with the number of slots equal to 3 and 1 are filtered out from the next ensemble sampling. Therefore, during the ensemble sampling, we can avoid estimating tag sizes for those minor categories with a rather large variance. Then, the involved categories are further divided into smaller groups based on the dynamic programming. Therefore, as those major cate￾gories are estimated and wiped out from the set R phase by phase, all categories including the relatively minor catego￾ries can be accurately estimated in terms of tag size. The query cycles continue to be issued until no singleton slots or collision slots exist. 6 ENSEMBLE SAMPLING FOR THE ICEBERG QUERY 6.1 Motivation In some applications, the users only pay attention to the major categories with the tag sizes above a certain threshold t, while those minor categories are not necessarily addressed. Then, the iceberg query [33] is utilized to filter out those cate￾gories below the threshold t in terms of the tag size. In this situation, the separate counting scheme is especially not suit￾able, since most of the categories are not within the scope of the concern, which can be wiped out together immediately. According to the definition in the problem formulation, three constraints for the iceberg query must be satisfied: Pr½jnbi nij   ni 1 b accuracy constraint; Pr½nbi < tjni t < b population constraint; Pr½nbi tjni < t < b population constraint: 2426 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有