正在加载图片...
During the ensemble sampling,if there exist some categories say g,can be quickly reduced.As the approximated threshold with tag sizes very close to the threshold t,then the required t is selected within the confidence interval.after a number of number of slots to verify the population constraint can be query cycles of ensemble sampling,when the width g is below rather large.Thus,we compute the essential frame size fi for the threshold ve2.t2.B,the estimated value f can be close each category C;and compare it with the expected number enough to the exact threshold t.Then,we can apply the iceberg of slots ne in basic tag identification.If fi>ne,then query with threshold t over the undetermined categories R and the category is moved from the set S to V.We heuristically the qualified categories O.Due to lack of space,we omit the set the frame size f to the mid-value among the series of fi, detailed algorithm for PT-Topk Query Problem.The detailed such that after a query cycle,about half of the categories can be algorithm is given in [17]. determined as qualified/unqualifed,and thus wiped out quickly. VIII.PERFORMANCE ANALYSIS IN TIME-EFFICIENCY Therefore,after the while loop,for each category CiE V,basic As mentioned in the problem formulation,the most critical identification is used to obtain the exact tag size ni.If ni>t, factor for the histogram collection problem is time efficiency. C;is illustrated in the histogram.For each category CiE Q, In regard to basic histogram collection,the time delay is mainly the reader verifies if it has satisfied the accuracy requirement; impacted by two factors:1)the number of categories m,2)the if so,Ci is illustrated in the histogram and wiped out from Q. category with the smallest tag size,say ni,inside the group Then,ensemble sampling is further applied over the categories for ensemble sampling.Generally,as the number of categories in Q to satisfy the accuracy requirement by using the optimized m increases,the number of groups and the essential number grouping method. of slots for each ensemble sampling is increasing,causing the VII.ENSEMBLE SAMPLING FOR THE TOP-k QUERY time delay to increase.Besides,the category with the smallest In some applications,when the number of categories is fairly tag size ni directly decides the essential frame size inside the large,the users only focus on the major categories in the top-k group,the larger the gap among the tag sizes of each category list in regard to the tag size.Then the top-k query is utilized to in the same group,the lower time efficiency that is achieved. filter out those categories out of the top-k list.In this situation, In regard to the iceberg query and top-k query,the time delay the separate counting scheme is especially unsuitable.If the mainly depends on the number of categories with tag size close specified category is not in the top-k list,it is unnecessary to the threshold t.Due to the variance in tag size estimation. to address it for accurate tag size estimation.However,since relatively large numbers of slots are required to verify whether the threshold t for the top-k list cannot be known in advance, the specified categories have tag sizes over the threshold t.For the separate counting scheme cannot quickly decide which the top-k query,additional time delay is required to estimate categories can be wiped out immediately. the threshold t corresponding to the top-k query. Hence,according to the definition of PT-Topk query,as IX.PERFORMANCE EVALUATION the exact value of tag size ni is unknown,in order to define We have conducted simulations in Matlab,the scenario is PrICi E top-k list],i.e.,the probability that category Ci is as follows:there exist m categories in total,and we randomly within the top-k list in terms of tag size,it is essential to de- generate the tag size for each category according to the normal termine a threshold t so that Pr[Ci E top-k list]Prni>t]. distribution N(u,)We set the default values for the follow- Ideally,t should be the tag size of the kth largest category; ing parameters:in regard to the accuracy constraint and the however,it is rather difficult to compute an exact value of t population constraint,we set 1-B =95%,and e =0.2 in the estimation scheme.due to the randomness in the slotted The average time interval for each slot is Ts=Ims,the inter- ALOHA protocol. cycle overhead is Te =43ms.We compare our solutions with Therefore,if the threshold t can be accurately estimated, two basic strategies:the basic tag identification (BD)and the then the top-k query problem is reduced to the iceberg query separate counting (SC)(explained in Section V).All results problem.Then it is essential to quickly determine the value are the averaged results of 500 independent trials of the threshold t while satisfying the constraint Pr[t-t< e.t]1-B.We rely on the following theorem to express the A.The Performance in Basic Histogram Collection above constraint in the form of the variance. We compare the ensemble sampling with one group (ES) Theorem 5:The constraint Pr[lt-t<e.t]>1-B is and the ensemble sampling with optimized grouping (ES-g) satisfied as long as Var(t-t)<e2.t2.B. with the basic strategies.In Fig.2(a),we compare the overall Due to lack of space,we omit the proof of Theorem 5.The scanning time under three various scenarios.In scenario 1 we detailed proof is given in [17]. set the number of categories m =50,the average tag size According to Theorem 5.we utilize the ensemble sampling u=50 and its standard deviation o=30.We observe that to quickly estimate the threshold t.The intuition is as follows: the ES strategy has the longest scanning time,while the others after the first query cycle of ensemble sampling,we can have fairly small values in scanning time.This is because the estimate a confidence interval [to,tup]of the threshold t variance o is relatively large as compared to the tag size. according to the sampled distribution.Then,by wiping out The minor categories become the bottleneck in regard to the those categories which are obviously qualified or unqualified estimation performance,thus greatly increasing the scanning to be in the top-k list,the width of the confidence interval, time.In scenario 2 we set m 100,u=50 and o =30.AsDuring the ensemble sampling, if there exist some categories with tag sizes very close to the threshold 𝑡, then the required number of slots to verify the population constraint can be rather large. Thus, we compute the essential frame size 𝑓𝑖 for each category 𝐶𝑖 and compare it with the expected number of slots 𝑛ˆ𝑖 ⋅ 𝑒 in basic tag identification. If 𝑓𝑖 > 𝑛ˆ𝑖 ⋅ 𝑒, then the category is moved from the set 𝑆 to 𝑉 . We heuristically set the frame size 𝑓 to the mid-value among the series of 𝑓𝑖, such that after a query cycle, about half of the categories can be determined as qualified/unqualifed, and thus wiped out quickly. Therefore, after the while loop, for each category 𝐶𝑖 ∈ 𝑉 , basic identification is used to obtain the exact tag size 𝑛𝑖. If 𝑛𝑖 ≥ 𝑡, 𝐶𝑖 is illustrated in the histogram. For each category 𝐶𝑖 ∈ 𝑄, the reader verifies if it has satisfied the accuracy requirement; if so, 𝐶𝑖 is illustrated in the histogram and wiped out from 𝑄. Then, ensemble sampling is further applied over the categories in 𝑄 to satisfy the accuracy requirement by using the optimized grouping method. VII. ENSEMBLE SAMPLING FOR THE TOP-𝑘 QUERY In some applications, when the number of categories is fairly large, the users only focus on the major categories in the top-𝑘 list in regard to the tag size. Then the top-𝑘 query is utilized to filter out those categories out of the top-𝑘 list. In this situation, the separate counting scheme is especially unsuitable. If the specified category is not in the top-𝑘 list, it is unnecessary to address it for accurate tag size estimation. However, since the threshold 𝑡 for the top-𝑘 list cannot be known in advance, the separate counting scheme cannot quickly decide which categories can be wiped out immediately. Hence, according to the definition of PT-Top𝑘 query, as the exact value of tag size 𝑛𝑖 is unknown, in order to define 𝑃 𝑟[𝐶𝑖 ∈ top-𝑘 list], i.e., the probability that category 𝐶𝑖 is within the top-𝑘 list in terms of tag size, it is essential to de￾termine a threshold 𝑡 so that 𝑃 𝑟[𝐶𝑖 ∈ top-𝑘 list] = 𝑃 𝑟[𝑛𝑖 ≥ 𝑡]. Ideally, 𝑡 should be the tag size of the 𝑘th largest category; however, it is rather difficult to compute an exact value of 𝑡 in the estimation scheme, due to the randomness in the slotted ALOHA protocol. Therefore, if the threshold ˆ𝑡 can be accurately estimated, then the top-𝑘 query problem is reduced to the iceberg query problem. Then it is essential to quickly determine the value of the threshold ˆ𝑡 while satisfying the constraint 𝑃 𝑟[∣ˆ𝑡 − 𝑡∣ ≤ 𝜖 ⋅𝑡] ≥ 1 − 𝛽. We rely on the following theorem to express the above constraint in the form of the variance. Theorem 5: The constraint 𝑃 𝑟[∣ˆ𝑡 − 𝑡∣ ≤ 𝜖 ⋅ 𝑡] ≥ 1 − 𝛽 is satisfied as long as 𝑉 𝑎𝑟(ˆ𝑡 − 𝑡) ≤ 𝜖2 ⋅ 𝑡 2 ⋅ 𝛽. Due to lack of space, we omit the proof of Theorem 5. The detailed proof is given in [17]. According to Theorem 5, we utilize the ensemble sampling to quickly estimate the threshold ˆ𝑡. The intuition is as follows: after the first query cycle of ensemble sampling, we can estimate a confidence interval [𝑡𝑙𝑜𝑤, 𝑡𝑢𝑝] of the threshold 𝑡 according to the sampled distribution. Then, by wiping out those categories which are obviously qualified or unqualified to be in the top-𝑘 list, the width of the confidence interval, say 𝑔, can be quickly reduced. As the approximated threshold ˆ𝑡 is selected within the confidence interval, after a number of query cycles of ensemble sampling, when the width 𝑔 is below the threshold √𝜖2 ⋅ 𝑡2 ⋅ 𝛽, the estimated value ˆ𝑡 can be close enough to the exact threshold 𝑡. Then, we can apply the iceberg query with threshold ˆ𝑡 over the undetermined categories 𝑅 and the qualified categories 𝑄. Due to lack of space, we omit the detailed algorithm for PT-Top𝑘 Query Problem. The detailed algorithm is given in [17]. VIII. PERFORMANCE ANALYSIS IN TIME-EFFICIENCY As mentioned in the problem formulation, the most critical factor for the histogram collection problem is time efficiency. In regard to basic histogram collection, the time delay is mainly impacted by two factors: 1) the number of categories 𝑚, 2) the category with the smallest tag size, say 𝑛𝑖, inside the group for ensemble sampling. Generally, as the number of categories 𝑚 increases, the number of groups and the essential number of slots for each ensemble sampling is increasing, causing the time delay to increase. Besides, the category with the smallest tag size 𝑛𝑖 directly decides the essential frame size inside the group, the larger the gap among the tag sizes of each category in the same group, the lower time efficiency that is achieved. In regard to the iceberg query and top-𝑘 query, the time delay mainly depends on the number of categories with tag size close to the threshold 𝑡. Due to the variance in tag size estimation, relatively large numbers of slots are required to verify whether the specified categories have tag sizes over the threshold 𝑡. For the top-𝑘 query, additional time delay is required to estimate the threshold 𝑡 corresponding to the top-𝑘 query. IX. PERFORMANCE EVALUATION We have conducted simulations in Matlab, the scenario is as follows: there exist 𝑚 categories in total, and we randomly generate the tag size for each category according to the normal distribution 𝑁(𝜇, 𝜎). We set the default values for the follow￾ing parameters: in regard to the accuracy constraint and the population constraint, we set 1 − 𝛽 = 95%, and 𝜖 = 0.2. The average time interval for each slot is 𝜏𝑠 = 1ms, the inter￾cycle overhead is 𝜏𝑐 = 43ms. We compare our solutions with two basic strategies: the basic tag identification (BI) and the separate counting (SC) (explained in Section V). All results are the averaged results of 500 independent trials. A. The Performance in Basic Histogram Collection We compare the ensemble sampling with one group (ES) and the ensemble sampling with optimized grouping (ES-g) with the basic strategies. In Fig.2(a), we compare the overall scanning time under three various scenarios. In scenario 1 we set the number of categories 𝑚 = 50, the average tag size 𝜇 = 50 and its standard deviation 𝜎 = 30. We observe that the ES strategy has the longest scanning time, while the others have fairly small values in scanning time. This is because the variance 𝜎 is relatively large as compared to the tag size. The minor categories become the bottleneck in regard to the estimation performance, thus greatly increasing the scanning time. In scenario 2 we set 𝑚 = 100, 𝜇 = 50 and 𝜎 = 30. As
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有