of the tag size ni is above/below the threshold t,the specified B.Algorithm for the Iceberg Query Problem category can be respectively identified as qualified/unqualifed, as both the false positive and false negative probability are less Algorithm 2 Algorithm for Iceberg Query than B;otherwise,the specified category is still undetermined. 1:Initialize R to all categories,set Q,U,V to Set I=1. According to the weighted statistical averaging method,as the 2:while Rโ ado number of repeated tests increases,the averaged variance o;for 3 If I=1,set the initial frame size f. each category decreases,thus the confidence interval for each 4 Issue a query cycle over the tags,add those relatively category is shrinking.Therefore,after a certain number of query major categories into the set S.Set S=S. cycles,all categories can be determined as qualified/unqualifed 5 while Sโ ado for the population constraint. 6: Compute the frame size fi for each category CiES suchthat the variance,=ไบงๅฏ๏ผf็>ๅ
e, Qualified Category then remove Ci from S to V.If fi>fmar,set fi= fmaz.Obtain the frame size f as the mid-value among Unqualified Category the series of fi. Threshold 7: Select all tags in S,issue a query cycle with the frame size f,compute the estimated tag size n and the averaged standard deviation o;for each category ๅ
็ฑ็ฑไธ C S.Detect the qualified category set Q and CI C2 C3 C4 C5 C6 C7 C8 C9 CI0CII CI2 unqualified category set U.Set S=S-Q-U. 8 ifUโ then Fig.1.Histogram with confidence interval annotated 9: Wipe out all categories unexplored in the singleton slots from S Note that when the estimated valueๅ
๏ผ>t orใt, 10: end if the required variance in the population constraint is much 11: end while larger than the specifications of the accuracy constraint.In 12: n=ๅ
-โc,esๅ
.R=R-S,1=1+1 this situation,these categories can be quickly identified as 13:end while qualified/unqualified,and can be wiped out immediately from 14:Further verify the categories in V and Q for the accuracy the ensemble sampling for verifying the population constraint. constraint Thus,those undetermined categories can be further involved in the ensemble sampling with a much smaller tag size,verifying We propose the algorithm for the iceberg query problem in the population constraint in a faster approach. Algorithm 2.Steps 1-4 are quite similar to steps 1-4 in Algo- Sometimes the tag sizes of various categories are subject rithm 1,due to lack of space we omit the detailed statements to some skew distributions with a "long tail".The long tail for these steps.Assume that the current set of categories is represents those categories each of which occupies a rather R,during the query cycles of ensemble sampling,the reader small percentage among the total categories,but all together continuously updates the statistical value of n as well as the they occupy a substantial proportion of the overall tag sizes.In standard deviation oi for each category CiE R.After each regard to the iceberg query,conventionally,the categories in the query cycle,the categories in R can be further divided into the long tail are unqualified for the population constraint.However, following categories according to the population constraint: due to the small tag size,most of them may not have the Qualified categories O:They refer to the categories whose opportunity to occupy even one singleton slot when contending tag sizes are determined to be over the specified threshold with those major categories during the ensemble sampling. t.fๅ
ไนt and o:โคๅทฒ๏ผthen category C is They remain undetermined without being immediately wiped identified as qualified for the population constraint. out,leading to inefficiency in scanning the other categories.We Unqualified categories U:They refer to the categories rely on the following theorem to quickly wipe out the categories whose tag sizes are determined to be below the specified in the long tail. t-ni threshold t..Ifm<t andoiโคo--ๅฏ๏ผthen category C Theorem 4:For any two categories Ci and Ci that ns.i< is identified as unqualified for the population constraint. ns.j satisfies for each query cycle of ensemble sampling,if Ci Undetermined categories R:The remaining categories to is determined to be unqualified for the population constraint, be verified are undetermined categories. then C;is also unqualified. Therefore,after each query cycle of ensemble sampling, Due to lack of space,we omit the proof of Theorem 4.The those ungualified categories and qualified categories can be detailed proof is given in [17]. immediately wiped out from the ensemble sampling.When According to Theorem 4,after a number of query cycles of at least one category is determined as unqualified,all of the ensemble sampling,if a category Ci is determined unqualified categories in the current group which have not been explored for the population constraint,then for any category Ci which in the singleton slots are wiped out immediately.The query has not appeared once in the singleton slots,ns.j>ns.i=0,cycles are then continuously issued over those undetermined it can be wiped out immediately as an unqualified category. categories in R until R=0.of the tag size ๐ห๐ is above/below the threshold ๐ก, the specified category can be respectively identified as qualified/unqualifed, as both the false positive and false negative probability are less than ๐ฝ; otherwise, the specified category is still undetermined. According to the weighted statistical averaging method, as the number of repeated tests increases, the averaged variance ๐๐ for each category decreases, thus the confidence interval for each category is shrinking. Therefore, after a certain number of query cycles, all categories can be determined as qualified/unqualifed for the population constraint. Tag size for each category C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 Qualified Category Unqualified Category Undetermined Category Threshold t Fig. 1. Histogram with confidence interval annotated Note that when the estimated value ๐ห๐ โซ ๐ก or ๐ห๐ โช ๐ก, the required variance in the population constraint is much larger than the specifications of the accuracy constraint. In this situation, these categories can be quickly identified as qualified/unqualified, and can be wiped out immediately from the ensemble sampling for verifying the population constraint. Thus, those undetermined categories can be further involved in the ensemble sampling with a much smaller tag size, verifying the population constraint in a faster approach. Sometimes the tag sizes of various categories are subject to some skew distributions with a โlong tailโ. The long tail represents those categories each of which occupies a rather small percentage among the total categories, but all together they occupy a substantial proportion of the overall tag sizes. In regard to the iceberg query, conventionally, the categories in the long tail are unqualified for the population constraint. However, due to the small tag size, most of them may not have the opportunity to occupy even one singleton slot when contending with those major categories during the ensemble sampling. They remain undetermined without being immediately wiped out, leading to inefficiency in scanning the other categories. We rely on the following theorem to quickly wipe out the categories in the long tail. Theorem 4: For any two categories ๐ถ๐ and ๐ถ๐ that ๐๐ ,๐ < ๐๐ ,๐ satisfies for each query cycle of ensemble sampling, if ๐ถ๐ is determined to be unqualified for the population constraint, then ๐ถ๐ is also unqualified. Due to lack of space, we omit the proof of Theorem 4. The detailed proof is given in [17]. According to Theorem 4, after a number of query cycles of ensemble sampling, if a category ๐ถ๐ is determined unqualified for the population constraint, then for any category ๐ถ๐ which has not appeared once in the singleton slots, ๐๐ ,๐ > ๐๐ ,๐ = 0, it can be wiped out immediately as an unqualified category. B. Algorithm for the Iceberg Query Problem Algorithm 2 Algorithm for Iceberg Query 1: Initialize ๐
to all categories, set ๐, ๐, ๐ to โ
. Set ๐ = 1. 2: while ๐
โ= โ
do 3: If ๐ = 1, set the initial frame size ๐. 4: Issue a query cycle over the tags, add those relatively major categories into the set ๐. Set ๐โฒ = ๐. 5: while ๐ โ= โ
do 6: Compute the frame size ๐๐ for each category ๐ถ๐ โ ๐ such that the variance ๐๐ = โฃ๐กโ๐ห๐โฃ ฮฆโ1(1โ๐ฝ) . If ๐๐ > ๐ห๐ โ
๐, then remove ๐ถ๐ from ๐ to ๐ . If ๐๐ > ๐๐๐๐ฅ, set ๐๐ = ๐๐๐๐ฅ. Obtain the frame size ๐ as the mid-value among the series of ๐๐. 7: Select all tags in ๐, issue a query cycle with the frame size ๐, compute the estimated tag size ๐ห๐ and the averaged standard deviation ๐๐ for each category ๐ถ๐ โ ๐. Detect the qualified category set ๐ and unqualified category set ๐. Set ๐ = ๐ โ ๐ โ ๐. 8: if ๐ โ= โ
then 9: Wipe out all categories unexplored in the singleton slots from ๐. 10: end if 11: end while 12: ๐ห = ๐ห โ โ ๐ถ๐โ๐โฒ ๐ห๐. ๐
= ๐
โ ๐โฒ , ๐ = ๐ + 1. 13: end while 14: Further verify the categories in ๐ and ๐ for the accuracy constraint. We propose the algorithm for the iceberg query problem in Algorithm 2. Steps 1-4 are quite similar to steps 1-4 in Algo๏ฟพrithm 1, due to lack of space we omit the detailed statements for these steps. Assume that the current set of categories is ๐
, during the query cycles of ensemble sampling, the reader continuously updates the statistical value of ๐ห๐ as well as the standard deviation ๐๐ for each category ๐ถ๐ โ ๐
. After each query cycle, the categories in ๐
can be further divided into the following categories according to the population constraint: โ Qualified categories ๐: They refer to the categories whose tag sizes are determined to be over the specified threshold ๐ก. If ๐ห๐ โฅ ๐ก and ๐๐ โค ๐ห๐โ๐ก ฮฆโ1(1โ๐ฝ) , then category ๐ถ๐ is identified as qualified for the population constraint. โ Unqualified categories ๐: They refer to the categories whose tag sizes are determined to be below the specified threshold ๐ก. If ๐ห๐ < ๐ก and ๐๐ โค ๐กโ๐ห๐ ฮฆโ1(1โ๐ฝ) , then category ๐ถ๐ is identified as unqualified for the population constraint. โ Undetermined categories ๐
: The remaining categories to be verified are undetermined categories. Therefore, after each query cycle of ensemble sampling, those unqualified categories and qualified categories can be immediately wiped out from the ensemble sampling. When at least one category is determined as unqualified, all of the categories in the current group which have not been explored in the singleton slots are wiped out immediately. The query cycles are then continuously issued over those undetermined categories in ๐
until ๐
= โ