XIE ET AL.:EFFICIENT PROTOCOLS FOR COLLECTING HISTOGRAMS IN LARGE-SCALE RFID SYSTEMS 2427 Algorithm 1.Algorithm for Histogram Collection Qualified Category 1: INPUT:1.Upper bound n on the number of tags n 2 Undetermined Category 2.Confidence interval width e 3: 3.Error probability B Unqualified Category 4: Initialize the set R to all tags.Set I=1. 5: while n.≠0Ane≠0do Threshold t 6: If I=1,compute the initial frame size f by solving fe-//=5.Otherwise,compute the frame size f=n. If f>fmar,set f=fmar. r1 7: Set S to Select the tags in R and issue a query cycle with the frame size f,get no,ne,ns.Find the category 由由 with the largest population v in the singleton slots.For C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 each category which appears in the singleton slot with Fig.3.Histogram with confidence interval annotated population ni>v.(is constant,0<<1),add it to the set S.Estimate the tag size ni for each category the 1-B confidence interval of the tag size n is above/ CiES using the SE estimator.Compute the variances below the threshold t,the specified category can be respec- for each category CiE S according to Eq.(7). tively identified as qualified/unqualified,as both the false 8: Rank the categories in S in non-increasing order of the positive and false negative probabilities are less than B;oth- tag size.Divide the set S into groups S1,S2....,Sa erwise,the specified category is still undetermined.Accord- according to the dynamic programming-based method. ing to the weighted statistical averaging method,as the 9 for each S∈S(1≤j≤d)do number of repeated tests increases,the averaged variance oi 10: For each category Ci Sj,compute the frame size fi fom成by solving5≤(-时2,3 for each category decreases,thus the confidence interval for each category is shrinking.Therefore,after a certain number 11: Obtain the maximum frame size f=maxc,es,fi.If of query cycles,all categories can be determined as quali- f<fmar,select all categories in Sj,and issue a query fied/unqualified for the population constraint. cycle with frame size f.Otherwise,select all catego- Note that when the estimated valuet or nt,the ries in Si,and issue r query cycles with the frame required variance in the population constraint is much larger size fmar.Wipe out the categories with satisfied than the specifications of the accuracy constraint.In this situ- accuracy after each query cycle. ation,these categories can be quickly identified as qualified/ 12: Estimate the tag size m for each category CiESj, unqualified,and can be wiped out immediately from the illustrate them in the histogram. 13: end for ensemble sampling for verifying the population constraint. 14: 元=i-∑cesm.R=R-S.S=0.1=l+1. Thus,those undetermined categories can be further involved 15: end while in the ensemble sampling with a much smaller tag size,veri- fying the population constraint in a faster approach. Sometimes the tag sizes of various categories are subject The first constraint is the accuracy constraint,while the to some skew distributions with a "long tail".The long tail second and third constraints are the population constraints. represents those categories each of which occupies a rather In regard to the accuracy constraint,we have demonstrated small percentage among the total categories,but all together in Theorem 2 that it can be expressed in the form of the vari- they occupy a substantial proportion of the overall tag sizes. ance constraint.In regard to the population constraint,the In regard to the iceberg query,conventionally the categories second constraint infers that,in the results of the iceberg in the long tail are unqualified for the population constraint query,the false negative probability should be no more However,due to the small tag size,most of them may not than B,while the third constraint infers that the false posi- have the opportunity to occupy even one singleton slot tive probability should be no more than B.We rely on the when contending with those major categories during the following theorem to express the population constraint in ensemble sampling.They remain undetermined without another equivalent form. being immediately wiped out,leading to inefficiency in Theorem 4.The two population constraints,Prn<tni >t< scanning the other categories.We rely on the following the- B and Prniz tlni<t<B,are satisfied as long as the stan- orem to quickly wipe out the categories in the long tail. dard devation ofthe s,≤n)is Theorem 5.For any two categories Ci and Cj that ns.i<nsj sat- the cumulative distribution function of the standard normal isfies for each query cycle of ensemble sampling,if Cj is deter- distribution. mined to be unqualified for the population constraint,then C is also unqualified. Proof.See Appendix E,available in the online supplemental material. ◇ Proof.See Appendix F,available in the online supplemental material. 0 In order to better illustrate the inherent principle,Fig.3 shows an example of the histogram with the 1-B confi- According to Theorem 5,after a number of query cycles of dence interval annotated,the y-axis denotes the estimated ensemble sampling,if a category C;is determined unqualified tag size for each category.In order to accurately verify the for the population constraint,then for any category Ciwhich population constraint,it is required that the variance of the has not appeared once in the singleton slots,n>n=0,it estimated tag size should be small enough.Note that when can be wiped out immediately as an unqualified category.Algorithm 1. Algorithm for Histogram Collection 1: INPUT: 1. Upper bound n on the number of tags n 2: 2. Confidence interval width 3: 3. Error probability b 4: Initialize the set R to all tags. Set l ¼ 1. 5: while ns 6¼ 0 ^ nc 6¼ 0 do 6: If l ¼ 1, compute the initial frame size f by solving fen=f ¼ 5. Otherwise, compute the frame size f ¼ nb. If f>fmax, set f ¼ fmax. 7: Set S to ?. Select the tags in R and issue a query cycle with the frame size f, get n0; nc; ns. Find the category with the largest population y in the singleton slots. For each category which appears in the singleton slot with population ns;i > y uðu is constant, 0 < u < 1Þ, add it to the set S. Estimate the tag size ni for each category Ci 2 S using the SE estimator. Compute the variances d0 i for each category Ci 2 S according to Eq. (7). 8: Rank the categories in S in non-increasing order of the tag size. Divide the set S into groups S1; S2; ... ; Sd according to the dynamic programming-based method. 9: for each Sj 2 Sð1 j dÞ do 10: For each category Ci 2 Sj, compute the frame size fi from di by solving 1 1=d0 i þ1=di ð Z1b=2 Þ 2 nbi 2. 11: Obtain the maximum frame size f ¼ maxCi2Sjfi. If f<fmax, select all categories in Sj, and issue a query cycle with frame size f. Otherwise, select all categories in Sj, and issue r query cycles with the frame size fmax. Wipe out the categories with satisfied accuracy after each query cycle. 12: Estimate the tag size nbi for each category Ci 2 Sj, illustrate them in the histogram. 13: end for 14: nb ¼ nb P Ci2S nbi. R ¼ R S. S ¼ ?. l ¼ l þ 1. 15: end while The first constraint is the accuracy constraint, while the second and third constraints are the population constraints. In regard to the accuracy constraint, we have demonstrated in Theorem 2 that it can be expressed in the form of the variance constraint. In regard to the population constraint, the second constraint infers that, in the results of the iceberg query, the false negative probability should be no more than b, while the third constraint infers that the false positive probability should be no more than b. We rely on the following theorem to express the population constraint in another equivalent form. Theorem 4. The two population constraints, Pr½nbi < tjni t < b and Pr½nbi tjni < t < b, are satisfied as long as the standard deviation of the averaged tag size si jnitj F1ð1bÞ , FðxÞ is the cumulative distribution function of the standard normal distribution. Proof. See Appendix E, available in the online supplemental material. tu In order to better illustrate the inherent principle, Fig. 3 shows an example of the histogram with the 1 b confi- dence interval annotated, the y-axis denotes the estimated tag size for each category. In order to accurately verify the population constraint, it is required that the variance of the estimated tag size should be small enough. Note that when the 1 b confidence interval of the tag size nbi is above/ below the threshold t, the specified category can be respectively identified as qualified/unqualified, as both the false positive and false negative probabilities are less than b; otherwise, the specified category is still undetermined. According to the weighted statistical averaging method, as the number of repeated tests increases, the averaged variance si for each category decreases, thus the confidence interval for each category is shrinking. Therefore, after a certain number of query cycles, all categories can be determined as quali- fied/unqualified for the population constraint. Note that when the estimated value nbi t or nbi t, the required variance in the population constraint is much larger than the specifications of the accuracy constraint. In this situation, these categories can be quickly identified as qualified/ unqualified, and can be wiped out immediately from the ensemble sampling for verifying the population constraint. Thus, those undetermined categories can be further involved in the ensemble sampling with a much smaller tag size, verifying the population constraint in a faster approach. Sometimes the tag sizes of various categories are subject to some skew distributions with a “long tail”. The long tail represents those categories each of which occupies a rather small percentage among the total categories, but all together they occupy a substantial proportion of the overall tag sizes. In regard to the iceberg query, conventionally the categories in the long tail are unqualified for the population constraint. However, due to the small tag size, most of them may not have the opportunity to occupy even one singleton slot when contending with those major categories during the ensemble sampling. They remain undetermined without being immediately wiped out, leading to inefficiency in scanning the other categories. We rely on the following theorem to quickly wipe out the categories in the long tail. Theorem 5. For any two categories Ci and Cj that ns;i < ns;j satisfies for each query cycle of ensemble sampling, if Cj is determined to be unqualified for the population constraint, then Ci is also unqualified. Proof. See Appendix F, available in the online supplemental material. tu According to Theorem 5, after a number of query cycles of ensemble sampling, if a category Cj is determined unqualified for the population constraint, then for any category Ci which has not appeared once in the singleton slots, ns;j > ns;i ¼ 0, it can be wiped out immediately as an unqualified category. Fig. 3. Histogram with confidence interval annotated. XIE ET AL.: EFFICIENT PROTOCOLS FOR COLLECTING HISTOGRAMS IN LARGE-SCALE RFID SYSTEMS 2427