XIE ET AL.:EFFICIENT PROTOCOLS FOR COLLECTING HISTOGRAMS IN LARGE-SCALE RFID SYSTEMS 2429 the kth largest category;however,it is rather difficult to variance of f-t is at most g2.In order to guarantee that compute an exact value of t in the estimation scheme due to Var(t-t)<e2.t2.B,it is essential to ensure g2<e2.t2.B. the randomness in the slotted ALOHA protocol.Therefore, As the ensemble sampling is continuously issued over the according to the problem formulation in Section 4,we categories in R,the standard deviation o:for each category attempt to obtain an estimated value t such that the follow- CiE R is continuously decreasing.Furthermore,as the ing constraints are satisfied: qualified/unqualified categories are continuously wiped Prlm-nil≤e:ni≥1-B accuracy constraint, out,the upper bound tup is continuously decreasing while Pr[lt-t≤e·t≥1-B accuracy constraint of t, the lower bound tlowe is continuously increasing.The width (9) of the range [tow,tup]is continuously decreasing.The while Pi<tn:≥司<B population constraint, loop continues until g2<e2.t2.B.Then,after the estimated Prm≥tn:<司<B threshold is computed,the iceberg query is further applied population constraint. (10) over those categories with the threshold t. Therefore,if the threshold i can be accurately estimated, then the top-k query problem is reduced to the iceberg Algorithm 3.Algorithm for PT-Topk Query Problem query problem.The population constraints(9)and(10)are 1:INPUT:1.Upper bound n on the number of tags n respectively equivalent to the population constraints(4)and 2: 2.Confidence interval width e (5).Then it is essential to quickly determine the value of the 3: 3.The value of k threshold t while satisfying the constraint Pr[lt-t<e.t]> 4 4.Error probability B 5:Initialize R to all categories,set I=1,n=(1-). 1-B.We rely on the following theorem to express the 6:while true do above constraint in the form of the variance. 7 Issue a query cycle to apply ensemble sampling over all Theorem6.The constraint Prt-t≤et刂≥1-B is satisfied categories in R.Compute the statistical average value and standard deviations as n and o;. as long as Var(t-t)≤2.tP·B. 6 Rank the categories in R according to the value of Proof.See Appendix G,available in the online supplemen- +no;for each identified category Ci.Find the k-th largest category Ci,settup=ni+ni.Detect the quali- tal material. fied categories Q with threshold tp Rank the categories in R according to the value of 7.2 Algorithm m-noi for each identified category Ci.Find the k-th According to Theorem 6,we utilize the ensemble sampling to quickly estimate the threshold t.The intuition is as fol- largest category Ci,set tioue=n-nai.Detect the unqualified categories U with threshold t. lows:after the first query cycle of ensemble sampling,we 10: Wipe out the qualified/unqualified categories from R. can estimate a confidence interval [tow,tup of the threshold t R=R-Q-U.Suppose the number of qualified cate- according to the sampled distribution.Then,by wiping out gories in current cycle is q,set k =k-g. those categories which are obviously qualified or unquali- 11: Rank the categories in R according to the value of for fied to be in the top-k list,the width of the confidence inter- each identified category C.Find the k-th largest category val can be quickly reduced.As the approximated threshold Ci,set t=fi.Set g tup tiou.I=1+1. f is selected within the confidence interval,after a number 12: ifg2≤2.B.2then of query cycles of ensemble sampling,when the width is 13: Break the while loop. 14: end if below a certain threshold,the estimated value t can be close 15:end while enough to the exact threshold t. 16:Apply iceberg query with threshold t over the undetermined Based on the above analysis,we propose an algorithm for categories R and the qualified categories Q. the top-k query problem in Algorithm 3.In the beginning,a For example,suppose the value of k is 5,after a query while loop is utilized to quickly identify an approximate cycle of ensemble sampling,the estimated number of tags value t for the threshold t.Suppose that the averaged esti- for various categories is ranked in decreasing order as fol- mated tag size and standard deviation for each category Ci 1ows:{C:120,C2:85,C367,C450,C548,C645,C720,Cs15} are respectively m and oi,if we use p to denote a small con- the threshold tup and tiou are respectively set to 68 and 28 stant value between 0 and 1,let n=-(1-g).Then,given according to the fifth largest category,then the categories a fixed value of p,the 1-p confidence interval for ni is with tag size 120 and 85 can be determined as qualified cate- [ni-noi,+ni.For each iteration,we respectively gories since their tag sizes are above the threshold tup the cat- determine an upper bound tup and a lower bound ttote for egories with tag size 20 and 15 can be also determined as the threshold t,according to the kth largest category in the unqualified categories since their tag sizes are below the current ranking.Then,we respectively wipe out those quali- threshold tou.Therefore,the remaining categories are as fol- fied and unqualified categories according to the upper lows:C3,C4,Cs and Co,we hence need another cycle of bound tup and a lower bound tu.The value of k is then ensemble sampling to further verify the threshold according decreased by the number of qualified categories.In this to the third largest category. way,the threshold t is guaranteed to be within the range [tloue,tup]with a probability of at least 1-p.When p-0, 8 DISCUSSION ON PRACTICAL ISSUES then te [tlou,tup]with the probability close to 100 percent.8.1 Time-Efficiency Moreover,an estimated threshold t is also selected within As mentioned in the problem formulation,the most critical this range.Therefore,let the width g=tup-tiow,then the factor for the histogram collection problem is the timethe kth largest category; however, it is rather difficult to compute an exact value of t in the estimation scheme due to the randomness in the slotted ALOHA protocol. Therefore, according to the problem formulation in Section 4, we attempt to obtain an estimated value bt such that the following constraints are satisfied: Pr½jnbi nij ni 1 b accuracy constraint; Pr½jbt tj t 1 b accuracy constraint of bt; Pr½nbi < btjni bt < b population constraint; (9) Pr½nbi btjni < bt < b population constraint: (10) Therefore, if the threshold bt can be accurately estimated, then the top-k query problem is reduced to the iceberg query problem. The population constraints (9) and (10) are respectively equivalent to the population constraints (4) and (5). Then it is essential to quickly determine the value of the threshold bt while satisfying the constraint Pr½jbt tj t 1 b. We rely on the following theorem to express the above constraint in the form of the variance. Theorem 6. The constraint Pr½jbt tj t 1 b is satisfied as long as Varðbt tÞ 2 t 2 b. Proof. See Appendix G, available in the online supplemental material. tu 7.2 Algorithm According to Theorem 6, we utilize the ensemble sampling to quickly estimate the threshold bt. The intuition is as follows: after the first query cycle of ensemble sampling, we can estimate a confidence interval ½tlow; tup of the threshold t according to the sampled distribution. Then, by wiping out those categories which are obviously qualified or unquali- fied to be in the top-k list, the width of the confidence interval can be quickly reduced. As the approximated threshold bt is selected within the confidence interval, after a number of query cycles of ensemble sampling, when the width is below a certain threshold, the estimated value bt can be close enough to the exact threshold t. Based on the above analysis, we propose an algorithm for the top-k query problem in Algorithm 3. In the beginning, a while loop is utilized to quickly identify an approximate value bt for the threshold t. Suppose that the averaged estimated tag size and standard deviation for each category Ci are respectively nbi and si, if we use p to denote a small constant value between 0 and 1, let h ¼ F1 ð1 p 2Þ. Then, given a fixed value of p, the 1 p confidence interval for ni is ½nbi h si; nbi þ h si. For each iteration, we respectively determine an upper bound tup and a lower bound tlow for the threshold t, according to the kth largest category in the current ranking. Then, we respectively wipe out those quali- fied and unqualified categories according to the upper bound tup and a lower bound tlow. The value of k is then decreased by the number of qualified categories. In this way, the threshold t is guaranteed to be within the range ½tlow; tup with a probability of at least 1 p. When p ! 0, then t 2 ½tlow; tup with the probability close to 100 percent. Moreover, an estimated threshold bt is also selected within this range. Therefore, let the width g ¼ tup tlow, then the variance of bt t is at most g2. In order to guarantee that Varðbt tÞ 2 t 2 b, it is essential to ensure g2 2 t 2 b. As the ensemble sampling is continuously issued over the categories in R, the standard deviation si for each category Ci 2 R is continuously decreasing. Furthermore, as the qualified/unqualified categories are continuously wiped out, the upper bound tup is continuously decreasing while the lower bound tlow is continuously increasing. The width of the range ½tlow; tup is continuously decreasing. The while loop continues until g2 2 t 2 b. Then, after the estimated threshold bt is computed, the iceberg query is further applied over those categories with the threshold bt. Algorithm 3. Algorithm for PT-Topk Query Problem 1: INPUT: 1. Upper bound n on the number of tags n 2: 2. Confidence interval width 3: 3. The value of k 4: 4. Error probability b 5: Initialize R to all categories, set l ¼ 1, h ¼ F1 ð1 p 2Þ. 6: while true do 7: Issue a query cycle to apply ensemble sampling over all categories in R. Compute the statistical average value and standard deviations as nbi and si. 8: Rank the categories in R according to the value of nbi þ h si for each identified category Ci. Find the k-th largest category Ci, set tup ¼ nbi þ h si. Detect the quali- fied categories Q with threshold tup. 9: Rank the categories in R according to the value of nbi h si for each identified category Ci. Find the k-th largest category Ci, set tlow ¼ nbi h si. Detect the unqualified categories U with threshold tlow. 10: Wipe out the qualified/unqualified categories from R. R ¼ R Q U. Suppose the number of qualified categories in current cycle is q, set k ¼ k q. 11: Rank the categories in R according to the value of nbi for each identified category Ci. Find the k-th largest category Ci, set bt ¼ nbi. Set g ¼ tup tlow. l ¼ l þ 1. 12: if g2 2 b bt2 then 13: Break the while loop. 14: end if 15: end while 16: Apply iceberg query with threshold bt over the undetermined categories R and the qualified categories Q. For example, suppose the value of k is 5, after a query cycle of ensemble sampling, the estimated number of tags for various categories is ranked in decreasing order as follows: {C1:120, C2:85, C3:67, C4:50, C5:48, C6:45, C7:20, C8:15 }, the threshold tup and tlow are respectively set to 68 and 28 according to the fifth largest category, then the categories with tag size 120 and 85 can be determined as qualified categories since their tag sizes are above the threshold tup, the categories with tag size 20 and 15 can be also determined as unqualified categories since their tag sizes are below the threshold tlow. Therefore, the remaining categories are as follows: C3; C4; C5 and C6, we hence need another cycle of ensemble sampling to further verify the threshold according to the third largest category. 8 DISCUSSION ON PRACTICAL ISSUES 8.1 Time-Efficiency As mentioned in the problem formulation, the most critical factor for the histogram collection problem is the time XIE ET AL.: EFFICIENT PROTOCOLS FOR COLLECTING HISTOGRAMS IN LARGE-SCALE RFID SYSTEMS 2429