2424 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,VOL.26.NO.9.SEPTEMBER 2015 to EPCglobal C1G2 standards,i.e.,the slotted ALOHA- algorithm to work,we only require the tags to comply based anti-collision scheme [4],[6]is used in the system with the current C1G2 standards:each tag has a unique model.The objective is to collect the histogram over RFID 96-bit ID in its EPC memory,where the first s binary bits tags according to some categorized metric,e.g,the type of are regarded as the category ID(1<s 96).According to merchandise,while the present set of category IDs cannot the C1G2 standard,the reader is able to select the tags in be predicted in advance.As we aim at a dynamic environ-a specified category by sending a Select command with an ment where the tags may frequently enter and leave the s-bit mask in the category ID field.If multiple categories scanning area,a time-efficient strategy must be proposed.need to be selected,the reader can provide multiple bit Therefore,the specified accuracy can be relaxed in order to masks in the Select command. quickly collect the histogram.Assume that the overall tag size is n,there exist m categories C=[C1,C2....,Cm},and 5 USE ENSEMBLE SAMPLING TO the actual tag size for each category is n,n2,...,nm. COLLECT HISTOGRAMS In the Basic Histogram Collection,the RFID system needs to collect the histogram for all categories.Due to the inherent When collecting the histograms over a large number of cate- inaccurate property for RFID systems,users can specify the gories,the objective is to minimize the overall scanning accuracy requirement for the histogram collection.Suppose time while the corresponding accuracy/population con- the estimated tag size for category Ci(1<i<m)is ni,then straints are satisfied.Two straightforward solutions are the following accuracy constraint should be satisfied: summarized as follows:(1)Basic Tag Identification:The histo- gram is collected by uniquely identifying each tag from the rm-nl≤e·n≥1-B accuracy constraint. (1) massive tag set and putting it into the corresponding cate- gories,thus the accuracy is 100 percent,and (2)Separate The accuracy constraint illustrates that,given the exact tag Counting:As the category IDs cannot be predicted in size ni for a specified category,the estimated tag sizen advance,the tree traversal method [31]is used to obtain the should be in an confidence interval of width 2e.ni,i.e., with probability greater than 1-B.For category IDs.Then,the reader sends a Select command to the tags,and it activates the tags in the specified category by example,if e=0.1,B=0.05,then in regard to a category providing a bit mask over the category ID in the command. with tag size ni=100,the estimated tag size ni should be According to the replies from the specified tags,the estima- within the range [90,110]with probability greater than tors such as [9],[24],[32]can be used to estimate the tag size 95 percent. for each category.As the rough tag size for each category In the Iceberg Query Problem,only those categories with a cannot be predicted in advance,a fixed initial frame size is tag size over a specified threshold t are essential to be illus- used for each category. trated in the histogram,while the accuracy requirement is Both the above two solutions are not time-efficient.In satisfied.As the exact tag size n;for category Ci is unknown, regard to the basic tag identification,uniquely identifying then,given the estimated value of tag size m,it is possible to each tag in the massive set is not scalable,for as the tag size have false negative error and false positive error in verifying grows into a huge number,the scanning time can be an the population constraint.Therefore,it is essential to guar- unacceptable value.In regard to the separated counting,the antee that the false negative/positive rate is below B,that is: reader needs to scan each category with at least one query cycle,even if the category is a minor category,which is not Pr[i<tn:≥t<B, (2) necessarily addressed in the iceberg query and the top-k query.As the number of categories m can be fairly large, Pr≥tn贴<<B. (3) e.g.,in the order of hundreds,the Select command and the fixed initial frame size for each category,as well as the In the Top-k Query Problem,we use the definition of the inter-cycle overhead among a large number of query cycles, probabilistic threshold top-k query (PT-Topk query),i.e.,in make the overall scanning time rather large. regard to the tag size,only the set of categories where each Therefore,we consider an ensemble sampling-based esti- takes a probability of at least 1-B to be in the top-k list are mation scheme as follows:select a certain number of catego- illustrated in the histogram,while the accuracy requirement ries and issue a query cycle,obtain the empty/singleton/ is satisfied.Much like the iceberg query problem,as the collision slots,and then estimate the tag size for each of the exact tag size ni for category Ci is unknown,then,given the categories according to the sampling in the singleton slots. estimated value of tag size ni,it is possible to have false neg- In this way,the ensemble sampling is more preferred than ative error and false positive error in verifying the popula- the separate counting in terms of reading performance. tion constraint,the following constraint must be satisfied: Since more tags are involved in one query cycle,more slots Pr[C;is regarded out of top-k list C;top-k list]<B,(4)amortize the cost of inter-cycle overhead,the Select com- mand,as well as the fixed initial frame size.Thus,the over- PrC,is regarded in top-k listC top-k list]<B.(5)all scanning time can be greatly reduced. In this paper,we aim to propose a series of novel solu-5.1 The Estimator ES tions to tackle the above problems while satisfying the In the slotted ALOHA-based protocol,besides the empty slots following properties:(1)Time-efficient.(2)Simple for the and the collision slots,the singleton slots can be obtained.In tag side in the protocol.(3)Complies with the EPCglobal the ensemble sampling-based estimation,according to the C1G2 standards.Therefore,in order for the proposed observed statistics of the empty/singleton/collision slots,weto EPCglobal C1G2 standards, i.e., the slotted ALOHAbased anti-collision scheme [4], [6] is used in the system model. The objective is to collect the histogram over RFID tags according to some categorized metric, e.g, the type of merchandise, while the present set of category IDs cannot be predicted in advance. As we aim at a dynamic environment where the tags may frequently enter and leave the scanning area, a time-efficient strategy must be proposed. Therefore, the specified accuracy can be relaxed in order to quickly collect the histogram. Assume that the overall tag size is n, there exist m categories C ¼ fC1; C2; ... ; Cmg, and the actual tag size for each category is n1; n2; ... ; nm. In the Basic Histogram Collection, the RFID system needs to collect the histogram for all categories. Due to the inherent inaccurate property for RFID systems, users can specify the accuracy requirement for the histogram collection. Suppose the estimated tag size for category Cið1 i mÞ is nbi, then the following accuracy constraint should be satisfied: Pr½jnbi nij ni 1 b accuracy constraint: (1) The accuracy constraint illustrates that, given the exact tag size ni for a specified category, the estimated tag size nbi should be in an confidence interval of width 2 ni, i.e., nbi ni 2 ½1 ; 1 þ with probability greater than 1 b. For example, if ¼ 0:1; b ¼ 0:05, then in regard to a category with tag size ni ¼ 100, the estimated tag size nbi should be within the range ½90;110 with probability greater than 95 percent. In the Iceberg Query Problem, only those categories with a tag size over a specified threshold t are essential to be illustrated in the histogram, while the accuracy requirement is satisfied. As the exact tag size ni for category Ci is unknown, then, given the estimated value of tag size nbi, it is possible to have false negative error and false positive error in verifying the population constraint. Therefore, it is essential to guarantee that the false negative/positive rate is below b, that is: Pr½nbi < tjni t < b; (2) Pr½nbi tjni < t < b: (3) In the Top-k Query Problem, we use the definition of the probabilistic threshold top-k query (PT-Topk query), i.e., in regard to the tag size, only the set of categories where each takes a probability of at least 1 b to be in the top-k list are illustrated in the histogram, while the accuracy requirement is satisfied. Much like the iceberg query problem, as the exact tag size ni for category Ci is unknown, then, given the estimated value of tag size nbi, it is possible to have false negative error and false positive error in verifying the population constraint, the following constraint must be satisfied: Pr½Ci is regarded out of top-k listj Ci 2 top-k list < b; (4) Pr½Ci is regarded in top-k listj Ci 2= top-k list < b: (5) In this paper, we aim to propose a series of novel solutions to tackle the above problems while satisfying the following properties: (1) Time-efficient. (2) Simple for the tag side in the protocol. (3) Complies with the EPCglobal C1G2 standards. Therefore, in order for the proposed algorithm to work, we only require the tags to comply with the current C1G2 standards: each tag has a unique 96-bit ID in its EPC memory, where the first s binary bits are regarded as the category ID (1 <s< 96). According to the C1G2 standard, the reader is able to select the tags in a specified category by sending a Select command with an s-bit mask in the category ID field. If multiple categories need to be selected, the reader can provide multiple bit masks in the Select command. 5 USE ENSEMBLE SAMPLING TO COLLECT HISTOGRAMS When collecting the histograms over a large number of categories, the objective is to minimize the overall scanning time while the corresponding accuracy/population constraints are satisfied. Two straightforward solutions are summarized as follows: (1) Basic Tag Identification: The histogram is collected by uniquely identifying each tag from the massive tag set and putting it into the corresponding categories, thus the accuracy is 100 percent, and (2) Separate Counting: As the category IDs cannot be predicted in advance, the tree traversal method [31] is used to obtain the category IDs. Then, the reader sends a Select command to the tags, and it activates the tags in the specified category by providing a bit mask over the category ID in the command. According to the replies from the specified tags, the estimators such as [9], [24], [32] can be used to estimate the tag size for each category. As the rough tag size for each category cannot be predicted in advance, a fixed initial frame size is used for each category. Both the above two solutions are not time-efficient. In regard to the basic tag identification, uniquely identifying each tag in the massive set is not scalable, for as the tag size grows into a huge number, the scanning time can be an unacceptable value. In regard to the separated counting, the reader needs to scan each category with at least one query cycle, even if the category is a minor category, which is not necessarily addressed in the iceberg query and the top-k query. As the number of categories m can be fairly large, e.g., in the order of hundreds, the Select command and the fixed initial frame size for each category, as well as the inter-cycle overhead among a large number of query cycles, make the overall scanning time rather large. Therefore, we consider an ensemble sampling-based estimation scheme as follows: select a certain number of categories and issue a query cycle, obtain the empty/singleton/ collision slots, and then estimate the tag size for each of the categories according to the sampling in the singleton slots. In this way, the ensemble sampling is more preferred than the separate counting in terms of reading performance. Since more tags are involved in one query cycle, more slots amortize the cost of inter-cycle overhead, the Select command, as well as the fixed initial frame size. Thus, the overall scanning time can be greatly reduced. 5.1 The Estimator ES In the slotted ALOHA-based protocol, besides the empty slots and the collision slots, the singleton slots can be obtained. In the ensemble sampling-based estimation, according to the observed statistics of the empty/singleton/collision slots, we 2424 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015