Efficiently Collecting Histograms Over RFID Tags Lei Xief,Hao Hant,Qun Lit,Jie Wus,and Sanglu Lut fState Key Laboratory for Novel Software Technology,Nanjing University,China Department of Computer Science,College of William and Mary,USA SDepartment of Computer Information and Sciences,Temple University,USA Email:TIxie@nju.edu.cn,fhhan,liqun@cs.wm.edu,3jiewu@temple.edu,sanglu@nju.edu.cn Abstract-Collecting histograms over RFID tags is an essential While dealing with a large scale deployment with thousands premise for effective aggregate queries and analysis in large- of tags,the traditional tag identification scheme is not suitable scale RFID-based applications.In this paper we consider efficient for histogram collection,since the scanning time is proportional collection of histograms from the massive number of RFID tags without the need to read all tag data.We first consider the problem to the number of tags,which can be in the order of several of basic histogram collection and propose an efficient algorithm minutes.As in most applications,the tags are frequently based on the idea of ensemble sampling.We further consider the moving into and out of the effective scanning area.In order problems of advanced histogram collection,respectively,with an to capture the distribution statistics in time,it is essential to iceberg query and a top-k query.Efficient algorithms are proposed to tackle the above problems such that the qualified/unqualified sacrifice some accuracy so that the main distribution can be categories can be quickly identified.Experiment results indi- obtained within a short time window in the order of several cate that our ensemble sampling-based solutions can achieve a seconds.Therefore,we seek to propose an estimation scheme much better performance than the basic estimation/identification to quickly count the tag sizes of each category,while achieving schemes. the accuracy requirement. Index Terms-Algorithms;RFID;Time efficiency;Histogram In most cases,the tag sizes of various categories are subject to some skew distribution with a "long tail".such as the I.INTRODUCTION Gaussian distribution.The long tail represents a large number With the rapid proliferation of RFID-based applications, of categories,each of which occupies a rather small percentage RFID tags have been deployed into pervasive spaces in increas- among the total categories.While handling the massive tags in ingly large numbers.In applications like warehouse monitoring, the order of several thousands,the overall number of categories the items are attached with RFID tags and densely packed in the long tail could be in hundreds.Therefore,by separately into boxes.As the maximum scanning range of a UHF RFID estimating the tag sizes over each category,a large number of reader is usually 6-10m,the overall number of tags within query cycles and slots are required.Besides,in applications this three-dimensional space can be up to tens of thousands like the iceberg query and the top-k query,only those major in a dense deployment scenario,as envisioned in [1-3].Many categories are essential to be addressed.In this situation,the identification protocols are proposed to uniquely identify the separate estimate approach will waste a lot of scanning time tags through anti-collision schemes.However,in a number of over those minor categories in the long tail.Therefore,a novel applications,only some useful statistical information is essen- scheme is essential to quickly collect the histograms over the tial to be collected,such as the overall tag size [2,4],popular massive RFID tags.In this paper,we propose a series of categories [5],and the histogram.In particular,histograms protocols to tackle the problem of efficient histogram collection. capture distribution statistics in a space-efficient fashion. We make the following contributions. In practice,tags are typically attached to objects belonging to 1)To the best of our knowledge,we are the first to consider different categories,e.g.,different brands and models of clothes the problem of collecting histograms over RFID tags,which is in a large clothing store,different titles of books in a book a fundamental premise for queries and analysis in RFID-based store,etc.Collecting histograms can be used to illustrate the tag applications.In order to achieve time efficiency,we propose a population belonging to each category,and determine whether novel,ensemble sampling-based method to simultaneously esti- the number of tags in a category is above or below any desired mate the tag size for a number of categories.This framework is threshold.By setting this threshold,it is easy to find popular very flexible and compatible to current tag-counting estimators, merchandise and control stock,e.g.,automatically signaling which can be efficiently leveraged to estimate the tag size for when more products need to be put on the shelf.Furthermore, each category.2)In order to tackle the histogram collection the histogram can be used for approximate answering of aggre- with a filter condition,we propose an effective solution for gate queries,as well as preprocessing and mining association the iceberg query problem.By considering the population and rules in data mining [6].Therefore,collecting histograms over accuracy constraint,we propose an efficient algorithm to wipe RFID tags is an essential premise for effective queries and out the unqualified categories in time,especially those cate- analysis in conventional RFID-based applications. gories in the long tail.We further present an effective solution 978-1-4799-3360-0/14/$31.00⊙2014IEEE
Efficiently Collecting Histograms Over RFID Tags Lei Xie†, Hao Han‡, Qun Li‡, Jie Wu§, and Sanglu Lu† †State Key Laboratory for Novel Software Technology, Nanjing University, China ‡Department of Computer Science, College of William and Mary, USA §Department of Computer Information and Sciences, Temple University, USA Email: †lxie@nju.edu.cn, ‡{hhan,liqun}@cs.wm.edu, §jiewu@temple.edu, †sanglu@nju.edu.cn Abstract—Collecting histograms over RFID tags is an essential premise for effective aggregate queries and analysis in largescale RFID-based applications. In this paper we consider efficient collection of histograms from the massive number of RFID tags without the need to read all tag data. We first consider the problem of basic histogram collection and propose an efficient algorithm based on the idea of ensemble sampling. We further consider the problems of advanced histogram collection, respectively, with an iceberg query and a top-𝑘 query. Efficient algorithms are proposed to tackle the above problems such that the qualified/unqualified categories can be quickly identified. Experiment results indicate that our ensemble sampling-based solutions can achieve a much better performance than the basic estimation/identification schemes. Index Terms—Algorithms; RFID; Time efficiency; Histogram I. INTRODUCTION With the rapid proliferation of RFID-based applications, RFID tags have been deployed into pervasive spaces in increasingly large numbers. In applications like warehouse monitoring, the items are attached with RFID tags and densely packed into boxes. As the maximum scanning range of a UHF RFID reader is usually 6-10m, the overall number of tags within this three-dimensional space can be up to tens of thousands in a dense deployment scenario, as envisioned in [1–3]. Many identification protocols are proposed to uniquely identify the tags through anti-collision schemes. However, in a number of applications, only some useful statistical information is essential to be collected, such as the overall tag size [2, 4], popular categories [5], and the histogram. In particular, histograms capture distribution statistics in a space-efficient fashion. In practice, tags are typically attached to objects belonging to different categories, e.g., different brands and models of clothes in a large clothing store, different titles of books in a book store, etc. Collecting histograms can be used to illustrate the tag population belonging to each category, and determine whether the number of tags in a category is above or below any desired threshold. By setting this threshold, it is easy to find popular merchandise and control stock, e.g., automatically signaling when more products need to be put on the shelf. Furthermore, the histogram can be used for approximate answering of aggregate queries, as well as preprocessing and mining association rules in data mining [6]. Therefore, collecting histograms over RFID tags is an essential premise for effective queries and analysis in conventional RFID-based applications. While dealing with a large scale deployment with thousands of tags, the traditional tag identification scheme is not suitable for histogram collection, since the scanning time is proportional to the number of tags, which can be in the order of several minutes. As in most applications, the tags are frequently moving into and out of the effective scanning area. In order to capture the distribution statistics in time, it is essential to sacrifice some accuracy so that the main distribution can be obtained within a short time window in the order of several seconds. Therefore, we seek to propose an estimation scheme to quickly count the tag sizes of each category, while achieving the accuracy requirement. In most cases, the tag sizes of various categories are subject to some skew distribution with a “long tail”, such as the Gaussian distribution. The long tail represents a large number of categories, each of which occupies a rather small percentage among the total categories. While handling the massive tags in the order of several thousands, the overall number of categories in the long tail could be in hundreds. Therefore, by separately estimating the tag sizes over each category, a large number of query cycles and slots are required. Besides, in applications like the iceberg query and the top-𝑘 query, only those major categories are essential to be addressed. In this situation, the separate estimate approach will waste a lot of scanning time over those minor categories in the long tail. Therefore, a novel scheme is essential to quickly collect the histograms over the massive RFID tags. In this paper, we propose a series of protocols to tackle the problem of efficient histogram collection. We make the following contributions. 1) To the best of our knowledge, we are the first to consider the problem of collecting histograms over RFID tags, which is a fundamental premise for queries and analysis in RFID-based applications. In order to achieve time efficiency, we propose a novel, ensemble sampling-based method to simultaneously estimate the tag size for a number of categories. This framework is very flexible and compatible to current tag-counting estimators, which can be efficiently leveraged to estimate the tag size for each category. 2) In order to tackle the histogram collection with a filter condition, we propose an effective solution for the iceberg query problem. By considering the population and accuracy constraint, we propose an efficient algorithm to wipe out the unqualified categories in time, especially those categories in the long tail. We further present an effective solution 978-1-4799-3360-0/14/$31.00 ⃝c 2014 IEEE
to tackle the top-k query problem.We use ensemble sampling be selected,the reader can provide multiple bit masks in the to quickly estimate the threshold corresponding to the k-th Select command. largest category,and reduce it to the iceberg query problem.3) While achieving time-efficiency,our solutions are completely B.The Impact of the Inter-cycle Overhead compatible with current industry standards,i.e.,the EPCglobal The MAC protocol for the C1G2 system is based on s- C1G2 standards,and do not require any modification to tags. lotted ALOHA.In order to accurately estimate the size of a specified set of tags,conventionally,the reader should issue II.RELATED WORK multiple query cycles over the same set of tags and take the In RFID systems,a reader needs to receive data from average of the estimates.The inter-cycle overhead consists of multiple tags,while the tags are unable to self-regulate their the time between cycles when the reader is powered down, radio transmissions to avoid collisions;then,a series of s-and the carrier time used to power the tags before beginning lotted ALOHA-based anti-collision protocols are designed to communication.According to the experiment results in [14], efficiently resolve collisions in RFID systems.In order to deal which are conducted in realistic settings,these times are 40 ms with the collision problems in multi-reader RFID systems, and 3 ms respectively,while the average time interval per slot scheduling protocols for reader activation are explored in [7,8]. is 1~2 ms.We have further measured the time interval for Recently,a number of polling protocols [9,10]are proposed, various slots with the USRP N210 platform,it is found that aiming to collect information from battery-powered active tags the average interval for empty slots is 1.6 ms per slot,and the in an energy efficient approach. average interval for singleton/collision slots is 5.1 ms per slot. Recent research is focused on the collection of statistical Note that if the powered-down interval is not long enough,it is information over the RFID tags [2,4,5,11-13].The authors possible that some surrounding tags will maintain the former mainly consider the problem of estimating the number of tags state for the inventoried flag with their local residual power, without collecting the tag IDs.In [4],Murali et al.provide very which causes them to keep silent in the upcoming query cycle. fast and reliable estimation mechanisms for tag quantity in a Therefore,the inter-cycle duration must be taken into account more practical approach.In [2],Shahzad et al.propose a new when considering overall reading performance.It is obvious scheme for estimating tag population size called Average Run that reading a larger number of tags per cycle amortizes the cost based Tag estimation.In [5],Sheng et al.consider the problem of inter-cycle overhead,resulting in lower per tag reading time, of identifying popular categories of RFID tags out of a large while for small tag sets the inter-cycle overhead is significant. collection of tags,while the set of category IDs are supposed to be known.In this paper,our goal is to collect the histograms IV.PROBLEM FORMULATION for all categories over RFID tags in a time-efficient approach, Suppose there are a large number of tags in the effective without any priori knowledge of the categories,including the scanning area of the RFID reader,the RFID system conforms number of categories and the category IDs. to EPCglobal C1G2 standards.i.e..the slotted ALOHA-based anti-collision scheme is used in the system model.The objective III.PRELIMINARY is to collect the histogram over RFID tags according to some A.The framed slotted ALOHA protocol categorized metric,e.g,the type of merchandise,while the In the Class 1 Gen 2 standard,the RFID system leverages present set of category IDs cannot be predicted in advance. the framed slotted ALOHA protocol to resolve the collisions for As we aim at a dynamic environment where the tags may tag identification.When a reader wishes to read a set of tags, frequently enter and leave the scanning area,a time-efficient it first powers up and transmits a continuous wave to energize strategy must be proposed.Therefore,the specified accuracy the tags.It then initiates a series of frames,varying the number can be relaxed in order to quickly collect the histogram. of slots in each frame to best accommodate the number of Assume that the overall tag size is n,there exists m categories tags.Each frame has a number of slots and each active tag C=[C1,C2,...,Cm),and the actual tag size for each category will reply in a randomly selected slot per frame.After all tags is n1,n2,...,nm. are read,the reader powers down.We refer to the series of In the Basic Histogram Collection,the RFID system needs frames between power down periods as a Query Cycle.Note to collect the histogram for all categories.Due to the inherent that,within each frame,tags may choose the same slot,which inaccurate property for RFID systems,users can specify the causes multiple tags to reply during a slot.Therefore,within accuracy requirement for the histogram collection.Suppose the each frame there exist three kinds of slots:(1)the empty slot estimated tag size for category Ci(1<i<m)is ni,then the where no tag replies;(2)the single slot where only one tag following accuracy constraint should be satisfied: replies;and (3)the collision slot where multiple tags reply. Pr[mi-nil≤e·ni≥1-B accuracy constraint.. (1) In regard to the tag ID,each tag has a unique 96-bit ID in its EPC memory,where the first s binary bits can be regarded as The accuracy constraint illustrates that,given the exact tag size the category ID(1 <s<96).According to the C1G2 standard,ni for a specified category,the estimated tag size n should for each QueryCycle,the reader is able to select the tags in a be in a confidence interval of width 2e[1 specified category by sending a Select command with an s-bit e,1+e with a probability greater than 1-B.For example,if mask in the category ID field.If multiple categories need to e=0.1,B=0.05,then in regard to a category with tag size
to tackle the top-k query problem. We use ensemble sampling to quickly estimate the threshold corresponding to the 𝑘-th largest category, and reduce it to the iceberg query problem. 3) While achieving time-efficiency, our solutions are completely compatible with current industry standards, i.e., the EPCglobal C1G2 standards, and do not require any modification to tags. II. RELATED WORK In RFID systems, a reader needs to receive data from multiple tags, while the tags are unable to self-regulate their radio transmissions to avoid collisions; then, a series of slotted ALOHA-based anti-collision protocols are designed to efficiently resolve collisions in RFID systems. In order to deal with the collision problems in multi-reader RFID systems, scheduling protocols for reader activation are explored in [7, 8]. Recently, a number of polling protocols [9, 10] are proposed, aiming to collect information from battery-powered active tags in an energy efficient approach. Recent research is focused on the collection of statistical information over the RFID tags [2, 4, 5, 11–13]. The authors mainly consider the problem of estimating the number of tags without collecting the tag IDs. In [4], Murali et al. provide very fast and reliable estimation mechanisms for tag quantity in a more practical approach. In [2], Shahzad et al. propose a new scheme for estimating tag population size called Average Run based Tag estimation. In [5], Sheng et al. consider the problem of identifying popular categories of RFID tags out of a large collection of tags, while the set of category IDs are supposed to be known. In this paper, our goal is to collect the histograms for all categories over RFID tags in a time-efficient approach, without any priori knowledge of the categories, including the number of categories and the category IDs. III. PRELIMINARY A. The framed slotted ALOHA protocol In the Class 1 Gen 2 standard, the RFID system leverages the framed slotted ALOHA protocol to resolve the collisions for tag identification. When a reader wishes to read a set of tags, it first powers up and transmits a continuous wave to energize the tags. It then initiates a series of frames, varying the number of slots in each frame to best accommodate the number of tags. Each frame has a number of slots and each active tag will reply in a randomly selected slot per frame. After all tags are read, the reader powers down. We refer to the series of frames between power down periods as a Query Cycle. Note that, within each frame, tags may choose the same slot, which causes multiple tags to reply during a slot. Therefore, within each frame there exist three kinds of slots: (1) the empty slot where no tag replies; (2) the single slot where only one tag replies; and (3) the collision slot where multiple tags reply. In regard to the tag ID, each tag has a unique 96-bit ID in its EPC memory, where the first 𝑠 binary bits can be regarded as the category ID (1 <𝑠< 96). According to the C1G2 standard, for each Query Cycle, the reader is able to select the tags in a specified category by sending a Select command with an 𝑠-bit mask in the category ID field. If multiple categories need to be selected, the reader can provide multiple bit masks in the Select command. B. The Impact of the Inter-cycle Overhead The MAC protocol for the C1G2 system is based on slotted ALOHA. In order to accurately estimate the size of a specified set of tags, conventionally, the reader should issue multiple query cycles over the same set of tags and take the average of the estimates. The inter-cycle overhead consists of the time between cycles when the reader is powered down, and the carrier time used to power the tags before beginning communication. According to the experiment results in [14], which are conducted in realistic settings, these times are 40 ms and 3 ms respectively, while the average time interval per slot is 1 ∼ 2 ms. We have further measured the time interval for various slots with the USRP N210 platform, it is found that the average interval for empty slots is 1.6 ms per slot, and the average interval for singleton/collision slots is 5.1 ms per slot. Note that if the powered-down interval is not long enough, it is possible that some surrounding tags will maintain the former state for the inventoried flag with their local residual power, which causes them to keep silent in the upcoming query cycle. Therefore, the inter-cycle duration must be taken into account when considering overall reading performance. It is obvious that reading a larger number of tags per cycle amortizes the cost of inter-cycle overhead, resulting in lower per tag reading time, while for small tag sets the inter-cycle overhead is significant. IV. PROBLEM FORMULATION Suppose there are a large number of tags in the effective scanning area of the RFID reader, the RFID system conforms to EPCglobal C1G2 standards, i.e., the slotted ALOHA-based anti-collision scheme is used in the system model. The objective is to collect the histogram over RFID tags according to some categorized metric, e.g, the type of merchandise, while the present set of category IDs cannot be predicted in advance. As we aim at a dynamic environment where the tags may frequently enter and leave the scanning area, a time-efficient strategy must be proposed. Therefore, the specified accuracy can be relaxed in order to quickly collect the histogram. Assume that the overall tag size is 𝑛, there exists 𝑚 categories 𝐶 = {𝐶1, 𝐶2, ..., 𝐶𝑚}, and the actual tag size for each category is 𝑛1, 𝑛2, ..., 𝑛𝑚. In the Basic Histogram Collection, the RFID system needs to collect the histogram for all categories. Due to the inherent inaccurate property for RFID systems, users can specify the accuracy requirement for the histogram collection. Suppose the estimated tag size for category 𝐶𝑖(1 ≤ 𝑖 ≤ 𝑚) is 𝑛ˆ𝑖, then the following accuracy constraint should be satisfied: 𝑃 𝑟[∣𝑛ˆ𝑖 − 𝑛𝑖∣ ≤ 𝜖 ⋅ 𝑛𝑖] ≥ 1 − 𝛽 accuracy constraint. (1) The accuracy constraint illustrates that, given the exact tag size 𝑛𝑖 for a specified category, the estimated tag size 𝑛ˆ𝑖 should be in a confidence interval of width 2𝜖 ⋅ 𝑛𝑖, i.e., 𝑛ˆ𝑖 𝑛𝑖 ∈ [1 − 𝜖, 1 + 𝜖] with a probability greater than 1 − 𝛽. For example, if 𝜖 = 0.1, 𝛽 = 0.05, then in regard to a category with tag size
ni=100,the estimated tag size n should be within the range categories m can be fairly large,e.g.,in the order of hundreds [90,110]with a probability greater than 95%. the Select command and the fixed initial frame size for each In the Iceberg Ouery Problem,only those categories with a category,as well as the inter-cycle overhead among a large tag size over a specified threshold t are essential to be illustrated number of query cycles,make the overall scanning time rather in the histogram,while the accuracy requirement is satisfied. large. As the exact tag size ni for category Ci is unknown,then, Therefore,we consider an ensemble sampling-based estima- given the estimated value of tag size ni,it is possible to have tion scheme as follows:select a certain number of categories false negative error and false positive error in verifying the and issue a query cycle,obtain the empty/singleton/collision population constraint.Therefore,it is essential to guarantee that slots,and then estimate the tag size for each of the categories the false negative/positive rate is below B,that is: according to the sampling in the singleton slots.In this way,the Pr[m<tn:≥t<B, ensemble sampling is more preferred than the separate counting (2) in terms of reading performance.Since more tags are involved Prm≥tn:<t<B. (3) in one query cycle,more slots amortize the cost of inter-cycle In the Top-k Query Problem,we use the definition of the overhead,the Select command,as well as the fixed initial frame probabilistic threshold top-k query (PT-Topk query),i.e.,in size.Thus,the overall scanning time can be greatly reduced. regard to the tag size,only the set of categories where each A.The Estimator SE takes a probability of at least 1-B to be in the top-k list are In the slotted ALOHA-based protocol,besides the empty illustrated in the histogram,while the accuracy requirement is slots and the collision slots,the singleton slots can be obtained. satisfied.Much like the iceberg query problem,as the exact tag size ni for category Ci is unknown,then,given the estimated In the ensemble sampling-based estimation,according to the observed statistics of the empty/singleton/collision slots,we can value of tag size ni,it is possible to have false negative error use estimators in [4]12]15]etc.to estimate the overall tag and false positive error in verifying the population constraint, the following constraint must be satisfied: size.Then,according to the response in each singleton slot, the category ID is obtained from the first s bits in the tag ID. Pr[Ciis regarded out of top-k list|Ci E top-k list]<B,(4) Based on the sampling from the singleton slots,the tag size for Pr[C;is regarded in top-k list Ci top-k list]<B.(5) each category can be estimated.The reason is as follows: Assume that there exists m categories C1,C2,...,Cm,the In this paper,we aim to propose a series of novel solutions overall tag size is n,and the tag size for each category is to tackle the above problems,while satisfying the following n,n2,...,nm.We define an indicator variable Xij to denote properties:(1)Time-efficient.(2)Simple for the tag side in the whether one tag of category Ci selects a slot j inside the frame protocol.(3)Complies with the EPCglobal C1G2 standards. with the size f.We set Xi.j=1 if only one tag of category V.USE ENSEMBLE SAMPLING TO COLLECT HISTOGRAMS Ci selects the slot j,and Xi.j=0 otherwise.Moreover,we use Pr[Xi.j=1]to denote the probability that only one tag When collecting the histograms over a large number of of category Ci selects the slot j,then, categories,the objective is to minimize the overall scanning time while the corresponding accuracy/population constraints Pr[X=1刂= are satisfied.Two straightforward solutions are summarized as follows:(1)Basic Tag Identification:The histogram is collected If we use ns.i to denote the number of singleton slots selected by uniquely identifying each tag from the massive tag set and by tags of category Ci,thus ns.i= ∑=lXi,then,the putting it into the corresponding categories,thus the accuracy expected value is 100%,and(2)Separate Counting:The histogram is collected by estimating the number of tags in each category one by E(nsi)= Pr[Xi=1=(1-F)n-1.n one.Specifically,for each category,the reader broadcasts a i= Select command to activate the tags in the specified category. According to the replies from the specified tags,the estimators Furthermore,let ns denote the number of singleton slots,the like [4][12][15]can be used to estimate the tag size for each expected value E(ns)=(1-1)n-1.n.Then, =兴 E(n。) category.As the rough tag size for each category cannot be Thus we can approximate the tag size of category C as follows: predicted,a fixed initial frame size is used for each category. si.亢 元= (6 Both of the above two solutions are not time-efficient.In Ta regard to the basic tag identification,uniquely identifying each Here,is the estimated value of the overall tag size.Let= tag in the massive set is not scalable,for as the tag size grows ,hen应=元 into a huge number,the scanning time can be an unacceptable value.In regard to the separated counting,the reader needs to B.Accuracy Analysis scan each category with at least one query cycle,even if the 1)Accuracy of the SE Estimator:In the ensemble sampling- category is a minor category,which is not necessarily addressed based estimation,since the estimators such as [4][12][15]can in the iceberg query or the top-k query.As the number of be utilized for estimating the overall number of tags,we use 6
𝑛𝑖 = 100, the estimated tag size 𝑛ˆ𝑖 should be within the range [90, 110] with a probability greater than 95%. In the Iceberg Query Problem, only those categories with a tag size over a specified threshold 𝑡 are essential to be illustrated in the histogram, while the accuracy requirement is satisfied. As the exact tag size 𝑛𝑖 for category 𝐶𝑖 is unknown, then, given the estimated value of tag size 𝑛ˆ𝑖, it is possible to have false negative error and false positive error in verifying the population constraint. Therefore, it is essential to guarantee that the false negative/positive rate is below 𝛽, that is: 𝑃 𝑟[𝑛ˆ𝑖 < 𝑡∣𝑛𝑖 ≥ 𝑡] < 𝛽, (2) 𝑃 𝑟[𝑛ˆ𝑖 ≥ 𝑡∣𝑛𝑖 < 𝑡] < 𝛽. (3) In the Top-k Query Problem, we use the definition of the probabilistic threshold top-𝑘 query (PT-Topk query), i.e., in regard to the tag size, only the set of categories where each takes a probability of at least 1 − 𝛽 to be in the top-𝑘 list are illustrated in the histogram, while the accuracy requirement is satisfied. Much like the iceberg query problem, as the exact tag size 𝑛𝑖 for category 𝐶𝑖 is unknown, then, given the estimated value of tag size 𝑛ˆ𝑖, it is possible to have false negative error and false positive error in verifying the population constraint, the following constraint must be satisfied: 𝑃 𝑟[𝐶𝑖is regarded out of top-𝑘 list∣𝐶𝑖 ∈ top-𝑘 list] < 𝛽, (4) 𝑃 𝑟[𝐶𝑖is regarded in top-𝑘 list∣𝐶𝑖 ∈/ top-𝑘 list] < 𝛽. (5) In this paper, we aim to propose a series of novel solutions to tackle the above problems, while satisfying the following properties: (1) Time-efficient. (2) Simple for the tag side in the protocol. (3) Complies with the EPCglobal C1G2 standards. V. USE ENSEMBLE SAMPLING TO COLLECT HISTOGRAMS When collecting the histograms over a large number of categories, the objective is to minimize the overall scanning time while the corresponding accuracy/population constraints are satisfied. Two straightforward solutions are summarized as follows: (1) Basic Tag Identification: The histogram is collected by uniquely identifying each tag from the massive tag set and putting it into the corresponding categories, thus the accuracy is 100%, and (2) Separate Counting: The histogram is collected by estimating the number of tags in each category one by one. Specifically, for each category, the reader broadcasts a Select command to activate the tags in the specified category. According to the replies from the specified tags, the estimators like [4][12][15] can be used to estimate the tag size for each category. As the rough tag size for each category cannot be predicted, a fixed initial frame size is used for each category. Both of the above two solutions are not time-efficient. In regard to the basic tag identification, uniquely identifying each tag in the massive set is not scalable, for as the tag size grows into a huge number, the scanning time can be an unacceptable value. In regard to the separated counting, the reader needs to scan each category with at least one query cycle, even if the category is a minor category, which is not necessarily addressed in the iceberg query or the top-𝑘 query. As the number of categories 𝑚 can be fairly large, e.g., in the order of hundreds, the Select command and the fixed initial frame size for each category, as well as the inter-cycle overhead among a large number of query cycles, make the overall scanning time rather large. Therefore, we consider an ensemble sampling-based estimation scheme as follows: select a certain number of categories and issue a query cycle, obtain the empty/singleton/collision slots, and then estimate the tag size for each of the categories according to the sampling in the singleton slots. In this way, the ensemble sampling is more preferred than the separate counting in terms of reading performance. Since more tags are involved in one query cycle, more slots amortize the cost of inter-cycle overhead, the Select command, as well as the fixed initial frame size. Thus, the overall scanning time can be greatly reduced. A. The Estimator SE In the slotted ALOHA-based protocol, besides the empty slots and the collision slots, the singleton slots can be obtained. In the ensemble sampling-based estimation, according to the observed statistics of the empty/singleton/collision slots, we can use estimators in [4][12][15] etc. to estimate the overall tag size. Then, according to the response in each singleton slot, the category ID is obtained from the first 𝑠 bits in the tag ID. Based on the sampling from the singleton slots, the tag size for each category can be estimated. The reason is as follows: Assume that there exists 𝑚 categories 𝐶1, 𝐶2, ..., 𝐶𝑚, the overall tag size is 𝑛, and the tag size for each category is 𝑛1, 𝑛2, ..., 𝑛𝑚. We define an indicator variable 𝑋𝑖,𝑗 to denote whether one tag of category 𝐶𝑖 selects a slot 𝑗 inside the frame with the size 𝑓. We set 𝑋𝑖,𝑗 = 1 if only one tag of category 𝐶𝑖 selects the slot 𝑗, and 𝑋𝑖,𝑗 = 0 otherwise. Moreover, we use 𝑃 𝑟[𝑋𝑖,𝑗 = 1] to denote the probability that only one tag of category 𝐶𝑖 selects the slot 𝑗, then, 𝑃 𝑟[𝑋𝑖,𝑗 = 1] = 1 𝑓 ⋅ (1 − 1 𝑓 ) 𝑛−1 ⋅ 𝑛𝑖. If we use 𝑛𝑠,𝑖 to denote the number of singleton slots selected by tags of category 𝐶𝑖, thus 𝑛𝑠,𝑖 = ∑𝑓 𝑗=1 𝑋𝑖,𝑗 , then, the expected value 𝐸(𝑛𝑠,𝑖) = ∑ 𝑓 𝑗=1 𝑃 𝑟[𝑋𝑖,𝑗 = 1] = (1 − 1 𝑓 ) 𝑛−1 ⋅ 𝑛𝑖. Furthermore, let 𝑛𝑠 denote the number of singleton slots, the expected value 𝐸(𝑛𝑠) = (1 − 1 𝑓 )𝑛−1 ⋅ 𝑛. Then, 𝐸(𝑛𝑠,𝑖) 𝐸(𝑛𝑠) = 𝑛𝑖 𝑛 . Thus we can approximate the tag size of category 𝐶𝑖 as follows: 𝑛ˆ𝑖 = 𝑛𝑠,𝑖 𝑛𝑠 ⋅ 𝑛. ˆ (6) Here, 𝑛ˆ is the estimated value of the overall tag size. Let 𝛼ˆ𝑖 = 𝑛𝑠,𝑖 𝑛𝑠 , then 𝑛ˆ𝑖 = 𝛼ˆ𝑖 ⋅ 𝑛ˆ. B. Accuracy Analysis 1) Accuracy of the SE Estimator: In the ensemble samplingbased estimation, since the estimators such as [4][12][15] can be utilized for estimating the overall number of tags, we use 𝛿
to denote the variance of ni.We rely on the following theorem C.Compute the Optimal Granularity for Ensemble Sampling to illustrate the accuracy of the estimator SE. Theorem I:Let o;represent the variance of the estimator SE As indicated in the above analysis,during a query cycle mi,the load factor p=子,then, of the ensemble sampling,in order to achieve the accuracy requirement for all categories,the essential scanning time d=%.eP+n-1 mainly depends on the category with the smallest tag size. n'co+n-1(6+n2)-n. (7) as the other categories must still be involved in the query cycle until this category achieves the accuracy requirement. Proof:See Appendix A. Therefore,we use the notion group to define a set of categories 2)Reducing the Variance through Repeated Tests:As the involved in a query cycle of the ensemble sampling.Hence, frame size for each query cycle has a maximum value,by each cycle of ensemble sampling should be applied over an estimating from the ensemble sampling within only one query appropriate group,such that the variance of the tag sizes for cycle,the estimated tag size may not be accurate enough for the involved categories cannot be too large.In this way,all the accuracy constraint.In this situation,multiple query cycles categories in the same group achieve the accuracy requirement are essential to reduce the variance through repeated tests. with very close finishing time.In addition,according to Eq.(7), Suppose the reader issues l query cycles over the same set of as the number of categories increases in the ensemble sampling categories,in regard to a specified category Ci,by utilizing the the load factor p is increased,then the achieved accuracy for weighted statistical averaging method,the averaged tag size each involved category is reduced.Therefore,it is essential to 元=∑k=1uknk:here wk=】 工,n,k and 6.k 1 compute an optimal granularity for the group in regard to the respectively denote the estimated tag size and variance for each reading performance.Suppose there exists m categories in total, cycle k.Then,the variance of元iso=zi本 the objective is to divide them into d(1 sd m)groups for ensemble sampling,such that the overall scanning time can be Therefore,according to the accuracy constraint in the prob- minimized while achieving the accuracy requirement. lem formulation,we rely on the following theorem to express For a specified group,in order for all involved categories to this constraint in the form of the variance. satisfy the accuracy requirement,it is essential to compute the Theorem 2:Suppose the variance of the averaged tag size required frame size for the category with the smallest tag size, nis o.The accuracy constraint is satisfied for a specified category C,as long as子≤(zaaP.n,Zi-aa is the say nLet(then,according to Theorem 2. we can compute the essential frame size f such that Ai(f)0,f >0.This property applies to any estimator with variance smaller than 8o single group consisting of categories from Ci to Ci,we define t(i,j)as the essential scanning time for ensemble sampling, in ZE,which simply estimates the overall tag size based on the which is computed in the same way as aforementioned T.Fur- observed number of empty slots. According to Theorem 2,in order to satisfy the accuracy thermore,we define T(i,j)as the minimum overall scanning constraint,we should ensure that( )2.ni.As a<0 time over the categories from Ci to Ci among various grouping strategies.Then,the recursive expression of T(i,j)is shown for all values of f,it infers that the larger the value ni is, in Eq.(8) the faster it will be for the specified category to satisfy the accuracy constraint.On the contrary,the smaller the value n is,the slower it will be for the specified category to satisfy the T(i,)= minisksjt(i,k)+T(k+1,j)),i<j; t(i,i), (8) i=j. accuracy constraint.This occurs during the ensemble sampling, when the major categories occupy most of the singleton slots,In Eq.(8),the value of T(i,j)is obtained by enumerating each while those minor categories cannot obtain enough samplings possible combination of t(i,k)and T(+1,j),and then getting in the singleton slots for accurate estimation of the tag size. the minimum value of t(i,)+T(k+1,j).By solving the
to denote the variance of 𝑛ˆ. We rely on the following theorem to illustrate the accuracy of the estimator SE. Theorem 1: Let 𝛿𝑖 represent the variance of the estimator SE 𝑛ˆ𝑖, the load factor 𝜌 = 𝑛 𝑓 , then, 𝛿𝑖 = 𝑛𝑖 𝑛 ⋅ 𝑒𝜌 + 𝑛𝑖 − 1 𝑒𝜌 + 𝑛 − 1 ⋅ (𝛿 + 𝑛2) − 𝑛2 𝑖 . (7) Proof: See Appendix A. 2) Reducing the Variance through Repeated Tests: As the frame size for each query cycle has a maximum value, by estimating from the ensemble sampling within only one query cycle, the estimated tag size may not be accurate enough for the accuracy constraint. In this situation, multiple query cycles are essential to reduce the variance through repeated tests. Suppose the reader issues 𝑙 query cycles over the same set of categories, in regard to a specified category 𝐶𝑖, by utilizing the weighted statistical averaging method, the averaged tag size 𝑛ˆ𝑖 = ∑𝑙 𝑘=1 𝜔𝑘 ⋅ 𝑛ˆ𝑖,𝑘; here 𝜔𝑘 = 1 𝛿 ∑ 𝑖,𝑘 𝑙 𝑘=1 1 𝛿𝑖,𝑘 , 𝑛ˆ𝑖,𝑘 and 𝛿𝑖,𝑘 respectively denote the estimated tag size and variance for each cycle 𝑘. Then, the variance of 𝑛ˆ𝑖 is 𝜎2 𝑖 = ∑ 1 𝑙 𝑘=1 1 𝛿𝑖,𝑘 . Therefore, according to the accuracy constraint in the problem formulation, we rely on the following theorem to express this constraint in the form of the variance. Theorem 2: Suppose the variance of the averaged tag size 𝑛ˆ𝑖 is 𝜎2 𝑖 . The accuracy constraint is satisfied for a specified category 𝐶𝑖, as long as 𝜎2 𝑖 ≤ ( 𝜖 𝑍1−𝛽/2 )2 ⋅ 𝑛2 𝑖 , 𝑍1−𝛽/2 is the 1 − 𝛽 2 percentile for the standard normal distribution. Proof: See Appendix B. According to Theorem 2, we can verify if the accuracy constraint is satisfied for each category through directly checking the variance against the threshold ( 𝜖 𝑍1−𝛽/2 )2⋅𝑛2 𝑖 . If 1−𝛽 = 95%, then 𝑍1−𝛽/2 = 1.96. 3) Property of the Ensemble Sampling: According to Theorem 1, the normalized variance of the SE estimator 𝜆𝑖 = 𝛿𝑖 𝑛𝑖 is equivalent to 𝜆𝑖 = 𝛿−𝑛⋅𝑒𝜌+𝑛 𝑒𝜌+𝑛−1 ⋅ 𝑛𝑖 𝑛 + (𝛿+𝑛2)(𝑒𝜌−1) 𝑛⋅(𝑒𝜌+𝑛−1) . Let 𝑎 = 𝛿−𝑛⋅𝑒𝜌+𝑛 𝑒𝜌+𝑛−1 , 𝑏 = (𝛿+𝑛2)(𝑒𝜌−1) 𝑛⋅(𝑒𝜌+𝑛−1) . Then, the normalized variance 𝜆𝑖 = 𝑎 ⋅ 𝑛𝑖 𝑛 + 𝑏. Since the SE estimator can utilize any estimator like [4][12][15] to estimate the overall tag size, then, without loss of generality, if we use the estimator in [4], we can prove that 𝑎 0,𝑓 > 0. This property applies to any estimator with variance smaller than 𝛿0 in ZE, which simply estimates the overall tag size based on the observed number of empty slots. According to Theorem 2, in order to satisfy the accuracy constraint, we should ensure that 𝜆𝑖 ≤ ( 𝜖 𝑍1−𝛽/2 )2 ⋅𝑛𝑖. As 𝑎 < 0 for all values of 𝑓, it infers that the larger the value 𝑛𝑖 is, the faster it will be for the specified category to satisfy the accuracy constraint. On the contrary, the smaller the value 𝑛𝑖 is, the slower it will be for the specified category to satisfy the accuracy constraint. This occurs during the ensemble sampling, when the major categories occupy most of the singleton slots, while those minor categories cannot obtain enough samplings in the singleton slots for accurate estimation of the tag size. C. Compute the Optimal Granularity for Ensemble Sampling As indicated in the above analysis, during a query cycle of the ensemble sampling, in order to achieve the accuracy requirement for all categories, the essential scanning time mainly depends on the category with the smallest tag size, as the other categories must still be involved in the query cycle until this category achieves the accuracy requirement. Therefore, we use the notion group to define a set of categories involved in a query cycle of the ensemble sampling. Hence, each cycle of ensemble sampling should be applied over an appropriate group, such that the variance of the tag sizes for the involved categories cannot be too large. In this way, all categories in the same group achieve the accuracy requirement with very close finishing time. In addition, according to Eq. (7), as the number of categories increases in the ensemble sampling, the load factor 𝜌 is increased, then the achieved accuracy for each involved category is reduced. Therefore, it is essential to compute an optimal granularity for the group in regard to the reading performance. Suppose there exists 𝑚 categories in total, the objective is to divide them into 𝑑(1 ≤ 𝑑 ≤ 𝑚) groups for ensemble sampling, such that the overall scanning time can be minimized while achieving the accuracy requirement. For a specified group, in order for all involved categories to satisfy the accuracy requirement, it is essential to compute the required frame size for the category with the smallest tag size, say 𝑛𝑖. Let 𝑡𝑖 = ( 𝜖 𝑍1−𝛽/2 )2 ⋅ 𝑛𝑖, then, according to Theorem 2, we can compute the essential frame size 𝑓 such that 𝜆𝑖(𝑓) ≤ 𝑡𝑖. Assume that the inter-cycle overhead is 𝜏𝑐, and the average time interval per slot is 𝜏𝑠. Therefore, if 𝑓 ≤ 𝑓𝑚𝑎𝑥, then the total scanning time 𝑇 = 𝑓 ⋅ 𝜏𝑠 + 𝜏𝑐. Otherwise, if the final estimate is the average of 𝑟 independent experiments each with an estimator variance of 𝜆𝑖(𝑓𝑚𝑎𝑥), then the variance of the average is 𝜆𝑖(𝑓𝑚𝑎𝑥) 𝑟 . Hence, if we want the final variance to be 𝑡𝑖, then 𝑟 should be 𝜆𝑖(𝑓𝑚𝑎𝑥) 𝑡𝑖 , the total scanning time is 𝑇 = (𝑓𝑚𝑎𝑥 ⋅ 𝜏𝑠 + 𝜏𝑐) ⋅ 𝑟. We propose a dynamic programming-based algorithm to compute the optimal granularity for ensemble sampling. Assume that currently there are 𝑚 categories, ranked in nonincreasing order, according to the estimated tag size, e.g., 𝐶1, 𝐶2, ..., 𝐶𝑚. We need to cut the ranked categories into one or more continuous groups for ensemble sampling. In regard to a single group consisting of categories from 𝐶𝑖 to 𝐶𝑗 , we define 𝑡(𝑖, 𝑗) as the essential scanning time for ensemble sampling, which is computed in the same way as aforementioned 𝑇. Furthermore, we define 𝑇(𝑖, 𝑗) as the minimum overall scanning time over the categories from 𝐶𝑖 to 𝐶𝑗 among various grouping strategies. Then, the recursive expression of 𝑇(𝑖, 𝑗) is shown in Eq.(8). 𝑇(𝑖, 𝑗) = { min𝑖≤𝑘≤𝑗{𝑡(𝑖, 𝑘) + 𝑇(𝑘 + 1, 𝑗)}, 𝑖<𝑗; 𝑡(𝑖, 𝑖), 𝑖 = 𝑗. (8) In Eq. (8), the value of 𝑇(𝑖, 𝑗) is obtained by enumerating each possible combination of 𝑡(𝑖, 𝑘) and 𝑇(𝑘+1, 𝑗), and then getting the minimum value of 𝑡(𝑖, 𝑘) + 𝑇(𝑘 + 1, 𝑗). By solving the
overlapping subproblems in T(i,j),the optimization problem Algorithm 1 Algorithm for histogram collection is then reduced to computing the value of T(1,m). 1:Initialize the set R to all tags.Set l=1. For example,suppose there are a set of tags with 10 2:while ns≠0Ane≠0do categories,these categories are ranked in non-increasing order 3: If I=1,compute the initial frame size f by solving of the estimated tag size,say,{100,80,75,41,35,30, fe-/=5.Otherwise,compute the frame size f=n. 20,15,12,8,they are finally divided into 3 groups for If f>fmaz,set f=fmaz. ensemble sampling according to the dynamic programming, 4 Set S to Select the tags in R and issue a query cycle i.e.,{100,80,75},{41,35,30,and{20,15,12,8.In this way,. with the frame size f,get no,ne,n Find the category the tag size of each category inside one group is close to the with the largest population v in the singleton slots.For others.During the ensemble sampling,all categories in the each category which appears in the singleton slot with same group can achieve the accuracy requirement with very population ns.i>v.(is constant,0v.(0 In some applications,the users only pay attention to the t]tni<t]<B.are satisfied as long as the standard deviation of the averaged tag size Ini-tl major categories with the tag sizes above a certain threshold t,while those minor categories are not necessarily addressed. (z)is the cumulative distribution function of the standard Then,the iceberg query [16]is utilized to filter out those normal distribution. categories below the threshold t in terms of the tag size.In Due to lack of space,we omit the proof of Theorem 3.The this situation,the separate counting scheme is especially not detailed proof is given in [17]. suitable,since most of the categories are not within the scope In order to better illustrate the inherent principle,Fig.I shows of the concern,which can be wiped out together immediately.an example of the histogram with the 1-B confidence interval According to the definition in the problem formulation,three annotated,the y-axis denotes the estimated tag size for each constraints for the iceberg query must be satisfied:The first category.In order to accurately verify the population constraint, constraint is the accuracy constraint,while the second and it is required that the variance of the estimated tag size should third constraints are the population constraints.In regard to be small enough.Note that when the 1-B confidence interval
overlapping subproblems in 𝑇(𝑖, 𝑗), the optimization problem is then reduced to computing the value of 𝑇(1, 𝑚). For example, suppose there are a set of tags with 10 categories, these categories are ranked in non-increasing order of the estimated tag size, say, {100, 80, 75, 41, 35, 30, 20, 15, 12, 8}, they are finally divided into 3 groups for ensemble sampling according to the dynamic programming, i.e., {100, 80, 75}, {41, 35, 30}, and {20, 15, 12, 8}. In this way, the tag size of each category inside one group is close to the others. During the ensemble sampling, all categories in the same group can achieve the accuracy requirement with very close finishing time; very few slots are wasted due to waiting for those, comparatively speaking, minor categories. On the other hand, these categories are put together with an appropriate granularity for ensemble sampling, as to sufficiently amortize the fixed time cost for each query cycle. D. The Ensemble Sampling-based Algorithm In Algorithm 1, we propose an ensemble sampling-based algorithm for the basic histogram collection. In the beginning, as the overall number of tags 𝑛 cannot be predicted, in order to accomodate a large operating range up to 𝑛¯, we need to set the initial frame size 𝑓 by solving 𝑓𝑒−𝑛/𝑓 ¯ = 5, as suggested in [4]. Then, during each cycle of ensemble sampling, we find the category with the largest population 𝜐 in the singleton slots, and set a threshold 𝑛𝑠,𝑖 > 𝜐 ⋅ 𝜃(0 𝑓𝑚𝑎𝑥, set 𝑓 = 𝑓𝑚𝑎𝑥. 4: Set 𝑆 to ∅. Select the tags in 𝑅 and issue a query cycle with the frame size 𝑓, get 𝑛0, 𝑛𝑐, 𝑛𝑠. Find the category with the largest population 𝜐 in the singleton slots. For each category which appears in the singleton slot with population 𝑛𝑠,𝑖 > 𝜐 ⋅ 𝜃(𝜃 is constant, 0 <𝜃< 1), add it to the set 𝑆. 5: Estimate the tag size 𝑛𝑖 for each category 𝐶𝑖 ∈ 𝑆 using the SE estimator. Compute the variances 𝛿′ 𝑖 for each category 𝐶𝑖 ∈ 𝑆 according to Eq. (7). 6: Rank the categories in 𝑆 in non-increasing order of the tag size. Divide the set 𝑆 into groups 𝑆1, 𝑆2, ..., 𝑆𝑑 according to the dynamic programming-based method. 7: for each 𝑆𝑗 ∈ 𝑆(1 ≤ 𝑗 ≤ 𝑑) do 8: For each category 𝐶𝑖 ∈ 𝑆𝑗 , compute the frame size 𝑓𝑖 from 𝛿𝑖 by solving 1 1/𝛿′ 𝑖+1/𝛿𝑖 ≤ ( 𝜖 𝑍1−𝛽/2 )2 ⋅ 𝑛ˆ𝑖 2 . 9: Obtain the maximum frame size 𝑓 = max𝐶𝑖∈𝑆𝑗 𝑓𝑖. If 𝑓<𝑓𝑚𝑎𝑥, select all categories in 𝑆𝑗 , and issue a query cycle with frame size 𝑓. Otherwise, select all categories in 𝑆𝑗 , and issue 𝑟 query cycles with the frame size 𝑓𝑚𝑎𝑥. Wipe out the categories with satisfied accuracy after each query cycle. 10: Estimate the tag size 𝑛ˆ𝑖 for each category 𝐶𝑖 ∈ 𝑆𝑗 , illustrate them in the histogram. 11: end for 12: 𝑛ˆ = 𝑛ˆ − ∑ 𝐶𝑖∈𝑆 𝑛ˆ𝑖. 𝑅 = 𝑅 − 𝑆. 𝑆 = ∅. 𝑙 = 𝑙 + 1. 13: end while the accuracy constraint, we have demonstrated in Theorem 1 that it can be expressed in the form of the variance constraint. In regard to the population constraint, the second constraint infers that, in the results of the iceberg query, the false negative probability should be no more than 𝛽, while the third constraint infers that the false positive probability should be no more than 𝛽. We rely on the following theorem to express the population constraint in another equivalent form. Theorem 3: The two population constraints, 𝑃 𝑟[𝑛ˆ𝑖 < 𝑡∣𝑛𝑖 ≥ 𝑡] < 𝛽 and 𝑃 𝑟[𝑛ˆ𝑖 ≥ 𝑡∣𝑛𝑖 < 𝑡] < 𝛽, are satisfied as long as the standard deviation of the averaged tag size 𝜎𝑖 ≤ ∣𝑛𝑖−𝑡∣ Φ−1(1−𝛽) , Φ(𝑥) is the cumulative distribution function of the standard normal distribution. Due to lack of space, we omit the proof of Theorem 3. The detailed proof is given in [17]. In order to better illustrate the inherent principle, Fig.1 shows an example of the histogram with the 1−𝛽 confidence interval annotated, the 𝑦-axis denotes the estimated tag size for each category. In order to accurately verify the population constraint, it is required that the variance of the estimated tag size should be small enough. Note that when the 1 − 𝛽 confidence interval
of the tag size ni is above/below the threshold t,the specified B.Algorithm for the Iceberg Query Problem category can be respectively identified as qualified/unqualifed, as both the false positive and false negative probability are less Algorithm 2 Algorithm for Iceberg Query than B;otherwise,the specified category is still undetermined. 1:Initialize R to all categories,set Q,U,V to Set I=1. According to the weighted statistical averaging method,as the 2:while R≠ado number of repeated tests increases,the averaged variance o;for 3 If I=1,set the initial frame size f. each category decreases,thus the confidence interval for each 4 Issue a query cycle over the tags,add those relatively category is shrinking.Therefore,after a certain number of query major categories into the set S.Set S=S. cycles,all categories can be determined as qualified/unqualifed 5 while S≠ado for the population constraint. 6: Compute the frame size fi for each category CiES suchthat the variance,=产可:f片>元e, Qualified Category then remove Ci from S to V.If fi>fmar,set fi= fmaz.Obtain the frame size f as the mid-value among Unqualified Category the series of fi. Threshold 7: Select all tags in S,issue a query cycle with the frame size f,compute the estimated tag size n and the averaged standard deviation o;for each category 内由由一 C S.Detect the qualified category set Q and CI C2 C3 C4 C5 C6 C7 C8 C9 CI0CII CI2 unqualified category set U.Set S=S-Q-U. 8 ifU≠then Fig.1.Histogram with confidence interval annotated 9: Wipe out all categories unexplored in the singleton slots from S Note that when the estimated value元;>t or《t, 10: end if the required variance in the population constraint is much 11: end while larger than the specifications of the accuracy constraint.In 12: n=元-∑c,es元.R=R-S,1=1+1 this situation,these categories can be quickly identified as 13:end while qualified/unqualified,and can be wiped out immediately from 14:Further verify the categories in V and Q for the accuracy the ensemble sampling for verifying the population constraint. constraint Thus,those undetermined categories can be further involved in the ensemble sampling with a much smaller tag size,verifying We propose the algorithm for the iceberg query problem in the population constraint in a faster approach. Algorithm 2.Steps 1-4 are quite similar to steps 1-4 in Algo- Sometimes the tag sizes of various categories are subject rithm 1,due to lack of space we omit the detailed statements to some skew distributions with a "long tail".The long tail for these steps.Assume that the current set of categories is represents those categories each of which occupies a rather R,during the query cycles of ensemble sampling,the reader small percentage among the total categories,but all together continuously updates the statistical value of n as well as the they occupy a substantial proportion of the overall tag sizes.In standard deviation oi for each category CiE R.After each regard to the iceberg query,conventionally,the categories in the query cycle,the categories in R can be further divided into the long tail are unqualified for the population constraint.However, following categories according to the population constraint: due to the small tag size,most of them may not have the Qualified categories O:They refer to the categories whose opportunity to occupy even one singleton slot when contending tag sizes are determined to be over the specified threshold with those major categories during the ensemble sampling. t.f元之t and o:≤已?then category C is They remain undetermined without being immediately wiped identified as qualified for the population constraint. out,leading to inefficiency in scanning the other categories.We Unqualified categories U:They refer to the categories rely on the following theorem to quickly wipe out the categories whose tag sizes are determined to be below the specified in the long tail. t-ni threshold t..Ifmns.i=0,cycles are then continuously issued over those undetermined it can be wiped out immediately as an unqualified category. categories in R until R=0
of the tag size 𝑛ˆ𝑖 is above/below the threshold 𝑡, the specified category can be respectively identified as qualified/unqualifed, as both the false positive and false negative probability are less than 𝛽; otherwise, the specified category is still undetermined. According to the weighted statistical averaging method, as the number of repeated tests increases, the averaged variance 𝜎𝑖 for each category decreases, thus the confidence interval for each category is shrinking. Therefore, after a certain number of query cycles, all categories can be determined as qualified/unqualifed for the population constraint. Tag size for each category C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 Qualified Category Unqualified Category Undetermined Category Threshold t Fig. 1. Histogram with confidence interval annotated Note that when the estimated value 𝑛ˆ𝑖 ≫ 𝑡 or 𝑛ˆ𝑖 ≪ 𝑡, the required variance in the population constraint is much larger than the specifications of the accuracy constraint. In this situation, these categories can be quickly identified as qualified/unqualified, and can be wiped out immediately from the ensemble sampling for verifying the population constraint. Thus, those undetermined categories can be further involved in the ensemble sampling with a much smaller tag size, verifying the population constraint in a faster approach. Sometimes the tag sizes of various categories are subject to some skew distributions with a “long tail”. The long tail represents those categories each of which occupies a rather small percentage among the total categories, but all together they occupy a substantial proportion of the overall tag sizes. In regard to the iceberg query, conventionally, the categories in the long tail are unqualified for the population constraint. However, due to the small tag size, most of them may not have the opportunity to occupy even one singleton slot when contending with those major categories during the ensemble sampling. They remain undetermined without being immediately wiped out, leading to inefficiency in scanning the other categories. We rely on the following theorem to quickly wipe out the categories in the long tail. Theorem 4: For any two categories 𝐶𝑖 and 𝐶𝑗 that 𝑛𝑠,𝑖 𝑛𝑠,𝑖 = 0, it can be wiped out immediately as an unqualified category. B. Algorithm for the Iceberg Query Problem Algorithm 2 Algorithm for Iceberg Query 1: Initialize 𝑅 to all categories, set 𝑄, 𝑈, 𝑉 to ∅. Set 𝑙 = 1. 2: while 𝑅 ∕= ∅ do 3: If 𝑙 = 1, set the initial frame size 𝑓. 4: Issue a query cycle over the tags, add those relatively major categories into the set 𝑆. Set 𝑆′ = 𝑆. 5: while 𝑆 ∕= ∅ do 6: Compute the frame size 𝑓𝑖 for each category 𝐶𝑖 ∈ 𝑆 such that the variance 𝜎𝑖 = ∣𝑡−𝑛ˆ𝑖∣ Φ−1(1−𝛽) . If 𝑓𝑖 > 𝑛ˆ𝑖 ⋅ 𝑒, then remove 𝐶𝑖 from 𝑆 to 𝑉 . If 𝑓𝑖 > 𝑓𝑚𝑎𝑥, set 𝑓𝑖 = 𝑓𝑚𝑎𝑥. Obtain the frame size 𝑓 as the mid-value among the series of 𝑓𝑖. 7: Select all tags in 𝑆, issue a query cycle with the frame size 𝑓, compute the estimated tag size 𝑛ˆ𝑖 and the averaged standard deviation 𝜎𝑖 for each category 𝐶𝑖 ∈ 𝑆. Detect the qualified category set 𝑄 and unqualified category set 𝑈. Set 𝑆 = 𝑆 − 𝑄 − 𝑈. 8: if 𝑈 ∕= ∅ then 9: Wipe out all categories unexplored in the singleton slots from 𝑆. 10: end if 11: end while 12: 𝑛ˆ = 𝑛ˆ − ∑ 𝐶𝑖∈𝑆′ 𝑛ˆ𝑖. 𝑅 = 𝑅 − 𝑆′ , 𝑙 = 𝑙 + 1. 13: end while 14: Further verify the categories in 𝑉 and 𝑄 for the accuracy constraint. We propose the algorithm for the iceberg query problem in Algorithm 2. Steps 1-4 are quite similar to steps 1-4 in Algorithm 1, due to lack of space we omit the detailed statements for these steps. Assume that the current set of categories is 𝑅, during the query cycles of ensemble sampling, the reader continuously updates the statistical value of 𝑛ˆ𝑖 as well as the standard deviation 𝜎𝑖 for each category 𝐶𝑖 ∈ 𝑅. After each query cycle, the categories in 𝑅 can be further divided into the following categories according to the population constraint: ∙ Qualified categories 𝑄: They refer to the categories whose tag sizes are determined to be over the specified threshold 𝑡. If 𝑛ˆ𝑖 ≥ 𝑡 and 𝜎𝑖 ≤ 𝑛ˆ𝑖−𝑡 Φ−1(1−𝛽) , then category 𝐶𝑖 is identified as qualified for the population constraint. ∙ Unqualified categories 𝑈: They refer to the categories whose tag sizes are determined to be below the specified threshold 𝑡. If 𝑛ˆ𝑖 < 𝑡 and 𝜎𝑖 ≤ 𝑡−𝑛ˆ𝑖 Φ−1(1−𝛽) , then category 𝐶𝑖 is identified as unqualified for the population constraint. ∙ Undetermined categories 𝑅: The remaining categories to be verified are undetermined categories. Therefore, after each query cycle of ensemble sampling, those unqualified categories and qualified categories can be immediately wiped out from the ensemble sampling. When at least one category is determined as unqualified, all of the categories in the current group which have not been explored in the singleton slots are wiped out immediately. The query cycles are then continuously issued over those undetermined categories in 𝑅 until 𝑅 = ∅
During the ensemble sampling,if there exist some categories say g,can be quickly reduced.As the approximated threshold with tag sizes very close to the threshold t,then the required t is selected within the confidence interval.after a number of number of slots to verify the population constraint can be query cycles of ensemble sampling,when the width g is below rather large.Thus,we compute the essential frame size fi for the threshold ve2.t2.B,the estimated value f can be close each category C;and compare it with the expected number enough to the exact threshold t.Then,we can apply the iceberg of slots ne in basic tag identification.If fi>ne,then query with threshold t over the undetermined categories R and the category is moved from the set S to V.We heuristically the qualified categories O.Due to lack of space,we omit the set the frame size f to the mid-value among the series of fi, detailed algorithm for PT-Topk Query Problem.The detailed such that after a query cycle,about half of the categories can be algorithm is given in [17]. determined as qualified/unqualifed,and thus wiped out quickly. VIII.PERFORMANCE ANALYSIS IN TIME-EFFICIENCY Therefore,after the while loop,for each category CiE V,basic As mentioned in the problem formulation,the most critical identification is used to obtain the exact tag size ni.If ni>t, factor for the histogram collection problem is time efficiency. C;is illustrated in the histogram.For each category CiE Q, In regard to basic histogram collection,the time delay is mainly the reader verifies if it has satisfied the accuracy requirement; impacted by two factors:1)the number of categories m,2)the if so,Ci is illustrated in the histogram and wiped out from Q. category with the smallest tag size,say ni,inside the group Then,ensemble sampling is further applied over the categories for ensemble sampling.Generally,as the number of categories in Q to satisfy the accuracy requirement by using the optimized m increases,the number of groups and the essential number grouping method. of slots for each ensemble sampling is increasing,causing the VII.ENSEMBLE SAMPLING FOR THE TOP-k QUERY time delay to increase.Besides,the category with the smallest In some applications,when the number of categories is fairly tag size ni directly decides the essential frame size inside the large,the users only focus on the major categories in the top-k group,the larger the gap among the tag sizes of each category list in regard to the tag size.Then the top-k query is utilized to in the same group,the lower time efficiency that is achieved. filter out those categories out of the top-k list.In this situation, In regard to the iceberg query and top-k query,the time delay the separate counting scheme is especially unsuitable.If the mainly depends on the number of categories with tag size close specified category is not in the top-k list,it is unnecessary to the threshold t.Due to the variance in tag size estimation. to address it for accurate tag size estimation.However,since relatively large numbers of slots are required to verify whether the threshold t for the top-k list cannot be known in advance, the specified categories have tag sizes over the threshold t.For the separate counting scheme cannot quickly decide which the top-k query,additional time delay is required to estimate categories can be wiped out immediately. the threshold t corresponding to the top-k query. Hence,according to the definition of PT-Topk query,as IX.PERFORMANCE EVALUATION the exact value of tag size ni is unknown,in order to define We have conducted simulations in Matlab,the scenario is PrICi E top-k list],i.e.,the probability that category Ci is as follows:there exist m categories in total,and we randomly within the top-k list in terms of tag size,it is essential to de- generate the tag size for each category according to the normal termine a threshold t so that Pr[Ci E top-k list]Prni>t]. distribution N(u,)We set the default values for the follow- Ideally,t should be the tag size of the kth largest category; ing parameters:in regard to the accuracy constraint and the however,it is rather difficult to compute an exact value of t population constraint,we set 1-B =95%,and e =0.2 in the estimation scheme.due to the randomness in the slotted The average time interval for each slot is Ts=Ims,the inter- ALOHA protocol. cycle overhead is Te =43ms.We compare our solutions with Therefore,if the threshold t can be accurately estimated, two basic strategies:the basic tag identification (BD)and the then the top-k query problem is reduced to the iceberg query separate counting (SC)(explained in Section V).All results problem.Then it is essential to quickly determine the value are the averaged results of 500 independent trials of the threshold t while satisfying the constraint Pr[t-t1-B is and the ensemble sampling with optimized grouping (ES-g) satisfied as long as Var(t-t)<e2.t2.B. with the basic strategies.In Fig.2(a),we compare the overall Due to lack of space,we omit the proof of Theorem 5.The scanning time under three various scenarios.In scenario 1 we detailed proof is given in [17]. set the number of categories m =50,the average tag size According to Theorem 5.we utilize the ensemble sampling u=50 and its standard deviation o=30.We observe that to quickly estimate the threshold t.The intuition is as follows: the ES strategy has the longest scanning time,while the others after the first query cycle of ensemble sampling,we can have fairly small values in scanning time.This is because the estimate a confidence interval [to,tup]of the threshold t variance o is relatively large as compared to the tag size. according to the sampled distribution.Then,by wiping out The minor categories become the bottleneck in regard to the those categories which are obviously qualified or unqualified estimation performance,thus greatly increasing the scanning to be in the top-k list,the width of the confidence interval, time.In scenario 2 we set m 100,u=50 and o =30.As
During the ensemble sampling, if there exist some categories with tag sizes very close to the threshold 𝑡, then the required number of slots to verify the population constraint can be rather large. Thus, we compute the essential frame size 𝑓𝑖 for each category 𝐶𝑖 and compare it with the expected number of slots 𝑛ˆ𝑖 ⋅ 𝑒 in basic tag identification. If 𝑓𝑖 > 𝑛ˆ𝑖 ⋅ 𝑒, then the category is moved from the set 𝑆 to 𝑉 . We heuristically set the frame size 𝑓 to the mid-value among the series of 𝑓𝑖, such that after a query cycle, about half of the categories can be determined as qualified/unqualifed, and thus wiped out quickly. Therefore, after the while loop, for each category 𝐶𝑖 ∈ 𝑉 , basic identification is used to obtain the exact tag size 𝑛𝑖. If 𝑛𝑖 ≥ 𝑡, 𝐶𝑖 is illustrated in the histogram. For each category 𝐶𝑖 ∈ 𝑄, the reader verifies if it has satisfied the accuracy requirement; if so, 𝐶𝑖 is illustrated in the histogram and wiped out from 𝑄. Then, ensemble sampling is further applied over the categories in 𝑄 to satisfy the accuracy requirement by using the optimized grouping method. VII. ENSEMBLE SAMPLING FOR THE TOP-𝑘 QUERY In some applications, when the number of categories is fairly large, the users only focus on the major categories in the top-𝑘 list in regard to the tag size. Then the top-𝑘 query is utilized to filter out those categories out of the top-𝑘 list. In this situation, the separate counting scheme is especially unsuitable. If the specified category is not in the top-𝑘 list, it is unnecessary to address it for accurate tag size estimation. However, since the threshold 𝑡 for the top-𝑘 list cannot be known in advance, the separate counting scheme cannot quickly decide which categories can be wiped out immediately. Hence, according to the definition of PT-Top𝑘 query, as the exact value of tag size 𝑛𝑖 is unknown, in order to define 𝑃 𝑟[𝐶𝑖 ∈ top-𝑘 list], i.e., the probability that category 𝐶𝑖 is within the top-𝑘 list in terms of tag size, it is essential to determine a threshold 𝑡 so that 𝑃 𝑟[𝐶𝑖 ∈ top-𝑘 list] = 𝑃 𝑟[𝑛𝑖 ≥ 𝑡]. Ideally, 𝑡 should be the tag size of the 𝑘th largest category; however, it is rather difficult to compute an exact value of 𝑡 in the estimation scheme, due to the randomness in the slotted ALOHA protocol. Therefore, if the threshold ˆ𝑡 can be accurately estimated, then the top-𝑘 query problem is reduced to the iceberg query problem. Then it is essential to quickly determine the value of the threshold ˆ𝑡 while satisfying the constraint 𝑃 𝑟[∣ˆ𝑡 − 𝑡∣ ≤ 𝜖 ⋅𝑡] ≥ 1 − 𝛽. We rely on the following theorem to express the above constraint in the form of the variance. Theorem 5: The constraint 𝑃 𝑟[∣ˆ𝑡 − 𝑡∣ ≤ 𝜖 ⋅ 𝑡] ≥ 1 − 𝛽 is satisfied as long as 𝑉 𝑎𝑟(ˆ𝑡 − 𝑡) ≤ 𝜖2 ⋅ 𝑡 2 ⋅ 𝛽. Due to lack of space, we omit the proof of Theorem 5. The detailed proof is given in [17]. According to Theorem 5, we utilize the ensemble sampling to quickly estimate the threshold ˆ𝑡. The intuition is as follows: after the first query cycle of ensemble sampling, we can estimate a confidence interval [𝑡𝑙𝑜𝑤, 𝑡𝑢𝑝] of the threshold 𝑡 according to the sampled distribution. Then, by wiping out those categories which are obviously qualified or unqualified to be in the top-𝑘 list, the width of the confidence interval, say 𝑔, can be quickly reduced. As the approximated threshold ˆ𝑡 is selected within the confidence interval, after a number of query cycles of ensemble sampling, when the width 𝑔 is below the threshold √𝜖2 ⋅ 𝑡2 ⋅ 𝛽, the estimated value ˆ𝑡 can be close enough to the exact threshold 𝑡. Then, we can apply the iceberg query with threshold ˆ𝑡 over the undetermined categories 𝑅 and the qualified categories 𝑄. Due to lack of space, we omit the detailed algorithm for PT-Top𝑘 Query Problem. The detailed algorithm is given in [17]. VIII. PERFORMANCE ANALYSIS IN TIME-EFFICIENCY As mentioned in the problem formulation, the most critical factor for the histogram collection problem is time efficiency. In regard to basic histogram collection, the time delay is mainly impacted by two factors: 1) the number of categories 𝑚, 2) the category with the smallest tag size, say 𝑛𝑖, inside the group for ensemble sampling. Generally, as the number of categories 𝑚 increases, the number of groups and the essential number of slots for each ensemble sampling is increasing, causing the time delay to increase. Besides, the category with the smallest tag size 𝑛𝑖 directly decides the essential frame size inside the group, the larger the gap among the tag sizes of each category in the same group, the lower time efficiency that is achieved. In regard to the iceberg query and top-𝑘 query, the time delay mainly depends on the number of categories with tag size close to the threshold 𝑡. Due to the variance in tag size estimation, relatively large numbers of slots are required to verify whether the specified categories have tag sizes over the threshold 𝑡. For the top-𝑘 query, additional time delay is required to estimate the threshold 𝑡 corresponding to the top-𝑘 query. IX. PERFORMANCE EVALUATION We have conducted simulations in Matlab, the scenario is as follows: there exist 𝑚 categories in total, and we randomly generate the tag size for each category according to the normal distribution 𝑁(𝜇, 𝜎). We set the default values for the following parameters: in regard to the accuracy constraint and the population constraint, we set 1 − 𝛽 = 95%, and 𝜖 = 0.2. The average time interval for each slot is 𝜏𝑠 = 1ms, the intercycle overhead is 𝜏𝑐 = 43ms. We compare our solutions with two basic strategies: the basic tag identification (BI) and the separate counting (SC) (explained in Section V). All results are the averaged results of 500 independent trials. A. The Performance in Basic Histogram Collection We compare the ensemble sampling with one group (ES) and the ensemble sampling with optimized grouping (ES-g) with the basic strategies. In Fig.2(a), we compare the overall scanning time under three various scenarios. In scenario 1 we set the number of categories 𝑚 = 50, the average tag size 𝜇 = 50 and its standard deviation 𝜎 = 30. We observe that the ES strategy has the longest scanning time, while the others have fairly small values in scanning time. This is because the variance 𝜎 is relatively large as compared to the tag size. The minor categories become the bottleneck in regard to the estimation performance, thus greatly increasing the scanning time. In scenario 2 we set 𝑚 = 100, 𝜇 = 50 and 𝜎 = 30. As
10 4.5 510 ES -E3-g 0.5 01 0.15 0.2 0.25 50 Tev 150 200 200 400 B00 100i Scenario (a (b) (c) (d) Fig.2.Simulation results in basic histogram collection:(a)The overall scanning time in various scenarios:(b)The overall scanning time with various e:(c)The overall scanning time with various Te/Ts;(d)The scanning time with various value of m. 10 210 5x10 0 -ES 6 60 40 ES 12 30 14 18 80 100 120 140 40 100 120 1.2 1.4 1.8 18 The value of e The value al a The lotal scanning tme x10 (a) (b) (c) (d) Fig.3.Simulation results in advanced histogram collection:(a)The scanning time with various threshold :(b)The scanning time with various variance o:(c)The scanning time with various value of k;(d)The variation of g with the scanning time. the number of categories is increased,the scanning time of the set u=100,o =20.Note that while the number of categories separate counting (SC)is apparently increased,due to the large increases,the scanning time of each solution grows in a linear inter-cycle overhead and the constant initial frame size in each approach.Still,the ES-g solution always achieves the minimum category.Still,the ES strategy has the longest scanning time.In scanning time. Scenario 3,we set m 100,u=500 and o 100.We observe that the basic identification(BD)has the longest scanning time,B.The Performance in Advanced Histogram Collection as the current overall tag size is rather large.The ES strategy We evaluate the performance of our iceberg query algorithm now requires a fairly short scanning time,as the variance o is We use ES to denote our optimized solution based on ensemble relatively small as compared to u.Note that in all cases,our sampling.In Fig.3(a)we compare the scanning time with optimized solution ES-g always achieves the best performance various values of threshold ratio 0.We set m 200,u in terms of scanning time.In Fig.2(b),we compare the scanning 200,=100,the exact threshold is set to t =0.u.We time with various values of e in the accuracy constraint.We set observe that,as the threshold increases,the scanning time of m =100,u=500,o =100.As the value of e is increasing, the SC strategy and the ES strategy is continuously decreasing, the scanning time of all solutions,except the BI strategy,is while the scanning time for the BI strategy is not affected.In decreasing.Among the four strategies,the ES-g solution always Fig.3(b)we compare the scanning time with various standard achieves the best performance in scanning time. deviation o.We set m =200.u=200 and the threshold ratio In Fig.2(c),we evaluate the impact of the inter-cycle over- 6-1.5.We observe that as the value of o increases,the head in the strategies.We set m =150,u=50,10.It scanning time of the SC strategy and the ES strategy grows is known that,by reducing the transmitted bits in singleton slowly.The reason is as follows:as the standard deviation o slots,the average slot duration can be further reduced,while increases,the number of qualified categories is increasing.thus the inter-cycle overhead is not easy to be greatly reduced due more slots are essential to verify the categories for accuracy; to the necessity to calm down the activated tags.So we test the besides,fewer categories have tag sizes close to the threshold, overall scanning time with various ratios of Te/Ta.We observe thus fewer slots are required to verify the population constraint that the BI strategy and the ES strategy have a fairly stable In all,the overall scanning time is increasing rather slowly. scanning time,as the number of query cycles is relatively small. We evaluate the performance of our PT-Topk:algorithm.In The separate counting (SC)has a relatively short scanning time Fig.3(c),we compare the scanning time with various values when Te/Ts is less than 50.As the value of Te/Ts increases, of k.We observe that as k increases from 20 to 120.the its scanning time linearly increases and surpasses the other scanning time of the ES strategy increases from 1.5 x 105 to strategies.The ES-g strategy always has the shortest scanning 2.5 x 105,and then decreases to 2 x 105.The reason is that, time.In Fig.2(d),we evaluate the scalability of the proposed as the value of k increases,the exact threshold is reduced,and algorithms while varying the overall number of categories.We more categories are identified as qualified,thus more slots are
1 2 3 0 5 10 15 x 104 Scenario The overall scanning time (ms) BI SC ES ES−g (a) 0.1 0.15 0.2 0.25 0.3 0 5 10 15 x 104 The value of ε The overall scanning time (ms) BI SC ES ES−g (b) 0 50 100 150 200 0.5 1 1.5 2 2.5 3 3.5 4 4.5 x 104 The value of τ c /τ s The total scanning time BI SC ES ES−g (c) 200 400 600 800 1000 0 0.5 1 1.5 2 2.5 3 3.5 x 105 The value of m The overall scanning time (ms) BI SC ES ES−g (d) Fig. 2. Simulation results in basic histogram collection: (a)The overall scanning time in various scenarios;(b)The overall scanning time with various 𝜖;(c)The overall scanning time with various 𝜏𝑐/𝜏𝑠;(d)The scanning time with various value of 𝑚. 1 1.2 1.4 1.6 1.8 2 0 2 4 6 8 10 12 x 104 The value of θ The total scanning time BI SC ES (a) 40 60 80 100 120 140 160 0 2 4 6 8 10 12 x 104 The value of σ The total scanning time BI SC ES (b) 20 40 60 80 100 120 0 0.5 1 1.5 2 2.5 3 3.5 x 105 The value of K The total scanning time BI ES (c) 1 1.2 1.4 1.6 1.8 x 10 5 20 40 60 80 100 The total scanning time The value of g (d) Fig. 3. Simulation results in advanced histogram collection: (a)The scanning time with various threshold 𝜃;(b)The scanning time with various variance 𝜎;(c)The scanning time with various value of 𝑘;(d)The variation of 𝑔 with the scanning time. the number of categories is increased, the scanning time of the separate counting (SC) is apparently increased, due to the large inter-cycle overhead and the constant initial frame size in each category. Still, the ES strategy has the longest scanning time. In Scenario 3, we set 𝑚 = 100, 𝜇 = 500 and 𝜎 = 100. We observe that the basic identification (BI) has the longest scanning time, as the current overall tag size is rather large. The ES strategy now requires a fairly short scanning time, as the variance 𝜎 is relatively small as compared to 𝜇. Note that in all cases, our optimized solution ES-g always achieves the best performance in terms of scanning time. In Fig.2(b), we compare the scanning time with various values of 𝜖 in the accuracy constraint. We set 𝑚 = 100, 𝜇 = 500, 𝜎 = 100. As the value of 𝜖 is increasing, the scanning time of all solutions, except the BI strategy, is decreasing. Among the four strategies, the ES-g solution always achieves the best performance in scanning time. In Fig.2(c), we evaluate the impact of the inter-cycle overhead in the strategies. We set 𝑚 = 150, 𝜇 = 50, 𝜎 = 10. It is known that, by reducing the transmitted bits in singleton slots, the average slot duration can be further reduced, while the inter-cycle overhead is not easy to be greatly reduced due to the necessity to calm down the activated tags. So we test the overall scanning time with various ratios of 𝜏𝑐/𝜏𝑠. We observe that the BI strategy and the ES strategy have a fairly stable scanning time, as the number of query cycles is relatively small. The separate counting (SC) has a relatively short scanning time when 𝜏𝑐/𝜏𝑠 is less than 50. As the value of 𝜏𝑐/𝜏𝑠 increases, its scanning time linearly increases and surpasses the other strategies. The ES-g strategy always has the shortest scanning time. In Fig.2(d), we evaluate the scalability of the proposed algorithms while varying the overall number of categories. We set 𝜇 = 100, 𝜎 = 20. Note that while the number of categories increases, the scanning time of each solution grows in a linear approach. Still, the ES-g solution always achieves the minimum scanning time. B. The Performance in Advanced Histogram Collection We evaluate the performance of our iceberg query algorithm. We use ES to denote our optimized solution based on ensemble sampling. In Fig.3(a) we compare the scanning time with various values of threshold ratio 𝜃. We set 𝑚 = 200, 𝜇 = 200, 𝜎 = 100, the exact threshold is set to 𝑡 = 𝜃 ⋅ 𝜇. We observe that, as the threshold increases, the scanning time of the SC strategy and the ES strategy is continuously decreasing, while the scanning time for the BI strategy is not affected. In Fig.3(b) we compare the scanning time with various standard deviation 𝜎. We set 𝑚 = 200, 𝜇 = 200 and the threshold ratio 𝜃 = 1.5. We observe that as the value of 𝜎 increases, the scanning time of the SC strategy and the ES strategy grows slowly. The reason is as follows: as the standard deviation 𝜎 increases, the number of qualified categories is increasing, thus more slots are essential to verify the categories for accuracy; besides, fewer categories have tag sizes close to the threshold, thus fewer slots are required to verify the population constraint. In all, the overall scanning time is increasing rather slowly. We evaluate the performance of our PT-Top𝑘 algorithm. In Fig.3(c), we compare the scanning time with various values of 𝑘. We observe that as 𝑘 increases from 20 to 120, the scanning time of the ES strategy increases from 1.5 × 105 to 2.5 × 105, and then decreases to 2 × 105. The reason is that, as the value of 𝑘 increases, the exact threshold is reduced, and more categories are identified as qualified, thus more slots are
essential to verify the categories for accuracy;then,as the value [15]H.Han,B.Sheng.C.C.Tan,Q.Li,W.Mao,and S.Lu,"Counting rfid of k further increases,more qualified categories with large tag tags efficiently and anonymously,"in Proc.of INFOCOM,2010. [16]M.Fang,N.Shivakumar,H.Garcia-Molina,R.Motwani,and J.D. sizes can be quickly wiped out in the threshold estimation, Ullman,"Computing iceberg queries efficiently,"in Proc.of the 24th thus fewer slots are required in the threshold estimation,and VLDB Conf.1998. the overall scanning time is decreased.In Fig.3(d),we eval- [17]L.Xie.H.Han,Q.Li.J.Wu,and S.Lu."Efficiently collect- uate the convergence for estimating the threshold t.We set ing histograms over rfid tags,"Nanjing University,Tech.Rep.,2013, http://cs.nju.edu.cn/lxie/publication/histogram.pdf. m=200,μ=500,o=200,k=20.We observe that the APPENDIX width of the range [t,t,i.e.,g,is continuously decreasing as A:Proof of Theorem 1 the scanning time increases.When the scanning time reaches Proof:According to the definitionE(= 1.8 x 105,the value of g is below the required threshold in the dash line,then the iteration ends. E().E(2)=E().Despite that n.n the relationship between ns and ns.i is rather weak as long as m is X.CONCLUSION fairly large.Hence we can assume ns.i and ns are independent We believe this is the first paper considering the problem of of each other.Then,according to the property of independence, collecting histograms over RFID tags.Based on the ensemble E@)==兴,E()=密.Then,as it can be sampling method,we respectively propose effective solutions proved that for the basic histogram collection,iceberg query problem and top-k query problem.Simulation results show that our solution E(n)=1-3-1.n+f-11-3》-2n2-m achieves much better performance than others. E(n2=(1-F)m-1.n+ 1-子)-6-n, f-1 ACKNOWLEDGMENTS we have This work is supported in part by National Natural Science Foundation of China under Grant No.61100196.61073028. a2)=%.1+1-点)m-2.(m-1) 61321491,91218302;JiangSu Natural Science Foundation un- n1+(1-吉)m-2·(n-1) der Grant No.BK2011559.Qun Li is supported in part by US NSF grants CNS-1117412,CNS-1320453,and CAREER ≈.1+e-p(ni-1) n1+e-p.(n-1) Award CNS-0747108.Jie Wu is supported in part by US NSF grants ECCS 1231461,ECCS 1128209,CNS 1138963,CNS Thus,6=E(2)-E2()=E(;2.2)-n2.Note that 1065444.and CC℉1028167 n is estimated from the estimator which relies on the number of empty/singleton/collison slots as well as the frame size f, REFERENCES while is equal to ,i.e.,the proportion that the category [1]C.Qian,Y.Liu,H.-L.Ngan,and L.M.Ni,"Asap:Scalable identification Ci occupies in the singleton slots.Therefore,we can assume the and counting for contactless rfid systems,"in Proc.of /EEE ICDCS,2010. [2]M.Shahzad and A.X.Liu,"Every bit counts fast and scalable rfid independence between a;and n,then 6i=E().E(n2)-n2. estimation,"in Proc.of MobiCom,2012. As the variance 6 E(n2)-E()2 and E(n)=n,thus [3]Y.Zheng and M.Li,"Fast tag searching protocol for large-scale rfid E(n2)=6+n2.Therefore, systems,"in Proc.of ICNP,2011. [4]M.Kodialam and T.Nandagopal."Fast and reliable estimation schemes d:=.ep+n-1 in rfid systems,"in Proc.of MobiCom,2006. (6+n2)-n2 n ep+n-1 [5]B.Sheng.C.C.Tan,Q.Li,and W.Mao,"Finding popular categoried for RFID tags,"in Proc.of ACM MobiHoc,2008. ◆ [6]Y.E.Ioannidis and V.Poosala,"Histogram-based approximation of set- B:Proof of Theorem 2 valued query answers,"in Proc.of the 25th VLDB Conference,1999. Proof:Suppose identical frame sizes are used for each [7]S.Tang.J.Yuan,X.Y.Li,G.Chen,Y.Liu,and J.Zhao,"Raspberry:A stable reader activation scheduling protocol in multi-reader rfid systems," query cycle in the repeated tests.Then,in regard to a specified in Proc.of ICNP,2009. category Ci,the series of estimated tag sizes [nik for I query [8]L.Yang.J.Han,Y.Qi,C.Wang.T.Gu,and Y.Liu,"Season:Shelving cycles are independent and identically distributed.Since the interference and joint identification in large-scale rfid systems,"in Proc. of INFOCOM,2011. variance 6i.is equal for each query cycle,according to the [9]S.Chen,M.Zhang,and B.Xiao,"Efficient information collection weighted statistical averaging method,the estimated tag size protocols for sensor-augmented rfid networks,"in Proc.of INFOCOM. is ni= 2011. .As the expected value and standard [10]Y.Qiao,S.Chen,T.Li,and S.Chen,"Energy-efficient polling protocols deviation for the averaged value ni are respectively ni and oi, in rfid systems,"in Proc.of MobiHoc,2011. then according to the Central Limit Theorem,X=n is [11]T.Li.S.Wu,S.Chen,and M.Yang."Energy efficient algorithms for the asymptotically normal with mean 0 and variance 1,that is,X rfid estimation problem,"in Proc.of INFOCOM.2010. [12]W.Chen,"An accurate tag estimate method for improving the perfor- satisfies the standard normal distribution. mance of an rfid anticollision algorithm based on dynamic frame length Therefore,the definition of the accuracy constraint is equiva- aloha."IEEE Transactions on Automation Science and Engineering, lent to:Pr-e≤-严≤s]≥1-B.We use Z1-B/2to vol.6,no.1,pp.9-15,2009. 0。 [13]W.Luo,Y.Qiao,and S.Chen,"An efficient protocol for rfid multigroup denote the 1-2 percentile for the standard normal distribution. threshold-based classification,"in Proc.of INFOCOM,2013. Hence,in order to satisfy the accuracy constraint,it is essential [14]M.Buettner and D.Wetherall,"An empirical study of uhf rfid perfor- mance,"in Proc.of MobiCom,2008. to ensure≥Zi-a2,that is,.a≤(z-aa2.n
essential to verify the categories for accuracy; then, as the value of 𝑘 further increases, more qualified categories with large tag sizes can be quickly wiped out in the threshold estimation, thus fewer slots are required in the threshold estimation, and the overall scanning time is decreased. In Fig.3(d), we evaluate the convergence for estimating the threshold 𝑡. We set 𝑚 = 200, 𝜇 = 500, 𝜎 = 200, 𝑘 = 20. We observe that the width of the range [˜𝑡,𝑡 ¯], i.e., 𝑔, is continuously decreasing as the scanning time increases. When the scanning time reaches 1.8 × 105, the value of 𝑔 is below the required threshold in the dash line, then the iteration ends. X. CONCLUSION We believe this is the first paper considering the problem of collecting histograms over RFID tags. Based on the ensemble sampling method, we respectively propose effective solutions for the basic histogram collection, iceberg query problem and top-𝑘 query problem. Simulation results show that our solution achieves much better performance than others. ACKNOWLEDGMENTS This work is supported in part by National Natural Science Foundation of China under Grant No. 61100196, 61073028, 61321491, 91218302; JiangSu Natural Science Foundation under Grant No. BK2011559. Qun Li is supported in part by US NSF grants CNS-1117412, CNS-1320453, and CAREER Award CNS-0747108. Jie Wu is supported in part by US NSF grants ECCS 1231461, ECCS 1128209, CNS 1138963, CNS 1065444, and CCF 1028167. REFERENCES [1] C. Qian, Y. Liu, H.-L. Ngan, and L. M. Ni, “Asap: Scalable identification and counting for contactless rfid systems,” in Proc. of IEEE ICDCS, 2010. [2] M. Shahzad and A. X. Liu, “Every bit counts - fast and scalable rfid estimation,” in Proc. of MobiCom, 2012. [3] Y. Zheng and M. Li, “Fast tag searching protocol for large-scale rfid systems,” in Proc. of ICNP, 2011. [4] M. Kodialam and T. Nandagopal, “Fast and reliable estimation schemes in rfid systems,” in Proc. of MobiCom, 2006. [5] B. Sheng, C. C. Tan, Q. Li, and W. Mao, “Finding popular categoried for RFID tags,” in Proc. of ACM MobiHoc, 2008. [6] Y. E. Ioannidis and V. Poosala, “Histogram-based approximation of setvalued query answers,” in Proc. of the 25th VLDB Conference, 1999. [7] S. Tang, J. Yuan, X. Y. Li, G. Chen, Y. Liu, and J. Zhao, “Raspberry: A stable reader activation scheduling protocol in multi-reader rfid systems,” in Proc. of ICNP, 2009. [8] L. Yang, J. Han, Y. Qi, C. Wang, T. Gu, and Y. Liu, “Season: Shelving interference and joint identification in large-scale rfid systems,” in Proc. of INFOCOM, 2011. [9] S. Chen, M. Zhang, and B. Xiao, “Efficient information collection protocols for sensor-augmented rfid networks,” in Proc. of INFOCOM, 2011. [10] Y. Qiao, S. Chen, T. Li, and S. Chen, “Energy-efficient polling protocols in rfid systems,” in Proc. of MobiHoc, 2011. [11] T. Li, S. Wu, S. Chen, and M. Yang, “Energy efficient algorithms for the rfid estimation problem,” in Proc. of INFOCOM, 2010. [12] W. Chen, “An accurate tag estimate method for improving the performance of an rfid anticollision algorithm based on dynamic frame length aloha,” IEEE Transactions on Automation Science and Engineering, vol. 6, no. 1, pp. 9–15, 2009. [13] W. Luo, Y. Qiao, and S. Chen, “An efficient protocol for rfid multigroup threshold-based classification,” in Proc. of INFOCOM, 2013. [14] M. Buettner and D. Wetherall, “An empirical study of uhf rfid performance,” in Proc. of MobiCom, 2008. [15] H. Han, B. Sheng, C. C. Tan, Q. Li, W. Mao, and S. Lu, “Counting rfid tags efficiently and anonymously,” in Proc. of INFOCOM, 2010. [16] M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman, “Computing iceberg queries efficiently,” in Proc. of the 24th VLDB Conf, 1998. [17] L. Xie, H. Han, Q. Li, J. Wu, and S. Lu, “Efficiently collecting histograms over rfid tags,” Nanjing University, Tech. Rep., 2013, http://cs.nju.edu.cn/lxie/publication/histogram.pdf. APPENDIX A: Proof of Theorem 1 Proof: According to the definition 𝛼ˆ𝑖 = 𝑛𝑠,𝑖 𝑛𝑠 , 𝐸(𝛼ˆ𝑖) = 𝐸( 𝑛𝑠,𝑖 𝑛𝑠 ), 𝐸(𝛼ˆ𝑖 2 ) = 𝐸( 𝑛2 𝑠,𝑖 𝑛2 𝑠 ). Despite that 𝑛𝑠 = ∑𝑚 𝑖=1 𝑛𝑠,𝑖, the relationship between 𝑛𝑠 and 𝑛𝑠,𝑖 is rather weak as long as 𝑚 is fairly large. Hence we can assume 𝑛𝑠,𝑖 and 𝑛𝑠 are independent of each other. Then, according to the property of independence, 𝐸(𝛼ˆ𝑖) = 𝐸(𝑛𝑠,𝑖) 𝐸(𝑛𝑠) = 𝑛𝑖 𝑛 , 𝐸(𝛼ˆ𝑖 2 ) = 𝐸(𝑛2 𝑠,𝑖) 𝐸(𝑛2 𝑠) . Then, as it can be proved that 𝐸(𝑛2 𝑠) = (1 − 1 𝑓 ) 𝑛−1 ⋅ 𝑛 + 𝑓 − 1 𝑓 (1 − 2 𝑓 ) 𝑛−2(𝑛2 − 𝑛), 𝐸(𝑛2 𝑠,𝑖) = (1 − 1 𝑓 ) 𝑛−1 ⋅ 𝑛𝑖 + 𝑓 − 1 𝑓 (1 − 2 𝑓 ) 𝑛−2(𝑛2 𝑖 − 𝑛𝑖). we have 𝐸(𝛼ˆ𝑖 2 ) = 𝑛𝑖 𝑛 ⋅ 1 + (1 − 1 𝑓−1 )𝑛−2 ⋅ (𝑛𝑖 − 1) 1 + (1 − 1 𝑓−1 )𝑛−2 ⋅ (𝑛 − 1) ≈ 𝑛𝑖 𝑛 ⋅ 1 + 𝑒−𝜌 ⋅ (𝑛𝑖 − 1) 1 + 𝑒−𝜌 ⋅ (𝑛 − 1) . Thus, 𝛿𝑖 = 𝐸(𝑛ˆ𝑖 2 ) − 𝐸2(𝑛ˆ𝑖) = 𝐸(𝛼ˆ𝑖 2 ⋅ 𝑛ˆ2) − 𝑛2 𝑖 . Note that 𝑛ˆ is estimated from the estimator which relies on the number of empty/singleton/collison slots as well as the frame size 𝑓, while 𝛼ˆ𝑖 is equal to 𝑛𝑠,𝑖 𝑛𝑠 , i.e., the proportion that the category 𝐶𝑖 occupies in the singleton slots. Therefore, we can assume the independence between 𝛼ˆ𝑖 and 𝑛ˆ, then 𝛿𝑖 = 𝐸(𝛼ˆ𝑖 2 )⋅𝐸(𝑛ˆ2)−𝑛2 𝑖 . As the variance 𝛿 = 𝐸(𝑛ˆ2) − 𝐸(𝑛ˆ)2 and 𝐸(𝑛ˆ) = 𝑛, thus 𝐸(𝑛ˆ2) = 𝛿 + 𝑛2. Therefore, 𝛿𝑖 = 𝑛𝑖 𝑛 ⋅ 𝑒𝜌 + 𝑛𝑖 − 1 𝑒𝜌 + 𝑛 − 1 ⋅ (𝛿 + 𝑛2) − 𝑛2 𝑖 . B: Proof of Theorem 2 Proof: Suppose identical frame sizes are used for each query cycle in the repeated tests. Then, in regard to a specified category 𝐶𝑖, the series of estimated tag sizes {𝑛ˆ𝑖,𝑘} for 𝑙 query cycles are independent and identically distributed. Since the variance 𝛿𝑖,𝑘 is equal for each query cycle, according to the weighted statistical averaging method, the estimated tag size is 𝑛ˆ𝑖 = 1 𝑙 ∑𝑙 𝑘=1 𝑛ˆ𝑖,𝑘. As the expected value and standard deviation for the averaged value 𝑛ˆ𝑖 are respectively 𝑛𝑖 and 𝜎𝑖, then according to the Central Limit Theorem, 𝑋 = 𝑛ˆ𝑖−𝑛𝑖 𝜎𝑖 is asymptotically normal with mean 0 and variance 1, that is, 𝑋 satisfies the standard normal distribution. Therefore, the definition of the accuracy constraint is equivalent to: 𝑃 𝑟[ −𝜖⋅𝑛𝑖 𝜎𝑖 ≤ 𝑛ˆ𝑖−𝑛𝑖 𝜎𝑖 ≤ 𝜖⋅𝑛𝑖 𝜎𝑖 ] ≥ 1 − 𝛽. We use 𝑍1−𝛽/2 to denote the 1− 𝛽 2 percentile for the standard normal distribution. Hence, in order to satisfy the accuracy constraint, it is essential to ensure 𝜖⋅𝑛𝑖 𝜎𝑖 ≥ 𝑍1−𝛽/2, that is, 𝜎2 𝑖 ≤ ( 𝜖 𝑍1−𝛽/2 )2 ⋅ 𝑛2 𝑖