IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,VOL 26,NO.9,SEPTEMBER 2015 2421 Efficient Protocols for Collecting Histograms in Large-Scale RFID Systems Lei Xie,Member,IEEE,Hao Han,Member,IEEE,Qun Li,Member,IEEE, Jie Wu,Fellow,IEEE,and Sanglu Lu,Member,IEEE Abstract-Collecting histograms over RFID tags is an essential premise for effective aggregate queries and analysis in large-scale RFID-based applications.In this paper we consider an efficient collection of histograms from the massive number of RFID tags, without the need to read all tag data.In order to achieve time efficiency,we propose a novel,ensemble sampling-based method to simultaneously estimate the tag size for a number of categories.We first consider the problem of basic histogram collection,and propose an efficient algorithm based on the idea of ensemble sampling.We further consider the problems of advanced histogram collection,respectively,with an iceberg query and a top-k:query.Efficient algorithms are proposed to tackle the above problems such that the qualified/unqualified categories can be quickly identified.This ensemble sampling-based framework is very flexible and compatible to current tag-counting estimators,which can be efficiently leveraged to estimate the tag size for each category. Experiment results indicate that our ensemble sampling-based solutions can achieve a much better performance than the basic estimation/identification schemes. Index Terms-Algorithms,RFID,time efficiency,histogram INTRODUCTION Win have lep me prv in a book store,etc.Collecting histogram can be used to illustrate the tag population belonging to each category,and spaces in increasingly large numbers.In applications like determine whether the number of tags in a category is warehouse monitoring,the items are attached with RFID above or below any desired threshold.By setting this tags,and are densely packed into boxes.As the maximum threshold,it is easy to find popular merchandise and control scanning range of a UHF RFID reader is usually 6-10 m,the stock,e.g.,automatically signaling when more products overall number of tags within this three-dimensional space need to be put on the shelf.Furthermore,the histogram can can be up to tens of thousands in a dense deployment sce- be used for approximate answering of aggregate queries nario,as envisioned in [1],[2],[3].Many tag identification [121,[131,as well as preprocessing and mining association protocols [41,[5],[61,[7],[8]are proposed to uniquely iden- rules in data mining [14].Therefore,collecting histograms tify the tags one by one through anti-collision schemes. over RFID tags is an essential premise for effective queries However,in a number of applications,only some useful sta-and analysis in conventional RFID-based applications. tistical information is essential to be collected,such as the Fig.1 shows an example for collecting histogram over the overall tag size [2],[91,[10],popular categories [11]and the RFID tags deployed in the application scenarios. histogram.In particular,histograms capture distribution While dealing with a large scale deployment with thou- statistics in a space-efficient fashion.In some applications, sands of tags,the traditional tag identification scheme is not such as a grocery store or a shipping portal,items are cate- suitable for histogram collection,since the scanning time is gorized according to some specified metrics,such as types proportional to the number of tags,which can be in the of merchandize,manufacturers,etc.A histogram is used to order of several minutes.As the overall tag size grows, illustrate the number of items in each category. reading each tag one by one can be rather time-consuming, In practice,tags are typically attached to objects belong- which is not scalable at all.As in most applications,the tags ing to different categories,e.g.,different brands and models are frequently moving into and out of the effective scanning of clothes in a large clothing store,different titles of books area.In order to capture the distribution statistics in time,it is essential to sacrifice some accuracy so that the main distri- L.Xie and S.Lu are with the State Key Laboratory for Novel Software bution can be obtained within a short time window-in the Technology,Nanjing University,China. order of several seconds.Therefore,we seek to propose an E-mail:(Ixie,sangluj@nju.edu.cn. estimation scheme to quickly count the tag sizes of each cat- H.Han and Q.Li are with the Department of Computer Science,College of William and Mary,Williamsburg,VA.E-mail:(hhan,liqun@cs.wm.edu. egory while achieving the accuracy requirement. ● I.Wu is with the Department of Computer Information and Sciences, In most cases,the tag sizes of various categories are sub- Temple University.E-mail:jiewu@temple.edu. ject to some skewed distribution with a "long tail",such as Manuscript received 10 Mar.2014;revised 12 Aug.2014;accepted 4 Sept. the Gaussian distribution.The long tail represents a large 2014.Date of publication 10 Sept.2014;date of current version 7 Aug.2015. number of categories,each of which occupies a rather small Recommended for acceptance by S.Guo. For information on obtaining reprints of this article,please send e-mail to: percentage among the total categories.While handling the reprints@ieee.org,and reference the Digital Object Identifier below. massive number of tags,in the order of several thousands, Digital Object Identifier no.10.1109/TPDS.2014.2357021 the overall number of categories in the long tail could be in 1045-2192014EEE Personal mission See http://www.ieee.org/publi
Efficient Protocols for Collecting Histograms in Large-Scale RFID Systems Lei Xie, Member, IEEE, Hao Han, Member, IEEE, Qun Li, Member, IEEE, Jie Wu, Fellow, IEEE, and Sanglu Lu, Member, IEEE Abstract—Collecting histograms over RFID tags is an essential premise for effective aggregate queries and analysis in large-scale RFID-based applications. In this paper we consider an efficient collection of histograms from the massive number of RFID tags, without the need to read all tag data. In order to achieve time efficiency, we propose a novel, ensemble sampling-based method to simultaneously estimate the tag size for a number of categories. We first consider the problem of basic histogram collection, and propose an efficient algorithm based on the idea of ensemble sampling. We further consider the problems of advanced histogram collection, respectively, with an iceberg query and a top-k query. Efficient algorithms are proposed to tackle the above problems such that the qualified/unqualified categories can be quickly identified. This ensemble sampling-based framework is very flexible and compatible to current tag-counting estimators, which can be efficiently leveraged to estimate the tag size for each category. Experiment results indicate that our ensemble sampling-based solutions can achieve a much better performance than the basic estimation/identification schemes. Index Terms—Algorithms, RFID, time efficiency, histogram Ç 1 INTRODUCTION WITH the rapid proliferation of RFID-based applications, RFID tags have been deployed into pervasive spaces in increasingly large numbers. In applications like warehouse monitoring, the items are attached with RFID tags, and are densely packed into boxes. As the maximum scanning range of a UHF RFID reader is usually 6-10 m, the overall number of tags within this three-dimensional space can be up to tens of thousands in a dense deployment scenario, as envisioned in [1], [2], [3]. Many tag identification protocols [4], [5], [6], [7], [8] are proposed to uniquely identify the tags one by one through anti-collision schemes. However, in a number of applications, only some useful statistical information is essential to be collected, such as the overall tag size [2], [9], [10], popular categories [11] and the histogram. In particular, histograms capture distribution statistics in a space-efficient fashion. In some applications, such as a grocery store or a shipping portal, items are categorized according to some specified metrics, such as types of merchandize, manufacturers, etc. A histogram is used to illustrate the number of items in each category. In practice, tags are typically attached to objects belonging to different categories, e.g., different brands and models of clothes in a large clothing store, different titles of books in a book store, etc. Collecting histogram can be used to illustrate the tag population belonging to each category, and determine whether the number of tags in a category is above or below any desired threshold. By setting this threshold, it is easy to find popular merchandise and control stock, e.g., automatically signaling when more products need to be put on the shelf. Furthermore, the histogram can be used for approximate answering of aggregate queries [12], [13], as well as preprocessing and mining association rules in data mining [14]. Therefore, collecting histograms over RFID tags is an essential premise for effective queries and analysis in conventional RFID-based applications. Fig. 1 shows an example for collecting histogram over the RFID tags deployed in the application scenarios. While dealing with a large scale deployment with thousands of tags, the traditional tag identification scheme is not suitable for histogram collection, since the scanning time is proportional to the number of tags, which can be in the order of several minutes. As the overall tag size grows, reading each tag one by one can be rather time-consuming, which is not scalable at all. As in most applications, the tags are frequently moving into and out of the effective scanning area. In order to capture the distribution statistics in time, it is essential to sacrifice some accuracy so that the main distribution can be obtained within a short time window–in the order of several seconds. Therefore, we seek to propose an estimation scheme to quickly count the tag sizes of each category while achieving the accuracy requirement. In most cases, the tag sizes of various categories are subject to some skewed distribution with a “long tail”, such as the Gaussian distribution. The long tail represents a large number of categories, each of which occupies a rather small percentage among the total categories. While handling the massive number of tags, in the order of several thousands, the overall number of categories in the long tail could be in L. Xie and S. Lu are with the State Key Laboratory for Novel Software Technology, Nanjing University, China. E-mail: {lxie, sanglu}@nju.edu.cn. H. Han and Q. Li are with the Department of Computer Science, College of William and Mary, Williamsburg, VA. E-mail: {hhan, liqun}@cs.wm.edu. J. Wu is with the Department of Computer Information and Sciences, Temple University. E-mail: jiewu@temple.edu. Manuscript received 10 Mar. 2014; revised 12 Aug. 2014; accepted 4 Sept. 2014. Date of publication 10 Sept. 2014; date of current version 7 Aug. 2015. Recommended for acceptance by S. Guo. For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.org, and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TPDS.2014.2357021 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015 2421 1045-9219 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information
2422 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,VOL.26.NO.9.SEPTEMBER 2015 Histogram RFID-based Application Scenario 2 RELATED WORK 10 In RFID systems,a reader needs to receive data from multi- ple tags,while the tags are unable to self-regulate their radio 10 threshold transmissions to avoid collisions;then,a series of slotted =50 ALOHA-based anti-collision protocols [1],[4],[5],[6],[7], [8],[16],[17]are designed to efficiently identify tags in RFID systems.In order to deal with the collision problems in CI C2 C3 C4 C5 C6 C7 C8 C9 C10 multi-reader RFID systems,scheduling protocols for reader Category ID activation are explored in the literature [18],[19].Recently, Fig.1.An example of collecting histogram over RFID tags a number of polling protocols [201,[211,[22]are proposed, aiming to collect information from battery-powered active hundreds.Therefore,by separately estimating the tag sizes tags in an energy efficient approach. Recent research is focused on the collection of statistical over each category,a large number of query cycles and slots information over the RFID tags [2],[9],[10],[11],[23],[24], are required.Besides,in applications like the iceberg query [25],[26],[27].The authors mainly consider the problem of and the top-k query,only those major categories are essen- tial to be addressed.In this situation,the separate estimate estimating the number of tags without collecting the tag IDs. Murali et al.provide very fast and reliable estimation mech- approach will waste a lot of scanning time over those minor categories in the long tail.Therefore,a novel scheme is anisms for tag quantity in a more practical approach [9].Li essential to quickly collect the histograms over the massive et al.study the RFID estimation problem from the energy angle [23].Their goal is to reduce the amount of energy that RFID tags.In this paper,we propose a series of protocols to tackle the problem of efficient histogram collection.The is consumed by the tags during the estimation procedure. main contributions of this paper are listed as follows(a pre- Shahzad et al.propose a new scheme for estimating tag pop- liminary version of this work appeared in [151): ulation size called average run based tag estimation(ART) [2].Chen et al.aim to gain deeper and fundamental insights 1) To the best of our knowledge,we are the first to con- in RFID counting protocols [271,they manage to design sider the problem of collecting histograms and its near-optimal protocols that are more efficient than existing applications (i.e.,iceberg query and top-k query) ones and simultaneously simpler than most of them.Liu over RFID tags,which is a fundamental premise for et al.investigate efficient distributed query processing in answering aggregate queries and data mining over large RFID-enabled supply chains [28].Liu et al.propose a RFID-based applications. novel solution to fast count the key tags in anonymous RFID 2) In order to achieve time efficiency,we propose a systems [29].Luo et al.tackle an interesting problem,called novel,ensemble sampling (ES)-based method to multigroup threshold based classification [25],which is to simultaneously estimate the tag size for a number of determine whether the number of objects in each group is categories.This framework is very flexible and com- above or below a prescribed threshold value.Sheng et al. patible to current tag-counting estimators,which can consider the problem of identifying popular categories of be efficiently leveraged to estimate the tag size for RFID tags out of a large collection of tags [11],while the set each category.While achieving time-efficiency,our of category IDs are supposed to be known.Different from solutions are completely compatible with current the previous work,in this paper,our goal is to collect the his- industry standards,i.e.,the EPCglobal C1G2 stand- tograms for all categories over RFID tags in a time-efficient ards,and do not require any tag modifications. approach,without any priori knowledge of the categories. 3) In order to tackle the histogram collection with a Specifically,we respectively consider the basic histogram filter condition,we propose an effective solution collection problem,the iceberg query problem,and the top-k for the iceberg query problem.By considering the query problem in regard to collecting histograms in large- population and accuracy constraint,we propose an scale RFID systems.We aim to propose a flexible and com- efficient algorithm to wipe out the unqualified cat- patible framework for current tag-counting estimators based egories in time,especially those categories in the on slotted ALOHA protocol,which can be efficiently lever- long tail.We further present an effective solution aged to estimate the tag size for each category. to tackle the top-k query problem.We use ensemble sampling to quickly estimate the threshold corre- 3 PRELIMINARY sponding to the kth largest category,and reduce it 3.1 The Framed Slotted ALOHA Protocol to the iceberg query problem. In the Class 1 Gen 2 standard,the RFID system leverages the The remainder of the paper is as follows.Sections 2 framed slotted ALOHA protocol to resolve the collisions for tag and 3 present the related work and RFID preliminary, identification.When a reader wishes to read a set of tags,it respectively.We formulate our problem in Section 4,and first powers up and transmits a continuous wave to ener- present our ensemble sampling-based method for the gize the tags.It then initiates a series of frames,varying the basic histogram collection in Section 5.We further present number of slots in each frame to best accommodate the our solutions for the iceberg query and the top-k query,number of tags.Each frame has a number of slots and each respectively,in Sections 6 and 7.In Section 8,we provide active tag will reply in a randomly selected slot per frame. performance analysis in time-efficiency.The performance After all tags are read,the reader powers down.We refer to evaluation is in Section 9,and we conclude in Section 10. the series of frames between power down periods as a
hundreds. Therefore, by separately estimating the tag sizes over each category, a large number of query cycles and slots are required. Besides, in applications like the iceberg query and the top-k query, only those major categories are essential to be addressed. In this situation, the separate estimate approach will waste a lot of scanning time over those minor categories in the long tail. Therefore, a novel scheme is essential to quickly collect the histograms over the massive RFID tags. In this paper, we propose a series of protocols to tackle the problem of efficient histogram collection. The main contributions of this paper are listed as follows (a preliminary version of this work appeared in [15]): 1) To the best of our knowledge, we are the first to consider the problem of collecting histograms and its applications (i.e., iceberg query and top-k query) over RFID tags, which is a fundamental premise for answering aggregate queries and data mining over RFID-based applications. 2) In order to achieve time efficiency, we propose a novel, ensemble sampling (ES)-based method to simultaneously estimate the tag size for a number of categories. This framework is very flexible and compatible to current tag-counting estimators, which can be efficiently leveraged to estimate the tag size for each category. While achieving time-efficiency, our solutions are completely compatible with current industry standards, i.e., the EPCglobal C1G2 standards, and do not require any tag modifications. 3) In order to tackle the histogram collection with a filter condition, we propose an effective solution for the iceberg query problem. By considering the population and accuracy constraint, we propose an efficient algorithm to wipe out the unqualified categories in time, especially those categories in the long tail. We further present an effective solution to tackle the top-k query problem. We use ensemble sampling to quickly estimate the threshold corresponding to the kth largest category, and reduce it to the iceberg query problem. The remainder of the paper is as follows. Sections 2 and 3 present the related work and RFID preliminary, respectively. We formulate our problem in Section 4, and present our ensemble sampling-based method for the basic histogram collection in Section 5. We further present our solutions for the iceberg query and the top-k query, respectively, in Sections 6 and 7. In Section 8, we provide performance analysis in time-efficiency. The performance evaluation is in Section 9, and we conclude in Section 10. 2 RELATED WORK In RFID systems, a reader needs to receive data from multiple tags, while the tags are unable to self-regulate their radio transmissions to avoid collisions; then, a series of slotted ALOHA-based anti-collision protocols [1], [4], [5], [6], [7], [8], [16], [17] are designed to efficiently identify tags in RFID systems. In order to deal with the collision problems in multi-reader RFID systems, scheduling protocols for reader activation are explored in the literature [18], [19]. Recently, a number of polling protocols [20], [21], [22] are proposed, aiming to collect information from battery-powered active tags in an energy efficient approach. Recent research is focused on the collection of statistical information over the RFID tags [2], [9], [10], [11], [23], [24], [25], [26], [27]. The authors mainly consider the problem of estimating the number of tags without collecting the tag IDs. Murali et al. provide very fast and reliable estimation mechanisms for tag quantity in a more practical approach [9]. Li et al. study the RFID estimation problem from the energy angle [23]. Their goal is to reduce the amount of energy that is consumed by the tags during the estimation procedure. Shahzad et al. propose a new scheme for estimating tag population size called average run based tag estimation (ART) [2]. Chen et al. aim to gain deeper and fundamental insights in RFID counting protocols [27], they manage to design near-optimal protocols that are more efficient than existing ones and simultaneously simpler than most of them. Liu et al. investigate efficient distributed query processing in large RFID-enabled supply chains [28]. Liu et al. propose a novel solution to fast count the key tags in anonymous RFID systems [29]. Luo et al. tackle an interesting problem, called multigroup threshold based classification [25], which is to determine whether the number of objects in each group is above or below a prescribed threshold value. Sheng et al. consider the problem of identifying popular categories of RFID tags out of a large collection of tags [11], while the set of category IDs are supposed to be known. Different from the previous work, in this paper, our goal is to collect the histograms for all categories over RFID tags in a time-efficient approach, without any priori knowledge of the categories. Specifically, we respectively consider the basic histogram collection problem, the iceberg query problem, and the top-k query problem in regard to collecting histograms in largescale RFID systems. We aim to propose a flexible and compatible framework for current tag-counting estimators based on slotted ALOHA protocol, which can be efficiently leveraged to estimate the tag size for each category. 3 PRELIMINARY 3.1 The Framed Slotted ALOHA Protocol In the Class 1 Gen 2 standard, the RFID system leverages the framed slotted ALOHA protocol to resolve the collisions for tag identification. When a reader wishes to read a set of tags, it first powers up and transmits a continuous wave to energize the tags. It then initiates a series of frames, varying the number of slots in each frame to best accommodate the number of tags. Each frame has a number of slots and each active tag will reply in a randomly selected slot per frame. After all tags are read, the reader powers down. We refer to the series of frames between power down periods as a Fig. 1. An example of collecting histogram over RFID tags 2422 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015
XIE ET AL.:EFFICIENT PROTOCOLS FOR COLLECTING HISTOGRAMS IN LARGE-SCALE RFID SYSTEMS 2423 TABLE 1 inter-cyde duration The Average Time Interval for Various Slots after QueryRep command after Query command empty slot 0.9ms 1.7ms singleton slot 4.1ms 5.1ms collision slot 1.3ms 2.2ms specified set of tags,conventionally,the reader should issue 100 200 60 multiple query cycles over the same set of tags and take the (a)The raw signal data for 5 continuous cycles average of the estimates.The inter-cycle overhead consists of the time between cycles when the reader is powered down,and the carrier time used to power the tags before beginning communication.According to the experiment results in [30],which are conducted in realistic settings, these times are 40 ms and 3 ms respectively,while the aver- age time interval per slot is 1~2 ms. We have further measured the time interval for various slots and the inter-cycle duration with the USRP N210 plat- form.In our experiments,we use the Alien-9900 reader and Alien-9611 linear antenna with a directional gain of 6 dB. The The RFID tags used are Alien 9640 general-purpose tags (b)The raw signal data for one frame which support the EPC C1G2 standards.We use Alien Fig.2.The captured raw signal data of the interrogation between the reader to continuously read 13 tags for 100 query cycles.We reader and the tag use USRP N210 as a sniffer to capture the physical signals. Fig.2 shows an example of the captured raw signal data of Query Cycle.Note that,within each frame,tags may choose the interrogation between the reader and the tag.According the same slot,which causes multiple tags to reply during a to the realistic experiment results in this setting,the average slot.Therefore,within each frame there exist three kinds of intervals for various slots are summarized in Table 1.It is slots:(1)the empty slot where no tag replies;(2)the single found that,in most cases,the slot is started with a QueryRep slot where only one tag replies;and (3)the collision slot command,then the average interval for empty slots is where multiple tags reply. 0.9 ms per slot,the average interval for singleton slots is 4.1 In regard to the tag ID,each tag has a unique 96-bit ID in ms per slot,and the average interval for collision slots is 1.3 its EPC memory,where the first s binary bits can be ms per slot;when a slot happens to be the first slot of a regarded as the category ID(1<s<96).According to the frame,the slot is started with a Query command,then the C1G2 standard,for each Query Cycle,the reader is able to average interval for empty slots is 1.7 ms per slot,the aver- select the tags in a specified category by sending a Select age interval for singleton is 5.1 ms per slot,and the average command with an s-bit mask in the category ID field.If mul- interval for collision slots is 2.2 ms per slot.By measuring tiple categories need to be selected,the reader can provide the time intervals between two adjacent query cycles,it is multiple bit masks in the Select command. found that the average interval for inter-cycle duration is 28.3 ms.Note that if the powered-down interval is not long 3.2 Basic Tag Identification versus enough,it is possible that some surrounding tags will main- the Estimation Scheme tain the former state for the inventoried flag with their local Assume that there are n tags in total,and that it takes s;slots residual power,which causes them to keep silent in the to uniquely identify n tags.It is known that for each query upcoming query cycle. round,when the frame size f is equal to the remaining Therefore,since the average inter-cycle duration(28.3 ms) number of tags,the proportion of singleton slots inside the is much larger than the average time interval of conventional frame is maximized;then,the efficiency is"=.Hence,the slots (empty slot:0.9 ms,singleton slot:4.1 ms,collision slot: essential number of slots is s(1).n=n.e. 1.3 ms),the inter-cycle duration must be taken into account when considering overall reading performance.It is obvious Therefore,assume that it takes se slots to estimate the tag that reading a large number of tags per cycle amortizes the size for each category with a certain accuracy.If we want cost of inter-cycle overhead,resulting in lower per tag read- the estimation scheme to achieve a better reading perfor- ing time,while for small tag sets the inter-cycle overhead is mance than the basic tag identification method,then we significant.It is essential to sufficiently reduce the inter-cycle need se.lesili,where le and li are the sizes of the bit overhead when we design a solution and set the correspond- strings transmitted during the estimation and identification ing parameters for RFID systems. phases,respectively. 3.3 The Impact of the Inter-Cycle Overhead 4 PROBLEM FORMULATION The MAC protocol for the C1G2 system is based on slotted Suppose there are a large number of tags in the effective ALOHA.In order to accurately estimate the size of a scanning area of the RFID reader,the RFID system conforms
Query Cycle. Note that, within each frame, tags may choose the same slot, which causes multiple tags to reply during a slot. Therefore, within each frame there exist three kinds of slots: (1) the empty slot where no tag replies; (2) the single slot where only one tag replies; and (3) the collision slot where multiple tags reply. In regard to the tag ID, each tag has a unique 96-bit ID in its EPC memory, where the first s binary bits can be regarded as the category ID (1 <s< 96). According to the C1G2 standard, for each Query Cycle, the reader is able to select the tags in a specified category by sending a Select command with an s-bit mask in the category ID field. If multiple categories need to be selected, the reader can provide multiple bit masks in the Select command. 3.2 Basic Tag Identification versus the Estimation Scheme Assume that there are n tags in total, and that it takes si slots to uniquely identify n tags. It is known that for each query round, when the frame size f is equal to the remaining number of tags, the proportion of singleton slots inside the frame is maximized; then, the efficiency is ns f ¼ 1 e . Hence, the essential number of slots is si ¼ Pþ1 i¼0 ð1 1 e Þ i n ¼ n e. Therefore, assume that it takes se slots to estimate the tag size for each category with a certain accuracy. If we want the estimation scheme to achieve a better reading performance than the basic tag identification method, then we need se le si li, where le and li are the sizes of the bit strings transmitted during the estimation and identification phases, respectively. 3.3 The Impact of the Inter-Cycle Overhead The MAC protocol for the C1G2 system is based on slotted ALOHA. In order to accurately estimate the size of a specified set of tags, conventionally, the reader should issue multiple query cycles over the same set of tags and take the average of the estimates. The inter-cycle overhead consists of the time between cycles when the reader is powered down, and the carrier time used to power the tags before beginning communication. According to the experiment results in [30], which are conducted in realistic settings, these times are 40 ms and 3 ms respectively, while the average time interval per slot is 1 2 ms. We have further measured the time interval for various slots and the inter-cycle duration with the USRP N210 platform. In our experiments, we use the Alien-9900 reader and Alien-9611 linear antenna with a directional gain of 6 dB. The RFID tags used are Alien 9640 general-purpose tags which support the EPC C1G2 standards. We use Alien reader to continuously read 13 tags for 100 query cycles. We use USRP N210 as a sniffer to capture the physical signals. Fig. 2 shows an example of the captured raw signal data of the interrogation between the reader and the tag. According to the realistic experiment results in this setting, the average intervals for various slots are summarized in Table 1. It is found that, in most cases, the slot is started with a QueryRep command, then the average interval for empty slots is 0.9 ms per slot, the average interval for singleton slots is 4.1 ms per slot, and the average interval for collision slots is 1.3 ms per slot; when a slot happens to be the first slot of a frame, the slot is started with a Query command, then the average interval for empty slots is 1.7 ms per slot, the average interval for singleton is 5.1 ms per slot, and the average interval for collision slots is 2.2 ms per slot. By measuring the time intervals between two adjacent query cycles, it is found that the average interval for inter-cycle duration is 28.3 ms. Note that if the powered-down interval is not long enough, it is possible that some surrounding tags will maintain the former state for the inventoried flag with their local residual power, which causes them to keep silent in the upcoming query cycle. Therefore, since the average inter-cycle duration (28.3 ms) is much larger than the average time interval of conventional slots (empty slot: 0.9 ms, singleton slot: 4.1 ms, collision slot: 1.3 ms), the inter-cycle duration must be taken into account when considering overall reading performance. It is obvious that reading a large number of tags per cycle amortizes the cost of inter-cycle overhead, resulting in lower per tag reading time, while for small tag sets the inter-cycle overhead is significant. It is essential to sufficiently reduce the inter-cycle overhead when we design a solution and set the corresponding parameters for RFID systems. 4 PROBLEM FORMULATION Suppose there are a large number of tags in the effective scanning area of the RFID reader, the RFID system conforms Fig. 2. The captured raw signal data of the interrogation between the reader and the tag TABLE 1 The Average Time Interval for Various Slots after QueryRep command after Query command empty slot 0.9 ms 1.7 ms singleton slot 4.1 ms 5.1 ms collision slot 1.3 ms 2.2 ms XIE ET AL.: EFFICIENT PROTOCOLS FOR COLLECTING HISTOGRAMS IN LARGE-SCALE RFID SYSTEMS 2423
2424 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,VOL.26.NO.9.SEPTEMBER 2015 to EPCglobal C1G2 standards,i.e.,the slotted ALOHA- algorithm to work,we only require the tags to comply based anti-collision scheme [4],[6]is used in the system with the current C1G2 standards:each tag has a unique model.The objective is to collect the histogram over RFID 96-bit ID in its EPC memory,where the first s binary bits tags according to some categorized metric,e.g,the type of are regarded as the category ID(1<s 96).According to merchandise,while the present set of category IDs cannot the C1G2 standard,the reader is able to select the tags in be predicted in advance.As we aim at a dynamic environ-a specified category by sending a Select command with an ment where the tags may frequently enter and leave the s-bit mask in the category ID field.If multiple categories scanning area,a time-efficient strategy must be proposed.need to be selected,the reader can provide multiple bit Therefore,the specified accuracy can be relaxed in order to masks in the Select command. quickly collect the histogram.Assume that the overall tag size is n,there exist m categories C=[C1,C2....,Cm},and 5 USE ENSEMBLE SAMPLING TO the actual tag size for each category is n,n2,...,nm. COLLECT HISTOGRAMS In the Basic Histogram Collection,the RFID system needs to collect the histogram for all categories.Due to the inherent When collecting the histograms over a large number of cate- inaccurate property for RFID systems,users can specify the gories,the objective is to minimize the overall scanning accuracy requirement for the histogram collection.Suppose time while the corresponding accuracy/population con- the estimated tag size for category Ci(1<i<m)is ni,then straints are satisfied.Two straightforward solutions are the following accuracy constraint should be satisfied: summarized as follows:(1)Basic Tag Identification:The histo- gram is collected by uniquely identifying each tag from the rm-nl≤e·n≥1-B accuracy constraint. (1) massive tag set and putting it into the corresponding cate- gories,thus the accuracy is 100 percent,and (2)Separate The accuracy constraint illustrates that,given the exact tag Counting:As the category IDs cannot be predicted in size ni for a specified category,the estimated tag sizen advance,the tree traversal method [31]is used to obtain the should be in an confidence interval of width 2e.ni,i.e., with probability greater than 1-B.For category IDs.Then,the reader sends a Select command to the tags,and it activates the tags in the specified category by example,if e=0.1,B=0.05,then in regard to a category providing a bit mask over the category ID in the command. with tag size ni=100,the estimated tag size ni should be According to the replies from the specified tags,the estima- within the range [90,110]with probability greater than tors such as [9],[24],[32]can be used to estimate the tag size 95 percent. for each category.As the rough tag size for each category In the Iceberg Query Problem,only those categories with a cannot be predicted in advance,a fixed initial frame size is tag size over a specified threshold t are essential to be illus- used for each category. trated in the histogram,while the accuracy requirement is Both the above two solutions are not time-efficient.In satisfied.As the exact tag size n;for category Ci is unknown, regard to the basic tag identification,uniquely identifying then,given the estimated value of tag size m,it is possible to each tag in the massive set is not scalable,for as the tag size have false negative error and false positive error in verifying grows into a huge number,the scanning time can be an the population constraint.Therefore,it is essential to guar- unacceptable value.In regard to the separated counting,the antee that the false negative/positive rate is below B,that is: reader needs to scan each category with at least one query cycle,even if the category is a minor category,which is not Pr[i<tn:≥t<B, (2) necessarily addressed in the iceberg query and the top-k query.As the number of categories m can be fairly large, Pr≥tn贴<<B. (3) e.g.,in the order of hundreds,the Select command and the fixed initial frame size for each category,as well as the In the Top-k Query Problem,we use the definition of the inter-cycle overhead among a large number of query cycles, probabilistic threshold top-k query (PT-Topk query),i.e.,in make the overall scanning time rather large. regard to the tag size,only the set of categories where each Therefore,we consider an ensemble sampling-based esti- takes a probability of at least 1-B to be in the top-k list are mation scheme as follows:select a certain number of catego- illustrated in the histogram,while the accuracy requirement ries and issue a query cycle,obtain the empty/singleton/ is satisfied.Much like the iceberg query problem,as the collision slots,and then estimate the tag size for each of the exact tag size ni for category Ci is unknown,then,given the categories according to the sampling in the singleton slots. estimated value of tag size ni,it is possible to have false neg- In this way,the ensemble sampling is more preferred than ative error and false positive error in verifying the popula- the separate counting in terms of reading performance. tion constraint,the following constraint must be satisfied: Since more tags are involved in one query cycle,more slots Pr[C;is regarded out of top-k list C;top-k list]<B,(4)amortize the cost of inter-cycle overhead,the Select com- mand,as well as the fixed initial frame size.Thus,the over- PrC,is regarded in top-k listC top-k list]<B.(5)all scanning time can be greatly reduced. In this paper,we aim to propose a series of novel solu-5.1 The Estimator ES tions to tackle the above problems while satisfying the In the slotted ALOHA-based protocol,besides the empty slots following properties:(1)Time-efficient.(2)Simple for the and the collision slots,the singleton slots can be obtained.In tag side in the protocol.(3)Complies with the EPCglobal the ensemble sampling-based estimation,according to the C1G2 standards.Therefore,in order for the proposed observed statistics of the empty/singleton/collision slots,we
to EPCglobal C1G2 standards, i.e., the slotted ALOHAbased anti-collision scheme [4], [6] is used in the system model. The objective is to collect the histogram over RFID tags according to some categorized metric, e.g, the type of merchandise, while the present set of category IDs cannot be predicted in advance. As we aim at a dynamic environment where the tags may frequently enter and leave the scanning area, a time-efficient strategy must be proposed. Therefore, the specified accuracy can be relaxed in order to quickly collect the histogram. Assume that the overall tag size is n, there exist m categories C ¼ fC1; C2; ... ; Cmg, and the actual tag size for each category is n1; n2; ... ; nm. In the Basic Histogram Collection, the RFID system needs to collect the histogram for all categories. Due to the inherent inaccurate property for RFID systems, users can specify the accuracy requirement for the histogram collection. Suppose the estimated tag size for category Cið1 i mÞ is nbi, then the following accuracy constraint should be satisfied: Pr½jnbi nij ni 1 b accuracy constraint: (1) The accuracy constraint illustrates that, given the exact tag size ni for a specified category, the estimated tag size nbi should be in an confidence interval of width 2 ni, i.e., nbi ni 2 ½1 ; 1 þ with probability greater than 1 b. For example, if ¼ 0:1; b ¼ 0:05, then in regard to a category with tag size ni ¼ 100, the estimated tag size nbi should be within the range ½90;110 with probability greater than 95 percent. In the Iceberg Query Problem, only those categories with a tag size over a specified threshold t are essential to be illustrated in the histogram, while the accuracy requirement is satisfied. As the exact tag size ni for category Ci is unknown, then, given the estimated value of tag size nbi, it is possible to have false negative error and false positive error in verifying the population constraint. Therefore, it is essential to guarantee that the false negative/positive rate is below b, that is: Pr½nbi < tjni t < b; (2) Pr½nbi tjni < t < b: (3) In the Top-k Query Problem, we use the definition of the probabilistic threshold top-k query (PT-Topk query), i.e., in regard to the tag size, only the set of categories where each takes a probability of at least 1 b to be in the top-k list are illustrated in the histogram, while the accuracy requirement is satisfied. Much like the iceberg query problem, as the exact tag size ni for category Ci is unknown, then, given the estimated value of tag size nbi, it is possible to have false negative error and false positive error in verifying the population constraint, the following constraint must be satisfied: Pr½Ci is regarded out of top-k listj Ci 2 top-k list < b; (4) Pr½Ci is regarded in top-k listj Ci 2= top-k list < b: (5) In this paper, we aim to propose a series of novel solutions to tackle the above problems while satisfying the following properties: (1) Time-efficient. (2) Simple for the tag side in the protocol. (3) Complies with the EPCglobal C1G2 standards. Therefore, in order for the proposed algorithm to work, we only require the tags to comply with the current C1G2 standards: each tag has a unique 96-bit ID in its EPC memory, where the first s binary bits are regarded as the category ID (1 <s< 96). According to the C1G2 standard, the reader is able to select the tags in a specified category by sending a Select command with an s-bit mask in the category ID field. If multiple categories need to be selected, the reader can provide multiple bit masks in the Select command. 5 USE ENSEMBLE SAMPLING TO COLLECT HISTOGRAMS When collecting the histograms over a large number of categories, the objective is to minimize the overall scanning time while the corresponding accuracy/population constraints are satisfied. Two straightforward solutions are summarized as follows: (1) Basic Tag Identification: The histogram is collected by uniquely identifying each tag from the massive tag set and putting it into the corresponding categories, thus the accuracy is 100 percent, and (2) Separate Counting: As the category IDs cannot be predicted in advance, the tree traversal method [31] is used to obtain the category IDs. Then, the reader sends a Select command to the tags, and it activates the tags in the specified category by providing a bit mask over the category ID in the command. According to the replies from the specified tags, the estimators such as [9], [24], [32] can be used to estimate the tag size for each category. As the rough tag size for each category cannot be predicted in advance, a fixed initial frame size is used for each category. Both the above two solutions are not time-efficient. In regard to the basic tag identification, uniquely identifying each tag in the massive set is not scalable, for as the tag size grows into a huge number, the scanning time can be an unacceptable value. In regard to the separated counting, the reader needs to scan each category with at least one query cycle, even if the category is a minor category, which is not necessarily addressed in the iceberg query and the top-k query. As the number of categories m can be fairly large, e.g., in the order of hundreds, the Select command and the fixed initial frame size for each category, as well as the inter-cycle overhead among a large number of query cycles, make the overall scanning time rather large. Therefore, we consider an ensemble sampling-based estimation scheme as follows: select a certain number of categories and issue a query cycle, obtain the empty/singleton/ collision slots, and then estimate the tag size for each of the categories according to the sampling in the singleton slots. In this way, the ensemble sampling is more preferred than the separate counting in terms of reading performance. Since more tags are involved in one query cycle, more slots amortize the cost of inter-cycle overhead, the Select command, as well as the fixed initial frame size. Thus, the overall scanning time can be greatly reduced. 5.1 The Estimator ES In the slotted ALOHA-based protocol, besides the empty slots and the collision slots, the singleton slots can be obtained. In the ensemble sampling-based estimation, according to the observed statistics of the empty/singleton/collision slots, we 2424 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015
XIE ET AL.:EFFICIENT PROTOCOLS FOR COLLECTING HISTOGRAMS IN LARGE-SCALE RFID SYSTEMS 2425 can use estimators in [9],[24],[32]etc.to estimate the overall Proof.See Appendix B,available in the online supple- tag size.Then,according to the response in each singleton slot, mental material. the category ID is obtained from the first s bits in the tag ID. Based on the sampling from the singleton slots,the tag size for each category can be estimated.The reason is as follows: 5.2.2 Reducing the Variance through Repeated Tests Assume that there exists m categories C1,C2,...,Cm,the As the frame size for each query cycle has a maximum overall tag size is n,and the tag size for each category is value,by estimating from the ensemble sampling within m,n2,...,nm.We define an indicator variable Xij to denote only one query cycle,the estimated tag size may not be whether one tag of category Ci selects a slot j inside the accurate enough for the accuracy constraint.In this situa- frame with the size f.We set Xij=1 if only one tag of cate- tion,multiple query cycles are essential to reduce the vari- gory Ci selects the slot j,and Xi.j=0 otherwise.Moreover, ance through repeated tests.Suppose the reader issues l we use Pr[Xij=1]to denote the probability that only one query cycles over the same set of categories,in regard to a tag of category Ci selects the slot j,then, specified category Ci,by utilizing the weighted statistical averaging method,the averaged tag size=n.k; here k= and respectively denote the esti If we use n.i to denote the number of singleton slots mated tag size and variance for each cycle k.Then,the vari- selected by tags of category Ci,thus n=then, ance ofs∑克 the expected value Therefore,according to the accuracy constraint in the problem formulation,we rely on the following theorem to express this constraint in the form of the variance. Theorem 2.Suppose the variance of the averaged tag size ni is o.The accuracy constraint is satisfied for a specified cate- Furthermore,let n,denote the number of singleton slots,the expected value E(n)=(1"n.Then,Thus 8onyC,as long as≤(z-e2.n,Zi-is the1-号per centile for the standard normal distribution. we can approximate the tag size of category Ci as follows: Proof.See Appendix C,available in the online supplemental 元,=n.元 material. ▣ (6) ns According to Theorem 2,we can verify if the accuracy Here,n is the estimated value of the overall tag size.Let constraint is satisfied for each category through directly d=hem元=:元 checking the variance against the threshold (If 1-B=95%,then Z1-/2=1.96. 5.2 Accuracy Analysis 5.2.1 Accuracy of the ES Estimator 5.2.3 Property of the Ensemble Sampling In the ensemble sampling-based estimation,since the esti- According to Theorem 1,the normalized variance of the SE mators such as [9],[24],[32]can be utilized for estimating estimator=产is equivalent to,k=告·=+ the overall number of tags,we use 8 to denote the variance +2e-.Leta=ne出,b=+ng-.Then,the nor- -(eP+n-1) of n.We have the property in Lemma 1. e+n-1 n-(eP+n-1) malized variance A:=a.+b.Since the SE estimator can Lemma 1.The number of singleton slots ns and the number of utilize any estimator like [9],[24],[32]to estimate the overall singleton slots nsi selected by the tags of category Ci,respec- tag size,then,without loss of generality,if we use the esti- tively,have the following expectations: mator in [9],we can prove that a0,f>0.The following theorem shows this property in )=(-n+号((1-)22-m the normalized variance. E(m)=(1-》·n+号.((1-)·(m-m) Theorem 3.0. Proof.See Appendix D,available in the online supplemen- Proof.See Appendix A,which can be found on the Computer tal material. Society Digital Library at http://doi.ieeecomputersociety. org/10.1109/TPDS.2014.2357021. This property applies to any estimator with variance 0 smaller than 8o in ZE,which simply estimates the overall We rely on the following theorem to illustrate the accu-tag size based on the observed number of empty slots. racy of the estimator SE. According to Theorem 3,in order to satisfy the accuracy Theorem 1.Let 8;represent the variance of the estimator SE, cosit,we should ensureAsforall the load factor p=then, values of f,it infers that the larger the value ni is,the faster it will be for the specified category to satisfy the accuracy con- n5eP+mi-1.(6+m2)-m 6= (7) straint.On the contrary,the smaller the value n;is,the slower n e+n-1 it will be for the specified category to satisfy the accuracy
can use estimators in [9], [24], [32] etc. to estimate the overall tag size. Then, according to the response in each singleton slot, the category ID is obtained from the first s bits in the tag ID. Based on the sampling from the singleton slots, the tag size for each category can be estimated. The reason is as follows: Assume that there exists m categories C1; C2; ... ; Cm, the overall tag size is n, and the tag size for each category is n1; n2; ... ; nm. We define an indicator variable Xi;j to denote whether one tag of category Ci selects a slot j inside the frame with the size f. We set Xi;j ¼ 1 if only one tag of category Ci selects the slot j, and Xi;j ¼ 0 otherwise. Moreover, we use Pr½Xi;j ¼ 1 to denote the probability that only one tag of category Ci selects the slot j, then, Pr½Xi;j ¼ 1 ¼ 1 f 1 1 f n1 ni: If we use ns;i to denote the number of singleton slots selected by tags of category Ci, thus ns;i ¼ Pf j¼1 Xi;j, then, the expected value Eðns;iÞ ¼ X f j¼1 Pr½Xi;j ¼ 1 ¼ 1 1 f n1 ni: Furthermore, let ns denote the number of singleton slots, the expected value EðnsÞ¼ð1 1 fÞ n1 n. Then, Eðns;iÞ EðnsÞ ¼ ni n . Thus we can approximate the tag size of category Ci as follows: nbi ¼ ns;i ns n: b (6) Here, nb is the estimated value of the overall tag size. Let abi ¼ ns;i ns , then nbi ¼ abi nb. 5.2 Accuracy Analysis 5.2.1 Accuracy of the ES Estimator In the ensemble sampling-based estimation, since the estimators such as [9], [24], [32] can be utilized for estimating the overall number of tags, we use d to denote the variance of nb. We have the property in Lemma 1. Lemma 1. The number of singleton slots ns and the number of singleton slots ns;i selected by the tags of category Ci, respectively, have the following expectations: Eðn2 s Þ ¼ 1 1 f n1 n þ f1 f 1 2 f n2 ðn2 nÞ; Eðn2 s;iÞ ¼ 1 1 f n1 ni þ f1 f 1 2 f n2 ðn2 i niÞ: 8 >: Proof. See Appendix A, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety. org/10.1109/TPDS.2014.2357021. tu We rely on the following theorem to illustrate the accuracy of the estimator SE. Theorem 1. Let di represent the variance of the estimator SE nbi, the load factor r ¼ n f, then, di ¼ ni n er þ ni 1 er þ n 1 ðd þ n2 Þ n2 i : (7) Proof. See Appendix B, available in the online supplemental material. tu 5.2.2 Reducing the Variance through Repeated Tests As the frame size for each query cycle has a maximum value, by estimating from the ensemble sampling within only one query cycle, the estimated tag size may not be accurate enough for the accuracy constraint. In this situation, multiple query cycles are essential to reduce the variance through repeated tests. Suppose the reader issues l query cycles over the same set of categories, in regard to a specified category Ci, by utilizing the weighted statistical averaging method, the averaged tag size nbi ¼ Pl k¼1 vk nci;k; here vk ¼ 1 d P i;k l k¼1 1 di;k , nci;k and di;k respectively denote the estimated tag size and variance for each cycle k. Then, the variance of nbi is s2 i ¼ P 1 l k¼1 1 di;k . Therefore, according to the accuracy constraint in the problem formulation, we rely on the following theorem to express this constraint in the form of the variance. Theorem 2. Suppose the variance of the averaged tag size nbi is s2 i . The accuracy constraint is satisfied for a specified category Ci, as long as s2 i ð Z1b=2 Þ 2 n2 i , Z1b=2 is the 1 b 2 percentile for the standard normal distribution. Proof. See Appendix C, available in the online supplemental material. tu According to Theorem 2, we can verify if the accuracy constraint is satisfied for each category through directly checking the variance against the threshold ð Z1b=2 Þ 2 n2 i . If 1 b ¼ 95%, then Z1b=2 ¼ 1:96. 5.2.3 Property of the Ensemble Sampling According to Theorem 1, the normalized variance of the SE estimator i ¼ di ni is equivalent to i ¼ dner þ n er þ n 1 ni n þ ðd þ n2Þðer 1Þ nðer þ n 1Þ . Let a ¼ d ner þ n er þ n 1 , b ¼ ðd þ n2Þðer 1Þ nðer þ n 1Þ . Then, the normalized variance i ¼ a ni n þ b. Since the SE estimator can utilize any estimator like [9], [24], [32] to estimate the overall tag size, then, without loss of generality, if we use the estimator in [9], we can prove that a 0;f> 0. The following theorem shows this property in the normalized variance. Theorem 3. d ner þ n er þ n 1 0;f> 0. Proof. See Appendix D, available in the online supplemental material. tu This property applies to any estimator with variance smaller than d0 in ZE, which simply estimates the overall tag size based on the observed number of empty slots. According to Theorem 3, in order to satisfy the accuracy constraint, we should ensure i ð Z1b=2 Þ 2 ni. As a < 0 for all values of f, it infers that the larger the value ni is, the faster it will be for the specified category to satisfy the accuracy constraint. On the contrary, the smaller the value ni is, the slower it will be for the specified category to satisfy the accuracy XIE ET AL.: EFFICIENT PROTOCOLS FOR COLLECTING HISTOGRAMS IN LARGE-SCALE RFID SYSTEMS 2425
2426 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,VOL.26.NO.9.SEPTEMBER 2015 constraint.This occurs during the ensemble sampling,when getting the minimum value of t(i,k)+T(+1,j).By solving the major categories occupy most of the singleton slots,while the overlapping subproblems in T(i,j),the optimization those minor categories cannot obtain enough samplings in problem is then reduced to computing the value of T(1,m). the singleton slots for an accurate estimation of the tag size. For example,suppose there are a set of tags with 10 cate- gories,these categories are ranked in non-increasing order of 5.3 Compute the Optimal Granularity the estimated tag size,say,{100,80,75,41,35,30,20,15,12, for Ensemble Sampling 8),then they are finally divided into three groups for ensem- As indicated in the above analysis,during a query cycle of ble sampling according to the dynamic programming,i.e., the ensemble sampling,in order to achieve the accuracy {100,80,75},{41,35,30,and{20,15,12,8}.In this way,the tag requirement for all categories,the essential scanning time sizes of each category inside one group are close to each mainly depends on the category with the smallest tag size, other,during the ensemble sampling all categories in the as the other categories must still be involved in the query same group can achieve the accuracy requirement with very cycle until this category achieves the accuracy requirement. close finishing time,very few slots are wasted due to waiting Therefore,we use the notion group to define a set of catego- for those,comparatively speaking,minor categories.On the ries involved in a query cycle of the ensemble sampling. other hand,these categories are put together with an appro- Hence,each cycle of ensemble sampling should be applied priate granularity for ensemble sampling to sufficiently over an appropriate group,such that the variance of the tag amortize the fixed time cost for each query cycle. sizes for the involved categories cannot be too large.In this way,all categories in the same group achieve the accuracy 5.4 The Ensemble Sampling-Based Algorithm requirement with very close finishing time.In addition, In Algorithm 1,we propose an ensemble sampling-based according to Eq.(7),as the number of categories increases in algorithm for the basic histogram collection.In the beginning, the ensemble sampling,the load factor p is increased,then as the overall number of tags n cannot be predicted,in order the achieved accuracy for each involved category is to accomodate a large operating range up to n,we need to reduced.Therefore,it is essential to compute an optimal set the initial frame size f by solving fe-/=5 as sug- granularity for the group in regard to the reading perfor- gested in [9].Then,during each cycle of ensemble sampling mance.Suppose there exists m categories in total,the objec- we find the category with the largest population vin the sin- tive is to divide them into d(1≤d≤m)groups for gleton slots,and set a threshold nsi>v(0<<1)to fil- ensemble sampling,such that the overall scanning time can ter out those minor categories which occasionally occupy a be minimized while achieving the accuracy requirement. small number of singleton slots.For example,suppose it is For a specified group,in order for all involved categories observed from the singleton slots that the number of slots to satisfy the accuracy requirement,it is essential to com- occupied by various categories are as follows:{35,25, pute the required frame size for the category with the small- 10,5,3,1,if 0 is set to 0.1,then the categories with the est tag size,,sayn.Let专=(22,hen according to number of slots equal to 3 and 1 are filtered out from the Theorem 2,we can compute the'essential frame size f such next ensemble sampling.Therefore,during the ensemble that Ai(f)<ti.Assume that the inter-cycle overhead is te, sampling,we can avoid estimating tag sizes for those minor the average time interval per slot is t..Therefore,if categories with a rather large variance.Then,the involved f fmar,then the total scanning time T=f.ts+te.Other- categories are further divided into smaller groups based on wise,if the final estimate is the average of r independent the dynamic programming.Therefore,as those major cate- experiments each with an estimator variance of Ai(fmar), gories are estimated and wiped out from the set R phase by then the variance of the average is.Hence,if we want phase,all categories including the relatively minor catego- the final variance to be thenr should be the total ries can be accurately estimated in terms of tag size.The query cycles continue to be issued until no singleton slots or scanning time is T=(fmr·ta+tc)·r. collision slots exist. We propose a dynamic programming-based algorithm to compute the optimal granularity for ensemble sampling. 6 ENSEMBLE SAMPLING FOR THE ICEBERG QUERY Assume that currently there are m categories ranked in 6.1 Motivation non-increasing order according to the estimated tag size, In some applications,the users only pay attention to the e.g.,C1,C2.....Cm.We need to cut the ranked categories major categories with the tag sizes above a certain threshold into one or more continuous groups for ensemble sampling. t,while those minor categories are not necessarily addressed. In regard to a single group consisting of categories from Ci Then,the iceberg query [33]is utilized to filter out those cate- to Ci,we define t(i,j)as the essential scanning time for gories below the threshold t in terms of the tag size.In this ensemble sampling,which is computed in the same way as situation,the separate counting scheme is especially not suit- the aforementioned T.Furthermore,we define T(i,j)as the able,since most of the categories are not within the scope of minimum overall scanning time over the categories from Ci the concern,which can be wiped out together immediately. to Cj among various grouping strategies.Then,the recur- According to the definition in the problem formulation, sive expression of T(i,j)is shown in Eq.(8): three constraints for the iceberg query must be satisfied: T(i,)= ∫mim≤k≤{t(i,k)+T(k+1,)},i<j, (8) t(i,), accuracy constraint. 1=j Prm-nl≤e·n21-B Pr<tn贴≥t< population constraint, In Eq.(8),the value of T(i,j)is obtained by enumerating each possible combination of t(i,k)and T(+1,j),and then Prlm≥tnm<t<B population constraint
constraint. This occurs during the ensemble sampling, when the major categories occupy most of the singleton slots, while those minor categories cannot obtain enough samplings in the singleton slots for an accurate estimation of the tag size. 5.3 Compute the Optimal Granularity for Ensemble Sampling As indicated in the above analysis, during a query cycle of the ensemble sampling, in order to achieve the accuracy requirement for all categories, the essential scanning time mainly depends on the category with the smallest tag size, as the other categories must still be involved in the query cycle until this category achieves the accuracy requirement. Therefore, we use the notion group to define a set of categories involved in a query cycle of the ensemble sampling. Hence, each cycle of ensemble sampling should be applied over an appropriate group, such that the variance of the tag sizes for the involved categories cannot be too large. In this way, all categories in the same group achieve the accuracy requirement with very close finishing time. In addition, according to Eq. (7), as the number of categories increases in the ensemble sampling, the load factor r is increased, then the achieved accuracy for each involved category is reduced. Therefore, it is essential to compute an optimal granularity for the group in regard to the reading performance. Suppose there exists m categories in total, the objective is to divide them into dð1 d mÞ groups for ensemble sampling, such that the overall scanning time can be minimized while achieving the accuracy requirement. For a specified group, in order for all involved categories to satisfy the accuracy requirement, it is essential to compute the required frame size for the category with the smallest tag size, say ni. Let ti ¼ ð Z1b=2 Þ 2 ni, then according to Theorem 2, we can compute the essential frame size f such that iðfÞ ti. Assume that the inter-cycle overhead is tc, the average time interval per slot is ts. Therefore, if f fmax, then the total scanning time T ¼ f ts þ tc. Otherwise, if the final estimate is the average of r independent experiments each with an estimator variance of iðfmaxÞ, then the variance of the average is iðfmaxÞ r . Hence, if we want the final variance to be ti, then r should be iðfmaxÞ ti , the total scanning time is T ¼ ðfmax ts þ tcÞ r. We propose a dynamic programming-based algorithm to compute the optimal granularity for ensemble sampling. Assume that currently there are m categories ranked in non-increasing order according to the estimated tag size, e.g., C1; C2; ... ; Cm. We need to cut the ranked categories into one or more continuous groups for ensemble sampling. In regard to a single group consisting of categories from Ci to Cj, we define tði; jÞ as the essential scanning time for ensemble sampling, which is computed in the same way as the aforementioned T. Furthermore, we define Tði; jÞ as the minimum overall scanning time over the categories from Ci to Cj among various grouping strategies. Then, the recursive expression of Tði; jÞ is shown in Eq. (8): Tði; jÞ ¼ minikjftði; kÞ þ Tðk þ 1; jÞg; i y uð0 < u < 1Þ to filter out those minor categories which occasionally occupy a small number of singleton slots. For example, suppose it is observed from the singleton slots that the number of slots occupied by various categories are as follows: f35; 25; 10; 5; 3; 1g, if u is set to 0.1, then the categories with the number of slots equal to 3 and 1 are filtered out from the next ensemble sampling. Therefore, during the ensemble sampling, we can avoid estimating tag sizes for those minor categories with a rather large variance. Then, the involved categories are further divided into smaller groups based on the dynamic programming. Therefore, as those major categories are estimated and wiped out from the set R phase by phase, all categories including the relatively minor categories can be accurately estimated in terms of tag size. The query cycles continue to be issued until no singleton slots or collision slots exist. 6 ENSEMBLE SAMPLING FOR THE ICEBERG QUERY 6.1 Motivation In some applications, the users only pay attention to the major categories with the tag sizes above a certain threshold t, while those minor categories are not necessarily addressed. Then, the iceberg query [33] is utilized to filter out those categories below the threshold t in terms of the tag size. In this situation, the separate counting scheme is especially not suitable, since most of the categories are not within the scope of the concern, which can be wiped out together immediately. According to the definition in the problem formulation, three constraints for the iceberg query must be satisfied: Pr½jnbi nij ni 1 b accuracy constraint; Pr½nbi < tjni t < b population constraint; Pr½nbi tjni < t < b population constraint: 2426 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015
XIE ET AL.:EFFICIENT PROTOCOLS FOR COLLECTING HISTOGRAMS IN LARGE-SCALE RFID SYSTEMS 2427 Algorithm 1.Algorithm for Histogram Collection Qualified Category 1: INPUT:1.Upper bound n on the number of tags n 2 Undetermined Category 2.Confidence interval width e 3: 3.Error probability B Unqualified Category 4: Initialize the set R to all tags.Set I=1. 5: while n.≠0Ane≠0do Threshold t 6: If I=1,compute the initial frame size f by solving fe-//=5.Otherwise,compute the frame size f=n. If f>fmar,set f=fmar. r1 7: Set S to Select the tags in R and issue a query cycle with the frame size f,get no,ne,ns.Find the category 由由 with the largest population v in the singleton slots.For C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 each category which appears in the singleton slot with Fig.3.Histogram with confidence interval annotated population ni>v.(is constant,0tn=0,it estimated tag size should be small enough.Note that when can be wiped out immediately as an unqualified category
Algorithm 1. Algorithm for Histogram Collection 1: INPUT: 1. Upper bound n on the number of tags n 2: 2. Confidence interval width 3: 3. Error probability b 4: Initialize the set R to all tags. Set l ¼ 1. 5: while ns 6¼ 0 ^ nc 6¼ 0 do 6: If l ¼ 1, compute the initial frame size f by solving fen=f ¼ 5. Otherwise, compute the frame size f ¼ nb. If f>fmax, set f ¼ fmax. 7: Set S to ?. Select the tags in R and issue a query cycle with the frame size f, get n0; nc; ns. Find the category with the largest population y in the singleton slots. For each category which appears in the singleton slot with population ns;i > y uðu is constant, 0 ns;i ¼ 0, it can be wiped out immediately as an unqualified category. Fig. 3. Histogram with confidence interval annotated. XIE ET AL.: EFFICIENT PROTOCOLS FOR COLLECTING HISTOGRAMS IN LARGE-SCALE RFID SYSTEMS 2427
2428 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,VOL.26.NO.9.SEPTEMBER 2015 6.2 Algorithm for the Iceberg Query Problem explored in the singleton slots are wiped out immediately. We propose the algorithm for the iceberg query problem in The query cycles are then continuously issued over those Algorithm 2.Assume that the current set of categories is R, undetermined categories in Runtil R=. during the query cycles of ensemble sampling,the reader con- For example,suppose the threshold is set to 30,after a tinuously updates the statistical value of ni as well as the stan- query cycle of ensemble sampling,the estimated number of dard deviation o;for each category C:E R.After each query tags for each category is as follows:(120,80,65,35,28,10, cycle,the categories in R can be further divided into the fol- 8),according to the standard deviation of estimation for var- lowing categories according to the population constraint: ious categories,then the categories with estimated tag size Qualified categories Q:f元,≥t and o;≤m- ni-t of 120,80 and 65 can be immediately determined as quali- fied,the categories with estimated tag size of 10 and 8 can then category Ci is identified as qualified for the be also immediately determined as unqualified,for those population constraint. t-ni categories with estimated tag size 35 and 28,due to the cur- Unqualified categories U:If 2.Confidence interval width e number of slots ne in basic tag identification.If fi>ne, 3: 3.Thresholdt then the category is removed from the set S to V.We heuristi- 4 4.Error probability B cally set the frame size f to the mid-value among the series of 5:Initialize R to all categories,set Q,U,V to Set l=1. fi,such that after a query cycle,about half of the categories 6:while R≠odo can be determined as qualified/unqualified,and thus wiped 7: If I=1,compute the initial frame size f by solving out quickly.Therefore,after the whileloop,foreach category fe-//=5.Otherwise,compute the frame size f=.If C;E V,basic identification is used to obtain the exact tag size f>fma,set∫=fmar ni.If ni>t,Ci is illustrated in the histogram.For each cate- 8: Set S to Select the tags in R and issue a query cycle gory CiE Q,the reader verifies if it has satisfied the accuracy with frame size f,get no,ne,n.Find the category with requirement;if so,Ci is illustrated in the histogram and the largest population v in the singleton slots.For each wiped out from O.Then,ensemble sampling is further category which appears in the singleton slot with popu- applied over the categories in Q to satisfy the accuracy lation ns>v(is constant,0fmar,set fi=fmar. fairly large,the users only focus on the major categories Obtain the frame size f as the mid-value among the in the top-k list in regard to the tag size.Then the top-k series of fi. query is utilized to filter out those categories out of the 11: Select all tags in S,issue a query cycle with the frame top-k list.In this situation,the separate counting scheme sizef,compute the estimated tag sizeand the aver-is especially not suitable.If the specified category is not aged standard deviation oi for each category CiS.in the top-k list,it is unnecessary to address it for accu- Detect the qualified category set Q and unqualified cat-rate tag size estimation.However,since the threshold t egory set U.Set S=S-Q-U. for the top-k list cannot be known in advance,the sepa- 12: ifU≠then rate counting scheme cannot quickly decide which catego- 13: Wipe out all categories unexplored in the singleton ries can be wiped out immediately. slots from S. Moreover,when the distribution around the kth ranking 14: end if is fairly even,i.e.,the size of each category is very close,it is 15: end while rather difficult to determine which categories belong to the 16: 元=i-∑c,esm.R=R-S,l=l+1 17:end while top-k categories.Based on this understanding,we utilize the probabilistic threshold top-k query (PT-Topk query)to 18:Further verify the categories in V and Q for the accuracy constraint. return a set of categories Q where each takes a probability of at least 1-B(0t.Ideally,t should be the tag size of
6.2 Algorithm for the Iceberg Query Problem We propose the algorithm for the iceberg query problem in Algorithm 2. Assume that the current set of categories is R, during the query cycles of ensemble sampling, the reader continuously updates the statistical value of nbi as well as the standard deviation si for each category Ci 2 R. After each query cycle, the categories in R can be further divided into the following categories according to the population constraint: Qualified categories Q: If nbi t and si nbit F1ð1bÞ , then category Ci is identified as qualified for the population constraint. Unqualified categories U: If nbi fmax, set f ¼ fmax. 8: Set S to ?. Select the tags in R and issue a query cycle with frame size f, get n0; nc; ns. Find the category with the largest population y in the singleton slots. For each category which appears in the singleton slot with population ns;i > y uðu is constant, 0 nbi e, then remove Ci from S to V . If fi > fmax, set fi ¼ fmax. Obtain the frame size f as the mid-value among the series of fi. 11: Select all tags in S, issue a query cycle with the frame size f, compute the estimated tag size nbi and the averaged standard deviation si for each category Ci 2 S. Detect the qualified category set Q and unqualified category set U. Set S ¼ S Q U. 12: if U 6¼ ? then 13: Wipe out all categories unexplored in the singleton slots from S. 14: end if 15: end while 16: nb ¼ nb P Ci2S0 nbi. R ¼ R S0 , l ¼ l þ 1. 17: end while 18: Further verify the categories in V and Q for the accuracy constraint. Therefore, after each query cycle of ensemble sampling, those unqualified categories and qualified categories can be immediately wiped out from the ensemble sampling. When at least one category is determined as unqualified, all of the categories in the current group which have not been explored in the singleton slots are wiped out immediately. The query cycles are then continuously issued over those undetermined categories in R until R ¼ ?. For example, suppose the threshold is set to 30, after a query cycle of ensemble sampling, the estimated number of tags for each category is as follows: {120, 80, 65, 35, 28, 10, 8}, according to the standard deviation of estimation for various categories, then the categories with estimated tag size of 120, 80 and 65 can be immediately determined as quali- fied, the categories with estimated tag size of 10 and 8 can be also immediately determined as unqualified, for those categories with estimated tag size 35 and 28, due to the current estimation error, we cannot yet determine if they are exactly qualified or unqualified, thus another cycle of ensemble sampling is required for further verification. During the ensemble sampling, if there exist some categories with tag sizes very close to the threshold t, then the required number of slots to verify the population constraint can be rather large. Thus, we compute the essential frame size fi for each category Ci and compare it with the expected number of slots nbi e in basic tag identification. If fi > nbi e, then the category is removed from the set S to V . We heuristically set the frame size f to the mid-value among the series of fi, such that after a query cycle, about half of the categories can be determined as qualified/unqualified, and thus wiped out quickly. Therefore, after the while loop, for each category Ci 2 V , basic identification is used to obtain the exact tag size ni. If ni t, Ci is illustrated in the histogram. For each category Ci 2 Q, the reader verifies if it has satisfied the accuracy requirement; if so, Ci is illustrated in the histogram and wiped out from Q. Then, ensemble sampling is further applied over the categories in Q to satisfy the accuracy requirement by using the optimized grouping method. 7 ENSEMBLE SAMPLING FOR THE TOP-k QUERY 7.1 Motivation In some applications, when the number of categories is fairly large, the users only focus on the major categories in the top-k list in regard to the tag size. Then the top-k query is utilized to filter out those categories out of the top-k list. In this situation, the separate counting scheme is especially not suitable. If the specified category is not in the top-k list, it is unnecessary to address it for accurate tag size estimation. However, since the threshold t for the top-k list cannot be known in advance, the separate counting scheme cannot quickly decide which categories can be wiped out immediately. Moreover, when the distribution around the kth ranking is fairly even, i.e., the size of each category is very close, it is rather difficult to determine which categories belong to the top-k categories. Based on this understanding, we utilize the probabilistic threshold top-k query (PT-Topk query) to return a set of categories Q where each takes a probability of at least 1 bð0 < b 1Þ to be in the top-k list. Therefore, the size of Q is not necessarily going to be exactly k. Hence, as the exact value of tag size ni is unknown, in order to define Pr½Ci 2 top-k list, i.e., the probability that category Ci is within the top-k list in terms of tag size, it is essential to determine a threshold t so that Pr½Ci 2 top-k list ¼ Pr½ni t. Ideally, t should be the tag size of 2428 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015
XIE ET AL.:EFFICIENT PROTOCOLS FOR COLLECTING HISTOGRAMS IN LARGE-SCALE RFID SYSTEMS 2429 the kth largest category;however,it is rather difficult to variance of f-t is at most g2.In order to guarantee that compute an exact value of t in the estimation scheme due to Var(t-t) 4 4.Error probability B 5:Initialize R to all categories,set I=1,n=(1-). 1-B.We rely on the following theorem to express the 6:while true do above constraint in the form of the variance. 7 Issue a query cycle to apply ensemble sampling over all Theorem6.The constraint Prt-t≤et刂≥1-B is satisfied categories in R.Compute the statistical average value and standard deviations as n and o;. as long as Var(t-t)≤2.tP·B. 6 Rank the categories in R according to the value of Proof.See Appendix G,available in the online supplemen- +no;for each identified category Ci.Find the k-th largest category Ci,settup=ni+ni.Detect the quali- tal material. fied categories Q with threshold tp Rank the categories in R according to the value of 7.2 Algorithm m-noi for each identified category Ci.Find the k-th According to Theorem 6,we utilize the ensemble sampling to quickly estimate the threshold t.The intuition is as fol- largest category Ci,set tioue=n-nai.Detect the unqualified categories U with threshold t. lows:after the first query cycle of ensemble sampling,we 10: Wipe out the qualified/unqualified categories from R. can estimate a confidence interval [tow,tup of the threshold t R=R-Q-U.Suppose the number of qualified cate- according to the sampled distribution.Then,by wiping out gories in current cycle is q,set k =k-g. those categories which are obviously qualified or unquali- 11: Rank the categories in R according to the value of for fied to be in the top-k list,the width of the confidence inter- each identified category C.Find the k-th largest category val can be quickly reduced.As the approximated threshold Ci,set t=fi.Set g tup tiou.I=1+1. f is selected within the confidence interval,after a number 12: ifg2≤2.B.2then of query cycles of ensemble sampling,when the width is 13: Break the while loop. 14: end if below a certain threshold,the estimated value t can be close 15:end while enough to the exact threshold t. 16:Apply iceberg query with threshold t over the undetermined Based on the above analysis,we propose an algorithm for categories R and the qualified categories Q. the top-k query problem in Algorithm 3.In the beginning,a For example,suppose the value of k is 5,after a query while loop is utilized to quickly identify an approximate cycle of ensemble sampling,the estimated number of tags value t for the threshold t.Suppose that the averaged esti- for various categories is ranked in decreasing order as fol- mated tag size and standard deviation for each category Ci 1ows:{C:120,C2:85,C367,C450,C548,C645,C720,Cs15} are respectively m and oi,if we use p to denote a small con- the threshold tup and tiou are respectively set to 68 and 28 stant value between 0 and 1,let n=-(1-g).Then,given according to the fifth largest category,then the categories a fixed value of p,the 1-p confidence interval for ni is with tag size 120 and 85 can be determined as qualified cate- [ni-noi,+ni.For each iteration,we respectively gories since their tag sizes are above the threshold tup the cat- determine an upper bound tup and a lower bound ttote for egories with tag size 20 and 15 can be also determined as the threshold t,according to the kth largest category in the unqualified categories since their tag sizes are below the current ranking.Then,we respectively wipe out those quali- threshold tou.Therefore,the remaining categories are as fol- fied and unqualified categories according to the upper lows:C3,C4,Cs and Co,we hence need another cycle of bound tup and a lower bound tu.The value of k is then ensemble sampling to further verify the threshold according decreased by the number of qualified categories.In this to the third largest category. way,the threshold t is guaranteed to be within the range [tloue,tup]with a probability of at least 1-p.When p-0, 8 DISCUSSION ON PRACTICAL ISSUES then te [tlou,tup]with the probability close to 100 percent.8.1 Time-Efficiency Moreover,an estimated threshold t is also selected within As mentioned in the problem formulation,the most critical this range.Therefore,let the width g=tup-tiow,then the factor for the histogram collection problem is the time
the kth largest category; however, it is rather difficult to compute an exact value of t in the estimation scheme due to the randomness in the slotted ALOHA protocol. Therefore, according to the problem formulation in Section 4, we attempt to obtain an estimated value bt such that the following constraints are satisfied: Pr½jnbi nij ni 1 b accuracy constraint; Pr½jbt tj t 1 b accuracy constraint of bt; Pr½nbi < btjni bt < b population constraint; (9) Pr½nbi btjni < bt < b population constraint: (10) Therefore, if the threshold bt can be accurately estimated, then the top-k query problem is reduced to the iceberg query problem. The population constraints (9) and (10) are respectively equivalent to the population constraints (4) and (5). Then it is essential to quickly determine the value of the threshold bt while satisfying the constraint Pr½jbt tj t 1 b. We rely on the following theorem to express the above constraint in the form of the variance. Theorem 6. The constraint Pr½jbt tj t 1 b is satisfied as long as Varðbt tÞ 2 t 2 b. Proof. See Appendix G, available in the online supplemental material. tu 7.2 Algorithm According to Theorem 6, we utilize the ensemble sampling to quickly estimate the threshold bt. The intuition is as follows: after the first query cycle of ensemble sampling, we can estimate a confidence interval ½tlow; tup of the threshold t according to the sampled distribution. Then, by wiping out those categories which are obviously qualified or unquali- fied to be in the top-k list, the width of the confidence interval can be quickly reduced. As the approximated threshold bt is selected within the confidence interval, after a number of query cycles of ensemble sampling, when the width is below a certain threshold, the estimated value bt can be close enough to the exact threshold t. Based on the above analysis, we propose an algorithm for the top-k query problem in Algorithm 3. In the beginning, a while loop is utilized to quickly identify an approximate value bt for the threshold t. Suppose that the averaged estimated tag size and standard deviation for each category Ci are respectively nbi and si, if we use p to denote a small constant value between 0 and 1, let h ¼ F1 ð1 p 2Þ. Then, given a fixed value of p, the 1 p confidence interval for ni is ½nbi h si; nbi þ h si. For each iteration, we respectively determine an upper bound tup and a lower bound tlow for the threshold t, according to the kth largest category in the current ranking. Then, we respectively wipe out those quali- fied and unqualified categories according to the upper bound tup and a lower bound tlow. The value of k is then decreased by the number of qualified categories. In this way, the threshold t is guaranteed to be within the range ½tlow; tup with a probability of at least 1 p. When p ! 0, then t 2 ½tlow; tup with the probability close to 100 percent. Moreover, an estimated threshold bt is also selected within this range. Therefore, let the width g ¼ tup tlow, then the variance of bt t is at most g2. In order to guarantee that Varðbt tÞ 2 t 2 b, it is essential to ensure g2 2 t 2 b. As the ensemble sampling is continuously issued over the categories in R, the standard deviation si for each category Ci 2 R is continuously decreasing. Furthermore, as the qualified/unqualified categories are continuously wiped out, the upper bound tup is continuously decreasing while the lower bound tlow is continuously increasing. The width of the range ½tlow; tup is continuously decreasing. The while loop continues until g2 2 t 2 b. Then, after the estimated threshold bt is computed, the iceberg query is further applied over those categories with the threshold bt. Algorithm 3. Algorithm for PT-Topk Query Problem 1: INPUT: 1. Upper bound n on the number of tags n 2: 2. Confidence interval width 3: 3. The value of k 4: 4. Error probability b 5: Initialize R to all categories, set l ¼ 1, h ¼ F1 ð1 p 2Þ. 6: while true do 7: Issue a query cycle to apply ensemble sampling over all categories in R. Compute the statistical average value and standard deviations as nbi and si. 8: Rank the categories in R according to the value of nbi þ h si for each identified category Ci. Find the k-th largest category Ci, set tup ¼ nbi þ h si. Detect the quali- fied categories Q with threshold tup. 9: Rank the categories in R according to the value of nbi h si for each identified category Ci. Find the k-th largest category Ci, set tlow ¼ nbi h si. Detect the unqualified categories U with threshold tlow. 10: Wipe out the qualified/unqualified categories from R. R ¼ R Q U. Suppose the number of qualified categories in current cycle is q, set k ¼ k q. 11: Rank the categories in R according to the value of nbi for each identified category Ci. Find the k-th largest category Ci, set bt ¼ nbi. Set g ¼ tup tlow. l ¼ l þ 1. 12: if g2 2 b bt2 then 13: Break the while loop. 14: end if 15: end while 16: Apply iceberg query with threshold bt over the undetermined categories R and the qualified categories Q. For example, suppose the value of k is 5, after a query cycle of ensemble sampling, the estimated number of tags for various categories is ranked in decreasing order as follows: {C1:120, C2:85, C3:67, C4:50, C5:48, C6:45, C7:20, C8:15 }, the threshold tup and tlow are respectively set to 68 and 28 according to the fifth largest category, then the categories with tag size 120 and 85 can be determined as qualified categories since their tag sizes are above the threshold tup, the categories with tag size 20 and 15 can be also determined as unqualified categories since their tag sizes are below the threshold tlow. Therefore, the remaining categories are as follows: C3; C4; C5 and C6, we hence need another cycle of ensemble sampling to further verify the threshold according to the third largest category. 8 DISCUSSION ON PRACTICAL ISSUES 8.1 Time-Efficiency As mentioned in the problem formulation, the most critical factor for the histogram collection problem is the time XIE ET AL.: EFFICIENT PROTOCOLS FOR COLLECTING HISTOGRAMS IN LARGE-SCALE RFID SYSTEMS 2429
2430 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,VOL.26.NO.9.SEPTEMBER 2015 efficiency.In regard to the basic histogram collection,the number of tags for various categories.However,since the time delay is mainly impacted by two factors:1)the number proposed scheme needs to identify the tag in singleton slots of categories m,2)the category with the smallest tag size, and read 96-bit EPC from the tag,it may incur high commu- say ni,inside the group for ensemble sampling.Generally, nication overheard for ensemble sampling.We thus conduct as the number of categories m increases,the number of real experiments with the USRP N210 platform to evaluate groups and the essential number of slots for each ensemble the ratio of tags that are identified during the whole process sampling is increasing,causing the time delay to increase. of collecting histograms.We respectively test the slot ratio Besides,the category with the smallest tag size ni directly (the ratio of the number of singleton slots to total number of decides the essential frame size inside the group,the larger slots)and time ratio (the ratio of the overall time interval for the gap among the tag sizes of each category in the same the singleton slots to total time duration).In the experiment, group,the lower the time efficiency that is achieved. we use the Alien reader to interrogate 50 tags and use USRP In regard to the iceberg query and the top-k query,the N210 as a sniffer to capture the detailed information in the time delay mainly depends on the number of categories physical layer,we average the experiment results via 50 with the tag size close to the threshold t.Due to the variance repeated test.According to the real experiment results,we in tag size estimation,a relatively large number of slots are find that the average slot ratio is 33 percent,which is lower required to verify whether the specified categories have tag than 36.8 percent in ideal case when the frame size is set to sizes over the threshold t.For the top-k query,additional an optimal value.We further find that the average time ratio time delay is required to estimate the threshold t corre- is 62 percent,it implies that the singleton slots occupy a con- sponding to the top-k query. siderable proportion of the overall scanning time. In order to sufficiently reduce the identification over- 8.2 Interference Factors in Realistic Settings head in singleton slots,we can make a slight modification In realistic settings of various applications,there might exist for the C1G2 protocol as follows:each tag can embed the several interference factors which hinder the actual perfor- category ID into the RN16 response,in this way,during mance of histogram collection.These practical issues mainly the process of collecting histograms,each tag only need to include path loss,multi-path effect,and mutual interfer- reply the RN16 random number in the selected slot instead ence.In the following we elaborate on the detail techniques of the exact EPC ID,the high overhead for identification to effectively tackle these problems. can be effectively avoided.We further evaluate the average Path loss.Path loss is common in RFID-based applica- time ratio for this new method,we find that the average tions,which may lead to the probabilistic backscattering [7] time ratio can be reduced from 62 to 44 percent,which is in RFID systems,even if the tags are placed in the reader's much closer to the slot ratio effective scanning range.In such scenario,the tags may reply in each query cycle with a certain probability instead 9 PERFORMANCE EVALUATION of 100 percent.Therefore,in regard to the tag-counting pro- tocols in our solutions,we need to essentially estimate the We have conducted simulations in Matlab,and the scenario probability via statistical tests in the particular application is as follows:there exist m categories in total,and we ran- scenarios.In this way,we can accurately estimate the num- domly generate the tag size for each category according to ber of tags according to the probability obtained in advance. the normal distribution N(u.o).We set the default values Multi-path effect.Multi-path effect is especially common for the following parameters:in regard to the accuracy con- for indoor applications.Due to multi-path effect,some tags straint and the population constraint,we set 1-B=95%, cannot be effectively activated as the forwarding waves and e=0.2.The average time interval for each slot is may offset each other,even in the effective scanning range ts=1ms,and the inter-cycle overhead is te=43ms.We of RFID systems.To mitigate the multi-path effect,we can compare our solutions with two basic strategies:the basic use the mobile reader to continuously interrogate the sur- tag identification (BI)and the separate counting (SC) rounding tags such that the multi-path profile can be contin- (explained in Section 5).All results are the averaged results uously changing.In this way,the tags are expected to have of 500 independent trials. more chances to be activated for at least once during the continuous scanning [8]. 9.1 Evaluate the Actual Variance in Ensemble Mutual interference:If the tags are placed too close,they Sampling may have a critical state of mutual interference [34]such In order to verify the correctness of the derivation in the var- that neither of the tags can be effectively activated.This is iance of the SE estimator,i.e.,8;in Eq.(7),we conduct simu- mainly caused by the coupling effect when the reader's lations and evaluate the actual variances in ensemble power is adjusted to a certain value.Hence,in order to miti- sampling,thus quantifying the tightness between the gate the mutual interference among RFID tags,we should derived value of 8;and the measured value in simulation skillfully tune the transmission power of the reader so as to studies.We conduct ensemble sampling on 5,500 tags for avoid the critical state among tags.A suitable power step- 200 cycles.For each query cycle,the frame size f is set to ping method should be leveraged to sufficiently reduce the 5,500.We look into a category Ci with tag size ni=100.In mutual interference among all tags Fig.4a,we plot the estimated value of n;in each cycle,while the expected values of ni-oi and ni+oi are respectively 8.3 Overhead from Tag ldentification illustrated in the red line and the green line.We observe In our ensemble sampling-based solution,we conduct effi- that the estimated value n;majorly vibrates between the cient sampling over the singleton slots to estimate the interval (ni-oi,ni+o:).In Fig.4b,we further compare the
efficiency. In regard to the basic histogram collection, the time delay is mainly impacted by two factors: 1) the number of categories m, 2) the category with the smallest tag size, say ni, inside the group for ensemble sampling. Generally, as the number of categories m increases, the number of groups and the essential number of slots for each ensemble sampling is increasing, causing the time delay to increase. Besides, the category with the smallest tag size ni directly decides the essential frame size inside the group, the larger the gap among the tag sizes of each category in the same group, the lower the time efficiency that is achieved. In regard to the iceberg query and the top-k query, the time delay mainly depends on the number of categories with the tag size close to the threshold t. Due to the variance in tag size estimation, a relatively large number of slots are required to verify whether the specified categories have tag sizes over the threshold t. For the top-k query, additional time delay is required to estimate the threshold t corresponding to the top-k query. 8.2 Interference Factors in Realistic Settings In realistic settings of various applications, there might exist several interference factors which hinder the actual performance of histogram collection. These practical issues mainly include path loss, multi-path effect, and mutual interference. In the following we elaborate on the detail techniques to effectively tackle these problems. Path loss. Path loss is common in RFID-based applications, which may lead to the probabilistic backscattering [7] in RFID systems, even if the tags are placed in the reader’s effective scanning range. In such scenario, the tags may reply in each query cycle with a certain probability instead of 100 percent. Therefore, in regard to the tag-counting protocols in our solutions, we need to essentially estimate the probability via statistical tests in the particular application scenarios. In this way, we can accurately estimate the number of tags according to the probability obtained in advance. Multi-path effect. Multi-path effect is especially common for indoor applications. Due to multi-path effect, some tags cannot be effectively activated as the forwarding waves may offset each other, even in the effective scanning range of RFID systems. To mitigate the multi-path effect, we can use the mobile reader to continuously interrogate the surrounding tags such that the multi-path profile can be continuously changing. In this way, the tags are expected to have more chances to be activated for at least once during the continuous scanning [8]. Mutual interference: If the tags are placed too close, they may have a critical state of mutual interference [34] such that neither of the tags can be effectively activated. This is mainly caused by the coupling effect when the reader’s power is adjusted to a certain value. Hence, in order to mitigate the mutual interference among RFID tags, we should skillfully tune the transmission power of the reader so as to avoid the critical state among tags. A suitable power stepping method should be leveraged to sufficiently reduce the mutual interference among all tags. 8.3 Overhead from Tag Identification In our ensemble sampling-based solution, we conduct effi- cient sampling over the singleton slots to estimate the number of tags for various categories. However, since the proposed scheme needs to identify the tag in singleton slots and read 96-bit EPC from the tag, it may incur high communication overheard for ensemble sampling. We thus conduct real experiments with the USRP N210 platform to evaluate the ratio of tags that are identified during the whole process of collecting histograms. We respectively test the slot ratio (the ratio of the number of singleton slots to total number of slots) and time ratio (the ratio of the overall time interval for the singleton slots to total time duration). In the experiment, we use the Alien reader to interrogate 50 tags and use USRP N210 as a sniffer to capture the detailed information in the physical layer, we average the experiment results via 50 repeated test. According to the real experiment results, we find that the average slot ratio is 33 percent, which is lower than 36.8 percent in ideal case when the frame size is set to an optimal value. We further find that the average time ratio is 62 percent, it implies that the singleton slots occupy a considerable proportion of the overall scanning time. In order to sufficiently reduce the identification overhead in singleton slots, we can make a slight modification for the C1G2 protocol as follows: each tag can embed the category ID into the RN16 response, in this way, during the process of collecting histograms, each tag only need to reply the RN16 random number in the selected slot instead of the exact EPC ID, the high overhead for identification can be effectively avoided. We further evaluate the average time ratio for this new method, we find that the average time ratio can be reduced from 62 to 44 percent, which is much closer to the slot ratio. 9 PERFORMANCE EVALUATION We have conducted simulations in Matlab, and the scenario is as follows: there exist m categories in total, and we randomly generate the tag size for each category according to the normal distribution Nðm; sÞ. We set the default values for the following parameters: in regard to the accuracy constraint and the population constraint, we set 1 b ¼ 95%, and ¼ 0:2. The average time interval for each slot is ts ¼ 1 ms, and the inter-cycle overhead is tc ¼ 43 ms. We compare our solutions with two basic strategies: the basic tag identification (BI) and the separate counting (SC) (explained in Section 5). All results are the averaged results of 500 independent trials. 9.1 Evaluate the Actual Variance in Ensemble Sampling In order to verify the correctness of the derivation in the variance of the SE estimator, i.e., di in Eq. (7), we conduct simulations and evaluate the actual variances in ensemble sampling, thus quantifying the tightness between the derived value of di and the measured value in simulation studies. We conduct ensemble sampling on 5,500 tags for 200 cycles. For each query cycle, the frame size f is set to 5,500. We look into a category Ci with tag size ni ¼ 100. In Fig. 4a, we plot the estimated value of ni in each cycle, while the expected values of ni si and ni þ si are respectively illustrated in the red line and the green line. We observe that the estimated value nbi majorly vibrates between the interval ðni si; ni þ siÞ. In Fig. 4b, we further compare the 2430 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015