2422 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,VOL.26.NO.9.SEPTEMBER 2015 Histogram RFID-based Application Scenario 2 RELATED WORK 10 In RFID systems,a reader needs to receive data from multi- ple tags,while the tags are unable to self-regulate their radio 10 threshold transmissions to avoid collisions;then,a series of slotted =50 ALOHA-based anti-collision protocols [1],[4],[5],[6],[7], [8],[16],[17]are designed to efficiently identify tags in RFID systems.In order to deal with the collision problems in CI C2 C3 C4 C5 C6 C7 C8 C9 C10 multi-reader RFID systems,scheduling protocols for reader Category ID activation are explored in the literature [18],[19].Recently, Fig.1.An example of collecting histogram over RFID tags a number of polling protocols [201,[211,[22]are proposed, aiming to collect information from battery-powered active hundreds.Therefore,by separately estimating the tag sizes tags in an energy efficient approach. Recent research is focused on the collection of statistical over each category,a large number of query cycles and slots information over the RFID tags [2],[9],[10],[11],[23],[24], are required.Besides,in applications like the iceberg query [25],[26],[27].The authors mainly consider the problem of and the top-k query,only those major categories are essen- tial to be addressed.In this situation,the separate estimate estimating the number of tags without collecting the tag IDs. Murali et al.provide very fast and reliable estimation mech- approach will waste a lot of scanning time over those minor categories in the long tail.Therefore,a novel scheme is anisms for tag quantity in a more practical approach [9].Li essential to quickly collect the histograms over the massive et al.study the RFID estimation problem from the energy angle [23].Their goal is to reduce the amount of energy that RFID tags.In this paper,we propose a series of protocols to tackle the problem of efficient histogram collection.The is consumed by the tags during the estimation procedure. main contributions of this paper are listed as follows(a pre- Shahzad et al.propose a new scheme for estimating tag pop- liminary version of this work appeared in [151): ulation size called average run based tag estimation(ART) [2].Chen et al.aim to gain deeper and fundamental insights 1) To the best of our knowledge,we are the first to con- in RFID counting protocols [271,they manage to design sider the problem of collecting histograms and its near-optimal protocols that are more efficient than existing applications (i.e.,iceberg query and top-k query) ones and simultaneously simpler than most of them.Liu over RFID tags,which is a fundamental premise for et al.investigate efficient distributed query processing in answering aggregate queries and data mining over large RFID-enabled supply chains [28].Liu et al.propose a RFID-based applications. novel solution to fast count the key tags in anonymous RFID 2) In order to achieve time efficiency,we propose a systems [29].Luo et al.tackle an interesting problem,called novel,ensemble sampling (ES)-based method to multigroup threshold based classification [25],which is to simultaneously estimate the tag size for a number of determine whether the number of objects in each group is categories.This framework is very flexible and com- above or below a prescribed threshold value.Sheng et al. patible to current tag-counting estimators,which can consider the problem of identifying popular categories of be efficiently leveraged to estimate the tag size for RFID tags out of a large collection of tags [11],while the set each category.While achieving time-efficiency,our of category IDs are supposed to be known.Different from solutions are completely compatible with current the previous work,in this paper,our goal is to collect the his- industry standards,i.e.,the EPCglobal C1G2 stand- tograms for all categories over RFID tags in a time-efficient ards,and do not require any tag modifications. approach,without any priori knowledge of the categories. 3) In order to tackle the histogram collection with a Specifically,we respectively consider the basic histogram filter condition,we propose an effective solution collection problem,the iceberg query problem,and the top-k for the iceberg query problem.By considering the query problem in regard to collecting histograms in large- population and accuracy constraint,we propose an scale RFID systems.We aim to propose a flexible and com- efficient algorithm to wipe out the unqualified cat- patible framework for current tag-counting estimators based egories in time,especially those categories in the on slotted ALOHA protocol,which can be efficiently lever- long tail.We further present an effective solution aged to estimate the tag size for each category. to tackle the top-k query problem.We use ensemble sampling to quickly estimate the threshold corre- 3 PRELIMINARY sponding to the kth largest category,and reduce it 3.1 The Framed Slotted ALOHA Protocol to the iceberg query problem. In the Class 1 Gen 2 standard,the RFID system leverages the The remainder of the paper is as follows.Sections 2 framed slotted ALOHA protocol to resolve the collisions for tag and 3 present the related work and RFID preliminary, identification.When a reader wishes to read a set of tags,it respectively.We formulate our problem in Section 4,and first powers up and transmits a continuous wave to ener- present our ensemble sampling-based method for the gize the tags.It then initiates a series of frames,varying the basic histogram collection in Section 5.We further present number of slots in each frame to best accommodate the our solutions for the iceberg query and the top-k query,number of tags.Each frame has a number of slots and each respectively,in Sections 6 and 7.In Section 8,we provide active tag will reply in a randomly selected slot per frame. performance analysis in time-efficiency.The performance After all tags are read,the reader powers down.We refer to evaluation is in Section 9,and we conclude in Section 10. the series of frames between power down periods as ahundreds. Therefore, by separately estimating the tag sizes over each category, a large number of query cycles and slots are required. Besides, in applications like the iceberg query and the top-k query, only those major categories are essential to be addressed. In this situation, the separate estimate approach will waste a lot of scanning time over those minor categories in the long tail. Therefore, a novel scheme is essential to quickly collect the histograms over the massive RFID tags. In this paper, we propose a series of protocols to tackle the problem of efficient histogram collection. The main contributions of this paper are listed as follows (a preliminary version of this work appeared in [15]): 1) To the best of our knowledge, we are the first to consider the problem of collecting histograms and its applications (i.e., iceberg query and top-k query) over RFID tags, which is a fundamental premise for answering aggregate queries and data mining over RFID-based applications. 2) In order to achieve time efficiency, we propose a novel, ensemble sampling (ES)-based method to simultaneously estimate the tag size for a number of categories. This framework is very flexible and compatible to current tag-counting estimators, which can be efficiently leveraged to estimate the tag size for each category. While achieving time-efficiency, our solutions are completely compatible with current industry standards, i.e., the EPCglobal C1G2 standards, and do not require any tag modifications. 3) In order to tackle the histogram collection with a filter condition, we propose an effective solution for the iceberg query problem. By considering the population and accuracy constraint, we propose an efficient algorithm to wipe out the unqualified categories in time, especially those categories in the long tail. We further present an effective solution to tackle the top-k query problem. We use ensemble sampling to quickly estimate the threshold corresponding to the kth largest category, and reduce it to the iceberg query problem. The remainder of the paper is as follows. Sections 2 and 3 present the related work and RFID preliminary, respectively. We formulate our problem in Section 4, and present our ensemble sampling-based method for the basic histogram collection in Section 5. We further present our solutions for the iceberg query and the top-k query, respectively, in Sections 6 and 7. In Section 8, we provide performance analysis in time-efficiency. The performance evaluation is in Section 9, and we conclude in Section 10. 2 RELATED WORK In RFID systems, a reader needs to receive data from multiple tags, while the tags are unable to self-regulate their radio transmissions to avoid collisions; then, a series of slotted ALOHA-based anti-collision protocols [1], [4], [5], [6], [7], [8], [16], [17] are designed to efficiently identify tags in RFID systems. In order to deal with the collision problems in multi-reader RFID systems, scheduling protocols for reader activation are explored in the literature [18], [19]. Recently, a number of polling protocols [20], [21], [22] are proposed, aiming to collect information from battery-powered active tags in an energy efficient approach. Recent research is focused on the collection of statistical information over the RFID tags [2], [9], [10], [11], [23], [24], [25], [26], [27]. The authors mainly consider the problem of estimating the number of tags without collecting the tag IDs. Murali et al. provide very fast and reliable estimation mechanisms for tag quantity in a more practical approach [9]. Li et al. study the RFID estimation problem from the energy angle [23]. Their goal is to reduce the amount of energy that is consumed by the tags during the estimation procedure. Shahzad et al. propose a new scheme for estimating tag population size called average run based tag estimation (ART) [2]. Chen et al. aim to gain deeper and fundamental insights in RFID counting protocols [27], they manage to design near-optimal protocols that are more efficient than existing ones and simultaneously simpler than most of them. Liu et al. investigate efficient distributed query processing in large RFID-enabled supply chains [28]. Liu et al. propose a novel solution to fast count the key tags in anonymous RFID systems [29]. Luo et al. tackle an interesting problem, called multigroup threshold based classification [25], which is to determine whether the number of objects in each group is above or below a prescribed threshold value. Sheng et al. consider the problem of identifying popular categories of RFID tags out of a large collection of tags [11], while the set of category IDs are supposed to be known. Different from the previous work, in this paper, our goal is to collect the histograms for all categories over RFID tags in a time-efficient approach, without any priori knowledge of the categories. Specifically, we respectively consider the basic histogram collection problem, the iceberg query problem, and the top-k query problem in regard to collecting histograms in largescale RFID systems. We aim to propose a flexible and compatible framework for current tag-counting estimators based on slotted ALOHA protocol, which can be efficiently leveraged to estimate the tag size for each category. 3 PRELIMINARY 3.1 The Framed Slotted ALOHA Protocol In the Class 1 Gen 2 standard, the RFID system leverages the framed slotted ALOHA protocol to resolve the collisions for tag identification. When a reader wishes to read a set of tags, it first powers up and transmits a continuous wave to energize the tags. It then initiates a series of frames, varying the number of slots in each frame to best accommodate the number of tags. Each frame has a number of slots and each active tag will reply in a randomly selected slot per frame. After all tags are read, the reader powers down. We refer to the series of frames between power down periods as a Fig. 1. An example of collecting histogram over RFID tags 2422 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015