IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,VOL 26,NO.9,SEPTEMBER 2015 2421 Efficient Protocols for Collecting Histograms in Large-Scale RFID Systems Lei Xie,Member,IEEE,Hao Han,Member,IEEE,Qun Li,Member,IEEE, Jie Wu,Fellow,IEEE,and Sanglu Lu,Member,IEEE Abstract-Collecting histograms over RFID tags is an essential premise for effective aggregate queries and analysis in large-scale RFID-based applications.In this paper we consider an efficient collection of histograms from the massive number of RFID tags, without the need to read all tag data.In order to achieve time efficiency,we propose a novel,ensemble sampling-based method to simultaneously estimate the tag size for a number of categories.We first consider the problem of basic histogram collection,and propose an efficient algorithm based on the idea of ensemble sampling.We further consider the problems of advanced histogram collection,respectively,with an iceberg query and a top-k:query.Efficient algorithms are proposed to tackle the above problems such that the qualified/unqualified categories can be quickly identified.This ensemble sampling-based framework is very flexible and compatible to current tag-counting estimators,which can be efficiently leveraged to estimate the tag size for each category. Experiment results indicate that our ensemble sampling-based solutions can achieve a much better performance than the basic estimation/identification schemes. Index Terms-Algorithms,RFID,time efficiency,histogram INTRODUCTION Win have lep me prv in a book store,etc.Collecting histogram can be used to illustrate the tag population belonging to each category,and spaces in increasingly large numbers.In applications like determine whether the number of tags in a category is warehouse monitoring,the items are attached with RFID above or below any desired threshold.By setting this tags,and are densely packed into boxes.As the maximum threshold,it is easy to find popular merchandise and control scanning range of a UHF RFID reader is usually 6-10 m,the stock,e.g.,automatically signaling when more products overall number of tags within this three-dimensional space need to be put on the shelf.Furthermore,the histogram can can be up to tens of thousands in a dense deployment sce- be used for approximate answering of aggregate queries nario,as envisioned in [1],[2],[3].Many tag identification [121,[131,as well as preprocessing and mining association protocols [41,[5],[61,[7],[8]are proposed to uniquely iden- rules in data mining [14].Therefore,collecting histograms tify the tags one by one through anti-collision schemes. over RFID tags is an essential premise for effective queries However,in a number of applications,only some useful sta-and analysis in conventional RFID-based applications. tistical information is essential to be collected,such as the Fig.1 shows an example for collecting histogram over the overall tag size [2],[91,[10],popular categories [11]and the RFID tags deployed in the application scenarios. histogram.In particular,histograms capture distribution While dealing with a large scale deployment with thou- statistics in a space-efficient fashion.In some applications, sands of tags,the traditional tag identification scheme is not such as a grocery store or a shipping portal,items are cate- suitable for histogram collection,since the scanning time is gorized according to some specified metrics,such as types proportional to the number of tags,which can be in the of merchandize,manufacturers,etc.A histogram is used to order of several minutes.As the overall tag size grows, illustrate the number of items in each category. reading each tag one by one can be rather time-consuming, In practice,tags are typically attached to objects belong- which is not scalable at all.As in most applications,the tags ing to different categories,e.g.,different brands and models are frequently moving into and out of the effective scanning of clothes in a large clothing store,different titles of books area.In order to capture the distribution statistics in time,it is essential to sacrifice some accuracy so that the main distri- L.Xie and S.Lu are with the State Key Laboratory for Novel Software bution can be obtained within a short time window-in the Technology,Nanjing University,China. order of several seconds.Therefore,we seek to propose an E-mail:(Ixie,sangluj@nju.edu.cn. estimation scheme to quickly count the tag sizes of each cat- H.Han and Q.Li are with the Department of Computer Science,College of William and Mary,Williamsburg,VA.E-mail:(hhan,liqun@cs.wm.edu. egory while achieving the accuracy requirement. ● I.Wu is with the Department of Computer Information and Sciences, In most cases,the tag sizes of various categories are sub- Temple University.E-mail:jiewu@temple.edu. ject to some skewed distribution with a "long tail",such as Manuscript received 10 Mar.2014;revised 12 Aug.2014;accepted 4 Sept. the Gaussian distribution.The long tail represents a large 2014.Date of publication 10 Sept.2014;date of current version 7 Aug.2015. number of categories,each of which occupies a rather small Recommended for acceptance by S.Guo. For information on obtaining reprints of this article,please send e-mail to: percentage among the total categories.While handling the reprints@ieee.org,and reference the Digital Object Identifier below. massive number of tags,in the order of several thousands, Digital Object Identifier no.10.1109/TPDS.2014.2357021 the overall number of categories in the long tail could be in 1045-2192014EEE Personal mission See http://www.ieee.org/publiEfficient Protocols for Collecting Histograms in Large-Scale RFID Systems Lei Xie, Member, IEEE, Hao Han, Member, IEEE, Qun Li, Member, IEEE, Jie Wu, Fellow, IEEE, and Sanglu Lu, Member, IEEE Abstract—Collecting histograms over RFID tags is an essential premise for effective aggregate queries and analysis in large-scale RFID-based applications. In this paper we consider an efficient collection of histograms from the massive number of RFID tags, without the need to read all tag data. In order to achieve time efficiency, we propose a novel, ensemble sampling-based method to simultaneously estimate the tag size for a number of categories. We first consider the problem of basic histogram collection, and propose an efficient algorithm based on the idea of ensemble sampling. We further consider the problems of advanced histogram collection, respectively, with an iceberg query and a top-k query. Efficient algorithms are proposed to tackle the above problems such that the qualified/unqualified categories can be quickly identified. This ensemble sampling-based framework is very flexible and compatible to current tag-counting estimators, which can be efficiently leveraged to estimate the tag size for each category. Experiment results indicate that our ensemble sampling-based solutions can achieve a much better performance than the basic estimation/identification schemes. Index Terms—Algorithms, RFID, time efficiency, histogram Ç 1 INTRODUCTION WITH the rapid proliferation of RFID-based applications, RFID tags have been deployed into pervasive spaces in increasingly large numbers. In applications like warehouse monitoring, the items are attached with RFID tags, and are densely packed into boxes. As the maximum scanning range of a UHF RFID reader is usually 6-10 m, the overall number of tags within this three-dimensional space can be up to tens of thousands in a dense deployment scenario, as envisioned in [1], [2], [3]. Many tag identification protocols [4], [5], [6], [7], [8] are proposed to uniquely identify the tags one by one through anti-collision schemes. However, in a number of applications, only some useful statistical information is essential to be collected, such as the overall tag size [2], [9], [10], popular categories [11] and the histogram. In particular, histograms capture distribution statistics in a space-efficient fashion. In some applications, such as a grocery store or a shipping portal, items are categorized according to some specified metrics, such as types of merchandize, manufacturers, etc. A histogram is used to illustrate the number of items in each category. In practice, tags are typically attached to objects belonging to different categories, e.g., different brands and models of clothes in a large clothing store, different titles of books in a book store, etc. Collecting histogram can be used to illustrate the tag population belonging to each category, and determine whether the number of tags in a category is above or below any desired threshold. By setting this threshold, it is easy to find popular merchandise and control stock, e.g., automatically signaling when more products need to be put on the shelf. Furthermore, the histogram can be used for approximate answering of aggregate queries [12], [13], as well as preprocessing and mining association rules in data mining [14]. Therefore, collecting histograms over RFID tags is an essential premise for effective queries and analysis in conventional RFID-based applications. Fig. 1 shows an example for collecting histogram over the RFID tags deployed in the application scenarios. While dealing with a large scale deployment with thousands of tags, the traditional tag identification scheme is not suitable for histogram collection, since the scanning time is proportional to the number of tags, which can be in the order of several minutes. As the overall tag size grows, reading each tag one by one can be rather time-consuming, which is not scalable at all. As in most applications, the tags are frequently moving into and out of the effective scanning area. In order to capture the distribution statistics in time, it is essential to sacrifice some accuracy so that the main distribution can be obtained within a short time window–in the order of several seconds. Therefore, we seek to propose an estimation scheme to quickly count the tag sizes of each category while achieving the accuracy requirement. In most cases, the tag sizes of various categories are subject to some skewed distribution with a “long tail”, such as the Gaussian distribution. The long tail represents a large number of categories, each of which occupies a rather small percentage among the total categories. While handling the massive number of tags, in the order of several thousands, the overall number of categories in the long tail could be in L. Xie and S. Lu are with the State Key Laboratory for Novel Software Technology, Nanjing University, China. E-mail: {lxie, sanglu}@nju.edu.cn. H. Han and Q. Li are with the Department of Computer Science, College of William and Mary, Williamsburg, VA. E-mail: {hhan, liqun}@cs.wm.edu. J. Wu is with the Department of Computer Information and Sciences, Temple University. E-mail: jiewu@temple.edu. Manuscript received 10 Mar. 2014; revised 12 Aug. 2014; accepted 4 Sept. 2014. Date of publication 10 Sept. 2014; date of current version 7 Aug. 2015. Recommended for acceptance by S. Guo. For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.org, and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TPDS.2014.2357021 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015 2421 1045-9219 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information