Efficiently Collecting Histograms Over RFID Tags Lei Xief,Hao Hant,Qun Lit,Jie Wus,and Sanglu Lut fState Key Laboratory for Novel Software Technology,Nanjing University,China Department of Computer Science,College of William and Mary,USA SDepartment of Computer Information and Sciences,Temple University,USA Email:TIxie@nju.edu.cn,fhhan,liqun@cs.wm.edu,3jiewu@temple.edu,sanglu@nju.edu.cn Abstract-Collecting histograms over RFID tags is an essential While dealing with a large scale deployment with thousands premise for effective aggregate queries and analysis in large- of tags,the traditional tag identification scheme is not suitable scale RFID-based applications.In this paper we consider efficient for histogram collection,since the scanning time is proportional collection of histograms from the massive number of RFID tags without the need to read all tag data.We first consider the problem to the number of tags,which can be in the order of several of basic histogram collection and propose an efficient algorithm minutes.As in most applications,the tags are frequently based on the idea of ensemble sampling.We further consider the moving into and out of the effective scanning area.In order problems of advanced histogram collection,respectively,with an to capture the distribution statistics in time,it is essential to iceberg query and a top-k query.Efficient algorithms are proposed to tackle the above problems such that the qualified/unqualified sacrifice some accuracy so that the main distribution can be categories can be quickly identified.Experiment results indi- obtained within a short time window in the order of several cate that our ensemble sampling-based solutions can achieve a seconds.Therefore,we seek to propose an estimation scheme much better performance than the basic estimation/identification to quickly count the tag sizes of each category,while achieving schemes. the accuracy requirement. Index Terms-Algorithms;RFID;Time efficiency;Histogram In most cases,the tag sizes of various categories are subject to some skew distribution with a "long tail".such as the I.INTRODUCTION Gaussian distribution.The long tail represents a large number With the rapid proliferation of RFID-based applications, of categories,each of which occupies a rather small percentage RFID tags have been deployed into pervasive spaces in increas- among the total categories.While handling the massive tags in ingly large numbers.In applications like warehouse monitoring, the order of several thousands,the overall number of categories the items are attached with RFID tags and densely packed in the long tail could be in hundreds.Therefore,by separately into boxes.As the maximum scanning range of a UHF RFID estimating the tag sizes over each category,a large number of reader is usually 6-10m,the overall number of tags within query cycles and slots are required.Besides,in applications this three-dimensional space can be up to tens of thousands like the iceberg query and the top-k query,only those major in a dense deployment scenario,as envisioned in [1-3].Many categories are essential to be addressed.In this situation,the identification protocols are proposed to uniquely identify the separate estimate approach will waste a lot of scanning time tags through anti-collision schemes.However,in a number of over those minor categories in the long tail.Therefore,a novel applications,only some useful statistical information is essen- scheme is essential to quickly collect the histograms over the tial to be collected,such as the overall tag size [2,4],popular massive RFID tags.In this paper,we propose a series of categories [5],and the histogram.In particular,histograms protocols to tackle the problem of efficient histogram collection. capture distribution statistics in a space-efficient fashion. We make the following contributions. In practice,tags are typically attached to objects belonging to 1)To the best of our knowledge,we are the first to consider different categories,e.g.,different brands and models of clothes the problem of collecting histograms over RFID tags,which is in a large clothing store,different titles of books in a book a fundamental premise for queries and analysis in RFID-based store,etc.Collecting histograms can be used to illustrate the tag applications.In order to achieve time efficiency,we propose a population belonging to each category,and determine whether novel,ensemble sampling-based method to simultaneously esti- the number of tags in a category is above or below any desired mate the tag size for a number of categories.This framework is threshold.By setting this threshold,it is easy to find popular very flexible and compatible to current tag-counting estimators, merchandise and control stock,e.g.,automatically signaling which can be efficiently leveraged to estimate the tag size for when more products need to be put on the shelf.Furthermore, each category.2)In order to tackle the histogram collection the histogram can be used for approximate answering of aggre- with a filter condition,we propose an effective solution for gate queries,as well as preprocessing and mining association the iceberg query problem.By considering the population and rules in data mining [6].Therefore,collecting histograms over accuracy constraint,we propose an efficient algorithm to wipe RFID tags is an essential premise for effective queries and out the unqualified categories in time,especially those cate- analysis in conventional RFID-based applications. gories in the long tail.We further present an effective solution 978-1-4799-3360-0/14/$31.00⊙2014IEEEEfficiently Collecting Histograms Over RFID Tags Lei Xie†, Hao Han‡, Qun Li‡, Jie Wu§, and Sanglu Lu† †State Key Laboratory for Novel Software Technology, Nanjing University, China ‡Department of Computer Science, College of William and Mary, USA §Department of Computer Information and Sciences, Temple University, USA Email: †lxie@nju.edu.cn, ‡{hhan,liqun}@cs.wm.edu, §jiewu@temple.edu, †sanglu@nju.edu.cn Abstract—Collecting histograms over RFID tags is an essential premise for effective aggregate queries and analysis in largescale RFID-based applications. In this paper we consider efficient collection of histograms from the massive number of RFID tags without the need to read all tag data. We first consider the problem of basic histogram collection and propose an efficient algorithm based on the idea of ensemble sampling. We further consider the problems of advanced histogram collection, respectively, with an iceberg query and a top-𝑘 query. Efficient algorithms are proposed to tackle the above problems such that the qualified/unqualified categories can be quickly identified. Experiment results indicate that our ensemble sampling-based solutions can achieve a much better performance than the basic estimation/identification schemes. Index Terms—Algorithms; RFID; Time efficiency; Histogram I. INTRODUCTION With the rapid proliferation of RFID-based applications, RFID tags have been deployed into pervasive spaces in increasingly large numbers. In applications like warehouse monitoring, the items are attached with RFID tags and densely packed into boxes. As the maximum scanning range of a UHF RFID reader is usually 6-10m, the overall number of tags within this three-dimensional space can be up to tens of thousands in a dense deployment scenario, as envisioned in [1–3]. Many identification protocols are proposed to uniquely identify the tags through anti-collision schemes. However, in a number of applications, only some useful statistical information is essential to be collected, such as the overall tag size [2, 4], popular categories [5], and the histogram. In particular, histograms capture distribution statistics in a space-efficient fashion. In practice, tags are typically attached to objects belonging to different categories, e.g., different brands and models of clothes in a large clothing store, different titles of books in a book store, etc. Collecting histograms can be used to illustrate the tag population belonging to each category, and determine whether the number of tags in a category is above or below any desired threshold. By setting this threshold, it is easy to find popular merchandise and control stock, e.g., automatically signaling when more products need to be put on the shelf. Furthermore, the histogram can be used for approximate answering of aggregate queries, as well as preprocessing and mining association rules in data mining [6]. Therefore, collecting histograms over RFID tags is an essential premise for effective queries and analysis in conventional RFID-based applications. While dealing with a large scale deployment with thousands of tags, the traditional tag identification scheme is not suitable for histogram collection, since the scanning time is proportional to the number of tags, which can be in the order of several minutes. As in most applications, the tags are frequently moving into and out of the effective scanning area. In order to capture the distribution statistics in time, it is essential to sacrifice some accuracy so that the main distribution can be obtained within a short time window in the order of several seconds. Therefore, we seek to propose an estimation scheme to quickly count the tag sizes of each category, while achieving the accuracy requirement. In most cases, the tag sizes of various categories are subject to some skew distribution with a “long tail”, such as the Gaussian distribution. The long tail represents a large number of categories, each of which occupies a rather small percentage among the total categories. While handling the massive tags in the order of several thousands, the overall number of categories in the long tail could be in hundreds. Therefore, by separately estimating the tag sizes over each category, a large number of query cycles and slots are required. Besides, in applications like the iceberg query and the top-𝑘 query, only those major categories are essential to be addressed. In this situation, the separate estimate approach will waste a lot of scanning time over those minor categories in the long tail. Therefore, a novel scheme is essential to quickly collect the histograms over the massive RFID tags. In this paper, we propose a series of protocols to tackle the problem of efficient histogram collection. We make the following contributions. 1) To the best of our knowledge, we are the first to consider the problem of collecting histograms over RFID tags, which is a fundamental premise for queries and analysis in RFID-based applications. In order to achieve time efficiency, we propose a novel, ensemble sampling-based method to simultaneously estimate the tag size for a number of categories. This framework is very flexible and compatible to current tag-counting estimators, which can be efficiently leveraged to estimate the tag size for each category. 2) In order to tackle the histogram collection with a filter condition, we propose an effective solution for the iceberg query problem. By considering the population and accuracy constraint, we propose an efficient algorithm to wipe out the unqualified categories in time, especially those categories in the long tail. We further present an effective solution 978-1-4799-3360-0/14/$31.00 ⃝c 2014 IEEE