2009 15th International Conference on Parallel and Distributed Systems A Decentralized Storage Scheme for Multi-Dimensional Range Queries over Sensor Networks Lei Xie,Lijun Chen,Daoxu Chen,Li Xie State Key Laboratory of Novel Software Technology Department of Computer Science,Nanjing University,Nanjing,China Email:xielei@dislab.nju.edu.cn,[chenlj.cdx,xieli @nju.edu.cn Abstract-This paper presents the design of a decentralized efficiently handle the event insertions and range query pro- storage scheme to support multi-dimensional range queries over cessing in an optimized approach.Contributions of this paper sensor networks.We build a distributed k-d tree based index are listed as follows: structure over sensor network,so as to efficiently map high dimensional event data to a two-dimensional space of sensors 1.We construct the k-d tree based index structure according while preserving the proximity of events.We propose a dynamic to event and range query distributions at sensor data level programming based methodology to control the granularity of instead of the two-dimensional sensor locations,thus we the index tree in an optimized approach,and an optimized can well utilize data level information to facilitate the event routing scheme for range query processing to achieve best energy efficiency.The simulation results demonstrate the efficiency of the insertion and query processing. design. 2.We reduce the problem of optimal routing for sub-queries to the Euclidean Steiner tree problem,which is NP-hard,and I.INTRODUCTION propose corresponding approximation algorithm to solve it. The rest of the paper is organized as follows.Section II In most sensor network applications,data or events are reviews the related work.Section IlI provides preliminaries continuously collected,forwarded and stored in sink nodes upon which the design theory can be built.The details of for further usage.Recently a Data Centric Storage (DCS) the proposed design methodology are given in Section IV. scheme is proposed to store event data in specific sensor nodes In Section V,we present the multi-dimensional event storage according to the "event type",leveraging the"in-network stor- and query processing mechanism.Section VI presents the age"capability of sensor networks.The "event type"refers to simulation results.Finally,we summarize the paper in Section certain pre-defined constellations of attribute values,however, VII. earlier DCS schemes like GHT [1]conventionally deal with event types with unique attribute and only support "exact II.RELATED WORKS match"point queries.As sensor applications are evolving. Data centric storage schemes [2]mainly utilize a mapping it is mandatory for these event data to be represented by function such as geographic hash function to map event in- several different attributes,such as temperature,humidity, formation to specified geographic points,and leverage routing light,and barometric pressure,which may be referred to as algorithms like GPSR [1]to route messages to corresponding multi-dimensional events.One natural way to query for events storage nodes,these research works include GHT [1].DIMEN- of interest is to use multi-dimensional range queries over SIONs [3]and DIFS [4].GHT is the first work which utilizes these attributes.For example,users may impose queries like a geographic hash table as the mapping function.Based on "List all events that have temperatures between 50.F and 60.F, the primitives in GHT,DIMENSIONs is proposed to support and light levels between 10 and 20".However conventional multi-resolution storage and query processing,which allows DCS schemes are inadequate for managing and processing drill-down search for features within sensor network,while these multi-dimensional events,because processing a query DIFS allows range queries on a single key in addition to over multi-dimensional events is much more complex than other operations.And these above storage schemes are mainly processing a query on single-dimensional events.We need to suitable for processing one-dimensional events. figure out how to efficiently map these high dimensional events The basic problem that this paper addresses-multidimen- to a two-dimensional space of sensors while preserving the sional range queries-is typically solved in database systems proximity of events. using indexing techniques.Among these techniques k-d tree[5] Therefore an efficient,decentralized storage framework is as one of the well-known multi-dimensional binary search essential to facilitate multi-dimensional queries over sensor trees,is proposed to support associative searching works over networks.In this paper we propose a framework which can multi-dimensional data attributes.DIM6]draws its inspiration IEEE 1521-9097/09$26.0002009IEEE 166 Φcomputer DOI10.1109/ICPADS.2009.24 society Authorized licensed use limited to:Nanjing University.Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore.Restrictions apply
A Decentralized Storage Scheme for Multi-Dimensional Range Queries over Sensor Networks Lei Xie, Lijun Chen, Daoxu Chen, Li Xie State Key Laboratory of Novel Software Technology Department of Computer Science, Nanjing University, Nanjing, China Email: xielei@dislab.nju.edu.cn, {chenlj,cdx,xieli}@nju.edu.cn Abstract—This paper presents the design of a decentralized storage scheme to support multi-dimensional range queries over sensor networks. We build a distributed k-d tree based index structure over sensor network, so as to efficiently map high dimensional event data to a two-dimensional space of sensors while preserving the proximity of events. We propose a dynamic programming based methodology to control the granularity of the index tree in an optimized approach, and an optimized routing scheme for range query processing to achieve best energy efficiency. The simulation results demonstrate the efficiency of the design. I. INTRODUCTION In most sensor network applications, data or events are continuously collected, forwarded and stored in sink nodes for further usage. Recently a Data Centric Storage (DCS) scheme is proposed to store event data in specific sensor nodes according to the “event type”, leveraging the “in-network storage” capability of sensor networks. The “event type” refers to certain pre-defined constellations of attribute values, however, earlier DCS schemes like GHT [1] conventionally deal with event types with unique attribute and only support “exact match” point queries. As sensor applications are evolving, it is mandatory for these event data to be represented by several different attributes, such as temperature, humidity, light, and barometric pressure, which may be referred to as multi-dimensional events. One natural way to query for events of interest is to use multi-dimensional range queries over these attributes. For example, users may impose queries like “List all events that have temperatures between 50.F and 60.F, and light levels between 10 and 20”. However conventional DCS schemes are inadequate for managing and processing these multi-dimensional events, because processing a query over multi-dimensional events is much more complex than processing a query on single-dimensional events. We need to figure out how to efficiently map these high dimensional events to a two-dimensional space of sensors while preserving the proximity of events. Therefore an efficient, decentralized storage framework is essential to facilitate multi-dimensional queries over sensor networks. In this paper we propose a framework which can efficiently handle the event insertions and range query processing in an optimized approach. Contributions of this paper are listed as follows: 1. We construct the k-d tree based index structure according to event and range query distributions at sensor data level instead of the two-dimensional sensor locations, thus we can well utilize data level information to facilitate the event insertion and query processing. 2. We reduce the problem of optimal routing for sub-queries to the Euclidean Steiner tree problem, which is NP-hard, and propose corresponding approximation algorithm to solve it. The rest of the paper is organized as follows. Section II reviews the related work. Section III provides preliminaries upon which the design theory can be built. The details of the proposed design methodology are given in Section IV. In Section V, we present the multi-dimensional event storage and query processing mechanism. Section VI presents the simulation results. Finally, we summarize the paper in Section VII. II. RELATED WORKS Data centric storage schemes [2] mainly utilize a mapping function such as geographic hash function to map event information to specified geographic points, and leverage routing algorithms like GPSR [1] to route messages to corresponding storage nodes, these research works include GHT [1], DIMENSIONs [3] and DIFS [4]. GHT is the first work which utilizes a geographic hash table as the mapping function. Based on the primitives in GHT, DIMENSIONs is proposed to support multi-resolution storage and query processing, which allows drill-down search for features within sensor network, while DIFS allows range queries on a single key in addition to other operations. And these above storage schemes are mainly suitable for processing one-dimensional events. The basic problem that this paper addresses - multidimensional range queries - is typically solved in database systems using indexing techniques. Among these techniques k-d tree[5] as one of the well-known multi-dimensional binary search trees, is proposed to support associative searching works over multi-dimensional data attributes. DIM[6] draws its inspiration 2009 15th International Conference on Parallel and Distributed Systems 1521-9097/09 $26.00 © 2009 IEEE DOI 10.1109/ICPADS.2009.24 166 Authorized licensed use limited to: Nanjing University. Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore. Restrictions apply
from the k-d tree index structure,and embeds the k-d tree at 0.25 and 0.75 which divide the ranges to [0,0.25),[0.25 into sensor networks in a decentralized approach.Furthermore, 0.5),[0.5,0.75),and [0.75,1).They repeat this procedure KDDCS [7]presents a load-balanced storage scheme over until it reaches the corresponding leaf node.As an example. sensor networks based on k-d tree for multi-dimensional range consider event E =(0.3,0.8),for this event,the zone code queries.POOL[8]addresses some weakness of DIM such over the aforementioned example k-d tree is code(ZE)=011. as hot spot and scalability problems,and further proposes When answering a range query,DIM finds all zones whose a storage scheme to group index nodes together as a data value ranges overlap with the query range and sends the pool to preserve the neighborhood property of nearby multi- query to index nodes of those zones.For example,consider dimensional data,however,their group mechanism are only a range query(0.3,0.7]0.5,0.d,[0.4,0.8,0.7,0.9]),the based on the greatest and the second greatest attribute values, storage regions of zone code 010,011,110,1111 will be which has not been proved to be efficient enough for high involved. dimensional data processing. Building upon this,DIM achieves energy efficiency by forcing events whose attribute values are "close"to be stored III.PRELIMINARIES at the same or nearby nodes.However,some performance A.Motivation issues still remain to be challenged for the traditional DIM: First,DIM constructs the k-d tree structure only based on the two-dimensional sensor locations,without leveraging the estimated distributions of various data attributes and range queries,thus it may cause inefficiency for non-uniform event 110 010 01 insertions and corresponding range queries,which are actually 品 conventional for sensor network applications.Second,DIM assigns a unique zone for each sensor node,hence as the network size increases,the zones have to be further split into smaller zones,more zones of sensors are inevitably involved 100 101 in the query processing,which cannot be energy efficient (a)DIM's zone tree (b)The zone node and boundaries of DIM for range query processing.Therefore how to control the granularity of the k-d tree based index structure is crucial Fig.1.Example of DIM(Distributed Index for Multi-dimensional data for energy efficiency issue.Third,DIM has not considered an optimized approach to route those split sub-queries to specified storage areas,since the query diffusion cost dominates the It is essential to construct an index structure to map overall energy consumption for multi-dimensional range query the multi-dimensional event data to two-dimensional sensors. processing,it is of utmost importance to devise an optimized Obviously a centralized index for multi-dimensional range routing scheme for query forwarding. queries is not feasible for energy-efficiency reasons of sen- Therefore in this paper we try to address the problems raised sor networks.DIM [6]is the first work to support multi- above and we propose a framework which can efficiently dimensional range queries in sensor networks,it utilizes a handle the event insertions and range query processing in an classical database index,the k-d tree [5],and embeds such optimized approach.Our storage framework is also based on an index in a sensor network.DIM divides sensor nodes into the k-d tree and we mainly focus on achieving the optimized separate zones and each zone contains one exact sensor node granularity of k-d tree structure over sensor networks for best as the index node.Fig.1(a)depicts an example of DIM's k-d energy efficiency,and we further investigate into the optimized tree for zones and Fig.I(b)shows the corresponding zones scheme for range query forwarding which dominates the over the sensor network.Each intermediate node over this overall energy consumption. k-d tree based index structure denotes the split parameters for further branching,which include the specified attribute B.Modelling Event Data Distribution Ai(1 i<m)from the attribute list (A1,...,Am),and Consider there exist k attributes A1,A2,...,Ak for sensor associative splitting ranges for Ai.Each leaf node represents data,for each attribute Ai,we have minimum value min(Vi) the event data within specified multi-dimensional ranges. and maximum value max(Vi),we normalize attribute A;'s We introduce the principle of k-d tree to process multi- value into the region [0,1]by calculating: dimensional attribute data with the following example:For = Vi-min(Vi) attributes (A1,...,Am),assume that the maximum depth of the max(V))-min(V (1) k-d tree is k,k is a multiple of m,and this value of k is known to every node.DIM assigns a k bit zone code to an event as And we represent the data distribution over attribute A;as follows.For i between I and m,if Ai<0.5,the i-th bit of Pi(r),which is exactly a probability density function (pdf) the zone code is assigned 0,else 1.For i between m+1 and over the region [0,1].thus we have 2m,if Ai-m <0.25 or Ai-m E 0.5,0.75),the i-th bit of the zone is assigned 0,else 1,because the next level divisions are Pi(x)dx =1 (2) 167 Authorized licensed use limited to:Nanjing University.Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore.Restrictions apply
from the k-d tree index structure, and embeds the k-d tree into sensor networks in a decentralized approach. Furthermore, KDDCS [7] presents a load-balanced storage scheme over sensor networks based on k-d tree for multi-dimensional range queries. POOL[8] addresses some weakness of DIM such as hot spot and scalability problems, and further proposes a storage scheme to group index nodes together as a data pool to preserve the neighborhood property of nearby multidimensional data, however, their group mechanism are only based on the greatest and the second greatest attribute values, which has not been proved to be efficient enough for high dimensional data processing. III. PRELIMINARIES A. Motivation Fig. 1. Example of DIM (Distributed Index for Multi-dimensional data It is essential to construct an index structure to map the multi-dimensional event data to two-dimensional sensors. Obviously a centralized index for multi-dimensional range queries is not feasible for energy-efficiency reasons of sensor networks. DIM [6] is the first work to support multidimensional range queries in sensor networks, it utilizes a classical database index, the k-d tree [5], and embeds such an index in a sensor network. DIM divides sensor nodes into separate zones and each zone contains one exact sensor node as the index node. Fig.1(a) depicts an example of DIM’s k-d tree for zones and Fig.1(b) shows the corresponding zones over the sensor network. Each intermediate node over this k-d tree based index structure denotes the split parameters for further branching, which include the specified attribute Ai(1 ≤ i ≤ m) from the attribute list hA1, ..., Ami, and associative splitting ranges for Ai . Each leaf node represents the event data within specified multi-dimensional ranges. We introduce the principle of k-d tree to process multidimensional attribute data with the following example: For attributes hA1, ..., Ami, assume that the maximum depth of the k-d tree is k, k is a multiple of m, and this value of k is known to every node. DIM assigns a k bit zone code to an event as follows. For i between 1 and m, if Ai < 0.5, the i-th bit of the zone code is assigned 0, else 1. For i between m + 1 and 2m, if Ai−m < 0.25 or Ai−m ∈ [0.5, 0.75), the i-th bit of the zone is assigned 0, else 1, because the next level divisions are at 0.25 and 0.75 which divide the ranges to [0, 0.25), [0.25, 0.5), [0.5, 0.75), and [0.75, 1). They repeat this procedure until it reaches the corresponding leaf node. As an example, consider event E = (0.3, 0.8), for this event, the zone code over the aforementioned example k-d tree is code (ZE) = 011. When answering a range query, DIM finds all zones whose value ranges overlap with the query range and sends the query to index nodes of those zones. For example, consider a range query h[0.3, 0.7], [0.5, 0.6], [0.4, 0.8], [0.7, 0.9]i, the storage regions of zone code 010, 011, 110, 1111 will be involved. Building upon this, DIM achieves energy efficiency by forcing events whose attribute values are “close” to be stored at the same or nearby nodes. However, some performance issues still remain to be challenged for the traditional DIM: First, DIM constructs the k-d tree structure only based on the two-dimensional sensor locations, without leveraging the estimated distributions of various data attributes and range queries, thus it may cause inefficiency for non-uniform event insertions and corresponding range queries, which are actually conventional for sensor network applications. Second, DIM assigns a unique zone for each sensor node, hence as the network size increases, the zones have to be further split into smaller zones, more zones of sensors are inevitably involved in the query processing, which cannot be energy efficient for range query processing. Therefore how to control the granularity of the k-d tree based index structure is crucial for energy efficiency issue. Third, DIM has not considered an optimized approach to route those split sub-queries to specified storage areas, since the query diffusion cost dominates the overall energy consumption for multi-dimensional range query processing, it is of utmost importance to devise an optimized routing scheme for query forwarding. Therefore in this paper we try to address the problems raised above and we propose a framework which can efficiently handle the event insertions and range query processing in an optimized approach. Our storage framework is also based on the k-d tree and we mainly focus on achieving the optimized granularity of k-d tree structure over sensor networks for best energy efficiency, and we further investigate into the optimized scheme for range query forwarding which dominates the overall energy consumption. B. Modelling Event Data Distribution Consider there exist k attributes A1, A2, ..., Ak for sensor data, for each attribute Ai , we have minimum value min(Vi) and maximum value max(Vi), we normalize attribute Ai’s value into the region [0, 1] by calculating: V ′ i = Vi − min(Vi) max(Vi) − min(Vi) (1) And we represent the data distribution over attribute Ai as pi(x), which is exactly a probability density function (pdf) over the region x ∈ [0, 1], thus we have Z 1 0 pi(x)dx = 1 (2) 167 Authorized licensed use limited to: Nanjing University. Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore. Restrictions apply
we can also define a discrete form of pi(r)as follows, IV.DESIGN OF DECENTRALIZED STORAGE SCHEME FOR we equally divide the region 0,1 into m sub-regions,and MULTI-DIMENSIONAL RANGE OUERIES leverage the histogram method to calculate the probability pi.j,(j=1,...,m)and we have: 01D0 ∑P=1 (3) j=1 Hence consider an index node over the k-d tree with multi- dimensional range[(a1,31),(a2,2),·,(ak,3k小,where a and B;respectively denote the lower bound and upper bound of specified range of attribute Ai,we have 0<ai<Bi<1, assume the above multi-dimensional range is associated with a two-dimensional spacial region Sj,therefore we can calculate (a)The K-d Tree based index structure (b)The corresponding zone node and the probability that a random event data falls into this region boundanes Si as: Fig.2.Decentralized Storage Scheme for Multi-Dimensional Range Queries ProbE(Sj)= Pi(x)dx (4) We visualize the deployment field of a sensor network as Assume the overall event frequency is fe,thus the frequency a number of n x n equal sized grid cells.As DIM mainly for which the specific storage region Sj is accessed is fe. applies for sparse sensor network where some zones may not Probe(Si). have any sensor node,in this paper we consider the situation C.Modelling Range Query Distribution where the node density is high enough for each cell to have at Unlike event distributions over attributes,which are mu- least one storage node candidates,these cells are nominated tually exclusive over each attribute Ai,the range query dis- as the storage units of the DCS scheme,at the center of each tributions over attributes are overlapped because each query cell one exact sensor node is selected as the storage node. is corresponding to a range over the attribute dimensions. A.Indexing Storages over Cells Users can specify an k dimensional range query in the form like ([L,U],[L2,U2],....[L,U])over k attributes,where We construct the index structure based on k-d tree over the 0≤L:≤U≤l,(l≤i≤k),for partial range query which cell-based storage deployment area,on one hand,each index does not include attribute A,,we can set Li=0,U;=1.Thus node over the index structure represents a multi-dimensional we utilize another method to depict the distributions:Given a range for the sensor data,on the other hand,it also associates training set of queries Q,we have Q=(Q1,Q2,...,QN}, with a specified spacial region over the two-dimensional where query Qi has a query range over each attribute Ai, deployment area.The definition of the index structure is con- which is (Li,Ui),hence we define an indicator variable Bi sistent with DIM,except that we utilize the cell as the storage for a specified range (i,Bi)of attribute A:as follows: unit instead of sensor node.each index node corresponds to a spacial region which covers one or more cells.Fig.2 shows an 了1if(a,3)n(Li,U)≠p B;0 otherwise (5) example of our decentralized storage scheme to support multi- dimensional range queries.Fig.2(a)depicts the k-d tree based And we further calculate the query frequency fi for the index structure and Fig.2(b)represents the corresponding zone specified range (i,Bi)over attribute Ai: code and boundaries.The black nodes denote the storage fi=∑B regions specified by the k-d tree and those cell(s)contained (6) in the corresponding regions become(s)the replicate storage HQ,∈Q cells,which means each cell will store the same copy of the Thus the specified attribute range will be queried with proba- multi-dimensional ranged data specified by the index node bility pi=fi/N. over the k-d tree. Assume query distributions over various attributes are mutu- ally independent,given a multi-dimensional data region which B.Controlling Granularity for k-d Tree based Index Structure is associated with a two-dimensional spacial region Sj,we can Now we investigate into the k-d tree based index structure, finally get the probability for the specified storage region S; as already mentioned in Section I,we note that controlling to be queried as: the granularity of the index nodes is very crucial for energy efficiency of multi-dimensional event insertion and range Proba(S)=Πn (7) query diffusion.If we split these sensor data into very refined =1 zones according to multi-dimensional data attributes,thus to Assume the overall query frequency is fo,thus the frequency resolve a conventional range query,we have to split it into for which the specified storage region Sj is queried is fa many sub-range queries for various zones,apparently it cannot Probo(Sj). be energy efficient due to the large number of routing cost for 168 Authorized licensed use limited to:Nanjing University.Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore.Restrictions apply
we can also define a discrete form of pi(x) as follows, we equally divide the region [0, 1] into m sub-regions, and leverage the histogram method to calculate the probability pi,j ,(j = 1, ..., m) and we have: Xm j=1 pi,j = 1 (3) Hence consider an index node over the k-d tree with multidimensional range [(α1, β1),(α2, β2), · · · ,(αk, βk)], where αi and βi respectively denote the lower bound and upper bound of specified range of attribute Ai , we have 0 ≤ αi < βi ≤ 1, assume the above multi-dimensional range is associated with a two-dimensional spacial region Sj , therefore we can calculate the probability that a random event data falls into this region Sj as: P robE (Sj ) = Y k i=1 Z βi αi pi(x)dx (4) Assume the overall event frequency is fe, thus the frequency for which the specific storage region Sj is accessed is fe · P robE (Sj ). C. Modelling Range Query Distribution Unlike event distributions over attributes, which are mutually exclusive over each attribute Ai , the range query distributions over attributes are overlapped because each query is corresponding to a range over the attribute dimensions. Users can specify an k dimensional range query in the form like h[L1, U1], [L2, U2], ..., [Lk, Uk]i over k attributes, where 0 ≤ Li ≤ Ui ≤ 1,(1 ≤ i ≤ k), for partial range query which does not include attribute Ai , we can set Li = 0, Ui = 1. Thus we utilize another method to depict the distributions: Given a training set of queries Q, we have Q = {Q1, Q2, ..., QN }, where query Qj has a query range over each attribute Ai , which is (Li , Ui), hence we define an indicator variable Bi for a specified range (αi , βi) of attribute Ai as follows: Bi = 1 if (αi , βi) ∩ (Li , Ui) 6= φ 0 otherwise (5) And we further calculate the query frequency fi for the specified range (αi , βi) over attribute Ai : fi = X ∀Qj∈Q Bi (6) Thus the specified attribute range will be queried with probability pi = fi/N. Assume query distributions over various attributes are mutually independent, given a multi-dimensional data region which is associated with a two-dimensional spacial region Sj , we can finally get the probability for the specified storage region Sj to be queried as: P robQ(Sj ) = Y k i=1 pi (7) Assume the overall query frequency is fq, thus the frequency for which the specified storage region Sj is queried is fq · P robQ(Sj ). IV. DESIGN OF DECENTRALIZED STORAGE SCHEME FOR MULTI-DIMENSIONAL RANGE QUERIES Fig. 2. Decentralized Storage Scheme for Multi-Dimensional Range Queries We visualize the deployment field of a sensor network as a number of n × n equal sized grid cells. As DIM mainly applies for sparse sensor network where some zones may not have any sensor node, in this paper we consider the situation where the node density is high enough for each cell to have at least one storage node candidates, these cells are nominated as the storage units of the DCS scheme, at the center of each cell one exact sensor node is selected as the storage node. A. Indexing Storages over Cells We construct the index structure based on k-d tree over the cell-based storage deployment area, on one hand, each index node over the index structure represents a multi-dimensional range for the sensor data, on the other hand, it also associates with a specified spacial region over the two-dimensional deployment area. The definition of the index structure is consistent with DIM, except that we utilize the cell as the storage unit instead of sensor node, each index node corresponds to a spacial region which covers one or more cells. Fig.2 shows an example of our decentralized storage scheme to support multidimensional range queries. Fig.2(a) depicts the k-d tree based index structure and Fig.2(b) represents the corresponding zone code and boundaries. The black nodes denote the storage regions specified by the k-d tree and those cell(s) contained in the corresponding regions become(s) the replicate storage cells, which means each cell will store the same copy of the multi-dimensional ranged data specified by the index node over the k-d tree. B. Controlling Granularity for k-d Tree based Index Structure Now we investigate into the k-d tree based index structure, as already mentioned in Section I, we note that controlling the granularity of the index nodes is very crucial for energy efficiency of multi-dimensional event insertion and range query diffusion. If we split these sensor data into very refined zones according to multi-dimensional data attributes, thus to resolve a conventional range query, we have to split it into many sub-range queries for various zones, apparently it cannot be energy efficient due to the large number of routing cost for 168 Authorized licensed use limited to: Nanjing University. Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore. Restrictions apply
query forwarding from query node to these various storage where Dauer(S)denotes the expected distance from an nodes.Based on the above observation,keeping these ranges arbitrary query node to the nearest storage node in the storage of data to be hosted in a relatively small number of zones region.And we can simply deduce that Dquery(Si)equals to may be more efficient,but obviously the extreme situation to Devent(Si). keep all ranges of data in one unique zone (with replicated storage scheme)is also not energy efficient,because although the query forwarding cost can be effectively reduced,large amount of energy will be drained for routing the same copy (X, (X,Y) of sensor data to all replicate storages.Based on the above analysis,we must find an optimized storage strategy to control ● the granularity for the index structure as a tradeoff between the two extremum situations to achieve best energy efficiency. Therefore to construct an optimized k-d tree based index Y (X..Y) structure over the sensor network,if we build the k-d tree ● from the root,then at each splitting step of the k-d tree,we have two optional strategies:one strategy is to conduct further splitting for the specified zones,another strategy is to stop Fig.3.Estimate the value for Dquery(Si)or Devent(i) further splitting and thus apply the replicate storage scheme to all cells covered by the specified zones.The most crucial We estimate the expectations of Dquery(Si)or Devent(Si) thing to construct the index structure is how to decide the as follows:for any event/query imposed inside the replicate storage region S;,as it only needs to find a nearest storage optimized strategy at each splitting step along the k-d tree,so as to achieve the best energy efficiency. node inside Si,the expected distance is less than half the cell size,which is d/2;for any event/query imposed outside To solve the above optimization problem,we start by the replicate storage region Si,as Fig.3 shows,we calculate first analyzing the energy cost for event insertion and query the distance according to its quadrant relative to Si,by processing over a specified storage region.In this paper we utilize the Euclidean Distance as the metric of energy summarizing the average distance from arbitrary points to consumption,which means the routing cost between two nodes the storage region,thus we can estimate the expectations of is represented by the Euclidean distance between them.For in- Dquery(Si)or Devent(Si). sertion of event,we leverage the double ruling scheme[9][10] As mentioned before,if we construct a k-d tree based index to distribute the event information to all relevant replicate structure by starting from the root,for each index node at the storage nodes in the storage region.For one specified storage splitting step,we have two optional strategies to choose:to region,the event node first routes event message to the nearest apply the replicate storage scheme over Si or conduct further storage node inside the region,and further forwards it both splitting?We know that the energy cost for replicate storage horizontally and vertically to all storage nodes covered by scheme over Si is: the storage region.Thus the expected energy cost for event Erep(Sa)=fe·ProbE(Si)·Eevent((Si) distribution over the specified storage region Si is: +fq·Probo(S)·Equery(S) (11) Eevent(Si)=Devent(Si)+d.(N(Si)-1) (8) The energy cost Erep(Si)includes the overall event inser- where Devent(Si)denotes the expected distance from an tion cost,as we know the event access frequency for Si arbitrary event node to the nearest storage node in storage is fe.Probe(Si)and the energy cost for each insertion is region Si,and d is the cell size,N(Si)denotes the number Eevent(Si),and it also includes the overall query diffusion of storage nodes inside Si,as each storage node inside Si cost with query access frequency faProb(S)and energy receives exactly one message except the first accessed storage cost for each query forwarding Eqer(S). node,thus the diffusion cost inside S;is d.(N(S;)-1). And consider the k-d tree based index structure,let us Assume the root of the k-d tree is at level 0,and the k-d tree denote Tichild(Si)as the zone associated with left child of indexes the cell unit at level h,(apparently h should be an Si,and Trchid(Si)as the zone associated with its right child, even number)),thus at level i(0≤i≤h),we have: and we utilize E(Si)to represent the optimized energy cost N(Si)=2h-i (9) for region Si,thus the energy cost for further splitting can be represented as: And for query diffusion,we only need to reach one of those specified replicate storage nodes,an intuitive strategy is for Esplit(Si)=E(Tichila(Si))+E(Trchita(Si))(12) the query node to find the nearest storage node in the storage region Si,later we will introduce an optimized routing scheme Therefore the optimized energy cost for region Si is of the for query diffusion,thus the expected energy cost for query following form: diffusion over the specified storage region Si is: E(Si)=min{Erep(Si),E(Tichila(Si))+E(Trchild(Si))} Equery(Si)=Dquery(Si) (10) (13) 169 Authorized licensed use limited to:Nanjing University.Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore.Restrictions apply
query forwarding from query node to these various storage nodes. Based on the above observation, keeping these ranges of data to be hosted in a relatively small number of zones may be more efficient, but obviously the extreme situation to keep all ranges of data in one unique zone (with replicated storage scheme) is also not energy efficient, because although the query forwarding cost can be effectively reduced, large amount of energy will be drained for routing the same copy of sensor data to all replicate storages. Based on the above analysis, we must find an optimized storage strategy to control the granularity for the index structure as a tradeoff between the two extremum situations to achieve best energy efficiency. Therefore to construct an optimized k-d tree based index structure over the sensor network, if we build the k-d tree from the root, then at each splitting step of the k-d tree, we have two optional strategies: one strategy is to conduct further splitting for the specified zones, another strategy is to stop further splitting and thus apply the replicate storage scheme to all cells covered by the specified zones. The most crucial thing to construct the index structure is how to decide the optimized strategy at each splitting step along the k-d tree, so as to achieve the best energy efficiency. To solve the above optimization problem, we start by first analyzing the energy cost for event insertion and query processing over a specified storage region. In this paper we utilize the Euclidean Distance as the metric of energy consumption, which means the routing cost between two nodes is represented by the Euclidean distance between them. For insertion of event, we leverage the double ruling scheme[9][10] to distribute the event information to all relevant replicate storage nodes in the storage region. For one specified storage region, the event node first routes event message to the nearest storage node inside the region, and further forwards it both horizontally and vertically to all storage nodes covered by the storage region. Thus the expected energy cost for event distribution over the specified storage region Si is: Eevent(Si) = Devent(Si) + d · (N(Si) − 1) (8) where Devent(Si) denotes the expected distance from an arbitrary event node to the nearest storage node in storage region Si , and d is the cell size, N(Si) denotes the number of storage nodes inside Si , as each storage node inside Si receives exactly one message except the first accessed storage node, thus the diffusion cost inside Si is d · (N(Si) − 1). Assume the root of the k-d tree is at level 0, and the k-d tree indexes the cell unit at level h, (apparently h should be an even number), thus at level i (0 ≤ i ≤ h), we have: N(Si) = 2h−i (9) And for query diffusion, we only need to reach one of those specified replicate storage nodes, an intuitive strategy is for the query node to find the nearest storage node in the storage region Si , later we will introduce an optimized routing scheme for query diffusion, thus the expected energy cost for query diffusion over the specified storage region Si is: Equery(Si) = Dquery(Si) (10) where Dquery(Si) denotes the expected distance from an arbitrary query node to the nearest storage node in the storage region. And we can simply deduce that Dquery(Si) equals to Devent(Si). Fig. 3. Estimate the value for Dquery(Si) or Devent(Si) We estimate the expectations of Dquery(Si) or Devent(Si) as follows: for any event/query imposed inside the replicate storage region Si , as it only needs to find a nearest storage node inside Si , the expected distance is less than half the cell size, which is d/2; for any event/query imposed outside the replicate storage region Si , as Fig.3 shows, we calculate the distance according to its quadrant relative to Si , by summarizing the average distance from arbitrary points to the storage region, thus we can estimate the expectations of Dquery(Si) or Devent(Si). As mentioned before, if we construct a k-d tree based index structure by starting from the root, for each index node at the splitting step, we have two optional strategies to choose: to apply the replicate storage scheme over Si or conduct further splitting? We know that the energy cost for replicate storage scheme over Si is: Erep(Si) = fe · P robE (Si) · Eevent(Si) +fq · P robQ(Si) · Equery(Si) (11) The energy cost Erep(Si) includes the overall event insertion cost, as we know the event access frequency for Si is fe · P robE(Si) and the energy cost for each insertion is Eevent(Si), and it also includes the overall query diffusion cost with query access frequency fq · P robQ(Si) and energy cost for each query forwarding Equery(Si). And consider the k-d tree based index structure, let us denote Tlchild(Si) as the zone associated with left child of Si , and Trchild(Si) as the zone associated with its right child, and we utilize E(Si) to represent the optimized energy cost for region Si , thus the energy cost for further splitting can be represented as: Esplit(Si) = E(Tlchild(Si)) + E(Trchild(Si)) (12) Therefore the optimized energy cost for region Si is of the following form: E(Si) = min{Erep(Si), E(Tlchild(Si)) + E(Trchild(Si))} (13) 169 Authorized licensed use limited to: Nanjing University. Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore. Restrictions apply
Based on the above analysis we note that the optimized k-d Algorithm 1 Constructing k-d Tree based Index Structure with tree structure satisfies the following property:for an optimum Optimized Granularity k-d tree structure,its sub-tree structures are also optimum 1:Initially construct a k-d tree as a full binary tree with Therefore this optimization problem can be solved by dynamic maximum height of h,each leaf node denotes a unit cell. programming that works from the bottom to the top along the 2:for each leaf node associated with region Si do tree. 3: E(S)=fe·ProbE(S)Eevent(Si)+fg·Probo(Sa) C.Constructing Optimized Index Structure based on Dynamic Equery(Si) Programming 4:end for We construct the k-d tree from bottom to top,which is in 5:for all remaining nodes Sij with corresponding children post-order.Assume the maximum height of the k-d tree is nodes S;and Sj,in post-order do h,initially consider a full k-d tree with all leaf nodes at the 6. costl is associated with the replicate storage scheme hth level,each leaf node correspond to a unit cell,thus the for region Si.j} number of nodes over the k-d tree is n=2h+1-1.For each cost1=fe·ProbE(S,)·Eevent(Sii)+fg· two nodes i and j within one sub-tree,we denote Si and S;as ProbQ(S,)·Equery(S,i) cost2 is associated with the split storage scheme for their respective regions,and we denote Si.;as the super-region region Si.j} composed of the two regions,which is corresponding to their parent node.We will try to merge two split nodes along the cost2=E(S)+E(S)】 10: tree if the energy cost for replicate storage scheme is lower E(Si.j)=min(cost1,cost2) 11: if cost1<cost2 then than the sum of each split nodes. 12 Algorithm I depicts the pseudo code of this dynamic Apply the replicate storage scheme to region Sij, programming based algorithm.Assume that the n nodes in merge two split nodes Si and Sj into one node Si.j the k-d tree are labelled using the post-order.A table E[l...n] for the k-d tree 13: end if is used to hold the minimum energy cost of each node which 14:end for corresponds to a specific storage region Si,here for ease of presentation we denote it as E(Si).In the algorithm,line 3 computes the energy cost E(Si)for all leave nodes and lines 6-10 compute the minimum energy cost E(Si.j)for As described in Section IV.when the event message reaches the remaining internal nodes in post-order.As there're O(n) the zone,we utilize the Double Ruling scheme [9][10]to entries of energy cost to compute according to all the nodes distribute the event information to all relevant replicate storage over the initial full k-d tree,and at each step one exact entry is nodes. computed,the time complexity of Algorithm I is O(n),where B.Resolving and Routing Range Queries n is the number of nodes over the initial full k-d tree. For implementation of optimized index structure construc- As a range query corresponds to a region over the tion,first the sink calculates the event data distribution and multi-dimensional data space,thus a conventional range range query distribution according to a training set by lever- query may cover several zones over the k-d tree based aging the methodology proposed in Section III,and further index structure.For example,consider a range query calculates optimized parameters for the optimized k-d tree (0.3,0.7,[0.5,0.6,0.4,0.8,0.7,0.9j)over the example k- d tree depicted in Fig.2(a),the storage regions of zone code based on dynamic programming,then it broadcasts the index structure information to the sensor network,after receiving 110.111,010.0111 will be involved.Therefore how to route this information,those selected storage nodes of each cell these sub-range queries to those specified storage nodes in will confirm their storage roles which includes corresponding an energy efficient approach is the problem we are trying to storage schemes and the multi-dimensional ranges of data they address in this section.As the corresponding query replies can will store,and each sensor node will know where to store/find be sent back to query node along the query routing paths,thus here we only consider the query diffusion cost.for the overall the relevant ranges of data according to the locally stored index cost is actually twice the query diffusion cost. structure. Basically there are three strategies to route these sub- V.EVENT STORAGE AND QUERY PROCESSING queries.The first strategy is to route sub-queries separately MECHANISMS to corresponding zones by selecting nearest storage nodes for A.Inserting Events to Cells the query node.As Fig.4(a)shows,for each relevant storage region Si associated with sub-query gi,the query node selects As an event data is a point value over the multi-dimensional a nearest storage node from those replicate storages,and data space,it corresponds to a unique zone over the de- respectively sends a sub-query message to the corresponding ployment area.When a sensor node generates an event data it will check the locally stored index structure to find the storage node.In this scheme,no routing paths are shared. The second strategy leverages Double Ruling based routing corresponding zone,then it leverages the GPSR[1]routing approach.As Fig.4(b)shows,the query node first imposes scheme to forward the event message to that specified region. a routing path horizontally,and at specific positions of the 170 Authorized licensed use limited to:Nanjing University.Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore.Restrictions apply
Based on the above analysis we note that the optimized k-d tree structure satisfies the following property: for an optimum k-d tree structure, its sub-tree structures are also optimum. Therefore this optimization problem can be solved by dynamic programming that works from the bottom to the top along the tree. C. Constructing Optimized Index Structure based on Dynamic Programming We construct the k-d tree from bottom to top, which is in post-order. Assume the maximum height of the k-d tree is h, initially consider a full k-d tree with all leaf nodes at the hth level, each leaf node correspond to a unit cell, thus the number of nodes over the k-d tree is n = 2h+1 − 1. For each two nodes i and j within one sub-tree, we denote Si and Sj as their respective regions, and we denote Si,j as the super-region composed of the two regions, which is corresponding to their parent node. We will try to merge two split nodes along the tree if the energy cost for replicate storage scheme is lower than the sum of each split nodes. Algorithm 1 depicts the pseudo code of this dynamic programming based algorithm. Assume that the n nodes in the k-d tree are labelled using the post-order. A table E[1...n] is used to hold the minimum energy cost of each node which corresponds to a specific storage region Si , here for ease of presentation we denote it as E(Si). In the algorithm, line 3 computes the energy cost E(Si) for all leave nodes and lines 6-10 compute the minimum energy cost E(Si,j ) for the remaining internal nodes in post-order. As there’re O(n) entries of energy cost to compute according to all the nodes over the initial full k-d tree, and at each step one exact entry is computed, the time complexity of Algorithm 1 is O(n), where n is the number of nodes over the initial full k-d tree. For implementation of optimized index structure construction, first the sink calculates the event data distribution and range query distribution according to a training set by leveraging the methodology proposed in Section III, and further calculates optimized parameters for the optimized k-d tree based on dynamic programming, then it broadcasts the index structure information to the sensor network, after receiving this information, those selected storage nodes of each cell will confirm their storage roles which includes corresponding storage schemes and the multi-dimensional ranges of data they will store, and each sensor node will know where to store/find the relevant ranges of data according to the locally stored index structure. V. EVENT STORAGE AND QUERY PROCESSING MECHANISMS A. Inserting Events to Cells As an event data is a point value over the multi-dimensional data space, it corresponds to a unique zone over the deployment area. When a sensor node generates an event data, it will check the locally stored index structure to find the corresponding zone, then it leverages the GPSR[1] routing scheme to forward the event message to that specified region. Algorithm 1 Constructing k-d Tree based Index Structure with Optimized Granularity 1: Initially construct a k-d tree as a full binary tree with maximum height of h, each leaf node denotes a unit cell. 2: for each leaf node associated with region Si do 3: E(Si) = fe ·P robE(Si)·Eevent(Si)+fq ·P robQ(Si)· Equery(Si) 4: end for 5: for all remaining nodes Si,j with corresponding children nodes Si and Sj , in post-order do 6: {cost1 is associated with the replicate storage scheme for region Si,j} 7: cost1 = fe · P robE(Si,j ) · Eevent(Si,j ) + fq · P robQ(Si,j ) · Equery(Si,j ) 8: {cost2 is associated with the split storage scheme for region Si,j} 9: cost2 = E(Si) + E(Sj ) 10: E(Si,j ) = min(cost1, cost2) 11: if cost1 < cost2 then 12: Apply the replicate storage scheme to region Si,j , merge two split nodes Si and Sj into one node Si,j for the k-d tree. 13: end if 14: end for As described in Section IV, when the event message reaches the zone, we utilize the Double Ruling scheme [9][10] to distribute the event information to all relevant replicate storage nodes. B. Resolving and Routing Range Queries As a range query corresponds to a region over the multi-dimensional data space, thus a conventional range query may cover several zones over the k-d tree based index structure. For example, consider a range query h[0.3, 0.7], [0.5, 0.6], [0.4, 0.8], [0.7, 0.9]i over the example kd tree depicted in Fig.2(a), the storage regions of zone code 110,111, 010, 0111 will be involved. Therefore how to route these sub-range queries to those specified storage nodes in an energy efficient approach is the problem we are trying to address in this section. As the corresponding query replies can be sent back to query node along the query routing paths, thus here we only consider the query diffusion cost, for the overall cost is actually twice the query diffusion cost. Basically there are three strategies to route these subqueries. The first strategy is to route sub-queries separately to corresponding zones by selecting nearest storage nodes for the query node. As Fig.4(a) shows, for each relevant storage region Si associated with sub-query qi , the query node selects a nearest storage node from those replicate storages, and respectively sends a sub-query message to the corresponding storage node. In this scheme, no routing paths are shared. The second strategy leverages Double Ruling based routing approach. As Fig.4(b) shows, the query node first imposes a routing path horizontally, and at specific positions of the 170 Authorized licensed use limited to: Nanjing University. Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore. Restrictions apply
(a)Separate routing strategy (b)Double ruling based routing strategy (c)Steiner tree based routing strategy Fig.4.Routing strategies for range query forwarding horizontal path it further imposes routing paths vertically Assume for each specified zone Si,it has a set of replicate to reach those corresponding storage regions.This routing storage node (si,1,si.2,...,si.m},here m is the number of scheme uses the horizontal routing path as the backbone path, redundancy of Si,which we can also denote as Sil,we thus it brings opportunities to sufficiently share the horizontal propose Algorithm 3 to compute the optimized routing path routing path,however,as it only utilizes the rectangular paths, for query diffusion by leveraging the approximation algorithms which may not be efficient enough when compared with the for Euclidean Minimum Steiner tree problem. shortest path. Hence according to the analysis over the pros.and cons. Algorithm 2 Computing the optimized routing path for query of the above two strategies,as our ultimate goal is to find diffusion the connecting paths with minimum routing cost,given the Input:Query node sg storage zone S1,$2,...,Sn specified query nodes and specified storage nodes,we can reduce the by range query O above problem into the Euclidean Minimum Steiner Tree[11] Procedure: problem.Based on the theory of Euclidean Steiner tree,we set Let opt =oo,steinset=o the query node and those specified storage nodes as terminal for each storage zone S:(1≤i≤n)do points,we then construct a Steiner tree according to these Enumerate each storage node si.j from Si,(1 <j<Sil) terminal points by selecting specific sensor nodes as the Si=Si,j Steiner points,thus to sufficiently share the routing path and (cost,S]=ComptuteSteinerTree(sg s1,...,si,..,sn) achieve the minimum routing cost for best energy efficiency. if (cost opt)then Fig.4(c)shows the Steiner tree based routing paths. opt cost steinset =S end if C.Optimized Routing Scheme for Query Forwarding end for We have reduced the optimized query forwarding problem Output:opt,steinset into the Euclidean minimum Steiner tree problem,actually as the problem is NP hard,so polynomial-time heuristics As each storage zone covers several storage nodes as are desired.There're a lot of research works which presents the candidate terminal points for the optimized solution,the heuristics algorithms to approximate the optimized Steiner above algorithm enumerates all permutations of storage nodes tree[11][12][13],and most of them are mainly solved by first (s1,s2....,sn)from those specified zones S1,S2,...,Sn.and constructing a Minimum Spanning Tree,for example,Dreyer leverage the function ComptuteSteinerTree to approximate the et.al.proposed two heuristics for the Euclidean Steiner tree optimized Steiner tree according to each permutation,among problem[12],which includes the Steiner Insertion Algorithm these candidate Steiner tree solutions we finally choose the and the Incremental Optimization Algorithm.The Steiner In- one which gives the minimum cost as the optimized Steiner sertion Algorithm systematically inserts Steiner points between tree solution.As here we leverage the Steiner Insertion Al- edges of the minimal spanning tree meeting at angles less gorithm as the approximation method,which has the time than 120 degrees,performing a local optimization at the end. complexity of O(n3),consider the number of permutations The Incremental Optimization Algorithm begins by finding M=I<inSil.we have the computing complexity as the Steiner tree for three of the fixed points.Then,at each O(M.n3).And the routing cost for query diffusion is equal iteration,it introduces a new fixed point to the tree,connecting to the overall lengths of the optimized Steiner tree it to each possible edge by inserting a Steiner point,and mini- Consider the distributed implementation of this optimized mizes over all connections,performing a local optimization for routing scheme,as each sensor node has the global information each.In this paper,we leverage Steiner Insertion Algorithm as of the index tree,for the query node,it first calculates the approximation method due to its simplicity and low computing optimized Steiner tree according to the specified range query, complexity.Readers can refer to [12]for detail algorithms. then it route the messages of sub-queries along the Steiner tree 171 Authorized licensed use limited to:Nanjing University.Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore.Restrictions apply
(a) Separate routing strategy (b) Double ruling based routing strategy (c) Steiner tree based routing strategy Fig. 4. Routing strategies for range query forwarding horizontal path it further imposes routing paths vertically to reach those corresponding storage regions. This routing scheme uses the horizontal routing path as the backbone path, thus it brings opportunities to sufficiently share the horizontal routing path, however, as it only utilizes the rectangular paths, which may not be efficient enough when compared with the shortest path. Hence according to the analysis over the pros. and cons. of the above two strategies, as our ultimate goal is to find the connecting paths with minimum routing cost, given the query nodes and specified storage nodes, we can reduce the above problem into the Euclidean Minimum Steiner Tree[11] problem. Based on the theory of Euclidean Steiner tree, we set the query node and those specified storage nodes as terminal points, we then construct a Steiner tree according to these terminal points by selecting specific sensor nodes as the Steiner points, thus to sufficiently share the routing path and achieve the minimum routing cost for best energy efficiency. Fig. 4(c) shows the Steiner tree based routing paths. C. Optimized Routing Scheme for Query Forwarding We have reduced the optimized query forwarding problem into the Euclidean minimum Steiner tree problem, actually as the problem is NP hard, so polynomial-time heuristics are desired. There’re a lot of research works which presents heuristics algorithms to approximate the optimized Steiner tree[11][12][13] , and most of them are mainly solved by first constructing a Minimum Spanning Tree, for example, Dreyer et.al. proposed two heuristics for the Euclidean Steiner tree problem[12], which includes the Steiner Insertion Algorithm and the Incremental Optimization Algorithm. The Steiner Insertion Algorithm systematically inserts Steiner points between edges of the minimal spanning tree meeting at angles less than 120 degrees, performing a local optimization at the end. The Incremental Optimization Algorithm begins by finding the Steiner tree for three of the fixed points. Then, at each iteration, it introduces a new fixed point to the tree, connecting it to each possible edge by inserting a Steiner point, and minimizes over all connections, performing a local optimization for each. In this paper, we leverage Steiner Insertion Algorithm as approximation method due to its simplicity and low computing complexity. Readers can refer to [12] for detail algorithms. Assume for each specified zone Si , it has a set of replicate storage node {si,1, si,2, ..., si,m}, here m is the number of redundancy of Si , which we can also denote as |Si |, we propose Algorithm 3 to compute the optimized routing path for query diffusion by leveraging the approximation algorithms for Euclidean Minimum Steiner tree problem. Algorithm 2 Computing the optimized routing path for query diffusion Input: Query node sq, storage zone S1, S2, ..., Sn specified by range query Q Procedure: Let opt = ∞, steinset = φ for each storage zone Si(1 ≤ i ≤ n) do Enumerate each storage node si,j from Si , (1 ≤ j ≤ |Si |) si = si,j [cost, S]=ComptuteSteinerTree(sq, s1, ..., si , ..., sn ) if (cost < opt) then opt = cost steinset = S end if end for Output: opt, steinset As each storage zone covers several storage nodes as the candidate terminal points for the optimized solution, the above algorithm enumerates all permutations of storage nodes (s1, s2, ..., sn) from those specified zones S1, S2, ..., Sn, and leverage the function ComptuteSteinerTree to approximate the optimized Steiner tree according to each permutation, among these candidate Steiner tree solutions we finally choose the one which gives the minimum cost as the optimized Steiner tree solution. As here we leverage the Steiner Insertion Algorithm as the approximation method, which has the time complexity of O(n 3 ), consider the number of permutations M = Q 1≤i≤n |Si |, we have the computing complexity as O(M · n 3 ). And the routing cost for query diffusion is equal to the overall lengths of the optimized Steiner tree. Consider the distributed implementation of this optimized routing scheme, as each sensor node has the global information of the index tree, for the query node, it first calculates the optimized Steiner tree according to the specified range query, then it route the messages of sub-queries along the Steiner tree 171 Authorized licensed use limited to: Nanjing University. Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore. Restrictions apply
by leveraging the GPSR routing scheme,thus to reach those corresponding storage regions,then the query results will be sent back along the original routing paths. VI.SIMULATION RESULTS In this section,we use simulation to compare the perfor- mance of our optimized storage scheme against conventional storage schemes for multi-dimensional range queries.We con- duct the simulation based on the grid based system,we divide the sensor network into n x n cells (we set 8 as the default The Optimized K-d Tree Index The corresponding storage regions over the cell based structure value for n),and each cell represents a unit storage area,we Befwork normalize the cell size to 1.For performance comparison we Fig.6.The optimized k-d tree and corresponding storage zones over the separately build DIM's zone tree,the fully replicate storage cell based network. scheme,and our optimized k-d tree over the network.Here we assign a unique zone for each of the cell units in DIM's zone tree,which infers to be a fully split approach over the fully replicate storage scheme always achieves the most energy cell based network,and we construct the fully replicate storage consumption as it requires the event message to update all scheme as another extreme situation,which stores all sensor n x n replicate storage cells.And DIM always achieves the data into each of the cell units,thus all the cells are replicate least energy consumption as it requires the event message storage units inside the unique zone to update one unique corresponding storage cell.For our And we utilize a training set of events and range queries to optimized k-d tree storage scheme,as the zone specified by derive the event and query distributions,thus to construct our the event may include one or more replicate storage cells,thus optimized k-d tree based index structure,as Fig.5 depicts,we its energy consumption is between the above two extreme use the Gaussian-like distribution to generate 5-dimensional situations,and unlike the fully replicate storage scheme,it data and queries for the training set and we normalize all the grows slowly as the network size increases. attribute values into the range [0,1],for range queries,we set Fig.7(b)shows the average query diffusion cost for the the query range for each dimension within range [0,1].And above three storage schemes.Note that DIM always achieves we also generate a similar set of events and range queries to the most energy consumption and the fully replicate storage evaluate the performance of various multi-dimensional storage scheme always achieves the least energy consumption,because schemes,we set the default number of event frequency fe and for one specific range query,DIM may split the query into query frequency fa to 1000. sub-queries and map them to quite many storage zones while the fully replicate storage scheme simply maps the query to one unique zone at the top level with replicate storages,thus its energy cost is nearly negligible.And similar to the former ass case,our optimized k-d tree storage scheme gains a tradeoff between the two extremes,and unlike DIM,it grows slowly as the network size increases. As Fig.7(c)shows,we further compare the overall energy cost for the above three schemes.For the situation fe 1000,fa=2000.DIM achieves the most energy cost as DIM's query diffusion cost is the largest among the three Fig.5.The Gaussian-like distribution of multi-dimensional data schemes,although its event insertion cost is rather small.As the fe/f ratio increases,the fully replicate storage scheme achieves the most energy cost due to its large event insertion A.Comparing performance for various multi-dimensional cost.For all the above three situations,our optimized k-d tree storage schemes storage scheme always achieves the least overall energy cost, In this section,we evaluate the energy efficiency for various as it gains an optimized tradeoff to sufficiently control both multi-dimensional storage schemes.Fig.6 depicts the opti- the event insertion cost and the query delivery cost.For the mized k-d tree and corresponding storage zones over the cell situation fe=1000,f=1000,our optimized k-d tree storage based network,this optimized index structure is constructed scheme reduces 39%energy cost than DIM and further reduces according to the event and query training set generated in terms 58%energy cost than the fully replicate storage scheme. of the Gaussian-like distributions. We now compare the average event insertion cost and B.Comparing performance for various query diffusion ap- average query delivery cost.Fig.7(a)shows the average event proaches insertion cost for DIM,the fully replicate storage scheme, In this section we compare the energy cost for various query and optimized K-d tree.As the network size increases,the diffusion approaches:the normal separate routing approach, 172 Authorized licensed use limited to:Nanjing University.Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore.Restrictions apply
by leveraging the GPSR routing scheme, thus to reach those corresponding storage regions, then the query results will be sent back along the original routing paths. VI. SIMULATION RESULTS In this section, we use simulation to compare the performance of our optimized storage scheme against conventional storage schemes for multi-dimensional range queries. We conduct the simulation based on the grid based system, we divide the sensor network into n × n cells (we set 8 as the default value for n), and each cell represents a unit storage area, we normalize the cell size to 1. For performance comparison we separately build DIM’s zone tree, the fully replicate storage scheme, and our optimized k-d tree over the network. Here we assign a unique zone for each of the cell units in DIM’s zone tree, which infers to be a fully split approach over the cell based network, and we construct the fully replicate storage scheme as another extreme situation, which stores all sensor data into each of the cell units, thus all the cells are replicate storage units inside the unique zone. And we utilize a training set of events and range queries to derive the event and query distributions, thus to construct our optimized k-d tree based index structure, as Fig.5 depicts, we use the Gaussian-like distribution to generate 5-dimensional data and queries for the training set and we normalize all the attribute values into the range [0,1], for range queries, we set the query range for each dimension within range [0,1]. And we also generate a similar set of events and range queries to evaluate the performance of various multi-dimensional storage schemes, we set the default number of event frequency fe and query frequency fq to 1000. -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Probability Normalized Value Fig. 5. The Gaussian-like distribution of multi-dimensional data A. Comparing performance for various multi-dimensional storage schemes In this section, we evaluate the energy efficiency for various multi-dimensional storage schemes. Fig.6 depicts the optimized k-d tree and corresponding storage zones over the cell based network, this optimized index structure is constructed according to the event and query training set generated in terms of the Gaussian-like distributions. We now compare the average event insertion cost and average query delivery cost. Fig.7(a) shows the average event insertion cost for DIM, the fully replicate storage scheme, and optimized K-d tree. As the network size increases, the Fig. 6. The optimized k-d tree and corresponding storage zones over the cell based network. fully replicate storage scheme always achieves the most energy consumption as it requires the event message to update all n × n replicate storage cells. And DIM always achieves the least energy consumption as it requires the event message to update one unique corresponding storage cell. For our optimized k-d tree storage scheme, as the zone specified by the event may include one or more replicate storage cells, thus its energy consumption is between the above two extreme situations, and unlike the fully replicate storage scheme, it grows slowly as the network size increases. Fig.7(b) shows the average query diffusion cost for the above three storage schemes. Note that DIM always achieves the most energy consumption and the fully replicate storage scheme always achieves the least energy consumption, because for one specific range query, DIM may split the query into sub-queries and map them to quite many storage zones while the fully replicate storage scheme simply maps the query to one unique zone at the top level with replicate storages, thus its energy cost is nearly negligible. And similar to the former case, our optimized k-d tree storage scheme gains a tradeoff between the two extremes, and unlike DIM, it grows slowly as the network size increases. As Fig.7(c) shows, we further compare the overall energy cost for the above three schemes. For the situation fe = 1000, fq = 2000, DIM achieves the most energy cost as DIM’s query diffusion cost is the largest among the three schemes, although its event insertion cost is rather small. As the fe/fq ratio increases, the fully replicate storage scheme achieves the most energy cost due to its large event insertion cost. For all the above three situations, our optimized k-d tree storage scheme always achieves the least overall energy cost, as it gains an optimized tradeoff to sufficiently control both the event insertion cost and the query delivery cost. For the situation fe = 1000, fq = 1000, our optimized k-d tree storage scheme reduces 39% energy cost than DIM and further reduces 58% energy cost than the fully replicate storage scheme. B. Comparing performance for various query diffusion approaches In this section we compare the energy cost for various query diffusion approaches: the normal separate routing approach, 172 Authorized licensed use limited to: Nanjing University. Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore. Restrictions apply
Kd Te 44 88 =100 agha向 (a)Average event insertion cost comparison. (b)Average query diffusion cost comparison. (c)Overall energy cost comparison. Fig.7.Comparing performance for various multi-dimensional storage schemes:DIM,Full Replicate Storage Scheme,and Optimized K-d Tree the double ruling based routing approach and the Steiner routing scheme for range query processing to achieve best tree based routing approach.As the query replies are sent energy efficiency. back along the query paths,here we only consider the query VIII.ACKNOWLEDGEMENT diffusion cost,the overall query processing cost is thus double the cost of query diffusion.We leverage Prim's algorithm[14] This work is supported by the China NSF grant to calculate the minimum spanning tree(MST)so as to approx- (60573132,60573106,60873026),the China973 project imate the Steiner tree.Fig.8 depicts the query diffusion cost for (2009CB320705,2006CB303000).We would like to express various schemes.Note that for all three storage schemes the our thanks to Yingpei Zeng and Jue Hong,for their helpful straight separate routing approach achieves the most energy comments and for assistance with literature. cost and the Steiner tree based routing approach achieves the REFERENCES best energy efficiency.For DIM the Steiner tree based routing approach reduces 61%energy cost than the straight separate [1]R.Sylvia,K.Brad,S.Scott,E.Deborah.G.Ramesh,Y.Li,and Y.Fang. "Data-centric storage in sensomets with ght,a geographic hash table, routing approach and reduces 26%energy cost than the double Mobile Net-works and Applications.vol.8.no.4.pp.427-442.2003. ruling based routing approach,for the optimized k-d tree [2]S.Scott,R.Sylvia,K.Brad,G.Ramesh,and E.Deborah,"Data-centric storage scheme the numbers are respectively 50%and 10%,for storage in sensornets,"ACM SIGCOMM Computer Communication Review,vol.33,no.1,pp.137-142,2003. the fully replicate storage scheme,as the query message only [3]G.Deepak.E.Deborah,and H.John,"Dimensions:Why do we need a has to be forwarded to the nearest storage node,the energy new data handling architecture for sensor networks,"ACM SIGCOMM costs for the three approaches are equally negligible. Computer Communication Review,vol.33.no.1,pp.143-148.2003. [4]B.Greenstein,S.Ratnasamy,S.Shenker.R.Govindan,and D.Estrin, "Difs:a distributed index for features in sensor networks,"Ad Hoc Networks,.vol.1,no.2-3,Pp.333-349,2003. [5]J.L.Bentley.,"Multidimensional binary search trees used for associative searching."Communicaions of the ACM,vol.18.no.9.p.475C484 1975. [6]X.Li,Y.J.Kim,R.Govindan,and W.Hong,"Multi-dimensional range queries in sensor networks,"in Sensys,2003. [7]M.Aly.K.Pruhs,and P.K.Chrysanthis."Kddcs:a load-balanced in- network data-centric storage scheme for sensor networks,"in CIKM. 2006. [8]Y.C.Chung.I.F.Su,and C.Lee,"Supporting multi-dimensional range query for sensor networks."in ICDCS.2007. D [9]X.Liu,Q.Huang.and Y.Zhang."Combs,needles,haystacks:balancing push and pull for discovery in large-scale sensor networks,"in Sensys. 2004. Fig.8.Query diffusion cost for various schemes. [10]R.Sarkar,X.J.Zhu,and J.Gao,"Double rulings for information brokerage in sensor networks,"in MOB/COM,2006. [11]F.Hwang.D.Richards,and P.Winter."The steiner tree problem,"G Jahrestagung.vol.1.no.2.pp.370-374,2003. VII.CONCLUSION [12]D.R.Dreyer and M.L.Overton,"Two heuristics for the euclidean steiner tree problem,"Annals of Discrete Mathematics,vol.53,1992. This paper presents the design of a decentralized stor- [13]M.Diane and J.Plesnik,"Three new heuristics for the steiner problem age scheme for multi-dimensional range queries over sensor in graphs,"Acta Math.Univ:Comenianae,vol.60,pp.105-121,1991. [14]T.H.Cormen,C.E.Leiserson,R.L.Rivest,and C. Stein,Introduction networks.We build a distributed index structure,which is to Algorithms,Second Edition. based on k-d tree,so as to efficiently map high dimensional event data to a two-dimensional space of sensors while still preserving the proximity of events.We propose a dynamic programming based methodology to control the granularity of the index structure in an optimized approach,and an optimized 173 Authorized licensed use limited to:Nanjing University.Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore.Restrictions apply
0 50 100 150 200 250 4*4 8*8 16*16 Average Energy Cost Network Size DIM FullReplicateScheme Optimized K-d Tree (a) Average event insertion cost comparison. 0 20 40 60 80 100 120 140 160 4*4 8*8 16*16 Average Energy Cost Network Size DIM FullReplicateScheme Optimized K-d Tree (b) Average query diffusion cost comparison. 0 20000 40000 60000 80000 DIM FullReplicateScheme Optimized K-d Tree fe=1000,fq=2000 fe=1000,fq=1000 fe=1000,fq=500 Average Energy Cost Varing the fe/fq ratio (c) Overall energy cost comparison. Fig. 7. Comparing performance for various multi-dimensional storage schemes: DIM, Full Replicate Storage Scheme, and Optimized K-d Tree the double ruling based routing approach and the Steiner tree based routing approach. As the query replies are sent back along the query paths, here we only consider the query diffusion cost, the overall query processing cost is thus double the cost of query diffusion. We leverage Prim’s algorithm[14] to calculate the minimum spanning tree (MST) so as to approximate the Steiner tree. Fig.8 depicts the query diffusion cost for various schemes. Note that for all three storage schemes the straight separate routing approach achieves the most energy cost and the Steiner tree based routing approach achieves the best energy efficiency. For DIM the Steiner tree based routing approach reduces 61% energy cost than the straight separate routing approach and reduces 26% energy cost than the double ruling based routing approach, for the optimized k-d tree storage scheme the numbers are respectively 50% and 10%, for the fully replicate storage scheme, as the query message only has to be forwarded to the nearest storage node, the energy costs for the three approaches are equally negligible. 0 5 10 15 20 25 30 35 40 DIM FullReplicateScheme Optimized K-d Tree Normal Double Ruling Steiner Tree Average Energy Cost Fig. 8. Query diffusion cost for various schemes. VII. CONCLUSION This paper presents the design of a decentralized storage scheme for multi-dimensional range queries over sensor networks. We build a distributed index structure, which is based on k-d tree, so as to efficiently map high dimensional event data to a two-dimensional space of sensors while still preserving the proximity of events. We propose a dynamic programming based methodology to control the granularity of the index structure in an optimized approach, and an optimized routing scheme for range query processing to achieve best energy efficiency. VIII. ACKNOWLEDGEMENT This work is supported by the China NSF grant (60573132, 60573106, 60873026), the China 973 project (2009CB320705,2006CB303000). We would like to express our thanks to Yingpei Zeng and Jue Hong, for their helpful comments and for assistance with literature. REFERENCES [1] R. Sylvia, K. Brad, S. Scott, E. Deborah, G. Ramesh, Y. Li, and Y. Fang, “Data-centric storage in sensornets with ght, a geographic hash table,” Mobile Net-works and Applications, vol. 8, no. 4, pp. 427–442, 2003. [2] S. Scott, R. Sylvia, K. Brad, G. Ramesh, and E. Deborah, “Data-centric storage in sensornets,” ACM SIGCOMM Computer Communication Review, vol. 33, no. 1, pp. 137–142, 2003. [3] G. Deepak, E. Deborah, and H. John, “Dimensions: Why do we need a new data handling architecture for sensor networks,” ACM SIGCOMM Computer Communication Review, vol. 33, no. 1, pp. 143–148, 2003. [4] B. Greenstein, S. Ratnasamy, S. Shenker, R. Govindan, and D. Estrin, “Difs: a distributed index for features in sensor networks,” Ad Hoc Networks, vol. 1, no. 2-3, pp. 333–349, 2003. [5] J. L. Bentley., “Multidimensional binary search trees used for associative searching,” Communicaions of the ACM, vol. 18, no. 9, p. 475C484, 1975. [6] X. Li, Y. J. Kim, R. Govindan, and W. Hong, “Multi-dimensional range queries in sensor networks,” in Sensys, 2003. [7] M. Aly, K. Pruhs, and P. K. Chrysanthis, “Kddcs: a load-balanced innetwork data-centric storage scheme for sensor networks,” in CIKM, 2006. [8] Y. C. Chung, I. F. Su, and C. Lee, “Supporting multi-dimensional range query for sensor networks,” in ICDCS, 2007. [9] X. Liu, Q. Huang, and Y. Zhang, “Combs, needles,haystacks: balancing push and pull for discovery in large-scale sensor networks,” in Sensys, 2004. [10] R. Sarkar, X. J. Zhu, and J. Gao, “Double rulings for information brokerage in sensor networks,” in MOBICOM, 2006. [11] F. Hwang, D. Richards, and P. Winter, “The steiner tree problem,” GI Jahrestagung, vol. 1, no. 2, pp. 370–374, 2003. [12] D. R. Dreyer and M. L. Overton, “Two heuristics for the euclidean steiner tree problem,” Annals of Discrete Mathematics, vol. 53, 1992. [13] M. Diane and J. Plesnik, “Three new heuristics for the steiner problem in graphs,” Acta Math. Univ. Comenianae, vol. 60, pp. 105–121, 1991. [14] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, Second Edition. 173 Authorized licensed use limited to: Nanjing University. Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore. Restrictions apply