Building and Environment 102(2016)179-192 Contents lists available at ScienceDirect Building and Environment ELSEVIER journal homepage:www.elsevier.com/locate/buildenv Occupancy data analytics and prediction:A case study CrossMark Xin Liang a.b,Tianzhen Hong b."Geoffrey Qiping Shen a Department of Building and Real Estate,Hong Kong Polytechnic University.Hong Kong.China Building Technology and Urban Systems Division.Lawrence Berkeley National Labortory.Berkeley.CA 94720.USA ARTICLE INFO ABSTRACT Article history: Occupants are a critical impact factor of building energy consumption.Numerous previous studies Received 6 January 2016 emphasized the role of occupants and investigated the interactions between occupants and buildings. Received in revised form However,a fundamental problem,how to learn occupancy patterns and predict occupancy schedule,has 12 March 2016 not been well addressed due to highly stochastic activities of occupants and insufficient data.This study Accepted 25 March 2016 Available online 28 March 2016 proposes a data mining based approach for occupancy schedule learning and prediction in office buildings.The proposed approach first recognizes the patterns of occupant presence by cluster analysis Keywords: then learns the schedule rules by decision tree,and finally predicts the occupancy schedules based on the Occupancy prediction inducted rules.A case study was conducted in an office building in Philadelphia,U.S.Based on one-year Occupant presence observed data,the validation results indicate that the proposed approach significantly improves the Data mining accuracy of occupancy schedule prediction.The proposed approach only requires simple input data(ie.. Machine learning the time series data of occupant number entering and exiting a building).which is available in most office buildings.Therefore,this approach is practical to facilitate occupancy schedule prediction,building energy simulation and facility operation. 2016 Elsevier Ltd.All rights reserved. 1.Introduction occupant behavior was changed [6-8].Masoso and Grobler [7 indicated that more energy is used during non-working hours Buildings are responsible for the majority of energy consump- (56%)than during working hours (44%).mainly due to occupants tion and greenhouse gas(GHG)emissions around the world.In the leaving lights and equipment on at the end of the day.More studies United States(U.S.)buildings consume approximately 40%of the proved that different occupant behaviors can affect more than 40% total primary energy [1];while in Europe,the ratio is also about 40% of energy consumption in office buildings [9.10].Azar and Menassa [2.In the last few decades,building energy consumption has [6]opined energy conservation events,which improve energy continued to increase,especially in developing countries.In China, saving behaviors,can save 16%of electricity in the building. building energy consumption increased by more than 10%annually Occupant behavior is likewise a critical impact factor of energy [3.Large-scale commercial buildings have high energy use in- simulation and prediction for office buildings.Numerous simula- tensity,which can be up to 300 kWh/m2 and 5-15 times of that in tion models and platforms have been developed and are widely residential buildings [4].Office buildings accounted for approxi- used to predict building energy consumption during the design. mately 17%of the energy use in the U.S.commercial building sector operation and retrofit phases.However,the differences between 5].Therefore,office buildings play an important role in total en- real energy consumption and estimated value are typically more ergy consumption around the world. than 30%11.In some extreme cases,the difference can reach 100% Occupant behavior is considered a critical impact factor of en- 12].The International Energy Agency's Energy in the Buildings and ergy consumption in office buildings.Numerous previous studies Communities Program (EBC)Annex 53:"Total Energy Use in emphasize the role that occupants play in influencing the energy Buildings:Analysis Evaluation Methods"identified six driving consumption in buildings and the expected energy savings if factors of energy use in buildings:(1)climate,(2)building enve- lope,(3)building energy and services systems,(4)indoor design criteria,(5)building operation and maintenance,and(6)occupant behavior.While the first five factors have been well addressed,the Corresponding author. uncertainty of occupant presence and variation of occupant E-mail addresses:xin.cliang@connectpolyu.hk (X.Liang).thong@lblgov (T.Hong) behavior are considered main reasons of prediction deviations http://dx.doi.org/10.1016/j.buildenv.2016.03.027 0360-1323/0 2016 Elsevier Ltd.All rights reserved
Occupancy data analytics and prediction: A case study Xin Liang a, b , Tianzhen Hong b, * , Geoffrey Qiping Shen a a Department of Building and Real Estate, Hong Kong Polytechnic University, Hong Kong, China b Building Technology and Urban Systems Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA article info Article history: Received 6 January 2016 Received in revised form 12 March 2016 Accepted 25 March 2016 Available online 28 March 2016 Keywords: Occupancy prediction Occupant presence Data mining Machine learning abstract Occupants are a critical impact factor of building energy consumption. Numerous previous studies emphasized the role of occupants and investigated the interactions between occupants and buildings. However, a fundamental problem, how to learn occupancy patterns and predict occupancy schedule, has not been well addressed due to highly stochastic activities of occupants and insufficient data. This study proposes a data mining based approach for occupancy schedule learning and prediction in office buildings. The proposed approach first recognizes the patterns of occupant presence by cluster analysis, then learns the schedule rules by decision tree, and finally predicts the occupancy schedules based on the inducted rules. A case study was conducted in an office building in Philadelphia, U.S. Based on one-year observed data, the validation results indicate that the proposed approach significantly improves the accuracy of occupancy schedule prediction. The proposed approach only requires simple input data (i.e., the time series data of occupant number entering and exiting a building), which is available in most office buildings. Therefore, this approach is practical to facilitate occupancy schedule prediction, building energy simulation and facility operation. © 2016 Elsevier Ltd. All rights reserved. 1. Introduction Buildings are responsible for the majority of energy consumption and greenhouse gas (GHG) emissions around the world. In the United States (U.S.), buildings consume approximately 40% of the total primary energy [1]; while in Europe, the ratio is also about 40% [2]. In the last few decades, building energy consumption has continued to increase, especially in developing countries. In China, building energy consumption increased by more than 10% annually [3]. Large-scale commercial buildings have high energy use intensity, which can be up to 300 kWh/m2 and 5e15 times of that in residential buildings [4]. Office buildings accounted for approximately 17% of the energy use in the U.S. commercial building sector [5]. Therefore, office buildings play an important role in total energy consumption around the world. Occupant behavior is considered a critical impact factor of energy consumption in office buildings. Numerous previous studies emphasize the role that occupants play in influencing the energy consumption in buildings and the expected energy savings if occupant behavior was changed [6e8]. Masoso and Grobler [7] indicated that more energy is used during non-working hours (56%) than during working hours (44%), mainly due to occupants leaving lights and equipment on at the end of the day. More studies proved that different occupant behaviors can affect more than 40% of energy consumption in office buildings [9,10]. Azar and Menassa [6] opined energy conservation events, which improve energy saving behaviors, can save 16% of electricity in the building. Occupant behavior is likewise a critical impact factor of energy simulation and prediction for office buildings. Numerous simulation models and platforms have been developed and are widely used to predict building energy consumption during the design, operation and retrofit phases. However, the differences between real energy consumption and estimated value are typically more than 30% [11]. In some extreme cases, the difference can reach 100% [12]. The International Energy Agency's Energy in the Buildings and Communities Program (EBC) Annex 53: “Total Energy Use in Buildings: Analysis & Evaluation Methods” identified six driving factors of energy use in buildings: (1) climate, (2) building envelope, (3) building energy and services systems, (4) indoor design criteria, (5) building operation and maintenance, and (6) occupant behavior. While the first five factors have been well addressed, the uncertainty of occupant presence and variation of occupant behavior are considered main reasons of prediction deviations * Corresponding author. E-mail addresses: xin.c.liang@connect.polyu.hk (X. Liang), thong@lbl.gov (T. Hong). Contents lists available at ScienceDirect Building and Environment journal homepage: www.elsevier.com/locate/buildenv http://dx.doi.org/10.1016/j.buildenv.2016.03.027 0360-1323/© 2016 Elsevier Ltd. All rights reserved. Building and Environment 102 (2016) 179e192
180 X Liang et aL Building and Environment 102 (2016)179-192 12.131. power of data mining methods in recognizing pattern of occupant Owing to the significant impacts on energy consumption and behavior and energy consumption areas,but the research area of prediction in buildings,a number of studies focused on the occu- occupancy schedule leaning and predicting still needs exploration. pant's energy use characteristics,which is defined as the presence The aim of this study is to present a new approach for occupancy of occupants in the building and their actions to (or do not to)in- schedule learning and predicting in office buildings by using data fluence the energy consumption [14].D'Oca and Hong [15] mining based methods.The process of this study includes recog- observed and identified the patterns of window opening and nizing the patterns of occupant presence,summarizing the rules of closing behavior in an office building.Zhou et al.[16]analyzed the recognized patterns and finally predicting the occupancy lighting behavior in large office buildings based on a stochastic schedules.This study hypothesizes the identified patterns and rules model.Zhang et al.[17]simulated occupant movement,light and by the proposed data mining approach are right.Namely,they can equipment use behavior synthetically with agent-based models. present the true characteristics of the occupancy data.This hy- Sun et al.[18]investigated the impact of overtime working on pothesis is validated by comparing the accuracy of prediction be- energy consumption in an office building.Azar and Menassa 16 tween the proposed method and the traditional methods.If the showed the education and learning effect of energy saving accuracy of the prediction results is improved,it indicates the hy- behavior,and proposed the impacts of energy conservation pro- pothesis is true. motion on energy saving. This model only needs a few types of inputs,typically the time Before modelling occupant's energy use characteristics,there is series data of occupant number entering and exiting a building. a more essential research question:how to identify the pattern of Another advantage of this model is that it allows for relatively occupant presence and predict the occupancy schedule?Without simple operations,excluding probability distribution fitting and the answer to this question,the occupant's energy use character- other complex mathematical processing.That means this method istics cannot get down to the ground.However,due to the highly can be well adaptive to practical projects.The results of this study stochastic activities and insufficient data,it is difficult to observe are critical to provide insight into the pattern of occupant presence, and predict occupant presence.Previous studies did not pay facilitate the energy simulation and prediction as well as improve enough attention to occupancy schedule and this question has not energy saving operation and retrofit. been well addressed.In general,three typical methods were applied to model occupant presence in previous studies.First 2.Methodology method is fix schedules.Occupants are categorized into several groups (e.g.,early bird,timetable complier and flexible worker) 2.1.Framework of occupancy schedule learning and prediction then each group is assigned to a specific schedule [17].Combining the schedules of each group proportionally can generate the Traditional methods of transforming data to knowledge nor- schedule of the whole building.The second method assumes that mally used statistical tests,regression and curve fitting by a certain occupant presence satisfies a certain probability distribution.The probability distribution.These methods are effective when data is distribution can be Poisson distribution[16],binomial distribution small volume,accurate and standardized.However,when the vol- 18.uniform distribution and triangle distribution [19].The occu- ume of data is growing exponentially in recent years,these pancy schedule can be obtained by a virtual occupant generation methods become slow and expensive.More seriously,when there following the certain distribution.The third method is analyzing is considerable missing data,the deviated data or the data format is practical observation data.D'Oca and Hong 8 observed 16 private disunion (e.g.the time steps are different,mix of numbers and offices with single or dual occupancy and Wang et al.[20]observed words),these methods cannot be applied or cannot deduce satis- 35 offices with single occupancy. fied results.Data mining is an emerging method which can process Although these methods had advantages and improved occu- big data and unstructured data effectively and robustly.Machine pancy schedule modeling,there are still some limitations:(1)the learning.as a main method of data mining.is specifically good at assumptions are not solid.Occupancy schedule is highly stochastic, identifying patterns and inducting rules.Since this study includes it is inappropriate to simply define that occupants belong to a huge volume of data and aims to induct rules of occupancy certain group or follow a certain distribution;(2)the previous schedules,data mining is selected as the research method. research emphasized on summarizing rules of occupant presence, Data mining,which is also named knowledge discovery in da- but less attention has been paid to predicting schedules in future tabases (KDD).is a relatively young and interdisciplinary field of The results are not practical if they cannot guide future work;(3) computer science.It is the process of discovering new patterns the results of schedules lack validation with real data;(4)observed from large data sets,involving methods at the intersection of data mainly focused on a single or multiple offices,so the data are pattern recognition,machine learning,artificial intelligence,cloud limited and results may be biased if applied to the whole building architecture,and data visualization [27.Normally,the process of To bridge the aforementioned research gaps,this study proposes KDD involves six steps:(1)Data selection;(2)Data cleaning and a data mining based approach to learning and predicting occupancy preprocessing:(3)Data transformation;(4)Data mining:(5)Data schedule for the whole building.Data mining can be defined as: interpretation and evaluation;and(6)Knowledge extraction 8]. "The analysis of large observation data sets to find unsuspected This study proposes a data mining based approach to discover relationships and to summarize the data in novel ways so that occupancy schedule patterns and extrapolate occupancy schedule owners can fully understand and make use of the data"[21.Data from observed big data streams of a building.The framework of this mining methods have significant advantages in revealing under- proposed method includes six steps,illustrated in Fig.1. lying patterns of data,which has been widely used in various Step 1:problem framing.The first step is to clarify problem research and industry fields,such as marketing.biology.engi- definition,boundary,assumption and key metric of success.The neering and social science [22].However,the applications of data research problem is defined as how to predict occupancy schedule mining in occupancy schedule and building energy consumption is from historical observed data.The scope of this study focuses on still underdeveloped.Some previous studies applied data mining the schedule prediction for weekdays in office buildings.The key methods to discover the pattern of occupant behavior [15,23,24]. metric of success is the similarity of prediction results to the and others focused on interactions between occupants and energy observed data. consumption [8,25,26].These studies demonstrated the strong Step 2:data acquisition and preparation.The second step is to
[12,13]. Owing to the significant impacts on energy consumption and prediction in buildings, a number of studies focused on the occupant's energy use characteristics, which is defined as the presence of occupants in the building and their actions to (or do not to) in- fluence the energy consumption [14]. D'Oca and Hong [15] observed and identified the patterns of window opening and closing behavior in an office building. Zhou et al. [16] analyzed lighting behavior in large office buildings based on a stochastic model. Zhang et al. [17] simulated occupant movement, light and equipment use behavior synthetically with agent-based models. Sun et al. [18] investigated the impact of overtime working on energy consumption in an office building. Azar and Menassa [6] showed the education and learning effect of energy saving behavior, and proposed the impacts of energy conservation promotion on energy saving. Before modelling occupant's energy use characteristics, there is a more essential research question: how to identify the pattern of occupant presence and predict the occupancy schedule? Without the answer to this question, the occupant's energy use characteristics cannot get down to the ground. However, due to the highly stochastic activities and insufficient data, it is difficult to observe and predict occupant presence. Previous studies did not pay enough attention to occupancy schedule and this question has not been well addressed. In general, three typical methods were applied to model occupant presence in previous studies. First method is fix schedules. Occupants are categorized into several groups (e.g., early bird, timetable complier and flexible worker), then each group is assigned to a specific schedule [17]. Combining the schedules of each group proportionally can generate the schedule of the whole building. The second method assumes that occupant presence satisfies a certain probability distribution. The distribution can be Poisson distribution [16], binomial distribution [18], uniform distribution and triangle distribution [19]. The occupancy schedule can be obtained by a virtual occupant generation following the certain distribution. The third method is analyzing practical observation data. D'Oca and Hong [8] observed 16 private offices with single or dual occupancy and Wang et al. [20] observed 35 offices with single occupancy. Although these methods had advantages and improved occupancy schedule modeling, there are still some limitations: (1) the assumptions are not solid. Occupancy schedule is highly stochastic, it is inappropriate to simply define that occupants belong to a certain group or follow a certain distribution; (2) the previous research emphasized on summarizing rules of occupant presence, but less attention has been paid to predicting schedules in future. The results are not practical if they cannot guide future work; (3) the results of schedules lack validation with real data; (4) observed data mainly focused on a single or multiple offices, so the data are limited and results may be biased if applied to the whole building. To bridge the aforementioned research gaps, this study proposes a data mining based approach to learning and predicting occupancy schedule for the whole building. Data mining can be defined as: “The analysis of large observation data sets to find unsuspected relationships and to summarize the data in novel ways so that owners can fully understand and make use of the data” [21]. Data mining methods have significant advantages in revealing underlying patterns of data, which has been widely used in various research and industry fields, such as marketing, biology, engineering and social science [22]. However, the applications of data mining in occupancy schedule and building energy consumption is still underdeveloped. Some previous studies applied data mining methods to discover the pattern of occupant behavior [15,23,24], and others focused on interactions between occupants and energy consumption [8,25,26]. These studies demonstrated the strong power of data mining methods in recognizing pattern of occupant behavior and energy consumption areas, but the research area of occupancy schedule leaning and predicting still needs exploration. The aim of this study is to present a new approach for occupancy schedule learning and predicting in office buildings by using data mining based methods. The process of this study includes recognizing the patterns of occupant presence, summarizing the rules of the recognized patterns and finally predicting the occupancy schedules. This study hypothesizes the identified patterns and rules by the proposed data mining approach are right. Namely, they can present the true characteristics of the occupancy data. This hypothesis is validated by comparing the accuracy of prediction between the proposed method and the traditional methods. If the accuracy of the prediction results is improved, it indicates the hypothesis is true. This model only needs a few types of inputs, typically the time series data of occupant number entering and exiting a building. Another advantage of this model is that it allows for relatively simple operations, excluding probability distribution fitting and other complex mathematical processing. That means this method can be well adaptive to practical projects. The results of this study are critical to provide insight into the pattern of occupant presence, facilitate the energy simulation and prediction as well as improve energy saving operation and retrofit. 2. Methodology 2.1. Framework of occupancy schedule learning and prediction Traditional methods of transforming data to knowledge normally used statistical tests, regression and curve fitting by a certain probability distribution. These methods are effective when data is small volume, accurate and standardized. However, when the volume of data is growing exponentially in recent years, these methods become slow and expensive. More seriously, when there is considerable missing data, the deviated data or the data format is disunion (e.g. the time steps are different, mix of numbers and words), these methods cannot be applied or cannot deduce satis- fied results. Data mining is an emerging method which can process big data and unstructured data effectively and robustly. Machine learning, as a main method of data mining, is specifically good at identifying patterns and inducting rules. Since this study includes huge volume of data and aims to induct rules of occupancy schedules, data mining is selected as the research method. Data mining, which is also named knowledge discovery in databases (KDD), is a relatively young and interdisciplinary field of computer science. It is the process of discovering new patterns from large data sets, involving methods at the intersection of pattern recognition, machine learning, artificial intelligence, cloud architecture, and data visualization [27]. Normally, the process of KDD involves six steps: (1) Data selection; (2) Data cleaning and preprocessing; (3) Data transformation; (4) Data mining; (5) Data interpretation and evaluation; and (6) Knowledge extraction [8]. This study proposes a data mining based approach to discover occupancy schedule patterns and extrapolate occupancy schedule from observed big data streams of a building. The framework of this proposed method includes six steps, illustrated in Fig. 1. Step 1: problem framing. The first step is to clarify problem definition, boundary, assumption and key metric of success. The research problem is defined as how to predict occupancy schedule from historical observed data. The scope of this study focuses on the schedule prediction for weekdays in office buildings. The key metric of success is the similarity of prediction results to the observed data. Step 2: data acquisition and preparation. The second step is to 180 X. Liang et al. / Building and Environment 102 (2016) 179e192
X.Liang et al.Building and Environment 102(2016)179-192 Steps Methods/Tools Outcomes 1 Problem statement. Literature review; Problem Framing assumption and key Expert interview metrics Acquire and Acquire,harmonize, rescale.clean and Valid data Prepare Data format data Methodology Identify problem Selected Selection solving approaches approaches and and software software tools Patterns and rules Learning Machine learning: Rule Induction; of occupancy schedule Prediction method Results of Prediction based on occupancy occupancy pattern presence prediction Compare prediction Validation results to observed Effect of the data proposed method Fig.1.Framework of the proposed method for occupancy schedule learning and predicting. acquire,harmonize,rescale,clean and format data.Due to the modularized operation for analytics and data mining.Due to its failure of sensors and other interference factors,the raw data may flexibility and accessibility,RapidMiner has been widely used in contain missing data,error data and the unstructured data.Before industry and academia. data mining.the raw data should be pre-processed to get the valid Step 4:learning.This step is to discover the patterns of occu- data.In this study,the missing data is removed from the data set. pancy schedule and abstract the rules within the patterns.Clus- Statistical methods (ie.,box plot and mean value)are used to tering and decision tree are applied for pattern recognition and rule investigate the characteristics of the data before data mining. induction respectively.The details of processes and results of each Step 3:methodology selection.Data mining involves various step are illustrated in the learning phase in Fig.2. kinds of methods.Different methods target problems at different Step 5:prediction.The observed data is split to a training set and levels.According to the specific problem and data source,appro- a test set.The training set is used to train the model and identify the priate methods could be selected.In this study,machine learning rules,shown in the predicting phase in Fig.2.Based on the iden- method is adopted to discover patterns of occupant presence,while tified patterns and rules of occupant presence,the occupancy rule induction is used to summarize rules within the patterns. schedule can be predicted. Software selection is essential to analyze data.Matlab 2015 and Step 6:validation.This step is to compare the prediction result RapidMiner 6.5 are applied on a standard PC with Windows 7 to to the test data set,shown in the validating phase in Fig.2.The perform the data processing and data mining.respectively.Rapid- more similar the two sets are,the better the method is.To quan- Miner is open source software with visualized interface and titatively validate the proposed method,several metrics can be
acquire, harmonize, rescale, clean and format data. Due to the failure of sensors and other interference factors, the raw data may contain missing data, error data and the unstructured data. Before data mining, the raw data should be pre-processed to get the valid data. In this study, the missing data is removed from the data set. Statistical methods (i.e., box plot and mean value) are used to investigate the characteristics of the data before data mining. Step 3: methodology selection. Data mining involves various kinds of methods. Different methods target problems at different levels. According to the specific problem and data source, appropriate methods could be selected. In this study, machine learning method is adopted to discover patterns of occupant presence, while rule induction is used to summarize rules within the patterns. Software selection is essential to analyze data. Matlab 2015 and RapidMiner 6.5 are applied on a standard PC with Windows 7 to perform the data processing and data mining, respectively. RapidMiner is open source software with visualized interface and modularized operation for analytics and data mining. Due to its flexibility and accessibility, RapidMiner has been widely used in industry and academia. Step 4: learning. This step is to discover the patterns of occupancy schedule and abstract the rules within the patterns. Clustering and decision tree are applied for pattern recognition and rule induction respectively. The details of processes and results of each step are illustrated in the learning phase in Fig. 2. Step 5: prediction. The observed data is split to a training set and a test set. The training set is used to train the model and identify the rules, shown in the predicting phase in Fig. 2. Based on the identified patterns and rules of occupant presence, the occupancy schedule can be predicted. Step 6: validation. This step is to compare the prediction result to the test data set, shown in the validating phase in Fig. 2. The more similar the two sets are, the better the method is. To quantitatively validate the proposed method, several metrics can be Methods/Tools Problem statement, assumption and key metrics Steps Problem Framing 1 Acquire and Prepare Data 2 Methodology Selection 3 Learning 4 Prediction 5 Literature review; Expert interview Acquire, harmonize, rescale, clean and format data Identify problem solving approaches and software Machine learning; Rule Induction; Prediction method based on occupancy pattern Outcomes Valid data Selected approaches and software tools Patterns and rules of occupancy schedule Results of occupancy presence prediction Validation 6 Compare prediction results to observed data Effect of the proposed method Fig. 1. Framework of the proposed method for occupancy schedule learning and predicting. X. Liang et al. / Building and Environment 102 (2016) 179e192 181
182 X Liang et aL Building and Environment 102 (2016)179-192 Phase Process Results Start Clustering No Clusters Patterns of occupant acceptable? presence Learning Yes Decision Tree Training No Accuracy Rules of patterns acceptable? Applying Rules Predicting Observed data☐ Training and Predicting Splitting Comparing Training set Test set raining Validating No Prediction Accuracy acceptable? Comparing Yes End Evaluation of Method Fig.2.Processes of the proposed method and results. applied to measure similarity between prediction results and Y to train the function(X).The goal of unsupervised learning is observed data,including mean,median,bias,RMSE (root mean to discover hidden patterns in the input data x by its own features, squared error)and RTE (relative total error).The details of the shown in Fig.3(b).In reality,numerous problems cannot obtain metrics and validation will be introduced in Section 3.5. priori information of outputs.Therefore,unsupervised learning is widely used to solve this kind of problems recently. 2.2.Machine learning This study uses both the supervised learning and the unsuper- vised learning in two steps.At the beginning,there is no label of Machine learning is an important method of data mining[27]. occupancy schedule data,so the unsupervised learning method which allows computers to learn from and make predictions on (i.e..clustering)is applied to identify patterns of occupant presence data via observation,experience,analysis and self-training [27,28]. from the features of data.After that,the presence data have labels, It operates by building a model to make data-driven predictions or which are the identified patterns.Then,the supervised learning decisions,rather than following strictly static program instructions method (i.e.,decision tree)is applied to induct rules based on the 291. labeled data. There are two types of machine learning.namely supervised learning and unsupervised learning [30.The former one refers to 2.2.1.Cluster analysis the traditional learning methods with training data,which is a Cluster analysis is a typical unsupervised machine learning known labeled data set of inputs and outputs.As a standard su- method,which aims to group data into a few cohesive clusters [31]. pervised learning problem, training samples The criterion of clustering is the similarities among samples.The (X,Y)={(x1,y).....(x.y)}are offered for an unknown function samples should have high similarities within the same cluster but Y=(X).X denotes the "input"variables,also called input fea- low similarities in different clusters.The similarity is normally tures,and Y denotes the "output"or target variables that trying to measured by distance.The shorter the distance between samples is, predict.The xi values are typically vectors of the form the more similar the samples are.There are various distance defi- (x,1,x2.....Xin)which are the features of xi.such as weight,color, nitions,including the Euclidian distance,the Chebyshev distance, shape and so on.The notation xij refers to the j-th feature of xi.The the Hamming distance,the dynamic time wrap distance and the goal of supervised learning is to learn a general rule(x)that correlation distance [32].Appropriate distance type should be maps inputs X to outputs Y,shown in Fig.3(a).The typical algo- selected according to the specific problem.For example,The rithms of supervised learning include regression,Bayesian statistic, Euclidian distance is commonly used for the direct geometrical decision tree and etc. distance.The correlation distance is good at triangle similarity.The The unsupervised learning refers to the methods without given dynamic time wrap is commonly used for the similarity of time- labels to the learning algorithm.leaving it on its own to find shift sequences.This study compares three kinds of distances, structure in its input.In unsupervised learning.there is no"output" shown in Fig.10,and selects the Euclidian distance due to its best
applied to measure similarity between prediction results and observed data, including mean, median, bias, RMSE (root mean squared error) and RTE (relative total error). The details of the metrics and validation will be introduced in Section 3.5. 2.2. Machine learning Machine learning is an important method of data mining [27], which allows computers to learn from and make predictions on data via observation, experience, analysis and self-training [27,28]. It operates by building a model to make data-driven predictions or decisions, rather than following strictly static program instructions [29]. There are two types of machine learning, namely supervised learning and unsupervised learning [30]. The former one refers to the traditional learning methods with training data, which is a known labeled data set of inputs and outputs. As a standard supervised learning problem, training samples ðX; YÞ ¼ fðx1; y1Þ;…;ðxm ; ym Þg are offered for an unknown function Y ¼ F ðXÞ: X denotes the “input” variables, also called input features, and Y denotes the “output” or target variables that trying to predict. The xi values are typically vectors of the form ðxi 1; xi 2; …; xinÞwhich are the features of xi, such as weight, color, shape and so on. The notation xij refers to the j-th feature of xi. The goal of supervised learning is to learn a general rule F ðXÞ that maps inputs X to outputs Y, shown in Fig. 3 (a). The typical algorithms of supervised learning include regression, Bayesian statistic, decision tree and etc. The unsupervised learning refers to the methods without given labels to the learning algorithm, leaving it on its own to find structure in its input. In unsupervised learning, there is no “output” Y to train the function F ðXÞ. The goal of unsupervised learning is to discover hidden patterns in the input data X by its own features, shown in Fig. 3 (b). In reality, numerous problems cannot obtain priori information of outputs. Therefore, unsupervised learning is widely used to solve this kind of problems recently. This study uses both the supervised learning and the unsupervised learning in two steps. At the beginning, there is no label of occupancy schedule data, so the unsupervised learning method (i.e., clustering) is applied to identify patterns of occupant presence from the features of data. After that, the presence data have labels, which are the identified patterns. Then, the supervised learning method (i.e., decision tree) is applied to induct rules based on the labeled data. 2.2.1. Cluster analysis Cluster analysis is a typical unsupervised machine learning method, which aims to group data into a few cohesive clusters [31]. The criterion of clustering is the similarities among samples. The samples should have high similarities within the same cluster but low similarities in different clusters. The similarity is normally measured by distance. The shorter the distance between samples is, the more similar the samples are. There are various distance defi- nitions, including the Euclidian distance, the Chebyshev distance, the Hamming distance, the dynamic time wrap distance and the correlation distance [32]. Appropriate distance type should be selected according to the specific problem. For example, The Euclidian distance is commonly used for the direct geometrical distance. The correlation distance is good at triangle similarity. The dynamic time wrap is commonly used for the similarity of timeshift sequences. This study compares three kinds of distances, shown in Fig. 10, and selects the Euclidian distance due to its best Fig. 2. Processes of the proposed method and results. 182 X. Liang et al. / Building and Environment 102 (2016) 179e192
X.Liang et al.Building and Environment 102(2016)179-192 183 Goal of Taraets Learning y1y2,…,yn) Inputs Rule: Outputs 十 Compare (&1,2,…,Xn)1 Y=F(X) (12,…,n) Adjust (a)Supervised learning Goal of Learning Inputs Optimization Outputs (1,X2,…,Xn) Algorithm Patterns of Inputs Adjust (b)Unsupervised learning Fig.3.Mechanism of machine learning. performance. One operation is assigning each training sample xi to the closest There are various clustering models,and for each of these cluster centroid uj,shown in Eq.(1).The other one is moving each models,different algorithms can be given [33].Typical cluster cluster centroid uj to the mean of the points assigned to it,shown in models include connectivity based models (e.g.,hierarchical clus- Eq.(2). tering),centroid based models(e.g..k-means clustering).distribu- The appropriate clustering algorithm for a particular problem tion based models(e.g..Gaussian distributions fitting)and density needs to be chosen experimentally,since there is no defined "best" based models(e.g.Density-based spatial clustering of applications clustering algorithm [33.The most appropriate algorithm for a with noise)[34].Among numerous clustering algorithms,the k- certain problem can be selected by its performance.The perfor- means clustering is the most commonly used,which is defined as mance of algorithms can be measured by the definition of clusters, follows. namely the proportion of intra-cluster distance to inter-cluster distance.The Davies-Bouldin index (DBI)is used to evaluate 1.Initialize cluster centroids u.u2.....uk ER different methods in this study.This index is defined in Eq.(3). 2.Repeat until convergence:{ For every j,set -若) (3) argmin, (1) where n is the number of clusters,ci is the centroid of cluster i,o;is the average distance of all elements in cluster i to centroid ci,and For every i,set d(ci.ci)is the distance between centroids c and c;.The lower value of DBI means lower intra-cluster distances (higher intra-cluster sim- 〔1f=j =1 1={0fj (2) ilarity)and higher inter-cluster distances(lower inter-cluster sim- ilarity).therefore,the clustering algorithm with the smallest DBI is considered the best algorithm based on this criterion. In the k-means algorithm,k(a parameter of the algorithm)is the 2.2.2.Decision tree learning preset number of clusters.The cluster centroids ui represent the This study uses decision tree to induce the rules of occupant positions of the centers of the clusters.Step 1 is to initialize cluster presence.Decision tree learning is a typical supervised machine centroids,randomly or by a specific method.Step 2 is to find learning algorithm in data mining [35.It uses a tree-like structure optimal cluster centroids and samples assigned to them.Two op- to model the rules and their possible consequences.A main erations are implemented iteratively until convergence in this step. advantage of decision tree method is that it can represent the rules
performance. There are various clustering models, and for each of these models, different algorithms can be given [33]. Typical cluster models include connectivity based models (e.g., hierarchical clustering), centroid based models (e.g., k-means clustering), distribution based models (e.g., Gaussian distributions fitting) and density based models (e.g., Density-based spatial clustering of applications with noise) [34]. Among numerous clustering algorithms, the kmeans clustering is the most commonly used, which is defined as follows. 1. Initialize cluster centroids m1, m2,…, mk 2ℝ 2. Repeat until convergence: { For every j, set ci ¼ argminj xi mj (1) For every i, set mj ¼ Pm Pi ¼1a,xi m i ¼1a ; a ¼ 1 if ci ¼ j 0 if ci sj (2) . In the k-means algorithm, k (a parameter of the algorithm) is the preset number of clusters. The cluster centroids mj represent the positions of the centers of the clusters. Step 1 is to initialize cluster centroids, randomly or by a specific method. Step 2 is to find optimal cluster centroids and samples assigned to them. Two operations are implemented iteratively until convergence in this step. One operation is assigning each training sample xi to the closest cluster centroid mj, shown in Eq. (1). The other one is moving each cluster centroid mj to the mean of the points assigned to it, shown in Eq. (2). The appropriate clustering algorithm for a particular problem needs to be chosen experimentally, since there is no defined “best” clustering algorithm [33]. The most appropriate algorithm for a certain problem can be selected by its performance. The performance of algorithms can be measured by the definition of clusters, namely the proportion of intra-cluster distance to inter-cluster distance. The Davies-Bouldin index (DBI) is used to evaluate different methods in this study. This index is defined in Eq. (3). DB ¼ 1 n Xn i¼1 max jsi si þ sj dðci; cjÞ ! (3) where n is the number of clusters, ci is the centroid of cluster i, si is the average distance of all elements in cluster i to centroid ci, and d(ci,cj) is the distance between centroids ci and cj. The lower value of DBI means lower intra-cluster distances (higher intra-cluster similarity) and higher inter-cluster distances (lower inter-cluster similarity), therefore, the clustering algorithm with the smallest DBI is considered the best algorithm based on this criterion. 2.2.2. Decision tree learning This study uses decision tree to induce the rules of occupant presence. Decision tree learning is a typical supervised machine learning algorithm in data mining [35]. It uses a tree-like structure to model the rules and their possible consequences. A main advantage of decision tree method is that it can represent the rules Rule: Inputs Outputs Adjust Compare Goal of Targets Learning (a) Supervised learning Goal of Learning Optimization Algorithm Inputs Outputs Adjust Patterns of Inputs (b) Unsupervised learning Fig. 3. Mechanism of machine learning. X. Liang et al. / Building and Environment 102 (2016) 179e192 183
184 X Liang et aL Building and Environment 102 (2016)179-192 visually and explicitly.Fig.4 illustrates the structure of decision tree model,which includes three types of nodes (ie.,root node,leaf node and terminal node)and branches between nodes.The leaf nodes denote attributes of input,while branches denote the con- dition of these attributes.Each terminal node is a subset of target variables Y,which indicates two kinds of information:(1)classi- fication of the target variables Y,and(2)the probability of each subset.Based on the classification and probability.the rules of prediction can be inducted. Most algorithms for generating decision trees are variations of a core algorithm that employs a top down,greedy search through the entire space of possible decision trees.ID3 algorithm [36]and its successor C4.5 37]are the most used methods.The key of these algorithms is the choice of the best attribute in each node.To measure the classification effect of a given attribute,a metric is defined,called information gain,which can be defined as follows Fig.5.Photo of building 101 [381. evaluate the performance of decision tree in this study.The data set Gain(S,A)=Entropy(S)> Entropy(S) vEValues(A)可 (4) is divided into ten subsets.Seven subsets are used for training and the other three are used for testing.Then it repeats by exchanging subsets.The cross-validation can improve the accuracy and robustness of decision tree model. where Entropy(S)=>-Pi log2Pi (5) Gain(S,A)represents the information gain of an attribute A related 2.3.Case study to a collection of samples S.Values(A)is the set of all possible values of attribute A,and S is the subset of S,which contains attribute A A case study was conducted to demonstrate the proposed has value v,namely S=seSA(s)=v).pi represents the proportion method.The office building of the case study is the Building 101 in of S belonging to class i,and Entropy is a measure of the impurity in the Navy Yard,Philadelphia,U.S.,shown in Fig.5.The building is a collection of training set.Given the definition of Entropy,the one of the nation's most highly instrumented commercial build- Gain(S,A)in Eq.(4)is the reduction in entropy caused by the ings.Building 101 in the Navy Yard is the temporary headquarters knowledge of attribute A.Namely,Gain(S,A)is the contribution of of the U.S.Department of Energy's Energy Efficient Building Hub attribute A to the information of samples S.The highest value of (EEB Hub)[39].Various sensors have been installed by EEB Hub information gain indicates the best attribute A in a specific node. since 2012 to acquire building data of occupants,facilities,energy There are two steps of decision tree generation.First step is consumption and environment.The profile of Building 101 is learning rules from training data based on the aforementioned C4.5 shown in Table 1. algorithm.Gain ratio method is employed to identify the best Four sensors are installed at the gates of the building to record attribute in each node by minimizing the entropy.The confidence is the number of occupants entering and exiting.The sensors are set to 0.25 and the minimal gain is 0.1.The second step is predicting located at the first floor of Building 101,shown in Fig.6.The data based on the rules learned from the first step,and validating results format of raw sensor records is shown in Table 2.The set(Ni1.Ni3. by testing data.If the accuracy is satisfied,the process is finished. Nis.Ni7)denotes the number of entering occupants,while the set Otherwise,the two steps are repeated to update decision tree until (Ni2.Ni4.Ni6.Nis)denotes the number of exiting occupants at the i- the result is satisfied.Cross-validation method 8]is used to th time step.Therefore,the number of total occupants in building at Root Node Condition 1.1 Condition 1.j Attribute 1 Leaf Node Leaf Node 。00·· Attribute n Condition i.1 Condition i.j Terminal Node Terminal Node Terminal Node Terminal Node 1.Subset of Target Variables 2.The probability of each subset Fig.4.Graphical structure of decision tree model
visually and explicitly. Fig. 4 illustrates the structure of decision tree model, which includes three types of nodes (i.e., root node, leaf node and terminal node) and branches between nodes. The leaf nodes denote attributes of input, while branches denote the condition of these attributes. Each terminal node is a subset of target variables Y, which indicates two kinds of information: (1) classi- fication of the target variables Y, and (2) the probability of each subset. Based on the classification and probability, the rules of prediction can be inducted. Most algorithms for generating decision trees are variations of a core algorithm that employs a top down, greedy search through the entire space of possible decision trees. ID3 algorithm [36] and its successor C4.5 [37] are the most used methods. The key of these algorithms is the choice of the best attribute in each node. To measure the classification effect of a given attribute, a metric is defined, called information gain, which can be defined as follows [38]. GainðS; AÞ ¼ EntropyðSÞ X y2ValuesðAÞ jSyj jSj EntropyðSyÞ (4) where EntropyðSÞ ¼ Xpi log2pi (5) Gain(S,A) represents the information gain of an attribute A related to a collection of samples S. Values(A) is the set of all possible values of attribute A, and Sy is the subset of S, which contains attribute A has value y, namely Sy¼{s2SjA(s)¼y}. pi represents the proportion of S belonging to class i, and Entropy is a measure of the impurity in a collection of training set. Given the definition of Entropy, the Gain(S,A) in Eq. (4) is the reduction in entropy caused by the knowledge of attribute A. Namely, Gain(S,A) is the contribution of attribute A to the information of samples S. The highest value of information gain indicates the best attribute A in a specific node. There are two steps of decision tree generation. First step is learning rules from training data based on the aforementioned C4.5 algorithm. Gain ratio method is employed to identify the best attribute in each node by minimizing the entropy. The confidence is set to 0.25 and the minimal gain is 0.1. The second step is predicting based on the rules learned from the first step, and validating results by testing data. If the accuracy is satisfied, the process is finished. Otherwise, the two steps are repeated to update decision tree until the result is satisfied. Cross-validation method [8] is used to evaluate the performance of decision tree in this study. The data set is divided into ten subsets. Seven subsets are used for training and the other three are used for testing. Then it repeats by exchanging subsets. The cross-validation can improve the accuracy and robustness of decision tree model. 2.3. Case study A case study was conducted to demonstrate the proposed method. The office building of the case study is the Building 101 in the Navy Yard, Philadelphia, U.S., shown in Fig. 5. The building is one of the nation's most highly instrumented commercial buildings. Building 101 in the Navy Yard is the temporary headquarters of the U.S. Department of Energy's Energy Efficient Building Hub (EEB Hub) [39]. Various sensors have been installed by EEB Hub since 2012 to acquire building data of occupants, facilities, energy consumption and environment. The profile of Building 101 is shown in Table 1. Four sensors are installed at the gates of the building to record the number of occupants entering and exiting. The sensors are located at the first floor of Building 101, shown in Fig. 6. The data format of raw sensor records is shown in Table 2. The set (Ni1, Ni3, Ni5, Ni7) denotes the number of entering occupants, while the set (Ni2, Ni4, Ni6, Ni8) denotes the number of exiting occupants at the ith time step. Therefore, the number of total occupants in building at Fig. 4. Graphical structure of decision tree model. Fig. 5. Photo of building 101. 184 X. Liang et al. / Building and Environment 102 (2016) 179e192
X.Liang et al.Building and Environment 102(2016)179-192 185 Table 1 Table 2 The profile of Building 101. The data format of sensor records Location Philadelphia,U.S. Time step Sensor1 Sensor2 Sensor3 Sensor4 Size 6410m In Out Out n Out In Out Floor 3 floors Constructed Year 1911 1/120140:00 N11 N12 N13 N14 Nis N16 N17 N18 1/120140:05 Building Usage Office … 1/1/20140:10 e。 4e= =+ e+4 ”” 12/31201423:50 the i-th time step can be calculated by Eq.(6). 12/31201423:55 Nit Na Ni4 Nis Ncotal =>(Ni1 -Ni2+Ni3 -Ni4+Nis-Ng6+N:7-Nigs)(6) Weekday 200+ 180 3.Results e 3.1.General characteristics of occupant presence 碧140 120 This study uses the data from the year 2014 and the time step is 100 5 min.Due to the sensor failure and other reasons,there are some missing data,which is less than 1%of all samples.Based on the measured data of Building 101,general characteristics of the occupant presence were analyzed and compared among different 40 conditions based on statistical method. 20 The daily 24-h profile of occupant presence is the main target of this study.First,the hourly occupant presence of weekday and 234567891011121314151617181920212223 weekend is shown in Fig.7.The results show that the mean of Time occupant number is close to zero in the building during weekends Weekend and holidays and the variance is also low.It means there are nor- 200 mally few occupants in weekend and holiday.Therefore,when 180 analyzing the occupancy schedule,this study excludes the data from weekend and holidays.In weekdays,the mean of occupant number is significantly changed over time.The variation range of 140 occupant number is very large from 7 am to 4 pm in weekdays. 120 which exceeds more than 30%of the mean.It indicates the main 100 characteristics of occupant presence,dynamic,stochastic and highly variable.These characteristics lead to difficulty to under- stand and predict occupant presence based on traditional statistical 60 methods. Statistical results of hourly occupant presence from Monday to Friday are compared in Fig.8.It shows the features ofeach weekday 小+挂鞋主主挂中 are different.For example,the variance range at 11 am is much 0 1234567891011121314151617181920212223 smaller on Tuesday and Thursday than Monday and Wednesday. Time The particular values(extremely high values)on Friday are signif- Fig.7.Hourly occupant presence during weekdays and weekends. icantly lower than that of the other four days.Although the ● Sensors Unused Doors Occasional Door First Floor Fig.6.Sensor locations in Building 101
the i-th time step can be calculated by Eq. (6). Ntotal ¼ Xi 1ðNi1 Ni2 þ Ni3 Ni4 þ Ni5 Ni6 þ Ni7 Ni8Þ (6) 3. Results 3.1. General characteristics of occupant presence This study uses the data from the year 2014 and the time step is 5 min. Due to the sensor failure and other reasons, there are some missing data, which is less than 1% of all samples. Based on the measured data of Building 101, general characteristics of the occupant presence were analyzed and compared among different conditions based on statistical method. The daily 24-h profile of occupant presence is the main target of this study. First, the hourly occupant presence of weekday and weekend is shown in Fig. 7. The results show that the mean of occupant number is close to zero in the building during weekends and holidays and the variance is also low. It means there are normally few occupants in weekend and holiday. Therefore, when analyzing the occupancy schedule, this study excludes the data from weekend and holidays. In weekdays, the mean of occupant number is significantly changed over time. The variation range of occupant number is very large from 7 am to 4 pm in weekdays, which exceeds more than 30% of the mean. It indicates the main characteristics of occupant presence, dynamic, stochastic and highly variable. These characteristics lead to difficulty to understand and predict occupant presence based on traditional statistical methods. Statistical results of hourly occupant presence from Monday to Friday are compared in Fig. 8. It shows the features of each weekday are different. For example, the variance range at 11 am is much smaller on Tuesday and Thursday than Monday and Wednesday. The particular values (extremely high values) on Friday are significantly lower than that of the other four days. Although the Table 1 The profile of Building 101. Location Philadelphia, U.S. Size 6410 m2 Floor 3 floors Constructed Year 1911 Building Usage Office Fig. 6. Sensor locations in Building 101. Table 2 The data format of sensor records. Time step Sensor1 Sensor2 Sensor3 Sensor4 In Out In Out In Out In Out 1/1/2014 0:00 N11 N12 N13 N14 N15 N16 N17 N18 1/1/2014 0:05 …. …. …. …. …. …. …. …. 1/1/2014 0:10 …. …. …. …. …. …. …. …. …. …. …. …. …. …. …. …. …. 12/31/2014 23:50 …. …. …. …. …. …. …. …. 12/31/2014 23:55 Ni1 Ni2 Ni3 Ni4 Ni5 Ni6 Ni7 Ni8 Fig. 7. Hourly occupant presence during weekdays and weekends. X. Liang et al. / Building and Environment 102 (2016) 179e192 185
186 X Liang et aL Building and Environment 102 (2016)179-192 Monday Tuesday Wednesday Thursday Friday 200 180 60 40 80 60 40 20 0 6 12 18 0 6 12 18 0 6 12 18 0 6 12 18 0 6 12 1823 Time Fig.8.Hourly occupant presence from Monday to Friday. occupancy features are different in each weekday.the averages of distance metrics are compared among Euclidean distance,corre- hourly occupant presence in each weekday are very similar except lation similarity and dynamic time wrap.The results indicate that Friday.It indicates that traditional method,which only uses mean k=4 with Euclidean distance metric is the optimal parameter in k- value to describe occupant presence (Fig.9),loses granularity of means algorithm for this data set,shown in Fig.10. information. The four clusters of occupant presence data are shown in Fig.11. Fig.9 shows occupant presence in Building 101 has dual-peak From the visualization of the clusters,four patterns of occupant feature(mainly due to occupants going out for lunch).which is presence are highlighted as following,and the characteristics of similar to occupant schedules used in ASHRAE standard 90.1 [40].It patterns are shown in Table 3: verifies the occupancy data in this case is not abnormal and has general adaption.But the peak in the afternoon is a bit lower than Pattern 1 represents the lowest occupancy rate and shortest that in the morning (the peaks in morning and afternoon are the working time.The occupants go to work latest and go home late same in ASHRAE standard).In addition,the drop at noon is not as in this pattern.The occupancy rate rises to 50%around early sharp as that in ASHRAE standard 90.1,and the slopes are likewise 10 am.In addition,there is no obvious noon-break drop of the different.Therefore,ASHRAE standard schedule is not adaptable to curve in this pattern,since the occupant number decreases variable buildings,it is necessary to adjust occupancy factor ac- continuously since 11 am. cording to the data of a particular building. Pattern 2 represents the highest occupancy rate and longest The occupant presence curve can be divided into six periods: working time.The occupants go to work earliest and go home late in this pattern.The occupancy rate rises to 50%around early The night period (7 pm-6 am):Few occupants are in the 8 am and decreases to 50%around 5 pm.The noon-break is building.typically no occupant.The occupancy rate is normally around 12 pm. less than 10%of the max value. Pattern 3 represents the medium occupancy rate,medium The going-to-work period(7 am-9 am):Occupants are arriving working time,going-to-work later and going-home later.The successively in this period.The occupancy rate is growing from occupancy rate rises to 50%around 9 am and decreases to 50% 10%to70%. before 6 pm.The noon-break is around 2 pm The morning period (10 am-12 pm):Occupants are working in Pattern 4 is similar to Pattern 3,which likewise represents the the building and the occupancy rate stays around 80%. medium occupancy rate and medium working time.But the The noon-break period (12 pm-1 pm):some occupants go out main difference is that the going-to-work time and going-home for lunch and the occupancy rate drops slightly to lower than time are about 1 h earlier than that in Pattern 3.The occupancy 80%. rate rises to 50%around 8 am and decreases to 50%before 5 pm The afternoon period(2 pm-3 pm):Occupants are back to work The noon-break is around 1 pm. in the building.The occupancy rate rises slightly higher than 80%.but is lower than that in the morning period. The going-home period (4 pm-6 pm):Occupants are leaving 3.3.Rules of patterns office successively in this period.The occupancy rate is decreasing from 70%to 10%. Based on the recognized patterns of occupant presence,the rules of these patterns are induced in this step.According to data analysis,three influencing factors are used in the decision tree 3.2.Patterns of occupant presence generation:the patterns are related to(1)seasons(temperatures): (2)weekdays:and(3)daylight saving time(DST).Since the tem- This step is to discover the pattern of occupant presence during perature information needs other data input but season weekdays.The data mining software RapidMiner 6 is applied to disaggregate presence data to several clusters.In this study,BDI is used to find the optimal different k value in the k-means algorithm 1 Daylight saving time in USA starts on the second Sunday in March and ends on and distance metric.The k values are evaluated from 2 to 8 and the the first Sunday in November
occupancy features are different in each weekday, the averages of hourly occupant presence in each weekday are very similar except Friday. It indicates that traditional method, which only uses mean value to describe occupant presence (Fig. 9), loses granularity of information. Fig. 9 shows occupant presence in Building 101 has dual-peak feature (mainly due to occupants going out for lunch), which is similar to occupant schedules used in ASHRAE standard 90.1 [40]. It verifies the occupancy data in this case is not abnormal and has general adaption. But the peak in the afternoon is a bit lower than that in the morning (the peaks in morning and afternoon are the same in ASHRAE standard). In addition, the drop at noon is not as sharp as that in ASHRAE standard 90.1, and the slopes are likewise different. Therefore, ASHRAE standard schedule is not adaptable to variable buildings, it is necessary to adjust occupancy factor according to the data of a particular building. The occupant presence curve can be divided into six periods: The night period (7 pme6 am): Few occupants are in the building, typically no occupant. The occupancy rate is normally less than 10% of the max value. The going-to-work period (7 ame9 am): Occupants are arriving successively in this period. The occupancy rate is growing from 10% to 70%. The morning period (10 ame12 pm): Occupants are working in the building and the occupancy rate stays around 80%. The noon-break period (12 pme1 pm): some occupants go out for lunch and the occupancy rate drops slightly to lower than 80%. The afternoon period (2 pme3 pm): Occupants are back to work in the building. The occupancy rate rises slightly higher than 80%, but is lower than that in the morning period. The going-home period (4 pme6 pm): Occupants are leaving office successively in this period. The occupancy rate is decreasing from 70% to 10%. 3.2. Patterns of occupant presence This step is to discover the pattern of occupant presence during weekdays. The data mining software RapidMiner 6 is applied to disaggregate presence data to several clusters. In this study, BDI is used to find the optimal different k value in the k-means algorithm and distance metric. The k values are evaluated from 2 to 8 and the distance metrics are compared among Euclidean distance, correlation similarity and dynamic time wrap. The results indicate that k ¼ 4 with Euclidean distance metric is the optimal parameter in kmeans algorithm for this data set, shown in Fig. 10. The four clusters of occupant presence data are shown in Fig. 11. From the visualization of the clusters, four patterns of occupant presence are highlighted as following, and the characteristics of patterns are shown in Table 3: Pattern 1 represents the lowest occupancy rate and shortest working time. The occupants go to work latest and go home late in this pattern. The occupancy rate rises to 50% around early 10 am. In addition, there is no obvious noon-break drop of the curve in this pattern, since the occupant number decreases continuously since 11 am. Pattern 2 represents the highest occupancy rate and longest working time. The occupants go to work earliest and go home late in this pattern. The occupancy rate rises to 50% around early 8 am and decreases to 50% around 5 pm. The noon-break is around 12 pm. Pattern 3 represents the medium occupancy rate, medium working time, going-to-work later and going-home later. The occupancy rate rises to 50% around 9 am and decreases to 50% before 6 pm. The noon-break is around 2 pm. Pattern 4 is similar to Pattern 3, which likewise represents the medium occupancy rate and medium working time. But the main difference is that the going-to-work time and going-home time are about 1 h earlier than that in Pattern 3. The occupancy rate rises to 50% around 8 am and decreases to 50% before 5 pm. The noon-break is around 1 pm. 3.3. Rules of patterns Based on the recognized patterns of occupant presence, the rules of these patterns are induced in this step. According to data analysis, three influencing factors are used in the decision tree generation: the patterns are related to (1) seasons (temperatures); (2) weekdays; and (3) daylight saving time (DST)1 . Since the temperature information needs other data input but season Fig. 8. Hourly occupant presence from Monday to Friday. 1 Daylight saving time in USA starts on the second Sunday in March and ends on the first Sunday in November. 186 X. Liang et al. / Building and Environment 102 (2016) 179e192
X.Liang et al.Building and Environment 102(2016)179-192 187 ◆Mon叠TueWed←Thu米Fri 20 00 80 60 40 20 0 1 234567891011121314151617181920212223 Time Fig.9.Mean of hourly occupants presence of weekdays. ◆-Euclidean Distance --Correlation Similarity Dynamic Time Wrap Q.12 a.11 Q1 0,09 0.08 豆 0.07 a.06 0.05 0.04 0.03 3 4 5 6 Number of k Fig.10.Performance of k and distance metrics evaluated by BDL information can be transformed from the existing data set (time 1 happened on Friday and there is no Pattern 4 happened in winter step column).to simplify the proposed method,seasons are It means it is possible to induce the underlying rules of patterns selected as an analysis factor.As shown in Fig.12,these three fac- from these factors. tors have strong relations with patterns.For example,most Pattern Fig.13 shows the decision tree for classification of the patterns ◆Pattern1 -Pattern2 Pattern3←Pattern4 120 100 40 0 891011121314151617181920212223 Time Fig.11.Patterns of occupant presence
information can be transformed from the existing data set (time step column), to simplify the proposed method, seasons are selected as an analysis factor. As shown in Fig. 12, these three factors have strong relations with patterns. For example, most Pattern 1 happened on Friday and there is no Pattern 4 happened in winter. It means it is possible to induce the underlying rules of patterns from these factors. Fig. 13 shows the decision tree for classification of the patterns Fig. 9. Mean of hourly occupants presence of weekdays. Fig. 10. Performance of k and distance metrics evaluated by BDI. Fig. 11. Patterns of occupant presence. X. Liang et al. / Building and Environment 102 (2016) 179e192 187
188 X Liang et aL Building and Environment 102 (2016)179-192 Table 3 Characteristics of occupant presence patterns. Pattern Occupancy rate Working time Going to work time Going home time Noon break time Pattern 1 Lowest Shortest Latest Earliest NA Pattern 2 Highest Longest Earliest Later 12 pm Pattern 3 Medium Medium Later Latest 2 pm Pattern 4 Medium Medium Earlier Earlier 1 pm by the attributes.Any samples can be classified to different patterns 100% top down along the path of the tree.The first decision level is 90% season.If season is winter,the branch is to the terminal node.If not, 80% the process will reach to the second decision level,namely week- 70% day.After split by the weekday nodes,the final decisions can be generated.It needs to be noted that the DST is not included in the 60% ■Pattern4 decision tree,which means DST cannot contribute enough infor- 50% ■Pattern3 mation to reach the threshold of gain radio.Namely,DST is not a 4% ■Pattern2 key attribute in the classification of patterns. 30% Pattern 1 Not only the classification,but also the probability of the clas- sification can be provided by the decision tree.In Fig.13,the lengths 10% of different colors represent the probability of different patterns. For example,if the season is winter,the decision is Pattern 3. MON UE WEN Behind this decision,there is more information of probability:the Weekday Pattern 3 is of the highest probability,Patterns 1 and 2 are of lower 100% probabilities,and the probability of Pattern 4 is zero.Table 4 shows 90% the rules of patterns in detail.80%of all the training samples are correctly classified based on these rules.The result of the decision tree model shows relatively good performance to be further applied 70% to prediction in the next step ■Pattern4 50% ■Pattern3 0% ■Pattern2 3.4.Prediction of occupancy schedule ■Pattern1 Based on the rules deduced by decision tree,the occupancy schedule can be predicted.Three prediction methods are compared 10% in this study.The first is the mean-day method.The predictions 0% WINTER SPWG SUMMER AUTUMN depend only on the time of day.The method is presented by Eq.(7). Season where t denotes the time of the day (e.g.3 pm)and Mday denotes the mean value of all days.For example,the prediction for 3 pm is 100% the average of all of the data for 3 pm in history.Therefore,there is 90% no different profile for each day of the week,for different seasons or 820% for other factors.This prediction method is simple and can be compared as a baseline R 70% 60% ■Pattern4 Prdiction(t)=Mday(t) (7) 50% Pattern 3 The second method is mean-week method.The method is 40% ■Pattern2 presented by Eq.(8).where day denotes the day of samples and 30% ■Pattern1 Mweekday denotes the mean value of the assigned weekday.For 20% example,the prediction of 3 pm on a Monday in spring is the 10% average of all historical data for 3 pm on Monday. 0% NO YES Prdiction(weekday,t)=Mweekday(t) (8) Daylight Saving Time The third method is the proposed method in this study,which is based on the probability of decision tree.The method is presented Fig.12.Relationship between occupancy patterns and weekdays,seasons and DST. by Eq.(9).where Mpi(i=1,2,3,4)denotes the mean value of the Pattern i and Ppi denotes the probability of Pattern i.For example, the prediction of 3 pm on a Monday in spring is the expectation of 3.5.Validation all historical data for 3 pm based on probability of patterns. Several statistical performance metrics are used to evaluate Prdiction(day,t)=Mp1(t).Pp1+Mp2(t)-Pp2 +Mp3(t)-Pp3 prediction.The definitions are described below. +Mp4(t)-Pp4 (9) The root mean squared error(RMSE)quantifies the typical size of the error in the predictions,in absolute units.The equation for The visualized prediction of occupancy schedule based on the RMSE is provided in Eq.(10).where Ei is the observed data of oc- third method is shown in Fig.14.Since there are 16 terminal nodes cupants,Ei is the prediction results,and n is the total number of in decision tree(Fig.13).there are 16 conditions of prediction. predictions
by the attributes. Any samples can be classified to different patterns top down along the path of the tree. The first decision level is season. If season is winter, the branch is to the terminal node. If not, the process will reach to the second decision level, namely weekday. After split by the weekday nodes, the final decisions can be generated. It needs to be noted that the DST is not included in the decision tree, which means DST cannot contribute enough information to reach the threshold of gain radio. Namely, DST is not a key attribute in the classification of patterns. Not only the classification, but also the probability of the classification can be provided by the decision tree. In Fig. 13, the lengths of different colors represent the probability of different patterns. For example, if the season is winter, the decision is Pattern 3. Behind this decision, there is more information of probability: the Pattern 3 is of the highest probability, Patterns 1 and 2 are of lower probabilities, and the probability of Pattern 4 is zero. Table 4 shows the rules of patterns in detail. 80% of all the training samples are correctly classified based on these rules. The result of the decision tree model shows relatively good performance to be further applied to prediction in the next step. 3.4. Prediction of occupancy schedule Based on the rules deduced by decision tree, the occupancy schedule can be predicted. Three prediction methods are compared in this study. The first is the mean-day method. The predictions depend only on the time of day. The method is presented by Eq. (7), where t denotes the time of the day (e.g. 3 pm) and Mday denotes the mean value of all days. For example, the prediction for 3 pm is the average of all of the data for 3 pm in history. Therefore, there is no different profile for each day of the week, for different seasons or for other factors. This prediction method is simple and can be compared as a baseline. PrdictionðtÞ ¼ MdayðtÞ (7) The second method is mean-week method. The method is presented by Eq. (8), where day denotes the day of samples and Mweekday denotes the mean value of the assigned weekday. For example, the prediction of 3 pm on a Monday in spring is the average of all historical data for 3 pm on Monday. Prdictionðweekday;tÞ ¼ MweekdayðtÞ (8) The third method is the proposed method in this study, which is based on the probability of decision tree. The method is presented by Eq. (9), where Mpi (i¼1,2,3,4) denotes the mean value of the Pattern i and Ppi denotes the probability of Pattern i. For example, the prediction of 3 pm on a Monday in spring is the expectation of all historical data for 3 pm based on probability of patterns. Prdictionðday;tÞ ¼ Mp1ðtÞ$Pp1 þ Mp2ðtÞ$Pp2 þ Mp3ðtÞ$Pp3 þ Mp4ðtÞ$Pp4 (9) The visualized prediction of occupancy schedule based on the third method is shown in Fig. 14. Since there are 16 terminal nodes in decision tree (Fig. 13), there are 16 conditions of prediction. 3.5. Validation Several statistical performance metrics are used to evaluate prediction. The definitions are described below. The root mean squared error (RMSE) quantifies the typical size of the error in the predictions, in absolute units. The equation for RMSE is provided in Eq. (10), where Ei is the observed data of occupants, bEi is the prediction results, and n is the total number of predictions. Table 3 Characteristics of occupant presence patterns. Pattern Occupancy rate Working time Going to work time Going home time Noon break time Pattern 1 Lowest Shortest Latest Earliest NA Pattern 2 Highest Longest Earliest Later 12 pm Pattern 3 Medium Medium Later Latest 2 pm Pattern 4 Medium Medium Earlier Earlier 1 pm Fig. 12. Relationship between occupancy patterns and weekdays, seasons and DST. 188 X. Liang et al. / Building and Environment 102 (2016) 179e192