ZEMCH 2015 - International Conference Proceedings

the Viterbi algorithm is not able to calculate all the possible paths and cannot give a sequence with certainty . To prevent this , we applied a smoothing technique to give a small probability to every transmission and emission even if it never happened through the whole dataset . In this experiment we used the Pseudoemissions and Pseudotransitions parameters in the MATLAB ’ s HMM function .

3.3 Data Pre-processing Let be our model and the training data extracted from our datasets is , where each y is the training label and each x in the sensor reading sample . If we consider that occurred at time t = 1 , and at time t = n , which is when the last annotation in the dataset , we need to specify the time in between each time step t = 1,2,3 … This is what we call timeslices . As the dataset has a resolution of miliseconds , we could assign a timeslice as small as that value . However , if we opt to do that we would be creating 1000msx60sx60mx24h = 86.4x106 samples per day , which is computationally unmanageable . Timeslices from a second long might be considered but previous works as well as our own experiments suggest that this length is too costly in terms of computation times and don ’ t really give a real improvement in terms of model accuracy .

As discussed in previous sections , D1 dataset is divided into three different scenarios named House A , House B and House C ; comprising 25 , 14 and 19 days of data respectively . We have evaluated the models using different sample lengths . We evaluated the three models for timeslices ranging from 30 seconds per sample to 10 minutes . We evaluated each scenario independently performing a cross validation dividing the data into days , testing one of them while leaving the rest for training ( leave one out ), and then calculating the average on the results .

In the case of Dataset 2 , two different setups were considered : Timeslice Approach ( TA ) and Chunk Data Approach ( CDA ).

3.3.1 Timeslice Approach ( TA ): As with D1 , for the second dataset timeslices of 60 seconds were also the length of choice for the data hashing . Therefore , for 56 days with 1440 secods per day , a total of N = 80600 samples were initially generated . However , the datasets we are using for our experiments are not fully labelled . This means that not every xn has yn label associated . This issue can be addressed in two different ways . The first solution would be to create an ‘ idle ’ activity and consider it as another different class , assigning every empy yn to that label . The other option would be simply removing those samples from the training data .

In a previous experiment using D1 , the absence of activity associated with sensor firings was considered as activity ‘ idle ’, and the preliminary results suggested yielded similarities between including this data or not . We decided to keep this samples in order to maximize the use of the readings in D1 . However , for D2 this approach was not advisable based on the fort attempts to include ‘ idle ’ in this dataset . The amount of unlabelled data for D1 was just of 12 %, 7 %, and 19 % for House A , B and C respectively . Nevertheless , for D2 , the amount of information from sensors unassociated to any activity ( the frequency of empty yn ’ s ) accounted for more than the 80 % of the total samples . Due to this , the classifiers trained with D2 data predicted just class ‘ idle ’ for all the test points due to the massive imbalance of this new class ‘ idle ’ in addition to the fact that any sensor firing combination could have an ‘ idle ’ label associated since we don ’ t know what activities were actually occurring during those blank timesteps . To solve this , all the information from sensors events that was not related to any label was just removed from the feature array , thus the label ‘ idle ’ was not considered . Ultimately , the TA approach

Modelling occupant activity patterns for energy saving in buidings 247

ZEMCH 2015 - International Conference Proceedings | Page 249