According to Wikipedia the rather unfortunate term “dogfooding” was coined in reference to the use of one’s own products. That is, “if it is good enough for others to use then it is good enough for me to use”. I prefer the term coined by one time Microsoft CIO Tony Scott: “icecreaming”.
In this two-part post, I am going to “eat my own ice cream” and dive into my own smart meter electricity data made available by my electricity distributor through an online portal. I will endeavour to find out what drives electricity usage in my household, how to make the data as predictable as possible and what lessons can be learned so that the utilities sector can get better insight from smart meter data.
Whether it is for regulatory requirements or generally better business decision making, traditional forecasting practices are proving to be inadequate. The current uncertainty in electricity pricing is partly driven by inadequate peak load and energy forecasts. Until the mid-2000s energy forecasting was very straightforward as electricity was a low cost resource and depended on very mature technology. And then everything changed. We had a run of hot summers followed by a run of wet, mild ones. We had the rooftop solar revolution helped in the early days by considerable government subsidy. We had changes in building energy efficiency standards and lately we have also had a downturn in the domestic economy. And of course we have had price rises which has revealed the demand elasticity of some consumers.
This array of influences can seem complex and overwhelming, but armed with some contemporary data mining techniques and plenty of data we can build forecasts which can take into account these range of factors and more importantly dispel myths of what does and doesn’t affect consumption patterns. Furthermore we can build algorithms that will detect when some new disruptor comes along and causes changes that we have not previously accounted for. This is very important in an age of digital disruption. Any organisation that is not master of its own data has the potential to face an existential crisis and all of the pain that comes with that.
In this analysis I am going to use techniques that I commonly use with my clients. In this case I am looking at a single meter (my own meter), but the principles are the same. When working with my clients my approach is to build the forecast at every single meter, because different factors will drive the forecast for different consumers (or at least different segments of consumers).
So I don’t indulge in “analysis paralysis”, I will define some hypotheses that i want to test:
- What drives electricity usage in my household?
- How predictable is my electricity usage?
- Can I use my smart meter data to predict electricity usage?
I will use open source/freeware to conduct this analysis and visualisations to again prove that this type of analysis does not have be costly in terms of software, but relies instead on “thoughtware”. As always, let’s start with a look at the data.
As you can see I have a row for each day and 48 half hourly readings which is standard format for meter data. To this I add day of week and a weekend flag calculated from date. I also add temperature data from my nearest Bureau of Meteorology automated weather station – which happens to be only about 3 kilometres away and on a similar altitude. I also total the 48 readings so I have a daily kWh usage figure. In a future post I will look into which techniques we can apply to the half hourly readings, but in this post I will concentrate on this total daily kWh figure.
This is the data with the added fields:
My tool of choice for this analysis is the Generalised Linear Model (GLM). As a general rule regression is usually a good choice for modelling a variable of continuous values. GLMs also allow tuning of the model to fit the distribution of the data.
Before deciding what type of GLM to use let’s look at the distribution of daily usage:
Not quite a normal distribution. The distribution is slightly skewed to the left and high kurtosis which looks a little like a gamma distribution. Next let’s look at a distribution of the log of daily kWh.
Here I can see a long tail to the left but if I remove that ignore that tail then I get quite symmetric distribution. Let’s have a closer look at those outliers, this time by plotting temperature against daily kWh. They can be seen clearly in a cluster at the bottom of the graph below.
This cohort of low energy usage days represents times when our house has been vacant. In the last year these have mostly been one-off events with no data that I can use to predict their occurrence. They can all be defined as being below 5 kWh, so I’ll remove them from my modelling dataset. The next graph then shows we clearly have a better fit to a gamma distribution (blue line) rather than a normal distribution (red line).
We are now ready to model. This is what the first GLM looks like:
In assessing my GLM, I will use three measures:
- The “p(>|t|)” to estimate the goodness of fit of each predictor (the smaller the better which means the greater confidence we have in the coefficient estimate),
- R2 which represents how well the overall model fits (the higher the better –R2 can be thought of as the percentage of variance in the data explained by the model), and
- root mean squared error (RMSE) which tells me what the quantum of average difference is between my actual and predicted values (the lower the better; an RMSE of zero means that predicted values do not vary from actual values).
The model above is not very well fitted as demonstrated by the p-values and that some coefficients did not produce an estimate. This model has an R-squared of 0.36 and a RMSE of 7 and these statistics are not very reliable given the p-values.
Also it seems odd that MinTemp is significant but MaxTemp is not. So I remove poor performing variables and add an interaction between MinTemp and MaxTemp as I expect to find a relationship between these two values and electricity usage.
This new model is better fitting with r-squared of 0.64 and RMSE of 5.32. But the p-value for “Day==Tuesday” is still not low enough for my liking given the sample size of only a few hundred observations. At risk of erring sightly on the side of underfit, I remove this term from the model. Taking a closer look at temperature, I plot average temperature (the midpoint between MinTemp and MaxTemp) against daily kWh and I find an interesting pattern:
We see cross over points in the direction of correlation in the same temperature band at different seasonal changeovers, like bookends to the winter peak in usage. I use this insight to create two new temperature variables using splines. A bit of experimentation leads me to conclude that the temperature changeover is at 18 degrees Celsius which is also the temperature at the bottom of my U-curve scatterplot above. I create a variable called “spl1” which is zero for all values less than 18 degrees and then the average temperature minus 18 for all above. The second variable, “spl2”, is the opposite: zero for all temperature above 18 degrees and 18 minus the average temperature for all below. Because I am using a log link function, these variables will describe a u-shape as in the scatterplot rather than a v-shape which is what would happen were I using linear regression.
Let’s see how these variables work in my model:
Hey presto! We have a much stronger fitting model with r-squared of 0.71 and a RMSE of 4.86. This model is appealing in that it is highly parsimonious and readily explainable. When I visualise the model fit and produce a thirty day moving average r-squared increases to 0.88 and I have a model with a good fit.
I have pointed out three periods where the model departs from actual usage. The two low periods coincide with times when we were away and the high period coincides with a period when I was travelling. I have seen market research which suggests that absence of the bill payer leads to higher household electricity usage. I can add dummy variables into my model to describe these events and then use those in future forecast scenarios. The important thing here is that I am not using a trend and given this fit I see no trend in my usage other than that created by climatic variability. Some consumers will have a trend in usage based on changes over time based on things like changes in productivity for businesses or addition of solar for residential customers. But it is not good enough to just count on a continuing trend. It is important to get to the drivers of change and findings ways of capturing these drivers in granular data.
In the next part of this post I’ll investigate how these meter-level insights can be used at the whole of network level, and some techniques which can be used to derive insight from individual meters to whole of network.