The Hitchhiker’s Guide to Big Data

Thanks to Douglas Adams

Data these days is big. Really big. You just won’t believe how vastly hugely mindbogglingly big it is… The best piece of advice I can give is:  Don’t Panic.

Because I am an all-round hoopy frood, I will provide you the reader with a few simple concepts for navigating this vast topic. So lets start with definitions. What is big data? The answer is: it depends who you talk to. Software vendors says it something that you need to analyse (interestingly, usually with the same tools used to analyse “small” data). Hardware vendors will talk about how much storage or processing power is needed. Consultants to focus on the commercial opportunities of big data, and the list goes on…

Big data is mostly just good ol’fashioned data. The real revolution is not the size of the data but the cultural and social upheaval that is generating new and different types of data. Think social networking, mobile technology, smart grids, satellite data, TIOT (The Internet Of Things). The behavioural contrails we all leave behind us every minute of every day. The real revolution is not the data itself but in the meaning contained within the data.

The towel, according to the Hitchhiker’s Guide to the Galaxy “is about the most massively useful thing an interstellar hitchhiker can have”. The most important tool the big data hitchhiker can have is a hypothesis. And for commercial users of data a business case to accompany that hypothesis. To understand why this is so important let’s look at the three characteristics of data: volume, velocity and variety. In simplified terms, think of spreadsheets:

Volume

As Tom Davenport has reportedly said, ‘Big data usually involves small analytics’. Size does matter insomuch as it becomes really hard to process and manipulate large amounts of data. If your problem requires complex analysis then you are better off limiting how much data you analyse. There are many standard ways to do this such as random selection or aggregation. Often because activities such as prediction or forecasting require generalisations to be drawn from data there comes a point where models do not become substantially more accurate with the addition of more data.

On the other hand, relatively crude algorithms can become quite accurate with large amounts of data to process and here I am thinking in particular of Bayesian learning algorithms.

The decision of how much data is the right amount depends on what you are trying to achieve. Do not assume that you need all of the data all of the time.

The question to ask when considering how much is: “How do I intend to analyse the data?”

Velocity

If there is any truth in the hype about big data it is the speed with which we can get access to it. In some ways I think it should be called “fast data” rather than “big data”. Again the speed with which data can be collected from devices can be astounding. Smart meters can effectively report in real time but does the data need to be processed in real time? Or can it be stored and analysed at a later time? And how granular does the chronology have to be?

As a rule I use the timeframes necessary for my operational outcome to dictate what chronological detail I need for my analysis. For example, real time voltage correction might need one minute data (or less), but monitoring the accuracy of a rolling annual forecast might only need monthly or weekly data.

Operational considerations are always necessary when using analytics. This is where I distinguish between analytics and business intelligence. BI is about understanding; analytics is about doing.

The question to ask when considering how often is: “How do I intend to use the analysis?”

Variety

This is my favourite. There is a dazzling array of publicly available data as governments and other organisations begin to embrace open data. And there’s paid-for vendor data available as well. Then there is social media data and data that organisations collect themselves through sensors in smart (and not so smart) devices as well as traditional customer data and also data collected from market research or other types of opt in such as mobile phone apps. The list goes on.

This is where a good hypothesis is important. Data sourced, cleansed and transformed should be done so according to your intended outcome.

The other challenge with data variety is how to join it all together. In the old days we relied on exact matches via a unique id (such as an account number), but now we may need to be more inventive about how we compare data from different sources. There are many ways in which we can be done: this is one of the growing areas of dark arts in big data. On the conceptual level, I like to work with a probability of matching. I start with my source data set which will have a unique identifier matched against the data I am most certain of. I then join in new data and create new variables that describe the certainty of match for each different data source. I then have a credibility weighting for each new source of data that I introduce and use these weightings as part of the analysis. This allows all data to relate to each other; this is my Babel fish for big data. There are a few ways in which I actually apply this approach (and finding new ways all of the  time).

The question to ask when considering which data sources is: “What is my hypothesis?”

So there you have it: a quick guide to big data. Not too many rules just enough so you don’t get lost. And one final quote from Douglas Adams’ Hitchhiker’s Guide to the Galaxy:

Protect me from knowing what I don’t need to know. Protect me from even knowing that there are things to know that I don’t know. Protect me from knowing that I decided not to know about the things that I decided not to know about. Amen.

Advertisements

Eating my own ice cream – Part 1

According to Wikipedia the rather unfortunate term “dogfooding” was coined in reference to the use of one’s own products. That is, “if it is good enough for others to use then it is good enough for me to use”. I prefer the term coined by one time Microsoft CIO Tony Scott: “icecreaming”.

Image

Sanctorius of Padua literally eating his own ice cream.

In this two-part post, I am going to “eat my own ice cream” and dive into my own smart meter electricity data made available by my electricity distributor through an online portal. I will endeavour to find out what drives electricity usage in my household, how to make the data as predictable as possible and what lessons can be learned so that the utilities sector can get better insight from smart meter data.

Whether it is for regulatory requirements or generally better business decision making, traditional forecasting practices are proving to be inadequate. The current uncertainty in electricity pricing is partly driven by inadequate peak load and energy forecasts. Until the mid-2000s energy forecasting was very straightforward as electricity was a low cost resource and depended on very mature technology. And then everything changed. We had a run of hot summers followed by a run of wet, mild ones. We had the rooftop solar revolution helped in the early days by considerable government subsidy. We had changes in building energy efficiency standards and lately we have also had a downturn in the domestic economy.  And of course we have had price rises which has revealed the demand elasticity of some consumers.

This array of influences can seem complex and overwhelming, but armed with some contemporary data mining techniques and plenty of data we can build forecasts which can take into account these range of factors and more importantly dispel myths of what does and doesn’t affect consumption patterns. Furthermore we can build algorithms that will detect when some new disruptor comes along and causes changes that we have not previously accounted for. This is very important in an age of digital disruption. Any organisation that is not master of its own data has the potential to face an existential crisis and all of the pain that comes with that.

In this analysis I am going to use techniques that I commonly use with my clients. In this case I am looking at a single meter (my own meter), but the principles are the same. When working with my clients my approach is to build the forecast at every single meter, because different factors will drive the forecast for different consumers (or at least different segments of consumers).

So I don’t indulge in “analysis paralysis”, I will define some hypotheses that i want to test:

  • What drives electricity usage in my household?
  • How predictable is my electricity usage?
  • Can I use my smart meter data to predict electricity usage?

I will use open source/freeware to conduct this analysis and visualisations to again prove that this type of analysis does not have be costly in terms of software, but relies instead on “thoughtware”. As always, let’s start with a look at the data.

Image

As you can see I have a row for each day and 48 half hourly readings which is standard format for meter data. To this I add day of week and a weekend flag calculated from date. I also add temperature data from my nearest Bureau of Meteorology automated weather station – which happens to be only about 3 kilometres away and on a similar altitude. I also total the 48 readings so I have a daily kWh usage figure.  In a future post I will look into which techniques we can apply to the half hourly readings, but in this post I will concentrate on this total daily kWh figure.

This is the data with the added fields:

Image

My tool of choice for this analysis is the Generalised Linear Model (GLM).  As a general rule regression is usually a good choice for modelling a variable of continuous values. GLMs also allow tuning of the model to fit the distribution of the data.

Before deciding what type of GLM to use let’s look at the distribution of daily usage:

Image

Not quite a normal distribution. The distribution is slightly skewed to the left and high kurtosis which looks a little like a gamma distribution. Next let’s look at a distribution of the log of daily kWh.

Image

Here I can see a long tail to the left but if I remove that ignore that tail then I get quite symmetric distribution.  Let’s have a closer look at those outliers, this time by plotting temperature against daily kWh. They can be seen clearly in a cluster at the bottom of the graph below.

Image

This cohort of low energy usage days represents times when our house has been vacant.  In the last year these have mostly been one-off events with no data that I can use to predict their occurrence.  They can all be defined as being below 5 kWh, so I’ll remove them from my modelling dataset. The next graph then shows we clearly have a better fit to a gamma distribution (blue line) rather than a normal distribution (red line).

Image

We are now ready to model. This is what the first GLM looks like:

Image

In assessing my GLM, I will use three measures:

  •  The “p(>|t|)” to estimate the goodness of fit of each predictor (the smaller the better which means the greater confidence we have in the coefficient estimate),
  • R2 which represents how well the overall model fits (the higher the better –R2 can be thought of as the  percentage of variance in the data explained by the model), and
  • root mean squared error  (RMSE) which tells me what the quantum of average difference is between my actual and predicted values (the lower the better; an RMSE of zero means that predicted values do not vary from actual values).

The model above is not very well fitted as demonstrated by the p-values and that some coefficients did not produce an estimate. This model has an R-squared of 0.36 and a RMSE of 7 and these statistics are not very reliable given the p-values.

Also it seems odd that MinTemp is significant but MaxTemp is not. So I remove poor performing variables and add an interaction between MinTemp and MaxTemp as I expect to find a relationship between these two values and electricity usage.

 Image

This new model is better fitting with r-squared of 0.64 and RMSE of 5.32. But the p-value for “Day==Tuesday” is still not low enough for my liking given the sample size of only a few hundred observations. At risk of erring sightly on the side of underfit, I remove this term from the model.  Taking a closer look at temperature, I plot average temperature (the midpoint between MinTemp and MaxTemp) against daily kWh and I find an interesting pattern:

Image

We see cross over points in the direction of correlation in the same temperature band at different seasonal changeovers, like bookends to the winter peak in usage.  I use this insight to create two new temperature variables using splines. A bit of experimentation leads me to conclude that the temperature changeover is at 18 degrees Celsius which is also the temperature at the bottom of my U-curve scatterplot above. I create a variable called “spl1” which is zero for all values less than 18 degrees and then the average temperature minus 18 for all above.  The second variable, “spl2”, is the opposite: zero for all temperature above 18 degrees and 18 minus the average temperature for all below. Because I am using a log link function, these variables will describe a u-shape as in the scatterplot rather than a v-shape which is what would happen were I using linear regression.

Let’s see how these variables work in my model:

Image

Hey presto! We have a much stronger fitting model with r-squared of 0.71 and a RMSE of 4.86. This model is appealing in that it is highly parsimonious and readily explainable. When I visualise the model fit and produce a thirty day moving average r-squared increases to 0.88 and I have a model with a good fit.

Image

I have pointed out three periods where the model departs from actual usage. The two low periods coincide with times when we were away and the high period coincides with a period when I was travelling. I have seen market research which suggests that absence of the bill payer leads to higher household electricity usage. I can add dummy variables into my model to describe these events and then use those in future forecast scenarios. The important thing here is that I am not using a trend and given this fit I see no trend in my usage other than that created by climatic variability. Some consumers will have a trend in usage based on changes over time based on things like changes in productivity for businesses or addition of solar for residential customers. But it is not good enough to just count on a continuing trend. It is important to get to the drivers of change and findings ways of capturing these drivers in granular data.

In the next part of this post I’ll investigate how these meter-level insights can be used at the whole of network level, and some techniques which can be used to derive insight from individual meters to whole of network.

Retail Therapy

July 1, 2012 will probably be mostly remembered at the date Australia introduced a price on carbon. But another event took place which may be more significant in terms of how households and small businesses consume their electricity:  the commencement of the National Energy Customer Framework (NECF).  The NECF gives the Australian Energy Regulator (AER) the responsibility for (among other things) regulating retail electricity prices.  Electricity retail prices continue to rise driven mostly by increasing capital expenditure costs for networks. Electricity businesses, regulators and governments are increasingly turning their attention to Time of Use (TOU) pricing to help mitigate peak network demand and therefore reduce capital expenditure.

Change will be gradual to start with however. A cynical observer may suggest that the NECF is no more than a website at present, but I believe that change is inevitable and it will be significant. Five states and the ACT have agreed to a phased introduction of the NECF following on from a 2006 COAG agreement, and the transition will be fraught with all of the complexities of introducing cross jurisdictional regulatory reform.

There are basically two mechanisms that drive the cost of electricity to produce and deliver. One is the weather (we use more in hot and cold weather) and the other is the cost of maintaining and upgrading the network that delivers the electricity. For the large retailers, the way to deal with the weather is to invest in both generation and retail because one is a hedge for the other. These are known as “gentailers”.

The network cost has traditionally been passed through as a regulated network tariff component of the retail price. The problem with this is that often the network price structure does not reflect actual network costs which are driven by infrequent peak use, particularly for residential customers. Those who use a greater proportion of electricity during peak times add to the cost of maintaining capacity in the network to cope with the peak. But for residential and other small consumers they all pay the same rate. In effect, “peaky” consumers are subsidised by “non-peaky” customers.

It is not yet really clear how a price signal will be built into the retail tariff but one policy option is for distributors to pass costs to reflect an individual consumer’s load profile. The implications for government policy are interesting but I’ll save for another post. In this post, I’ll explore what the implications are from the retailer’s perspective in contestable markets.

I believe that this is potentially quite a serious threat to the business model for retailers for a number of reasons that I’ll get into shortly, but at the heart of the matter is data: lots of it, and what to do with it. Much of that data is flowing from smart meters in Victoria and NSW and will start to flow from meters in other states. A TOU pricing strategy not only requires data from smart meters but also many other sources as well.

Let’s have a quick recap on TOU. I have taken the following graph from a report we have prepared for the Victorian Department of Primary Industries which can be found here.

The idea of TOU is to define a peak time period where the daily usage peaks and charge more for electricity in this time period. A two part TOU will define other times as off peak and charge a much lower tariff. There may also be shoulder periods either side of the peak where a medium tariff is charged.

How each of these periods is defined and the tariff levels set will determine whether the system as a whole will collect the same revenue as when everyone is on a flat tariff.  This principle is called revenue neutrality. That is, the part of the electricity system that supplies households and small businesses will collect the same revenue under the new TOU tariffs as under the old flat tariff.

But this should by no means give comfort to retailers that they each will achieve revenue neutrality.

For example, we can see from the above graphs that even if revenue neutrality is achieved for all residential and SME customers combined, residential customers may be better off and SME worse or vice versa but everything still totals to no change in revenue. If a retailer has a large share of customers in a “better off” category then that will translate to a fall in revenue if the retailer passes on the network tariff with their existing margin. In fact we find that residential bills for example may be reduced by up to five per cent, depending on the design of the network tariff.

Of course this is just one segmentation of TOU, there could be many, many more sub-segments all with different “better off” or ”worse off” outcomes.

Revenue neutrality can be affected by price elasticity (consumers reduce their peak consumption) or substitution (they move their peak usage to shoulder or off-peak and thus reducing their overall electricity bill). This means that retailers not only have to understand what the impact would be under a current state of electricity usage but also how the tariff itself will affect consumer behaviour.

Data is at the very centre of competitive advantage as this disruptive event unfolds in the retail electricity market. Indeed the threat may not just be disruptive: for some retailers this may be an existential threat, especially as we see data-centric organisations entering the market such as telcos and ISPs. So far the no large telcos have entered the market in Australia (as far as I know: please correct me on this if this has changed) but surely the elephants must be loitering outside the room if not already in it.

I think what is clear for incumbent electricity retailers is “do nothing” is not an option. There must be a clear strategy around data and pricing including technology, talent and process. Furthermore, the centrepiece must be time of use pricing excellence built on a deep capability with data flowing from new technology meters and networks.

So what exactly are the key issues? The following list is by no means exhaustive but certainly gives some idea of the extent of data and the quantum of skills required to handle such complex analysis and interpretation.

Opt In or Opt Out?

I believe that TOU tariffs for small consumers are inevitable, but how will it roll out and how fast will the rollout be? The key policy decision will be whether to allow customers to opt in to TOU tariffs or opt out of a scheme which will otherwise be rolled out by default (a third option is to mandate to all, but this is likely to be politically unpalatable). I think pressure on governments to act on electricity network costs means that the “opt in” option, if it is adopted by the AER, will by definition be a transitional process. But the imperative is to act quickly because there is a lag between reducing peak demand and the flow through to capital expenditure savings (this is another whole issue which I will discuss in a future post). This lag means that if take up of TOU is too slow then the effect to the bottom line will be lost in the general noise of electricity consumption cycles: a case of a discount delayed is a discount denied. Retailers will have the right to argue for a phased introduction but there will be pressure on governments and the AER to balance this against the public good.

Non-cyclical change in demand

In recent years we have seen a change in the way electricity is consumed. I won’t go into the details here because I have blogged on this before. Suffice to say that it is one thing to understand from the data how a price may play out in the current market state but it’s altogether another thing to forecast how this will affect earnings. This requires a good idea of where consumption is heading and in turn this is affected a by a range of recent disruptors including Solar PV, changes in housing energy efficiency and changes in household appliance profiles. Any pricing scenario must also include a consumption forecast scenario. It would also be wise to have way to monitor forecasts carefully for other black swans waiting to sweep in.

A whole of market view

The task of maintaining or increasing earnings from TOU pricing will be a zero sum game. That is, if one retailer gets an “unfair share” of the “worse off” segments, then another retailer will get more of the “better off” segments and it is likely that this will be a one-off re-adjustment of the market. There is a need for a sophisticated understanding of customer lifetime value and this will be underpinned by also having a good understanding of market share by profitability. The problem is that smart meters (and the subsequent data for modelling TOU) will roll out in stages (Victoria is ahead of the other states, but I think the rollout will be inevitable across the National Electricity Market). The true competitive advantage for a retailer comes from estimating the demand profiles of customers still on accumulation meters and those smart meter consumers who are with competitors. There are a range of data mining techniques to build a whole-of-market view but equally important is a sound go-to-market strategy built to take advantage of these insights.

There will be winners and losers in the transition to TOU. For consumers, it could be argued that the “losers” are currently “winners” because the cost of their electricity supply is being subsidised by less “peaky” customers. There will also be winners and losers among energy retailers. Some of the winners may not even be in the market yet. The question is who will the losers be?

Critical Peak Price or Critical Peak Rebate?

Australians pay billions of dollars every year for an event that usually doesn’t happen: a critical demand peak on the electricity network. Electricity networks are designed to ensure continuous supply of electricity regardless of the demand placed on it. Every few years we are likely to experience a heat wave or cold snap that drives up simultaneous demand for energy across the network. The infrastructure required to cope with this peak in demand is very expensive; infrastructure that is not used except during these relatively rare events.

Shaving even just a very small amount of demand off these peak days has the potential to save up to $1.2b each year nationally according to a recent report by Deloitte. The tricky part is to try and target the peaks rather than drive down energy consumption in non-peak times. It’s this non-peak consumption that pays the bills for infrastructure investment. If distributors get less revenue and their peak infrastructure costs stay the same then prices have to go up. This is one of the big reasons why electricity prices have risen so steeply in recent years.

One way to do this is to send a price signal or incentive for consumers to moderate their demand on peak days. Last year at the ANZ Smart Utilities Conference in Sydney, Daniel Collins from Ausgrid gave an interesting presentation comparing the benefits for distributors of offering a critical peak pricing versus critical peak rebate. A critical peak price is where the network issues a very steep increase in electricity price on a handful of days each year. This price might be as much as ten times the usual electricity price. Under a critical peak rebate scheme consumers are charged the same amount on peak days but are given a rebate by the distributor if they keep their peak below a pre-defined threshold.

In electricity market where distributors cannot own retailers (the most common type of market in Australia) it is very difficult for price signals set by distributors to reach end consumers. This is because the distributors charge retailers and retailers then set the price and product options offered to consumers. Distributor price signals can get obscured in this process. In this type of market critical peak prices are unlikely to be mandated by government because it goes against a policy of deregulation and is highly politically unpalatable in an environment of rapidly increasing electricity prices. The only option for distributors then is an opt-in price.

The effectiveness of such a price then is highly dependent on the opt-in rate and given that only consumers who do stand to lose under such a price are the only one likely to opt in then the overall savings may be quite low.

A more interesting concept is critical peak rebate. For a start the rebate is given by the distributor directly which avoids the incentive being obscured by retail pricing. Such a scheme is also likely to attract a much greater uptake than opt-in peak pricing. The tricky part however is the design. How much rebate should be offered? Which consumers should be targeted and will they be interested? How do we set the upper demand limit?

It would be a mistake to offer the same deal to all consumers as it is very hard to offer a general incentive with significant return. A badly designed rebate could easily cost more to administer than it saves. There are four crucial elements that need to be considered in the design of a CPR.

How to measure the benefit?

This is quite tricky but by far the most important design element. There is a lag time between energy peaks on the network and infrastructure costs. This is because infrastructure spending is usually allocated on a five year cycle based on forecasts developed from historical peak demand data. It is vital that a scheme is deigned to capture the net savings in peak demand and that there is process to feed this data into the forecasting process. Unfortunately, I have never seen a demand management team feed data to a forecasting team.

Critical Pricing Customer Engagment Strategy

Who do we target?

The first issue is to work out which consumers have high peaking demand and are likely to take up the incentive. There should also be consideration to how data will collected and analysed during the roll out of the program, and how this data is used to continually drive better targeting of the program. The problem with a one-size-fits-all scheme is that there may be a number of different groups who have different motivations for curtailing their peak demand. For example, the rebate financial incentive may be set for the average consumer but may not be high enough to appeal to a wealthy consumer. But there may be other ways to appeal to these customers such as offering donation to a charity if the peak demand saving target is reached. It is therefore to think about a segmentation approach to targeting the right customers with the right offer.

What price, demand threshold and event frequency do we set?

Pricing the incentive is a three dimensional problem: target demand threshold, price and frequency of events. Each of these effect the total benefit of the scheme and the consumer trade-offs need to be understood. The danger here again is relying on averages. Different cohorts of customer will have different trade-off thresholds and an efficient design is vital to the effectiveness of the incentive. It is unlikely that there is room to vary the rebate amount based on customer attributes but there is certainly room to design individualised demand thresholds and maybe also the frequency with which events are called for different cohorts of customers.

How do we refine the program?

In the rush to get new programs to market, response data and customer intelligence feedback is often not well considered. It is important that there is a system for holding data and routines for measuring response against control groups for each treatment group in the program so that incremental benefits can be measured but also so data can be fed back into improving models which select customers for the program. Incremental benefits of the program should also feed back into refining pricing of the rebate and target demand thresholds. Understanding which customers respond and the quantum of that response are valuable insights into customer behaviour which distributors usually do have the ability to capture in the normal course of their business.  These are all good reasons for running a well-designed CPR program.

Energy consumption, customer value and retail strategy

I am sometimes surprised at the amount of effort that goes into marketing electricity. I can’t help but feel that a lot of customer strategy is over engineered. So here I present a fairly straightforward approach that acknowledges that energy is a highly commoditised product. This post departs a little from the big themes of this blog but is still relevant because the data available from smart meters makes executing on an energy retail strategy a  much more interesting proposition (although still a challenging data problem).

To start with let’s look at the distribution of energy consumers by consumption. This should be a familiar distribution shape to those in the know:

Energy Consumption Distribution

In effect what we have are two distributions overlayed: a normal distribution to the left overlaps with a Pareto distribution to the right. This first observation tells us that we have two discrete populations with the own rules governing the distribution of energy consumption. A normal distribution is a signature of human population characteristics and as such identifies what is commonly termed the electricity “mass market” essentially dominated by domestic households. The Pareto distribution to the right is typical of an interdependent network such as a stock market where a stock’s value, for example, is not independent of the value of other stocks. This is also similar to what we see when we look at the distribution of business sizes.

A quick look at the distribution of electricity consumption allows us to define two broad groups and because consumption is effectively a proxy for revenue we have a valuable measure in understanding customer value.

In our Pareto distribution we have a long tail of an ever decreasing number of customers with increasingly large consumption (and therefore contribution to revenue). To the left we have the largest number of customer but relatively low value (although mostly better that the customers at the top end of the normal distribution) and to the right a very few “mega-value” customers. We can therefore roughly define three “super-segments” as follows:

Energy Consumption Super Segments

With VLC on the right revenue is king. Losing just a few of these customers will impact overall revenue so the strategy here is to retain at all costs. At the extreme right for example individual relationship management is a good idea as is bespoke product design and pricing. To the lower end of this segment a better option may be relationship managers with portfolios of customers. But the over-riding rule is 1:1 management where possible.

The middle segment is interesting in that both revenue and margin are important. Getting the balance right between these two measures is very important and the strategy depends on whether your organisation is in a growth or retain phase.  If I was a new market entrant this is where I would be investing a lot of my energy. This is the segment of the market where some small wins could build a revenue base with good returns relatively quickly assuming that the VLC market will be fairly stable and avoids the risks inherent in the mass market. On the flip side, if I was a mature player then I would be keeping a careful eye on retention rates and making sure I had the mechanisms to fine tune the customer value proposition. An example might be offering “value-add” services which become possible with advanced metering infrastructure such as online tools which allow business owners to track productivity via portal access to real time energy data; or the ability to upload their own business data which can be merged and visualised with energy consumption data.

The mass market is really the focus of most retailers because often success metrics focus too heavily on customer numbers rather than revenue and margin, probably because this is easier to measure. The trap is that these customers have a high degree of variable profitability as described by the four drivers of customer lifetime value:

Customer Lifetime Value Drivers

Understanding these drivers and developing an understanding of customer lifetime value is critical to developing tailored engagement strategies in this segment. Because these customers are the easiest to acquire, a strategy based around margin means that less profitable customers will be left for competitors to acquire. If those competitors are still focussed on customer counts as their measure for success then they will happily acquire unprofitable customers which in time will increase pressure to acquire even more because of falling margins. Thus the virtual circle above is replaced with a vicious cycle (thanks to David McCloskey for that epithet).

And so there we have the beginnings of a data driven customer strategy. There is of course much more to segmentation that this and there now very advanced methodologies for producing granular segmentation to help execute on customer strategy and provide competitive advantage.  I’ll touch on these in future posts. But this is a good start.

Text Mining Public Perception of Smart Meters in Victoria

Today the Herald Sun ran a story proclaiming that smart meters are here to stay and invited their readers to comment on whether the government should scrap the smart meter program. I am not going to comment here on the journalistic quality of the article but concentrate on comments section which gives stakeholders some valuable insight into the zeitgeist of smart metering in the Garden State.

By applying an unstructured text mining application I have extracted the key themes from the comments on this story. When analysed in conjunction with the structure and content of the story, we get some interesting insights into public perception.

To start with I excluded the words “smart”, “meter” and “meters” in order not to be distracted by the subject under discussion. This is what I got.

Word clouds often seem to point to a collective meaning that is independent of individual attitudes. If this is the case then the strong message here which we could interpret as a collective rejection of what is seen as government control being favoured over the wishes of the “people”. This may be more of a reflection of the Herald Sun readership rather than a general community concern however.

If I remove “government” and “power” we get a closer look at the next level of the word cloud.

An aside of note is that we see that Herald Sun readers like to refer to the premier by his first name which is perhaps a sign that he still has popularity with this demographic.

One interesting observation to me is that despite its prominent mention article, the myth of radio frequency radiation from smart meters is not a major concern to the community, so we are unlikely to see a repeat of the tin foil hats fiasco in California.

Once we get into some of the word cloud detail, we see the common themes relating to “cost of living”, namely the additional costs to the electricity bill of the roll out and potential costs associated with time of use pricing. The article does mention that time of use pricing is an opportunities for households to save money. Time of use pricing is also a fairer pricing regime than flat tariffs.

The other important theme that I see is that the smart meter rollout is linked to the other controversial big technology projects of the previous Victorian government – Myki and the Wonthaggi Desalination Plant. But the good news is that the new government still has some cache with the public (even in criticism readers often refer to the premier by his first name). The objective now should be to leverage this and start building programs which smart meter initiatives which demonstrate the value of the technology directly to consumers. This in part requires unlocking the value of the data for consumers. I’ll speak more about this in future posts.

UPDATE: For interpretation of word clouds I suggest reading up on concept of collective consciousness.

Appliance Penetration and the Wisdom of Crowds

Some of the burning questions for electricity utilities in Australia have to do with appliance take up. I decided to see what the wisdom of crowds could tell us about the take-up of some key appliances which are affecting load profiles and consumption trends. My crowd-sourced data comes from Google Insights for Search. I have taken the weekly search volume indexes for three search terms: “air conditioner”, “pool pump” and “solar pv”. In addition, I also took the search volumes for “energy efficient” to see if there has been a fundamental change in the zeitgeist in terms of energy efficiency.

Firstly, let’s have a look at Google “air conditioner” search data.

The graph shows strong seasonality with people searching more for air conditioners in summer which makes sense. We see indications of how profound the growth of air conditioners has been in Australia (and South East Queensland in particular), I decided to compare growth in air conditioner searching by country and city. Since 2004, Australia ranks second behind the US for air conditioner searches. For cities, Brisbane and Sydney rank fourth and fifth in the world, but if we adjust for population they rank second and third respectively behind Huston. This has been one of the causes behind the recent difficulties in forecasting demand. One of the big questions is will air conditioning load continue to grow or has air conditioner penetration reached saturation point? Read on to find some insights that I think this data may have uncovered.

When we look at the data for the search term “energy efficient”, we get the opposite temperature effect with dips in searches during summer and maybe winter is noticeable in this graph.

This tells us that people become less concerned with energy efficiency as comfort becomes more important which has also been shown in other studies. But if we want to look for underlying changes in behavior then we need to account for temperature sensitivity in this data and the first thing we need to do is come up with a national temperature measure that we can compare with the Google data. To do this I get temperature data for Australia’s five largest cities from the Bureau of Meteorology and create national daily maximum temperatures for 2004-2011 comprised of a population weighted mean of the maximum temperatures of the five largest Australian cities. This accounts for about 70% of Australia’s population and an even greater proportion of regular internet users. Now we can quantify the relationship between our appliances, energy efficiency and temperature.

Below are the scatter charts showing the R2 correlations. “Solar PV” is uncorrelated with temperature but all of the other search terms show quite good correlation. You may also notice that I have tried to account for the U-curve in the relationship between “Energy Efficient” and temperature by correlating with the absolute number of degrees from 21C. The main relationship is with hot weather; accounting for the U-curve only adds slightly to the R2. Interestingly, people don’t start searching for air conditioners until the temperature hits 25C, and then there is a slightly exponential shape to the increase in searches. For the purposes of this post I will stick to simple linear methods, but further analysis may consider a log link GLM or Multiple Adaptive Regression Splines (MARS) to help explain this shape in the data.

Now to the central question that this post is trying to answer: what are the underlying trends in these appliances, can we find this out from Google and BOM data and can we get some insight into current underlying trends and how this might help uncover the underlying trends in consumption and load factor.  To do this I create a dummy variable to represent time and regress this with temperature to see to what extent each factor separately describes the number of Google searches. I build separate models for each year which separates the trend over time in searches from the temperature related ones.

But before I do that I can look at the direct relationship between annual “solar PV” trends.

There was not enough search data to go all the way back to 2004 (which is of itself interesting) so we only go back 2007. What we see is a large growth in searches in 2008, statistically insignificant trend in 2009 and 2010 and a distinct decline during 2011. It looks like removal of incentives and changes to feed in tariffs are having an effect. The error bars show the 95% confidence interval.

Now on to pool pumps. Here we see a steady rise on searching for pool pumps which indicates that we can expect growth in pool pump load to also grow nationally. If anything it looks like the search rate is increasing and maybe apart from 2008 maybe not affected by the 2008 global downturn.

Once we account for temperature variability we see really no trend in terms of energy efficiency until 2010. This came after the collapse of Australia’s carbon trading legislation and the collapse of internally accord on climate change policy. It stands to seems to me that this is also reflected in the public concern with energy efficiency. It seems to me that if there is widespread public concern about the contribution of electricity to cost of living then it should be reflected here but it isn’t. This also seems to suggest that for consumers the motivation towards energy efficiency is driven by a sense of social responsibility rather than being an economic decision.

Finally, air conditioning. What we see represented here is the rapid growth in air conditioning that happened in 2004-2005 with a slowing in growth from 2006-2008. It looks like maybe the government rebates of 2009 may have been partially spent on air conditioning. But what we see is that from 2010 onwards there has been no significant trend in search term growth. Does this suggest that we have finally reached saturation some time during 2010?