The Hitchhiker’s Guide to Big Data

Thanks to Douglas Adams

Data these days is big. Really big. You just won’t believe how vastly hugely mindbogglingly big it is… The best piece of advice I can give is:  Don’t Panic.

Because I am an all-round hoopy frood, I will provide you the reader with a few simple concepts for navigating this vast topic. So lets start with definitions. What is big data? The answer is: it depends who you talk to. Software vendors says it something that you need to analyse (interestingly, usually with the same tools used to analyse “small” data). Hardware vendors will talk about how much storage or processing power is needed. Consultants to focus on the commercial opportunities of big data, and the list goes on…

Big data is mostly just good ol’fashioned data. The real revolution is not the size of the data but the cultural and social upheaval that is generating new and different types of data. Think social networking, mobile technology, smart grids, satellite data, TIOT (The Internet Of Things). The behavioural contrails we all leave behind us every minute of every day. The real revolution is not the data itself but in the meaning contained within the data.

The towel, according to the Hitchhiker’s Guide to the Galaxy “is about the most massively useful thing an interstellar hitchhiker can have”. The most important tool the big data hitchhiker can have is a hypothesis. And for commercial users of data a business case to accompany that hypothesis. To understand why this is so important let’s look at the three characteristics of data: volume, velocity and variety. In simplified terms, think of spreadsheets:


As Tom Davenport has reportedly said, ‘Big data usually involves small analytics’. Size does matter insomuch as it becomes really hard to process and manipulate large amounts of data. If your problem requires complex analysis then you are better off limiting how much data you analyse. There are many standard ways to do this such as random selection or aggregation. Often because activities such as prediction or forecasting require generalisations to be drawn from data there comes a point where models do not become substantially more accurate with the addition of more data.

On the other hand, relatively crude algorithms can become quite accurate with large amounts of data to process and here I am thinking in particular of Bayesian learning algorithms.

The decision of how much data is the right amount depends on what you are trying to achieve. Do not assume that you need all of the data all of the time.

The question to ask when considering how much is: “How do I intend to analyse the data?”


If there is any truth in the hype about big data it is the speed with which we can get access to it. In some ways I think it should be called “fast data” rather than “big data”. Again the speed with which data can be collected from devices can be astounding. Smart meters can effectively report in real time but does the data need to be processed in real time? Or can it be stored and analysed at a later time? And how granular does the chronology have to be?

As a rule I use the timeframes necessary for my operational outcome to dictate what chronological detail I need for my analysis. For example, real time voltage correction might need one minute data (or less), but monitoring the accuracy of a rolling annual forecast might only need monthly or weekly data.

Operational considerations are always necessary when using analytics. This is where I distinguish between analytics and business intelligence. BI is about understanding; analytics is about doing.

The question to ask when considering how often is: “How do I intend to use the analysis?”


This is my favourite. There is a dazzling array of publicly available data as governments and other organisations begin to embrace open data. And there’s paid-for vendor data available as well. Then there is social media data and data that organisations collect themselves through sensors in smart (and not so smart) devices as well as traditional customer data and also data collected from market research or other types of opt in such as mobile phone apps. The list goes on.

This is where a good hypothesis is important. Data sourced, cleansed and transformed should be done so according to your intended outcome.

The other challenge with data variety is how to join it all together. In the old days we relied on exact matches via a unique id (such as an account number), but now we may need to be more inventive about how we compare data from different sources. There are many ways in which we can be done: this is one of the growing areas of dark arts in big data. On the conceptual level, I like to work with a probability of matching. I start with my source data set which will have a unique identifier matched against the data I am most certain of. I then join in new data and create new variables that describe the certainty of match for each different data source. I then have a credibility weighting for each new source of data that I introduce and use these weightings as part of the analysis. This allows all data to relate to each other; this is my Babel fish for big data. There are a few ways in which I actually apply this approach (and finding new ways all of the  time).

The question to ask when considering which data sources is: “What is my hypothesis?”

So there you have it: a quick guide to big data. Not too many rules just enough so you don’t get lost. And one final quote from Douglas Adams’ Hitchhiker’s Guide to the Galaxy:

Protect me from knowing what I don’t need to know. Protect me from even knowing that there are things to know that I don’t know. Protect me from knowing that I decided not to know about the things that I decided not to know about. Amen.


Eating my own ice cream – Part 1

According to Wikipedia the rather unfortunate term “dogfooding” was coined in reference to the use of one’s own products. That is, “if it is good enough for others to use then it is good enough for me to use”. I prefer the term coined by one time Microsoft CIO Tony Scott: “icecreaming”.


Sanctorius of Padua literally eating his own ice cream.

In this two-part post, I am going to “eat my own ice cream” and dive into my own smart meter electricity data made available by my electricity distributor through an online portal. I will endeavour to find out what drives electricity usage in my household, how to make the data as predictable as possible and what lessons can be learned so that the utilities sector can get better insight from smart meter data.

Whether it is for regulatory requirements or generally better business decision making, traditional forecasting practices are proving to be inadequate. The current uncertainty in electricity pricing is partly driven by inadequate peak load and energy forecasts. Until the mid-2000s energy forecasting was very straightforward as electricity was a low cost resource and depended on very mature technology. And then everything changed. We had a run of hot summers followed by a run of wet, mild ones. We had the rooftop solar revolution helped in the early days by considerable government subsidy. We had changes in building energy efficiency standards and lately we have also had a downturn in the domestic economy.  And of course we have had price rises which has revealed the demand elasticity of some consumers.

This array of influences can seem complex and overwhelming, but armed with some contemporary data mining techniques and plenty of data we can build forecasts which can take into account these range of factors and more importantly dispel myths of what does and doesn’t affect consumption patterns. Furthermore we can build algorithms that will detect when some new disruptor comes along and causes changes that we have not previously accounted for. This is very important in an age of digital disruption. Any organisation that is not master of its own data has the potential to face an existential crisis and all of the pain that comes with that.

In this analysis I am going to use techniques that I commonly use with my clients. In this case I am looking at a single meter (my own meter), but the principles are the same. When working with my clients my approach is to build the forecast at every single meter, because different factors will drive the forecast for different consumers (or at least different segments of consumers).

So I don’t indulge in “analysis paralysis”, I will define some hypotheses that i want to test:

  • What drives electricity usage in my household?
  • How predictable is my electricity usage?
  • Can I use my smart meter data to predict electricity usage?

I will use open source/freeware to conduct this analysis and visualisations to again prove that this type of analysis does not have be costly in terms of software, but relies instead on “thoughtware”. As always, let’s start with a look at the data.


As you can see I have a row for each day and 48 half hourly readings which is standard format for meter data. To this I add day of week and a weekend flag calculated from date. I also add temperature data from my nearest Bureau of Meteorology automated weather station – which happens to be only about 3 kilometres away and on a similar altitude. I also total the 48 readings so I have a daily kWh usage figure.  In a future post I will look into which techniques we can apply to the half hourly readings, but in this post I will concentrate on this total daily kWh figure.

This is the data with the added fields:


My tool of choice for this analysis is the Generalised Linear Model (GLM).  As a general rule regression is usually a good choice for modelling a variable of continuous values. GLMs also allow tuning of the model to fit the distribution of the data.

Before deciding what type of GLM to use let’s look at the distribution of daily usage:


Not quite a normal distribution. The distribution is slightly skewed to the left and high kurtosis which looks a little like a gamma distribution. Next let’s look at a distribution of the log of daily kWh.


Here I can see a long tail to the left but if I remove that ignore that tail then I get quite symmetric distribution.  Let’s have a closer look at those outliers, this time by plotting temperature against daily kWh. They can be seen clearly in a cluster at the bottom of the graph below.


This cohort of low energy usage days represents times when our house has been vacant.  In the last year these have mostly been one-off events with no data that I can use to predict their occurrence.  They can all be defined as being below 5 kWh, so I’ll remove them from my modelling dataset. The next graph then shows we clearly have a better fit to a gamma distribution (blue line) rather than a normal distribution (red line).


We are now ready to model. This is what the first GLM looks like:


In assessing my GLM, I will use three measures:

  •  The “p(>|t|)” to estimate the goodness of fit of each predictor (the smaller the better which means the greater confidence we have in the coefficient estimate),
  • R2 which represents how well the overall model fits (the higher the better –R2 can be thought of as the  percentage of variance in the data explained by the model), and
  • root mean squared error  (RMSE) which tells me what the quantum of average difference is between my actual and predicted values (the lower the better; an RMSE of zero means that predicted values do not vary from actual values).

The model above is not very well fitted as demonstrated by the p-values and that some coefficients did not produce an estimate. This model has an R-squared of 0.36 and a RMSE of 7 and these statistics are not very reliable given the p-values.

Also it seems odd that MinTemp is significant but MaxTemp is not. So I remove poor performing variables and add an interaction between MinTemp and MaxTemp as I expect to find a relationship between these two values and electricity usage.


This new model is better fitting with r-squared of 0.64 and RMSE of 5.32. But the p-value for “Day==Tuesday” is still not low enough for my liking given the sample size of only a few hundred observations. At risk of erring sightly on the side of underfit, I remove this term from the model.  Taking a closer look at temperature, I plot average temperature (the midpoint between MinTemp and MaxTemp) against daily kWh and I find an interesting pattern:


We see cross over points in the direction of correlation in the same temperature band at different seasonal changeovers, like bookends to the winter peak in usage.  I use this insight to create two new temperature variables using splines. A bit of experimentation leads me to conclude that the temperature changeover is at 18 degrees Celsius which is also the temperature at the bottom of my U-curve scatterplot above. I create a variable called “spl1” which is zero for all values less than 18 degrees and then the average temperature minus 18 for all above.  The second variable, “spl2”, is the opposite: zero for all temperature above 18 degrees and 18 minus the average temperature for all below. Because I am using a log link function, these variables will describe a u-shape as in the scatterplot rather than a v-shape which is what would happen were I using linear regression.

Let’s see how these variables work in my model:


Hey presto! We have a much stronger fitting model with r-squared of 0.71 and a RMSE of 4.86. This model is appealing in that it is highly parsimonious and readily explainable. When I visualise the model fit and produce a thirty day moving average r-squared increases to 0.88 and I have a model with a good fit.


I have pointed out three periods where the model departs from actual usage. The two low periods coincide with times when we were away and the high period coincides with a period when I was travelling. I have seen market research which suggests that absence of the bill payer leads to higher household electricity usage. I can add dummy variables into my model to describe these events and then use those in future forecast scenarios. The important thing here is that I am not using a trend and given this fit I see no trend in my usage other than that created by climatic variability. Some consumers will have a trend in usage based on changes over time based on things like changes in productivity for businesses or addition of solar for residential customers. But it is not good enough to just count on a continuing trend. It is important to get to the drivers of change and findings ways of capturing these drivers in granular data.

In the next part of this post I’ll investigate how these meter-level insights can be used at the whole of network level, and some techniques which can be used to derive insight from individual meters to whole of network.

Productivity and Big Bang Theory

Productivity has been falling in Australia for some time. In the mining, utility and manufacturing sectors we have seen a remarkable fall in productivity over the last decade. Some of this has been caused by rising labour costs, but in mining and utilities in particular, capital expenditure on infrastructure has been major contributor. So how will new technology and the era of “big data” transform the way these sectors derive return on capital investment?


According to the ABS this may have been driven in part by rapid development of otherwise unprofitable mines to production in an environment of once-in-lifetime high commodity prices. From a labour perspective, this has also driven wages in the mining sector which has knock-on effects for utilities.

Meanwhile for the last decade utilities have been dealing with a nexus of chronic under-investment in some networks, our insatiable appetite for air conditioning in hot summers and a period of growth in new housing with poor energy efficiency design in outlying urban areas which are subject to greater temperature extremes. The capital expenditure required to keep pace with this forecast peak demand growth has been a major negative in terms of productivity.

In this post I am going to consider how analytics can find increased productivity in the utilities sector (although there should be parallels for the mining sector) and specifically through optimisation of capital expenditure. I’ll discuss labour productivity in a  future post.

Deloitte has recently released its report into digital disruption: Short Fuse, Big Bang. In this report the utility sector is one which is going to be transformed by technological change, albeit more slowly than other sectors. Having said that, electricity utilities and retailers are going to be the first to experience disruptions to their business models, before water and gas. This is being driven by the fact that electricity businesses are at the forefront of privatisation among utilities and the politicisation of electricity pricing. Internationally, energy security concerns (which as in turn has seen the rise of renewables, energy conservation and electric vehicle development, for example) have also driven technological change faster for electricity utilities.

On face value the concept of smart grid just looks like the continuation of big ticket capital investment and therefore decline in productivity. Is there, however, a way to embrace the smart grid which actually increases productivity?

Using good design principles and data analytics, I believe the answer is yes. Here are three quick examples.

Demand Management

The obvious one is time of use pricing of electricity which I have written about on this blog several times already. The problem with this from a savings point of view is that the payoff between reduced peak demand and saving in capital expenditure is quite lagged and without the effective feedback between demand management and peak demand forecasting then may just result in overinvestment in network expansion and renewal. In fact I believe that we have already seen this occur as evidenced by the AEMO’s revision of peak demand growth. When peak demand was growing most rapidly through the mid 1990’s , demand management programs were proliferating as were revisions to housing energy efficiency standards. It should have been no surprise that this would have an effect on energy usage, but quite clearly it has come as a surprise to some.

Interval meters (which are also commonly referred to as “smart” meters) are required to deliver time of use pricing and some parts of the NEM are further down the track than others in rolling these out, so this solution still requires further capital investment. In my recent experience this appears to be the most effective and fairest means for reducing peak demand. Meter costs can be contained however as “smart meter” costs continue to fall. A big cost in the Victorian rollout of smart meters has not just been the meters themselves but the communications and IT infrastructure to support the metering system. An opt-in roll out will lead to slower realisation of the benefits of time of use pricing in curbing peak demand but will allow a deferral of the infrastructure capital costs. Such an incremental rollout will allow assessment of options such as between communication-enabled “smart meters” versus manually read interval meters (MRIMs). They are meters which capture half hour usage data but do not upload that via a communications network. They still require a meter reader to visit the meter and physically download the data. These meters are cheaper but labour costs for meter reading need to be factored in. There are other advantages to communications-enabled meters in that data can be relayed in real time to the distributor to allow other savings spin offs in network management. It also makes it possible for consumers to monitor their own energy usage in real time and therefore increase the effectiveness of demand pricing through immediate feedback to the consumer.

Power Management

From real time voltage management to reduce line loss, to neural net algorithms to improve whole of network load balancing, there are many exciting solutions that will reduce operating costs over time. Unfortunately, this will require continued capital investment in networks that do not have real time data-reporting capabilities and there is little appetite for this at the moment. Where a smart grid has already rolled out these options need to be developed. Graeme McClure at SP Ausnet is doing some interesting work in this field.

Asset Lifetime

This idea revolves around a better understanding of the true value of each asset on the network. Even the most advanced asset management systems in Australian distributors at the moment tend to treat all assets of a particular type of equal value, rather than having a systematic way of quantifying their value based on where they are within the network. Assets generally have some type of calculated lifetime and these get replaced before they expire. But what if some assets could be allowed to run to failure with little or no impact on the network? It’s not that many talented asset managers don’t already understand this. Many do. But good data analytics can ensure that this happens with consistency across the entire network. This is an idea that I have blogged about before. It doesn’t really require any extra investment in network infrastructure to realise benefits. This is more about a conceptually smart use of data rather than smart devices.

The era of big data may also be the era of big productivity gains and utilities still have time to get their houses in order in terms of developing analytics capability. But delaying this transition could easily see some utilities facing the challenges to the business model currently being faced by some in the media and retail industries. The transition from service providers to data manufacturers is one that will in time transform the industry. Don’t leave it too late to get on board.

Have We Seen the End of Peak Demand?

There has been a lot of comment in the media lately about how dodgy forecasts have  impacted retail electricity bills. Is this really the case? Has peak demand peaked? Have we over-invested in peaking capacity? I don’t propose to come up with a definitive answer here but by exploring forecasting methodologies I hope to show why such predictions are so hard to do. In this post I am going to show that a pretty good model can be developed using free software and a couple of sources of publicly available data (ABS, BOM) on wet a Melbourne Saturday afternoon. To cheer me up I am going to use Queensland electricity data from AEMO and concentrate on summer peak demand. I am then going to apply this technique to data only up to summer 2009 to and compare that to the recently downward-revised AEMO forecast.

But first let’s start with a common misconception. The mistake many commentators make is confusing the economics of electricity demand with the engineering of the network for peak capacity. Increasing consumption of electricity will impact wholesale prices of electricity. To a lesser extent it will also affect retail prices as retailers endeavour to pass on costs to consumers. The main driver of increased retail electricity prices however is network costs; specifically the cost of maintain enough network capacity for peak demand periods.

Let’s start by looking at some AEMO data. The following chart show total electricity consumption by month for Queensland from 2000 – 2011.

Queensland Energy Consumption

We can see from this chart that total consumption has started to fall from around 2010. Interestingly, though, we have seen the peakiness increase from about 2004 where summers have a much greater electricity usage than non-peak seasons.

If we overlay this with peak demand then we see some interesting trends.

Consumption versus Demand

What we see is from 2006 onwards is an increasing separation between peak demand and total consumption. There are a couple of factors underlying this decoupling. One is increased energy efficiency of homes driven by energy efficient building standards and other schemes such as the home insulation scheme. The other is the rapid uptake of solar power. Generous feed in tariffs have encouraged a widespread uptake of solar panels which has decreased the amount of energy consumed from the grid except at peak times. A solar panel will reduce electricity consumption during the day but in during warm summer evenings when the sun has disappeared air conditioners will run heavily on network electricity. The implication of the decoupling of peak demand from total consumption is that we either have to pay more for our electricity to maintain the same standard of supply or accept lower reliability of supply, especially at time when we most need it – very hot and very cold days.

When we overlay temperature on peak demand we see generally summer peaking which is typical for Queensland. We also see that maximum temperatures were higher earlier in the decade and then generally cooler in the last three years. It is important to remember that what we are seeing is longer wave of variability which is not a trend. This is often understood but not properly accounted for in forecasting temperature-variant behaviour.Demand versus Temperature

The above chart does not use maximum monthly temperature but the average maximum of the hottest four days of each month. Those who have studied electricity usage behaviour know that the highest peak often occurs after a run of hot days. By averaging the hottest few days of each month we get a measure that captures both the peak temperature and the temperature run. It is not necessary for this purpose to explicitly calculate consecutive days because temperature is not randomly distributed: temperature tends to cluster anyway. Another way to capture this is count the number of days above a given temperature. Both types of variable can perform well in models such as these.

We can see from this chart that peak demand continues to rise despite variability caused by temperature. The next step then is to add variables that describe the increase in peak. In my experience population usually performs the best but in this case I’ll test a couple of economic time series measures form the ABS National Accounts.

I also create a dummy variable to flag June, July and August as winter months. My final dataset looks like this:

Data snapshot

Preparation of data is the most important element of analytics. It is often difficult, messy and time consuming work but something that many of those new to analytics skip over.

In this exercise I have created dummy variables and eventually discard all except a flag indicating if a particular month is a winter month as per the data shown above. This will allow the model to treat minimum temperature differently during cold months.

Another common mistake is that extremes such as peak demand can only be modelled on the extreme observations. In this case I look at peak demand is all months in order to fit the summer peaks rather than just modelling the peaks themselves. This is because there is important information in how consumer demand varies between peak and non-peak months. This way the model is not just a forecast but a high level snapshot of population response to temperature stimulus. Extreme behaviour is defined by the variance from average behaviour.

My tool of choice is the GLM (Generalised Linear Model) which gives me a chance to experiment with both categorical variables (e.g. is it winter? Yes/No) and various distributions of peak demand (i.e. normal or gamma) and whether I want to fit a linear or logarithmic line to the data.

After a good deal of experimentation I end up with a very simple model which exhibits good fit and each of the predictor variables fit significance greater than 95%. For the stats minded here is the output:

GLM Output

You will notice that I have just four variables from two data sources left in my model. Economic measures did not make it to the final model. I suspect that population growth acts as a proxy for macroeconomic growth over time both in terms of number of consumers and available labour supporting economic output.

Another approach borrowed from data mining that is not always used in forecasting is to hold a random test sample of data which the model is not trained on but is validated in terms of goodness of fit statistics. The following show the R-squared fit against both the data used to train the model and the hold out validation dataset.

Model Fit - Training Data

Model Fit - Test Data

We can be confident on the basis of this that our model explains about 80% of the variance in peak demand over the last decade (with I suspect that balance being explained by a combination of solar pv, household energy efficiency programs, industrial use  and “stochastic systems” – complex interactive effects that cannot be modelled in this way).

Another way to look at this is to visually compare the predicted peak demand against actual peak demand as done in the following graph.

GLM Model - Predicted versus Actual

We can see from this chart that the model tends to overestimate demand in the earlier part of the period and underestimate at the end. I am not too concerned about that however as I am trying to fit an average over the period so that I can extrapolate an extreme. I will show that this only has a small impact on the short term forecast. This time series does have a particularly big disruption which is the increased penetration of air conditioning. We know that the earlier part of the period includes relatively low air conditioner penetration (and we have now most likely reached maximum penetration of air conditioning). Counteracting this is the fact that the later period includes households with greater energy efficiency. These events in counteract each other. As with weather you can remove variability if you take a long enough view.

Let see what happens if we take temperature up to a 10 POE level and forecast out three years to November 2014. That is, what happens if we feed 1-in-10 year temperatures into the model? I emphasise that this is 10 POE temperature; not 10 POE demand.

GLM - 10 POE Temperature Prediction

We see from this chart that actual demand exceed our theorised demand three times (2005, 2007 and 2010) out of 12 years. Three years out of twelve can be considered as 25 POE or in other words peak exceeds the theorised peak 25% of the time over a twelve year period.

2010 appears to be an outlier as overall the summer was quite mild. There was however a spike of very warm weather in South East Queensland in January which drove a peak not well predicted by my model. The month also recorded very cool temperature which has caused my model to drag down peak demand. This is consistent with the concept of probability of excedance. That is, there will be observed occurrences that exceed the model.

The final test of my model will be to compare back to the AEMO model. My model predicts a 2013/14 summer peak of 2309 MW at 25 POE. The 50 POE summer peak forecast for 2013/14 under the Medium scenario for AEMO is 9262 MW and 9568 MW at 10 POE. If we approximate a 25 POE for AEMO as the midpoint between the two then we get 9415 MW. Which means we get pretty close with using just population and temperature, some free data and software and a little bit of knowledge (which we know is a dangerous thing).

GLM Fit to AEMO Model

This forecast is a significant downward forecast on previous expectations which has in part lead to the accusations of dodgy forecasting and “gold plating” of the network. So what happens if I apply my technique again but this time only on data up until February 2009? That was the last time we saw a really hot spell in South East Queensland. If new data has caused forecasts to be lowered then going back this far should lead to model that exceeds the current AEMO forecast. The purple line in the graph below is the result of this new model compared to actual and the first model and AEMO:

GLM Modelled Pre-2010

What we see here is much better fitting through the earlier period, some significant under fitting of the hot summers of 2004 and 2005, but an almost identical result to the original GLM model in forecasting through 2012, 2013 and 2014. And still within the bound of the AEMO 10 and 50 POE forecasts. Hindsight is always 20/20 vision, but there is at least prima facie evidence to say that the current AEMO forecast appears to be on the money and previous forecasts have been overcooked. It will be interesting to see what happens over the next few years. We should expect peak demand to exceed the 50 POE line once every 2 years and the 10 POE line every 10 years.

We have not seen the end of peak demand. The question is how far are we willing to trade off reliability in our electricity network to reduce the cost of accommodating peak demand. The other question is all-of-system peak demand forecasting is good and well, but where will the demand happen on the network, will it be concentrated in certain areas and what are the risks to industry and consumers of lower reliability in these areas? I’ll tackle this question in my next post.

Analytics: Insource or Outsource?

For someone who makes their living from consulting on analytics my answer to this question may surprise some. In a world increasingly dominated by data, the ability to leverage data is not only a source of competitive advantage it is now a required competency for most businesses.

External consulting can help accelerate the journey to fully insourced analytics capability. The trick is how to do this in the most cost effective way. I have dealt with a number of companies that have very different approaches to this question, and it is my observation that the wrong mix of insourcing and outsourcing can be very expensive, perhaps in ways that you may find surprising. The key is understanding that analytics is not primarily a technology function.

To illustrate my point I am going to describe the analytics journey of three hypothetical companies. Our three companies are all challenger brands, second or third in their respective markets. Their businesses have always been reliant on data and smart people, but new technology and competitive pressures mean that data is becoming more and more important to their business models. All recognise the need to invest, but which is the right strategy?

The CIO of Company A has launched a major project to implement a new ERP system which will transform the way they will manage and access data right across the organisation. He is also establishing an analytics team by hiring a handful of statistics PhDs to extract maximum value from the new data platform. He is investing significantly with a major ERP platform vendor and is using consultants to advise him on implementation and help manage the vendor. He sees no need to spend additional money on analytics consultants because he has already hired plenty of smart people who can help him in the short term. He does however see value in hiring consultants to help his organisation with the large IT transformation.

In Company B, the COO is driving the analytics strategy. Privately, he doesn’t rate the CIO. He sees him as coming from a bygone era where IT is a support function to the core business and technical capability should be delivered from technical centre of excellence. The CIO has built a team of senior managers who firmly believe that to maintain efficient use of resources; business users should only have access to data through IT-approved or IT-built applications. The company has a very large and well organised data warehouse, but mostly it is accessed by other applications. There are very few human users of the data, and virtually none outside of IT who mostly use the data warehouse for building applications and rely on a detailed specification process from internal “customers” to understand the content of the data.

To drive his strategy of developing organisational analytics capability, the COO is forced to either wait for lengthy testing of new applications and system access through an exception basis, or else outsource his analytics to service providers who can offer him greater flexibility and responsiveness. He secures funding for an asset management project to optimize spending on maintaining ageing infrastructure and secures the services of a data-hosting service. Separately, he hires consultants to build advanced asset failure predictive models based on the large volumes of data in his externally hosted data mart.

Company C has hired a new CIO who has a varied background in both technology and business-related positions. She has joined the company from a role as CEO of a technology company where she has had both technology and commercial experience. Her previous company frequently (but not always) used Agile development methodology. She too has been tasked with developing a data strategy in her new role. Company C is losing market share to competitors and the executive think this is because their two competitors have spent a large amount of money on IT infrastructure renewal and have effectively bought market share by doing so. Company C is not using their data effectively to price their products and develop product features to drive greater customer value, but they are constrained in the amount of money they can spend to renew their own data infrastructure. The parent company will not invest in large IT expenditure when margins and market share are falling. The CIO resists pressure from the executive and external vendors to implement a new cut price ERP system and instead focuses her team on building better relationships with business users, especially in the pricing and product teams. She develops a team of technology-savvy senior managers with functional expertise in pricing and product development, rather than IT managers. She delivers a strong and consistent message that their organisation’s goal is to compete on data and analytics. Every solution should be able to state how data and analytics are used.

As issues or manager-driven initiatives arise she funds small project teams comprising IT, business and some involvement of external consultants. She insists that her managers hire consultants to work on site as part of virtual teams with company staff. Typically consultants are only engaged a few weeks at a time, but there may be a number of projects running simultaneously. Where infrastructure or organised data does not exist, teams are permitted to build their own “proof of concept” solutions which are supported by the teams themselves rather than IT. Because the ageing data warehouse struggles to cope with increased traffic increasingly it is used as a data staging area with teams running their own purpose built databases.

So how might these strategies play out? Let’s look at our three companies 12 months later.

Company A has built a test environment for their ERP system fairly quickly. The consultants have worked well with the vendor to get a “vanilla” system up and running but the project is now running into delays due to integration with legacy systems and problems handling increasing size of data. The CIO’s consultants are warning of significant blow outs in time and cost, but they are so far down the path now that pulling out is not an option. The only option is to keep requesting more funds. The blame game is starting with the vendor blaming the consultants, the consultants blaming IT.  Meanwhile the CIOs PhD-qualified analytics team have little work to do as they wait many months for their data requests to be filled. The wait is due in part to the number of resources required to support the ERP project means that there are few staff available to support ad hoc requests. When the stats team gets data they build interesting and robust statistical models but struggle to understand relevance to the business. One senior analyst has already left and others will most likely follow. I have seen this happen more times than I care to remember. Sadly, Company A is a pretty typical example.

Company B has successfully built their asset management system which is best in class due to the specialised skills provided by the data hosting vendor and analytics consultants. It has not been cheap – but they will not spend as much as Company A eventually will to get their solution in place. The main issue is that no one is Company B really understands the solution and more time and money will be required to bring the solution in house with some expenditure still required by IT and the development of a support team. On the bright side, however, the CIO has been shown up as recalcitrant and the migration of the project in house will be a good first project for the incoming CIO when the current CIO retires in a few months. It will encourage IT to develop new IP and new ways of working with the business including sharing of data and system development environments.

Company C (as you may already have guessed) is the outstanding success. Within a few weeks they had their first analytics pricing solution in place. A few weeks after that, tests were showing both increased profitability and market share within the small test group of customers who were chosen to receive new pricing. The business case for second stage roll out was a no brainer and funding will be used to move the required part of the data warehouse into the cloud.

After 12 months a few of the projects did not produce great results and these were quietly dropped. Because these were small projects costs were contained and importantly the team became better at picking winners over time. Small incremental losses were seen as part of the development process. A strategy of running a large number of concurrent projects was a strain at first for an IT group which was more accustomed to “big bang” projects, but the payoff was that risks were spread. While some projects failed other succeeded. Budgets were easier to manage because this was delegated to individual project teams and the types of cost blow outs experienced by Company A were avoided.

The salient lesson here is to look firstly at how your organisation structures it approach to data and analytics projects. Only then should you consider how to use and manage outsourced talent. The overarching goal should be to bring analytics in house because that’s really where it belongs.

Retail Therapy

July 1, 2012 will probably be mostly remembered at the date Australia introduced a price on carbon. But another event took place which may be more significant in terms of how households and small businesses consume their electricity:  the commencement of the National Energy Customer Framework (NECF).  The NECF gives the Australian Energy Regulator (AER) the responsibility for (among other things) regulating retail electricity prices.  Electricity retail prices continue to rise driven mostly by increasing capital expenditure costs for networks. Electricity businesses, regulators and governments are increasingly turning their attention to Time of Use (TOU) pricing to help mitigate peak network demand and therefore reduce capital expenditure.

Change will be gradual to start with however. A cynical observer may suggest that the NECF is no more than a website at present, but I believe that change is inevitable and it will be significant. Five states and the ACT have agreed to a phased introduction of the NECF following on from a 2006 COAG agreement, and the transition will be fraught with all of the complexities of introducing cross jurisdictional regulatory reform.

There are basically two mechanisms that drive the cost of electricity to produce and deliver. One is the weather (we use more in hot and cold weather) and the other is the cost of maintaining and upgrading the network that delivers the electricity. For the large retailers, the way to deal with the weather is to invest in both generation and retail because one is a hedge for the other. These are known as “gentailers”.

The network cost has traditionally been passed through as a regulated network tariff component of the retail price. The problem with this is that often the network price structure does not reflect actual network costs which are driven by infrequent peak use, particularly for residential customers. Those who use a greater proportion of electricity during peak times add to the cost of maintaining capacity in the network to cope with the peak. But for residential and other small consumers they all pay the same rate. In effect, “peaky” consumers are subsidised by “non-peaky” customers.

It is not yet really clear how a price signal will be built into the retail tariff but one policy option is for distributors to pass costs to reflect an individual consumer’s load profile. The implications for government policy are interesting but I’ll save for another post. In this post, I’ll explore what the implications are from the retailer’s perspective in contestable markets.

I believe that this is potentially quite a serious threat to the business model for retailers for a number of reasons that I’ll get into shortly, but at the heart of the matter is data: lots of it, and what to do with it. Much of that data is flowing from smart meters in Victoria and NSW and will start to flow from meters in other states. A TOU pricing strategy not only requires data from smart meters but also many other sources as well.

Let’s have a quick recap on TOU. I have taken the following graph from a report we have prepared for the Victorian Department of Primary Industries which can be found here.

The idea of TOU is to define a peak time period where the daily usage peaks and charge more for electricity in this time period. A two part TOU will define other times as off peak and charge a much lower tariff. There may also be shoulder periods either side of the peak where a medium tariff is charged.

How each of these periods is defined and the tariff levels set will determine whether the system as a whole will collect the same revenue as when everyone is on a flat tariff.  This principle is called revenue neutrality. That is, the part of the electricity system that supplies households and small businesses will collect the same revenue under the new TOU tariffs as under the old flat tariff.

But this should by no means give comfort to retailers that they each will achieve revenue neutrality.

For example, we can see from the above graphs that even if revenue neutrality is achieved for all residential and SME customers combined, residential customers may be better off and SME worse or vice versa but everything still totals to no change in revenue. If a retailer has a large share of customers in a “better off” category then that will translate to a fall in revenue if the retailer passes on the network tariff with their existing margin. In fact we find that residential bills for example may be reduced by up to five per cent, depending on the design of the network tariff.

Of course this is just one segmentation of TOU, there could be many, many more sub-segments all with different “better off” or ”worse off” outcomes.

Revenue neutrality can be affected by price elasticity (consumers reduce their peak consumption) or substitution (they move their peak usage to shoulder or off-peak and thus reducing their overall electricity bill). This means that retailers not only have to understand what the impact would be under a current state of electricity usage but also how the tariff itself will affect consumer behaviour.

Data is at the very centre of competitive advantage as this disruptive event unfolds in the retail electricity market. Indeed the threat may not just be disruptive: for some retailers this may be an existential threat, especially as we see data-centric organisations entering the market such as telcos and ISPs. So far the no large telcos have entered the market in Australia (as far as I know: please correct me on this if this has changed) but surely the elephants must be loitering outside the room if not already in it.

I think what is clear for incumbent electricity retailers is “do nothing” is not an option. There must be a clear strategy around data and pricing including technology, talent and process. Furthermore, the centrepiece must be time of use pricing excellence built on a deep capability with data flowing from new technology meters and networks.

So what exactly are the key issues? The following list is by no means exhaustive but certainly gives some idea of the extent of data and the quantum of skills required to handle such complex analysis and interpretation.

Opt In or Opt Out?

I believe that TOU tariffs for small consumers are inevitable, but how will it roll out and how fast will the rollout be? The key policy decision will be whether to allow customers to opt in to TOU tariffs or opt out of a scheme which will otherwise be rolled out by default (a third option is to mandate to all, but this is likely to be politically unpalatable). I think pressure on governments to act on electricity network costs means that the “opt in” option, if it is adopted by the AER, will by definition be a transitional process. But the imperative is to act quickly because there is a lag between reducing peak demand and the flow through to capital expenditure savings (this is another whole issue which I will discuss in a future post). This lag means that if take up of TOU is too slow then the effect to the bottom line will be lost in the general noise of electricity consumption cycles: a case of a discount delayed is a discount denied. Retailers will have the right to argue for a phased introduction but there will be pressure on governments and the AER to balance this against the public good.

Non-cyclical change in demand

In recent years we have seen a change in the way electricity is consumed. I won’t go into the details here because I have blogged on this before. Suffice to say that it is one thing to understand from the data how a price may play out in the current market state but it’s altogether another thing to forecast how this will affect earnings. This requires a good idea of where consumption is heading and in turn this is affected a by a range of recent disruptors including Solar PV, changes in housing energy efficiency and changes in household appliance profiles. Any pricing scenario must also include a consumption forecast scenario. It would also be wise to have way to monitor forecasts carefully for other black swans waiting to sweep in.

A whole of market view

The task of maintaining or increasing earnings from TOU pricing will be a zero sum game. That is, if one retailer gets an “unfair share” of the “worse off” segments, then another retailer will get more of the “better off” segments and it is likely that this will be a one-off re-adjustment of the market. There is a need for a sophisticated understanding of customer lifetime value and this will be underpinned by also having a good understanding of market share by profitability. The problem is that smart meters (and the subsequent data for modelling TOU) will roll out in stages (Victoria is ahead of the other states, but I think the rollout will be inevitable across the National Electricity Market). The true competitive advantage for a retailer comes from estimating the demand profiles of customers still on accumulation meters and those smart meter consumers who are with competitors. There are a range of data mining techniques to build a whole-of-market view but equally important is a sound go-to-market strategy built to take advantage of these insights.

There will be winners and losers in the transition to TOU. For consumers, it could be argued that the “losers” are currently “winners” because the cost of their electricity supply is being subsidised by less “peaky” customers. There will also be winners and losers among energy retailers. Some of the winners may not even be in the market yet. The question is who will the losers be?

Is there really such a thing as a “data scientist”?

A curious job description has started to appear in recent times: that of “data scientist”. In this post I am going to see if such a creature really exists or whether it is a chimera designed to make the rather dense and impenetrable subject of analytics even more dense and impenetrable to the outside observer.

Firstly, we need to whip up a quick epistemology of analytics (apologies to T.S. Eliot):

Here we see all knowledge and wisdom has data at its core; not raw data but data that is transformed into information. Information is derived from data in many different ways: from the humble report produced from a desktop database or spreadsheet, to the most sublime and parsimonious predictive model lovingly crafted by the aforementioned alleged data scientist.

If we gather enough of this information and discover (or create) the interconnections between discrete pieces of information then we create knowledge. If we gather enough knowledge then sooner or later we may start to question why we know what we know: this arguably is wisdom. Hopefully, from a large enough amount of data we may in time extract a very small amount of wisdom.

The logic also flows the other way: wisdom tells us if we are acquiring the right knowledge; knowledge gaps leads to the need for more information, and information needs drive further gathering and interrogation of data.

So where does science sit in all of this? I am not going to discuss wisdom in detail here – that belongs to philosophy and theology, although some physicists may disagree (you can download the podcast here). Science is dedicated to the creation of knowledge from information (knowledge that is derived through deductions or observations). The “data scientist” on the other hand specialises in deriving information from data which I argue is not a science at all. It is certainly a critically important function and one that is becoming central to all organisations in one form or another, but it is not a science.

Invention, as the saying goes, is 1% inspiration and 99% perspiration. Generating information from data is the 99% perspiration part. The most skilled statistician cannot create useful models without the right data in the right shape to answer the right questions. Understanding what shape the data ought to be in requires the transfer of knowledge through information to the data level.

To turn the question around, if the data scientist truly exists then what are scientists of other disciplines? Arguably, all scientists are data scientists as all hypothesis-driven science relies on creating the nexus between data, information and knowledge. The term “data scientist” is therefore a tautology.

So if the data “scientist” is not a scientist then what is he or she? If it’s big data we are dealing then the discipline is more akin to engineering. For smaller datasets it may be more of a design or architectural function. These are all critical functions in analytics but they are not science.

More importantly, every knowledge worker is increasingly becoming their own data scientist. It is no longer acceptable that analytics remains a function outside the critical other functions of an organisation. Because it is knowledge and experience that helps us gain insight from data; knowledge does not sit a priori within data waiting to be discovered. The questions we ask of data are the most important things in transforming information into knowledge.