The Hitchhiker’s Guide to Big Data

Thanks to Douglas Adams

Data these days is big. Really big. You just won’t believe how vastly hugely mindbogglingly big it is… The best piece of advice I can give is:  Don’t Panic.

Because I am an all-round hoopy frood, I will provide you the reader with a few simple concepts for navigating this vast topic. So lets start with definitions. What is big data? The answer is: it depends who you talk to. Software vendors says it something that you need to analyse (interestingly, usually with the same tools used to analyse “small” data). Hardware vendors will talk about how much storage or processing power is needed. Consultants to focus on the commercial opportunities of big data, and the list goes on…

Big data is mostly just good ol’fashioned data. The real revolution is not the size of the data but the cultural and social upheaval that is generating new and different types of data. Think social networking, mobile technology, smart grids, satellite data, TIOT (The Internet Of Things). The behavioural contrails we all leave behind us every minute of every day. The real revolution is not the data itself but in the meaning contained within the data.

The towel, according to the Hitchhiker’s Guide to the Galaxy “is about the most massively useful thing an interstellar hitchhiker can have”. The most important tool the big data hitchhiker can have is a hypothesis. And for commercial users of data a business case to accompany that hypothesis. To understand why this is so important let’s look at the three characteristics of data: volume, velocity and variety. In simplified terms, think of spreadsheets:


As Tom Davenport has reportedly said, ‘Big data usually involves small analytics’. Size does matter insomuch as it becomes really hard to process and manipulate large amounts of data. If your problem requires complex analysis then you are better off limiting how much data you analyse. There are many standard ways to do this such as random selection or aggregation. Often because activities such as prediction or forecasting require generalisations to be drawn from data there comes a point where models do not become substantially more accurate with the addition of more data.

On the other hand, relatively crude algorithms can become quite accurate with large amounts of data to process and here I am thinking in particular of Bayesian learning algorithms.

The decision of how much data is the right amount depends on what you are trying to achieve. Do not assume that you need all of the data all of the time.

The question to ask when considering how much is: “How do I intend to analyse the data?”


If there is any truth in the hype about big data it is the speed with which we can get access to it. In some ways I think it should be called “fast data” rather than “big data”. Again the speed with which data can be collected from devices can be astounding. Smart meters can effectively report in real time but does the data need to be processed in real time? Or can it be stored and analysed at a later time? And how granular does the chronology have to be?

As a rule I use the timeframes necessary for my operational outcome to dictate what chronological detail I need for my analysis. For example, real time voltage correction might need one minute data (or less), but monitoring the accuracy of a rolling annual forecast might only need monthly or weekly data.

Operational considerations are always necessary when using analytics. This is where I distinguish between analytics and business intelligence. BI is about understanding; analytics is about doing.

The question to ask when considering how often is: “How do I intend to use the analysis?”


This is my favourite. There is a dazzling array of publicly available data as governments and other organisations begin to embrace open data. And there’s paid-for vendor data available as well. Then there is social media data and data that organisations collect themselves through sensors in smart (and not so smart) devices as well as traditional customer data and also data collected from market research or other types of opt in such as mobile phone apps. The list goes on.

This is where a good hypothesis is important. Data sourced, cleansed and transformed should be done so according to your intended outcome.

The other challenge with data variety is how to join it all together. In the old days we relied on exact matches via a unique id (such as an account number), but now we may need to be more inventive about how we compare data from different sources. There are many ways in which we can be done: this is one of the growing areas of dark arts in big data. On the conceptual level, I like to work with a probability of matching. I start with my source data set which will have a unique identifier matched against the data I am most certain of. I then join in new data and create new variables that describe the certainty of match for each different data source. I then have a credibility weighting for each new source of data that I introduce and use these weightings as part of the analysis. This allows all data to relate to each other; this is my Babel fish for big data. There are a few ways in which I actually apply this approach (and finding new ways all of the  time).

The question to ask when considering which data sources is: “What is my hypothesis?”

So there you have it: a quick guide to big data. Not too many rules just enough so you don’t get lost. And one final quote from Douglas Adams’ Hitchhiker’s Guide to the Galaxy:

Protect me from knowing what I don’t need to know. Protect me from even knowing that there are things to know that I don’t know. Protect me from knowing that I decided not to know about the things that I decided not to know about. Amen.


Productivity and Big Bang Theory

Productivity has been falling in Australia for some time. In the mining, utility and manufacturing sectors we have seen a remarkable fall in productivity over the last decade. Some of this has been caused by rising labour costs, but in mining and utilities in particular, capital expenditure on infrastructure has been major contributor. So how will new technology and the era of “big data” transform the way these sectors derive return on capital investment?


According to the ABS this may have been driven in part by rapid development of otherwise unprofitable mines to production in an environment of once-in-lifetime high commodity prices. From a labour perspective, this has also driven wages in the mining sector which has knock-on effects for utilities.

Meanwhile for the last decade utilities have been dealing with a nexus of chronic under-investment in some networks, our insatiable appetite for air conditioning in hot summers and a period of growth in new housing with poor energy efficiency design in outlying urban areas which are subject to greater temperature extremes. The capital expenditure required to keep pace with this forecast peak demand growth has been a major negative in terms of productivity.

In this post I am going to consider how analytics can find increased productivity in the utilities sector (although there should be parallels for the mining sector) and specifically through optimisation of capital expenditure. I’ll discuss labour productivity in a  future post.

Deloitte has recently released its report into digital disruption: Short Fuse, Big Bang. In this report the utility sector is one which is going to be transformed by technological change, albeit more slowly than other sectors. Having said that, electricity utilities and retailers are going to be the first to experience disruptions to their business models, before water and gas. This is being driven by the fact that electricity businesses are at the forefront of privatisation among utilities and the politicisation of electricity pricing. Internationally, energy security concerns (which as in turn has seen the rise of renewables, energy conservation and electric vehicle development, for example) have also driven technological change faster for electricity utilities.

On face value the concept of smart grid just looks like the continuation of big ticket capital investment and therefore decline in productivity. Is there, however, a way to embrace the smart grid which actually increases productivity?

Using good design principles and data analytics, I believe the answer is yes. Here are three quick examples.

Demand Management

The obvious one is time of use pricing of electricity which I have written about on this blog several times already. The problem with this from a savings point of view is that the payoff between reduced peak demand and saving in capital expenditure is quite lagged and without the effective feedback between demand management and peak demand forecasting then may just result in overinvestment in network expansion and renewal. In fact I believe that we have already seen this occur as evidenced by the AEMO’s revision of peak demand growth. When peak demand was growing most rapidly through the mid 1990’s , demand management programs were proliferating as were revisions to housing energy efficiency standards. It should have been no surprise that this would have an effect on energy usage, but quite clearly it has come as a surprise to some.

Interval meters (which are also commonly referred to as “smart” meters) are required to deliver time of use pricing and some parts of the NEM are further down the track than others in rolling these out, so this solution still requires further capital investment. In my recent experience this appears to be the most effective and fairest means for reducing peak demand. Meter costs can be contained however as “smart meter” costs continue to fall. A big cost in the Victorian rollout of smart meters has not just been the meters themselves but the communications and IT infrastructure to support the metering system. An opt-in roll out will lead to slower realisation of the benefits of time of use pricing in curbing peak demand but will allow a deferral of the infrastructure capital costs. Such an incremental rollout will allow assessment of options such as between communication-enabled “smart meters” versus manually read interval meters (MRIMs). They are meters which capture half hour usage data but do not upload that via a communications network. They still require a meter reader to visit the meter and physically download the data. These meters are cheaper but labour costs for meter reading need to be factored in. There are other advantages to communications-enabled meters in that data can be relayed in real time to the distributor to allow other savings spin offs in network management. It also makes it possible for consumers to monitor their own energy usage in real time and therefore increase the effectiveness of demand pricing through immediate feedback to the consumer.

Power Management

From real time voltage management to reduce line loss, to neural net algorithms to improve whole of network load balancing, there are many exciting solutions that will reduce operating costs over time. Unfortunately, this will require continued capital investment in networks that do not have real time data-reporting capabilities and there is little appetite for this at the moment. Where a smart grid has already rolled out these options need to be developed. Graeme McClure at SP Ausnet is doing some interesting work in this field.

Asset Lifetime

This idea revolves around a better understanding of the true value of each asset on the network. Even the most advanced asset management systems in Australian distributors at the moment tend to treat all assets of a particular type of equal value, rather than having a systematic way of quantifying their value based on where they are within the network. Assets generally have some type of calculated lifetime and these get replaced before they expire. But what if some assets could be allowed to run to failure with little or no impact on the network? It’s not that many talented asset managers don’t already understand this. Many do. But good data analytics can ensure that this happens with consistency across the entire network. This is an idea that I have blogged about before. It doesn’t really require any extra investment in network infrastructure to realise benefits. This is more about a conceptually smart use of data rather than smart devices.

The era of big data may also be the era of big productivity gains and utilities still have time to get their houses in order in terms of developing analytics capability. But delaying this transition could easily see some utilities facing the challenges to the business model currently being faced by some in the media and retail industries. The transition from service providers to data manufacturers is one that will in time transform the industry. Don’t leave it too late to get on board.

Have We Seen the End of Peak Demand?

There has been a lot of comment in the media lately about how dodgy forecasts have  impacted retail electricity bills. Is this really the case? Has peak demand peaked? Have we over-invested in peaking capacity? I don’t propose to come up with a definitive answer here but by exploring forecasting methodologies I hope to show why such predictions are so hard to do. In this post I am going to show that a pretty good model can be developed using free software and a couple of sources of publicly available data (ABS, BOM) on wet a Melbourne Saturday afternoon. To cheer me up I am going to use Queensland electricity data from AEMO and concentrate on summer peak demand. I am then going to apply this technique to data only up to summer 2009 to and compare that to the recently downward-revised AEMO forecast.

But first let’s start with a common misconception. The mistake many commentators make is confusing the economics of electricity demand with the engineering of the network for peak capacity. Increasing consumption of electricity will impact wholesale prices of electricity. To a lesser extent it will also affect retail prices as retailers endeavour to pass on costs to consumers. The main driver of increased retail electricity prices however is network costs; specifically the cost of maintain enough network capacity for peak demand periods.

Let’s start by looking at some AEMO data. The following chart show total electricity consumption by month for Queensland from 2000 – 2011.

Queensland Energy Consumption

We can see from this chart that total consumption has started to fall from around 2010. Interestingly, though, we have seen the peakiness increase from about 2004 where summers have a much greater electricity usage than non-peak seasons.

If we overlay this with peak demand then we see some interesting trends.

Consumption versus Demand

What we see is from 2006 onwards is an increasing separation between peak demand and total consumption. There are a couple of factors underlying this decoupling. One is increased energy efficiency of homes driven by energy efficient building standards and other schemes such as the home insulation scheme. The other is the rapid uptake of solar power. Generous feed in tariffs have encouraged a widespread uptake of solar panels which has decreased the amount of energy consumed from the grid except at peak times. A solar panel will reduce electricity consumption during the day but in during warm summer evenings when the sun has disappeared air conditioners will run heavily on network electricity. The implication of the decoupling of peak demand from total consumption is that we either have to pay more for our electricity to maintain the same standard of supply or accept lower reliability of supply, especially at time when we most need it – very hot and very cold days.

When we overlay temperature on peak demand we see generally summer peaking which is typical for Queensland. We also see that maximum temperatures were higher earlier in the decade and then generally cooler in the last three years. It is important to remember that what we are seeing is longer wave of variability which is not a trend. This is often understood but not properly accounted for in forecasting temperature-variant behaviour.Demand versus Temperature

The above chart does not use maximum monthly temperature but the average maximum of the hottest four days of each month. Those who have studied electricity usage behaviour know that the highest peak often occurs after a run of hot days. By averaging the hottest few days of each month we get a measure that captures both the peak temperature and the temperature run. It is not necessary for this purpose to explicitly calculate consecutive days because temperature is not randomly distributed: temperature tends to cluster anyway. Another way to capture this is count the number of days above a given temperature. Both types of variable can perform well in models such as these.

We can see from this chart that peak demand continues to rise despite variability caused by temperature. The next step then is to add variables that describe the increase in peak. In my experience population usually performs the best but in this case I’ll test a couple of economic time series measures form the ABS National Accounts.

I also create a dummy variable to flag June, July and August as winter months. My final dataset looks like this:

Data snapshot

Preparation of data is the most important element of analytics. It is often difficult, messy and time consuming work but something that many of those new to analytics skip over.

In this exercise I have created dummy variables and eventually discard all except a flag indicating if a particular month is a winter month as per the data shown above. This will allow the model to treat minimum temperature differently during cold months.

Another common mistake is that extremes such as peak demand can only be modelled on the extreme observations. In this case I look at peak demand is all months in order to fit the summer peaks rather than just modelling the peaks themselves. This is because there is important information in how consumer demand varies between peak and non-peak months. This way the model is not just a forecast but a high level snapshot of population response to temperature stimulus. Extreme behaviour is defined by the variance from average behaviour.

My tool of choice is the GLM (Generalised Linear Model) which gives me a chance to experiment with both categorical variables (e.g. is it winter? Yes/No) and various distributions of peak demand (i.e. normal or gamma) and whether I want to fit a linear or logarithmic line to the data.

After a good deal of experimentation I end up with a very simple model which exhibits good fit and each of the predictor variables fit significance greater than 95%. For the stats minded here is the output:

GLM Output

You will notice that I have just four variables from two data sources left in my model. Economic measures did not make it to the final model. I suspect that population growth acts as a proxy for macroeconomic growth over time both in terms of number of consumers and available labour supporting economic output.

Another approach borrowed from data mining that is not always used in forecasting is to hold a random test sample of data which the model is not trained on but is validated in terms of goodness of fit statistics. The following show the R-squared fit against both the data used to train the model and the hold out validation dataset.

Model Fit - Training Data

Model Fit - Test Data

We can be confident on the basis of this that our model explains about 80% of the variance in peak demand over the last decade (with I suspect that balance being explained by a combination of solar pv, household energy efficiency programs, industrial use  and “stochastic systems” – complex interactive effects that cannot be modelled in this way).

Another way to look at this is to visually compare the predicted peak demand against actual peak demand as done in the following graph.

GLM Model - Predicted versus Actual

We can see from this chart that the model tends to overestimate demand in the earlier part of the period and underestimate at the end. I am not too concerned about that however as I am trying to fit an average over the period so that I can extrapolate an extreme. I will show that this only has a small impact on the short term forecast. This time series does have a particularly big disruption which is the increased penetration of air conditioning. We know that the earlier part of the period includes relatively low air conditioner penetration (and we have now most likely reached maximum penetration of air conditioning). Counteracting this is the fact that the later period includes households with greater energy efficiency. These events in counteract each other. As with weather you can remove variability if you take a long enough view.

Let see what happens if we take temperature up to a 10 POE level and forecast out three years to November 2014. That is, what happens if we feed 1-in-10 year temperatures into the model? I emphasise that this is 10 POE temperature; not 10 POE demand.

GLM - 10 POE Temperature Prediction

We see from this chart that actual demand exceed our theorised demand three times (2005, 2007 and 2010) out of 12 years. Three years out of twelve can be considered as 25 POE or in other words peak exceeds the theorised peak 25% of the time over a twelve year period.

2010 appears to be an outlier as overall the summer was quite mild. There was however a spike of very warm weather in South East Queensland in January which drove a peak not well predicted by my model. The month also recorded very cool temperature which has caused my model to drag down peak demand. This is consistent with the concept of probability of excedance. That is, there will be observed occurrences that exceed the model.

The final test of my model will be to compare back to the AEMO model. My model predicts a 2013/14 summer peak of 2309 MW at 25 POE. The 50 POE summer peak forecast for 2013/14 under the Medium scenario for AEMO is 9262 MW and 9568 MW at 10 POE. If we approximate a 25 POE for AEMO as the midpoint between the two then we get 9415 MW. Which means we get pretty close with using just population and temperature, some free data and software and a little bit of knowledge (which we know is a dangerous thing).

GLM Fit to AEMO Model

This forecast is a significant downward forecast on previous expectations which has in part lead to the accusations of dodgy forecasting and “gold plating” of the network. So what happens if I apply my technique again but this time only on data up until February 2009? That was the last time we saw a really hot spell in South East Queensland. If new data has caused forecasts to be lowered then going back this far should lead to model that exceeds the current AEMO forecast. The purple line in the graph below is the result of this new model compared to actual and the first model and AEMO:

GLM Modelled Pre-2010

What we see here is much better fitting through the earlier period, some significant under fitting of the hot summers of 2004 and 2005, but an almost identical result to the original GLM model in forecasting through 2012, 2013 and 2014. And still within the bound of the AEMO 10 and 50 POE forecasts. Hindsight is always 20/20 vision, but there is at least prima facie evidence to say that the current AEMO forecast appears to be on the money and previous forecasts have been overcooked. It will be interesting to see what happens over the next few years. We should expect peak demand to exceed the 50 POE line once every 2 years and the 10 POE line every 10 years.

We have not seen the end of peak demand. The question is how far are we willing to trade off reliability in our electricity network to reduce the cost of accommodating peak demand. The other question is all-of-system peak demand forecasting is good and well, but where will the demand happen on the network, will it be concentrated in certain areas and what are the risks to industry and consumers of lower reliability in these areas? I’ll tackle this question in my next post.

Is there really such a thing as a “data scientist”?

A curious job description has started to appear in recent times: that of “data scientist”. In this post I am going to see if such a creature really exists or whether it is a chimera designed to make the rather dense and impenetrable subject of analytics even more dense and impenetrable to the outside observer.

Firstly, we need to whip up a quick epistemology of analytics (apologies to T.S. Eliot):

Here we see all knowledge and wisdom has data at its core; not raw data but data that is transformed into information. Information is derived from data in many different ways: from the humble report produced from a desktop database or spreadsheet, to the most sublime and parsimonious predictive model lovingly crafted by the aforementioned alleged data scientist.

If we gather enough of this information and discover (or create) the interconnections between discrete pieces of information then we create knowledge. If we gather enough knowledge then sooner or later we may start to question why we know what we know: this arguably is wisdom. Hopefully, from a large enough amount of data we may in time extract a very small amount of wisdom.

The logic also flows the other way: wisdom tells us if we are acquiring the right knowledge; knowledge gaps leads to the need for more information, and information needs drive further gathering and interrogation of data.

So where does science sit in all of this? I am not going to discuss wisdom in detail here – that belongs to philosophy and theology, although some physicists may disagree (you can download the podcast here). Science is dedicated to the creation of knowledge from information (knowledge that is derived through deductions or observations). The “data scientist” on the other hand specialises in deriving information from data which I argue is not a science at all. It is certainly a critically important function and one that is becoming central to all organisations in one form or another, but it is not a science.

Invention, as the saying goes, is 1% inspiration and 99% perspiration. Generating information from data is the 99% perspiration part. The most skilled statistician cannot create useful models without the right data in the right shape to answer the right questions. Understanding what shape the data ought to be in requires the transfer of knowledge through information to the data level.

To turn the question around, if the data scientist truly exists then what are scientists of other disciplines? Arguably, all scientists are data scientists as all hypothesis-driven science relies on creating the nexus between data, information and knowledge. The term “data scientist” is therefore a tautology.

So if the data “scientist” is not a scientist then what is he or she? If it’s big data we are dealing then the discipline is more akin to engineering. For smaller datasets it may be more of a design or architectural function. These are all critical functions in analytics but they are not science.

More importantly, every knowledge worker is increasingly becoming their own data scientist. It is no longer acceptable that analytics remains a function outside the critical other functions of an organisation. Because it is knowledge and experience that helps us gain insight from data; knowledge does not sit a priori within data waiting to be discovered. The questions we ask of data are the most important things in transforming information into knowledge.

From CRM to ARM: what utilities can learn from banks about maximising value

Last week in Brisbane a small metal clamp holding an overhead electric cable failed causing a meltdown on the Queensland Rail network and leading to the government compensating commuters with a free day of travel. I expect that there are tens or hundreds of thousands of these clamps across the network and in all likelihood they are all treated in more or less the same way and assigned the same value.

There are interesting parallels between the current transformation of utilities to smart grid and what happened in banks in regards to customer analytics at the turn of the  millennium. Can we gain insights from over a decade of experiences in the banking industry of customer relationship management (CRM) to move towards a principle of asset “relationship” management (ARM).

When I became involved in my first large CRM project over ten years ago, CRM was at that point only concerned about the “kit” – the software and hardware that was the operational aspects of CRM – and not on the ecology of customer data where the real value of CRM lay. To give just one example: we built a  system for delivering SMS reminders which was very popular with customers, but when we went to understand why it was so successful we realised that we had not recorded the contact in a way that was easy to retrieve and analyse. If we had designed CRM from the point of view of an ecology of customer data then we would have been able to leverage insight from the SMS reminder initiative faster and for lower cost.

Once we understood this design principle we were able to start delivering real return on investment in CRM including developing a data construct of customer which spanned the CRM touch points, point of sale, transactional data systems and data which  resided outside of the internal systems including public data and data supplied by third party providers. We also embarked on standardising processes for data capture and developing common logical data definitions across multiple systems and then the development of an analytical data environment. The real CRM came into being once we had developed this whole data ecology of customer that enabled a sophisticated understanding of customer lifetime value and  the capacity to to build a range of models which predict customer behaviour and provide platforms for executing on our customer strategy.

The term “relationship” has some anthropological connotations and it may seem crazy to apply this thinking to network assets.  From a customer strategy perspective, however, it has a purely logical application: how can we capture customer interactions to maximise customer lifetime value, increase retention and reduce the costs of acquiring new customers?

If we look at customer value drivers we seem some parallel with capital expenditure and asset management. Cost to acquire is roughly synonymous with asset purchase price. Lifetime value applies to both a customer and an asset. Cost to serve for a customer is a parallel with the cost maintain an asset. Costumer retention is equivalent to asset reliability. The difference with advanced analytical CRM is that these drivers are calculated not as averages across customer classes but for every single customer.

The development of smart devices and the associated data environments necessary to support smart grid now enables utilities to look at a similar approach. Why can we not develop an analytical environment in which we capture attributes for, say, 30 million assets across a network so that we can identify risks to network operation before they happen?

If we could assign an expected life and therefore predicted probability of failure to the metal clamp between Milton and Roma Street stations; a value-to-network based on downstream consequences of failure and balance this with a cost to maintain/replace then we would be applying the same lessons that banks have learnt from understanding CRM and customer lifetime value.