The Hitchhiker’s Guide to Big Data

Thanks to Douglas Adams

Data these days is big. Really big. You just won’t believe how vastly hugely mindbogglingly big it is… The best piece of advice I can give is:  Don’t Panic.

Because I am an all-round hoopy frood, I will provide you the reader with a few simple concepts for navigating this vast topic. So lets start with definitions. What is big data? The answer is: it depends who you talk to. Software vendors says it something that you need to analyse (interestingly, usually with the same tools used to analyse “small” data). Hardware vendors will talk about how much storage or processing power is needed. Consultants to focus on the commercial opportunities of big data, and the list goes on…

Big data is mostly just good ol’fashioned data. The real revolution is not the size of the data but the cultural and social upheaval that is generating new and different types of data. Think social networking, mobile technology, smart grids, satellite data, TIOT (The Internet Of Things). The behavioural contrails we all leave behind us every minute of every day. The real revolution is not the data itself but in the meaning contained within the data.

The towel, according to the Hitchhiker’s Guide to the Galaxy “is about the most massively useful thing an interstellar hitchhiker can have”. The most important tool the big data hitchhiker can have is a hypothesis. And for commercial users of data a business case to accompany that hypothesis. To understand why this is so important let’s look at the three characteristics of data: volume, velocity and variety. In simplified terms, think of spreadsheets:


As Tom Davenport has reportedly said, ‘Big data usually involves small analytics’. Size does matter insomuch as it becomes really hard to process and manipulate large amounts of data. If your problem requires complex analysis then you are better off limiting how much data you analyse. There are many standard ways to do this such as random selection or aggregation. Often because activities such as prediction or forecasting require generalisations to be drawn from data there comes a point where models do not become substantially more accurate with the addition of more data.

On the other hand, relatively crude algorithms can become quite accurate with large amounts of data to process and here I am thinking in particular of Bayesian learning algorithms.

The decision of how much data is the right amount depends on what you are trying to achieve. Do not assume that you need all of the data all of the time.

The question to ask when considering how much is: “How do I intend to analyse the data?”


If there is any truth in the hype about big data it is the speed with which we can get access to it. In some ways I think it should be called “fast data” rather than “big data”. Again the speed with which data can be collected from devices can be astounding. Smart meters can effectively report in real time but does the data need to be processed in real time? Or can it be stored and analysed at a later time? And how granular does the chronology have to be?

As a rule I use the timeframes necessary for my operational outcome to dictate what chronological detail I need for my analysis. For example, real time voltage correction might need one minute data (or less), but monitoring the accuracy of a rolling annual forecast might only need monthly or weekly data.

Operational considerations are always necessary when using analytics. This is where I distinguish between analytics and business intelligence. BI is about understanding; analytics is about doing.

The question to ask when considering how often is: “How do I intend to use the analysis?”


This is my favourite. There is a dazzling array of publicly available data as governments and other organisations begin to embrace open data. And there’s paid-for vendor data available as well. Then there is social media data and data that organisations collect themselves through sensors in smart (and not so smart) devices as well as traditional customer data and also data collected from market research or other types of opt in such as mobile phone apps. The list goes on.

This is where a good hypothesis is important. Data sourced, cleansed and transformed should be done so according to your intended outcome.

The other challenge with data variety is how to join it all together. In the old days we relied on exact matches via a unique id (such as an account number), but now we may need to be more inventive about how we compare data from different sources. There are many ways in which we can be done: this is one of the growing areas of dark arts in big data. On the conceptual level, I like to work with a probability of matching. I start with my source data set which will have a unique identifier matched against the data I am most certain of. I then join in new data and create new variables that describe the certainty of match for each different data source. I then have a credibility weighting for each new source of data that I introduce and use these weightings as part of the analysis. This allows all data to relate to each other; this is my Babel fish for big data. There are a few ways in which I actually apply this approach (and finding new ways all of the  time).

The question to ask when considering which data sources is: “What is my hypothesis?”

So there you have it: a quick guide to big data. Not too many rules just enough so you don’t get lost. And one final quote from Douglas Adams’ Hitchhiker’s Guide to the Galaxy:

Protect me from knowing what I don’t need to know. Protect me from even knowing that there are things to know that I don’t know. Protect me from knowing that I decided not to know about the things that I decided not to know about. Amen.


Productivity and Big Bang Theory

Productivity has been falling in Australia for some time. In the mining, utility and manufacturing sectors we have seen a remarkable fall in productivity over the last decade. Some of this has been caused by rising labour costs, but in mining and utilities in particular, capital expenditure on infrastructure has been major contributor. So how will new technology and the era of “big data” transform the way these sectors derive return on capital investment?


According to the ABS this may have been driven in part by rapid development of otherwise unprofitable mines to production in an environment of once-in-lifetime high commodity prices. From a labour perspective, this has also driven wages in the mining sector which has knock-on effects for utilities.

Meanwhile for the last decade utilities have been dealing with a nexus of chronic under-investment in some networks, our insatiable appetite for air conditioning in hot summers and a period of growth in new housing with poor energy efficiency design in outlying urban areas which are subject to greater temperature extremes. The capital expenditure required to keep pace with this forecast peak demand growth has been a major negative in terms of productivity.

In this post I am going to consider how analytics can find increased productivity in the utilities sector (although there should be parallels for the mining sector) and specifically through optimisation of capital expenditure. I’ll discuss labour productivity in a  future post.

Deloitte has recently released its report into digital disruption: Short Fuse, Big Bang. In this report the utility sector is one which is going to be transformed by technological change, albeit more slowly than other sectors. Having said that, electricity utilities and retailers are going to be the first to experience disruptions to their business models, before water and gas. This is being driven by the fact that electricity businesses are at the forefront of privatisation among utilities and the politicisation of electricity pricing. Internationally, energy security concerns (which as in turn has seen the rise of renewables, energy conservation and electric vehicle development, for example) have also driven technological change faster for electricity utilities.

On face value the concept of smart grid just looks like the continuation of big ticket capital investment and therefore decline in productivity. Is there, however, a way to embrace the smart grid which actually increases productivity?

Using good design principles and data analytics, I believe the answer is yes. Here are three quick examples.

Demand Management

The obvious one is time of use pricing of electricity which I have written about on this blog several times already. The problem with this from a savings point of view is that the payoff between reduced peak demand and saving in capital expenditure is quite lagged and without the effective feedback between demand management and peak demand forecasting then may just result in overinvestment in network expansion and renewal. In fact I believe that we have already seen this occur as evidenced by the AEMO’s revision of peak demand growth. When peak demand was growing most rapidly through the mid 1990’s , demand management programs were proliferating as were revisions to housing energy efficiency standards. It should have been no surprise that this would have an effect on energy usage, but quite clearly it has come as a surprise to some.

Interval meters (which are also commonly referred to as “smart” meters) are required to deliver time of use pricing and some parts of the NEM are further down the track than others in rolling these out, so this solution still requires further capital investment. In my recent experience this appears to be the most effective and fairest means for reducing peak demand. Meter costs can be contained however as “smart meter” costs continue to fall. A big cost in the Victorian rollout of smart meters has not just been the meters themselves but the communications and IT infrastructure to support the metering system. An opt-in roll out will lead to slower realisation of the benefits of time of use pricing in curbing peak demand but will allow a deferral of the infrastructure capital costs. Such an incremental rollout will allow assessment of options such as between communication-enabled “smart meters” versus manually read interval meters (MRIMs). They are meters which capture half hour usage data but do not upload that via a communications network. They still require a meter reader to visit the meter and physically download the data. These meters are cheaper but labour costs for meter reading need to be factored in. There are other advantages to communications-enabled meters in that data can be relayed in real time to the distributor to allow other savings spin offs in network management. It also makes it possible for consumers to monitor their own energy usage in real time and therefore increase the effectiveness of demand pricing through immediate feedback to the consumer.

Power Management

From real time voltage management to reduce line loss, to neural net algorithms to improve whole of network load balancing, there are many exciting solutions that will reduce operating costs over time. Unfortunately, this will require continued capital investment in networks that do not have real time data-reporting capabilities and there is little appetite for this at the moment. Where a smart grid has already rolled out these options need to be developed. Graeme McClure at SP Ausnet is doing some interesting work in this field.

Asset Lifetime

This idea revolves around a better understanding of the true value of each asset on the network. Even the most advanced asset management systems in Australian distributors at the moment tend to treat all assets of a particular type of equal value, rather than having a systematic way of quantifying their value based on where they are within the network. Assets generally have some type of calculated lifetime and these get replaced before they expire. But what if some assets could be allowed to run to failure with little or no impact on the network? It’s not that many talented asset managers don’t already understand this. Many do. But good data analytics can ensure that this happens with consistency across the entire network. This is an idea that I have blogged about before. It doesn’t really require any extra investment in network infrastructure to realise benefits. This is more about a conceptually smart use of data rather than smart devices.

The era of big data may also be the era of big productivity gains and utilities still have time to get their houses in order in terms of developing analytics capability. But delaying this transition could easily see some utilities facing the challenges to the business model currently being faced by some in the media and retail industries. The transition from service providers to data manufacturers is one that will in time transform the industry. Don’t leave it too late to get on board.

Analytics: Insource or Outsource?

For someone who makes their living from consulting on analytics my answer to this question may surprise some. In a world increasingly dominated by data, the ability to leverage data is not only a source of competitive advantage it is now a required competency for most businesses.

External consulting can help accelerate the journey to fully insourced analytics capability. The trick is how to do this in the most cost effective way. I have dealt with a number of companies that have very different approaches to this question, and it is my observation that the wrong mix of insourcing and outsourcing can be very expensive, perhaps in ways that you may find surprising. The key is understanding that analytics is not primarily a technology function.

To illustrate my point I am going to describe the analytics journey of three hypothetical companies. Our three companies are all challenger brands, second or third in their respective markets. Their businesses have always been reliant on data and smart people, but new technology and competitive pressures mean that data is becoming more and more important to their business models. All recognise the need to invest, but which is the right strategy?

The CIO of Company A has launched a major project to implement a new ERP system which will transform the way they will manage and access data right across the organisation. He is also establishing an analytics team by hiring a handful of statistics PhDs to extract maximum value from the new data platform. He is investing significantly with a major ERP platform vendor and is using consultants to advise him on implementation and help manage the vendor. He sees no need to spend additional money on analytics consultants because he has already hired plenty of smart people who can help him in the short term. He does however see value in hiring consultants to help his organisation with the large IT transformation.

In Company B, the COO is driving the analytics strategy. Privately, he doesn’t rate the CIO. He sees him as coming from a bygone era where IT is a support function to the core business and technical capability should be delivered from technical centre of excellence. The CIO has built a team of senior managers who firmly believe that to maintain efficient use of resources; business users should only have access to data through IT-approved or IT-built applications. The company has a very large and well organised data warehouse, but mostly it is accessed by other applications. There are very few human users of the data, and virtually none outside of IT who mostly use the data warehouse for building applications and rely on a detailed specification process from internal “customers” to understand the content of the data.

To drive his strategy of developing organisational analytics capability, the COO is forced to either wait for lengthy testing of new applications and system access through an exception basis, or else outsource his analytics to service providers who can offer him greater flexibility and responsiveness. He secures funding for an asset management project to optimize spending on maintaining ageing infrastructure and secures the services of a data-hosting service. Separately, he hires consultants to build advanced asset failure predictive models based on the large volumes of data in his externally hosted data mart.

Company C has hired a new CIO who has a varied background in both technology and business-related positions. She has joined the company from a role as CEO of a technology company where she has had both technology and commercial experience. Her previous company frequently (but not always) used Agile development methodology. She too has been tasked with developing a data strategy in her new role. Company C is losing market share to competitors and the executive think this is because their two competitors have spent a large amount of money on IT infrastructure renewal and have effectively bought market share by doing so. Company C is not using their data effectively to price their products and develop product features to drive greater customer value, but they are constrained in the amount of money they can spend to renew their own data infrastructure. The parent company will not invest in large IT expenditure when margins and market share are falling. The CIO resists pressure from the executive and external vendors to implement a new cut price ERP system and instead focuses her team on building better relationships with business users, especially in the pricing and product teams. She develops a team of technology-savvy senior managers with functional expertise in pricing and product development, rather than IT managers. She delivers a strong and consistent message that their organisation’s goal is to compete on data and analytics. Every solution should be able to state how data and analytics are used.

As issues or manager-driven initiatives arise she funds small project teams comprising IT, business and some involvement of external consultants. She insists that her managers hire consultants to work on site as part of virtual teams with company staff. Typically consultants are only engaged a few weeks at a time, but there may be a number of projects running simultaneously. Where infrastructure or organised data does not exist, teams are permitted to build their own “proof of concept” solutions which are supported by the teams themselves rather than IT. Because the ageing data warehouse struggles to cope with increased traffic increasingly it is used as a data staging area with teams running their own purpose built databases.

So how might these strategies play out? Let’s look at our three companies 12 months later.

Company A has built a test environment for their ERP system fairly quickly. The consultants have worked well with the vendor to get a “vanilla” system up and running but the project is now running into delays due to integration with legacy systems and problems handling increasing size of data. The CIO’s consultants are warning of significant blow outs in time and cost, but they are so far down the path now that pulling out is not an option. The only option is to keep requesting more funds. The blame game is starting with the vendor blaming the consultants, the consultants blaming IT.  Meanwhile the CIOs PhD-qualified analytics team have little work to do as they wait many months for their data requests to be filled. The wait is due in part to the number of resources required to support the ERP project means that there are few staff available to support ad hoc requests. When the stats team gets data they build interesting and robust statistical models but struggle to understand relevance to the business. One senior analyst has already left and others will most likely follow. I have seen this happen more times than I care to remember. Sadly, Company A is a pretty typical example.

Company B has successfully built their asset management system which is best in class due to the specialised skills provided by the data hosting vendor and analytics consultants. It has not been cheap – but they will not spend as much as Company A eventually will to get their solution in place. The main issue is that no one is Company B really understands the solution and more time and money will be required to bring the solution in house with some expenditure still required by IT and the development of a support team. On the bright side, however, the CIO has been shown up as recalcitrant and the migration of the project in house will be a good first project for the incoming CIO when the current CIO retires in a few months. It will encourage IT to develop new IP and new ways of working with the business including sharing of data and system development environments.

Company C (as you may already have guessed) is the outstanding success. Within a few weeks they had their first analytics pricing solution in place. A few weeks after that, tests were showing both increased profitability and market share within the small test group of customers who were chosen to receive new pricing. The business case for second stage roll out was a no brainer and funding will be used to move the required part of the data warehouse into the cloud.

After 12 months a few of the projects did not produce great results and these were quietly dropped. Because these were small projects costs were contained and importantly the team became better at picking winners over time. Small incremental losses were seen as part of the development process. A strategy of running a large number of concurrent projects was a strain at first for an IT group which was more accustomed to “big bang” projects, but the payoff was that risks were spread. While some projects failed other succeeded. Budgets were easier to manage because this was delegated to individual project teams and the types of cost blow outs experienced by Company A were avoided.

The salient lesson here is to look firstly at how your organisation structures it approach to data and analytics projects. Only then should you consider how to use and manage outsourced talent. The overarching goal should be to bring analytics in house because that’s really where it belongs.