The Hitchhiker’s Guide to Big Data

Thanks to Douglas Adams

Data these days is big. Really big. You just won’t believe how vastly hugely mindbogglingly big it is… The best piece of advice I can give is:  Don’t Panic.

Because I am an all-round hoopy frood, I will provide you the reader with a few simple concepts for navigating this vast topic. So lets start with definitions. What is big data? The answer is: it depends who you talk to. Software vendors says it something that you need to analyse (interestingly, usually with the same tools used to analyse “small” data). Hardware vendors will talk about how much storage or processing power is needed. Consultants to focus on the commercial opportunities of big data, and the list goes on…

Big data is mostly just good ol’fashioned data. The real revolution is not the size of the data but the cultural and social upheaval that is generating new and different types of data. Think social networking, mobile technology, smart grids, satellite data, TIOT (The Internet Of Things). The behavioural contrails we all leave behind us every minute of every day. The real revolution is not the data itself but in the meaning contained within the data.

The towel, according to the Hitchhiker’s Guide to the Galaxy “is about the most massively useful thing an interstellar hitchhiker can have”. The most important tool the big data hitchhiker can have is a hypothesis. And for commercial users of data a business case to accompany that hypothesis. To understand why this is so important let’s look at the three characteristics of data: volume, velocity and variety. In simplified terms, think of spreadsheets:

Volume

As Tom Davenport has reportedly said, ‘Big data usually involves small analytics’. Size does matter insomuch as it becomes really hard to process and manipulate large amounts of data. If your problem requires complex analysis then you are better off limiting how much data you analyse. There are many standard ways to do this such as random selection or aggregation. Often because activities such as prediction or forecasting require generalisations to be drawn from data there comes a point where models do not become substantially more accurate with the addition of more data.

On the other hand, relatively crude algorithms can become quite accurate with large amounts of data to process and here I am thinking in particular of Bayesian learning algorithms.

The decision of how much data is the right amount depends on what you are trying to achieve. Do not assume that you need all of the data all of the time.

The question to ask when considering how much is: “How do I intend to analyse the data?”

Velocity

If there is any truth in the hype about big data it is the speed with which we can get access to it. In some ways I think it should be called “fast data” rather than “big data”. Again the speed with which data can be collected from devices can be astounding. Smart meters can effectively report in real time but does the data need to be processed in real time? Or can it be stored and analysed at a later time? And how granular does the chronology have to be?

As a rule I use the timeframes necessary for my operational outcome to dictate what chronological detail I need for my analysis. For example, real time voltage correction might need one minute data (or less), but monitoring the accuracy of a rolling annual forecast might only need monthly or weekly data.

Operational considerations are always necessary when using analytics. This is where I distinguish between analytics and business intelligence. BI is about understanding; analytics is about doing.

The question to ask when considering how often is: “How do I intend to use the analysis?”

Variety

This is my favourite. There is a dazzling array of publicly available data as governments and other organisations begin to embrace open data. And there’s paid-for vendor data available as well. Then there is social media data and data that organisations collect themselves through sensors in smart (and not so smart) devices as well as traditional customer data and also data collected from market research or other types of opt in such as mobile phone apps. The list goes on.

This is where a good hypothesis is important. Data sourced, cleansed and transformed should be done so according to your intended outcome.

The other challenge with data variety is how to join it all together. In the old days we relied on exact matches via a unique id (such as an account number), but now we may need to be more inventive about how we compare data from different sources. There are many ways in which we can be done: this is one of the growing areas of dark arts in big data. On the conceptual level, I like to work with a probability of matching. I start with my source data set which will have a unique identifier matched against the data I am most certain of. I then join in new data and create new variables that describe the certainty of match for each different data source. I then have a credibility weighting for each new source of data that I introduce and use these weightings as part of the analysis. This allows all data to relate to each other; this is my Babel fish for big data. There are a few ways in which I actually apply this approach (and finding new ways all of the  time).

The question to ask when considering which data sources is: “What is my hypothesis?”

So there you have it: a quick guide to big data. Not too many rules just enough so you don’t get lost. And one final quote from Douglas Adams’ Hitchhiker’s Guide to the Galaxy:

Protect me from knowing what I don’t need to know. Protect me from even knowing that there are things to know that I don’t know. Protect me from knowing that I decided not to know about the things that I decided not to know about. Amen.

Advertisements