- Analytics is not BI
Analytics is serviceably defined by wikipedia but this does not really do justice to the potential of a properly established analytics environment. To paraphrase Donald Rumsfeld BI deals with “known knowns” whereas analytics (at its most exciting) deals with “known unknowns” – that is, as data scientists, we know what we don’t know.
Let me illustrate.
It is important to measure new meter connections, where they are occurring and at what rate they are occurring. This is a well defined measure and can easily be translated into regular report with well defined metrics (e.g. how many, time from request to connection, breakdown by geography, meter type, etc). This is BI.
If however we don’t know why connections are growing or if they appear flat but if fact we suspect some classes are growing while others are shrinking so that the net effect is flat growth; or that consumption is changing (as it is in many parts of the country) and this this may somehow be linked to new connections. This is analytics.
- Analytics is a business function; not an IT function
Well, not always but usually. When I started in this field about ten years ago nobody really knew what to do with our team. We were originally part of a project team implementing a large scale CRM system. We were deeply technical when it came to understanding customer data but we weren’t part of IT. We were a “reporting” team but we were also data cube, database and web developers. There was no process for us to have access to a development environment outside of the IT department so we built our own (we bought our own server from Harvey Norman and when we had to move offices we wheeled it across the road on an office chair). We built our own statistical models to allocate sales because the system had not been built to recognise sales when they appeared in the product system. And eventually we started using the data to built predictive models and customer segmentation.
To begin with I think IT saw us as a threat to the safe running of “the system”, but over time we were accepted as a special case. It is true that to this day for an organisation to be truly competitive in analytics it must recognise the analytics teams embedded within the business are deeply technical and need to be treated as a special class of super user.
- Analytics is Agile
This is not a new idea. The best analytics outcomes are performed by small cross functional groups that cover data manipulation, data mining, machine learning and subject matter expertise. The groups are usually small because analytics developed is investigative and generally not hypothesis driven (or if there are hypotheses then there may be many competing ones which need to be tested), and the outputs of very complex analysis can often be disarmingly simple algorithms. It has not been usual for months of development to produce less than 20 lines of code.
- Analytics needs lots of data
Analytics thrives on lots of available data but its not used all of the time. When we are asked about what data we require, the trite answer is “give us everything”. The reason is that results can be biased if only built on part of the data record. That’s not to say that all of the data is used all of the time but a competent analyst will always know what has been excluded or how the data may have been sampled or summarised. For example, we spend a lot of time deciding on whether to treat missing data as null, zero or replace it in some way and the answer is always different depending on the project. In the age of smart grid data where datasets are very large (I have recently seen a ten terabyte table) an analytics environment should regularly build data samples for analytical use (properly randomised of course in consultation with analytics users).
- Analytics takes care of the “T” in ETL
Analytics teams like to take control of the “Transform” part of the ETL process, especially of the transform step involved summarisation or some other change to the data. Because data mining processes can pick up very subtle signals in the data, small changes to the data can lead to bad models being produced, and sometimes the what is considered bad data is an effect of interest to the data miner. For example, “bad” SCADA readings need to be removed from the dataset in order to develop accurate forecasts but the same bad data may be of interest in the building of asset failure models.
- Nothing grows in a sterile garden; don’t over-cleanse your analytics datasets
All raw data is dirty: missing records, poor data entry, mysterious analogue readings get digitised. But as in the example above, dirty data can be a signal worth investigating. Also, because models can be sensitive to outliers the data miner likes to have control over the definition of outlier and it is often a relative measure. If outliers have already been removed then this can cause valid records to be discarded in further outlier removal processes. Of course this needs to be balanced against the possibility that dirty data may lead to false conclusions especially with less experienced analytics users. So the right balance needs to be found but this does not always mean that cleaner is better.
- Be aware of who is doing analytics in your organisation
Because analytics is a technical function but it is embedded in business units and because it is agile by nature, analytics can be a very rapid way to develop value metrics for large system changes. It is good to know who is doing what as this may provide tangible evidence of business value. In most organisations that are doing some analytics, it tends to be in pockets throughout the organisation so it may not be immediately obvious who is doing what and how that might support IT business cases.