A curious job description has started to appear in recent times: that of “data scientist”. In this post I am going to see if such a creature really exists or whether it is a chimera designed to make the rather dense and impenetrable subject of analytics even more dense and impenetrable to the outside observer.
Firstly, we need to whip up a quick epistemology of analytics (apologies to T.S. Eliot):
Here we see all knowledge and wisdom has data at its core; not raw data but data that is transformed into information. Information is derived from data in many different ways: from the humble report produced from a desktop database or spreadsheet, to the most sublime and parsimonious predictive model lovingly crafted by the aforementioned alleged data scientist.
If we gather enough of this information and discover (or create) the interconnections between discrete pieces of information then we create knowledge. If we gather enough knowledge then sooner or later we may start to question why we know what we know: this arguably is wisdom. Hopefully, from a large enough amount of data we may in time extract a very small amount of wisdom.
The logic also flows the other way: wisdom tells us if we are acquiring the right knowledge; knowledge gaps leads to the need for more information, and information needs drive further gathering and interrogation of data.
So where does science sit in all of this? I am not going to discuss wisdom in detail here – that belongs to philosophy and theology, although some physicists may disagree (you can download the podcast here). Science is dedicated to the creation of knowledge from information (knowledge that is derived through deductions or observations). The “data scientist” on the other hand specialises in deriving information from data which I argue is not a science at all. It is certainly a critically important function and one that is becoming central to all organisations in one form or another, but it is not a science.
Invention, as the saying goes, is 1% inspiration and 99% perspiration. Generating information from data is the 99% perspiration part. The most skilled statistician cannot create useful models without the right data in the right shape to answer the right questions. Understanding what shape the data ought to be in requires the transfer of knowledge through information to the data level.
To turn the question around, if the data scientist truly exists then what are scientists of other disciplines? Arguably, all scientists are data scientists as all hypothesis-driven science relies on creating the nexus between data, information and knowledge. The term “data scientist” is therefore a tautology.
So if the data “scientist” is not a scientist then what is he or she? If it’s big data we are dealing then the discipline is more akin to engineering. For smaller datasets it may be more of a design or architectural function. These are all critical functions in analytics but they are not science.
More importantly, every knowledge worker is increasingly becoming their own data scientist. It is no longer acceptable that analytics remains a function outside the critical other functions of an organisation. Because it is knowledge and experience that helps us gain insight from data; knowledge does not sit a priori within data waiting to be discovered. The questions we ask of data are the most important things in transforming information into knowledge.