Is there really such a thing as a “data scientist”?

A curious job description has started to appear in recent times: that of “data scientist”. In this post I am going to see if such a creature really exists or whether it is a chimera designed to make the rather dense and impenetrable subject of analytics even more dense and impenetrable to the outside observer.

Firstly, we need to whip up a quick epistemology of analytics (apologies to T.S. Eliot):

Here we see all knowledge and wisdom has data at its core; not raw data but data that is transformed into information. Information is derived from data in many different ways: from the humble report produced from a desktop database or spreadsheet, to the most sublime and parsimonious predictive model lovingly crafted by the aforementioned alleged data scientist.

If we gather enough of this information and discover (or create) the interconnections between discrete pieces of information then we create knowledge. If we gather enough knowledge then sooner or later we may start to question why we know what we know: this arguably is wisdom. Hopefully, from a large enough amount of data we may in time extract a very small amount of wisdom.

The logic also flows the other way: wisdom tells us if we are acquiring the right knowledge; knowledge gaps leads to the need for more information, and information needs drive further gathering and interrogation of data.

So where does science sit in all of this? I am not going to discuss wisdom in detail here – that belongs to philosophy and theology, although some physicists may disagree (you can download the podcast here). Science is dedicated to the creation of knowledge from information (knowledge that is derived through deductions or observations). The “data scientist” on the other hand specialises in deriving information from data which I argue is not a science at all. It is certainly a critically important function and one that is becoming central to all organisations in one form or another, but it is not a science.

Invention, as the saying goes, is 1% inspiration and 99% perspiration. Generating information from data is the 99% perspiration part. The most skilled statistician cannot create useful models without the right data in the right shape to answer the right questions. Understanding what shape the data ought to be in requires the transfer of knowledge through information to the data level.

To turn the question around, if the data scientist truly exists then what are scientists of other disciplines? Arguably, all scientists are data scientists as all hypothesis-driven science relies on creating the nexus between data, information and knowledge. The term “data scientist” is therefore a tautology.

So if the data “scientist” is not a scientist then what is he or she? If it’s big data we are dealing then the discipline is more akin to engineering. For smaller datasets it may be more of a design or architectural function. These are all critical functions in analytics but they are not science.

More importantly, every knowledge worker is increasingly becoming their own data scientist. It is no longer acceptable that analytics remains a function outside the critical other functions of an organisation. Because it is knowledge and experience that helps us gain insight from data; knowledge does not sit a priori within data waiting to be discovered. The questions we ask of data are the most important things in transforming information into knowledge.

The public perception of electricity prices in Queensland

Last week the Queensland government announced a three person panel to investigate how electricity prices might be lowered. The Courier Mail story attracted 171 comments full of the usual colourful characters and partisan political commentary to be expected from such a forum. As I have done in a past post for Victoria, I have decided to see what text analytics can tell us about the current zeitgeist in terms of electricity prices in Queensland. This time however I use some more sophisticated analysis beyond word clouds.

The following discussion gets pretty technical, so firstly I’ll sum up the findings. Apart from the ubiquitous and tiresome slacktivism of partisan political commentary that accompanies online news stories, there are a few interesting insights. It seems that the message about why electricity prices are going up is getting through, at least to some sections of the general public (Online news commenters are probably not a good representation of the average community response as they have selected themselves by choosing to comment). The other observation of note is the undeniable rise of solar power in the public imagination. It seems in recent times that it has drifted away from being a green consumer choice to a libertarian one: a way of side stepping what these consumers see as the state’s interference in the rights of the individual. The comments confirm the growing public perception that electricity prices are a significant impact on household budgets, despite the fact that it is still a minor cost for most households.  If however consumer discontent with network costs continues to rise then we will see increasing numbers leaving the grid or at least reducing their reliance on it.

Now to get into the nitty gritty…

Text mining is a pretty good approach to unpicking meaning in newspaper comments as a casual reader tends to get caught up in some of the more loopy sentiments expressed or generally turned off by partisan comment. A logical and objective analysis of the text allows us to try and uncover some insights without getting caught up in the general argy bargy.

Before doing any analysis we want to remove common words and phrases that form the grammatical and lexical glue of our language (e.g. and, if, but, etc.). In text mining these are called stopwords. Next we load the data into what’s known as a document term matrix. That is, one row for each comment and a long list of columns for each word with a count of the number of times that word is used in the comment. Like this:

Document Term Matrix

We then use this as the basis for our analysis. A word cloud is a way of arranging words so that their colour indicates the frequency with which each word occurs and size is the deviation between the maximum frequency within a comment minus the average frequency across all comments (i.e. it’s “lumpiness” of use). The position in the cloud depends on which comments the maximum frequency occurs. This is our initial word cloud:

Initial Word Cloud

We can see that the commentary is dominated by the words “power” and “electricity”. Later we’ll remove these to see what the cloud looks like without these terms.

But first we will remove all of the words with small counts (i.e. word that appear less than 10 times). By doing this we reduce the number of terms from about 1800 down to 45.

We then see if we can cluster the remaining words using another method: k-means clustering. K-means clustering allows us to organise the comments into groups based on the natural word groupings. But how many clusters do we create? In the following graphic each digit represents a separate comment. They are represented by a number signifying which cluster they belong to:

After a bit of experimenting I settle on three clusters as these separate nicely as shown above and do not have any clusters that are overlapping or too small/outlying. Each comment then gets a 1,2 or 3 based on which of these clusters the comment is classified into. I then use a tree algorithm to work out which words are driving the clustering process. This is the tree:

The tree above splits on the count of times the words are mentioned in a comment. For example the far left “node” of the tree (i.e. the red 4) is defined as comments which have used the word “power” at least once (i.e. >= 0.5) and the word “solar” at least once.

There are a couple of observations that we can make from the two graphics above. Firstly, we see cluster 2 sitting around zero on each coordinate in the discriminant plot. This indicates that this is a cluster without significant patterns in word combinations, backed up by the fact that most cluster 2 comments in the decision tree are in nodes that mostly do not use the “terms of interest” decided by the model (i.e. “power”, “electricity”, “solar” and “government”). Cluster 1 is dominated by the terms “solar power”, “power” and “electricity”, while cluster 3 is dominated by the terms “electricity” and “government”.

We see that there is a large group of general comments but two distinct themes emerge: one where commenters discuss solar power and another where they discuss electricity and government.

The difficulty with interpreting this tree is that “power” is a synonym with “electricity” and a natural pair with “solar”. So let’s run the same process but with this time removing the word power from the analysis. There is an issue here because the analysis to date has been dominated by the words “electricity” and “power” which do not add insight to our discussion as it the thing we are really trying to analyse. We see in our word cloud the next level of significant words emerge. It confirms the significance of the discussion about solar power and we also see “money” and “cost” emerge:

Nest we again remove the low frequency words and cluster the resultant document term matrix and discover five clusters this time:

And a tree which splits nicely into five nodes with a particular word representing four clusters and a general comments cluster:

So what does this tell us? We add the word “pay” to our existing list of thematic words, but more importantly, the key themes in the commentary are distinguished in their use of particular words in isolation from the other thematic words (apart from the term “solar power” where two of our thematic words cluster together).

So how do we dig further into what this means? The answer is to look at which other commonly used words correlate with our thematic words. The following charts show these correlations (only with words that are used 10 times or more as correlation is particularly sensitive to outliers which can distort the interpretation). It is from these final graphs that I have drawn conclusions at the start of this post.