Last week the Queensland government announced a three person panel to investigate how electricity prices might be lowered. The Courier Mail story attracted 171 comments full of the usual colourful characters and partisan political commentary to be expected from such a forum. As I have done in a past post for Victoria, I have decided to see what text analytics can tell us about the current zeitgeist in terms of electricity prices in Queensland. This time however I use some more sophisticated analysis beyond word clouds.
The following discussion gets pretty technical, so firstly I’ll sum up the findings. Apart from the ubiquitous and tiresome slacktivism of partisan political commentary that accompanies online news stories, there are a few interesting insights. It seems that the message about why electricity prices are going up is getting through, at least to some sections of the general public (Online news commenters are probably not a good representation of the average community response as they have selected themselves by choosing to comment). The other observation of note is the undeniable rise of solar power in the public imagination. It seems in recent times that it has drifted away from being a green consumer choice to a libertarian one: a way of side stepping what these consumers see as the state’s interference in the rights of the individual. The comments confirm the growing public perception that electricity prices are a significant impact on household budgets, despite the fact that it is still a minor cost for most households. If however consumer discontent with network costs continues to rise then we will see increasing numbers leaving the grid or at least reducing their reliance on it.
Now to get into the nitty gritty…
Text mining is a pretty good approach to unpicking meaning in newspaper comments as a casual reader tends to get caught up in some of the more loopy sentiments expressed or generally turned off by partisan comment. A logical and objective analysis of the text allows us to try and uncover some insights without getting caught up in the general argy bargy.
Before doing any analysis we want to remove common words and phrases that form the grammatical and lexical glue of our language (e.g. and, if, but, etc.). In text mining these are called stopwords. Next we load the data into what’s known as a document term matrix. That is, one row for each comment and a long list of columns for each word with a count of the number of times that word is used in the comment. Like this:
We then use this as the basis for our analysis. A word cloud is a way of arranging words so that their colour indicates the frequency with which each word occurs and size is the deviation between the maximum frequency within a comment minus the average frequency across all comments (i.e. it’s “lumpiness” of use). The position in the cloud depends on which comments the maximum frequency occurs. This is our initial word cloud:
We can see that the commentary is dominated by the words “power” and “electricity”. Later we’ll remove these to see what the cloud looks like without these terms.
But first we will remove all of the words with small counts (i.e. word that appear less than 10 times). By doing this we reduce the number of terms from about 1800 down to 45.
We then see if we can cluster the remaining words using another method: k-means clustering. K-means clustering allows us to organise the comments into groups based on the natural word groupings. But how many clusters do we create? In the following graphic each digit represents a separate comment. They are represented by a number signifying which cluster they belong to:
After a bit of experimenting I settle on three clusters as these separate nicely as shown above and do not have any clusters that are overlapping or too small/outlying. Each comment then gets a 1,2 or 3 based on which of these clusters the comment is classified into. I then use a tree algorithm to work out which words are driving the clustering process. This is the tree:
The tree above splits on the count of times the words are mentioned in a comment. For example the far left “node” of the tree (i.e. the red 4) is defined as comments which have used the word “power” at least once (i.e. >= 0.5) and the word “solar” at least once.
There are a couple of observations that we can make from the two graphics above. Firstly, we see cluster 2 sitting around zero on each coordinate in the discriminant plot. This indicates that this is a cluster without significant patterns in word combinations, backed up by the fact that most cluster 2 comments in the decision tree are in nodes that mostly do not use the “terms of interest” decided by the model (i.e. “power”, “electricity”, “solar” and “government”). Cluster 1 is dominated by the terms “solar power”, “power” and “electricity”, while cluster 3 is dominated by the terms “electricity” and “government”.
We see that there is a large group of general comments but two distinct themes emerge: one where commenters discuss solar power and another where they discuss electricity and government.
The difficulty with interpreting this tree is that “power” is a synonym with “electricity” and a natural pair with “solar”. So let’s run the same process but with this time removing the word power from the analysis. There is an issue here because the analysis to date has been dominated by the words “electricity” and “power” which do not add insight to our discussion as it the thing we are really trying to analyse. We see in our word cloud the next level of significant words emerge. It confirms the significance of the discussion about solar power and we also see “money” and “cost” emerge:
Nest we again remove the low frequency words and cluster the resultant document term matrix and discover five clusters this time:
And a tree which splits nicely into five nodes with a particular word representing four clusters and a general comments cluster:
So what does this tell us? We add the word “pay” to our existing list of thematic words, but more importantly, the key themes in the commentary are distinguished in their use of particular words in isolation from the other thematic words (apart from the term “solar power” where two of our thematic words cluster together).
So how do we dig further into what this means? The answer is to look at which other commonly used words correlate with our thematic words. The following charts show these correlations (only with words that are used 10 times or more as correlation is particularly sensitive to outliers which can distort the interpretation). It is from these final graphs that I have drawn conclusions at the start of this post.