#TwitterMetrics: Daily Twitter Sentiment

The #TwitterMetrics project is about creating stories form everyday Twitter data. In this example I measure the sentiment of trending Twitter topics every 15 minutes using a Python script and plot the results using the d3.js library. You can follow the project on Twitter to get regular updates.

The Data

Getting the data out of Twitter uses the Twitter API, accessed using the python-twitter library. It’s cobbled together using some MySQL databases, formatted into JSON files and displayed using the d3.js graphical library (which takes some time and skill to get the most out of it but is certainly worth it).

The python scripts I’ve written runs automatically via a cron job on my Raspberry Pi. The script scans Twitter every 15 minutes and the web-data is updated once a day.

Sentiment Wordlists

The TwitterMetrics project uses a dictionary of positive and negative keywords developed by the American academics Tim Loughran and Bill McDonald, which is in turn an extension and refinement of the Harvard IV-4 Psychosocial Dictionary. The list is extensive but doesn’t include some terms I thought would be relevant (I don’t think they encourage the use of ‘lol’ or certain expletives in acedemic literature) so I added to the list some ‘Twitter-specific’ terms of my own.

The approach isn’t completely bullet-proof, it doesn’t work well with sarcasm or some slang (“That new NeYo song is the bomb, yo!”* would probably be misinterpreted for example). It’s good enough to make a fun infographic though.

Loughran & McDonald’s wordlist is available here.

*I’ve no idea whether or not this is something ‘the kids’ would actually say.

The Twitter Sentiment Index

To generate the Twitter Sentiment Index, the number of positive words are counted and the number of negative words subtracted. The total is then divided by the total number of words in the returned tweets to measure the relative ‘positive-ness’ of the tweets returned.

SentimentIndex_{t} = \left ( \left ( \frac{PosWords_{t} - NegWords_{t}}{TotalWords_{t}}\right ) - AvgSentimentIndex \right ) * 10000

The data is normalised by subtracting the average sentiment since the data gathering began, then multiplied by 10,000 (for no other reason than to remove the decimal places and make the numbers more readable).

The data is displayed with a daily and weekly simple moving average so that the change in sentiment can be visualised over time.

The Code

You can download the Python code and an importable SQL file for the database from GitHub

Leave a Reply