3.65 billion people internet access via a smart phone/ tablet
2.1 billion people regularly use social media
1.3 billion people are registered on Twitter
88% of Twitter users are on mobile
500 million+ tweets per day
People openly express their opinions on social networks
Much of this data is public, and freely available
This makes social media data, such as tweets ideal for gauging peoples opinions
This data has the potential to totally revolutionise how information is gathered
There is no way of interpreting the key attitudes and opinions conveyed in social media data, other than reading through an unworkable mass of tweets, most of which may not even be relevant to the topic.
Further to this, it's not possible for a human to draw trends between the overall sentiment, and other factors such as location, time, key events and more
The aim of this project was to develop a platform that would calculate the sentiment towards a given topic, and display results in a series of interactive data visualisations that will allow trends to be found
More specifically, the final solution will stream real-time Twitter data, and allow the user to enter a specific topic or keyword to display live sentiment results in various forms.
This is quite a new area, and in order to complete the final solution I will bring together several new technologies in a way they have not been used before.
For that reason, I am going to develop the application following a modular approach and for each component developed, it will be fully tested, documented and then published to the open source community both on GitHub and NPM.
This will allow other developers to build on the existing code base and take it further.
There are many times when this would be extremely useful, or provide interesting insights. For example
Extensive research into the current progress of sentiment analysis, including a literature review was carried out.
Twitter is a well-known micro-blogging website which allows millions of users to interact over different types of communities, topics, and tweeting trends. The big data being generated on Twitter daily, and its significant impact on social networking, has motivated the application of data mining (analysis) to extract useful information from tweets. Tariq Mahmood, Atika Mustafa, Tasmiyah Iqbal, Farnaz Amin and Wajeeta Lohaana, "Mining Twitter Big Data to Predict 2013 Pakistan Election Winner", IEEE INMIC, Lahore, Pakistan
One key use for this insight into people opinions would be to aid marketing campaigns, as companies will have a better understanding of what techniques were effective in successfully marketing a product or serviceWenbo Wang et. al 2012
Tweet sentiment score precedes stock price movement starting from about 7 hours beforehand for the highest gainers in stock. As time goes on, the correlation between tweets score and stock prices increases linearly. This shows that to some degree, tweet sentiments precede stock prices.Meesad (2014)
Research was conducted into the ways that sentiment can be calculated, and a few sample algorithms were developed in order to find the best option for the final solution
The lexicon-based approach involves calculating orientation for a document from the semantic orientation of words or phrases in the documentThis is usually done with a predefined dataset of words annotated with their semantic values, and a simple algorithm can then calculate an overall semantic score for a given string. Dictionaries for this approach can either be created manually, or automatically, using seed words to expand the list of words
The natural language understanding (NLU) or text classification approach involves building classifiers from labelled instances of texts or sentences, essentially a supervised classification task.There are various NLU algorithms, the two main branches are supervised and unsupervised machine learning. A supervised learning algorithm generally builds a classification model on a large annotated corpus. Its accuracy is mainly based on the quality of the annotation, and usually the training process will take a long time. Unsupervised uses a sentiment dictionary, rather like the lexicon-based approach, with the addition that builds up a database of common phrases and their aggregated sentiment as well.
I conducted an experiment to find the optimum SA algorithm, in terms of efficiency and accuracy. A lexicon approach, NLU approach and a human (as a bench mark) calculated sentiment over the same set of Tweets to and the results were compared.
The application was developed following the agile methodologyThe methodology used in this project followed the principles of agile, more specifically personal-SCRUM (one-man agile). This is an iterative approach, where the project will be divided into a set of phases, called sprints. Each sprint had a set of requirements presented in the form of user stories and acceptance criteria. The sprint was only marked as complete once each story has been developed, implemented and tested (or descoped). User stories were prioritised and given a complexity estimate before each sprint, to ensure the best use of time and resources.
Gulp was used to automate the project build process.
A script was developed and configured, to do the following:
All acceptance criteria must be met, checked and documented
All acceptance criteria must be met, checked and documented
100% pass rate after every commit
80% or greater
B grade/ Level 4 or higher. Ideally A grade/ Level 5 if
Mostly up-to-date dependencies except in
The majority of the charts and data visualizations were
coded in D3.js (but written in CoffeeScript).
library for manipulating documents based on data. It
allows for web elements (such as SVG and HTML) to be
bound to data.
(See more at https://d3js.org/).
Two node packaged were developed and published, for getting tweets:
A data cache was also developed, using MongoDB as the database to store significant tweets from the past 24 hours. This speeds up initial page loadfetch-tweets: https://www.npmjs.com/package/fetch-tweets
A sentiment analysis algorithm was developed from scratch,
fully unit tested, documented and published to NPM.
It is significantly faster than other SA algorithms, with an 85% accuracy, it is ideal for calculating sentiment of live twitter data quickly.