A comparison of sentiment analysis techniques in a parallel and distributed NoSQL environment

Loading...
Thumbnail Image
Date
2020-04
Authors
Van der Linde, Ian Daniel
Journal Title
Journal ISSN
Volume Title
Publisher
University of the Free State
Abstract
Sentiment analysis has seen a revival due to the advent of social media platforms such as Facebook and Twitter. The data posted on these platforms can be mined for valuable insights into customer relations, political unrest and product supply and demand. This information is embedded in typical Big Data, with very large volumes delivered at high velocity consisting of a wide variety of content and sources, and usually unstructured in nature. The challenge of analysing such data for decision support can be addressed through the use of sentiment analysis techniques in distributed environments designed to process and store large amounts of data in a horizontally-scalable fashion. The performance characteristics of these techniques have, however, hardly been studied in distributed environments, and the impact of cluster size on such environments is largely undocumented. The aim of this research was to investigate the accuracy and performance of four sentiment analysis approaches (a lexicon-based classifier, a Naïve-Bayes classifier, a Neural Network classifier, and a Support Vector Machine classifier) in a distributed environment with a cluster size of three to eight machines, while making use of a distributed NoSQL database backend to retrieve and store the data. The key investigations were to determine the nature of performance bottlenecks for each classifier in a distributed environment, how well each classifier scaled as more machines are added, and whether a relationship could be found between classifier accuracy and performance. It was determined that all four classifiers provide statistically significantly different accuracies, when compared pairwise and collectively. It was also found that there is no clear relationship between accuracy and resource usage (i.e., a more performant technique does not necessarily have worse accuracy).
Description
Keywords
Dissertation (M.Sc. (Computer Science and Informatics))--University of the Free State, 2020, Sentiment analysis, NoSQL database, Document classification, Parallel computing, Distributed computing, Empirical analysis
Citation