Sentiment Analysis is one of the most obvious things Data Analysts with unlabelled Text data (with no score or no rating) end up doing in an attempt to extract some insights out of it and the same Sentiment analysis is also one of the potential research areas for any NLP (Natural Language Processing) enthusiasts.
For an analyst, the same sentiment analysis is a pain in the neck because most of the primitive packages/libraries handling sentiment analysis perform a simple dictionary lookup and calculate a final composite score based on the number of occurrences of positive and negative words. But that often ends up in a lot of false positives, with a very obvious case being ‘happy’ vs ‘not happy’ – Negations, in general Valence Shifters .
Consider this sentence: ‘I am not very happy’. Any Primitive Sentiment Analysis Algorithm would just flag this sentence positive because of the word ‘happy’ that apparently would appear in the positive dictionary. But reading this sentence we know this is not a positive sentence.
sentimentr can be installed from CRAN or the development version can be installed from github.
install.packages('sentimentr') #or library(devtools) install_github('trinker/sentimentr')
The author of the package himself explaining what does
sentimentr do that other packages don’t and why does it matter?
“sentimentr attempts to take into account valence shifters (i.e., negators, amplifiers (intensifiers), de-amplifiers (downtoners), and adversative conjunctions) while maintaining speed. Simply put, sentimentr is an augmented dictionary lookup. The next questions address why it matters.”
sentimentr offers sentiment analysis with two functions: 1.
Aggregated (Averaged) Sentiment Score for a given text with
sentiment_by('I am not very happy', by = NULL) <em> element_id sentence_id word_count sentiment 1: 1 1 5 -0.06708204 </em>
But this might not help much when we have multiple sentences with different polarity, hence sentence-level scoring with
sentiment would help here.
sentiment('I am not very happy. He is very happy') <em> element_id sentence_id word_count sentiment 1: 1 1 5 -0.06708204 2: 1 2 4 0.67500000 </em>
Both the functions return a dataframe with four columns:
element_id – ID / Serial Number of the given text
sentence_id– ID / Serial Number of the sentence and this is equal to element_id in case of
word_count – Number of words in the given sentence
sentiment – Sentiment Score of the given sentence
extract_sentiment_terms() function helps us extract the keywords – both positive and negative that was part of the sentiment score calculation. sentimentr also supports pipe operator
%>% which makes it easier to write multiple lines of code with less assignment and also cleaner code.
'My life has become terrible since I met you and lost money' %>% extract_sentiment_terms() <em> element_id sentence_id negative positive 1: 1 1 terrible,lost money </em>
And finally, the
highight() function coupled with
sentiment_by() that gives a html output with parts of sentences nicely highlighted with green and red color to show its polarity. Trust me, This might seem trivial but it really helps while making Presentations to share the results, discuss False positives and to identify the room for improvements in the accuracy.
'My life has become terrible since I met you and lost money. But I still have got a little hope left in me' %>% sentiment_by(by = NULL) %>% highlight()
sentimentr for your sentiment analysis and text analytics project and do share your feedback in comments. Complete code used here is available on my github .