NLP SENTIMENT CLASSIFICATION USING MACHINE LEARNING AND DEEP LEARNING

Md. Wazir Ali
18 min readSep 18, 2021

TABLE OF CONTENTS:-

  1. INTRODUCTION
  2. DATA
  3. BUSINESS OBJECTIVES AND CONSTRAINTS
  4. MAPPING TO A ML PROBLEM
  5. PERFORMANCE METRIC
  6. EXPLORATORY DATA ANALYSIS AND DATA PREPROCESSING
  7. FEATURE ENGINEERING
  8. VALIDITY OF THE ENGINEERED FEATURES
  9. TSNE PLOTS AND CDFS OF THE FEATURES
  10. CLASSICAL MACHINE LEARNING TECHNIQUES
  11. SUMMARY OF RESULTS FOR THE BEST ML MODELS
  12. SCOPE OF IMPROVEMENTS
  13. DEEP LEARNING MODELS
  14. FINAL RESULTS
  15. CODE AND DATASET
  16. LINKEDIN PROFILE
  17. REFERENCES

INTRODUCTION:-

Natural Language Processing is an emerging field in Artificial Intelligence and Machine Learning. One of the basic tasks of Natural Language Processing is Sentiment Classification which is essentially classifying a given text as positive, neutral or negative sentiment. This text could be a review of the food in a hotel/restaurant or a speech which could denote a positive or a negative sentiment. Today in the modern world, there are a lot of classical Machine Learning or Deep Learning techniques which could efficiently solve this problem of sentiment classification. In this blog post, we would consider a dataset consisting of text and the corresponding sentiments as negative or neutral and we would try to train some Machine Learning and Deep Learning models to efficiently classify an unseen text as negative or neutral.

DATA:-

We have a corpus of text having 18,999 texts classified as negative or neutral. The data is given in a data.csv file. The features of the data are :-

  1. TextID- A unique identifier for the given text.
  2. Text- The text denoting what a person said.
  3. Sentiment- The sentiment of the text which is either negative or neutral.

The data.csv file is in a zip format and is available as data.zip

BUSINESS OBJECTIVE AND CONSTRAINTS:-

BUSINESS OBJECTIVE:-

The objective is to build a model which would classify an unseen text into negative or neutral based on the mapping which it has learnt from the given text corpus.

CONSTRAINT:-

  1. We need to make sure that we are not missing out on any negative sentiment.
  2. The prediction should not take too long.

MAPPING TO A ML PROBLEM:-

This problem can be mapped to a supervised classification problem as the dataset has labels marked as negative or neutral which is a categorical variable with two levels.

POSITIVE CLASS OF INTEREST:-

The positive class of interest would be negative sentiment as our main point of focus would be that only.

PERFORMANCE METRIC:-

The metric to judge or evaluate the effectiveness of the algorithm would be recall because as per the constraint, it is mentioned that we should not miss out on any negative sentiment. So, we should try to get as much as actual negative sentiments as correct in our prediction.

EXPLORATORY DATA ANALYSIS AND DATA PREPROCESSING:-

EXPLORATORY DATA ANALYSIS:-

The exploratory data analysis includes the following:-

a) Checking the shape of the data i.e. the number of rows and columns present in the data.

b) Checking the top 5 rows of the data.

c) Checking the distribution of the negative and neutral sentiment in the data.

d) Checking for empty text.

Using the simple function above, we can check the list of indices which have no text.

e) Checking for duplicate text.

The above data frame duplicate_text would consist of the rows or text which are duplicate wrt the text and the sentiment or the label associated.

We remove the duplicate text using the dataframe.drop command.

DATA PREPROCESSING:-

The data preprocessing would include the following:-

a) Decontracting the text by removing the apostrophes and expanding the phrases.

Using the above function, the corpus of text was decontracted.

b) Stopwords and Punctuation Removal

For the stop word and punctuation removal, we considered our own set of custom stop words as we needed to retain the word ‘not’.

c) Lemmatizing the cleaned text.

We chose to do lemmatization over stemming as lemmatization considers the context while performing the conversion of a word to it’s root form.

After performing the above two steps, we keep the preprocessed text in the column preprocessed_text.

d) Encoding the levels of sentiment

e) Checking the top 20 most frequently occurring for each sentiment.

Negative Sentiment or positive class of interest:-

We store the unique words occurring in the negative sentiment with their counts in the descending order. Now, let’s plot the same.

Neutral Sentiment or Negative Class:-

The same two code snippets were used to get the bar plot showing the frequency of words for neutral sentiment.

f) Checking the top 5 words with the highest idf values for each sentiment.

Negative Sentiment or The Positive Class of Interest:-

The function calculate_idf calculates the idf of all the unique words in the corpus and returns the vocabulary and a dictionary containing the words and their corresponding idf values. This dictionary is sorted in descending order on values after calling the above function on the preprocessed_text.

Now, let’s check the top 5 words with the highest idf values.

Neutral Sentiment or Negative Class:-

The above two code snippets work equally good for the Neutral Sentiment or Negative Class except for the data which changes in this case.

g) Checking and removing the empty and the duplicate rows of the preprocessed text.

The same code as put earlier was used to check for empty and duplicate rows of the preprocessed text. There were more than 100 empty rows and 397 duplicate rows which were removed.

h) Generating the Word Cloud for each of the sentiments for the 100 most frequently occurring words.

The function for generating the word cloud is as follows:-

The word cloud is visualized as follows using the following code snippet:-

Negative sentiment:-

Neutral Sentiment:-

FEATURE ENGINEERING:-

The feature engineering was done on the preprocessed_text feature with two features :-

  1. Length of the preprocessed text
  2. Sentiment Score of the preprocessed text.

Length of the preprocessed text

The text was already preprocessed and cleaned using decontractions and stopword and punctuations removal and was tokenized after that. A lambda function calculating the length of the splitted text for each of the text in preprocessed_text was used for the same and it was stored in a column Length_of_Preprocessed_Text.

Sentiment Score of the preprocessed text

The second feature of the preprocessed text is the sentiment score which is obtained using Vader which is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments in social media. The Sentiment IntensityAnalyzer class from nltk.sentiment.vader was used to give the sentiment intensity scores which are neg, neu, pos and compound i.e. the negative intensity, neutral intensity, positive intensity and a compund intensity score for every text. Below code snippet would make it clear.

VALIDITY OF THE ENGINEERED FEATURES:-

The usefulness of the Engineered features such as Length of the preprocessed text and the sentiment scores in predicting the sentiment of the text can be checked by checking whether the means of the engineered features are different across different levels of sentiments which in this case are negative and neutral. This is done by running a one-way anova on each of the engineered features.

Anova Results for Length of the Preprocessed Text:-

The probability of seeing a value greater than the F-score is close to 0 which means that the p-value is less than the significance level that is 0.05. This means that we reject the null hypothesis which states that the means are same across the levels of the sentiment. Hence, we conclude that the feature Length_Of_Preprocessed_Text is useful for the prediction of the feature sentiment.

Anova Results for the neg feature of the sentiment score:-

The p-value is close to 0.45 which is greater than the alpha or the significance level 0.05 and so we fail to reject the null hypothesis and conclude that the feature neg is not useful in predicting the sentiment.

Anova Results for the neu feature of the sentiment score:-

As the p-value is close to 0.59, so we fail to reject the null hypothesis and conclude that the feature neu is not useful in predicting the sentiment.

Anova Results for the pos feature of the sentiment score:-

The results for the anova analysis on the pos feature show that the pos feature is not useful in predicting the sentiment either as the p-value is greater than the significance level.

Anova Results for the compound feature of the sentiment score:-

The results show that the p-value is greater than the significance level or alpha and hence, we fail to reject the null hypothesis and conclude that compound feature is also not useful in predicting the sentiment from the given data.

TSNE PLOTS AND CDFS OF THE FEATURES:-

TSNE PLOT FOR THE TF-IDF VECTORIZED PREPROCESSED FEATURES:-

The preprocessed text was featurized with the help of TF-IDF vectorizer and then the transformed feature matrix was visualized using t-sne in 2 dimensions with various perplexities.

TSNE with perplexity as 30:-

TSNE Plot in 2 dimensions with perplexity as 30

The above plot clearly shows that tfidf vectorizer is not able to separate out the classes perfectly. There are overlaps all around in the 2 — dimensional space.

TSNE with perplexity 50:-

TSNE Plot in 2 dimensions with perplexity as 50

The above TSNE plot doesn’t even show a separation between the classes in two dimensions with the TFIDF vectorization

TSNE with perplexity 100:-

TSNE Plot in 2 dimensions with perplexity as 100

The above TSNE plot doesn’t even show a separation between the classes in two dimensions with the TFIDF vectorization

TSNE PLOT FOR THE BOW VECTORIZED PREPROCESSED FEATURES:-

TSNE with perplexity 30:-

TSNE Plot in 2 dimensions with perplexity as 30

The above plot shows that the classes are not well separated in two dimensions with the Count Vectorizer as the feature space.

TSNE with perplexity 50:-

TSNE Plot in 2 dimensions with perplexity as 50

The above plot shows that the classes are not well separated in two dimensions with the Count Vectorizer as the feature space.

TSNE with perplexity 100:-

The above plot shows that the classes are not well separated in two dimensions with the Count Vectorizer as the feature space.

CDF OF THE FEATURE LENGTH OF THE PREPROCESSED TEXT:-

Code:-

Result:-

The cdf for the length of the preprocessed text for the two classes of sentiments which are neutral and negative are overlapping and hence, this particular feature is actually not a good determinant of the sentiments although we have seen that mean lengths of the preprocessed text are different for the two sentiments.

CLASSICAL MACHINE LEARNING TECHNIQUES:-

The textID, preprocessed_text, Length_of_Preprocessed_Text and class (1- negative, 0- neutral) are taken from the original data and saved as preprocessed_data.csv and considered for training of various Machine Learning Models.

MACHINE LEARNING MODELS:-

The following Machine Learning Models were tried out on the data:-

a) Naive Bayes

b) Logistic Regression

c) Linear SVM

d) GBDT (Gradient Boosted Decision Trees)

e) Random Forests

On all of the above Machine Learning models, hyperparameter tuning was done and the best model was fitted on the data which had the best Area Under the Curve (AUC). The next step was to choose the best threshold which maximizes the True Positive Rate * (1- False Positive Rate) or Recall * Specificity.

TEXT FEATURIZATION:-

The preprocessed text was featurized using the following methods:-

a) Bag of Words (Unigrams, Unigrams and Bigrams and Unigrams, Bigrams and Trigrams)

b) Tf-idf (Unigrams, Unigrams and Bigrams and Unigrams, Bigrams and Trigrams)

c) Word2Vec (Average Word2Vec and Tf-idf weighted Word2Vec)

TRAIN-TEST SPLIT:-

The preprocessed_data.csv was imported and was splitted in the ratio of 75%- 25% by maintaining the class ratio with the help of stratified sampling.

FUNCTION TO CHOOSE THE BEST THRESHOLD:-

The first function was used to find the best threshold which would maximize the tpr*(1-fpr) values. Here, thresholds denote the probability of the positive class predicted by the best model and fpr and tpr are the arrays for the true positive rates and false positive rates for each of the thresholds.

The second function returns the predictions in terms of 0 and 1 in this case based on the best threshold.

NAIVE BAYES :-

The Naive Bayes algorithm was fitted on the training data by importing the MultinomialNB class of sklearn.naive_bayes.

Hyperparameter Tuning:-

The hyperparameter tuning was done on the parameter alpha which prevents the posterior probabilities of a class given a text to go to zero for any unseen word or rare word.

Range:-

i) Alpha-

[0.00001,0.0005,0.0001,0.005,0.001,0.05,0.01,0.1,0.5,1,5,10,50,100]

Algorithm for Hyperparameter Search:- Randomized Search

Vectorizer :-

a) Bag of Words (Unigrams, Unigrams and Bigrams & Unigrams, Bigrams and Trigrams.)

b) TF-IDF (Unigrams, Unigrams and Bigrams & Unigrams, Bigrams and Trigrams.)

SUMMARY OF RESULTS:-

Naive Bayes Results

LOGISTIC REGRESSION:-

The Logistic algorithm was fitted on the training data by importing the SGDClassifier class of sklearn.linear_model.

Hyperparameter Tuning:-

The hyperparameter tuning was done on the parameter alpha which is the weight given to the regularization and the penalty which is either of ‘L1’, ‘L2’ or ‘ElasticNet’ which denotes the type of Regularization.

Range:-

i) Alpha-

[0.00001,0.0005,0.0001,0.005,0.001,0.05,0.01,0.1,0.5,1,5,10,50,100]

ii) Penalty- [L1,L2,ElasticNet]

Algorithm for Hyperparameter Search:- Randomized Search

Vectorizer :-

a) Bag of Words (Unigrams, Unigrams and Bigrams & Unigrams, Bigrams and Trigrams.)

b) TF-IDF (Unigrams, Unigrams and Bigrams & Unigrams, Bigrams and Trigrams.)

c) Average Word2Vec (300 Dimensional Vectors of words from glove model.)

d) TFIDF Weighted Word2Vec (300 Dimensional Vectors of words from glove model.)

SUMMARY OF RESULTS:-

Logistic Regression Results

LINEAR SVM:-

The Linear SVM was fitted on the training data by importing SGDClassifier class of sklearn.linear_model.

Hyperparameter Tuning:-

The hyperparameter tuning was done on the parameter alpha which is the weight given to the regularization.

Alpha in this case is actually inversely proportional to C which is the measure of penalty imposed on training data points on the other side of the margin resulting in tightening of the maximal margin separating hyperplane so that those could be rightly classified. So, as alpha increases, C decreases and the margin of the maximal separating hyperplane increases essentially meaning that few training points could be misclassified. In other words, regularization increases resulting in a decrease of variance.

Range:-

i) Alpha-

[0.00001,0.0005,0.0001,0.005,0.001,0.05,0.01,0.1,0.5,1,5,10,50,100]

Algorithm for Hyperparameter Search:- Randomized Search

Vectorizer :-

a) Bag of Words (Unigrams, Unigrams and Bigrams & Unigrams, Bigrams and Trigrams.)

b) TF-IDF (Unigrams, Unigrams and Bigrams & Unigrams, Bigrams and Trigrams.)

c) Average Word2Vec (300 Dimensional Vectors of words from glove model.)

d) TFIDF Weighted Word2Vec (300 Dimensional Vectors of words from glove model.)

SUMMARY OF RESULTS:-

Linear SVM Results

GBDT (Gradient Boosted Decision Trees):-

The Gradient Boosted Decision Trees were also fitted on the data by importing GradientBoostingClassifier from sklearn.ensemble.

Hyperparameter Tuning:-

The hyperparameter tuning was done on the parameter n_estimators which is the number of base learners or decision trees and learning rate which is essentially the rate at which the contribution of each base learner is shrinked.

Range:-

i) n_estimators- [5,10,50,75,100]

ii) learning_rate- [0.0001,0.001,0.01,0.1,0.2,0.3]

Algorithm for Hyperparameter Search:- Randomized Search

Vectorizer :-

a) Bag of Words (Unigrams, Unigrams and Bigrams & Unigrams, Bigrams and Trigrams.)

b) TF-IDF (Unigrams, Unigrams and Bigrams & Unigrams, Bigrams and Trigrams.)

c) Average Word2Vec (300 Dimensional Vectors of words from glove model.)

d) TFIDF Weighted Word2Vec (300 Dimensional Vectors of words from glove model.)

SUMMARY OF RESULTS:-

GBDT Results

RANDOM FORESTS:-

The Random Forests was fitted on the data by importing RandomForest Classifier from sklearn.ensemble.

Hyperparameter Tuning:-

The hyperparameter tuning was done on the parameter n_estimators which is the number of base learners or decision trees and max_depth which is essentially the depth to which each of the decision trees would be grown at the max.

Range:-

i) n_estimators- [10,50,100,200]

ii) max_depth- [10,20,30,50,100]

Algorithm for Hyperparameter Search:- Randomized Search

Vectorizer :-

a) Bag of Words (Unigrams, Unigrams and Bigrams & Unigrams, Bigrams and Trigrams.)

b) TF-IDF (Unigrams, Unigrams and Bigrams & Unigrams, Bigrams and Trigrams.)

c) Average Word2Vec (300 Dimensional Vectors of words from glove model.)

d) TFIDF Weighted Word2Vec (300 Dimensional Vectors of words from glove model.)

SUMMARY OF RESULTS:-

Random Forest Results

SUMMARY OF RESULTS FOR THE BEST ML MODELS:-

From the above summary, we see that Random Forest performs the best on Bag of Words of Unigrams and Bigrams featurization with a recall of negative sentiment as 0.748.

Confusion Matrix for the Best Random Forest Model:-

LIMITATIONS AND SCOPE OF IMPROVEMENTS:-

LIMITATIONS:-

The limitations with the above approach of applying classical Machine Learning Models are :-

a) We have not utilized the sequence of words in the sentence or the context in which the words have appeared in the text resulting in negative or neutral sentiments.

b) There could be more features which could be more useful in predicting the negative sentiment apart from the simple length of the text which have not been utilized either due to lack of knowledge of the topics from which the text come.

c) The most important part which should be explored in this case is the joint probability distribution of the words for the negative and neutral classes. We can try to check the posterior probabilities of the classes given the most frequently occurring unigrams, bigrams and trigrams.

SCOPE OF IMPROVEMENTS:-

a) The performance of the models for prediction of the sentiment could be significantly improved upon using the RNNs or Recurrent Neural Networks which use the sequential nature of the data.

b) The embeddings or the vector representation of the words in the text can be learnt based on the context from the above texts but it would require more data.

c) BERT/Transformers can be used to check the performance of this task.

DEEP LEARNING MODELS:-

The state of the art Deep Learning Models have proved to be very efficient in NLP problems. Some of the Deep Learning Models have also been tried for this problem.

The basic architecture of models was based on the following:-

  1. LSTM based model taking the word embeddings.
  2. Self-Attention based model.
  3. BERT small model used for Text Classification.

LSTM Based Model

LSTM is a special kind of Recurrent unit which handles the problem of long term dependency. In this model, a single layered LSTM of 50 hidden units is fed the word embeddings of the text derived from the glove model. The model architecture is shown below :-

LSTM Based Model Architecture

The first input layer denoted by input_50 is the layer which takes the padded sequence of text after tokenizing. The sequences are derived from a word dictionary produced from the training corpus with index as keys and the corresponding words as values. The words are replaced by the keys of the dictionary representing the keys and each sentence is padded by zeros up to the maximum length of the sentences in the training corpus.

The other input denoted by input_51 is the manually designed length of the text feature.

The Deep Learning model was compiled with Adam optimizer, Categorical Cross Entropy Loss and the metric which was maximized was recall_score. The model was fitted on the data with a batch size of 64.

Here is the code for the tokenizing text, generating padded sequence, initializing embedding matrix, model compilation, fitting and callbacks.

The above model was run for 8 epochs after which training was stopped as the validation recall was not improving for the past 3 epochs. The callbacks used were Tensorboard, Checkpointing and Earlystopping.

The Tensor Board epoch loss and recall plot is as follows:-

Tensorboard Epoch Loss and Recall Plot

The best model gave a validation and test recall as 0.7215 and 0.7359 respectively.

Self- Attention Based Model:-

This model employed a self attention mechanism on the LSTM outputs i.e. it tried to determine which words are important in determining particular words.

The code can be found below:-

The model architecture is as follows:-

The model was compiled with Adam Optimizer with a learning rate of 0.0001, CategoricalCrossentropy loss and the recall score as the metrics to maximize.

The model was fitted on the training data with a batch size of 128.

The Tensorboard epoch loss and recall plot is as follows:-

The model was run for 20 epochs with tensorboard and checkpointing callback.

The validation and test recall for the best model were 0.8451 and 0.8360, respectively.

FINAL RESULTS:-

Summary of the Best Models

From the above table, we can clearly see that the self attention model on the word embeddings learnt from the glove vectors wins the game.

CODE AND DATASET:-

The code for the project along with the data can be accessed here:-

LINKEDIN PROFILE:-

My linkedin profile can be assessed here:-

REFERENCES:-

--

--