QUORA QUESTION PAIRS SIMILARITY

Md. Wazir Ali
18 min readNov 21, 2021

TABLE OF CONTENTS:-

  1. INTRODUCTION TO QUORA
  2. BUSINESS PROBLEM/OBJECTIVE
  3. CONSTRAINTS
  4. DATASET SOURCES AND DESCRIPTION
  5. APPROACHES TO THE PROBLEM
  6. MAPPING TO A ML PROBLEM
  7. EXPLORATORY DATA ANALYSIS
  8. BASIC FEATURE EXTRACTION AND IT’S IMPORTANCE
  9. TEXT PREPROCESSING
  10. ADVANCED FEATURE EXTRACTION AND IT’S IMPORTANCE
  11. TF-IDF WEIGHTED MEAN VECTOR FEATURES
  12. FINAL FEATURES
  13. BUILDING A RANDOM MODEL
  14. FIRST CUT ML SOLUTIONS
  15. IMPROVEMENTS ON THE PROPOSED ML SOLUTIONS
  16. SUMMARY OF RESULTS
  17. CODE
  18. FUTURE WORK
  19. REFERENCES

INTRODUCTION TO QUORA:-

Quora is an online platform where one can gain and share knowledge by asking questions and getting solutions from the people. There are many people who contribute to Quora by providing solutions/answers to the questions put on the platform. It’s a brilliant platform to ask questions and connect with people for their useful insights and quality answers. Quora is a part of 100 most visited websites worldwide and ranks 81st on the list. It gets over 620 million visits per month.

BUSINESS PROBLEM/OBJECTIVE:-

In this problem, Quora has asked for a solution to determine or predict questions posted on the website that are duplicates of each other. This would help Quora in unnecessary searches for answers of duplicate questions and would prove useful in providing instant answers to questions that have been already answered. Here, we have a task of predicting whether two questions are duplicates of each other or not.

DATASET SOURCES AND DESCRIPTION:-

This was a competition hosted by quora on kaggle in 2016 and was closed in 2017. The dataset for this competition can be found here.

The dataset which was given had the following structure:-

  1. id- The id of a training set question pair.
  2. qid1- The id of the question 1 of the two questions to be compared for checking duplicacy.
  3. qid2- The id of the question 2 of the two questions to be compared for checking duplicacy.
  4. question1- The description of the first question of the two given questions for checking duplicacy.
  5. question2- The description of the second question of the two given questions for checking duplicacy.
  6. is_duplicate- A flag indicating whether the two questions given are duplicates of each other or not.

Here’s a snapshot of the first five rows of the data:-

First Five question Pairs of the training data

There are two datasets train.csv and test.csv. The train.csv have the questions coming from quora originally whereas the test.csv have the question pairs generated from the computer. The ground truth is the set of labels that have been supplied by human experts. The ground truth labels should be taken to be informed but not 100% accurate and may include incorrect labelling as human labelling is a noisy process. The labels, on the whole, represent a reasonable consensus but this may not be true often on a case by case basis for individual items in the dataset.

EXISTING APPROACHES TO THE PROBLEM:-

These are the few existing approaches to the problem:-

  1. The classical machine learning approach in which hand crafted features are engineered from the given question pairs in the dataset along with some text featurization techniques and then various ML models (linear and Tree Based) are fitted on those features which are generated from the training data to predict whether the questions are duplicates of each other or not by providing a probability for the duplicate and not duplicate class.
  2. The Deep Learning Approach of using Word Embeddings for the tokens in the questions and then generating vector representation for each question to compute the label prediction. The label prediction was done in different ways in each of the approaches of Deep Learning which would be discussed in detail with the results in the coming sections.

MAPPING TO A ML PROBLEM:-

This type of a problem can be mapped to a supervised Machine Learning binary classification problem where the target is a binary class indicating whether the two questions are duplicate of each other or not. Here, we are interested in getting a probability of how similar are both the questions to each other. In other words, instead of getting a prediction of 0 and 1, we are more interested in getting a number between 0 and 1 which would tell us how similar are these two questions are.

PERFORMANCE METRIC:-

1)The performance metric of this problem is Logarithmic Loss which depends on the probability of a data point belonging to a class. It can be defined as :-

Log Loss or Binary Cross Entropy Loss

In the above image, the summation goes over all the data points, y_i’s are the actual ground truth labels and p(y_i) is the probability that the data point belongs to class 1. More about cross- entropy can be found here in the Show me the math (really?!) section. This type of a loss function has one global minima and is also known as convex function.

2) We would also look at Confusion matrix as a measure of assessing the performance of the ML model on the data. It would give us all kinds of measures such as Precision, Recall, TPR, FPR, TNR, FNR.

EXPLORATORY DATA ANALYSIS:-

a) The total number of question pairs given are 404,290.

b) There are 63.08% of the question pairs (255,027 question pairs) that are not similar whereas 36.92% of the question pairs (149,263 question pairs) are similar indicating it is an imbalanced dataset.

Distribution of target label

c) Total number of unique questions are 537,933 considering the unique questions in the pairs.

d) Number of unique questions that appear more than one time is 111,780 which is approximately 20.78% and number of unique questions that appear just once is 426,153 which is approximately 79.22%.

Unique Questions V/S Repeated Questions

e) The maximum number of times a question is repeated is 157.

The x-axis of the above plot represents the number of occurrences of a question or the frequency and the y-axis represents the number of such questions.

f) There are no duplicate questions.

g) There are two question pairs in which the second question is a null value i.e. it is blank. The question id for that question is 174,364.

BASIC FEATURE EXTRACTION AND IT’S IMPORTANCE:-

Basic Feature Extraction:-

Some Basic Features which are extracted from the given data before cleaning or preprocessing the sentences are :-

  1. Freq_qid1- Frequency or the number of occurrences of the first question of the pair in the total training data.
  2. Freq_qid2- Frequency or the number of occurrences of the second question of the pair in the total training data.
  3. q1len- The length of the string q1 for the first question description.
  4. q2len- The length of the string q2 for the second question description.
  5. q1_n_words- Number of words in Question 1.
  6. q2_n_words- Number of words in Question 2.
  7. word_Common- Number of unique common words in Question 1 and Question 2.
  8. word_Total- Total Number of unique words in Question 1 and Question 2.
  9. word_share- Ratio of the total number of unique common words in Question 1 and Question 2 and the total number of unique words in Question 1 and Question 2.
Basic Feature Extraction before preprocessing

The features are saved in a csv file on to the disk which is df_fe_without_preprocessing_train.csv

Few Statistics and Importance of the Basic Extracted Features:-

The minimum length of the question 1 and question 2 of the training data is 1. Number of questions in question 1 with minimum length is 67 whereas number of questions in question 2 with minimum length is 24.

Importance of the feature word_share:-

The violin plots and the distribution plot for the feature word_share for the duplicate and non-duplicate question pairs are as follows:-

Violin Plots and Distribution Plots

From the above plots, we can clearly say that the word_share for the duplicate questions is more than the word_share for those question pairs which are not duplicates of each other. This feature is helpful in separating the duplicate questions from the non-duplicate ones however, there is still some overlap of word_share in duplicate and non-duplicate questions.

Importance of the feature word_common:-

The violin plots and the distribution plot for the feature word_common for the duplicate and non-duplicate question pairs are as follows:-

From the above plots, we can clearly say that the violin plots for the feature word_common for the duplicate and non-duplicate question pairs are not clear and are highly overlapping and it is not possible to conclude whether this feature word_common is helpful in predicting whether two questions are duplicates of each other or not.

TEXT PREPROCESSING:-

Preprocessing of the text included the following steps:-

  1. Replacement of 6 zeros with a “m” symbol indicating a million.
  2. Replacement of 3 zeros with a “k” symbol indicating a thousand.
  3. Decontracting phrases such as replacing can’t with can not, won’t with will not, what’s with what is, it’s with it is, ‘ve with have, i’m with i am, he’s with he is, ‘s with own etc.
  4. Performing stemming operation on the words to convert those into their root forms using PorterStemmer.
A Single Question Preprocessing Function

ADVANCED FEATURE EXTRACTION AND IT’S IMPORTANCE:-

We would extract some more advanced features on top of the basic features from the data after doing the preprocessing of text as explained above.

Advanced Feature Extraction:-

On top of the basic features extracted for the task, there were some other advanced features which also were extracted. Those are :-

  1. cwc_min- Ratio of common_word_count (count of common words) to min length of word count of Q1 and Q2 (minimum of length of word counts of Q1 and Q2).
  2. cwc_max- Ratio of common_word_count (count of common words) to min length of word count of Q1 and Q2 (minimum of length of word counts of Q1 and Q2).
  3. csc_min- Ratio of common_stop_count(count of common stop words) to min length of stop count of Q1 and Q2 (minimum of length of stop words of Q1 and Q2).
  4. csc_max- Ratio of common_stop_count(count of common stop words) to max length of stop count of Q1 and Q2 (maximum of length of stop words of Q1 and Q2).
  5. ctc_min- Ratio of common_token_count (count of common tokens) to min length of token count of Q1 and Q2 (minimum of length of common tokens of Q1 and Q2).
  6. ctc_max- Ratio of common_token_count (count of common tokens) to max lenghth of token count of Q1 and Q2 (maximum of length of common tokens of Q1 and Q2).
  7. last_word_eq- Check if First word of both questions is equal or not (A Boolean variable indicating whether the last word of both questions are equal or not).
  8. first_word_eq- Check if First word of both questions is equal or not (A Boolean variable indicating whether the first word of both questions are equal or not).
  9. abs_len_diff- Abs. length difference (Absolute difference of the length of the tokens of Q1 and Q2).
  10. mean_len- Average Token Length of both Questions (Mean length of the tokens of Q1 and Q2).
  11. fuzz_ratio- Measure of the similarity between two strings wrt the common words in the strings. For e.g:- New York Mets and New York Meats are 96% similar to each other with just a difference of character ‘a’. It is a part of fuzzywuzzy library in python.
  12. fuzz_partial_ratio- Measure of the similarity between two strings of different lengths m and n giving the best similarity of the two substrings of the minimum of lengths of m and n using the fuzz.ratio. The function is called fuzz.partial_ratio. For e.g.:- Yankees and New York Yankees are both the same teams with Yankees being the shorter string. The best match of the substring between Yankees and New York Yankees is Yankees and hence fuzz.partial_ratio is 100 for the above two strings. It is also a part of fuzzywuzzy library of python.
  13. token_sort_ratio- Measure of the similarity between two strings which are same in terms of the tokens but are oriented differently. This measures the similarity between two strings after tokenizing the strings and sorting the both in ascending order. For e.g:- If we want to measure the similarity between new york mets vs atlanta braves and atlanta braves vs new york mets, then according to the above two measures, we would end up in a poor metric of similarity but actually they are both the same. So, in this case token_sort_ratio would sort both the strings in ascending order and then give the similarity which would be 100 in this case.
  14. token_set_ratio- Measure of the similarity between two strings s1 and s2 in which the strings are tokenized and then the similarity between the sorted intersection of the tokens and the sorted intersection of the tokens + rest of the each of the strings is calculated and the best score is returned as the result. It is defined as fuzz.token_set_ratio.
  15. longest_substr_ratio- It is defined as the ratio of the length of the longest common substring to the minimum length of token count of Q1 and Q2. longest_substr_ratio = len(longest common sub_string)/ (min(len(q1_tokens),len(q2_tokens)). The longest common substring between two strings can be computed using the lcsubstrings method of the distance library. More about the distance library can be found here.

The library fuzzywuzzy is a very easy and convenient way to calculate a similarity score between two strings between 0 to 100. Please check here, here and here more for fuzzywuzzy library. This library uses Levenshtein distance to calculate the differences between sequences. Please check this to know more about Leveinstein distance and this and this to know about the string matching ratios or advanced features discussed above from fuzz.ratio.

Advanced Feature Extraction of String Similarity

The advanced features are extracted and are saved in a csv file on to the disk which is nlp_features_train.csv.

Analysis of the Advanced Features and It’s Importance:-

We would perform some bivariate analysis and univariate analysis to determine which pair of features and which features are important to distinguish between the target classes of the is_duplicate

Pair Plots of the features ctc_min, cwc_min, csc_min and token_sort_ratio:-

Pair Plots for cwc_min, csc_min, ctc_min and token_sort_ratio

The above plot reveals that:-

  1. token_sort_ratio along with csc_min and token_sort_ratio along with cwc_min are separating the two target classes good enough although there are overlaps to some extent.
  2. token_sort_ratio alone is capable of separating the classes duplicate and is_duplicate although there is some overlap between the two classes.

PDFs and Violin Plots:-

Token_Sort_Ratio:-

Viloin Plots and PDF for token_sort_ratio

The two plots reveal that token_sort_ratio is capable of separating out the duplicate and non-duplicate classes from each other pretty well. The PDF and violin plots both denote that the duplicate questions or questions which are duplicate to each other are having a token_set_ratio more than those questions which are not duplicate of each other but there is a significant amount of overlap denoted by the violet color.

Fuzz_Ratio:-

Violin Plots and PDF for fuzz_ratio

The two plots reveal that the viloin plot and PDF for fuzz_ratio separate out the duplicate and non-duplicate questions pretty well with the fuzz ratio of the duplicate questions being more than the fuzz_ratio of the non-duplicate questions.

Word Clouds:-

Duplicate Question Pairs:-

Word Cloud for Duplicate Question Pairs

Non-Duplicate Question Pairs:-

Word Cloud for Non-Duplicate Question Pairs

Code Snippet for Word Cloud :-

T-SNE Visualization of all the 15 advanced features:-

All the 15 advanced features are converted into two dimensions with the help of t-distributed stochastic neighborhood embedding technique popularly known as t-sne after scaling them using Minmaxscaler to bring all the features in the same scale.

The results of T-SNE after running the data for 1000 iterations and for perplexity 30 look something like:-

T-SNE Plot

Please read more about t-sne here, here and here.

We can see overlaps between the two classes in the above plot but still we can clearly see clusters of red and blue points in the plot. This signifies that the features which are designed are helpful in prediction of the duplicate question pairs fairly when they are combined.

TF-IDF WEIGHTED MEAN VECTOR FEATURES:-

  1. Question pairs were converted into one unique list to generate the vocabulary of words and then to generate the sparse matrix of tf-idf values of the unique words or tokens from the data.
  2. The mapping of words to idf values are stored in a dictionary.
  3. The small pipeline package of spacy which ends with sm which is ‘en_core_web_sm’ is used to get the context sensitive vectors (instead of word vectors) for each of the tokens of both the question1 and question2.
  4. The mean of the product of the vectors and idf values are taken giving one single vector of n dimensions where n is 384 in this case for each of the questions.

Code Snippet for Vectorizing Text:-

FINAL SET OF FEATURES:-

The above basic features, advanced features and the average tfidf weighted word vectors were merged into a final file known as final_features.csv and saved to a disk.

DATA SPLIT:-

A subset of 1,00,000 data points is taken and it is splitted into training and test data in the same proportion of the class labels of is_duplicate. This split is done according to 70:30 ratio. This data doesn’t contain the question pairs but contains the features derived from those including all of the basic, advanced and tfidf weighted mean word vectors (of both question 1 and question 2). The feature set consists of a total of 794 features.

BUILDING A RANDOM MODEL:-

A random model in this case would be randomly predicting the test labels with random probabilities for every test data point to belong to the positive class of interest that is duplicate and also to the negative class of interest that is non-duplicate.

To achieve this, we generate two random numbers from a uniform distribution between 0 and 1. Then we convert them into probabilities by rationalizing each random number by the sum of two random numbers generated.

Code snippet:-

Log-Loss and Confusion Matrix:-

Log Loss and Confusion Matrix for the Random Model

The Log-Loss for the random model is about 0.8872. So, the benchmark for any of the models which we would build is above 0.8872. Our Models should at least have a log-loss which is lesser than 0.88.

FIRST CUT ML SOLUTIONS:-

a) LOGISTIC REGRESSION WITH HYPERPARAMETER TUNING:-

The hyperparameter alpha which is the weight given to regularization was searched for the best results in the space 10^-5 to 10 and the log loss for each of them are as follows:-

Alpha Values and Test Log-Loss

The best value of alpha for which the test log loss is the minimum is 1.

Log-Loss and the Confusion Matrix:-

The train log loss for alpha = 1 is 0.5138 whereas the test log loss is 0.5200. The confusion matrix plot looks somewhat like this:-

Confusion Matrix

Concerns:-

One of the major concern is that the recall for class 2 or the duplicate questions is very low, close to 50%.

Note :- SGDClassifier with log loss was used for Logistic Regression from sklearn’s linear_model module.

b) LINEAR SVM WITH HYPERPARAMETER TUNING:-

The hyperparameter alpha which is the weight given to regularization was searched for the best results in the space 10^-5 to 10 and the log loss for each of them are as follows:-

Alpha Values and Test Log-Loss

Log-Loss and the Confusion Matrix:-

The train log loss for alpha = 0.0001 is 0.4780 whereas the test log loss is 0.4896. The confusion matrix plot looks somewhat like this:-

Confusion Matrix

Concerns:-

One of the major concern is that the recall for class 2 or the duplicate questions is very low, close to 49%.

Note :- SGDClassifier with hinge loss was used for Linear SVM from sklearn’s linear_model module.

c) XGBOOST WITH RANDOMLY CHOSEN HYPERPARAMETERS:-

XGBOOST was run on the train and tested on the validation set with:-

  1. Log loss as the evaluation metric for each tree,
  2. Learning rate of 0.02
  3. Max depth of 4.
  4. The objective was set to binary:logistic.

The algorithm was run for 400 iterations, the number of iterations before which the training would be stopped was set to 20 subject to the condition that the validation log loss doesn’t improve. The validation metric was printed after every 10 iterations.

After 400 iterations, the test log loss which was also the validation set came to be 0.3570.

Log-Loss and the Confusion Matrix:-

The train log loss after 400 iterations was 0.3455 whereas the test log loss is 0.3570. The confusion matrix plot looks somewhat like this:-

Confusion Matrix

Improvement:-

The recall for the duplicate pair class improved to almost 69% from close to 50% from the above algorithms.

IMPROVEMENTS ON THE PROPOSED ML SOLUTIONS:-

The first cut ML solutions were improved in the following ways:-

  1. The question pairs were vectorized using TF-IDF vectorizer and Logistic Regression and Linear SVM was fitted on the same along with the basic and advanced features.
  2. The XGBoost Model was fitted on the earlier Word2Vec Mean vectors of the question pairs along with the basic and advanced features but this time hyperparameter tuning was done.

LOGISTIC REGRESSION AND IT’S RESULTS ON THE TF-IDF VECTORIZER:-

Code for Preprocessing :-

Vectorizing the questions after Training, Validation and Test Split and Preprocessing:-

Preprocessed Questions

Basic and Advanced Feature Split into Train, Validation and Test:-

Final Feature Matrix:-

Model Fitting and Hyperparameter Tuning:-

The Plot for Log Loss and the Results of Hyperparameter Tuning

Code for Confusion Matrix:-

Results:-

The hyperparameter alpha which is the regularization strength was searched in the space of 10^-5 to 10.

  1. The best hyperparameter alpha was found to be 10^-5.
  2. The train log loss was 0.3893 and the test log loss was 0.3949.
Confusion Matrix

Improvement:-

  1. The Recall for the duplicate question pair class has improved from close to 0.50 to 0.664.

LINEAR SVM RESULTS ON THE TF-IDF VECTORIZER:-

Code for Model Fitting and Hyperparameter Tuning:-

The Plot for Log Loss and the Results of Hyperparameter Tuning

Results:-

The hyperparameter alpha which is the regularization strength was searched in the space of 10^-5 to 10.

  1. The best hyperparameter alpha was found to be 10^-5.
  2. The train log loss was 0.3774 and the test log loss was 0.3837.
Confusion Matrix

Improvement:-

  1. The Recall for the duplicate question pair class has improved from close to 0.50 to 0.679.

XGBOOST RESULTS WITH HYPERPARAMETER TUNING ON MEAN TFIDF WEIGHTED W2V:-

Code for Hyperparameter Tuning and Model Fitting:-

HeatMap for Negative Log Loss

The best hyperparameters came out to be :-

  1. n_estimators = 100
  2. Learning_Rate = 0.2
Best XGBoost Model

Results:-

The hyperparameter n_estimators was searched over 5,10,50,75 and 100 whereas the hyperparameter learning_rate was searched over 0.0001, 0.001, 0.01, 0.1, 0.2, 0.3.

  1. The best hyperparameter n_estimators was found to be 100 and the best hyperparameter learning_rate was found to be 0.2.
  2. The training log loss for the best XGBoost model was 0.2158 and the test log loss for the best XGBoost model was 0.3413.
Confusion Matrix

Improvement:-

The XGBoost Model with tuned hyperparameters reduced the loss from 0.3570 to 0.3413 and also improved the recall of the duplicate question pair class from 0.69 to close to 0.74.

SUMMARY OF RESULTS:-

The summary of all the ML models fitted with different featurization techniques is as follows:-

Result Summary
  1. The XGBoost model with hyperparameter tuning performed the best on the basic + advanced features along with TFIDF Weighted Mean Word 2 Vec with the minimum log loss.
  2. The linear model on the TFIDF Vectors of the question pairs along with the handcrafted basic and advanced features performed better than the linear models on the TFIDF Weighted Mean Word 2 Vec.

CODE:-

The code for the project can be found below:-

FUTURE WORK:-

  1. The future work in this context can be attributed to a decomposable attention model for Natural Language Inference using attention. This approach uses attention to decompose the problem into subproblems that can be solved separately. This approach is based on word alignments and detection of contradictory sentences. Please check the second reference under References section for the research paper.
  2. Work has been done in this context on the Stanford Natural Language Inference Dataset. Design of Sequential Inference Models based on Chain LSTMs has outperformed all previous models. Please check the third reference under References section for the research paper.

REFERENCES:-

  1. www.appliedaicourse.com
  2. [1606.01933] A Decomposable Attention Model for Natural Language Inference (arxiv.org)
  3. [1609.06038] Enhanced LSTM for Natural Language Inference (arxiv.org)

--

--