MERCARI PRICE PREDICTION CHALLENGE
Today, automation and applications involving Machine Learning solutions have become state-of-the art. This is a case study which was launched by Japan’s online portal Mercari through Kaggle in 2018 wherein they have asked for a solution to predict the price of their products put up on the portal.
TABLE OF CONTENTS:-
1.INTRODUCTION
2. BUSINESS PROBLEM
3. ML/DL PROBLEM MAPPING
4. UNDERSTANDING THE DATA
5. EXISTING APPROACHES/SOLUTIONS
6. FIRST CUT SOLUTION
7. EDA
8. FEATURE ENGINEERING
9. MODELLING
10. RESULTS AND DEPLOYMENT
11. CODE REFRENCES
12. CONCLUSIONS AND FUTURE WORK
13. PROFILE
14. REFERENCES
1. INTRODUCTION:-
Here is a brief introduction to the project in this section. By the name, you would be able to figure out that the challenge is to predict the price of the products. The prediction of prices is for the products which are put up on an online portal of a company named Mercari Inc. Before getting into more details, let me give you a brief description of the company Mercari Inc.
1.1. ABOUT MERCARI:-
i) It is a Japanese ecommerce company founded in February 2013 by Japanese Serial Entrepreneur Shintaro Yamada.
ii) When founded, it was named Kuzoh Inc.
iii) It is currently operating in Japan, UK and the United States.
iv) Mercari app was launched for IOS and mobile devices in July, 2013.
v) Mercari has grown up since then to become Japan’s largest community- powered marketplace with over JPY 10 billion in transactions carried out on the platform each month.
vi) Mercari has the largest market share among the country’s one of many community marketplace apps with 94% of the Japanese users using Mercari.
vii) Features such as Mercari Channel (live streaming e-commerce) and the Mercari NOW service, which allows users to instantly receive cash for their items, have contributed to the app’s widespread success.
viii) Mercari expanded to the United States in 2014 and the United Kingdom in 2016 .
ix) The Mercari app has been downloaded over 100 million times worldwide (as of 16 December 2017).
1.2. USAGE OF ML / DL :-
Considering the huge demand of the Mercari app and the number of times this app has been downloaded by people along with the transactions, there is a need to automate the price prediction of the products so that the right price comes up after putting up a product on the portal within a short period of time. One of the recommended approaches of predicting price is to develop an application with a Machine Learning or a Deep Learning model running in the backend which would capture the interactions between the various attributes or the features of the products given to predict the price of the product.
2. BUSINESS PROBLEM :-
As far as the business problem is concerned, it is to predict the prices of the products which the sellers put up on the Mercari app.
2.1. CHALLENGES:-
It is pretty tough to predict the prices of the products which are put up on the app as the sellers are free to put up anything on the online portal of Mercari. So, if the predicted price is below the price for the product put up, then it’s a loss for the seller and if the predicted price is above the price for the current condition of the product, then it’s a loss for the buyer. It is very tough to get closer to the actual price of the product considering it’s current condition. It’s a Ill known fact that both the first hand and the second hand products are put up on the portal in various conditions.
2.2. USE CASE:-
Mercari is an online marketplace where the products are put up for sale. The right price would encourage the products to be reused time and again until they are in a pretty good condition. The description for the products put up would matter a lot in this case.
3. ML/DL PROBLEM MAPPING:-
As far as the mapping of the business problem to a ML problem is concerned, it is supposed to be a regression problem as in this problem, I am predicting the price of a product which is continuous in nature. You would see the details of the dataset in the upcoming section which is given as the historical data which would be used to predict the prices of the products that would be put up on the portal in the near future.
3.1. BUSINESS CONSTRAINT:-
The business constraint in developing this application would be the response time in which the price would be predicted. The prediction time should not take a minute or more to predict the price but it should be reasonable restricted to few seconds might be 30 at max.
3.2. ERROR METRIC:-
The evaluation metric or the error metric which would determine the performance of the algorithm would be Root Mean Squared Logarithmic error which is mathematically given by:-
In this metric, I take the log of the predicted price after adding 1 to it and similarly I take the log of the actual price after adding 1 to it. I subtract one term from the other. It is also interpreted as the log of the ratio of the 1 added to the predicted price and 1 added to the actual price. I square the above difference or the log of the ratio for every product or every term and then sum it up. I divide this by the number of terms or in other words, I average the sum of errors or the differences of the log terms. Then finally, I take the square root of the term.
3.3. ADVANTAGES OF THE ERROR METRIC:-
1) This error metric makes the predictions robust to outliers as I am taking the log of the predictions and the actuals, thereby scaling down the bigger values which would mean that the errors in predictions of the bigger terms won’t affect the final metric or the evaluation of the algorithm. This is one advantage over RMSE (Root Mean Squared Error) wherein I take the square root of the average of the difference of actual terms.
2) The second advantage of this error metric over RMSE(Root Mean Squared Error) is that it penalizes more the underestimation of the actual value than the overestimation of the actual value.
4. UNDERSTANDING THE DATA:-
4.1. SOURCES OF DATA:-
The data for this project is available at https://www.kaggle.com/c/mercari-price-suggestion-challenge/data
There are three datasets given for this competition. They are :-
i) train.tsv.7z — A compressed 7z tab separated file of size 74.3 MB which when uncompressed becomes close to 329 MB.
ii) test.tsv.7z — A compressed 7z tab separated file of size 33.97 MB which when uncompressed becomes close to 150 MB.
iii) test_stg2.tsv.zip — A compressed 7z tab separated file of size 294.37 MB which when uncompressed becomes close to 736 MB.
The first file is the training data provided to us with the product information along with the price which I use for analysis, exploration and building models so that I can predict the price of any future unseen product with the help of this model. The other 2 files are the files containing just the product information without the price and are meant for predicting price to check the performance of the model which is built.
4.2. DESCRIPTION OF THE DATA:-
The files have the following columns :-
i) product_id — The unique id of the product.
ii) name — The name of the product/item.
iii) item_condition_id — The condition of the product provided by the seller.
iv) category_name — the name of the category of the product i.e. the industry to which the product belongs to.
v) brand_name — the brand of the product.
vi) shipping — A variable denoting whether the shipping charge is to be paid by the seller or not.
vii) item_Description — The full description of the product giving the details about the priduct regarding it’s quality.
In the test file, the product_id is replaced with test_id. Also, apart from the above seven features for a product, there is a feature called price in the train.tsv file which gives the price of the product.
5. EXISTING APPROACHES/SOLUTIONS:-
I would give a brief overview of a good approach for building models for the case study.
5.1. ML SOLUTION:-
The modelling approaches mainly focused on vectorizing the product name and pre-processed item description using count vectorizer and tfidf-vectorizer for getting the tfidf values of ngrams in the pre-processed item description. The item description was pre-processed by removing the punctuations, stopwords, numbers and keeping those words which are more than 3 characters in length.
The other categorical features such as brand, item condition, shipping and the three category names are either one hot encoded or label encoded.
The ML algorithms which are applied on the vectorized text data and the one hot encoded or label encoded categorical data are: -
i) LightGBM.
ii) Random Forest Regressor.
iii) Ridge Regressor.
5.2. DL SOLUTION:-
There was a deep learning solution to the modelling also which focused on the interaction between6 various terms. They are :-
i) Concatenating the product name which is the name and the brand name after imputing the null values of the individual columns with spaces.
ii) Concatenating the item description, name and the category name after imputing the null values of the individual columns with spaces.
Vectorizing the concatenated features using TFIDF for both but only for unigrams in the first feature and for both unigrams and bigrams in the second feature. Other than these features, the item condition and the shipping variables are one hot encoded.
The above stacked features are passed to a simple MLP with the following architecture: -
i) A dense layer of 256 neurons with relu activation.
ii) A dense layer with 64 neurons with relu activation.
iii) A dense layer with 64 neurons with relu activation.
iv) A dense layer with 32 neurons with relu activation.
v) The output layer with 1 neuron for the price without any activation.
The above model was compiled with Adam optimizer with a learning rate of 3e-3 and loss as mean squared error.
All the above models be it ML or DL focused on predicting the log price instead of the original price which means the price getting scaled down and then the models are trained to minimize the mean squared error. After training the models, the predictions are brought back to the normal scale by taking the exponential of the predictions and then the RMSLE of the model was evaluated.
6. FIRST CUT SOLUTION:-
The first cut solution to the problem was to thoroughly explore the data, then to split the data into train, test and cross validation and then to featurize the existing features by preprocessing the text, vectorizing it through TFIDF by ensuring no data leakage, one hot encoding or label encoding the categorical data and using the numerical features as is. After featurization, the next step was to do hyperparameter tuning on the Cross Validation data for various ML algorithms such as Ridge Regressor, Random Forest Regressor and XGBoost Regressor. Thereafter, the best models were fitted on the featurized data and the performance was gauged using the rmsle on the test data. Also, these models were evaluated on the unseen test data which was provided during the competition by submitting in the kaggle platform and to get the results.
7. EDA: -
The EDA or the exploratory data analysis of the mercari price prediction challenge included the following points:-
i)Reading the data and getting the descriptive statistics of the data along with the number or rows and number and type of columns.
ii) Checking the number of unique categories and brands
iii) Checking the null values for each column
iv) Checking the distribution of the target variable price
Original Price:-
Log transformed Price:-
Observation:-
See the skewed distribution of original variable price which suggests less penalty of predicting high end prices as there are very few items in the training data with high end prices.
v) Checking the distribution of the price for the products with and without shipping.
Original Price:-
Log Transformed Price:-
Observation:-
The distribution suggests that more high-priced products are paid by the sellers as far as their shipping charge is concerned. This is true for both the original price and log of the price as expected because log is a monotonically increasing function.
vi) Distribution plot for logarithm of price variable
vii) Plot for the top 20 categories with most number of products.
viii) Brand Name imputation
The rationale behind imputation of brand names was to check the product names and then search for a brand name in the same from the available list of brand names given as a part of training data.
In the imputation process, a function was defined which searches for the brand name in the unique set of brands from the name. Earlier exploration shows that the maximum length of a brand name is 8. So, we continue to search for the brand name till 8th word in the product name.
After imputation, there was a significant reduction in the null values of the brand name. The number of null values in the brand name reduced from 632,682 to 545,138 which is a significant reduction in the null values.
After the imputation, the remaining brand names with null values were filled out with No Brand Name.
The function for missing brand name imputation can be found here:-
ix) Plot for the top 20 brands with most number of products
Observation:-
a) The number of products with No Brand Name are humungous crossing more than 0.55 million products.
b) The top 3 most popular brands are Nike, Pink and Victoria’s Secret.
x) Plot for the average price for the top 20 categories.
xi) Plot for the average price of top 20 brands.
Observation:-
a) The top 3 costliest brands if we consider the average price of the product are Proenza Schouler, Auto Meter and Oris.
xii) Separating the 3 different subcategories from the category name
The function category_split has a try except block which tries to separate out the sub categories with the help of a “/” and if it cannot do it, then it returns three strings as No Label , No Label and No Label. This is implemented using a try except block inside the function.
xiii) Plot for the number of products of the general category
Observation:-
a) The category women has the highest number of products.
b) After women category, we have the category of Beauty, Kids and Electronics which are frequently put up on the portal.
xiv) Plot for the top 20 subcategory1 products put up on the portal
Observation:-
a) The top 3 first sub-category products put up on the portal are Athletic Apparel, Makeup and Tops & Blouses.
xv) Plot for the top 20 second subcategory products put up on the portal.
Observation:-
a) The top 3 second subcategory products put up on the portal are Pants, Tights, Leggings, Other and Face.
xvi) The box plot of log price of various general categories.
Observation:-
a) The prices across various general categories overlap and are not well separated for the sample of data given.
xvii) I double checked the log of price distinction between various levels of general categories and subcategory1 with the help of One-Way anova. I used the ols module from statsmodels.formula.api to define the model and also used the module statsmodels.api to generate the results of the anova.
The code for performing one-way anova can be found here:-
One- Way Anova Results :-
Observation:-
a) The F-value which is the ratio of variances between two groups is greater than 1 and is rather a large positive value. Here the ratio of between group variance and the within group variance denotes that the between group variance dominates the within group variance.
b) So, based on the above observation, we conclude that there is at least one general category whose product’s mean price is different from the mean prices of the other 10 categories.
The code for one-way anova can be found here:-
One-Way Anova Results:-
Observation:-
a) The F-value which is the ratio of variances between two groups is greater than 1 and is rather a large positive value. Here the ratio of between group variance and the within group variance denotes that the between group variance dominates the within group variance.
b) So, based on the above observation, we conclude that there is at least one sub-category1 whose product’s mean price is different from the mean prices of the other 113 sub-category1.
Note:- The one-way anova could not completed for the sub category2 and brands as the RAM crashed due to a high number of unique levels.
xviii) The association of item description length engineered from item description with the price and log_price
Price: -
There is a typo in the above graph on the y-axis. It should be price instead of log_price. The scatter plot shows a trend that with increasing length of the product description there is a decrease in the prices of the products irrespective of the categories or brands to which the products belong to.
Log Price: -
xix) Checking the average price and log_price for the item description length
Price:-
Observation:-
a) The average price initially rises with the increase in the length of the item description, then it starts declining but for very lengthy descriptions, the average price fluctuates a lot.
Log Price: -
Observation:-
a) The average log price initially rises with the increase in the length of the item description, then it starts declining but for very lengthy descriptions, the average log price fluctuates a lot.
xx) Checking the average distance of products of cheap category of Nike, Pink and Victoria’s Secret from other products of cheap, affordable and expensive categories of the 3 brands according to tfidf vectorization.
For Nike Products:-
For Pink Products:-
For Victoria’s Secret Products:-
In each of the above boxplots, each box represents the plot of average distances in cheap category to other products in cheap, affordable and expensive categories.
I saw that there is a pattern in price which was determined by these categories of products and was guided by the item description of products for various brands as the box plots show a clear separation of prices via these three categories which are derived from the quantiles of given prices.
Note :- For each of the brands, a sample of products are taken from the whole cheap, affordable and expensive categories.
xxi) Word Clouds for the top 4 general categories
Women:
Beauty:
Kids:
Electronics:
xxii) Checking the shipping variable’s impact on price for each of the brands of Nike, Pink and Victoria’s Secret in each of these category which was cheap, affordable and expensive. These three brands were among the top 3 brands which were common to the 3 product categories.
Cheap Products:-
Nike:-
Pink:-
Victoria’s Secret :-
Affordable Products:-
Nike :-
Pink :-
Victoria’s Secret:-
Expensive Products:-
Nike:-
Pink:-
Victoria’s Secret:-
Observation:-
a) The three categories of all the three brands Nike, Pink and Victoria’s Secret behave similarly wrt the shipping variable.
b) The cheap and expensive category products of the 3 brands have a pattern of shipping charges borne by seller but the affordable category of the 3 brands don’t have a pattern of shipping charge as far as prices are concerned.
xxiii) The association of Item condition id with the price for these categories of products of these brands.
Boxplot for Nike Cheap Products:-
Boxplot for Pink Cheap Products:-
Boxplot for Victoria’s Secret Cheap Products:-
Boxplot for Nike Affordable Products:-
Boxplot for Pink Affordable Products:-
Boxplot for Victoria’s Secret Affordable Products:-
Boxplot for Nike Expensive Products:-
Boxplot for Pink Expensive Products:-
Boxplot for Victoria’s Secret Expensive Products:-
Observation:-
I see that the shipping and the item condition id are not the key differentiators of the product categories of cheap, affordable and expensive for the three brands. However, the shipping variable behaves similarly in each of the product categories at a brand level.
xxiv) Checking the association between the numerical variables through the correlation matrix.
The above correlation matrix shows that the price is not linearly correlated with any of the numerical variables as item description length , pos, neg, neu and compound or categorical variables shipping or item condition.
Conclusions :-
a) The EDA concludes that there are categories of cheap, affordable and expensive in each of the brands which is a key differentiator of prices and somehow the prices are related to the item description.
b) The general categories and the sub category 1 are separating out the average prices with at least 1 such category or sub category present whose mean price is quite different from the rest of the categories or sub categories.
c) The sentiment scores of the item description are very moderately correlated with the price.
d) There has to be an underlying relation for a product which needs to be approximated using either ML or DL models.
8. FEATURE ENGINEERING:-
The feature engineering part consists of coming up with the following three features and would involve testing the association of those with the target variable price.
i) Item Description Length :-
The length of the item description was calculated after removing the punctuations, numbers and stop words. The function count_of_words takes an argument text and pre-processes it and then removes the stop words and then filters out those words whose length is lesser than 3.
I apply the function on each element of the item_description feature and then store it in a separate column item_description_Length.
ii) Pre-processed item description after removal of punctuations, numbers and stopwords.
a) There are two functions which are decontracted and preprocess which take inputs as a sentence and an array of texts. These two functions are used to clean out the item description by removing the stopwords, punctuations and numbers and also those words whose length was less than or equal to 3.
b) As the length of the training data was approximately 1.4 million. So, the pre-processing was done in chunks of 0.5 million and each time the pre-processed text was saved in a csv file.
iii) Sentiment Score of the preprocessed item description:-
The sentiment score of the preprocessed item description which gave the polarity score in terms of the positivity, negativity or neutral behaviour of the description and a combined compound score for each product. The SentimentIntensityAnalyzer and the vader_lexicon package were used. These scores were added onto the data frame as four columns corresponding to each product.
iv) Another feature which was created was the categories of products based on the quantile of prices of all the products in the data.
9. MODELLING:-
9.1. ML MODELS AND FEATURES:-
FEATURE MATRIX:-
HANDLING TEXT DATA:-
Handling Preprocessed Item Description:-
i) The pre-processed item description was vectorized using TFIDF vectorizer for unigrams.
a) The whole data was splitted into training and test data with a 75–25 split and then the training data was again splitted into train and cross validation data with the brand category feature in the ratio 75–25.
b) The null values in the pre-processed item description are filled with no description.
c) The data leakage was prevented by fitting the vectorizer on the training data to get the vocabulary of unigrams and then transforming the training, the cross validation data and the test data into a sparse matrix using the same vocabulary of words in the training data.
d) The tf-idf vectorizer was fitted using the following parameters:-
1. min_df = 10 which means to consider those words that appear at least in 10 documents which in this case are pre-processed item description.
2. max_features = 50000 i.e. I considered a maximum of 50000 words.
3. ngram_range = (1,1) i.e. I considered only the unigrams.
HANDLING CATEGORICAL FEATURES:-
ii) The categorical variables as brand_name, general_cat, subcat_1,subcat_2,item_condition_id and shipping were all label encoded.
Handling item condition id:-
a) An instance of label encoder was used to label encode the categorical features.
b) The instance was fitted on the training data and then was used to transform the training data.
c) The instance was used to transform the test and the cv data using the labels found in the training data.
d) The variable item condition id had 5 different labels only. So, no such labels were there which were not seen in the training data.
Code:-
Handling brand_name:-
I used a custom label encoder to take care of the values which are not a part of the training brands. Here is the code for the custom class.
Handling shipping:-
The feature shipping is label encoded by fitting the in-built label encoder on the values of shipping in training data and transforming both the training and the test data’s shipping variable.
Handling general_cat:-
The feature general_cat is label encoded by fitting the in-built label encoder on the values of general_cat in training data and transforming the training data and the test data’s general_cat.
Handling subcat_1:-
The feature subcat_1 is label encoded by fitting the in-built label encoder on the values of subcat_1 in training data and transforming the training data and the test data’s subcat_1.
Handling subcat_2:-
The feature subcat_2 is label encoded by the instance of the custom label encoder class defined above. The instance is fitted on the training data’s subcat_2 values and then it’s transform method transforms the training and test data’s subcat_2 variable into a label encode feature.
HANDLING NUMERICAL FEATURES:-
iii) The numerical features item description length, pos, neg, neu and compound are reshaped accordingly and are concatenated in the feature matrix.
ESTIMATING THE PRODUCT CATEGORIES:-
iv) For approximating the product categories of cheap, affordable and expensive, I employed a hack by marking the training brands as cheap, affordable wrt the average price of all the products in a brand by employing the quantile information of average prices. After marking the brands as one of the cheap, affordable and expensive, the feature was label encoded.
a) The average prices of products of brands were sorted and then the brands in the training data were categorized as cheap, affordable and expensive.
b) The above brand categories corresponded to the under the first quantile of average price, between first and third quantile of average price and greater than third quantile of average price.
FINAL FEATURE MATRIX:-
v) The final feature matrix was a sparse matrix by stacking horizontally all the above features using hstack function from scipy.sparse.
MODELS:-
i) Random Model :-
I evaluated a random model by computing the rmsle on the average price of training data, Cross validation data and test data.
Here the predicted labels were considered to be the average price for all the products given.
Results:-
ii) Random Forest Regressor:-
a) Hyperparameter Tuning:-
For the hyperparameter tuning, I used a subset of the cross validation data which was 2% of the total cross validation data. The hyperparameters which were tuned were n_estimators and max_depth.
Randomized Search was used to get the best hyperparameters by selecting among the array of values.
- n_estimators = 3000 which denotes the number of base learners or decision trees.
- max_depth = 100 which means the maximum depth to which each decision tree or base learner is grown.
b) Evaluation of CV data and prediction on training, CV and test data:-
The no. of base estimators and max depth which gave the best results were 3000 and 100, respectively but when the full cv data was fitted with these values, then it was taking a long time. Hence, I reduced the number of base estimators to 500 and the max depth to 20.
Results:-
iii) XGBoost Regressor:-
a) Hyperparameter Tuning:-
The hyperparameter tuning was done on the parameters learning_rate, n_estimators, max_depth, colsample_bytree and subsample using a randomized search. The data was 2% of the cross validation data
The best set of hyperparameters for the XGBoost Regressor are :-
- Subsample = 0.3 which is the sampling rate at which the data is sampled for each gradient boosted decision tree.
- n_estimators = 100 which are the number of base estimators or gradient boosted decision trees.
- learning_rate = 0.01 which is the boosting learning rate.
- max_depth = 5 which is the maximum depth to which the gradient boosted decision trees are fitted.
- colsample_bytree = 0.5 which means the sampling rate for the features at the level of every tree.
b) Evaluation on Cross Validation data and prediction on training, CV and test data:-
Results:-
iv) SGD Regressor with L2 Penalty or Ridge Regressor:-
a) Hyperparameter Tuning:-
The hyperparameters which were tuned on the 2% of the cross validation data for SGDRegressor with L2 penalty were loss, alpha or regularization strength and learning_rate using a randomized search.
The best set of hyperparameters for the SGDRegressor with L2 penalty were :-
- Loss = Epsilon_Insensitive which denotes the loss function to be used for minimization of the deviation from the actual value.
- Learning_rate = optimal which denotes the learning rate while performing Gradient Descent, which in this case is a function of alpha and time instance or the iteration number.
- alpha = 0.001 which denotes the regularization strength which is supposed to be given on L-2 penalty.
b) Evaluation of CV data and prediction on training, CV and test data:-
Results:-
v) Decision Tree Regressor:-
a) Hyperparameter Tuning:-
The hyperparameters which were tuned on 2% of the CV data were the criterion of the split, max_depth, min_samples_split and max features to consider while constructing the tree using randomized search.
The best set of hyperparameters for the Decision Tree Regressor are:-
- min_samples_split = 3 which denotes the minimum number of data points required at each node for it to be further splitted.
- max_features = log2 which denotes the total number of features to be considered while growing the tree. log2 denotes the logarithm of the number of features in the data.
- criterion = friedman_mse which means the criterion or the methodology employed at each leaf node to measure the quality of split. Friedman_mse uses mean_squared_error with Friedman’s Improvement score for potential_splits on the training data. Other criteria are mse (mean squared error) and mae (mean absolute deviation). This essentially decides on the regions after the split. The overall loss is minimized using this criterion in the resulting split.
- max_depth = 30 which denotes the maximum depth to which the decision tree has to be grown while training.
b) Evaluating CV data and prediction on training, CV and test data:-
The results on the training, CV and test data using the best set of hyperparameters for the decision tree are :-
Results:-
vi) Adaboost Regressor:-
a) Hyperparameter Tuning:-
The hyperparameters which were tuned for the Adaboost Regressor on the reduced CV data using Randomized search were n_estimators, learning_rate and loss.
The best set of hyperparameters for the adaboost regressor were:-
- n_estimators = 50 which is the number of base estimators.
- loss = square which denotes the loss function to be used against which the pseudo residual is calculated as negative of the same.
- learning_rate = 0.01 which is the boosting learning rate.
b) Evaluation of CV data and prediction on training, CV and test data:-
The results of the Adaboost Regressor on the full CV data and the predictions on the training and test data using the best set of hyperparameters were:-
Results:-
9.2. DL MODELS:-
INITIAL DL MODEL:-
FEATURIZATION OF TEXT DATA:-
i) The preprocessed item description was tokenized and converted into sequences, embedded as per the pretrained glove vector and was fed into a 50 layer LSTM.
a) Tokenization and converting into sequences:-
The training data’s preprocessed item description was converted into tokens with filters of alphanumeric characters. Each of the item descriptions are converted into sequences with the token being represented by the unique id of the token learnt from the vocabulary of words from the entire training data’s item_description. The words not seen in training data are handled using the OOV token which is considered to be “OOV” in this case.
b) Embeddings for tokens:-
The pretrained glove vector was used for the embeddings in the neural network. Each of the tokens were converted into predefined vectors of 300 dimensions using this step.
FEATURIZATION OF CATEGORICAL DATA:-
ii) The brand name, general_cat, subcat_1 and subcat_2 and the brand category features were tokenized and then the sequences were embedded.
a) Tokenization and converting into sequences:-
The training data’s categorical features were tokenized and then each of the values were converted into sequences after padding with zeros to the maximum length of these features found in the training data.
b) Embeddings for tokens:-
The embeddings for these features were learnt during the training of the neural network. The dimensionality of the embeddings was kept to be the minimum of 50 and half of the unique number of levels in each of these variables.
FEATURIZATION OF NUMERICAL DATA:-
iii) The numerical features were scaled using minmaxscaler which essentially scales every column of the data by subtracting the minimum value of the column and dividing by the range of the column from each value in the column.
ARCHITECTURE:-
i) The output of the LSTM layer was flattened, the embeddings of the other categorical variables were also flattened and the scaled numerical features were passed through a dense layer.
ii) The flattened vectors and the output of the dense layer were concatenated and then was passed through two pairs of dense layers and dropouts before passing through a relu activation which gave the predicted price.
Code:-
Diagrammatic Representation:-
MODEL TRAINING AND RESULTS:-
Callbacks used, Model Compilation and Model Fitting:-
1) The Checkpointing callback was used to save the model weights for those epochs where the validation RMSLE reduced from the previous epoch.
2) The ReduceLROnPlateau callback was used to reduce the learning rate by 10% after 2 epochs when the validation RMSLE didn’t improve.
3) The Earlystopping callback was used to stop training if the validation RMSLE didn’t improve in 5 epochs.
4) The model was compiled using Adam optimizer, the metrics was rmsle and the loss was meansquaredlogarithmicerror.
The batch size was chosen to be 128.
5) The input data was a list of seven different arrays.
6) The model was fit on the training and the validation data for 20 epochs. The earlystopping callback stopped the training after 9 epochs although.
Model Loss and Error:-
FINAL DL MODEL:-
FEATURIZATION OF TEXT DATA:-
Preprocessing:-
i) The preprocessed item description and the product name after preprocessing it were concatenated. For the preprocessing part, the text was decontracted and then preprocessed after removal of punctuations, stopwords, special characters and those words which are lesser than 3 in length.
Preprocessing:-
Conversion into sequences and embedding:-
ii) The concatenated item description and product names were converted into tokens and then were converted into padded sequence. The embeddings of the sequences were taken from the 300 dimensional vectors of ngrams of words at a character level from fasttext library.
Tokenizing and Sequencing:-
Embeddings:-
FEATURIZATION OF CATEGORICAL DATA:-
Preprocessing:-
iii) The brand name, general_cat,subcat_1 and subcat_2 were pre-processed and were concatenated. The preprocessing of the brand name was done in a similar manner to the text data. The preprocessing of the general_cat, subcat_1 and subcat_2 features was done by replacing the spaces, ampersands and braces with underscores, the apostrophes with blanks.
Embedding:-
iv) The concatenated concatenated brand name and categories were embedded and the embeddings were learnt during training. The dimensions of the embeddings was the minimum of 50 and the half of the unique number of levels of the concatenated feature.
FEATURIZATION OF NUMERICAL DATA:-
v) The numerical features including the shipping and item_condition_id were scaled using minmaxscaler and then were passed through a dense layer in the neural network architecture.
ARCHITECTURE:-
i) I employed the attention mechanism on the concatenated product name and preprocessed item description to predict the concatenated brand and categories.
ii) The context vector of the attention layer was passed through a global average pooling layer whose output was flattened and was concatenated with the output of the dense layer to which the scaled numerical features were passed.
iii) The concatenated output was passed through one Batch Normalization layer and two pairs of dense layers with parametric relu activation and dropouts to give the final output of predicted price through a dense layer followed by relu activation. Relu activation was used at the final layer because the loss was Root Mean Squared Logarithmic Error which considered the log values of the predictions. As logarithm is only defined for positive values, so we exclude the negative predictions.
Code:-
Diagrammatic Representation:-
MODEL TRAINING AND RESULTS:-
Callbacks used, Model Compilation and Model Fitting:-
i) The model was compiled using Adam optimizer with default learning rate, loss as MeansquaredLogarithmicerror() and metric as rmsle.
ii) The callbacks of earlystopping, ModelCheckpoint, tensorboard and ReduceLRonPlateau were used while training.
iii) The inputs to the model were the list of array of padded sequences and the scaled numerical features.
iv) The model was fitted for 20 epochs but early stopping stopped the callback after 19 epochs.
Tensorboard Epoch Plot:-
9.3. KAGGLE SCREENSHOT:-
10. RESULTS AND DEPLOYMENT:-
RESULTS:-
The evaluation of various ML and DL models are as follows:-
DEPLOYMENT:-
The video of the prediction of a single test instance can be found here:-
11. CODE REFERENCES:-
The relevant code files are available at:-
You can find all the .ipynb files and .py files along with the README.md which describes the project.
12. CONCLUSIONS AND FUTURE WORK:-
The price prediction on an online portal wherein there is no restriction on the products getting uploaded is very tough. The underlying relationship between features of a product is very complex. The price could be the result of interactions between the variables. The best way to figure out those interactions is to try out the combinations in various models. The final model could be an ensemble of model predictions, the weights of those individual models could also be learnt from the data itself. However, the main information related to the product is captured in the name of the product and the description. This information could be specific to various brands or broad categories.
As far as the future work on this dataset is concerned, we can include the images of the product uploaded which can be used to detect the defects in the product. The bottle neck features can be used in concatenation with the above features to detect the price more accurately.
13. PROFILE:-
13.1. GITHUB PROFILE:-
waziraligh - Overview
waziraligh has one repository available. Follow their code on GitHub.
github.com
13.2. LINKEDIN PROFILE:-
14. REFERENCES: -
· https://realpython.com/python-histograms/
· https://www.kaggle.com/maheshdadhich/i-will-sell-everything-for-free-0-55
· https://www.pythonfordatascience.org/anova-python/
· https://www.kaggle.com/thykhuely/mercari-interactive-eda-topic-modelling
· https://www.geeksforgeeks.org/generating-word-cloud-python/
· https://www.kaggle.com/c/mercari-price-suggestion-challenge/discussion/50431
. https://www.appliedaicourse.com