A DETAILED GUIDE TO ENSEMBLES -BAGGING

Md. Wazir Ali
14 min readJul 22, 2021

Table of Contents:-

Ensembles — An Introduction

Bootstrapping

What is Bagging?

Mathematical Intuition of Bagging

Random Forest — A special Case of Bagging

Failure Cases of Random Forest

References

Ensembles — An Introduction

Starting from the scratch, we must know what does the word Ensemble represent? The word Ensemble means a group of things as in experts or a group of musicians. It is also referred to as a technique to combine the opinions of various experts and arrive at a single opinion from that. In the domain of Machine Learning/AI, through Ensemble techniques, we get a much more powerful model that helps in prediction of a data point by combining the predictions of many classifiers/regressors which are known as base learners.

In the field of Machine Learning/AI, there are four types of Ensembles. They are:-

  1. Bagging (Also known as Bootstrapped Aggregation)
  2. Boosting
  3. Stacking
  4. Cascading

Using the above four techniques, very powerful models and very high performing models can be built. The core concept behind combining the predictions of the individual base learners say m1, m2, m3, m4,…..,mk is that these should be as different as possible from each other so that combining them would give a very better and optimized result. The more different these models are from each other, the more independent would be their predictions or mapping from the input to the output and hence, the more robust and powerful would be the final prediction or output on combining them.

Bootstrapping

Introduction

Bootstrapping refers to sampling a given sample of points with replacement to produce a similar simulated distribution of the population from which the given sample was drawn.

Essentially, bootstrapping generates samples of size B from the original data of size N. These samples of size B are called bootstrap samples. There are mainly two assumptions which should be validated while creating bootstrapped samples which are:-

  1. The data of size N should be sufficiently large enough so that it ensures that the data is sampled representatively from the underlying population.
  2. The size of the Bootstrap samples B should be smaller than or equal to N and should be sampled independently of each other.

Code :-

The bootstrapping is illustrated in the following snippet of code:-

The above function takes as input a flag indicating whether column sampling needs to be done or not, the input x and the output y.

Brief Mathematical Intuition

Bootstrapping is a technique which can be called resampling. We have a data x1, x2, x3,….., xn with N data points which are a sample from a population with a certain distribution P. Now, the population from which this sample is being drawn has some unknown parameters theta which could be in simple terms the mean or the median of the distribution. Bootstrapping creates a sample from the given sample of an unknown population by sampling again from the same (hence the name resampling). The unknown distribution of the population P of the real world could be compared with the “true” pretend unknown distribution of the bootstrap sample P^n. Similarly, the unknown parameter θ of the population(true one), the aspect of P which has to be estimated, could be compared with the unknown parameter θ^n of the bootstrap sample which would estimate some aspect of P^n. The data which has samples x1,x2,x3,….,xm are assumed to be independently and identically distributed (iid) and same is for the data of the bootstrap samples which essentially means that the sampling has to be done with replacement so that the probability of selecting samples doesn’t change in subsequent selections .

Bootstrapping can be of two types:-

  1. Non-Parametric Bootstrap
  2. Parametric Bootstrap

The main difference between the above two types of bootstrapping is that in the first one, we are using nonparametric estimates of distributions whereas in the second one, we are using parametric estimates of distributions.

What is Bagging?

Bagging as the name suggests comes from the word bag. It is also referred to as Bootstrapped Aggregation. It consists of two words Bootstrapping and Aggregation.

a) Bootstrapping mainly refers to sampling with replacement from the given sample of data points as explained above.

b) Aggregation refers to aggregating the predictions of the individual models and giving a final prediction for one data point. Aggregation is done through majority voting in case of classification and by taking the mean/median of the data point’s predicted value by all of the base learners.

Bagging as a statistical technique which combines the predictions of independent models that are generally high variance and low bias models i.e. reasonably complex models in a way so that the final model which gives the aggregated output is a low variance model i.e. which doesn’t overfit. These models whose outputs are combined are called base learners and they give the results on the bootstrapped data independently of each other. This independence is achieved by making these base learners as different from each other as they can be. One of the techniques to achieve this independence is through randomly sampling the columns of the bootstrapped data for any particular model.

Bagging is also termed as an embarrassingly parallel algorithm as the bootstrapping creates n samples of the given sample of data which are fed into n base learners parallelly.

Bag of Classifiers

Need for a tree-based classifier/regressor:-

One of the most preferred and candidate choice for the base learners in bagging while training is a fully grown decision tree. This is because a decision tree works on the principle of reducing entropy for classification and on the principle of reducing the mse(Mean Squared Error) for regression and is dependent on the feature set in the data. Different randomly sampled feature sets gives the independence in splitting at various nodes to different trees due to which at the leaf nodes, each of the fully grown decision trees gives the class of a data point independent of the other trees.

Mathematical Intuition of Bagging

Now, that when we have a basic theoretical understanding of the process of bagging, we deep dive slightly into the mathematical details of the process so that we can have a clear cut understanding of how does this process help in improving the performance when compared to a single model or learner.

As explained earlier, bagging intuitively reduces the variance of the overall model by aggregating the outputs of the single base learners which have a high variance or a tendency to overfit. Now, we would understand the bias-variance tradeoff which plays an important role in determining the performance of a model on the unseen test data.

Single Learner

For a single classifier/regressor, the test error or generalization error can be broken up into the following 3 components:-

a) Bias²

b) Variance

c) Irreducible error (sigma²)

Mathematical Equation for Test Error

A model with a high bias is mainly due to over simplified assumption of a model which can be in terms of linear dependence or relation of the response wrt the dependent variables whereas in reality, the underlying true relation between the response and the dependent variables may not be linear. This causes the model to underperform and make errors on the training data after it has been trained.

The bias of the model decreases with the increase of model complexity but again with an overly complex model, the variance of the model increases i.e. the model underperforms or shows a drop in the performance metrics with a slight change in the data which is evident from the performance of the model on the unseen test data.

We would see a graphical representation for the same below.

Bias-Variance tradeoff for a single classifier/regressor

From the above plot, we can make out that the sweet spot for a model which is indicated by the optimal test error is somewhere where the model has some bias and some variance. We cannot expect a model to have a zero variance and zero bias at the same time. Therefore, for a single learner to achieve optimal performance on the test data, there has to be some sacrifice for the bias to achieve the lowest possible test error or generalization error.

When the model makes errors on the training data or in other words, model has high bias due to overly simplistic assumptions, then it is said to underfit. In contrast to that, when the model complexity increases much so that the model suffers from high variance i.e. as slight change in the data causes the model to underperform, then the model is said to overfit.

A Group of Learners

Regression Problem: If we train M learners on the M bootstrapped samples of the data with each sample being of size K, then the final prediction of the M learners in case of a regression problem is given by:-

The final function f(x) giving predictions in case of Regression

Here,

f_m(x) = The mth base learner’s approximation of the mapping from input to output.

Classification Problem: Similarly, in case of classification, the final predicted class of a data point is given by taking the majority vote in terms of the class labels of the data point predicted by the M learners. So, the final class label of the data point is that class label which is predicted by the majority of the base learners.

Final Class from the c classes for one data point based on the maximum votes of the classifiers

Error Reduction: The error reduction in case of bagging wherein we have a group of learners works on the principle of the independence based on which the individual learners make the decisions. The errors of these individual learners are also made independently of each other. Now, the final meta model or the aggregated model makes the errors on the data which are common to the errors of the individual base learners and this causes a lot of reduction in the errors made by the bagged model compared to the individual learners.

Error Reduction in Bagging

Variance Reduction : The variance reduction of a group of learners which are bagged can be intuitively explained in terms of sampling with replacement technique which is used to create multiple samples from the given sample of data. As each of the base learners is seeing just a part of the data, hence, a small change in the data won’t change the results much, although the individual learners are high variance learners and since the final results are an aggregation of the results of the base learners, that also would not change much with the change in the data.

Random Forest — A special Case of Bagging

A Random Forest is a very popular and commonly used technique of bagging in which a collection of fully grown decision trees is used to give the results in a classification or regression problem and then the results of all of these decision trees are aggregated through a simple mean or median in case of regression and through a simple majority vote in case of a classification problem. The decision trees used in the Random Forests are fully grown trees with a high variance property. Each of the tree is made independent of each other in terms of decision making by supplying data with different feature sets which would make the decision trees independent of each other due to splitting at different nodes using different features in different trees.

A Forest showing a collection of trees

Random Forests are used both for classification and regression problems and are a very popular technique for reducing variance through bagging with the help of randomization which is added via column sampling for each of the base learners wherein the features of the bootstrap sample for the base learner are randomly sampled.

Entropy

Entropy of a dataset is measured by it’s impurity wrt the classes present in it. The more balanced the dataset is wrt the classes, the more is the entropy. The more imbalanced the dataset is wrt the classes, the less is the entropy which denotes that the dataset is more pure. The maximum value of entropy is 1 and the minimum value of entropy is 0. It is given by

Entropy

where p_i = Fraction of the ith class in the dataset or the probability of ith class in the dataset.

Graphically, Entropy is given by

Entropy V/S Class Probability

Gini Impurity

Gini Impurity is another measure of impurity which is mathematically given by

Gini Impurity

where, p_i = Fraction of data points of the ith class.

It is similar in concept to Entropy wherein the value is more if the classes are balanced in a dataset and the value is less if the classes are imbalanced.

The maximum value of Gini Impurity is 0.5 and the minimum value is 0.

Graphically, Gini Impurity is given by

Gini Index V/S Class Probability

Information Gain

The criteria to split a node on any of the features depend on the maximum information gain. Information Gain is defined as the difference in the entropy of the parent dataset before splitting and the weighted sum of the entropy of the resulting child datasets after splitting on any categorical or continuous feature at specific split points which are the levels in case of a categorical feature and optimal split point in case of a continuous feature. Here, the weights are the fractions of the parent dataset resulting out from the split at a particular level of a categorical feature or the split at a particular point or a value in case of a continuous variable.

Mean Squared Error

Mean Squared Error at a particular node in case of prediction of a continuous variable by a decision tree is given by

General Formula for Mean Squared Error

In the above expression, y_i is the actual response of the data point and y~_i is the predicted value of the response of the data point.

At a particular node of a Decision Tree, the predicted value of the data point is the average of the actual values of the data points lying in that node. The decision tree in case of a regression problem is built in such a way that the mse value of a particular leaf node is minimized. The optimization objective is defined in a way such that the optimal feature chosen would give the regions or the dataset after the split with minimum MSE value.

Feature Importance in Random Forest

The feature importance of the various features in Random Forest are based on the criteria of reduction of entropy or Gini impurity across all the trees used in the Random Forest. The most important features in the data contribute the most in reduction of entropy or gini impurity in all the base learners or individual decision trees.

Out of Bag Score (OOB Score) or Validation Score in Random Forest

The out of bag score or the validation score in Random Forest is a single quantity, similar to mean squared error mainly used in regression problems to measure the performance of the aggregated model or the random forest model on those data points which were not sampled in the bootstrap samples for building up the individual base learners which in this case are the decision trees. When we consider the Out of Bag or OOB Score on those data points which were not sampled in the bootstrap samples, then for each of those data points, we consider the median value of the prediction as the final value from all those base learners which were not built using that data point. In this way, we get the out of bag predictions for all the data points in the dataset and then we calculate the Mean Squared Error for all of those and name it as OOB Score.

The calculation of OOB Score in case of regression problem can be illustrated as follows:-

The calculation of OOB Score in case of classification problem can be illustrated as follows:-

The above function takes as arguments the list of trained Decision Trees , list of size n consisting of selected row and column indices for n base learners or Decision Trees, the input data x, output y and a flag col_samp indicating whether column sampling has to be done or not. The first function outputs mse as the oob score while the second function outputs accuracy as the oob score.

Bias Variance Tradeoff

The bias variance tradeoff for Random Forests is handled by controlling the number of base learners or the fully grown decision trees used for training. The more the number of base learners, the less is the tendency of the overall aggregated model to overfit the training data as more trees would produce different results.

Training and Test Time Complexity

The training time complexity of a Random Forest is:-

O(n * log(n) * d * k)

where,

k = Number of base learners in case of training on a single box,

d = Dimensionality of the dataset,

n = Number of data points.

The test time complexity of a Random Forest is :-

O(k * depth)

where,

depth = Depth of a base learner in a random forest,

k = Number of base learners.

Test Space Complexity

The run time space complexity of a random forest is :-

O(k*nodes)

where,

nodes = Number of nodes of a base learner,

k = Number of base learners.

Randomization in Random Forest

Randomization in a random forest are introduced using row sampling or bootstrap sampling, column sampling or feature sampling and also by selecting randomly the cutoffs in a numerical or continuous feature to check for splitting. The third point is commonly used in Extremely Randomized Forests.

Failure Cases of Random Forest

Random Forest does not work well in the following cases:-

a) There is a categorical feature with many levels or categories and each level of the feature is represented by a one-hot encoded vector which represents that level with a 1 and the rest of the levels are 0. The dimensionality of that feature becomes equal to the number of levels or categories present. The large dimension of the categorical feature represented through one-hot encoding is a bottleneck as each of the features have to checked for the optimal split. It affects the time complexity. The categorical feature has to be converted into a numerical feature by calculating the probability of the positive class given the particular level of the categorical feature.

b) When the data is imbalanced, then it has to be balanced using up-sampling or down-sampling or using class weights as otherwise it impacts the Entropy or Mean Squared Error Calculation.

c) Another bottleneck is when the input data is given in the form of a similarity matrix of data points of the dataset. In this case, Random Forests cannot be applied as they need to have features to calculate the Information Gain in case of Entropy or Gini Impurity.

d) In case of the presence of outliers, with the increase of depth of the decision trees in the random forest, the leaf nodes would become susceptible to have two to three data points which could be misleading while predicting the final class label of an unseen data point in case of a classification or a regression problem. This leads to unstability of a decision tree.

Code:-

The simple implementation of Random Forest Regressor and Classifier can be found here:-

Please go through the readme file describing the .py files, it’s functions and the usages for the same.

References:-

Random Forest Regressor Documentation

Random Forest Classifier Documentation

--

--