IMAGE SEGMENTATION USING UNETS

Md. Wazir Ali
9 min readJan 16, 2022

What is Image Segmentation?

In the field of digital image processing and computer vision, Image Segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions.

As we all know, an image can contain multiple objects. Each object in the image is represented by a set of pixels of various intensities. Image segmentation helps to locate each object of a particular class in the image by coloring the pixels belonging to that object in the image with a distinct color. Each such class of object(s) is denoted by a different color to help distinguish between various classes of objects.

The result of Image Segmentation is a set of segments that collectively cover the entire image or a set of contours extracted from the image. Each of the pixels in a region is similar with respect to some characteristic or computed property such as color, intensity or texture.

Applications of Image Segmentation

Image Segmentation finds applications in Bio Medical Images of MRI Scans of organs in human body as it helps to detect the regions of inflammations and cancerous growth by coloring those regions with a different color than the normal regions where there is no abnormality.

Let’s consider some images below

MRI Scan of a brain:-

As we can see clearly in the above images, the various sections of the scanned image of the brain are colored in different colors denoting some abnormality in that region which is separated from the other regions by a different color especially red.

Image of a left femur bone of a human being:-

The above image separates out the outer surface as red, the surface between the surface bone and spongy bone as green and the surface of the bone marrow as blue.

Image of a girl:-

The above image shows a girl wearing sunglasses. In the same image, various features of the girl such as her eyebrows, lips, hair, nostrils, skin color are denoted by different colors. This is achieved by recognizing the particular region and by coloring the same with a distinct color that would help separating out the same region from the others. Also, we can see that background is denoted by light gray color. Here, background is considered as a separate category altogether.

How this desired output is achieved?

This desired output is achieved on unseen MRI scans and other images by training a Convolutional neural network which learns to produce such images from the original images which actually have the different image categories separated out by various colors with each category represented by a specific color. This is achieved by extracting such images with different pixel intensities for various categories of objects from the original objects known as masks and training a Convolutional neural network which learns the mapping of the original image containing various categories of objects to the different regions in the same image denoting various categories by separating out each of them with a different color.

The Convolutional Neural Network which learns the mapping from the image to the masks wherein each category in the image is separated out from others using a different coloring scheme and is applicable to that category only for all the pixels wherever it is present in the image, is known as U-Net.

This Convolutional Neural Network which learns this mapping in the training phase, outputs a similar kind of an image known as mask on any unseen test data image. This is discussed in detail in the next to next section.

Other Approaches to Image Segmentation

There were other approaches to Image Segmentation which are listed as below:-

  1. Threshold Based Segmentation
  2. Edge Based Segmentation
  3. Region-Based Segmentation
  4. Clustering Based Segmentation
  5. Artificial Neural Network Based Segmentation

Threshold Based Segmentation

In this technique, a binary or multi- colored image is created based on setting a threshold value on the pixel intensity of the original image. The intensity histogram of all the pixels in the image is considered. Then a threshold is set to divide the image into sections. For example, considering image pixels ranging from 0 to 255, we set a threshold of 60. So all the pixels with values less than or equal to 60 will be provided with a value of 0(black) and all the pixels with a value greater than 60 will be provided with a value of 255(white).

Various thresholding techniques are :-

a) Global Thresholding

b) Manual Thresholding

c) Adaptive Thresholding

d) Local Adaptive Thresholding

Edge Based Segmentation

Edge based Segmentation relies on edges found in an image using various edge detection operators. These edges mark image locations of discontinuity in gray levels, color, texture etc. When we move from one region to another, then the gray level may change. So if we can find that discontinuity, we can find that edge.

Region Based Segmentation

A region can be classified as a group of connected pixels exhibiting similar properties. The similarity between pixels can be in terms of intensity, color etc. In this type of segmentation, some predefined rules are present which have to be obeyed by a pixel in order to be classified into similar pixel regions. This is a preferred method over Edge Based Segmentation in case of a noisy image. They are mainly classified into two types based on the approach they follow:-

a) Region growing method

b) Region splitting and merging method

Clustering Based Segmentation

This type of segmentation technique uses Kmeans clustering to make segments in a colored image. It is highly used for segmentation of images.

Artificial Neural Network Based Segmentation

These techniques are classified into the following two categories:-

a) Supervised Techniques

b) Unsupervised Techniques

Supervised methods require expert human input for segmentation. Usually this means that human experts are carefully selecting the training data that is then used to segment the images.

On the other hand, unsupervised methods or clustering processes are semi or fully automatic. User intervention might be necessary at some point in the process to improve performance of the 5 methods, but the results should be more or less human independent. An unsupervised segmentation method automatically partitions the images without operator intervention.

NN-based image segmentation techniques

Please read about the above five techniques here, here and here.

Various Performance Metrics for Image Segmentation

Dice Loss

In dice loss, the dice coefficient is used to measure the similarity between two images. Let’s see the case of two sets while analyzing dice loss. The dice coefficient for two sets A and B is defined as :-

Dice Coefficient

The above formula is for the Dice Coefficient where the numerator is the twice of the intersection of two sets A and B and the denominator is the sum of the square of the matrix A and square of the matrix B. |A| and |B| are represented by square of the matrices A and B.

Now coming to the specific problem of Image Segmentation, sets A and B are the actual and predicted values for images which could be considered as pixel values. More the Dice coefficient i.e. greater the common elements of both the sets out of the summation of the square of the two images, less is the Dice loss. Specifically,

The range of the dice loss is from 0 to 1. The lesser the loss, the more closer are the original and the predicted images or better known as masks.

IOU (Intersection Over Union)

Another metric to measure the effectiveness of the prediction of an input image is Intersection Over Union. This metric measures how similar the two images are by measuring the common area between two images out of the total area of the two images.

In this particular case of Image Segmentation, as the actual ground truth mask of the original image is pixel level coloring indicating the different categories, so the IOU or the Intersection Over Union measures the pixel level labelling of categories between the original mask and the predicted mask. The maximum value of IOU is 1 and the minimum value of IOU is 0.

Binary Cross Entropy

Cross Entropy is defined as the measure of the difference between two probability distributions for a given random variable or set of events. It is widely used for classification problem and as segmentation is pixel level classification, it works well.

Binary Cross Entropy is defined as:-

Binary Cross Entropy

Here, y and y^ are the actual and predicted class labels for every pixel.

Weighted Binary Cross Entropy

It is a variant of Binary Cross Entropy with weights given to the positive examples. The positive examples gets weighted by some coefficient. It can be defined as:

Weighted Binary Cross Entropy

Beta value can be used to tune false negatives and false positives. For eg:- If you want to reduce the number of false negatives then set beta > 1, similarly to decrease the number of false positives, set beta < 1.

Balanced Cross Entropy

Balanced Cross Entropy is similar to Weighted Cross Entropy. The only difference is that apart from just positive examples, negative examples are also weighed. Balanced Cross-Entropy can be defined as follows:-

Balanced Cross Entropy

Focal Loss

It can be seen as a variation of Binary Cross-Entropy. It down-weights the contribution of easy examples and enables the model to focus on learning hard examples. Let’s look at how this Focal loss is designed and how it is derived from cross-entropy. Cross Entropy is defined as:-

Cross Entropy

Focal Loss defines the estimated probability of class as:

Now Cross Entropy is defined as

Focal Loss proposes to down-weight easy examples and focus training on hard negatives using a modulating factor. This modulating factor is multiplied with the cross entropy term defined above and the focal loss is as shown below:-

Here gamma > 0 and when gamma = 1 Focal Loss works like Cross-Entropy loss function. Similarly, alpha range from [0,1]. It can be set as inverse class frequency or treated as a hyperparameter.

U-Net

In this blog post, I would be discussing about a Convolutional Neural Network which actually gives output as another image to an input image. It learns to predict an output image from the input image in the training phase. This output image is nothing but an image containing different colors for the pixels of different categories of objects and it learns a better representation of the image from the image generated from the output image which acts as a ground truth label for the actual image. The network backpropagates the total loss from the output image produced for each input RGB image wrt the ground truth label for each batch of images. This loss is generally termed as dice loss.

U-net Architecture

U-Net Architecture

--

--