top of page
shivamshinde92722

Guardrails for ML and DL Models: A Deep Dive into Regularization Techniques

This article will explain the concept of regularization in ML and DL. The article will also introduce readers to some of the most famous regularization techniques.



Table of Content
  1. Introduction

    - What is Bias and Variance

    - What is Overfitting, Underfitting, and the Right Fit?

  2. What is Regularization?

  3. Different Regularization Methods :-

    - Lasso or L1 Regularization,

    - Ridge or L2 Regularization,

    - Elastic Net or (L1+L2) Regularization,

    - K-fold Cross-Validation Regularization,

    - Using an Ensemble of Algorithms,

    - Dropout,

    - Data augmentation

  4. Summary

  5. References


Introduction

Before diving straight into the topic, let’s take a detour and revise some of the basic concepts in machine learning.


What are Bias and Variance?

Let’s understand two different kinds of errors that are necessary to understand underfitting and overfitting.


  1. Bias error: A bias error is an error or loss that we find when we use the trained model on the training data. In other words, here we are finding the error using the same data that is used for training the model. An error can be any kind, such as mean squared error, mean absolute error, etc.

  2. Variance error: A variance error is an error or loss that we find when we use the trained model on the test data. Again, here, the error can be any type. Even though we can use any type of error to find the variance, we use the same error that we used for bias finding because that way, we can compare the bias and variance values.


Note that our trained model's ideal condition is low bias and low variance.


What is Overfitting, Underfitting, and Right Fit in Machine Learning?

The machine learning model is said to be overfitted when it performs very well on the training data but poorly on the test data (i.e., low bias and high variance).


On the other hand, when a machine learning model performs poorly on both training and testing data (i.e., high bias and high variance), it is said to be underfitted to the data.

The right fit machine learning model gives good performance on both training and test data (i.e., low bias and low variance).


I have written a whole article on the concepts of overfitting, underfitting, and the right-fit models in machine learning. If you are interested, you can go through the article. You can find the link to said article below.



What is Regularization?

The gas stove we use for everyday cooking in the kitchen has a regulator attached. Such a regulator controls the flame produced by the gas stove.


Similarly, Regularization, as the name suggests, is used to regularize or control an algorithm's power or learning ability.


In the context of machine learning and deep learning,


When we say ‘increase the regularization’, we mean ‘decrease the learning ability of the algorithm’ or ‘use simpler algorithm’. This is done when the model is overfitting to the data.


When we say ‘decrease the regularization’, we mean ‘increase the learning ability of algorithm’ or ‘use more complex algorithm’. This is done when the model is underfitting to the data.


Regularization techniques help reduce the overfitting of the model on the training data. It also helps our model when it is underfitted on the training data.


The main purpose of the regularization technique is to reduce the variance of the trained model while keeping the bias of the model as low as possible.


Different Regularization Methods

Now, let’s see some of the famous regularization methods.


Lasso or L1 Regularization


We add a term to the original cost function used to train the model. So after the addition of the additional term, the new cost function looks like this:



This new term is also called the penalty term (colored in gray color). The new term is the sum of the absolute value of all the weights multiplied by a variable lambda.


The addition of a new term forces the learning algorithm to not only fit the data but also keep the model weights as small as possible.


The variable lambda used in the new term regulates how much regularization is needed. The low value of lambda signifies low regularization, while the high value signifies high regularization.


The lambda term takes the values [0, infinity). When we use lambda = 0, we will have no regularization over our model.


The regularization term should only be added to the cost function during training. Once the model is trained, you want to use the unregularized performance measure to evaluate the model’s performance.


This regularization technique works as a feature selection technique as well. The important characteristic of this regularization technique is that it tends to set the weights of the least important features to zero. So, in this process, it removes the redundant features from our data.


This regularization technique is not affected by the presence of outliers in the data. Also, the technique is often used in the sparse dataset.


Ridge or L2 Regularization


The working of L2 regularization is very similar to that of L1 regularization. However, L2 regularization has a different term added to the original cost function.



Here, lambda does the same work as that of its work in L1 regularization.


Since we are using the square term of weights in the cost function, the loss will give a very high or very low value in case of outliers in the data. So this regularization is not ideal when the data contains outliers.


Elastic Net or (L1 + L2) Regularization


In this regularization technique, we add the new terms from both L1 and L2 regularization to the cost function.



In this regularization, we can control the portion of the L1 regularization term and L2 regularization term with the help of variable r.


When r = 0, elastic net regularization is equal to L2 regularization. When r = 1, elastic net regularization is equal to the L1 regularization.


K-fold Cross-Validation Regularization


The cross-validation doesn’t exactly change the learning ability of the algorithm but it is a kind of approach that will help the model achieve better generalization over the test data.


First, we split the original data into training data and testing data. Then, the model is trained on training data as follows:


  1. Training data is split into ’n’ equal parts. Let’s name each of the splits from split 1 to split n.

  2. The model is trained on every group of ‘n-1’ splits possible. We will get n such groups.

  3. The trained model is evaluated using the remaining split for every group of (n-1) splits.


Let’s take one example to understand this more clearly.



Let’s say we split the training data into 5 parts. Now, we will find out all the possible groups of 4 splits. These groups are shown in the above diagram in green color. The last remaining split for every group of four splits is in blue in the above diagram.


For split 1, the model is trained on the group (Fold2, Fold3, Fold4, Fold5). After training, the model is evaluated on Fold1.


For split 2, the model is trained on the group (Fold1, Fold3, Fold4, Fold5). After training, the model is evaluated on Fold2.


For split 3, the model is trained on the group (Fold1, Fold2, Fold4, Fold5). After training, the model is evaluated on Fold3.


For split 4, the model is trained on the group (Fold1, Fold2, Fold3, Fold5). After training, the model is evaluated on Fold4.


For split 5, the model is trained on the group (Fold1, Fold2, Fold3, Fold4). After training, the model is evaluated on Fold5.


Finally, we average the accuracy of all the splits to get the final accuracy. Since we get accuracy for each of the splits, we can obtain the standard deviation of these accuracies also.


I have written a whole article on the cross-validation. If you are interested, you can read the article using the below link.



Using an Ensemble of Algorithms


Ensemble methods combine the predictions of several weak algorithms to improve the performance of the model on the test data. The use of ensemble methods reduces the variance of the model without affecting the bias in most cases. The use of ensemble methods is not exactly a regularization technique, but it serves the same purpose as that of regularization techniques, i.e., improve the generalization of the model on test data.


Dropout


The dropout regularization technique is mostly used in artificial neural networks in deep learning models.


At every training step, every neuron (including input neurons but not output neurons) has a probability p of temporarily ‘dropping out’. This means that during this training step, the neuron will be ignored completely but it might be active during the next step. After training is completed, neurons don’t get dropped anymore.



There is a small detail that we need to understand. If p = 50% then during training 50% of all the neurons will be active. This means that once the training is completed, neurons will be connected to twice as many input neurons as they would be during training. So, to compensate for this, we will have to multiply the input connection weight of every input neuron by two during training. If we don’t do this then the model won’t give proper results, since it will see different kinds of data during and after training.


In a nutshell, we need to divide the connection weights by the keep probability i.e., (1 — P) during the training.


There is one example that will give you an idea of why such an approach could work.

Consider an organization with employees who work both from home and the office on a random basis every day, as per the new policy. In such a scenario, the tasks that were previously assigned to a single person, such as planning events or meetings, now become the responsibility of the entire team. This decentralization of work makes the team more self-sufficient and less dependent on any one individual.


Similarly, in neural networks, the absence of certain neurons during training enables other neurons to become more independent, resulting in a more robust model.


Data augmentation


Data augmentation is a technique commonly used in computer vision applications where new images are created by applying various operations such as image flipping and cropping to existing images.



Similarly, in machine learning applications where we have less data, we can create new data using upsampling techniques. One of the popular techniques uses the K-nearest neighbor algorithm to create new data using already present data.


Summary

The article delves into regularization techniques for ML and DL. It explores methods like L1, L2, Elastic Net, K-fold cross-validation, ensemble, dropout, and data augmentation to control algorithm learning and reduce overfitting, aiming for balanced model complexity and improved generalization.


References






Data Augmentation Image Reference:

Data augmentation-assisted deep learning of hand-drawn partially colored sketches for visual search — Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/Data-augmentation-using-semantic-preserving-transformation-for-SBIR_fig2_319413978 [accessed 22 Jan, 2024]


Outro

Thanks for reading!


Have a great day!



Comments


bottom of page