From Raw to Refined: A Journey Through Data Preprocessing — Part 1: Feature Scaling

shivamshinde92722
Jul 29, 2023
3 min read

Updated: Aug 8, 2024

This article is the part 1 of the Data Preprocessing series. In this part, I explain the feature scaling step of the preprocessing.

Sometimes, the data we receive for our machine learning tasks isn’t in a suitable format for coding with Scikit-Learn or other machine learning libraries. As a result, we have to process the data to transform it into the desired format.

There could be various issues with raw data. According to the nature of the issue, we need to use appropriate methods to deal with it.

Let’s see some of these methods and how to implement them in a code.

Mean Removal and Variance Scaling (Standardization)

Scikit-Learn estimators (Estimators refer to the Scikit-Learn classes used to train the machine learning models) are tuned to work best with the standard normally distributed data, i.e., a Gaussian distribution with zero mean and unit variance.

Raw data may not always be in the Gaussian distribution and hence the models trained on this data might end up giving sub-optimal results. Standardization operation could be the solution to this problem.

The standardization is performed using the below formula:

First, the mean and standard deviation is calculated for a column. Then, the mean is subtracted from every data point in that column. Lastly, the result of subtractions is divided by the standard deviation.

Let’s see how to implement this in a code.

For the demonstration, let’s use the famous ‘Tips’ dataset. The ‘Tips’ dataset is used to predict the tips received by waiters based on different factors such as total bill, gender of the customer, day of the week, time of the day, etc.

First, we need to separate the dependent feature, which is ‘total_bill’, from the independent features. After that, we need to split the data into training and testing sets.

Let us perform standard scaling on the ‘tip’ column.

The ‘reshape’ method is used to convert the 1D array into a 2D array since the fit and transform method requires a 2D array as input.

Scaling features to a range (MinMaxScaler and MaxAbsScaler)

This is another approach to standardization in which features are scaled to lie between the minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size.

If we want to scale our data to lie between ‘min’ and ‘max’ values using MinMaxScaler, the following formula is used:

MinMaxScaler:

MaxAbsScaler works in a similar way, but it scales the data such that each value lies in the range [-1, 1]. This is done by dividing each value by the maximum value in each of the features.

Centering the data destroys the inherent sparseness of the data and generally, it is not an appropriate thing to do. But in a case where the features are on different scales, it makes sense to scale sparse inputs. MaxAbsScaler was specifically designed for the purpose of scaling sparse data and it is a recommended way.

MaxAbsScaler:

Scaling data with outliers (RobustScaler)

If the data is infested with outliers, then the value of the mean and standard deviation could get skewed. In this case, mean and std. deviation values won’t represent the data’s center or data’s spread correctly. Therefore, using mean and std. deviation for scaling when the data has outliers would not work well.

To circumvent this issue, we can use RobustScaler, which uses more robust estimates for the center and range of the data.