top of page
shivamshinde92722

Using Scikit-Learn Pipelines to Automate the Machine Learning Model Training and Predictions

In this article, I will try to explain the theory and the use of Scikit-Learn’s pipelines class using a coding example of cross-validation and hyperparameter tuning.

Scikit-Learn pipelines are used to chain multiple operations in our machine learning lifecycle (mainly data preprocessing, model creation, and prediction on the test data). They help us by reducing a lot of manual coding for cross-validation and hyperparameter tuning.


Before diving into the Scikit-Learn pipelines, let’s first understand the advantages of using these pipelines.


Convenience and encapsulation


After incorporating Scikit-Learn pipelines into your code, you only have to call fit and predict methods on your data to fit the whole array of preprocessing and model training operations. Also, Scikit-Learn pipelines make our life easier by making experimenting with different machine-learning algorithms quite easy.


Joint parameter selection


You can grid search over all the parameters of the estimators in the pipeline at once.


Safety


Scikit-Learn pipelines avoid the leakage of statistics from the test data into the trained model in cross-validation. This is done by making sure that the data used for the training of the transformers and predictors is the same.

 

We will use the Kaggle Spaceship Titanic dataset to demonstrate the use of Scikit-Learn’s pipelines. In this article, I will start from the preprocessing step of the data science project lifecycle. If you want to see the exploratory data analysis of the data, then check out my other article related to the exploratory data analysis.


The data for this problem needs the following preprocessing:

  1. missing value imputation

  2. categorical data encoding

  3. scaling of numerical data

  4. outlier removal

  5. log-normal transformation (optional)

Scikit-Learn has built-in transformers for most of the basic preprocessing operations such as categorical data encoding, missing value imputation, scaling, and much more. But sometimes, we need to perform certain operations on the data for which we don’t have a built-in Scikit-Learn transformer. In such cases, we create custom transformers according to our needs.


Scikit-Learn doesn’t have a built-in transformer for outlier removal and log-normal transformation. So, we will create one by ourselves. Since the scope of this article is to know how to use Scikit-Learn pipelines, I will not explain here how to create a custom transformer. But I will list some good resources for this at the end of the article.


Creating a custom transformer for outlier removal


Creating a custom transformer for the log-normal distribution


Creating a pipeline for the preprocessing of numerical features


Scikit-Learn has a Pipeline object to create a pipeline.


Our numerical pipeline will contain three steps. They are:

  1. Outlier removal using the above-created transformer

  2. Transformer to remove missing values: Here we will use the Scikit-Learn SimpleImputer transformer to perform this task.

  3. Transformer to scale the numerical features: Here we will use the scikit-learn StandardScaler transformer.

Scikit-Learn Pipeline object takes in a list of tuples. Each of these tuples will be dedicated to one of the processes and will have two arguments for each process. The first argument is the name of the step and the second one is the transformer or estimator object.


Pipeline ([

(process1_name, process1_object),

(process2_name, process2_object),

so on…

])



As you can see in the code block above, outlier_removal, replacing_num_missing_values, log_transformation, and scaling are the names of transformers. Next to each of these names, their respective transformer object is mentioned.


Creating a pipeline for preprocessing of categorical features


Similar to the case of numerical features, we will also create one pipeline for categorical features. Categorical features need the following steps in the preprocessing:

  1. Removing useless features

  2. Replacing missing values with most_frequent value from the feature

  3. Encoding the feature values to integers

  4. Replacing missing values with most_frequent value from the feature


Here, we are performing the missing value imputation again after feature value encoding. This is because our encoder is designed in such a way that it will encode any new category it comes across as a null value (This case might happen in test data if not training data).


Combining both numerical and categorical preprocessing pipelines


In previous steps, we created pipelines for numerical and categorical features in the data. So, now we will combine those two previously created pipelines to create one whole pipeline which will be capable of preprocessing all the features at once.


Scikit-Learn has a built-in class for this. The ColumnTransformer class is used for this job. Scikit-Learn ColumnTransformer takes in the list of tuples as the main argument. Each tuple has three kinds of information in them. First is the name of the process, second is the object needed for that operation and the third one is the list of names of features on which this process needs to be performed.


Here, cat_preprocessing and num_preprocessing are the names of two processes. cat_pipe and num_pipe are the objects that we created in the last two steps. cat_feat and num_feat are the lists of names of features on which cat_pipe and num_pipe will be applied.


ColumnTransformer ([

(process1_name, process1_object, feature_names1),

(process2_name, process2_object, feature_names2),

So on…

])


Note that ColumnTransformer has a second argument named remainder. Here the ‘drop’ value of the remainder will remove all the features of the data which are not present in the num_feat list or cat_feat list. Here if we use the ‘pass’ value for the argument remainder, then in this case, all the features which are not in the num_feat list or cat_feat list are ignored in preprocessing step but are not removed from the data.


Now that we have created our whole preprocessing pipeline, let’s create some machine-learning models.


Creating some machine learning models


We have our models with us now. The next step would be to combine the whole preprocessing pipeline with each of the above-created machine-learning models. To combine preprocessing pipeline and the models, we will make use of the Scikit-Learn Pipeline class again.



This concludes the creation of pipelines. The pipelines that we have created in the above code blocks are now capable of performing preprocessing as well as model building with a single call of the fit method.

 

Also, we can use these pipelines for cross-validation as well as hyperparameter tuning. Let’s see examples of that too.


Performing a cross-validation on the data using the pipelines




Notice that here we have used pipelines as a final estimator in the cross_val_score method. Since we have incorporated the preprocessing as well as model creation in our final pipeline, we don’t have to perform any processing ourselves.


Now, let’s see one example of hyperparameter tuning using Scikit-Learn pipelines for the algorithms that are giving the highest r2 score.


Performing hyperparameter tuning using Scikit-Learn pipelines


Let’s create a dictionary of hyperparameters for each of the machine-learning models created above.


Here, we will perform hyperparameter tuning only on high-performance machine learning models according to cross-validation.





Now, let’s perform hyperparameter tuning.


We have got the parameter that gives high performance for our model. Then let’s set those parameters for our model and train the pipelines.


Training and saving the pipelines


We will train the data on all the above create pipelines and save each one of them. For training, we will only need to provide the pipeline name and the data, all steps included in the processing and training will be performed automatically all thanks to the pipelines.




Now the last step is to just load the new data and make a prediction using the high-performance trained pipeline.


Scikit-Learn pipelines are very useful in finding the best-performing model by letting us experiment with different kinds of machine-learning algorithms at one time. Also, as you have seen above, it became very easy to perform cross-validation as well as hyperparameter tuning on many machine-learning models at one time. Basically, these pipelines make our life a lot more convenient by saving us a ton of time required for experimenting with many algorithms one by one.

I hope you have got a basic idea of Scikit-Learn pipelines and their use. You can find the whole code used in article here .


I hope you liked this article. Have a great day!

Comments


bottom of page