top of page
shivamshinde92722

A Step-by-Step Guide to Building an End-to-End Machine Learning Project

This article will show you the way you can create an end-to-end machine learning project.


Table of Content
  1. Introduction

  2. Create a Virtual Environment and Install all the Required Libraries

  3. Create the Project Structure

  4. Exploratory Data Analysis (EDA)

  5. Store all the Variables We Need in the params.yaml file

  6. Start Coding

  7. Creating Unit Tests for Our Project

  8. Model Deployment

  9. Making a Prediction


Introduction

In this article, I will show you how I created an end-to-end machine-learning project.

Please find the project code and deployed web app below for reference while reading the article. I would suggest going through the code as you read through the article for better understanding.


Project Code:



Deployed Project Link:



Step 1: Create a Virtual Environment and Install all the Required Libraries


You can create a conda environment using the following command in the bash portal. Just make sure to replace environment_name with whatever name you want to choose.


conda create -n environment_name python=3.10

Next, we will activate the environment and install all the Python libraries we require for our project in that environment.


conda activate environment_name
pip install numpy pandas matplotlib seaborn scikit-learn PyYAML streamlit

Step 2: Create the Project Structure


In this step, we will create the required files and folders. We will add some more files and folders if required later in the stage.

Some of the important file names are:



params.yaml → The file that will store all the variables that we will use in the project.


requirements.txt → This file will contain the names and versions of all the Python libraries that we will use for the coding.


Some of the important folders are:


src: This folder will contain all of our .py files.


Logs: This folder will contain all the logs of our activities.


Data: This folder will contain the data that is fetched from remote storage and the data that is preprocessed.


Metrics: This folder will contain the files that store the performance of our trained machine-learning model on test data.


Models: This folder will contain the saved pickle file of our trained machine-learning models.


Preprocess_pipelines: As the name suggests, this folder will save the pipelines that are created during the preprocessing of our train data.


Docs: This folder will contain all the project-related documents, such as wireframe documents, high-level and low-level design documents, etc.


Notebooks: This folder will contain the jupyter notebooks that we will use for the exploratory data analysis of our training data.


Step 3: Exploratory Data Analysis (EDA)


Create a jupyter notebook in the Notebooks folder and perform the exploratory data analysis on the training data. During the process, try to find out as much as you can about the data and what preprocessing steps it needs.


Additionally, in this step, we will experiment with many machine learning tasks and hyperparameters and find the model with the highest performance on the test data.


Step 4: Store all the Variables We Need in the params.yaml file


By the time we finish exploratory data analysis, we know what parameters and variables we will need for our coding. We will store all of these variables in one place in params.yaml file. This way of storing the variables in one place has many advantages. If you are interested as to why it is useful and how to do it, check out my article given below.



Step 5: Start Coding


We will start coding our project in a modular format. We will create .py files for each of the following steps of the machine learning lifecycle:


  1. creating utility functions and classes

  2. data loading and saving

  3. data validation

  4. data preprocessing and saving preprocess pipelines

  5. model training

  6. finding the performance of the trained model on test data and saving the performance and trained models

  7. creating a front-end interface for taking new inputs from the user


In the first step, we will create useful functions required for the tasks, like creating a folder if it doesn’t exist and loading the variable from params.yaml file, etc.


In the second step, we will load the training data from any remote location where it is stored. If possible, we will also store the data in some kind of local storage.


In the third step, we will validate the data. We will check if the data is as we expected with respect to the number of records, number of features, the data type of each of the features, etc.


Once the validation is passed successfully, in the fourth step, we will create the pipelines for the processing of the raw data. We would know what processing is needed for the data from the exploratory data analysis step. We will fit and transform the training data using these pipelines. Once we create and fit these pipelines on the training data, we will store them in a pickle file to use later on the test data and the new data points provided by the user of our application.


In the fifth step, we will train the machine-learning algorithm that we found during exploratory data analysis with the processed data.


After training the model, in the sixth step, the model will be tested on the test data and the performance of the model will be saved in a JSON file in the Metrics folder. We will also save the trained machine-learning model into the pickle file.


In the seventh step, we will create the front-end interface for our project using the Streamlit library.


Note that inside every Python file, we will use Python’s logging framework to note down every operation of the code.


Step 6: Creating Unit Tests for Our Project


After the model training has been completed, we will create unit tests to test our application and the training procedures.


Step 7: Model Deployment


Once the coding has been completed, we will use the Streamlit library to deploy our trained model.


Step 8: Making a Prediction


we can now use the web app interface we created to make predictions on new data.

Now our end-to-end machine learning project is ready!


 

Thanks for reading!


Connect with me on LinkedIn


Similarly, you can follow me on Medium


Have a great day!

Comments


bottom of page