This article will show you the way you can create an end-to-end machine learning project.
Table of Content
Introduction
Create a Virtual Environment and Install all the Required Libraries
Create the Project Structure
Exploratory Data Analysis (EDA)
Store all the Variables We Need in the params.yaml file
Start Coding
Creating Unit Tests for Our Project
Model Deployment
Making a Prediction
Introduction
In this article, I will show you how I created an end-to-end machine-learning project.
Please find the project code and deployed web app below for reference while reading the article. I would suggest going through the code as you read through the article for better understanding.
Project Code:
Deployed Project Link:
Step 1: Create a Virtual Environment and Install all the Required Libraries
You can create a conda environment using the following command in the bash portal. Just make sure to replace environment_name with whatever name you want to choose.
conda create -n environment_name python=3.10
Next, we will activate the environment and install all the Python libraries we require for our project in that environment.
conda activate environment_name
pip install numpy pandas matplotlib seaborn scikit-learn PyYAML streamlit
Step 2: Create the Project Structure
In this step, we will create the required files and folders. We will add some more files and folders if required later in the stage.
Some of the important file names are:
params.yaml → The file that will store all the variables that we will use in the project.
requirements.txt → This file will contain the names and versions of all the Python libraries that we will use for the coding.
Some of the important folders are:
src: This folder will contain all of our .py files.
Logs: This folder will contain all the logs of our activities.
Data: This folder will contain the data that is fetched from remote storage and the data that is preprocessed.
Metrics: This folder will contain the files that store the performance of our trained machine-learning model on test data.
Models: This folder will contain the saved pickle file of our trained machine-learning models.
Preprocess_pipelines: As the name suggests, this folder will save the pipelines that are created during the preprocessing of our train data.
Docs: This folder will contain all the project-related documents, such as wireframe documents, high-level and low-level design documents, etc.
Notebooks: This folder will contain the jupyter notebooks that we will use for the exploratory data analysis of our training data.
Step 3: Exploratory Data Analysis (EDA)
Create a jupyter notebook in the Notebooks folder and perform the exploratory data analysis on the training data. During the process, try to find out as much as you can about the data and what preprocessing steps it needs.
Additionally, in this step, we will experiment with many machine learning tasks and hyperparameters and find the model with the highest performance on the test data.
Step 4: Store all the Variables We Need in the params.yaml file
By the time we finish exploratory data analysis, we know what parameters and variables we will need for our coding. We will store all of these variables in one place in params.yaml file. This way of storing the variables in one place has many advantages. If you are interested as to why it is useful and how to do it, check out my article given below.
Step 5: Start Coding
We will start coding our project in a modular format. We will create .py files for each of the following steps of the machine learning lifecycle:
creating utility functions and classes
data loading and saving
data validation
data preprocessing and saving preprocess pipelines
model training
finding the performance of the trained model on test data and saving the performance and trained models
creating a front-end interface for taking new inputs from the user
In the first step, we will create useful functions required for the tasks, like creating a folder if it doesn’t exist and loading the variable from params.yaml file, etc.
In the second step, we will load the training data from any remote location where it is stored. If possible, we will also store the data in some kind of local storage.
In the third step, we will validate the data. We will check if the data is as we expected with respect to the number of records, number of features, the data type of each of the features, etc.
Once the validation is passed successfully, in the fourth step, we will create the pipelines for the processing of the raw data. We would know what processing is needed for the data from the exploratory data analysis step. We will fit and transform the training data using these pipelines. Once we create and fit these pipelines on the training data, we will store them in a pickle file to use later on the test data and the new data points provided by the user of our application.
In the fifth step, we will train the machine-learning algorithm that we found during exploratory data analysis with the processed data.
After training the model, in the sixth step, the model will be tested on the test data and the performance of the model will be saved in a JSON file in the Metrics folder. We will also save the trained machine-learning model into the pickle file.
In the seventh step, we will create the front-end interface for our project using the Streamlit library.
Note that inside every Python file, we will use Python’s logging framework to note down every operation of the code.
Step 6: Creating Unit Tests for Our Project
After the model training has been completed, we will create unit tests to test our application and the training procedures.
Step 7: Model Deployment
Once the coding has been completed, we will use the Streamlit library to deploy our trained model.
Step 8: Making a Prediction
we can now use the web app interface we created to make predictions on new data.
Now our end-to-end machine learning project is ready!
Thanks for reading!
Connect with me on LinkedIn
Similarly, you can follow me on Medium
Have a great day!
Comments