This blog post will discuss the benefits of using a YAML file as a central repository for storing variables, parameters, and hyper-parameters in a data science project. It will explain how this method of storage can improve the efficiency and organization of the project by allowing for easy access and modification of these values. The post will also provide examples and a step-by-step guide for implementing this method in a data science project.
Introduction
Machine learning and deep learning problems are all about experimentation with different parameters. The experimentation becomes quite difficult as the number of parameters increases. This difficulty is partly due to the manual effort required to change the parameter values for every experiment iteration. But luckily for us, there is a way to make this easier. Using the collaboration of YAML files with the Python code, we can perform different experiments quite easily. This article will demonstrate how to use the YAML file and Python code for different experimentations.
Prerequisites
Basic knowledge of Python programming language
Basic knowledge of working of the machine learning lifecycle
Agenda
What is YAML?
Why not use the conventional way of storing the variables?
Advantages of storing the parameters centrally in the YAML file
Downloading PyYAML python library
Storing the variables in the YAML file
Storing lists and dictionaries in the YAML file
Loading the variables from the YAML file into the Python file
Conclusion
What is YAML?
Before diving straight into the topic let’s learn some basic information about YAML.
YAML stands for ‘YAML Ain’t Markup Language’. YAML is a language that stores data in a very much human-readable format, unlike XML or JSON files. YAML file only stores information so, it doesn’t include any type of actions in it. Also, one can easily transfer the data from YAML files into other programming languages such as python.
Why not use the conventional way of storing the variables?
To explain these concepts here, I will be using an example of a data science project named ‘credit card fraud detection’. The aim of the project is simple. The project focuses on detecting whether the performed transaction is fraudulent or not. This is done by using some information about the said transaction. Some of the examples that could be used as information are:
The distance between the place where the transaction is done and the home address of the credit card owner.
The distance from the last transaction place
The ratio of the mean transaction price to the current transaction price
The IP address from which the transaction has been made
Online or offline payment done
This detection is done by the machine learning model trained on the credit card transaction history of the user.
The data science project based on machine learning has many stages to it such as data exploring, data cleaning, finding the suitable machine learning model for the problem, tuning the model, and saving the model. These are some of the many steps present in such projects. Each of these steps creates a lot of variables, especially in the step where the suitable machine learning algorithm is to be found and the one where the tuning of the suitable machine learning algorithm is done.
The conventional way of storing variables creates problems in such cases. Let’s understand this in more detail. Finding the suitable machine learning algorithm to fit the data and to get the maximum accuracy out of it largely depends on experimentation with the algorithm’s hyper-parameters. Using the conventional way, we will have to go around each of the files and change those parameters manually to perform every experiment. This becomes very hectic and is prone to errors. To avoid such unnecessary work and to avoid silly mistakes, a new approach is used. We will understand that new approach later in this article.
Advantages of storing the parameters centrally in the YAML file
Unlike the conventional way of storing the parameters in their respective file, this approach advises storing all the parameters in one file. One can obtain the parameters whenever needed from this file by importing the file. This approach is leaner and less prone to silly mistakes. One can even use a YAML file to store the file paths also.
One question might arise in the mind why use the YAML file only? The answer to this question lies in the extremely simple syntax of the YAML files. One can use other types of files also but to make a simple matter simpler, it is advisable to the YAML file.
Now, let’s see how it’s done using some code.
Downloading PyYAML python library
Python’s one of the popular third-party libraries is PyYAML. This library is actively maintained and it is also mentioned on the official YAML website. To install this library, use the following command in the terminal.
python -m pip install pyyaml
After the installation of the library is completed, use the following command to import it into the python file.
import yaml
Note that even though PyYAML is the name of the library that you have installed, you will import the package using the name ‘yaml’ in the python code.
Storing the variables in the YAML file
YAML file has a somewhat similar syntax to that of the python language. In the YAML file, indents are used just like in python. Let’s take a look at the YAML file to understand this.
Here we are storing variables in two groups named SimpleImputer and OrdinalEncoder. These are the variables that are used as parameters for Scikit-Learn’s simple imputer and ordinal encoder transformers in the preprocessing step.
Note that we don’t need to use quotation marks around the string variables’ values in the YAML file. But even if we used quotation marks around string variable values, it does not make any difference.
Storing the file paths into the YAML is similar to saving of any other value in the YAML file. The following are the paths used in the preprocessing of the data and training of our credit card fraud detection model.
Storing lists and dictionaries in the YAML file
There are two ways to store lists and dictionaries in YAML file. The following are the hyper-parameters used for the hyper-parameter tuning of the random forest classifier model used for fraud detection.
Approach-1:
In the first approach, we just put the list or dictionaries as we do in the python programming language. Dictionaries are represented in the simple key: value pair.
Approach-2:
In the second approach, all the list members start with the symbol ‘-’ at the same indentation level. Dictionaries are represented in the simple key: value pair.
Loading the variables from the YAML file into the python file
Now let’s say we want to access the ‘verbose’ variable from our ‘parameters.yaml’ file into the python file. We can do this the following way.
You might wonder why go through all the trouble of importing the variables in the Python file from the YAML file when we can just initialize the variable verbose as 3 here in the Python file. There is a reason behind this way of coding practice.
let’s say we want to use this variable in multiple files. And let’s say we want to update the verbose variable, then we will have to go through all the files one by one and then change it. If we were to store the variable in the YAML file and then import it in every python file then once we change the value of the variable in the YAML file, it will be reflected in every python file where it is used.
Conclusion
In this article, we learned Why we shouldn’t use the traditional approach of storing variables in Python files. Also, we learned what are the advantages of using a yaml file to store the variables and how it helps with the experiments in the machine learning project. Check out the following link for the whole code for this article.
Outro
I hope you like the article. If you have any thoughts on the article then please let me know. Have a great day!
Comments