This article will teach you the different performance measures used in machine learning classification tasks. The article will also cover the correct usage of those performance measures.
Let’s start with the question: what is meant by a performance measure?
In the context of machine learning, we can consider the performance measures as a measuring tool that will tell us how good our trained model is.
Usually, “accuracy” is considered a standard measure of performance. But this approach has a disadvantage in the case of a classification problem. Let’s understand this using one example.
Suppose we have a validation dataset with 100 rows. The target column has only two unique values namely “A” and “B” (typical binary classification problem). Suppose there are 80 A’s and 20 B’s in the target column of our validation dataset. Now let’s use a basic model that always outputs the “A” regardless of the input features to predict the output for our validation dataset. Since this model is extremely simple, it will most probably underfit the data and will not generalize well on the new data. But still, here we are getting an accuracy of 80% on the validation data. This is very misleading.
This kind of dataset, where some classes are much more frequent than others, is called an unbalanced or skewed dataset. The accuracy measure gives misleading results for the unbalanced datasets. So that is why we need some other means of measuring the performance of our model.
It is not recommended to use the accuracy performance measure for the skewed dataset.
Note: It is okay to use the accuracy measure for the balanced dataset.
Let’s learn some other performance measures.
confusion matrix
precision
recall
f1-score
The area under the receiver operating characteristic (ROC) curve
Confusion matrix
The general idea of the confusion matrix is to count the number of times instances of “A” are classified as “B” and vice versa.
To compute the confusion matrix, you first need to have predictions that can be compared with the actual target values.
Each row in the confusion matrix represents an actual class, while each column represents a predicted class.
Let’s take an example where the target column has unique categories called “A” s and “B”. Let us call B a non-A here.
This is what a typical confusion matrix will look like.
TN, FP, FN, and TP stand for true negative, false positive, false negative, and true positive, respectively. Now let’s understand what these terminologies mean.
True negative (TN) provides the count of negative feature values (non-A) which were correctly predicted as negative.
False positive (FP) provides the count of negative feature values (non-A) which were wrongly predicted as positive.
False negative (FN) provides the count of positive feature values (A) which were wrongly predicted as negative.
True positive (TP) provides the count of positive feature values (A) which were correctly predicted as negative.
We can consider a classifier as the perfect one when:
FP = FN = 0
TP ≥ 0 and TN ≥ 0
In other words, the confusion matrix of a perfect classifier would have non-zero values only on its main diagonal.
The confusion matrix will give you a lot of information about the performance of the model. Also, it is a matrix and hence kind of hard to grasp on a single look. So, we would like a more concise metric to measure the performance of our model.
Precision, Recall, and F1-Score
Precision and recall give us concise metrics for the performance measurement of a model.
Precision can be considered as the accuracy of positive predictions. One can easily find the precision by looking at the confusion matrix.
Also, recall can be found by taking a ratio of the count of correctly predicted positive observations and the total number of positive observations.
The recall has other names also such as sensitivity or true positive rate.
It is often convenient to combine precision and recall into a single metric called f1-score, especially if you want a simple way to compare two classifiers.
The harmonic mean of precision and recall is called f1-score.
Generally, harmonic mean will give more weight to a lower value. That is why the classifier will get a high f1-score if both precision and recall are high.
But this will not be the case every time. In some cases, you might need higher precision and lower recall and in some other cases, you might also need lower precision and higher recall. It depends on the task at hand. Let’s take two examples for this to make sense.
Example 1:
If you trained a classifier to detect videos that are safe for kids. So, for this classifier, it is okay if some of the safe videos are predicted to be non-safe. But, the number of times the not-safe video is predicted as safe should be as low as possible. This implies that we should have strict conditions for this classifier as high FN and low FP.
High FN implies a low recall value and low FP implies a high precision value.
Example 2:
Suppose you train a classifier to detect a shoplifter in a surveillance camera. So, in this case, it is okay if an innocent person is predicted to be a shoplifter (Morally, this is not the correct way but consider only the machine learning context here). But the number of times a shoplifter is predicted to be an innocent person should be as low as possible. This implies that we should have strict conditions for this classifier as low FN and high FP.
Low FN implies a high recall value and high FP implies low precision.
Let’s go back to the case of high recall and high precision. Unfortunately, this is not possible in real scenarios. We either get high recall and low precision or low recall and high precision.
In this case, we will try to look at the precision-recall curve and select the point on the graph where precision and recall are both fairly high according to your task.
This is how a typical precision-recall curve would look like. By looking at the graph we can get the point where recall = 0.65 and precision = 0.75. For this, we got a fairly high value for both precision and recall.
Receiver operating characteristic (ROC) curve
ROC curve is used as a metric for the binary classification problem. ROC curve plots the true positive rate (i.e. recall) vs the false positive rate.
This is how a typical ROC curve will look like.
True positive rate (TPR) and False positive rate (FPR) can be found using the formulas below.
The dotted line represents the ROC curve of a purely random classification model; a good classifier always stays as far away from that line as possible (towards the top-left corner).
One way to find just how away the ROC curve of our model is from the random model ROC curve is to find the area under the curve. If the area under the ROC curve of our model is equal to 1 then it is as far as it could be from the random model ROC curve. Hence, in this case, our model could be considered a perfect model. A model with a higher area under the ROC curve will be the better model.
The area under the ROC curve can be used to compare two classifiers. A classifier with the higher area under its ROC curve will be a better model in terms of performance.
When to use a precision-recall curve and when to use a ROC curve?
As a rule of thumb, you should prefer a precision-recall curve whenever the positive class is rare or when you care more about the false positives than the false negatives. Otherwise, use the ROC curve.
For example, if we are studying an infection of a rare disease, then we would get the rare positive class. In this case, the precision-recall curve would be the better choice as a performance measure.
I hope you liked the article. Have a great day!
Comments