From Raw to Refined: A Journey Through Data Preprocessing — Part 5: Outliers

shivamshinde92722
Aug 31, 2023
6 min read

A Simple Guide to Navigating Data Anomalies. Decode the mystery behind outliers in data science. From detection to resolution, empower your analysis with straightforward techniques. Master outliers, master your data.

Table of Content

what is an outlier?
Why learning about outliers is important?
Reasons for the presence of outliers in the data
Outlier vs Noise
Types of outliers
How to detect outliers in the data
Methods to deal with outliers in the data
References
Outro

What is an outlier?

The outlier is anything vastly different from its kind.

Observe the above picture of the chessboard. There is only one piece on the chess board which is brown while all the other pieces are yellow. So, we can consider the brown chess piece as an outlier in the context of chess.

In the data analysis context, the outlier is a data point that is very far from the other data points in the data.

Let’s take the example of the net worth in US dollars of people living on the earth. Very few people like Elon Musk, Bill Gates, etc. will have a net worth of over 100 billion while all the remaining people will have a net worth that nowhere near the amount of 100 billion. So, in this case, we can consider people like Elon Musk or Bill Gates as outliers as far as net worth is concerned. Similarly, most of the people on Earth have decent net worth, however, some groups of people are living in extreme poverty. If we consider the whole population of the earth then these people will also be considered outliers in the context of net worth.

Why learning about outliers is important?

In statistics, we learned about measures of central tendency such as mean, median, and mode. Mean is a popular measure of central tendency, but it is highly sensitive to outliers, which can skew the data towards one side. This is particularly important because the mean is used in various operations such as missing value imputation. Additionally, some machine learning algorithms are sensitive to outliers, which is another reason to be cautious of them.

Let’s again revisit the example of net worth from earlier. If we calculated the mean of the net worth of all the people on Earth, we would get a very high value due to the presence of a few people with incredibly high net worth. This means that the mean value does not represent the entire population and is misleading.

Reasons for the presence of outliers in the data

Outliers may be present in the data due to the variability of the data or due to some experimental or survey errors in the collection phase of the data.

Outlier vs Noise

Outliers are fundamentally different than the noise. The outlier is the data point significantly different from other data points in the dataset. On the other hand, noise is the random error that is present in the data. The outlier is the part of the data and is sometimes even useful (For example, credit card fraud detection). However, noise adversely affects the data and unnecessarily increases the space required to store the data.

Types of outliers

Types based on the number of features using which outliers are found:-

Univariate Outliers:

The outlier that is present in one of the features of the data is known as an univariate outlier. For example, the outlier present in the feature salary of employee’s data can be called univariate outlier. The employees whose salary is very low or very high can be considered outliers. In the below diagram, you can see that some of the employees have very high salaries. Those are considered outliers.

2. Multivariate Outliers:

The outlier that is present in two or more features of the data is known as a multivariate outlier. For example, the outlier present in features salary and the age of employee’s data together can be called outliers. Here, the outliers could be the employees with low age and very high income or the employees with high age and very low income.

In the above diagram, you can observe that despite the low age some employees have got very high salaries. On the other hand, there are some employees with high age with relatively low salaries. These two categories of employees can be considered as outliers. Here, we are making use of both age as well as salary to find the outliers, hence these are multivariate outliers.

We can visualize the outliers up to when we use three features to detect outliers. Beyond that, we cannot visualize the outliers.

Types based on the location or distribution of outliers:-

Global outlier or point outlier:

An individual data point that is very far away from the rest of the data can be considered the global or point outlier. In the example of multivariate outliers, the record where the salary is 1000 units can be assumed as the global outlier.

2. Contextual outlier or conditional outlier:

Finding this kind of outlier is dependent on some kind of context or additional information. If we consider the amount of rain as a feature then we would need additional information about the geographical region for which the data is collected aka ''context'' for this data to find the outliers.

For example, if the data shows a high amount of rain in the desert region or a low amount of rain in a very rainy region then these cases would be considered outliers.

3. Collective outliers:

If the small collection of data points is significantly different from the rest of the data then the data points in this group are called collective outliers.

In the above diagram, you can observe that the group of data points located at the top-right side are vastly different in value than the rest of the data. These are collective outliers.

How to detect the outliers in the data?

Detecting the outliers using the boxplot or interquartile range:-

The boxplot is a very intuitive way to visualize the outliers in our data.

import pandas as pd
# df --> dataframe containing our data
# num_feat --> list containing the names of all numerical features
df[num_feat].plot(kind="box")

In the above plot, outliers are represented using hollow circles.

We can see that we see some hollow circles for each of the numerical features and that means outliers are present in our data. However, here we are just able to know that we have outliers. If we want to get the values of the outliers, then we can use the quantiles and interquartile range to find those. We can use the following formulas to find the outlier values in the data.

1. interquartile-range (IQR) = quartile3 value ( Q3)— quartile1 value (Q1)

2. If data point value > (quartile3 value (Q3)+ constant (k)* interquartile-range (IQR)) ==> data point is outlier

3. If data point value < (quartile1 value (Q1) — constant (k)* interquartile range (IQR)) ==> data point is outlier

4. Otherwise, the data point is not an outlier.

Usually, the constant value (k) is equal to 1.5. The following code shows how one can find the outliers using the interquartile range and the quantiles.

outliers = []
def detect_outliers_iqr(data):
    data = sorted(data)
    q1 = np.percentile(data, 25)
    q3 = np.percentile(data, 75)
    # print(q1, q3)
    IQR = q3-q1
    lwr_bound = q1-(1.5*IQR)
    upr_bound = q3+(1.5*IQR)
    for i in data: 
        if (i<lwr_bound or i>upr_bound):
            outliers.append(i)
    return outliers# Driver code
sample_outliers = detect_outliers_iqr(sample)
print("Outliers from IQR method: ", sample_outliers)

Detecting the outliers using Z-index:-

The values present on the far left or far right of the data distribution can be considered outliers. Generally, by far-left we mean the data point whose Z-index is less than -3, and by far-right we mean the data point whose Z-index is greater than 3.

Z-index = (data point value — mean of the feature)/std. dev. of the feature

If (Z-index < -3) OR (Z-index > 3) ==> data point is outlier

Otherwise, the data point is not an outlier.

The following code shows how one can find the outliers using the Z-index.

import numpy as np
outliers = []
def detect_outliers_zscore(data):
    thres = 3
    mean = np.mean(data)
    std = np.std(data)
    # print(mean, std)
    for i in data:
        z_score = (i-mean)/std
        if (np.abs(z_score) > thres):
            outliers.append(i)
    return outliers
sample_outliers = detect_outliers_zscore(sample)
print("Outliers from Z-scores method: ", sample_outliers)

Methods to deal with outliers in the data

Removing outliers from the data:

One approach to handling outliers in the data is to remove them, but this is not a recommended course of action. This is primarily because unless we can be certain that the outliers in the data are a result of some error (such as an error during data collection), removing them will result in a significant loss of information.

Replacing the outliers with the median of the data:

Since the mean of the data is highly sensitive to the presence of outliers in the data, we can replace the outliers with the median of the data.

Quantile-based flooring and capping:

In this method, we replace the outliers at the lower end of the data distribution with some low threshold percentile value (say 10th percentile) and we replace the outliers at the higher end of the data distribution with some high threshold percentile value (say 90th percentile).

If (for outlier value < 10th percentile of data) replace outlier with the 10th percentile of data.

If (for outlier value > 90th percentile of data) replace outlier with the 90th percentile of data.

We can change the threshold percentiles used above as per the distribution of our data. We can always use the outlier detection methods we learned earlier to confirm if the outliers have been dealt with.