From Raw to Refined: A Journey Through Data Preprocessing — Part 3: Duplicate Data

shivamshinde92722
Aug 27, 2023
4 min read

This article will explain how to identify duplicate records in the data and, the different ways to deal with the problem of having duplicate records.

Why the presence of duplicate records in data is a problem?

The presence of duplicate values in the data is often ignored by many programmers. But, dealing with the duplicate records in the data is quite important.

Having duplicate records can lead to incorrect data analysis and decision-making.

For example, what happens when you replace missing values (imputation) with mean in data with duplicate records?

In this scenario, the incorrect mean value may be used for imputation. Let’s take an example.

Consider the following data. Data contains two columns, namely Name and Weight. Note that the weight value for the ‘John’ is repeated. Also, the weight value for ‘Steve’ is missing.

If one wishes to impute the missing value of weight for Steve with the mean of all the weight values, then the imputation would be done using the incorrect mean, i.e.,

(98 + 50 + 67 + 66 + 50)/5 = 66.2

But the actual mean of the data by ignoring the duplicate value is

(98 + 50 + 67 + 66)/4 = 70.25

Therefore, the missing value will be incorrectly imputed if we do not do something about the duplicate records.

Moreover, duplicate values may even affect the business decisions that are made using such faulty data.

In summary, duplicate records from the data should be dealt with to keep the data free from problems.

Now, let’s see different methods to deal with the duplicate records in the data.

Identifying the duplicate values in the data

We can use the pandas duplicated method to identify the rows that are duplicated in the data.

Now, let’s understand the duplicate values using an example.

Identifying duplicate values:

We get the value True, where the duplicate record is present, and False where the unique records are present.

Note that by default, duplicated() method uses all the columns to find duplicate records. But, we can use the subset of columns to find the duplicates as well. To do this, duplicated() method has a parameter named subset. The subset parameter takes the list of column names that we want to use for finding duplicates.

Additionally, duplicated() method has one more important parameter named keep. The value of the keep parameter decides whether we consider the first record or the last record as unique in all the duplicated records. We also have the option where we can consider all the duplicate records as non-unique.

keep = ‘first’: The first record in all of the duplicate records is considered unique

keep = ‘last’: The last record in all of the duplicate records is considered unique

keep = False: All the duplicate records are considered non-unique.

Notice here that the first duplicate value (at index 1) is considered unique and all the others (at index 4) are considered duplicates.

Notice here that the last duplicate value (at index 4) is considered unique and all the others (at index 1) are considered duplicates.

Notice here that all the duplicate records (at index 1 and index 4) are shown.

How to deal with duplicate records in data

The next step after identifying the duplicate records is to deal with them.

There are two ways to deal with the duplicate records in the data.

Removing the duplicate records

Let’s start with the approach where we remove the duplicate records.

We can make use of the pandas drop_duplicates() method for this.

By default, drop_duplicates() method keeps the first record from the set of all the duplicate records and then drops the rest of them from the data. Also, by default, drop_duplicates() method uses all the columns to identify the duplicate records.

But this default behavior can be changed using the two parameters of the drop_duplicates() method. They are

keep
subset

They work exactly how the keep and subset parameters of duplicated() method work.

Updating the duplicate records

Sometimes we want to replace the duplicate records with some value. Let’s say we found two duplicate records, then we got to know that the person who acquired the data accidentally put the wrong name in one of the duplicate records. So in such a case, we would want to put the name of the correct person. Thus, this would solve the problem of duplicate values.

Here, we have duplicate records at the indices 1 and 4. Now, if we changed the value of ‘Name’ column at index 1, then we won’t have duplicate values anymore.