How to Find and Remove Duplicates in Pandas Dataframes
Duplicate records in datasets can cause various issues and inconsistencies in data analysis and processing. They can lead to incorrect results, biased analysis, and wasted computational resources. Therefore, it is essential to find and remove duplicates in Pandas dataframes to ensure data integrity and accurate analysis.
What are duplicates in a dataset?
Duplicates in a dataset refer to rows that have identical values across all or some of their columns. These duplicate records can occur due to various reasons, including data entry errors, data integration processes, data extraction and collection methods, and data replication and synchronization issues.
How to Detect Duplicates in a Pandas Dataframes
Finding duplicates in a table
"duplicated()" function identifies duplicate rows in a dataframe. By default, it considers all columns when identifying duplicates. The function returns a boolean series where
"True" indicates a duplicate row.
import pandas as pd ## Given the following table named pets
## To check duplicates pets.duplicated() ## Output 0 False 1 True 2 True
animal colour cat black dog white dog white
Find duplicates in a specific column
To find duplicates based on a particular field, the “
duplicated()” function can be used with the “
subset" parameter. You specify the column name or a list of column names and the function will check for duplicates only within those columns. For example in the following table about cities and countries, we detect duplicates in the “city” column:
## The city table
## To see duplicates in the city column city[city.duplicated(subset='city')] ## Output
city country population Paris France 8000000 Madrid Spain 6000000 New Delhi India NaN Ibiza Spain 2000000 Paris France 8000000
city country population Paris France 8000000
"keep" parameter of the
"duplicated()" function controls which duplicate rows to keep. By default, it keeps the first occurrence of each duplicated row and marks subsequent occurrences as duplicates. However, it can also be set to keep the last occurrence or to mark all duplicates as
True. It is very usuful when you need to see all the rows duplicated so you can compare them:
# We find and display all duplicated rows city[city.duplicated(keep=False)] ## Output
city country population Paris France 8000000 Paris France 8000000
How to remove Duplicates in a Pandas dataframes
To remove duplicates from a dataframe, the
"drop_duplicates()" function keeps by default the first occurrence of each duplicated row and removes the subsequent duplicates.
If we want to keep the last occurrence of each duplicated row and remove the preceding duplicates, the
"drop_duplicates()" function can be called with the
keep parameter set to
To remove all duplicate rows from a dataframe, irrespective of their occurrence, the
"drop_duplicates()" function can be called with the
"keep" parameter set to
"False". Following the dataframe of the previuos example:
## We remove duplicates of the table city.drop_duplicates() ## Output
city country population Paris France 8000000 Madrid Spain 6000000 New Delhi India NaN Ibiza Spain 2000000
Causes of Duplicates in Datasets
Data entry errors: One of the primary causes of duplicates is human error during data entry. Typos, incorrect data formatting, and accidental duplication can lead to records with the same values being entered multiple times.
Data integration and merging: When integrating data from multiple sources or merging datasets, duplicates can arise. Inconsistencies in record matching or incomplete merging processes can result in duplicated records.
Data extraction and collection: During data extraction and collection processes, duplicates can occur due to various factors. For example, web scraping may retrieve duplicate records if not properly filtered or if the same data source is scraped multiple times.
Data replication and synchronization: In distributed systems or databases that replicate and synchronize data across multiple nodes or servers, duplicates can emerge due to replication delays, network issues, or conflicts during synchronization.
Handling Duplicates in Large Datasets
When dealing with large datasets, it may not be feasible to process the entire dataset at once. Sampling techniques, such as random sampling or stratified sampling, can be used to select a representative subset of the data for duplicate detection and removal.
Parallel processing for efficient duplicate detection
For computationally intensive operations, parallel processing can significantly speed up duplicate detection. By splitting the data into multiple chunks and processing them simultaneously, the overall processing time can be reduced.
Chunking data for memory optimization
If memory constraints are a concern, the dataset can be processed in smaller chunks instead of loading the entire dataset into memory. By processing data in chunks, memory utilization can be optimized, enabling the detection and removal of duplicates in a memory-efficient manner.
Best Practices to avoid Duplicates
Implementing standardized data entry processes can minimize the occurrence of duplicates. This includes data validation checks, enforcing data integrity rules, and providing clear guidelines to data entry personnel.
Implementing data validation checks
Data validation checks, such as uniqueness constraints on key columns, can be enforced at the database level to prevent duplicates from being inserted in the first place. Additionally, pre-processing steps, such as data cleansing and normalization, can help identify and handle potential duplicates before analysis.
Regular data cleaning and maintenance
Performing regular data cleaning and maintenance tasks can help keep datasets free from duplicates. This includes periodically running duplicate detection processes, updating and merging datasets, and removing outdated or irrelevant data.
In this guide, we explored the causes of duplicates in datasets, the importance of treating them, and various techniques for detecting and removing duplicates in Pandas dataframes.
For a better understanding of how to cope with duplicates, have a look to out data cleaning projects to practice when and how to detect and remove duplicates on a dataset.