How to handle Nulls in Pandas
Missing values, often represented as null or NaN (Not a Number), are a common occurrence in datasets. Dealing with Nulls in Pandas is crucial for accurate data analysis and modeling. In this comprehensive guide, we will explore various techniques to handle missing values in a dataset using Pandas.
Understanding Missing Values
Types of Nulls
In Python Pandas, there are different representations of missing values: np.nan, None, Null, and NaN. It is important to understand the differences between these types.
- np.nan: It is a floating-point value and is commonly used to represent missing numerical values in Numpy.
- None: It is a Python built-in constant and is often used to represent missing values in non-numeric data types.
- Null: It is another representation of missing values.
- NaN: It is similar to np.nan and is widely used in Pandas to represent missing numerical values.
The Problem with Nulls in a Table
Null values can cause problems in data analysis and modeling. They can lead to incorrect calculations, biased results, and errors in machine learning algorithms. Therefore, it is crucial to treat them appropriately to ensure the accuracy and reliability of the analysis.
How to find Null Values
To detect missing values in a dataset, there are several methods available in Pandas such as "isnull()"
, "isna()"
, and "notnull()"
. These methods return a boolean mask indicating whether each element in the dataset is missing or not.
import pandas as pd # Given the following table about cities df = pd.read_csv(cities.csv, sep=';') df
Output# Check for null values using isnull() null_mask = df.isnull() print(null_mask) ##
city country population Paris France 8000000 Madrid NaN 5000000 New Delhi India NaN
city country population False False False False True False False False True
To filter the missing values of a certain field, you can also use the “isnull()” method. In our example, if we need to check what cities have unknown population, you only need to apply the method:
## Select cities where the population is unknown
df[df.population.isnull()]
city country population
New Delhi india NaN
How to Replace Null Values
Usually, it is often necessary to replace nulls with appropriate values like 0 or an empty string(“”). Pandas has two methods for replacing null values: fillna()
and replace()
.
The fillna()
Method
The "fillna()"
method fills null values with a specified or calculated value or a calculated value. Here are some examples:
# Fill null values with "unknown" in the "population" column
df['population'] = df.population.fillna('unknown')
df
city country population
Paris France 8000000
Madrid NaN 5000000
New Delhi India unknown
# Fill null values with the mean of the column
df_mean = df.fillna(df.mean())
The replace()
Method
The "replace()"
method is used to replace values, including null values. For example:
# Replace all null values with -1
df_replaced = df.replace(np.nan, -1)
print(df_replaced)
## Output
city country population
Paris France 8000000
Madrid -1 5000000
New Delhi India -1
How to Drop Columns or Rows with Null Values
Sometimes, columns or rows that contain null values are useless and they must be removed. Pandas provides the "dropna()"
method to drop columns or rows with null values.
# Drop rows with null values
df_dropped_rows = df.dropna()
## Output
print(df_dropped_rows)
city country population
Paris France 8000000
# Drop columns with null values
df_dropped_columns = df.dropna(axis=1)
How to count Null values
To get a count of the null values of a column in a dataset, you can use the "isnull()"
method in combination with the "sum()"
method.
# Count null values in each column null_counts = df.isnull().sum() print(null_counts) ##
Output:
city 0 country 1 population 1
We learned how to detect missing values using methods like isnull()
, isna()
, and notnull()
, and how to replace null values using fillna()
and replace()
. We also discussed methods for dropping columns and rows with null values and counting null values in a dataset.
The ability to handle missing values is an important skill for any data analyst or data scientist, and with the help of Python Pandas, you can confidently work with datasets that contain missing values. To learn more about Nulls and NaNs, start practicing with our real projects and exercises about data cleaning and Pandas in general.