Missing Values in Pandas

How to handle Nulls in Pandas

Missing values, often represented as null or NaN (Not a Number), are a common occurrence in datasets. Dealing with Nulls in Pandas is crucial for accurate data analysis and modeling. In this comprehensive guide, we will explore various techniques to handle missing values in a dataset using Pandas.

Understanding Missing Values

Types of Nulls

In Python Pandas, there are different representations of missing values: np.nan, None, Null, and NaN. It is important to understand the differences between these types.

  • np.nan: It is a floating-point value and is commonly used to represent missing numerical values in Numpy.
  • None: It is a Python built-in constant and is often used to represent missing values in non-numeric data types.
  • Null: It is another representation of missing values.
  • NaN: It is similar to np.nan and is widely used in Pandas to represent missing numerical values.

The Problem with Nulls in a Table

Null values can cause problems in data analysis and modeling. They can lead to incorrect calculations, biased results, and errors in machine learning algorithms. Therefore, it is crucial to treat them appropriately to ensure the accuracy and reliability of the analysis.

How to find Null Values

To detect missing values in a dataset, there are several methods available in Pandas such as "isnull()", "isna()", and "notnull()". These methods return a boolean mask indicating whether each element in the dataset is missing or not.

import pandas as pd

# Given the following table about cities
df = pd.read_csv(cities.csv, sep=';')
df

citycountrypopulation
ParisFrance8000000
MadridNaN5000000
New DelhiIndiaNaN
# Check for null values using isnull() null_mask = df.isnull() print(null_mask) ##
Output
citycountrypopulation
FalseFalseFalse
FalseTrueFalse
FalseFalseTrue

To filter the missing values of a certain field, you can also use the “isnull()” method. In our example, if we need to check what cities have unknown population, you only need to apply the method:


## Select cities where the population is unknown

df[df.population.isnull()]


citycountrypopulation
New DelhiindiaNaN

How to Replace Null Values

Usually, it is often necessary to replace nulls with appropriate values like 0 or an empty string(“”). Pandas has  two methods for replacing null values: fillna() and replace().

The fillna() Method

The "fillna()" method fills null values with a specified or calculated value or a calculated value. Here are some examples:

# Fill null values with "unknown" in the "population" column
df['population'] = df.population.fillna('unknown')
df

citycountrypopulation
ParisFrance8000000
MadridNaN5000000
New DelhiIndiaunknown
# Fill null values with the mean of the column df_mean = df.fillna(df.mean())

The replace() Method

The "replace()" method is used to replace values, including null values. For example:

# Replace all null values with -1
df_replaced = df.replace(np.nan, -1)
print(df_replaced)

## Output

citycountrypopulation
ParisFrance8000000
Madrid-15000000
New DelhiIndia-1

How to Drop Columns or Rows with Null Values

Sometimes, columns or rows that contain null values are useless and they must be removed. Pandas provides the "dropna()" method to drop columns or rows with null values.

# Drop rows with null values
df_dropped_rows = df.dropna()
## Output
print(df_dropped_rows)

citycountrypopulation
ParisFrance8000000
# Drop columns with null values df_dropped_columns = df.dropna(axis=1)

How to count  Null values

To get a count of the null values of a column in a dataset, you can use the "isnull()" method in combination with the "sum()" method.

# Count null values in each column
null_counts = df.isnull().sum()
print(null_counts)
## Output:

city0
country1
population1

We learned how to detect missing values using methods like isnull(), isna(), and notnull(), and how to replace null values using fillna() and replace(). We also discussed methods for dropping columns and rows with null values and counting null values in a dataset.
The ability to handle missing values is an important skill for any data analyst or data scientist, and with the help of Python Pandas, you can confidently work with datasets that contain missing values. To learn more about Nulls and NaNs, start practicing with our real projects and exercises about data cleaning and Pandas in general.

Python and Excel Projects for practice
Register New Account
Shopping cart