How to Filter rows in Pandas Dataframe

Methods to Filter Rows in Pandas Dataframes

Data analysis involves working with vast amounts of data, often containing irrelevant or redundant information. It is usually necessary to focus on specific subsets of data, reducing complexity and enabling a more accurate analysis. There are several methods to filter rows in Pandas dataframes so that data analysts can extract specific partitions that are relevant to their analysis, so they can focus on the most important aspects.

How to filter rows by single condition

There are several methods available in Pandas to filter rows based on a single condition. One commonly used approach is using the “DataFrame.loc[]” attribute, which allows analysts to select rows based on specific conditions. Another method is using the “DataFrame.query()” function, which enables analysts to filter rows using expressions.

## Given the following dataset about sales
sales

IDproductcategorypricefinal_priceorigin_countrycurrency
A1TVelectronics500.0600GermanyUSD
A2phoneelectronics175.0210ChinaUSD
A3laptopelectronics450.0540TaiwanUSD
A4screenelectronicsNaNNaNunknownUSD
## We select screens sales[sales.product=='screen'] ## Output
IDproductcategorypricefinal_priceorigin_countrycurrency
A4screenelectronicsNaNNaNunknownUSD

Finally, another method to filter a dataframe is the ~ (NOT IN) operator (alt gr + 4) that checks whether the data is available or not. For example, in our sample table, we want to know the products that are not produced in China. Instead of entering all the possible countries, we just apply the NOT IN condition:

## Select all products not made in China
sales[~sales['origin_country']=='China]
## Output

IDproductcategorypricefinal_priceorigin_countrycurrency
A1TVelectronics500.0600GermanyUSD
A3laptopelectronics450.0540TaiwanUSD
A4screenelectronicsNaNNaNunknownUSD

How to filter rows by multiple conditions

In many cases, data analysts need to filter rows based on several conditions. Pandas provides various options to achieve this.
The most common way is combining conditions with logical operators like “&” and the pipeline “|”:

## We select products with sale price greater than 200 and sold in USD
sales[(sales['price']>200 )&(sales['currency']=='USD']
## Output

IDproductcategorypricefinal_priceorigin_countrycurrency
A1TVelectronics500.0600GermanyUSD
A3laptopelectronics450.0540TaiwanUSD

How to Filter Rows Based on List of Values

Another common scenario in data analysis is filtering rows based on a list of values. The “isin()” function selects rows based on whether the values in a column match any value in a given list. This method is particularly usuful when you need to filter the dataset including multiple values of the same field.

## We define the list of values
produit = ["phone", "laptop"]

## Select only rows where product is contained in the "produit" list
sales[sales.product.isin(produit)]
## Output

IDproductcategorypricefinal_priceorigin_countrycurrency
A2phoneelectronics175.0210ChinaUSD
A3laptopelectronics450.0540TaiwanUSD

In this example, the “isin()” function is used to filter rows where the ‘product’ column value matches any value in the ‘produit’ list.

How to Filter Rows with Null Values

Handling missing data is a critical aspect of data analysis. Functions like “isnull()” and “notnull()” filter rows based on the presence or absence of null values.

## We check if the price is always filled in
sales[sales.price.isnull()]
## Output

IDproductcategorypricefinal_priceorigin_countrycurrency
A4screenelectronicsNaNNaNunknownUSD

The “isnull()” function has selected the records where the ‘price’ column has null values.

How to Filter a dataframe with String Matches

Filtering rows based on string matches is often required in exploratory analysis. Functions like “str.contains()”, “str.startswith()”, and “str.endswith()” filter rows based on string patterns. Bear in mind it is case sensitive.

# We select rows where "currency" contains the letters "US"
sales[sales.currency.str.contains('US')]
## Output

IDproductcategorypricefinal_priceorigin_countrycurrency
A1TVelectronics500.0600GermanyUSD
A2phoneelectronics175.0210ChinaUSD
A3laptopelectronics450.0540TaiwanUSD
A4screenelectronicsNaNNaNunknownUSD

How to filter columns in dataframe

When you only need certain columns and not the entire dataset, you can also select a number of columns through double brackets.

## We select just three columns of our dataset
sales[['product','price','origin_country']]
## Output

productpriceorigin_country
TV500.0Germany
phone175.0China
laptop450.0Taiwan
screenNaNunknown

Python and Excel Projects for practice
Register New Account
Shopping cart