How to merge dataframes with Python

How to combine dataframes in Python

In the world of data analysis, combining datasets is a fundamental task that data analysts and data scientists perform on a regular basis. Whether it’s merging customer information with sales data or joining multiple datasets for analysis, knowing how to combine dataframes in Python is an essential skill.

1) Concatenate Dataframes

Concatenating dataframes allows you to combine two or more datasets along a particular axis. The Pandas function pd.concat() is used for this purpose. It takes a sequence of dataframes as input and concatenates them based on the specified axis.
For example, let’s say we have two dataframes, df1 and df2, with the same columns. We can concatenate them vertically using the following code:

import pandas as pd

df1 = pd.DataFrame({'product': ['TV', 'phone', 'laptop'], 'price': [200, 120, 400]}) 
df2 = pd.DataFrame({'product': ['screen', 'mouse', 'keyboard'], 'price': [100, 11, 12]}) 
outcome = pd.concat([df1, df2], axis=0

## Output
productprice
TV200
phone120
laptop400
screen100
mouse11
keyboard12

The resulting dataframe output will have all the rows from df1 followed by all the rows from df2. This is known as vertical concatenation or stacking.
Another way to concatenate dataframes is horizontally, where the columns of the two dataframes are aligned. This is achieved by specifying axis=1 in the pd.concat() function.

It’s important to note that concatenating dataframes can result in a cartesian product if the number of rows in the dataframes being concatenated is different. A cartesian product is the combination of all rows from one dataframe with all rows from the other dataframe. This can lead to a significant increase in the size of the resulting dataframe and should be used with caution.

2) Merge Dataframes

Merging dataframes involves combining datasets based on common columns between them. The Pandas function "pd.merge()" is used for this purpose. It takes two dataframes as input and merges them based on the specified columns.
To understand how the pd.merge() function works, let’s consider two dataframes, df1 and df2, with a common column key. We can merge them using the following code:

import pandas as pd

df1 = pd.DataFrame({'key': ['A', 'B', 'C'],
                    'value1': [1, 2, 3]})

df2 = pd.DataFrame({'key': ['B', 'C', 'D'],
                    'value2': [4, 5, 6]})

result2 = pd.merge(df1, df2, on='key')

## Output
keyvalue1value2
B24
C35

The resulting dataframe result2 will contain only the rows where the key column in df1 matches the key column in df2. This is known as an inner join.
There are different types of joins that can be performed using the "pd.merge()" function. The default is an inner join, which returns only the common values in both dataframes. Other types of joins include left join, right join, and outer join.

Left-Join

A left join (how='left') keeps all rows from the left dataframe and adds NaN values for non-matching rows in the right dataframe.

import pandas as pd

store = pd.DataFrame({'ID':['A1','A2','A3','A4'],'product': ['TV', 'phone', 'laptop','screen'],
'category':['electronics','electronics','electronics','electronics']}) 

## Store dataframe

IDproductcategory
A1TVelectronics
A2phoneelectronics
A3laptopelectronics
A4screenelectronics
## Create another dataframe with prices of the products contained in "store" table prices = pd.DataFrame({'price': [500, 175, 450],'currency': ['USD', 'EUR', 'USD'], 'ID':['A1','A2','A3']}) ### Prices table
pricecurrencyID
500USDA1
175EURA2
450USDA3
### Merge dataframes outcome = pd.merge(store, prices, how='left',on='ID') ### Outcome
IDproductcategorypricecurrency
A1TVelectronics500.0USD
A2phoneelectronics175.0EUR
A3laptopelectronics450.0USD
A4screenelectronicsNaNNaN

 

Right-join

A right join (how='right') keeps all rows from the right dataframe and adds NaN values for non-matching rows in the left dataframe.

## Add three extra rows to the prices dataframe
prices = pd.DataFrame({'price': [500, 175, 450,670,90,44],'currency': ['USD', 'EUR', 'USD','USD','USD','USD'],
'ID':['A1','A2','A3','A4','A5','A6']})
## Print the table

pricecurrencyID
500USDA1
175EURA2
450USDA3
670USDA4
90USDA5
44USDA6
## Merge the two table through a right-join right = pd.merge(store, prices, how='right',on='ID') ## Outcome
IDproductcategorypricecurrency
A1TVelectronics500USD
A2phoneelectronics175EUR
A3laptopelectronics450USD
A4screenelectronics670USD
A5NaNNaN90USD
A6NaNNaN44USD

Outer join

An outer join (how='outer') returns all rows from both dataframes and adds NaN values for non-matching rows.
For example:

### Create two dataframes
store = pd.DataFrame({'ID':['A1','A2','A3','A4'] ,'product': ['TV', 'phone', 'laptop','screen'],
'category':['electronics','electronics','electronics','electronics']})
prices = pd.DataFrame({'price': [500, 175, 450,670,90,44],'currency': ['USD', 'EUR', 'USD','USD','USD','USD'],
'ID':['A1','A2','A3','A4','A5','A6']})

## Merge
outcome = pd.merge(store, prices, how='outer',on='ID')

## Outcome
IDproductcategorypricecurrency
A1TVelectronics500USD
A2phoneelectronics175EUR
A3laptopelectronics450USD
A4screenelectronics670USD
A5NaNNaN90USD
A6NaNNaN44USD

Additionally, the pd.merge() function allows you to merge dataframes with different column names by specifying the left_on and right_on parameters. This is useful when the columns you want to merge on have different names in the two dataframes.

## Create two dataframes
cars = pd.DataFrame({'plate': ['x23', 'x34', 'x45'],
'brand': ['Toyota', 'Ford', 'Seat']})

tyres = pd.DataFrame({'type': ['x23', 'x34', 'x45'],
'manufacturer': ['pirelli', 'michelin', 'firestone']})
## Merge
mech= pd.merge(cars,tyres, left_on='plate',right_on='type',how='left')

## Outcome
platebrandtypemanufacturer
x23Toyotax23pirelli
x34Fordx34michelin

Apart from the "pd.merge()" function, Pandas also provides the "append()" and "join()" functions to combine dataframes. The append() function is used to vertically stack dataframes, similar to concatenation. The "join()" function is used to join dataframes based on their indices.

Conclusion

In this comprehensive guide, we have explored the various techniques and functions provided by the Pandas library in Python to merge and join dataframes. We have learned how to concatenate dataframes vertically and horizontally using the pd.concat() function. We have also seen how to merge dataframes based on common columns using the pd.merge() function, and the different types of joins that can be performed. Finally, we have briefly discussed the append() and join() functions for combining dataframes.

Combining dataframes is a crucial step in data analysis, as it allows us to bring together different datasets and perform more meaningful analysis. By mastering the techniques and functions discussed in this guide, you will be well-equipped to handle complex merging and joining tasks in your data analysis projects.

Python and Excel Projects for practice
Register New Account
Shopping cart