How to merge dataframes with Python
How to combine dataframes in Python
In the world of data analysis, combining datasets is a fundamental task that data analysts and data scientists perform on a regular basis. Whether it’s merging customer information with sales data or joining multiple datasets for analysis, knowing how to combine dataframes in Python is an essential skill.
1) Concatenate Dataframes
Concatenating dataframes allows you to combine two or more datasets along a particular axis. The Pandas function pd.concat()
is used for this purpose. It takes a sequence of dataframes as input and concatenates them based on the specified axis.
For example, let’s say we have two dataframes, df1
and df2
, with the same columns. We can concatenate them vertically using the following code:
import pandas as pd
df1 = pd.DataFrame({'product': ['TV', 'phone', 'laptop'], 'price': [200, 120, 400]})
df2 = pd.DataFrame({'product': ['screen', 'mouse', 'keyboard'], 'price': [100, 11, 12]})
outcome = pd.concat([df1, df2], axis=0
## Output
product | price |
---|---|
TV | 200 |
phone | 120 |
laptop | 400 |
screen | 100 |
mouse | 11 |
keyboard | 12 |
The resulting dataframe output will have all the rows from df1
followed by all the rows from df2
. This is known as vertical concatenation or stacking.
Another way to concatenate dataframes is horizontally, where the columns of the two dataframes are aligned. This is achieved by specifying axis=1
in the pd.concat()
function.
It’s important to note that concatenating dataframes can result in a cartesian product if the number of rows in the dataframes being concatenated is different. A cartesian product is the combination of all rows from one dataframe with all rows from the other dataframe. This can lead to a significant increase in the size of the resulting dataframe and should be used with caution.
2) Merge Dataframes
Merging dataframes involves combining datasets based on common columns between them. The Pandas function "pd.merge()"
is used for this purpose. It takes two dataframes as input and merges them based on the specified columns.
To understand how the pd.merge()
function works, let’s consider two dataframes, df1
and df2
, with a common column key
. We can merge them using the following code:
import pandas as pd
df1 = pd.DataFrame({'key': ['A', 'B', 'C'],
'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['B', 'C', 'D'],
'value2': [4, 5, 6]})
result2 = pd.merge(df1, df2, on='key')
## Output
key | value1 | value2 |
---|---|---|
B | 2 | 4 |
C | 3 | 5 |
The resulting dataframe result2
will contain only the rows where the key
column in df1
matches the key
column in df2
. This is known as an inner join.
There are different types of joins that can be performed using the "pd.merge()"
function. The default is an inner join, which returns only the common values in both dataframes. Other types of joins include left join, right join, and outer join.
Left-Join
A left join (how='left'
) keeps all rows from the left dataframe and adds NaN values for non-matching rows in the right dataframe.
import pandas as pd
store = pd.DataFrame({'ID':['A1','A2','A3','A4'],'product': ['TV', 'phone', 'laptop','screen'],
'category':['electronics','electronics','electronics','electronics']})
## Store dataframe
ID product category
A1 TV electronics
A2 phone electronics
A3 laptop electronics
A4 screen electronics
## Create another dataframe with prices of the products contained in "store" table
prices = pd.DataFrame({'price': [500, 175, 450],'currency': ['USD', 'EUR', 'USD'],
'ID':['A1','A2','A3']})
### Prices table
price currency ID
500 USD A1
175 EUR A2
450 USD A3
### Merge dataframes
outcome = pd.merge(store, prices, how='left',on='ID')
### Outcome
ID | product | category | price | currency |
---|---|---|---|---|
A1 | TV | electronics | 500.0 | USD |
A2 | phone | electronics | 175.0 | EUR |
A3 | laptop | electronics | 450.0 | USD |
A4 | screen | electronics | NaN | NaN |
Right-join
A right join (how='right'
) keeps all rows from the right dataframe and adds NaN values for non-matching rows in the left dataframe.
## Add three extra rows to the prices dataframe
prices = pd.DataFrame({'price': [500, 175, 450,670,90,44],'currency': ['USD', 'EUR', 'USD','USD','USD','USD'],
'ID':['A1','A2','A3','A4','A5','A6']})
## Print the table
price currency ID
500 USD A1
175 EUR A2
450 USD A3
670 USD A4
90 USD A5
44 USD A6
## Merge the two table through a right-join
right = pd.merge(store, prices, how='right',on='ID')
## Outcome
ID | product | category | price | currency |
---|---|---|---|---|
A1 | TV | electronics | 500 | USD |
A2 | phone | electronics | 175 | EUR |
A3 | laptop | electronics | 450 | USD |
A4 | screen | electronics | 670 | USD |
A5 | NaN | NaN | 90 | USD |
A6 | NaN | NaN | 44 | USD |
Outer join
An outer join (how='outer'
) returns all rows from both dataframes and adds NaN values for non-matching rows.
For example:
### Create two dataframes
store = pd.DataFrame({'ID':['A1','A2','A3','A4'] ,'product': ['TV', 'phone', 'laptop','screen'],
'category':['electronics','electronics','electronics','electronics']})
prices = pd.DataFrame({'price': [500, 175, 450,670,90,44],'currency': ['USD', 'EUR', 'USD','USD','USD','USD'],
'ID':['A1','A2','A3','A4','A5','A6']})
## Merge
outcome = pd.merge(store, prices, how='outer',on='ID')
## Outcome
ID | product | category | price | currency |
---|---|---|---|---|
A1 | TV | electronics | 500 | USD |
A2 | phone | electronics | 175 | EUR |
A3 | laptop | electronics | 450 | USD |
A4 | screen | electronics | 670 | USD |
A5 | NaN | NaN | 90 | USD |
A6 | NaN | NaN | 44 | USD |
Additionally, the pd.merge()
function allows you to merge dataframes with different column names by specifying the left_on
and right_on
parameters. This is useful when the columns you want to merge on have different names in the two dataframes.
## Create two dataframes
cars = pd.DataFrame({'plate': ['x23', 'x34', 'x45'],
'brand': ['Toyota', 'Ford', 'Seat']})
tyres = pd.DataFrame({'type': ['x23', 'x34', 'x45'],
'manufacturer': ['pirelli', 'michelin', 'firestone']})
## Merge
mech= pd.merge(cars,tyres, left_on='plate',right_on='type',how='left')
## Outcome
plate | brand | type | manufacturer |
---|---|---|---|
x23 | Toyota | x23 | pirelli |
x34 | Ford | x34 | michelin |
Apart from the "pd.merge()"
function, Pandas also provides the "append()"
and "join()"
functions to combine dataframes. The append()
function is used to vertically stack dataframes, similar to concatenation. The "join()"
function is used to join dataframes based on their indices.
Conclusion
In this comprehensive guide, we have explored the various techniques and functions provided by the Pandas library in Python to merge and join dataframes. We have learned how to concatenate dataframes vertically and horizontally using the pd.concat()
function. We have also seen how to merge dataframes based on common columns using the pd.merge()
function, and the different types of joins that can be performed. Finally, we have briefly discussed the append()
and join()
functions for combining dataframes.
Combining dataframes is a crucial step in data analysis, as it allows us to bring together different datasets and perform more meaningful analysis. By mastering the techniques and functions discussed in this guide, you will be well-equipped to handle complex merging and joining tasks in your data analysis projects.