How to Add New Columns to Dataframe
How to Add New Columns to a Pandas DataFrame
In data analysis and manipulation, it is often necessary to add new columns to a Pandas DataFrame. It helps expand the scope of the analysis, find patterns, and perform data transformations. In this comprehensive guide, we will explore different methods to add new columns to an existing DataFrame in Pandas.
Method 1: Adding an Empty Column or a Column with a Constant Value
The first method is for adding a new column with either an empty value or a constant value for all rows in the DataFrame. This can be useful when we want to initialize a column before populating it with actual values.
To add an empty column, we can simply assign an empty list or array to the desired column name. For example:
## Option 1
df['new_column'] = []
## Option 2
df['new_column'] = ''
Method 2: New Column Derived from an Existing One
Another common scenario is adding a new column that is derived from an existing field in the DataFrame. This can involve performing calculations, applying functions, or mapping values from one column to another.
To create a new column using calculations or functions, we can use brackets to apply the desired operation to the existing column. For example, given the following dataset about sales, let’s see how to add two extra fields:
## Our dataset is sales print(sales)
ID | product | category | price | currency |
---|---|---|---|---|
A1 | TV | electronics | 500.0 | USD |
A2 | phone | electronics | 175.0 | EUR |
A3 | laptop | electronics | 450.0 | USD |
A4 | screen | electronics | NaN | NaN |
Add an new column with the final price including a tax (VAT) of 20%
sales['final_price'] = sales['price'] * 1.2
## Output
ID product category price currency final_price
A1 TV electronics 500.0 USD 600
A2 phone electronics 175.0 EUR 210
A3 laptop electronics 450.0 USD 540
A4 screen electronics NaN NaN NaN
If we want to map values from one column to another based on a dictionary or a list, we can use the “map()"
function. For instance, we can add the origin country of the product through a dictionary:
## We define the dictionary
origin = {'TV':'Germany','phone':'China','laptop':'India','screen':'Taiwan'}
# We apply the map funtion() to fill in the new column. It's based on the existing column, "product".
## For values not included in the dictionary, we apply the "fillna()" function.
sales['origin_country'] = sales['product'].map(origin).fillna('unknown')
## Output
ID product category price final_price origin_country
A1 TV electronics 500.0 600 Germany
A2 phone electronics 175.0 210 China
A3 laptop electronics 450.0 540 Taiwan
A4 screen electronics NaN NaN unknown
Method 3: Adding a New Column with iloc[]
The "iloc[]"
function allows to access DataFrame elements by their index. We use this function to add a new column at a specific position in the DataFrame.
## We create a new column with the currency at the end of the table
sales.loc[:,'currency'] = 'USD'
## Output
ID product category price final_price origin_country currency
A1 TV electronics 500.0 600 Germany USD
A2 phone electronics 175.0 210 China USD
A3 laptop electronics 450.0 540 Taiwan USD
A4 screen electronics NaN NaN unknown USD
Method 4: New columns with insert()
To add a new column at a specific position, we can use the insert()
function:
## We add the VAT percentaje in the third position
sales.insert(2, 'vat', '0.2')
## Output
ID product vat category price currency final_price origin_country currency
A1 TV 0.2 electronics 500.0 USD 600 Germany USD
A2 phone 0.2 electronics 175.0 EUR 210 China USD
A3 laptop 0.2 electronics 450.0 USD 540 Taiwan USD
A4 screen 0.2 electronics NaN NaN NaN unknown USD
In the above code, 2
represents the position where the new column will be inserted (bear in mind it starts at 0).
“vat” is the name of the new column and “0.2” is the value to input.
Method 5: New Column Based on Conditions with np.where()
Sometimes, we need to add a new column based on a condition. The Numpy "np.where()"
function can create a new column based on a given condition.
import numpy as np
# We create a new filed based on the "final_price" field
sales['discount'] = np.where(sales['final_price'] > 250, 'yes', 'no')
## Output
ID product category price final_price origin_country currency discunt
A1 TV electronics 500.0 600 Germany USD yes
A2 phone electronics 175.0 210 China USD no
A3 laptop electronics 450.0 540 Taiwan USD yes
A4 screen electronics NaN NaN unknown USD no
The new field purpose is to inform if a discount applies based on the price. The condition is simple, if the price is greater than 250, a discount is applied, otherwise no. Therefore if the “final_price” value is greater than 250, the new column has the value ‘yes’, if it is equal or less than 250, ‘no’.
Method 5: Adding a New Column with apply() and a Python Function
In some cases, we may need to apply a custom function to each row of a DataFrame to calculate the values for a new column. We can use the “apply()"
function in combination with a Python function to achieve this.
This method is very useful when the values of the new field depend on multiple conditions and none of the former methods are valid.
To illustrate how the “apply()” function works, we use a dataset (df) with prices of six products sold worldwide. We need a new field to display the final price in the currency of the country is sold. To do so, we need to implement the following conditions:
1) If the country is “usa” and the currency is not USD convert it to dollars.
2) If the country is in the UE and the currency is not EUR convert it to euros.
3) For any other country, apply prices in USD.
The exchange rate of the EUR/USD is 1.058
## Our dataframe df:
product | price | currency | country |
---|---|---|---|
A | 500 | USD | usa |
B | 175 | USD | germany |
C | 450 | USD | india |
D | 50 | USD | uk |
E | 100 | EUR | usa |
F | 80 | EUR | spain |
def country_price(df):
if df['country'] = 'usa' and df['currency'] != 'USD':
return df['price'] * 1.058
elif df['country'] in eu and df['currency'] !='EUR':
return df['price'] /1.058
elif df['country'] in eu and df['currency']='EUR':
return df['price']
else:
return df['price']
## We run the apply() function to implement the conditions to the new column
df['sale_price'] = df.apply(country_price, axis=1)
## Output
product price currency country sale_price
A 500 USD usa 500
B 175 USD germany 165.4064
C 450 USD india 450
D 50 USD uk 50
E 100 EUR usa 105.8
F 80 EUR spain 80
Using df.apply()
with axis=1
, we apply the function to each row of the DataFrame.
Conclusion
In this comprehensive guide, we have explored different methods to add new columns to an existing Pandas DataFrame. We have learned how to add an empty column or a column with a constant value, derive a new column from an existing one, add a column at a specific position, add a column based on conditions, and add a column using "apply()"
and a Python function. These methods provide us with the flexibility to manipulate and analyze data effectively using Pandas.
To gain a better understanding of these techniques, experiment with these methods working on real projects of data analysis and data wrangling and adapt them to your specific use cases.