PANDAS LIBRARY – QUICK GUIDE TUTORIAL
What is Pandas?
The Pandas software library is a free-to-use (BSD licensed), open-source add-on to the Python language. It enables the already famously useful language to easily retrieve information from datasets and present it in visually intuitive formats. The library was created using a combination of Python, C, and Cython and compares very favorably to other libraries in key performance metrics, making it a logical choice for data scientists in general, but particularly for those who work hands-on with the Python language.
Since its creation in 2008, this highly accessible tool has steadily become the library of choice for data scientists, data analysts, Python developers, and all those interested in machine learning. The library’s efficiency, high performance, and ease of use have made it extremely popular with anyone who needs to carry out advanced manipulations on large-scale data sets.
How to Install it
You can install pandas into your project just like other external packages by using the pip package manager:
pip install pandas
How to read different types of files: JSON, Excel, CSV
Big data sets are often stored, or extracted as JSON, Excel, CSV, etc.
These file types are plain text, but have standard formats well known in the world of programming, including Pandas.
The read_json() function allows you to read JSON files:
import pandas as pd df = pd.read_json('data.json') print(df.to_string())
The read_csv() function allows you to read CSV files:
import pandas as pd df = pd.read_csv('data.csv') print(df.to_string())
Pandas read_excel() uses a library called xlrd internally. xlrd is a library for reading (input) Excel files (.xlsx, .xls) in Python.
If you call pandas.read_excel in an environment where xlrd is not installed, you will receive an error message similar to the following: “ImportError: Install xlrd >= 0.9.0 for Excel support”
xlrd can be installed with pip:
pip install xlrd
read_excel() function allows you to read Excel files:
import pandas as pd df = pd.read_excel('sample.xlsx') print(df)
Pandas Dataframes
At the heart of the Pandas package is the DataFrame data structure. A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns. You will commonly see the variable “df” set to the value of the DataFrame. Example:
import pandas as pd data = { "calories": [420, 380, 390], "duration": [50, 40, 45] } #load data into a DataFrame object: df = pd.DataFrame(data) print(df) |
Once data is inside a Pandas DataFrame, we can begin to explore ways to manipulate the data.
Basic Pandas functions that every beginner should know
–head()
head(n) is used to return the first n rows of a dataset. By default, df.head() will return the first 5 rows of the DataFrame. If you want more/less number of rows, you can specify n as an integer.
-describe()
describe() is used to generate descriptive statistics of the data in a Pandas DataFrame or Series. It summarizes central tendency and dispersion of the dataset. describe() helps in getting a quick overview of the dataset.
-astype()
astype() is used to cast a Python object to a particular data type. It can be a very helpful function in case your data is not stored in the correct format (data type). For instance, if floating point numbers have somehow been misinterpreted by Python as strings, you can convert them back to floating point numbers with astype(). Or if you want to convert an object datatype to “category”, you can use astype(). For example:
data_1['Gender'] = data_1.Gender.astype('category')
-loc()
loc() helps to access a group of rows and columns in a dataset, a slice of the dataset, as per our requirement. For instance, if we only want the last 2 rows and the first 3 columns of a dataset, we can access them with the help of loc[:]. Example:
data_1.loc[0:4, ['Name', 'Age', 'State']] |
Output:
Name Age State Alam 29 Madhya Pradesh Rohit 23 Delhi Bimla 35 Haryana Rahul 25 West Bengal Chaman 32 Tamil Nadu
Filters by column value and conditional selection:
# selecting rows based on condition rslt_df = dataframe[dataframe['Percentage'] > 70] print('\nResult dataframe :\n', rslt_df)
-Row selection: iloc()
The iloc indexer for Pandas Dataframe is used for integer-location based indexing / selection by position.
An index refers to a position within an ordered list, where the first item has index = 0. So, item 1 in list will be at index 0, item 2 in list will be at index 1, and so on. Some helpful iloc() tricks:
data.iloc[0] # first row of data frame (Aleshia Tomkiewicz) - Note a Series data type output. data.iloc[1] # second row of data frame (Evan Zigomalas) data.iloc[-1] # last row of data frame (Mi Richan) # Columns: data.iloc[:,0] # first column of data frame (first_name) data.iloc[:,1] # second column of data frame (last_name) data.iloc[:,-1] # last column of data frame (id) # Multiple row and column selections using iloc and DataFrame data.iloc[0:5] # first five rows of dataframe data.iloc[:, 0:2] # first two columns of data frame with all rows data.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns. data.iloc[0:5, 5:8] # first 5 rows and 5th, 6th, 7th columns of data frame (county -> phone1)
How to drop columns and rows
To drop a row or column in a dataframe, you need to use the drop() method available in the dataframe.Let us take an example data frame:
product_name Unit_Price No_Of_Units Available_Quantity Available_Since_Date Keyboard 500.000 5 5 11/5/2021 Mouse 200.000 5 6 4/23/2021 Monitor 5000.235 10 10 08/21/2021 CPU 10000.550 20 Not Available 09/18/2021 CPU 10000.550 20 Not Available 9/18/2021 Speakers 250.500 8 NaT 01/05/2021 NaT NaN NaT NaT NaT
Now, let us drop rows 5 & 6:
df.drop([5,6], axis=0, inplace=True)
In this code:
- [5,6] is the index of the rows you want to delete
- axis=0 denotes that rows should be deleted from the dataframe
- inplace=True performs the drop operation in the same dataframe
You may also use indexes to drop a range of rows:
df.drop(df.index[2:4], inplace=True)
In this code:
- index[2:4] generates a range of rows from 2 to 4. The lower limit of the range is inclusive and the upper limit of the range is exclusive. This means that rows 2 and 3 will be deleted and row 4 will not be deleted.
To drop columns, you simply change the axis value to 1 and name the columns you wish to drop:
df = df.drop('column_name', axis=1)
How to find and delete duplicates
Pandas drop_duplicates() method helps in removing duplicates from the Pandas Dataframe:
import pandas as pd data = { "A": ["TeamA", "TeamB", "TeamB", "TeamC", "TeamA"], "B": [50, 40, 40, 30, 50], "C": [True, False, False, False, True] } df = pd.DataFrame(data) display(df.drop_duplicates())
Output:
A B C 0 TeamA 50 True 1 TeamB 40 False 3 TeamC 30 False Drop dups of a certain column: data = { "A": ["TeamA", "TeamB", "TeamB", "TeamC", "TeamA"], "B": [50, 40, 40, 30, 50], "C": [True, False, False, False, True]} # sorting by first name data.sort_values("A", inplace=True) # dropping ALL duplicate values data.drop_duplicates(subset="A",keep=False, inplace=True) # displaying data data
Output:
A B C TeamC 30 False
How to find, select and delete Nulls (NaN)
While reading a csv file, many blank columns are imported as null values into the Data Frame which later creates problems while manipulating data. Pandas .isnull() and .notnull() methods are used to check and manage NULL values in a data frame.
In the example below, we create a DataFrame with a null value in column C (in python, you can assign a null value using “None”). We then use the .isnull function:
# importing pandas package import pandas as pd data = { "A": ["TeamA", "TeamB", "TeamB", "TeamC", "TeamA"], "B": [50, 40, 40, 30, 50], "C": [True, False, None, False, True] } data = pd.DataFrame(data) # creating bool series True for NaN values bool_series = pd.isnull(data["C"]) print(data[bool_series])
Output:
A B C TeamB 40 None
You can see only the rows with null values in column C are printed. In the following example, we use the .notnull() function:
# importing pandas package import pandas as pd data = { "A": ["TeamA", "TeamB", "TeamB", "TeamC", "TeamA"], "B": [50, 40, 40, 30, 50], "C": [True, False, None, False, True] } data = pd.DataFrame(data) # creating bool series False for NaN values bool_series = pd.notnull(data["C"]) # displayed data only with team = NaN data[bool_series]
Output:
A B C TeamA 50 True TeamB 40 False TeamC 30 False TeamA 50 True
We can see that all the rows that did NOT contain a null value in column C are printed.
How to replace Values
Depending on your needs, you may use either of the following approaches to replace values in Pandas:
(1) Replace a single value with a new value for an individual DataFrame column:
df['column name'] = df['column name'].replace(['old value'],'new value')
(2) Replace multiple values with a new value for an individual DataFrame column:
df['column name'] = df['column name'].replace(['1st old value','2nd old value',...],'new value')
(3) Replace multiple values with multiple new values for an individual DataFrame column:
df['column name'] = df['column name'].replace(['1st old value','2nd old value',...],['1st new value','2nd new value',...])
(4) Replace a single value with a new value for an entire DataFrame
df = df.replace(['old value'],'new value')
Groupby
Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently. Pandas dataframe.groupby() function is used to split the data into groups based on some criteria. The following example, we load a csv containing nba player information:
# importing pandas as pd import pandas as pd # Creating the dataframe df = pd.read_csv("nba.csv") # Print the dataframe print(df)
Output:
Name Team Number Position Age Height Weight College Salary Avery Bradley Boston Celtics 0.0 PG 25 6-2 180 Texas 7730337 Jae Crowder Boston Celtics 99.0 SF 25 6-6 235 Marquette 6796117 John Holland Boston Celtics 30.0 SG 27 6-5 205 Boston Univ NaN R.J. Hunter Boston Celtics 28.0 SG 22 6-5 185 Georgia State 1148640 Jonas Jerebko Boston Celtics 8.0 PF 29 6-10 231 NaN 5000000
Next, let’s implement the .groupby function to group by team value:
# applying groupby() function to # group the data on team value. gk = df.groupby('Team') # Let's print the first entries# in all the groups formed. gk.first()
Output:
Name Number Position Age Height Weight College Salary Team Atlanta Hawks Kent Bazemore 24 SF 26 6-5 201 Old Dominion 2000000 Boston Celtics Avery Bradley 0 PG 25 6-2 180 Texas 7730337 Brooklyn Nets Bojan Bogdanovic 44 SG 27 6-8 216 Oklahoma State 3425510 Charlotte Hornets Nicolas Batum 5 SG 27 6-8 200 Virginia Comm 13125306 Chicago Bulls Cameron Bairstow 41 PF 25 6-9 250 New Mexico 845059
Some of our displayed data doesn’t quite make sense… there are surely more than 1 player on each of these teams. Let’s find these values:
# Finding the values contained in the "Boston Celtics" group gk.get_group('Boston Celtics')
Output:
Name Team Number Position Age Height Weight College Salary
Avery Bradley Boston Celtics 0 PG 25 6-2 180 Texas 7730337
Jae Crowder Boston Celtics 99 SF 25 6-6 235 Marquette 67961172
John Holland Boston Celtics 30 SG 27 6-5 205 Boston Univers NaN
R.J. Hunter Boston Celtics 28 SG 22 6-5 185 Georgia State 1148640
Jonas Jerebko Boston Celtics 8 PF 29 6-10 231 NaN 5000000
Amir Johnson Boston Celtics 90 PF 29 6-9 240 NaN 12000000
We can also use groupby() function to form groups based on more than one category (i.e. Use more than one column to perform the splitting):
# importing pandas as pd import pandas as pd # Creating the dataframe df = pd.read_csv("nba.csv") # First grouping based on "Team"# Within each team we are grouping based on "Position" gkk = df.groupby(['Team', 'Position']) # Print the first value in each group print(gkk.first())
groupby() is a very powerful function with a lot of variations. It makes the task of splitting the dataframe over some criteria really easy and efficient.
How to print unique values and value counts
You can get unique values in column (or multiple columns) from pandas DataFrame using unique() function:
# Find unique values of a column print(df[“Column Name”].unique()) # Using pandas.unique() to unique values in multiple columns df2 = pd.unique(df[['Courses', 'Fee']].values.ravel())
Pandas Index.value_counts() function returns object containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default:
df = pd.DataFrame({'Name': ['Shobhit', 'Vaibhav', 'Vimal', 'Sourabh', 'Rahul', 'Shobhit'], 'Physics': [11, 12, 13, 14, None, 11], 'Chemistry': [10, 14, None, 18, 20, 10], 'Math': [13, 10, 15, None, None, 13]}) print(df)
df.count()
Output:
Name 6 Physics 5 Chemistry 5 Math 4
The Pandas value_counts() method can be applied to both a DataFrame column or to an entire data set:
# Calculating a Frequency Table of a DataFrame Column print(df[“Physics”].value_counts())
Pandas also makes it very easy to display these frequencies as percentages:
print(df["Physics"].value_counts(normalize=True))
How to export data to excel or csv
Pandas DataFrame to_csv() function converts DataFrame into CSV data. We can pass a file object to write the CSV data into a file:
df.to_csv('out.csv')
You can write any data (lists, strings, numbers etc) to Excel, by first converting it into a Pandas DataFrame and then writing the DataFrame to Excel:
df.to_excel('pandas_to_excel.xlsx', sheet_name='new_sheet_name')
Basic calculations with Pandas: mean, max and mins
Pandas dataframe.max() function returns the maximum of the values in the given object. If the input is a series, the method will return a scalar which will be the maximum of the values in the series:
# importing pandas as pd import pandas as pd# Creating the dataframe df = pd.DataFrame({"A":[12, 4, 5, 44, 1], "B":[5, 2, 54, 3, 2], "C":[20, 16, 7, 3, 8], "D":[14, 3, 17, 2, 6]}) print(df.max(axis = 0))
Output:
A 44 B 54 C 20 D 17
Pandas dataframe.min() function returns the minimum of the values in the given object:
print(df.min(axis = 0))
Output:
A 1 B 2 C 3 D 2
Pandas dataframe.mean() function return the mean of the values for the requested axis:
print(df.mean(axis = 0))
Output:
A 13.2 B 13.2 C 10.8 D 8.4
How to merge data frames with Pandas
The pandas module contains various features to perform various operations on dataframes like join, concatenate, delete, add, etc. In this article, we are going to discuss the various types of join operations that can be performed on pandas dataframe. There are mainly four types of Joins in Pandas:
– Inner Join
– Left Outer Join
– Right Outer Join
– Full Outer Join or simply Outer Join
Let’s make two data frames:
# importing pandas import pandas as pd # Creating dataframe a a = pd.DataFrame() # Creating Dictionary d = {'id': [1, 2, 10, 12], 'val1': ['a', 'b', 'c', 'd']} # Creating dataframe b b = pd.DataFrame() # Creating dictionary d = {'id': [1, 2, 9, 8], 'val1': ['p', 'q', 'r', 's']} b = pd.DataFrame(d)
Inner Join: Inner join is the most common type of join you’ll be working with. It returns a dataframe with only those rows that have common characteristics. This is similar to the intersection of two sets:
df = pd.merge(a, b, on='id', how='inner') print(df)
Output:
id val1_x val1_y 1 a 1 2 b q
Left Outer Join: With a left outer join, all the records from the first dataframe will be displayed, irrespective of whether the keys in the first dataframe can be found in the second dataframe. Whereas, for the second dataframe, only the records with the keys in the second dataframe that can be found in the first dataframe will be displayed:
df = pd.merge(a, b, on='id', how='left') print(df)
Output:
id val1_x val1_y 1 a p 2 b q 10 c NaN 12 d NaN
Right Outer Join: For a right join, all the records from the second dataframe will be displayed. However, only the records with the keys in the first dataframe that can be found in the second dataframe will be displayed:
df = pd.merge(a, b, on='id', how='right') print(df)
Output:
id val1_x val1_y 1 a p 2 b q 9 NaN r 8 NaN s
Full Outer Join: A full outer join returns all the rows from the left dataframe, all the rows from the right dataframe, and matches up rows where possible, with NaNs elsewhere. But if the dataframe is complete, then we get the same output.
df = pd.merge(a, b, on='id', how='outer') print(df)
Output:
id val1_x val1_y 0 1 a p 2 b q 10 c NaN 12 d NaN 9 NaN r 8 NaN s
How to work with dates columns
When a csv file is imported and a Data Frame is made, the Date time objects in the file are read as a string object rather than a Date Time object and hence it’s very tough to perform operations like Time difference on a string rather than a Date Time object. Pandas to_datetime() method helps to convert string Date time into Python Date time object:
# importing pandas package import pandas as pd # making data frame from csv file with a column containing a datetime field called "Date" data = pd.read_csv("todatetime.csv") # overwriting data after changing format data["Date"]= pd.to_datetime(data["Date"]) print(data)
Data engineering: Apply()
Lastly, Pandas apply() function allow the users to pass a function and apply it on every single value of the Pandas series:
import pandas as pd df = pd.DataFrame({'A': [1, 2], 'B': [10, 20]}) def square(x): return x * x df1 = df.apply(square) print(df1)
Output:
A B 1 100 4 400
We can also apply a function along the axis:
import pandas as pd import numpy as np df = pd.DataFrame({'A': [1, 2], 'B': [10, 20]})df1 = df.apply(np.sum, axis=0) print(df1) df1 = df.apply(np.sum, axis=1) print(df1)
Output:
A 3 B 30 dtype: int64 0 11 1 22 dtype: int64
Let’s say we want to apply a function that accepts more than one parameter. In that case, we can pass the additional parameters using the ‘args’ argument:
import pandas as pd def sum(x, y, z): return x + y + z df = pd.DataFrame({'A': [1, 2], 'B': [10, 20]}) df1 = df.apply(sum, args=(1, 2)) print(df1)
Output:
A B 0 4 13 1 5 23