Pandas Library

 

 

PANDAS QUICK TUTORIAL

What is Pandas?

The Pandas software library is a free-to-use (BSD licensed), open-source add-on to the Python language. In this Pandas quick tutorial we will see how to easily retrieve information from datasets and present it in visually intuitive formats. The library was created using a combination of Python, C, and Cython and compares very favorably to other libraries in key performance metrics, making it a logical choice for data scientists in general, but particularly for those who work hands-on with the Python language.

Since its creation in 2008, this highly accessible tool has steadily become the library of choice for data scientists, data analysts, Python developers, and all those interested in machine learning. The library’s efficiency, high performance, and ease of use have made it extremely popular with anyone who needs to carry out advanced manipulations on large-scale data sets.

How to Install it

You can install pandas into your project just like other external packages by using the pip package manager:

pip install pandas

How to read different types of files: JSON, Excel, CSV

Big data sets are often stored, or extracted as JSON, Excel, CSV, etc.
These file types are plain text, but have standard formats well known in the world of programming, including Pandas.

The read_json() function allows you to read JSON files:

import pandas as pd
df = pd.read_json('data.json')
print(df.to_string())

The read_csv() function allows you to read CSV files:

import pandas as pd
df = pd.read_csv('data.csv')
print(df.to_string())

Pandas read_excel() uses a library called xlrd internally.  xlrd is a library for reading (input) Excel files (.xlsx, .xls) in Python.
If you call pandas.read_excel in an environment where xlrd is not installed, you will receive an error message similar to the following: “ImportError: Install xlrd >= 0.9.0 for Excel support”

xlrd can be installed with pip:

pip install xlrd

read_excel() function allows you to read Excel files:

import pandas as pd 
df = pd.read_excel('sample.xlsx') 
print(df)

Pandas Dataframes

At the heart of the Pandas package is the DataFrame data structure. A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns. You will commonly see the variable “df” set to the value of the DataFrame. Example:

import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)

Once data is inside a Pandas DataFrame, we can begin to explore ways to manipulate the data.

 Basic Pandas Functions

 –head()
head(n) is used to return the first n rows of a dataset. By default, df.head() will return the first 5 rows of the DataFrame. If you want more/less number of rows, you can specify n as an integer.

-describe()
describe() is used to generate descriptive statistics of the data in a Pandas DataFrame or Series. It summarizes central tendency and dispersion of the dataset. describe() helps in getting a quick overview of the dataset.

-astype()
astype() casts a Python object to a particular data type. It can be a very helpful function in case your data is not stored in the correct format (data type). For instance, if floating point numbers are misinterpreted by Python as strings, you can convert them back to floating point numbers with astype(). Or if you want to convert an object datatype to “category”, you can use astype(). For example:

data_1['Gender'] = data_1.Gender.astype('category')

-loc()
loc() helps to access a group of rows and columns in a dataset, a slice of the dataset, as per our requirement. For instance, if we only want the last 2 rows and the first 3 columns of a dataset, we can access them with the help of loc[:]. Example:

data_1.loc[0:4, ['Name', 'Age', 'State']]

Output:

Name  Age   State      
Alam  29    Madhya Pradesh   
Rohit 23    Delhi 
Bimla 35    Haryana      
Rahul 25    West Bengal      
Chaman 32   Tamil Nadu

Filters by column value and conditional selection

# selecting rows based on condition 
rslt_df = dataframe[dataframe['Percentage'] > 70]
print('\nResult dataframe :\n', rslt_df)

-Row selection: iloc()
The iloc indexer for Pandas Dataframe is used for integer-location based indexing / selection by position.
An index refers to a position within an ordered list, where the first item has index = 0. Thus, item 1 in list will be at index 0, item 2 in list will be at index 1, and so on. Some helpful iloc() tricks:

data.iloc[0] # first row of data frame (Aleshia Tomkiewicz) - Note a Series data type output.
data.iloc[1] # second row of data frame (Evan Zigomalas) 
data.iloc[-1] # last row of data frame (Mi Richan)
# Columns: 
data.iloc[:,0] # first column of data frame (first_name) 
data.iloc[:,1] # second column of data frame (last_name)
data.iloc[:,-1] # last column of data frame (id)
# Multiple row and column selections using iloc and DataFrame
data.iloc[0:5] # first five rows of dataframe
data.iloc[:, 0:2] # first two columns of data frame with all rows
data.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns.
data.iloc[0:5, 5:8] # first 5 rows and 5th, 6th, 7th columns of data frame (county -> phone1)

How to drop columns and rows

To drop a row or column in a dataframe, you need to use the drop() method available in the dataframe.Let us take an example data frame:

product_name    Unit_Price        No_Of_Units     Available_Quantity      Available_Since_Date
Keyboard         500.000               5                5                    11/5/2021
Mouse            200.000               5                6                    4/23/2021
Monitor          5000.235             10               10                   08/21/2021
CPU              10000.550            20             Not Available          09/18/2021
CPU              10000.550            20             Not Available           9/18/2021
Speakers           250.500             8                NaT                 01/05/2021
NaT                 NaN               NaT               NaT                     NaT

Now, let us drop rows 5 & 6:

df.drop([5,6], axis=0, inplace=True)

In this code:

  • [5,6] is the index of the rows you want to delete
  • axis=0 denotes that rows should be deleted from the dataframe
  • inplace=True performs the drop operation in the same dataframe

You may also use indexes to drop a range of rows:

df.drop(df.index[2:4], inplace=True)

In this code:

  • index[2:4] generates a range of rows from 2 to 4. The lower limit of the range is inclusive and the upper limit of the range is exclusive. Therefore, rows 2 and 3 will be deleted and row 4 will not be deleted.

To drop columns, you simply change the axis value to 1 and name the columns you wish to drop:

df = df.drop('column_name', axis=1)

Find and delete duplicates

Pandas drop_duplicates() method helps in removing duplicates from the Pandas Dataframe:

import pandas as pd 
data = { "A": ["TeamA", "TeamB", "TeamB", "TeamC", "TeamA"], "B": [50, 40, 40, 30, 50], "C": [True, False, False, False, True] } 
df = pd.DataFrame(data) 
display(df.drop_duplicates())

Output:

        A      B      C
0    TeamA    50    True 
1    TeamB    40    False 
3    TeamC    30    False

Drop dups of a certain column:
data = { "A": ["TeamA", "TeamB", "TeamB", "TeamC", "TeamA"], "B": [50, 40, 40, 30, 50], "C": [True, False, False, False, True]}

# sorting by first name
data.sort_values("A", inplace=True)

# dropping ALL duplicate values
data.drop_duplicates(subset="A",keep=False, inplace=True)

# displaying data 
data

Output:

     A     B    C
   TeamC  30  False

Find, select and delete Nulls (NaN)

While reading a csv file, many blank columns appear as null values into the Data Frame which later creates problems while manipulating data. Pandas .isnull() and .notnull() methods check and manage NULL values in a data frame.
In the example below, we create a DataFrame with a null value in column C (in python, you can assign a null value using “None”). We then use the .isnull function:

# importing pandas package 
import pandas as pd 
data = { "A": ["TeamA", "TeamB", "TeamB", "TeamC", "TeamA"], "B": [50, 40, 40, 30, 50], "C": [True, False, None, False, True] } 
data = pd.DataFrame(data)
# creating bool series True for NaN values 
bool_series = pd.isnull(data["C"])
print(data[bool_series])

Output:

       A   B     C 
    TeamB  40  None

You can see only the rows with null values in column C. In the following example, we use the .notnull() function:

# importing pandas package
import pandas as pd 
data = { "A": ["TeamA", "TeamB", "TeamB", "TeamC", "TeamA"], "B": [50, 40, 40, 30, 50], "C": [True, False, None, False, True] }
data = pd.DataFrame(data)
# creating bool series False for NaN values
bool_series = pd.notnull(data["C"])
# displayed data only with team = NaN 
data[bool_series]

Output:

   A   B      C  
TeamA  50   True   
TeamB  40  False   
TeamC  30  False   
TeamA  50   True

We can see all the rows that did NOT contain a null value in column C.

How to replace Values

Depending on your needs, you may use either of the following approaches to replace values in Pandas:

(1) Replace a single value with a new value for an individual DataFrame column:

df['column name'] = df['column name'].replace(['old value'],'new value')

(2) Replace multiple values with a new value for an individual DataFrame column:

df['column name'] = df['column name'].replace(['1st old value','2nd old value',...],'new value')

(3) Replace multiple values with multiple new values for an individual DataFrame column:

df['column name'] = df['column name'].replace(['1st old value','2nd old value',...],['1st new value','2nd new value',...])

(4) Replace a single value with a new value for an entire DataFrame

df = df.replace(['old value'],'new value')

Groupby

Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently. Pandas dataframe.groupby() function splits the data into groups based on some criteria. In the following example, we load a csv containing nba player information:

# importing pandas as pd 
import pandas as pd
# Creating the dataframe  
df = pd.read_csv("nba.csv")
# Print the dataframe 
print(df)

Output:

Name            Team           Number   Position  Age  Height   Weight   College   Salary Avery Bradley       Boston Celtics     0.0       PG      25    6-2      180     Texas       7730337
Jae Crowder   Boston Celtics    99.0       SF      25    6-6      235   Marquette     6796117
John Holland  Boston Celtics    30.0       SG      27    6-5      205   Boston Univ       NaN 
R.J. Hunter   Boston Celtics    28.0       SG      22    6-5      185   Georgia State 1148640 
Jonas Jerebko Boston Celtics     8.0       PF      29   6-10      231       NaN        5000000 

Next, let’s implement the .groupby function to group by team value:

# applying groupby() function to
# group the data on team value. 
gk = df.groupby('Team')
# Let's print the first entries# in all the groups formed. 
gk.first()

Output:

                  Name      Number Position   Age   Height  Weight        College      Salary 
Team
Atlanta Hawks    Kent Bazemore     24  SF          26     6-5   201    Old Dominion   2000000
Boston Celtics   Avery Bradley      0  PG          25     6-2   180         Texas     7730337 Brooklyn Nets    Bojan Bogdanovic  44  SG          27     6-8   216    Oklahoma State 3425510 Charlotte Hornets Nicolas Batum     5  SG          27     6-8   200    Virginia Comm  13125306
Chicago Bulls    Cameron Bairstow  41  PF          25     6-9   250    New Mexico     845059 

Some of our displayed data doesn’t quite make sense… there are surely more than 1 player on each of these teams. Let’s find these values:

# Finding the values contained in the "Boston Celtics" group
gk.get_group('Boston Celtics')

Output:

Name Team          Number Position   Age Height  Weight     College    Salary
Avery Bradley  Boston Celtics     0     PG  25    6-2   180        Texas   7730337
Jae Crowder  Boston Celtics    99     SF  25    6-6   235      Marquette   67961172
John Holland  Boston Celtics    30     SG  27   6-5   205  Boston Univers    NaN
R.J. Hunter  Boston Celtics    28    SG  22    6-5   185   Georgia State   1148640
Jonas Jerebko  Boston Celtics     8     PF  29   6-10  231            NaN   5000000
Amir Johnson  Boston Celtics    90     PF  29    6-9   240            NaN  12000000

We can also use groupby() function to form groups based on more than one category (i.e. Use more than one column to perform the splitting):

# importing pandas as pd 
import pandas as pd
# Creating the dataframe  
df = pd.read_csv("nba.csv")
# First grouping based on "Team"# Within each team we are grouping based on "Position" 
gkk = df.groupby(['Team', 'Position'])
# Print the first value in each group 
print(gkk.first())

groupby() is a very powerful function with a lot of variations. It makes the task of splitting the dataframe over some criteria really easy and efficient.

 Print unique values and value counts

You can get unique values in column (or multiple columns) from pandas DataFrame using unique() function:

# Find unique values of a column 
print(df[“Column Name”].unique())
# Using pandas.unique() to unique values in multiple columns 
df2 = pd.unique(df[['Courses', 'Fee']].values.ravel())

Pandas Index.value_counts() function returns an object containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Moreover, it excludes NA values by default:

df = pd.DataFrame({'Name': ['Shobhit', 'Vaibhav', 'Vimal', 'Sourabh', 'Rahul', 'Shobhit'], 'Physics': [11, 12, 13, 14, None, 11], 
'Chemistry': [10, 14, None, 18, 20, 10], 
'Math': [13, 10, 15, None, None, 13]})
print(df)
df.count()

Output:

Name         6 
Physics      5 
Chemistry    5 
Math         4 

You can apply the Pandas value_counts() method to both a DataFrame column or to an entire data set:

 
# Calculating a Frequency Table of a DataFrame Column 
print(df[“Physics”].value_counts())

Pandas also makes it very easy to display these frequencies as percentages:

print(df["Physics"].value_counts(normalize=True))

Export data to excel or csv

Pandas DataFrame to_csv() function converts DataFrame into CSV data. We can pass a file object to write the CSV data into a file:

df.to_csv('out.csv')

You can write any data (lists, strings, numbers etc) to Excel, by first converting it into a Pandas DataFrame and then writing the DataFrame to Excel:

df.to_excel('pandas_to_excel.xlsx', sheet_name='new_sheet_name')

Basic calculations with Pandas: mean, max and mins

Pandas dataframe.max() function returns the maximum of the values in the given object. If the input is a series, the method will return a scalar which will be the maximum of the values in the series:

# importing pandas as pd import pandas as pd# Creating the dataframe  
df = pd.DataFrame({"A":[12, 4, 5, 44, 1], 
"B":[5, 2, 54, 3, 2], 
"C":[20, 16, 7, 3, 8], 
"D":[14, 3, 17, 2, 6]})
print(df.max(axis = 0))

Output:

A    44 
B    54 
C    20 
D    17

Pandas dataframe.min() function returns the minimum of the values in the given object:

print(df.min(axis = 0))

Output:

A    1 
B    2 
C    3 
D    2

Pandas dataframe.mean() function return the mean of the values for the requested axis:

print(df.mean(axis = 0))

Output:

A    13.2 
B    13.2 
C    10.8 
D     8.4

Merge data frames with Pandas

The pandas module contains various features to perform various operations on dataframes like join, concatenate, delete, add, etc. There are mainly four types of Joins in Pandas:

– Inner Join
– Left Outer Join
– Right Outer Join
– Full Outer Join or simply Outer Join

Let’s make two data frames:

# importing pandas import pandas as pd 
# Creating dataframe a 
a = pd.DataFrame() 
# Creating Dictionary 
d = {'id': [1, 2, 10, 12], 'val1': ['a', 'b', 'c', 'd']} 
# Creating dataframe b 
b = pd.DataFrame() 
# Creating dictionary 
d = {'id': [1, 2, 9, 8], 'val1': ['p', 'q', 'r', 's']} 
b = pd.DataFrame(d)

Inner Join: Inner join is the most common type of join you’ll be working with. It returns a dataframe with only those rows that have common characteristics. This is similar to the intersection of two sets:

df = pd.merge(a, b, on='id', how='inner') 
print(df)

Output:

id   val1_x    val1_y   
1      a         1   
2      b         q

Left Outer Join: With a left outer join, all the records from the first dataframe will be displayed, irrespective of whether the keys in the first dataframe can be found in the second dataframe. Whereas, for the second dataframe, only the records with the keys in the second dataframe that can be found in the first dataframe will be displayed:

df = pd.merge(a, b, on='id', how='left') 
print(df)

Output:

id         val1_x          val1_y
1             a               p
2             b               q
10            c             NaN
12            d             NaN

Right Outer Join: For a right join, all the records from the second dataframe will be displayed. However, only the records with the keys in the first dataframe that can be found in the second dataframe will be displayed:

df = pd.merge(a, b, on='id', how='right') print(df)

Output:

id val1_x val1_y 
1      a      p 
2      b      q 
9    NaN      r 
8    NaN      s

Full Outer Join: A full outer join returns all the rows from the left dataframe, all the rows from the right dataframe, and matches up rows where possible, with NaNs elsewhere. But if the dataframe is complete, then we get the same output.

df = pd.merge(a, b, on='id', how='outer')

print(df)

Output:

id      val1_x   val1_y 0   
1         a         p  
2         b         q   
10        c        NaN   
12        d        NaN    
9        NaN        r    
8        NaN        s

How to work with dates columns

 When you read a csv file, the Date time objects in the file are read as a string object rather than a Date Time object and hence it’s very tough to perform operations like Time difference on a string rather than a Date Time object. Pandas to_datetime() method helps to convert string Date time into Python Date time object:

# importing pandas package
import pandas as pd

# making data frame from csv file with a column containing a datetime field called "Date"
data = pd.read_csv("todatetime.csv")

# overwriting data after changing format
data["Date"]= pd.to_datetime(data["Date"])

print(data)

Data engineering: Apply()

Lastly, Pandas apply() function allow the users to pass a function and apply it on every single value of the Pandas series:

import pandas as pd
df = pd.DataFrame({'A': [1, 2], 'B': [10, 20]})
def square(x): return x * x
df1 = df.apply(square)
print(df1)

Output:

A    B 
1   100 
4   400

  We can also apply a function along the axis:

import pandas as pd 
import numpy as np
df = pd.DataFrame({'A': [1, 2], 'B': [10, 20]})df1 = df.apply(np.sum, axis=0) print(df1) df1 = df.apply(np.sum, axis=1) 
print(df1)

Output:

A     3 
B    30 dtype: int64 
0    11 
1    22 dtype: int64

Finally, we want to apply a function that accepts more than one parameter. In that case, we can pass the additional parameters using the ‘args’ argument:

import pandas as pd 
def sum(x, y, z): return x + y + z 
df = pd.DataFrame({'A': [1, 2], 'B': [10, 20]}) 
df1 = df.apply(sum, args=(1, 2)) 
print(df1)

Output:

   A   B 
0  4  13 
1  5  23

 

 

Python and Excel Projects for practice
Register New Account
Shopping cart