Descriptive Statistics with Python

Descriptive Statistics with Python

 

Python is a versatile programming language that is widely used in various fields, including data science and statistics. Its simplicity, readability, and vast ecosystem of libraries make it an excellent choice for performing descriptive statistics.

Python Libraries for Descriptive Statistics

Python has several libraries with powerful tools for descriptive statistics. Each library has its own advantages and disadvantages and may be suitable for different use cases. Some of the most popular are the following:

  1. Pandas: Pandas is the most popular library in the Python ecosystem for data analysis. It has built-in methods for the most used averages, handling missing data, and performing data manipulation.
  2. Scipy: Scipy is a scientific computing library that includes various statistical functions and tests. It provides functions for optimization, differential ecuations, hypothesis testing, and much more.
  3. Statistics: The statistics module is part of the Python standard library and offers basic functions. It includes functions for calculating measures of central tendency, measures of dispersion, and tests.
  4. Scikit-learn: Scikit-learn is a machine learning library that also includes functions statistics. It provides methods for calculating mean, median, feature scaling, and data preprocessing.
  5. Statsmodels: Statsmodels is a library focused on statistical modeling and includes a wide range of statistical models and tests. It offers functions for calculating regression models, time series analysis, and much more.
  6. Researchpy: Researchpy is a Python package that aims to simplify statistical analysis and reporting. It comes with a proper interface for inferential statistics, hypothesis tests, and summary tables.

Mean, Median, and Mode

Measures of central tendency are statistics that provide insights into the “middle” or central location of a dataset. The three common measures of central tendency are the mean, median, and mode.

Mean

The mean, also known as the average, is calculated by summing all the values in a dataset and dividing by the total number of observations. It is a very common measure of central tendency. In Python, calculating the mean is straightforward with the help of Pandas.

import pandas as pd

# Create a DataFrame
data = pd.DataFrame([3, 5, 7, 8, 8, 9, 10, 11], columns=['values'])

# Calculate the mean
mean = data['values'].mean()

# Outcome
print(mean)
7.625

Median

The median is the middle value in a dataset when the values are sorted in ascending order. Therefore, if there is an even number of observations, the median is calculated by taking the average of the two middle values. The Pandas method for calculating the median is as follows:

import pandas as pd

# Create a DataFrame
data = pd.DataFrame([3, 5, 7, 8, 8, 9, 10, 11], columns=['values'])

# Calculate the median
median = data['values'].median()

# Outcome
8

Mode

The mode is the value that appears most frequently in a dataset. Again, we can calculate it using Pandas.

import pandas as pd

# Create a DataFrame
data = pd.DataFrame([3, 5, 7, 8, 8, 9, 10, 11], columns=['values'])

# Calculate the mode
mode = data['values'].mode()

Measures of Dispersion

Measures of dispersion describe the spread or variability of a dataset. They attempt to describe how much the data deviates from the central tendency. Some common measures of dispersion are variance, standard deviation, minimum value, maximum value, kurtosis, and skewness.

Variance

Variance measures the average deviation of each data point from the mean. It is calculated by summing the squared differences between each value and the mean, divided by the total number of observations minus one. Pandas offers a simple method for calculating the variance.

import pandas as pd

# Create a DataFrame
data = pd.DataFrame([3, 5, 7, 8, 8, 9, 10, 11], columns=['values'])

# Calculate the variance
variance = data['values'].var()

# Outcome
6.839285714285714

Standard Deviation

The standard deviation is the square root of the variance. It measures the average amount of deviation from the mean. To calculate it with Pandas:

import pandas as pd

# Create a DataFrame
data = pd.DataFrame([3, 5, 7, 8, 8, 9, 10, 11], columns=['values'])

# Calculate the standard deviation
std_deviation = data['values'].std()

#Outcome
2.615202805574687

Minimum and Maximum Values

The minimum and maximum values represent the smallest and largest values in a dataset, respectively. Pandas methods for calculating these values are:

import pandas as pd

# Create a DataFrame
data = pd.DataFrame([3, 5, 7, 8, 8, 9, 10, 11], columns=['values'])

# Calculate the minimum and maximum values
minimum = data['values'].min()
maximum = data['values'].max()

#Outcome
print(minimum)
3

#Outcome
print(maximum)
11

Kurtosis

Kurtosis measures the tailedness or heaviness of the distribution of a dataset. It provides insights into the shape of the distribution. Pandas does not have a built-in method for calculating kurtosis. However, you can use the scipy.stats module to calculate it.

import pandas as pd
from scipy.stats import kurtosis

# Create a DataFrame
data = pd.DataFrame([3, 5, 7, 8, 8, 9, 10, 11], columns=['values'])

# Calculate the kurtosis
kurt = kurtosis(data['values'])

Real-World Examples of Python for Statistics

Here are some real-world examples of how Python is applied in statistics:

  1. Market Research: Python is used to analyze market research data and extract insights. It helps in identifying trends, determining market segments, and making data-driven decisions.
  2. Healthcare: To analyze patient data, conduct clinical trials, and develop predictive models for disease diagnosis and treatment.
  3. Finance: Most common applications of Python in finance include risk analysis, portfolio optimization, and quantitative trading. It enables financial institutions to make informed decisions based on statistical models.
  4. Social Sciences: Python is employed in social sciences research for analyzing survey data, conducting experiments, and performing statistical modeling to understand human behavior and societal trends.
  5. Environmental Science: To analyze climate data, model environmental processes, and assess the impact of human activities on the environment.
  6. Quality Control: Python is employed in quality control processes to analyze production data, monitor product quality, and identify areas for improvement.

Conclusion

Python provides a vast array of libraries and tools for  statistics. Libraries such as Pandas, Scipy, Statistics, Scikit-learn, or Statsmodels offer powerful capabilities for calculating measures of central tendency, measures of dispersion, and conducting statistical analysis. Python’s simplicity and readability, combined with its wide adoption in various industries, make it an excellent choice for statistical analysis.

Python and Excel Projects for practice
Register New Account
Shopping cart