Python for Data Science

Descriptive statistics

Descriptive statistics summarizes the data and are broken down into measures of central tendency (mean, median, and mode) and measures of variability (standard deviation, minimum/maximum values, range, kurtosis, and skewness). Example data to be used on this page is [3, 5, 7, 8, 8, 9, 10, 11].

Measures of Central Tendency

    Mean The average value of the data. Can be calculated by adding all the measurements of a variable together and dividing that summation by the number of observations used. The formula is displayed below. $$ \bar{x} = \frac{\sum x}{n} \\ \\ \begin{align} \text{Where,} \\ \text{$\bar{x}$ is the estimated average} \\ \text{$\sum$ indicates to add all the values in the data} \\ \text{$x$ represents the measurements, and} \\ \text{$n$ is the total number of observations} \end{align} $$ Calculating the mean using the example data. $$ \bar{x} = \frac{3 + 5 + 7 + 8 + 8 + 9 + 10 + 11}{8} \\ \\ \bar{x} = 7.625 $$
    Median The middle value when the measurements are placed in ascending order. If there is no true midpoint, the median is calculated by adding the two midpoints together and dividing by 2. $$ \text{median} = \frac{8 + 8}{2} \\ \text{median} = 8 $$
    Mode The number that occurs the most in the set of measurements.
Measures of Variability
    Variance The sum of the squared deviations divided by the number of observations - 1. Using this definition is considered an unbiased estimate of the population variance. Variance does not have a unit of measurement. $$ s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1} \\ \begin{align} \text{Where,} \\ \text{$x_i$ is the $i^{th}$ value of the measurement} \\ \text{$\bar{x}$ is the estimated average} \\ \text{$\sum$ indicates to add all the values in the data} \\ \text{$n$ is the total number of observations} \end{align} $$
    Standard deviation The positive square root of the variance. Standard deviation can be interpreted using the unit of measurement of the observations used. $$ \sqrt{s^2} $$
    Minimum value The smallest value of the measurements.
    Maximum value The largest value of the measurements.
    Range The difference between the maximum and minimum values.
    Kurtois Is a measure of tailedness of a distribution.
    Skew Is a measure of symmetry of the distribution of the data.

Descriptive Statistics with Python

There are a few ways to get descriptive statistics using Python. Below will show how to get descriptive statistics using Pandas and Researchpy. First, let's import an example data set.

import pandas as pd import researchpy as rp df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/blood_pressure.csv")
df.info()
class 'pandas.core.frame.DataFrame' RangeIndex: 120 entries, 0 to 119 Data columns (total 5 columns): patient 120 non-null int64 sex 120 non-null object agegrp 120 non-null object bp_before 120 non-null int64 bp_after 120 non-null int64 dtypes: int64(3), object(2) memory usage: 4.8+ KB

Pandas

Continuous variables

df['bp_before'].describe()
count 120.000000 mean 156.450000 std 11.389845 min 138.000000 25% 147.000000 50% 154.500000 75% 164.000000 max 185.000000 Name: bp_before, dtype: float64

This method returns many useful descriptive statistics with a mix of measures of central tendency and measures of variability. This includes the number of non-missing observations; the mean; standard deviation; minimum value; 25th, 50th (a.k.a. the median), and 75th percentile; as well as the maximum value. It's missing some useful information that is typically desired regarding the mean, this is the standard error and the 95% confidence interval. No worries though, pairing this with Researcpy's summary_cont() method provides the descriptive statistic information that is wanted - this method will be shown later.

Categorical variables

df['sex'].describe()
count 120 unique 2 top Female freq 60 Name: sex, dtype: object
df['sex'].value_counts()
Female 60 Male 60 Name: sex, dtype: int64

Using both the describe() and value_counts() methods are useful since they compliment each other with the information returned. The describe() method says that "Female" occurs more than "Male" but one can see that is not the case since they both occur an equal amount.

For more information about these methods, please see their official documentation page for describe() and value_counts().

Distribution measures

df['bp_before'].kurtosis()
-0.4385909267217518
df['bp_before'].skew()
0.5542441047738688

For more information on these methods, please see their official documentation page for kurtosis() and skew().

Researchpy

Continuous variables

rp.summary_cont(df['bp_before'])
Variable N Mean SD SE 95% Conf. Interval
0 bp_before 120.0 156.45 11.389845 1.039746 154.391199 158.508801

This method returns less overall information compared to the describe() method, but it does return more in-depth information regarding the mean. It returns the non-missing count, mean, stand deviation (SD). standard error (SE), and the 95% confidence interval.

Categorical variables

rp.summary_cat(df['sex'])
Variable Outcome Count Percent
0 sex Female 60 50.0
1 Male 60 50.0

The method returns the variable name, the non-missing count, and the percentage of each category of a variable. By default, the outcomes are sorted in descending order.

For more information about these methods, please see the official documentation for summary_cont() and summary_cont().

References

Ott, R. L., and Longnecker, M (2010). An introduction to statistical methods and data analysis. Belmon, CA: Brooks/Cole.