Descriptive statistics

Descriptive statistics summarizes the data and are broken down into measures of central tendency (mean, median, and mode) and measures of variability (standard deviation, minimum/maximum values, range, kurtosis, and skewness). Example data to be used on this page is [3, 5, 7, 8, 8, 9, 10, 11].

Measures of Central Tendency

Mean

The average value of the data. Can be calculated by adding all the measurements of a variable together and dividing that summation by the number of observations used. The formula is displayed below. $$ \bar{x} = \frac{\sum x}{n} \\ \\ \begin{align} \text{Where,} \\ \text{$\bar{x}$ is the estimated average} \\ \text{$\sum$ indicates to add all the values in the data} \\ \text{$x$ represents the measurements, and} \\ \text{$n$ is the total number of observations} \end{align} $$ Calculating the mean using the example data. $$ \bar{x} = \frac{3 + 5 + 7 + 8 + 8 + 9 + 10 + 11}{8} \\ \\ \bar{x} = 7.625 $$

Median

The middle value when the measurements are placed in ascending order. If there is no true midpoint, the median is calculated by adding the two midpoints together and dividing by 2. $$ \text{median} = \frac{8 + 8}{2} \\ \text{median} = 8 $$

Mode

The number that occurs the most in the set of measurements.

Measures of Variability

Variance

The sum of the squared deviations divided by the number of observations - 1. Using this definition is considered an unbiased estimate of the population variance. Variance does not have a unit of measurement. $$ s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1} \\ \begin{align} \text{Where,} \\ \text{$x_i$ is the $i^{th}$ value of the measurement} \\ \text{$\bar{x}$ is the estimated average} \\ \text{$\sum$ indicates to add all the values in the data} \\ \text{$n$ is the total number of observations} \end{align} $$

Standard deviation

The positive square root of the variance. Standard deviation can be interpreted using the unit of measurement of the observations used. $$ \sqrt{s^2} $$

Minimum value

The smallest value of the measurements.

Maximum value

The largest value of the measurements.

Range

The difference between the maximum and minimum values.

Kurtois

Is a measure of tailedness of a distribution.

Skew

Is a measure of symmetry of the distribution of the data.

Descriptive Statistics with Python

There are a few ways to get descriptive statistics using Python. Below will show how to get descriptive statistics using Pandas and Researchpy. First, let's import an example data set.

import pandas as pd
import researchpy as rp

df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/blood_pressure.csv") 


df.info()

class 'pandas.core.frame.DataFrame' RangeIndex: 120 entries, 0 to 119 Data columns (total 5 columns): patient 120 non-null int64 sex 120 non-null object agegrp 120 non-null object bp_before 120 non-null int64 bp_after 120 non-null int64 dtypes: int64(3), object(2) memory usage: 4.8+ KB

Pandas

Continuous variables

df['bp_before'].describe()

count 120.000000 mean 156.450000 std 11.389845 min 138.000000 25% 147.000000 50% 154.500000 75% 164.000000 max 185.000000 Name: bp_before, dtype: float64

This method returns many useful descriptive statistics with a mix of measures of central tendency and measures of variability. This includes the number of non-missing observations; the mean; standard deviation; minimum value; 25^th, 50^th (a.k.a. the median), and 75^th percentile; as well as the maximum value. It's missing some useful information that is typically desired regarding the mean, this is the standard error and the 95% confidence interval. No worries though, pairing this with Researcpy's summary_cont() method provides the descriptive statistic information that is wanted - this method will be shown later.

Categorical variables

df['sex'].describe()

count 120 unique 2 top Female freq 60 Name: sex, dtype: object

df['sex'].value_counts()

Female 60 Male 60 Name: sex, dtype: int64

Using both the describe() and value_counts() methods are useful since they compliment each other with the information returned. The describe() method says that "Female" occurs more than "Male" but one can see that is not the case since they both occur an equal amount.

For more information about these methods, please see their official documentation page for describe() and value_counts().

Distribution measures

df['bp_before'].kurtosis()

-0.4385909267217518

df['bp_before'].skew()

0.5542441047738688

For more information on these methods, please see their official documentation page for kurtosis() and skew().

Researchpy

Continuous variables

rp.summary_cont(df['bp_before'])

	Variable	N	Mean	SD	SE	95% Conf.	Interval
0	bp_before	120.0	156.45	11.389845	1.039746	154.391199	158.508801

This method returns less overall information compared to the describe() method, but it does return more in-depth information regarding the mean. It returns the non-missing count, mean, stand deviation (SD). standard error (SE), and the 95% confidence interval.

Categorical variables

rp.summary_cat(df['sex'])

	Variable	Outcome	Count	Percent
0	sex	Female	60	50.0
1		Male	60	50.0

The method returns the variable name, the non-missing count, and the percentage of each category of a variable. By default, the outcomes are sorted in descending order.

For more information about these methods, please see the official documentation for summary_cont() and summary_cont().

References

Ott, R. L., and Longnecker, M (2010). An introduction to statistical methods and data analysis. Belmon, CA: Brooks/Cole.

Symbol	Meaning
$n$	Sample size
$N$	Population size
$s^2$	Sample variance
$\sigma^2$	Population variance
$s$	Sample standard deviation
$\sigma$	Population standard deviation
$\mu$	Mean
$\bar{x}$	Sample or group mean
symbol$_1$	Subscript represents a group, i.e. symbol$_1$ group 1 while symbol$_2$ is group 2
$\alpha$	Alpha value, statistical significance threshold

Python for Data Science

Table of contents

Descriptive statistics

Descriptive Statistics with Python

Pandas

Continuous variables

Categorical variables

Distribution measures

Researchpy

Continuous variables

Categorical variables

References