Descriptive statistics
Descriptive statistics summarizes the data and are broken down into
measures of central tendency (mean, median, and mode) and measures of
variability (standard deviation, minimum/maximum values, range, kurtosis,
and skewness). Example data to be used on this page is [3, 5, 7, 8, 8, 9, 10, 11].
Measures of Central Tendency
Mean
The average value of the data. Can be calculated by adding all the measurements of a variable together and dividing that summation by the number of observations used. The formula is displayed below. $$ \bar{x} = \frac{\sum x}{n} \\ \\ \begin{align} \text{Where,} \\ \text{$\bar{x}$ is the estimated average} \\ \text{$\sum$ indicates to add all the values in the data} \\ \text{$x$ represents the measurements, and} \\ \text{$n$ is the total number of observations} \end{align} $$ Calculating the mean using the example data. $$ \bar{x} = \frac{3 + 5 + 7 + 8 + 8 + 9 + 10 + 11}{8} \\ \\ \bar{x} = 7.625 $$Median
The middle value when the measurements are placed in ascending order. If there is no true midpoint, the median is calculated by adding the two midpoints together and dividing by 2. $$ \text{median} = \frac{8 + 8}{2} \\ \text{median} = 8 $$Mode
The number that occurs the most in the set of measurements.Variance
The sum of the squared deviations divided by the number of observations - 1. Using this definition is considered an unbiased estimate of the population variance. Variance does not have a unit of measurement. $$ s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1} \\ \begin{align} \text{Where,} \\ \text{$x_i$ is the $i^{th}$ value of the measurement} \\ \text{$\bar{x}$ is the estimated average} \\ \text{$\sum$ indicates to add all the values in the data} \\ \text{$n$ is the total number of observations} \end{align} $$Standard deviation
The positive square root of the variance. Standard deviation can be interpreted using the unit of measurement of the observations used. $$ \sqrt{s^2} $$Minimum value
The smallest value of the measurements.Maximum value
The largest value of the measurements.Range
The difference between the maximum and minimum values.Kurtois
Is a measure of tailedness of a distribution.Skew
Is a measure of symmetry of the distribution of the data.Descriptive Statistics with Python
There are a few ways to get descriptive statistics using Python. Below will show how to get descriptive statistics using Pandas and Researchpy. First, let's import an example data set.
import pandas as pd
import researchpy as rp
df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/blood_pressure.csv")
df.info()
Pandas
Continuous variables
df['bp_before'].describe()
This method returns many useful descriptive statistics with a mix of measures of central tendency and measures of variability. This includes the number of non-missing observations; the mean; standard deviation; minimum value; 25th, 50th (a.k.a. the median), and 75th percentile; as well as the maximum value. It's missing some useful information that is typically desired regarding the mean, this is the standard error and the 95% confidence interval. No worries though, pairing this with Researcpy's summary_cont() method provides the descriptive statistic information that is wanted - this method will be shown later.
Categorical variables
df['sex'].describe()
df['sex'].value_counts()
Using both the describe() and value_counts() methods are useful since they
compliment each other with the information returned. The describe() method
says that "Female" occurs more than "Male" but one can see that is not the
case since they both occur an equal amount.
For more information about these methods, please see their official documentation
page for describe()
and value_counts().
Distribution measures
df['bp_before'].kurtosis()
df['bp_before'].skew()
For more information on these methods, please see their official documentation page for kurtosis() and skew().
Researchpy
Continuous variables
rp.summary_cont(df['bp_before'])
Variable | N | Mean | SD | SE | 95% Conf. | Interval | |
---|---|---|---|---|---|---|---|
0 | bp_before | 120.0 | 156.45 | 11.389845 | 1.039746 | 154.391199 | 158.508801 |
This method returns less overall information compared to the describe() method, but it does return more in-depth information regarding the mean. It returns the non-missing count, mean, stand deviation (SD). standard error (SE), and the 95% confidence interval.
Categorical variables
rp.summary_cat(df['sex'])
Variable | Outcome | Count | Percent | |
---|---|---|---|---|
0 | sex | Female | 60 | 50.0 |
1 | Male | 60 | 50.0 |
The method returns the variable name, the non-missing count, and the percentage of
each category of a variable. By default, the outcomes are sorted in
descending order.
For more information about these methods, please see the official documentation
for summary_cont() and
summary_cont().