# Descriptive statistics

Descriptive statistics summarizes the data and are broken down into
measures of central tendency (mean, median, and mode) and measures of
variability (standard deviation, minimum/maximum values, range, kurtosis,
and skewness). Example data to be used on this page is [3, 5, 7, 8, 8, 9, 10, 11].

**Measures of Central Tendency**

**Mean**

The average value of the data. Can be calculated by adding all the measurements
of a variable together and dividing that summation by the number of
observations used. The formula is displayed below.
$$
\bar{x} = \frac{\sum x}{n} \\
\\
\begin{align}
\text{Where,} \\
\text{$\bar{x}$ is the estimated average} \\
\text{$\sum$ indicates to add all the values in the data} \\
\text{$x$ represents the measurements, and} \\
\text{$n$ is the total number of observations}
\end{align}
$$
Calculating the mean using the example data.
$$
\bar{x} = \frac{3 + 5 + 7 + 8 + 8 + 9 + 10 + 11}{8} \\
\\
\bar{x} = 7.625
$$
**Median**

The middle value when the measurements are placed in ascending order. If
there is no true midpoint, the median is calculated by adding the two
midpoints together and dividing by 2.
$$
\text{median} = \frac{8 + 8}{2} \\
\text{median} = 8
$$
**Mode**

The number that occurs the most in the set of measurements.
**Measures of Variability**

**Variance**

The sum of the squared deviations divided by the number of observations
- 1. Using this definition is considered an unbiased estimate of the
population variance. Variance does not have a unit of measurement.
$$
s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1}
\\
\begin{align}
\text{Where,} \\
\text{$x_i$ is the $i^{th}$ value of the measurement} \\
\text{$\bar{x}$ is the estimated average} \\
\text{$\sum$ indicates to add all the values in the data} \\
\text{$n$ is the total number of observations}
\end{align}
$$
**Standard deviation**

The positive square root of the variance. Standard deviation can be
interpreted using the unit of measurement of the observations used.
$$
\sqrt{s^2}
$$
**Minimum value**

The smallest value of the measurements.
**Maximum value**

The largest value of the measurements.
**Range**

The difference between the maximum and minimum values.
**Kurtois**

Is a measure of tailedness of a distribution.
**Skew**

Is a measure of symmetry of the distribution of the data.
## Descriptive Statistics with Python

There are a few ways to get descriptive statistics using Python. Below will show how to get descriptive statistics using Pandas and Researchpy. First, let's import an example data set.

```
import pandas as pd
import researchpy as rp
df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/blood_pressure.csv")
```

df.info()

### Pandas

#### Continuous variables

`df['bp_before'].describe()`

This method returns many useful descriptive statistics with a mix of
measures of central tendency and measures of variability. This includes the
number of non-missing observations; the mean; standard deviation; minimum value;
25^{th}, 50^{th} (a.k.a. the median), and 75^{th} percentile;
as well as the maximum value. It's missing some useful information that is
typically desired regarding the mean, this is the standard error and the
95% confidence interval. No worries though, pairing this with Researcpy's
summary_cont() method provides the descriptive statistic information
that is wanted - this method will be shown later.

#### Categorical variables

`df['sex'].describe()`

`df['sex'].value_counts()`

Using both the describe() and value_counts() methods are useful since they
compliment each other with the information returned. The describe() method
says that "Female" occurs more than "Male" but one can see that is not the
case since they both occur an equal amount.

For more information about these methods, please see their official documentation
page for describe()
and value_counts().

#### Distribution measures

`df['bp_before'].kurtosis()`

`df['bp_before'].skew()`

For more information on these methods, please see their official documentation page for kurtosis() and skew().

### Researchpy

#### Continuous variables

`rp.summary_cont(df['bp_before'])`

Variable | N | Mean | SD | SE | 95% Conf. | Interval | |
---|---|---|---|---|---|---|---|

0 | bp_before | 120.0 | 156.45 | 11.389845 | 1.039746 | 154.391199 | 158.508801 |

This method returns less overall information compared to the describe() method, but it does return more in-depth information regarding the mean. It returns the non-missing count, mean, stand deviation (SD). standard error (SE), and the 95% confidence interval.

#### Categorical variables

`rp.summary_cat(df['sex'])`

Variable | Outcome | Count | Percent | |
---|---|---|---|---|

0 | sex | Female | 60 | 50.0 |

1 | Male | 60 | 50.0 |

The method returns the variable name, the non-missing count, and the percentage of
each category of a variable. By default, the outcomes are sorted in
descending order.

For more information about these methods, please see the official documentation
for summary_cont() and
summary_cont().

## References

Ott, R. L., and Longnecker, M (2010).*An introduction to statistical methods and data analysis.*Belmon, CA: Brooks/Cole.