Definitions and Data
The difference between variance, covariance, and correlation is:
- Variance is a measure of variability from the mean
- Covariance is a measure of relationship between the variability of 2 variables - covariance is scale dependent because it is not standardized
- Correlation is a of relationship between the variability of of 2 variables - correlation is standardized making it not scale dependent
A more in-depth look into each of these will be discussed below. First to import the required packages and create some fake data.
import pandas as pd
import numpy as np
# Setting a seed so the example is reproducible
np.random.seed(4272018)
df = pd.DataFrame(np.random.randint(low= 0, high= 20, size= (5, 2)),
columns= ['Commercials Watched', 'Product Purchases'])
df
Commercials Watched | Product Purchases | |
---|---|---|
0 | 10 | 13 |
1 | 15 | 0 |
2 | 7 | 7 |
3 | 2 | 4 |
4 | 16 | 11 |
df.agg(["mean", "std"])
Commercials Watched | Product Purchases | |
---|---|---|
mean | 10.000000 | 7.000000 |
std | 5.787918 | 5.244044 |
What is variance?
Variance is a measure of how much the data for a variable varies from it's mean. This can be represented with the following equation: $$\text{Variance }(s^2) = \sum\frac{(x_i - \bar{x})^2}{N - 1}$$ Where,
- $x_i$ is the ith observation,
- $\bar{x}$ is the mean, and
- $N$ is the number of observations
Calculating this manually for commercials watched would produce the following results:
This can be calculated easily within Python - particulatly when using Pandas. Although Pandas is not the only available package which will calculate the variance. Using Pandas, one simply needs to enter the following:
df.var()
What is covariance?
Covariance is a measure of relationship between 2 variables that is scale dependent, i.e. how much will a variable change when another variable changes. This can be represented with the following equation: $$\text{Covariance }(x, y) = \sum\frac{(x_i - \bar{x})(y_i - \bar{y})}{N - 1}$$ Where,
- $x_i$ is the ith observation in variable x,
- $\bar{x}$ is the mean for variable x,
- $y_i$ is the ith observation in variable y,
- $\bar{y}$ is the mean for variable y, and
- $N$ is the number of observations
The formula is very similar to the formula used to calculate variance. The difference being that instead of squaring the differences between the data point and the mean for that variable, instead one multiples that difference to the difference of the other variable.
The covariance between commercials watched and product purchases can be calclated manually and would produce the following results:
Again, this can be calculated easily within Python - particulatly when using Pandas. Although Pandas is not the only available package which will calculate the covariance. Using Pandas, one simply needs to enter the following:
df.cov()
Commercials Watched | Product Purchases | |
---|---|---|
Commercials Watched | 33.50 | 3.25 |
Product Purchases | 3.25 | 27.50 |
Interpreting covariance is hard to gain any meaning from since the values are not scale dependent and does not have any upper bound. This is where correlation comes in.
What is correlation?
Correlation overcomes the lack of scale dependency that is present in covariance by standardizing the values. This standardization converts the values to the same scale, the example below will the using the Pearson Correlation Coeffiecient. The equation for converting data to Z-scores is: $$\text{Z-score } = \frac{x_i - \bar{x}}{s_x}$$ Where,
- $x_i$ is the ith value for the variable,
- $\bar{x}$ is the mean for the variable, and
- $s_x$ is the standard deviation for the variable
There is no need to convert the values before using the Pearson Correlation equation since the standardization is apart of the formula: $$r = \sum\frac{(x_i - \bar{x})(y_i - \bar{y})}{(N - 1)(s_x)(s_y)}$$ Where,
- $x_i$ is the ith observation in variable x,
- $\bar{x}$ is the mean for variable x,
- $y_i$ is the ith observation in variable y,
- $\bar{y}$ is the mean for variable y, and
- $N$ is the number of observations
- $s_x$ is the standard deviation for variable x
- $s_y$ is the standard deviation for variable y
Conducting the equation manually would produce the following result:
Again, this can be calculated easily within Python - particulatly when using Pandas. Although Pandas is not the only available package which will calculate the correlation. Using Pandas, one simply needs to enter the following:
df.corr()
Commercials Watched | Product Purchases | |
---|---|---|
Commercials Watched | 1.000000 | 0.107077 |
Product Purchases | 0.107077 | 1.000000 |
The Pearson Correlation Coeffiecient will always range between -1 to 1. The closer the correlation coeffiecient is to -1 or 1, the stronger the relationship; whereas, the close the correlation coefficient is to 0, the weaker the relationship is.
If the correlation coeffiecient is positive, this indicates that as one variable increase so does the other. However, if the correlation coeffiecient is negative, it indicates that as one variable increase the other decreases. An easy way to see this relationship is to plot is using a scatter plot. Currently there is no agreed on threshold for how to interpret the coefficients. Akoglu, (2018) provides the following table with the three most commonly used suggestions for how to interpret the correlation cofficients - the fields vary a bit.
Correlation Coefficient | Dancey & Reidy (Psychology) | Quinnipiac University (Politics) | Chan YH (Medicine) | |
---|---|---|---|---|
+1 | −1 | Perfect | Perfect | Perfect |
+0.9 | −0.9 | Strong | Very Strong | Very Strong |
+0.8 | −0.8 | Strong | Very Strong | Very Strong |
+0.7 | −0.7 | Strong | Very Strong | Moderate |
+0.6 | −0.6 | Moderate | Strong | Moderate |
+0.5 | −0.5 | Moderate | Strong | Fair |
+0.4 | −0.4 | Moderate | Strong | Fair |
+0.3 | −0.3 | Weak | Moderate | Fair |
+0.2 | −0.2 | Weak | Weak | Poor |
+0.1 | −0.1 | Weak | Negligible | Poor |
0 | 0 | Zero | None | None |
There are other measures of correlation, such as: Spearman's rank correlation, Kendall's tau, biserial, and point-biseral correlations. Each correlation measure has different assumptions about that data and are testing different null hypotheses. The in-depth look at these measures is out of scope for this page.
References
Akoglu, H. (2018). User's guide to correlation coefficients. Turk J Emerg Med, 18(3), 91-93. doi: 10.1016/j.tjem.2018.08.001Rosner, B. (2015). Fundamentals of Biostatistics (8th). Boston, MA: Cengage Learning.
Ott, R. L., and Longnecker, M. (2010). An introduction to statistical methods and data analysis. Belmon, CA: Brooks/Cole.