Python for Data Science

Definitions and Data

The difference between variance, covariance, and correlation is:

  • Variance is a measure of variability from the mean
  • Covariance is a measure of relationship between the variability of 2 variables - covariance is scale dependent because it is not standardized
  • Correlation is a of relationship between the variability of of 2 variables - correlation is standardized making it not scale dependent

A more in-depth look into each of these will be discussed below. First to import the required packages and create some fake data.

import pandas as pd
import numpy as np


# Setting a seed so the example is reproducible
np.random.seed(4272018)

df = pd.DataFrame(np.random.randint(low= 0, high= 20, size= (5, 2)),
                  columns= ['Commercials Watched', 'Product Purchases'])

df
Commercials Watched Product Purchases
0 10 13
1 15 0
2 7 7
3 2 4
4 16 11
df.agg(["mean", "std"])
Commercials Watched Product Purchases
mean 10.000000 7.000000
std 5.787918 5.244044

What is variance?

Variance is a measure of how much the data for a variable varies from it's mean. This can be represented with the following equation: $$\text{Variance }(s^2) = \sum\frac{(x_i - \bar{x})^2}{N - 1}$$ Where,

  • $x_i$ is the ith observation,
  • $\bar{x}$ is the mean, and
  • $N$ is the number of observations

Calculating this manually for commercials watched would produce the following results:

Variable: Commercials Watched $\bar{x}$ = (10 + 15 + 7 + 2 + 16)/ 5 = 10.00 $\text{Variance }(s^2)$ = ((10 - 10)2 + (15 - 10)2 + (7 - 10)2 + (2 - 10)2 + (16 - 10)2) / (5 - 1) $\text{Variance }(s^2)$ = 33.5

This can be calculated easily within Python - particulatly when using Pandas. Although Pandas is not the only available package which will calculate the variance. Using Pandas, one simply needs to enter the following:

df.var()
Commercials Watched 33.5 Product Purchases 27.5 dtype: float64

What is covariance?

Covariance is a measure of relationship between 2 variables that is scale dependent, i.e. how much will a variable change when another variable changes. This can be represented with the following equation: $$\text{Covariance }(x, y) = \sum\frac{(x_i - \bar{x})(y_i - \bar{y})}{N - 1}$$ Where,

  • $x_i$ is the ith observation in variable x,
  • $\bar{x}$ is the mean for variable x,
  • $y_i$ is the ith observation in variable y,
  • $\bar{y}$ is the mean for variable y, and
  • $N$ is the number of observations

The formula is very similar to the formula used to calculate variance. The difference being that instead of squaring the differences between the data point and the mean for that variable, instead one multiples that difference to the difference of the other variable.

The covariance between commercials watched and product purchases can be calclated manually and would produce the following results:

Variables: Commercials Watched and Product Purchases Covariance (x, y) = ((10 - 10)(13 - 7) + (15 - 10)(0 - 7) + (7 - 10)(7 - 7) + (2 - 10)(4 - 7) + (16 - 10)(11 - 7)) / (5 - 1) = 3.25

Again, this can be calculated easily within Python - particulatly when using Pandas. Although Pandas is not the only available package which will calculate the covariance. Using Pandas, one simply needs to enter the following:

df.cov()
Commercials Watched Product Purchases
Commercials Watched 33.50 3.25
Product Purchases 3.25 27.50

Interpreting covariance is hard to gain any meaning from since the values are not scale dependent and does not have any upper bound. This is where correlation comes in.

What is correlation?

Correlation overcomes the lack of scale dependency that is present in covariance by standardizing the values. This standardization converts the values to the same scale, the example below will the using the Pearson Correlation Coeffiecient. The equation for converting data to Z-scores is: $$\text{Z-score } = \frac{x_i - \bar{x}}{s_x}$$ Where,

  • $x_i$ is the ith value for the variable,
  • $\bar{x}$ is the mean for the variable, and
  • $s_x$ is the standard deviation for the variable

There is no need to convert the values before using the Pearson Correlation equation since the standardization is apart of the formula: $$r = \sum\frac{(x_i - \bar{x})(y_i - \bar{y})}{(N - 1)(s_x)(s_y)}$$ Where,

  • $x_i$ is the ith observation in variable x,
  • $\bar{x}$ is the mean for variable x,
  • $y_i$ is the ith observation in variable y,
  • $\bar{y}$ is the mean for variable y, and
  • $N$ is the number of observations
  • $s_x$ is the standard deviation for variable x
  • $s_y$ is the standard deviation for variable y

Conducting the equation manually would produce the following result:

Variables: Commercials Watched and Product Purchases r = ((10 - 10)(13 - 7) + (15 - 10)(0 - 7) + (7 - 10)(7 - 7) + (2 - 10)(4 - 7) + (16 - 10)(11 - 7)) / (5 - 1)(5.787918)(5.244044) = 0.11

Again, this can be calculated easily within Python - particulatly when using Pandas. Although Pandas is not the only available package which will calculate the correlation. Using Pandas, one simply needs to enter the following:

df.corr()
Commercials Watched Product Purchases
Commercials Watched 1.000000 0.107077
Product Purchases 0.107077 1.000000

The Pearson Correlation Coeffiecient will always range between -1 to 1. The closer the correlation coeffiecient is to -1 or 1, the stronger the relationship; whereas, the close the correlation coefficient is to 0, the weaker the relationship is.

If the correlation coeffiecient is positive, this indicates that as one variable increase so does the other. However, if the correlation coeffiecient is negative, it indicates that as one variable increase the other decreases. An easy way to see this relationship is to plot is using a scatter plot. Currently there is no agreed on threshold for how to interpret the coefficients. Akoglu, (2018) provides the following table with the three most commonly used suggestions for how to interpret the correlation cofficients - the fields vary a bit.

Interpretation of the Pearson's and Spearman's correlation coefficients.
Correlation Coefficient Dancey & Reidy (Psychology) Quinnipiac University (Politics) Chan YH (Medicine)
+1 −1 Perfect Perfect Perfect
+0.9 −0.9 Strong Very Strong Very Strong
+0.8 −0.8 Strong Very Strong Very Strong
+0.7 −0.7 Strong Very Strong Moderate
+0.6 −0.6 Moderate Strong Moderate
+0.5 −0.5 Moderate Strong Fair
+0.4 −0.4 Moderate Strong Fair
+0.3 −0.3 Weak Moderate Fair
+0.2 −0.2 Weak Weak Poor
+0.1 −0.1 Weak Negligible Poor
0 0 Zero None None

There are other measures of correlation, such as: Spearman's rank correlation, Kendall's tau, biserial, and point-biseral correlations. Each correlation measure has different assumptions about that data and are testing different null hypotheses. The in-depth look at these measures is out of scope for this page.

References

Akoglu, H. (2018). User's guide to correlation coefficients. Turk J Emerg Med, 18(3), 91-93. doi: 10.1016/j.tjem.2018.08.001
Rosner, B. (2015). Fundamentals of Biostatistics (8th). Boston, MA: Cengage Learning.
Ott, R. L., and Longnecker, M. (2010). An introduction to statistical methods and data analysis. Belmon, CA: Brooks/Cole.