Installing and Loading the Reticulate Package

This package allows us to run Python 3 inside R Markdown chunks. You will need Python 3 installed. There are details on how to do that in the Resources section of the blog. I also needed to change Reticulate’s settings to point it at Python3 on my Mac. You can find those details here.

Of course, you do not need to run Python from Rstudio, or within the R framework. It will be much easier for you to work through this exercise in IPython or a Jupyter Notebook via Anaconda.

pacman::p_load(reticulate)
use_python('/usr/local/Cellar/python/3.6.5/bin/python3') # points reticulate at Python 3  

We import the relevant Python packages for the ANOVA exercise. If you have installed Python 3 via Anaconda all of these packages will be installed already. If you installed Python3 using another method, e.g. Homebrew on MacOS, then you can run the pip3 install -package_name- command to install them.

import pandas as pd # This is a common coding format in Python. The package is imported using a shorter name format so that function calls are faster to type. 
import numpy as np
import matplotlib as mpl
mpl.use('TkAgg') # I added this line as a fix to a Matplotlib error - it couldn't be imported to the python environment
import matplotlib.pyplot as plt 
import seaborn as sns
from scipy import stats
import statsmodels
import statsmodels.api as sm
from statsmodels.formula.api import ols

Data Acquisition and Transformation

This data comes from the University of Sheffield in the UK. It is listed as an “R Dataset” but it is a .csv text file, which Python pandas can import with no problems at all. When I began on these notes I didn’t realize that the University also has an ANOVA lesson based on this data. You can access the very useful R-based notes here.

diet = "https://www.sheffield.ac.uk/polopoly_fs/1.570199!/file/stcp-Rdataset-Diet.csv" # reads in data
data = pd.read_csv(diet) # gives the data the name 'data'
data['Loss'] = data['pre.weight'] - data['weight6weeks'] #creates a target variable of weight loss. 
print(data.head()) # reviews the modified pandas data frame.
##    Person gender  Age  Height  pre.weight  Diet  weight6weeks  Loss
## 0      25          41     171          60     2          60.0   0.0
## 1      26          32     174         103     2         103.0   0.0
## 2       1      0   22     159          58     1          54.2   3.8
## 3       2      0   46     192          60     1          54.0   6.0
## 4       3      0   55     170          64     1          63.3   0.7

Data Description

“The data set Diet.csv contains information on 78 people who undertook one of three diets. There is background information such as age, gender (Female=0, Male=1) and height. The aim of the study was to see which diet was best for losing weight so the independent variable (group) is diet.” - The University of Sheffield’s Mathematics and Statistics Help Department (MASH)

We check the data types and find that ‘Diet’ - our independent variable - is being treated as an integer rather than a categorical variable. In R we would convert it to a factor; in Python we convert it to the equivalent: a categorical variable.

print(data.dtypes) # list variable types 
## Person            int64
## gender           object
## Age               int64
## Height            int64
## pre.weight        int64
## Diet              int64
## weight6weeks    float64
## Loss            float64
## dtype: object

This chunk of code contains the pandas conversion:

data['Diet'] = pd.Categorical(data['Diet']) # Change Diet to a categorical variable

We check that the conversion was successful and see that Diet is now listed as ‘category’:

print(data.dtypes) # list variable types 
## Person             int64
## gender            object
## Age                int64
## Height             int64
## pre.weight         int64
## Diet            category
## weight6weeks     float64
## Loss             float64
## dtype: object

Question for investigation: Are the three diets equally effective for weight loss?

ANOVA cannot prove causation. It can tell us whether the mean weight loss of each diet group are different at a statistically significant level. In this case, our null hypothesis is that all group mean weight losses are the same.

We start by visualizing the impact of each diet on the weight loss variable with a box plot. We use the Seaborn data visualization package. Figure 1 tells us that Diet 3 seems to be the most effective; it has the highest mean weight loss. The standard deviations of each diet look similar.

Analysis of Variance or ANOVA tells us whether the differences in the variance between each diet group are statistically significant when compared to the differences within each group.

One of the core assumptions with ANOVA is that the dependent variable is normally distributed.

sns.set(font_scale=1.6)  #sets font size for chart scales
ax = sns.boxplot(x = "Diet", y = "Loss", data = data, # calls the boxplot
                 palette="Set1", # sets color scheme 
                 linewidth = 2) # sets chart line width
ax.set_title("Figure 1. Boxplot of Weight Loss by Diet Type\n", fontsize=16) # title 
ax.set_ylabel("Weight\nLoss", rotation = 0, fontsize=16, labelpad=50) # y-axis label
ax.set_xlabel("Diet", rotation = 0, fontsize=16) # x-axis label
plt.show()

plt.gcf().clear() # you do not need to use this line to flush the mpl cache if you're running this in pure Python (as opposed to through R). If I don't run this line, box chart elements merged with  histogram below. 

Create a histogram of the weight loss dependent variable:

plt.hist(data['Loss'], bins='auto') # check for variable normality in weight loss dependent variable. 
#Data looks normally distributed.
plt.title("Figure 2. Distribution of Weight Loss Variable, lbs\n")
plt.xlabel("Weight Loss - lbs")
plt.ylabel("Count\nof Dieters", rotation = 0, labelpad = 40)
plt.show()

We now confirm our intuition from the visualization that the weight loss data is normally distributed. The p-score is higher than our alpha of 0.05, showing our intuition to be correct.

print(stats.normaltest(data['Loss'], axis=0))
# statistical normality test to check the visual test. 
# P > 0.05 therefore We cannot reject the hypothesis that the sample comes from a population which has a normal distribution
## NormaltestResult(statistic=0.8256032683724894, pvalue=0.6617935470237468)

Now we create an Ordinary Least Squares (OLS) model as a precursor to the ANOVA. I won’t print the results as I’ll be writing about interpreting OLS results in a future note.

model = ols("Loss ~ Diet", data).fit()  # builds the OLS model, predicting weight loss with diet type.

Before the ANOVA, we check the OLS residuals are normally distributed. This is another important prior assumption. The residuals look approximately normal.

resids = statsmodels.regression.linear_model.RegressionResults.resid(model) # grabs the residual values from the OLS model
hist2 = plt.hist(resids, bins = 'auto', color='dodgerblue') # check for variable normality in weight loss dependent variable. 
#Data looks normally distributed.
plt.title("Figure 3. Distribution of OLS Model Residuals")
plt.xlabel("Residual Values")
plt.ylabel("Count", rotation = 0, labelpad = 40)
plt.show()

We now call the ANOVA function, applying it to the linear model. We are looking for a p-score lower than 0.05, which will show a statistically significant difference between mean weight losses. It will not tell us where those differences exist, just whether they exist. The p-score of 0.003229 shows that they do.

table = sm.stats.anova_lm(model, typ=2) # Type 2 ANOVA DataFrame
print(table) 
# In a one-way ANOVA the null hypothesis is that the means of the Diet types are equal. We can reject that hypothesis
# because P < 0.05 
##               sum_sq    df         F    PR(>F)
## Diet       71.093689   2.0  6.197447  0.003229
## Residual  430.179259  75.0       NaN       NaN

But which diet is the best one for weight loss?

This question is out of the scope of the ANOVA, though our analysis is an important precursor to answering this more specific question. We can check with a post hoc Tukey test.

from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.multicomp import MultiComparison
mc = MultiComparison(data['Loss'], data['Diet'])
tkresult = mc.tukeyhsd()
 
print(tkresult)
## Multiple Comparison of Means - Tukey HSD,FWER=0.05
## ============================================
## group1 group2 meandiff  lower  upper  reject
## --------------------------------------------
##   1      2    -0.2741  -1.8806 1.3325 False 
##   1      3     1.8481   0.2416 3.4547  True 
##   2      3     2.1222   0.5636 3.6808  True 
## --------------------------------------------

It looks like the Diet 3 group lost the most weight. The second row of results shows a 1.85 mean difference in weight loss between Diets 2 and 3. The third row shows the mean difference for those on Diet 2 after we subtract the second row results from it: 2.12 - 1.85 = 0.27. So the Diet 2 group only lost an average of 0.27 lbs more than the Diet 1 group. Again, let me stress that this is not a definitive causal analysis. All we have established is that the three dieting groups differ in weight loss performance from one another, and by what average margin.There may well be other confounding factors at play causing the weight loss.

References