Installing and Loading the Reticulate Package
This package allows us to run Python 3 inside R Markdown chunks. You will need Python 3 installed. There are details on how to do that in the Resources section of the blog. I also needed to change Reticulate’s settings to point it at Python3 on my Mac. You can find those details here.
Of course, you do not need to run Python from Rstudio, or within the R framework. It will be much easier for you to work through this exercise in IPython or a Jupyter Notebook via Anaconda.
pacman::p_load(reticulate) use_python('/usr/local/Cellar/python/3.6.5/bin/python3') # points reticulate at Python 3
We import the relevant Python packages for the ANOVA exercise. If you have installed Python 3 via Anaconda all of these packages will be installed already. If you installed Python3 using another method, e.g. Homebrew on MacOS, then you can run the
pip3 install -package_name- command to install them.
import pandas as pd # This is a common coding format in Python. The package is imported using a shorter name format so that function calls are faster to type. import numpy as np import matplotlib as mpl mpl.use('TkAgg') # I added this line as a fix to a Matplotlib error - it couldn't be imported to the python environment import matplotlib.pyplot as plt import seaborn as sns from scipy import stats import statsmodels import statsmodels.api as sm from statsmodels.formula.api import ols
Data Acquisition and Transformation
This data comes from the University of Sheffield in the UK. It is listed as an “R Dataset” but it is a .csv text file, which Python pandas can import with no problems at all. When I began on these notes I didn’t realize that the University also has an ANOVA lesson based on this data. You can access the very useful R-based notes here.
diet = "https://www.sheffield.ac.uk/polopoly_fs/1.570199!/file/stcp-Rdataset-Diet.csv" # reads in data data = pd.read_csv(diet) # gives the data the name 'data' data['Loss'] = data['pre.weight'] - data['weight6weeks'] #creates a target variable of weight loss. print(data.head()) # reviews the modified pandas data frame.
## Person gender Age Height pre.weight Diet weight6weeks Loss ## 0 25 41 171 60 2 60.0 0.0 ## 1 26 32 174 103 2 103.0 0.0 ## 2 1 0 22 159 58 1 54.2 3.8 ## 3 2 0 46 192 60 1 54.0 6.0 ## 4 3 0 55 170 64 1 63.3 0.7
“The data set Diet.csv contains information on 78 people who undertook one of three diets. There is background information such as age, gender (Female=0, Male=1) and height. The aim of the study was to see which diet was best for losing weight so the independent variable (group) is diet.” - The University of Sheffield’s Mathematics and Statistics Help Department (MASH)
We check the data types and find that ‘Diet’ - our independent variable - is being treated as an integer rather than a categorical variable. In R we would convert it to a factor; in Python we convert it to the equivalent: a categorical variable.
print(data.dtypes) # list variable types
## Person int64 ## gender object ## Age int64 ## Height int64 ## pre.weight int64 ## Diet int64 ## weight6weeks float64 ## Loss float64 ## dtype: object
This chunk of code contains the pandas conversion:
data['Diet'] = pd.Categorical(data['Diet']) # Change Diet to a categorical variable
We check that the conversion was successful and see that Diet is now listed as ‘category’:
print(data.dtypes) # list variable types
## Person int64 ## gender object ## Age int64 ## Height int64 ## pre.weight int64 ## Diet category ## weight6weeks float64 ## Loss float64 ## dtype: object
Question for investigation: Are the three diets equally effective for weight loss?
ANOVA cannot prove causation. It can tell us whether the mean weight loss of each diet group are different at a statistically significant level. In this case, our null hypothesis is that all group mean weight losses are the same.
We start by visualizing the impact of each diet on the weight loss variable with a box plot. We use the Seaborn data visualization package. Figure 1 tells us that Diet 3 seems to be the most effective; it has the highest mean weight loss. The standard deviations of each diet look similar.
Analysis of Variance or ANOVA tells us whether the differences in the variance between each diet group are statistically significant when compared to the differences within each group.
One of the core assumptions with ANOVA is that the dependent variable is normally distributed.
sns.set(font_scale=1.6) #sets font size for chart scales ax = sns.boxplot(x = "Diet", y = "Loss", data = data, # calls the boxplot palette="Set1", # sets color scheme linewidth = 2) # sets chart line width ax.set_title("Figure 1. Boxplot of Weight Loss by Diet Type\n", fontsize=16) # title ax.set_ylabel("Weight\nLoss", rotation = 0, fontsize=16, labelpad=50) # y-axis label ax.set_xlabel("Diet", rotation = 0, fontsize=16) # x-axis label plt.show()
plt.gcf().clear() # you do not need to use this line to flush the mpl cache if you're running this in pure Python (as opposed to through R). If I don't run this line, box chart elements merged with histogram below.
Create a histogram of the weight loss dependent variable:
plt.hist(data['Loss'], bins='auto') # check for variable normality in weight loss dependent variable. #Data looks normally distributed. plt.title("Figure 2. Distribution of Weight Loss Variable, lbs\n") plt.xlabel("Weight Loss - lbs") plt.ylabel("Count\nof Dieters", rotation = 0, labelpad = 40) plt.show()
We now confirm our intuition from the visualization that the weight loss data is normally distributed. The p-score is higher than our alpha of 0.05, showing our intuition to be correct.
print(stats.normaltest(data['Loss'], axis=0)) # statistical normality test to check the visual test. # P > 0.05 therefore We cannot reject the hypothesis that the sample comes from a population which has a normal distribution
## NormaltestResult(statistic=0.8256032683724894, pvalue=0.6617935470237468)
Now we create an Ordinary Least Squares (OLS) model as a precursor to the ANOVA. I won’t print the results as I’ll be writing about interpreting OLS results in a future note.
model = ols("Loss ~ Diet", data).fit() # builds the OLS model, predicting weight loss with diet type.
Before the ANOVA, we check the OLS residuals are normally distributed. This is another important prior assumption. The residuals look approximately normal.
resids = statsmodels.regression.linear_model.RegressionResults.resid(model) # grabs the residual values from the OLS model hist2 = plt.hist(resids, bins = 'auto', color='dodgerblue') # check for variable normality in weight loss dependent variable. #Data looks normally distributed. plt.title("Figure 3. Distribution of OLS Model Residuals") plt.xlabel("Residual Values") plt.ylabel("Count", rotation = 0, labelpad = 40) plt.show()
We now call the ANOVA function, applying it to the linear model. We are looking for a p-score lower than 0.05, which will show a statistically significant difference between mean weight losses. It will not tell us where those differences exist, just whether they exist. The p-score of 0.003229 shows that they do.
table = sm.stats.anova_lm(model, typ=2) # Type 2 ANOVA DataFrame print(table) # In a one-way ANOVA the null hypothesis is that the means of the Diet types are equal. We can reject that hypothesis # because P < 0.05
## sum_sq df F PR(>F) ## Diet 71.093689 2.0 6.197447 0.003229 ## Residual 430.179259 75.0 NaN NaN
But which diet is the best one for weight loss?
This question is out of the scope of the ANOVA, though our analysis is an important precursor to answering this more specific question. We can check with a post hoc Tukey test.
from statsmodels.stats.multicomp import pairwise_tukeyhsd from statsmodels.stats.multicomp import MultiComparison mc = MultiComparison(data['Loss'], data['Diet']) tkresult = mc.tukeyhsd() print(tkresult)
## Multiple Comparison of Means - Tukey HSD,FWER=0.05 ## ============================================ ## group1 group2 meandiff lower upper reject ## -------------------------------------------- ## 1 2 -0.2741 -1.8806 1.3325 False ## 1 3 1.8481 0.2416 3.4547 True ## 2 3 2.1222 0.5636 3.6808 True ## --------------------------------------------
It looks like the Diet 3 group lost the most weight. The second row of results shows a 1.85 mean difference in weight loss between Diets 2 and 3. The third row shows the mean difference for those on Diet 2 after we subtract the second row results from it:
2.12 - 1.85 = 0.27. So the Diet 2 group only lost an average of 0.27 lbs more than the Diet 1 group. Again, let me stress that this is not a definitive causal analysis. All we have established is that the three dieting groups differ in weight loss performance from one another, and by what average margin.There may well be other confounding factors at play causing the weight loss.
- ‘Using One-way ANOVA and Tukey’s test to compare data sets’, Clever Owl, July 1, 2015
- ‘ANOVA’, University of Sheffield, December 2015.
- D.L. Couturier, R. Nicholls, M. Fernandes, ‘ANOVA with R: analysis of the diet dataset’, Bioinformatics Training Materials, May 10, 2018.