options(scipen = 999) #the command that turns off scientific notation in R output
I often set this parameter in any R project in which I’ll be evaluating predictive model performance. But why would you want to turn off scientific notation? First, it’s a personal preference, but I strongly advocate for learning how to read scientific notation as it is a data science standard. Second, your audience may not be mathematically trained. In many cases, my business audience wants to see raw numbers to make some of their own conclusions.
Scientific notation is shorthand for large number strings. For example:
0.00000000002is written in R output in the scientific notation format as 2e-11
Basically, this approach takes the integer “2”, puts a decimal point in front of it (1 of the 11), and counts off the remainder of the e-numbers (10) as decimal places. So, zero-point-ten-zeros-two. The number 4e-16 would be zero-point-fifteen-zeros-four.
If you build a predictive model and then summarise its output, by default R will use scientific notation in the output. For a quick example of what I mean, run this code in R’s console, substituting the number for any large or small number you like.
x <- 0.00000000002 # assigns the decimal to 'x' format(x, scientific = TRUE) # reproduces 'x' in scientific notation
##  "2e-11"
The satisfying weight of zeros
Because I have humanities training and arrived a little later in my career to the data science party, I prefer to view and compare full number strings when I’m examining data mining results. There’s something satisfying to me about feeling the weight of all the zeros in the p-values of a model. In the early stages of learning data science principles, I also prefered to visually compare number strings.
For a demonstration of what I’m talking about let’s build a simple linear model using an accessible dataset and view the output using the broom package’s ‘tidy’ function.
First I use an R package called pacman to load in the libraries I need for the project. As package authors Tyler W. Rinker & Dason Kurkiewicz say in the package vignette documentation:
“The function p_load is particularly well suited for help forums and blog posts, as it will load and, if necessary, install missing packages.”
#install.packages("pacman") -this line is commented because I already have the pacman package installed. pacman::p_load(broom, ggplot2) # loading in broom and ggplot2 from the tidyverse package
Okay, now for the linear model. First, I visualize the two variables with ggplot2. The target variable - stopping distance for cars - is on my y-axis; the explanatory variable - car speed - is on my x-axis.
ggplot(data = cars, aes(x = speed, y = dist)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(title = "Faster-moving cars have greater stopping distances", subtitle = "There is a positive linear relationship in the data between car speed and stopping distance.", x = "Speed of car (MPH)", y = "Stopping\ndistance (feet)") + theme_classic() + theme(axis.title.y = element_text(angle = 0, vjust = 1))
Then I build the linear regression using the lm() function in R. Stopping distance is the target variable we want to predict, using car speed at time of breaking as the descriptive variable we want to employ in the prediction.
Enter the code below into your R console. Putting a question mark before a package name or function results in help documentation in R. This in one reason I love the language.
Now, the code for the model:
#linear model car_lm <- lm(dist ~ speed, data = cars)
Let’s take a look at the results of the regression analysis. Becuase we set the ‘scipen’ option to a large number earlier (999), we can see the p-values in their full glory. In a future post, I will cover how to interpret R’s output for the lm() function and the simple linear regression.
## term estimate std.error statistic p.value ## 1 (Intercept) -17.579095 6.7584402 -2.601058 0.012318816153809090 ## 2 speed 3.932409 0.4155128 9.463990 0.000000000001489836