1 Chapter 11: Why model?

This is the code for Chapter 11. Throughout this chapter and the rest of the book, we’ll use the tidyverse metapackage to load the core tidyverse packages, as well the broom package. In this chapter, the main packages we’ll be using are ggplot2 for data visualization, dplyr for data manipulation, and broom for tidying regression model results.

1.1 Program 11.1

To replicate Program 11.1, we first need to create the data set, binary_a_df.

library(tidyverse)
library(broom)

binary_a_df <- data.frame(
  a = c(rep(1, 8), rep(0, 8)),
  y = c(200, 150, 220, 110, 50, 180, 90, 170, 
        170, 30, 70, 110, 80, 50, 10, 20)
)

ggplot(binary_a_df, aes(a, y)) +
  geom_point(size = 4, col = "white", fill = "#E69F00", shape = 21) +
  scale_x_continuous(breaks = c(0, 1), expand = expand_scale(.5)) + 
  theme_minimal(base_size = 20)
Figure 11.1

FIGURE 1.1: Figure 11.1

To summarize the data, we’ll use dplyr to group the dataset by a, then get the sample size, mean, standard deviation, and range. knitr::kable() is used to print the results nicely to a table. (For more information on the %>% pipe operator, see R for Data Science.)

binary_a_df %>% 
  group_by(a) %>% 
  summarize(
    n = n(), 
    mean = mean(y), 
    sd = sd(y), 
    minimum = min(y), 
    maximum = max(y)
  ) %>% 
  knitr::kable(digits = 2)
a n mean sd minimum maximum
0 8 67.50 53.12 10 170
1 8 146.25 58.29 50 220

Similarly, we’ll plot and summarize categorical_a_df.

categorical_a_df <- data.frame(a = sort(rep(1:4, 4)),
                               y = c(110, 80, 50, 40, 170, 30, 70, 50, 
                                     110, 50, 180, 130, 200, 150, 220, 210))

ggplot(categorical_a_df, aes(a, y)) +
  geom_point(size = 4, col = "white", fill = "#E69F00",  shape = 21) +
  scale_x_continuous(breaks = 1:4, expand = expand_scale(.25)) + 
  theme_minimal(base_size = 20)
Figure 11.2

FIGURE 1.2: Figure 11.2

categorical_a_df %>% 
  group_by(a) %>% 
  summarize(
    n = n(), 
    mean = mean(y), 
    sd = sd(y), 
    minimum = min(y), 
    maximum = max(y)
  ) %>% 
  knitr::kable(digits = 2)
a n mean sd minimum maximum
1 4 70.0 31.62 40 110
2 4 80.0 62.18 30 170
3 4 117.5 53.77 50 180
4 4 195.0 31.09 150 220

1.2 Program 11.2

Program 11.2 uses a continuous exposure and outcome data, continuous_a_df. Plotting the points is similar to above.

continuous_a_df <- data.frame(
  a = c(3, 11, 17, 23, 29, 37, 41, 53, 
        67, 79, 83, 97, 60, 71, 15, 45),
  y = c(21, 54, 33, 101, 85, 65, 157, 120, 
        111, 200, 140, 220, 230, 217, 11, 190)
)

ggplot(continuous_a_df, aes(a, y)) +
  geom_point(size = 4, col = "white", fill = "#E69F00",  shape = 21) +
  theme_minimal(base_size = 20)
Figure 11.3

FIGURE 1.3: Figure 11.3

We’ll also add a regression line using geom_smooth() with method = "lm".

ggplot(continuous_a_df, aes(a, y)) +
  geom_point(size = 4, col = "white", fill = "grey85", shape = 21) +
  geom_smooth(method = "lm", se = FALSE, col = "#E69F00", size = 1.2) +
  theme_minimal(base_size = 20) 
Figure 11.4

FIGURE 1.4: Figure 11.4

To fit an OLS regression model of a on y, we’ll uselm() and then tidy the results with tidy() from the broom package.

linear_regression <- lm(y ~ a, data = continuous_a_df)

linear_regression %>% 
  # get the confidence intervals using `conf.int = TRUE`
  tidy(conf.int = TRUE) %>% 
  # drop the test statistic and P-value
  select(-statistic, -p.value) %>% 
  knitr::kable(digits = 2)
term estimate std.error conf.low conf.high
(Intercept) 24.55 21.33 -21.20 70.29
a 2.14 0.40 1.28 2.99

To predict a value of y for when a = 90, we pass the linear regression model object linear_regression to the predict() function. We also give it a new data frame that contains only one value, a = 90.

linear_regression %>% 
  predict(newdata = data.frame(a = 90))
##      1 
## 216.89

Similarly, using binary_a_df, which has a treatmant a that is a binary:

lm(y ~ a, data = binary_a_df) %>% 
  tidy(conf.int = TRUE) %>%   
  select(-statistic, -p.value) %>% 
  knitr::kable(digits = 2)
term estimate std.error conf.low conf.high
(Intercept) 67.50 19.72 25.21 109.79
a 78.75 27.88 18.95 138.55

1.3 Program 11.3

Fitting and tidying a quadratic version of a is similar, but we use I() to include a^2.

smoothed_regression <- lm(y ~ a + I(a^2), data = continuous_a_df) 

smoothed_regression %>% 
  tidy(conf.int = TRUE) %>% 
  select(-statistic, -p.value) %>% 
  #  remove `I()` from the term name
  mutate(term = ifelse(term == "I(a^2)", "a^2", term)) %>% 
  knitr::kable(digits = 2) 
term estimate std.error conf.low conf.high
(Intercept) -7.41 31.75 -75.99 61.18
a 4.11 1.53 0.80 7.41
a^2 -0.02 0.02 -0.05 0.01

To plot the quadratic function, we give geom_smooth() a formula argument: formula = y ~ x + I(x^2).

ggplot(continuous_a_df, aes(a, y)) +
  geom_point(size = 4, col = "white", fill = "grey85", shape = 21) +
  geom_smooth(method = "lm", se = FALSE, col = "#E69F00", formula = y ~ x + I(x^2), size = 1.2) + 
  theme_minimal(base_size = 20) 
Figure 11.5

FIGURE 1.5: Figure 11.5

But predicting is done the same way.

smoothed_regression %>% 
  predict(newdata = data.frame(a = 90))
##        1 
## 197.1269