Chapter 5 Model Assessment and Selection

In the previous chapters, you fit various models to explain or predict an outcome variable of interest. However, how do we know which models to choose? Model assessment measures allow you to assess how well an explanatory model “fits” a set of data or how accurate a predictive model is. Based on these measures, you’ll learn about criteria for determing which models are “best.”

Model selection and assessment VIDEO

5.1 Refresher: sum of squared residuals

Let’s remind you how to compute the sum of squared residuals. You’ll do this for two models.

Use the appropriate function to get a dataframe with the residuals for model_price_2.
Add a new column of squared residuals called sq_residuals.
Then summarize sq_residuals with their sum. Call this sum sum_sq_residuals.

# Model 2
model_price_2 <- lm(log10_price ~ log10_size + bedrooms, 
                    data = house_prices)

# Calculate squared residuals
get_regression_points(model_price_2) %>%
  mutate(sq_residuals = residual^2) %>%
  summarize(sum_sq_residuals = sum(sq_residuals))

# A tibble: 1 × 1
  sum_sq_residuals
             <dbl>
1             604.

Compute the sum of squared residuals for model_price_4 which uses the categorical variable waterfront instead of the numerical variable bedrooms.

# Model 4
model_price_4 <- lm(log10_price ~ log10_size + waterfront, 
                    data = house_prices)

# Calculate squared residuals
get_regression_points(model_price_4) %>%
  mutate(sq_residuals = residual^2) %>%
  summarize(sum_sq_residuals = sum(sq_residuals))

# A tibble: 1 × 1
  sum_sq_residuals
             <dbl>
1             599.

Let’s use these two measures of model assessment to choose between these two models, or in other words, perform model selection!

5.2 Which model to select?

Based on these two values of the sum of squared residuals, which of these two models do you think is “better,” and hence which would you select?

model_price_2 that uses log10_size and bedrooms?
model_price_4 that uses log10_size and waterfront?
- Since model_price_2’s value was 605, select this one.
- Since model_price_4’s value was 599, select this one.
- No information about which is better is provided whatsoever.

Assessing model fit with R-squared VIDEO

5.3 Computing the R-squared of a model

Let’s compute the \(R^2\) summary value for the two numerical explanatory/predictor variable model you fit in the previous chapter, price as a function of size and the number of bedrooms.

Recall that \(R^2\) can be calculated as: \[1 - \frac{\text{Var(residuals)}}{\text{Var(y)}}\]

Compute \(R^2\) by summarizing the residual and log10_price columns.

# Fit model
model_price_2 <- lm(log10_price ~ log10_size + bedrooms, data = house_prices)
                    
# Get fitted/values & residuals, compute R^2 using residuals
get_regression_points(model_price_2) %>%
  summarize(r_squared = 1 - var(residual) / var(log10_price))

# A tibble: 1 × 1
  r_squared
      <dbl>
1     0.466

You observed an R-squared value of 0.465, which means that 46.5% of the total variability of the outcome variable log base 10 price can be explained by this model.

5.4 Comparing the R-squared of two models

Let’s now compute \(R^2\) for the one numerical and one categorical explanatory/predictor variable model you fit in the previous chapter, price as a function of size and whether the house had a view of the waterfront, and compare its \(R^2\) with the one you just computed.

Compute \(R^2\) for model_price_4

# Fit model
model_price_4 <- lm(log10_price ~ log10_size + waterfront,
                    data = house_prices)

# Get fitted/values & residuals, compute R^2 using residuals
get_regression_points(model_price_4) %>% 
  summarize(r_squared = 1 - var(residual)/var(log10_price))

# A tibble: 1 × 1
  r_squared
      <dbl>
1     0.470

Since `model_price_2`` had a lower \(R^2\) of 0.465, it “fit” the data better.
** Since model_price_4 had a higher \(R^2\) of 0.470, it “fit” the data better.**
\(R^2\) doesn’t tell us anything about quality of model “fit.”

Note: Since using waterfront explained a higher proportion of the total variance of the outcome variable than using the number of bedrooms, using waterfront in our model is preferred.

Assessing predictions with RMSE VIDEO

5.5 Computing the MSE & RMSE of a model

Just as you did earlier with \(R^2\), which is a measure of model fit, let’s now compute the root mean square error (RMSE) of our models, which is a commonly used measure of preditive error. Let’s use the model of price as a function of size and number of bedrooms.

The model is available in your workspace as model_price_2.

Let’s start by computing the mean squared error (mse), which is the mean of the squared residual.

 #Get all residuals, square them, and take mean                    
get_regression_points(model_price_2) %>%
  mutate(sq_residuals = residual^2) %>%
  summarize(mse = mean(sq_residuals))

# A tibble: 1 × 1
     mse
   <dbl>
1 0.0279

Now that you’ve computed the mean squared error, let’s compute the root mean squared error.

# Get all residuals, square them, take the mean and square root
get_regression_points(model_price_2) %>%
  mutate(sq_residuals = residual^2) %>%
  summarize(mse = mean(sq_residuals)) %>% 
  mutate(rmse = sqrt(mse))

# A tibble: 1 × 2
     mse  rmse
   <dbl> <dbl>
1 0.0279 0.167

The RMSE is 0.167. You can think of this as the “typical” prediction error this model makes.

5.6 Comparing the RMSE of two models

As you did using the sum of squared residuals and \(R^2\), let’s once again assess and compare the quality of your two models using the root mean squared error (RMSE). Note that RMSE is more typically used in prediction settings than explanatory settings.

model_price_2 and model_price_4 are available in your workspace.

Based on the code provided that computes MSE and RMSE for model_price_2, compute the MSE and RMSE for model_price_4.

# MSE and RMSE for model_price_2
get_regression_points(model_price_2) %>%
  mutate(sq_residuals = residual^2) %>%
  summarize(mse = mean(sq_residuals), rmse = sqrt(mean(sq_residuals)))

# A tibble: 1 × 2
     mse  rmse
   <dbl> <dbl>
1 0.0279 0.167

# MSE and RMSE for model_price_4
get_regression_points(model_price_4) %>%
  mutate(sq_residuals = residual^2) %>%
  summarize(mse = mean(sq_residuals), rmse = sqrt(mean(sq_residuals)))

# A tibble: 1 × 2
     mse  rmse
   <dbl> <dbl>
1 0.0277 0.166

Highlight the correct Answer:

Since model_price_2 had a higher rmse of 0.167, this is suggestive that this model has better preditive power.
rmse doesn’t tell us anything about predictive power.
Since model_price_4 had a lower rmse of 0.166, this is suggestive that this model has better preditive power.

RMSE can be thought of as the ‘typical’ error a predicive model will make.

Validation set prediction framework VIDEO

5.7 Fitting model to training data

It’s time to split your data into a training set to fit a model and a separate test set to evaluate the predictive power of the model. Before making this split however, we first sample 100% of the rows of house_prices without replacement and assign this to house_prices_shuffled. This has the effect of “shuffling” the rows, thereby ensuring that the training and test sets are randomly sampled.

Use slice() to set train to the first 10,000 rows of house_prices_shuffled and test to the remainder of the 21,613 rows.

# Set random number generator seed value for reproducibility
set.seed(76)

# Randomly reorder the rows
house_prices_shuffled <- house_prices %>% 
  sample_frac(size = 1, replace = FALSE)

# Train/test split
train <- house_prices_shuffled %>%
  slice(1:10000)
test <- house_prices_shuffled %>%
  slice(10001:21613)

Now fit a linear regression to predict log10_price using log10_size and bedrooms using just the training data.

# Fit model to training set
train_model_2 <- lm(log10_price ~ log10_size + bedrooms, data = train)

Since you’ve fit/trained the predictive model on the training set, let’s now apply it to the test set to make predictions!

5.8 Predicting on test data

Now that you’ve trained the model on the train set, let’s apply the model to the test data, make predictions, and evaluate the predictions. Recall that having a separate test set here simulates the gathering of a “new” independent dataset to test our model’s predictive performance on.

The datasets train and test, and the trained model, train_model_2 are available in your workspace.

Use the get_regression_points() function to apply train_model_2 on your new dataset: test.

# Make predictions on test set
get_regression_points(train_model_2, newdata = test)

# A tibble: 11,612 × 6
      ID log10_price log10_size bedrooms log10_price_hat residual
   <int>       <dbl>      <dbl>    <int>           <dbl>    <dbl>
 1     1        5.92       3.42        5            5.75    0.163
 2     2        5.45       3.25        3            5.65   -0.203
 3     3        5.69       3.38        3            5.77   -0.089
 4     4        5.61       2.94        2            5.39    0.225
 5     5        5.81       3.52        4            5.87   -0.061
 6     6        5.39       3.25        3            5.65   -0.267
 7     7        5.61       3.06        2            5.50    0.102
 8     8        5.38       3.12        2            5.56   -0.181
 9     9        5.77       3.53        4            5.88   -0.109
10    10        5.85       3.53        3            5.91   -0.058
# … with 11,602 more rows

Compute the root mean square error using this output.

# Compute RMSE
get_regression_points(train_model_2, newdata = test) %>% 
  mutate(sq_residuals = residual^2) %>%
  summarize(rmse = sqrt(mean(sq_residuals)))

# A tibble: 1 × 1
   rmse
  <dbl>
1 0.167

Your RMSE using size and condition as predictor variables is 0.167, which is higher than 0.165 when you used size and year built! It seems the latter is marginally better!

Conclusion-Where to go from here? VIDEO