Chapter 5 Model Assessment and Selection
In the previous chapters, you fit various models to explain or predict an outcome variable of interest. However, how do we know which models to choose? Model assessment measures allow you to assess how well an explanatory model “fits” a set of data or how accurate a predictive model is. Based on these measures, you’ll learn about criteria for determing which models are “best.”
Model selection and assessment VIDEO
5.1 Refresher: sum of squared residuals
Let’s remind you how to compute the sum of squared residuals. You’ll do this for two models.
Use the appropriate function to get a dataframe with the residuals for
model_price_2
.Add a new column of squared residuals called
sq_residuals
.Then summarize
sq_residuals
with their sum. Call this sumsum_sq_residuals
.
# Model 2
<- lm(log10_price ~ log10_size + bedrooms,
model_price_2 data = house_prices)
# Calculate squared residuals
get_regression_points(model_price_2) %>%
mutate(sq_residuals = residual^2) %>%
summarize(sum_sq_residuals = sum(sq_residuals))
# A tibble: 1 × 1
sum_sq_residuals
<dbl>
1 604.
- Compute the sum of squared residuals for
model_price_4
which uses the categorical variablewaterfront
instead of the numerical variablebedrooms
.
# Model 4
<- lm(log10_price ~ log10_size + waterfront,
model_price_4 data = house_prices)
# Calculate squared residuals
get_regression_points(model_price_4) %>%
mutate(sq_residuals = residual^2) %>%
summarize(sum_sq_residuals = sum(sq_residuals))
# A tibble: 1 × 1
sum_sq_residuals
<dbl>
1 599.
Let’s use these two measures of model assessment to choose between these two models, or in other words, perform model selection!
5.2 Which model to select?
Based on these two values of the sum of squared residuals, which of these two models do you think is “better,” and hence which would you select?
model_price_2
that useslog10_size
andbedrooms
?model_price_4
that useslog10_size
andwaterfront
?Since model_price_2’s value was 605, select this one.
Since model_price_4’s value was 599, select this one.
No information about which is better is provided whatsoever.
Assessing model fit with R-squared VIDEO
5.3 Computing the R-squared of a model
Let’s compute the \(R^2\) summary value for the two numerical explanatory/predictor variable model you fit in the previous chapter, price as a function of size and the number of bedrooms.
Recall that \(R^2\) can be calculated as: \[1 - \frac{\text{Var(residuals)}}{\text{Var(y)}}\]
- Compute \(R^2\) by summarizing the residual and
log10_price
columns.
# Fit model
<- lm(log10_price ~ log10_size + bedrooms, data = house_prices)
model_price_2
# Get fitted/values & residuals, compute R^2 using residuals
get_regression_points(model_price_2) %>%
summarize(r_squared = 1 - var(residual) / var(log10_price))
# A tibble: 1 × 1
r_squared
<dbl>
1 0.466
You observed an R-squared value of 0.465, which means that 46.5% of the total variability of the outcome variable log base 10 price can be explained by this model.
5.4 Comparing the R-squared of two models
Let’s now compute \(R^2\) for the one numerical and one categorical explanatory/predictor variable model you fit in the previous chapter, price as a function of size and whether the house had a view of the waterfront, and compare its \(R^2\) with the one you just computed.
- Compute \(R^2\) for
model_price_4
# Fit model
<- lm(log10_price ~ log10_size + waterfront,
model_price_4 data = house_prices)
# Get fitted/values & residuals, compute R^2 using residuals
get_regression_points(model_price_4) %>%
summarize(r_squared = 1 - var(residual)/var(log10_price))
# A tibble: 1 × 1
r_squared
<dbl>
1 0.470
Since `model_price_2`` had a lower \(R^2\) of 0.465, it “fit” the data better.
** Since
model_price_4
had a higher \(R^2\) of 0.470, it “fit” the data better.**\(R^2\) doesn’t tell us anything about quality of model “fit.”
Note: Since using waterfront
explained a higher proportion of the total variance of the outcome variable than using the number of bedrooms, using waterfront in our model is preferred.
Assessing predictions with RMSE VIDEO
5.5 Computing the MSE & RMSE of a model
Just as you did earlier with \(R^2\), which is a measure of model fit, let’s now compute the root mean square error (RMSE) of our models, which is a commonly used measure of preditive error. Let’s use the model of price as a function of size and number of bedrooms.
The model is available in your workspace as model_price_2
.
- Let’s start by computing the mean squared error (
mse
), which is themean
of the squaredresidual
.
#Get all residuals, square them, and take mean
get_regression_points(model_price_2) %>%
mutate(sq_residuals = residual^2) %>%
summarize(mse = mean(sq_residuals))
# A tibble: 1 × 1
mse
<dbl>
1 0.0279
- Now that you’ve computed the mean squared error, let’s compute the root mean squared error.
# Get all residuals, square them, take the mean and square root
get_regression_points(model_price_2) %>%
mutate(sq_residuals = residual^2) %>%
summarize(mse = mean(sq_residuals)) %>%
mutate(rmse = sqrt(mse))
# A tibble: 1 × 2
mse rmse
<dbl> <dbl>
1 0.0279 0.167
The RMSE is 0.167. You can think of this as the “typical” prediction error this model makes.
5.6 Comparing the RMSE of two models
As you did using the sum of squared residuals and \(R^2\), let’s once again assess and compare the quality of your two models using the root mean squared error (RMSE). Note that RMSE is more typically used in prediction settings than explanatory settings.
model_price_2
and model_price_4
are available in your workspace.
- Based on the code provided that computes MSE and RMSE for
model_price_2
, compute the MSE and RMSE formodel_price_4
.
# MSE and RMSE for model_price_2
get_regression_points(model_price_2) %>%
mutate(sq_residuals = residual^2) %>%
summarize(mse = mean(sq_residuals), rmse = sqrt(mean(sq_residuals)))
# A tibble: 1 × 2
mse rmse
<dbl> <dbl>
1 0.0279 0.167
# MSE and RMSE for model_price_4
get_regression_points(model_price_4) %>%
mutate(sq_residuals = residual^2) %>%
summarize(mse = mean(sq_residuals), rmse = sqrt(mean(sq_residuals)))
# A tibble: 1 × 2
mse rmse
<dbl> <dbl>
1 0.0277 0.166
Highlight the correct Answer:
Since
model_price_2
had a higher rmse of 0.167, this is suggestive that this model has better preditive power.rmse doesn’t tell us anything about predictive power.
Since
model_price_4
had a lower rmse of 0.166, this is suggestive that this model has better preditive power.
RMSE can be thought of as the ‘typical’ error a predicive model will make.
Validation set prediction framework VIDEO
5.7 Fitting model to training data
It’s time to split your data into a training set to fit a model and a separate test set to evaluate the predictive power of the model. Before making this split however, we first sample 100% of the rows of house_prices
without replacement and assign this to house_prices_shuffled
. This has the effect of “shuffling” the rows, thereby ensuring that the training and test sets are randomly sampled.
- Use
slice()
to set train to the first 10,000 rows ofhouse_prices_shuffled
and test to the remainder of the 21,613 rows.
# Set random number generator seed value for reproducibility
set.seed(76)
# Randomly reorder the rows
<- house_prices %>%
house_prices_shuffled sample_frac(size = 1, replace = FALSE)
# Train/test split
<- house_prices_shuffled %>%
train slice(1:10000)
<- house_prices_shuffled %>%
test slice(10001:21613)
- Now fit a linear regression to predict
log10_price
usinglog10_size
andbedrooms
using just the training data.
# Fit model to training set
<- lm(log10_price ~ log10_size + bedrooms, data = train) train_model_2
Since you’ve fit/trained the predictive model on the training set, let’s now apply it to the test set to make predictions!
5.8 Predicting on test data
Now that you’ve trained the model on the train
set, let’s apply the model to the test
data, make predictions, and evaluate the predictions. Recall that having a separate test
set here simulates the gathering of a “new” independent dataset to test our model’s predictive performance on.
The datasets train
and test
, and the trained model, train_model_2
are available in your workspace.
- Use the
get_regression_points()
function to applytrain_model_2
on your new dataset:test
.
# Make predictions on test set
get_regression_points(train_model_2, newdata = test)
# A tibble: 11,612 × 6
ID log10_price log10_size bedrooms log10_price_hat residual
<int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 5.92 3.42 5 5.75 0.163
2 2 5.45 3.25 3 5.65 -0.203
3 3 5.69 3.38 3 5.77 -0.089
4 4 5.61 2.94 2 5.39 0.225
5 5 5.81 3.52 4 5.87 -0.061
6 6 5.39 3.25 3 5.65 -0.267
7 7 5.61 3.06 2 5.50 0.102
8 8 5.38 3.12 2 5.56 -0.181
9 9 5.77 3.53 4 5.88 -0.109
10 10 5.85 3.53 3 5.91 -0.058
# … with 11,602 more rows
- Compute the root mean square error using this output.
# Compute RMSE
get_regression_points(train_model_2, newdata = test) %>%
mutate(sq_residuals = residual^2) %>%
summarize(rmse = sqrt(mean(sq_residuals)))
# A tibble: 1 × 1
rmse
<dbl>
1 0.167
Your RMSE using size
and condition as predictor variables is 0.167, which is higher than 0.165 when you used size
and year
built! It seems the latter is marginally better!
Conclusion-Where to go from here? VIDEO