#StackBounty: #regression #prediction #simulation Using test data sets in simulation study

Bounty: 50

I’d like to know the correct way to simulate test data for a simulation study. For simplicity, suppose that I want to test a linear regression model. Chapter 7 of ESL explains that the average test error is equal to the expected prediction error, where the average is over the training data.

$text{Err}{mathcal{T}}$ is the test error for a specific training set $mathcal{T}$. They have an example stating that $text{Err}{mathcal{T}}$ was calculated for 100 simulated training sets. $sum_{mathcal{T}=1}^{100}text{Err}{mathcal{T}}/100$ is then an estimate of the expected prediction error $text{Err} = E(text{Err}{mathcal{T}})$.

Page 220 of ELS:

ELS p220

My question is: must 100 test sets be simulated for the calculation or is each $text{Err}_{mathcal{T}}$ calculated using the same test set?

Here is some R code to demonstrate:

library(MASS)
n = 50
m = 200
b0 = 0.5
b = c(1,2,0,0)
rho = 0.5
sigma = 1
iter = 100
p = length(b)
r = matrix(rho, p, p); diag(r) = 1

# test set option A
# x.new = mvrnorm(m, rep(0,p), r)
# y.new = b0 + x.new %*% b + rnorm(m, 0, sigma)

err.t = rep(0, iter)
for (i in 1:iter) {

  # training set
  x = mvrnorm(n, rep(0,p), r)
  y = b0 + x %*% b + rnorm(n, 0, sigma)

  # test set option B
  x.new = mvrnorm(m, rep(0,p), r)
  y.new = b0 + x.new %*% b + rnorm(m, 0, sigma)

  mod = lm(y ~ x)
  pred = predict(mod, x.new, type="response")
  err.t[i] = sum((y.new - pred)^2)/m
}
err = mean(err.t)

Should the test set be generated outside the loop as in option A, or inside the loop as in option B?

The same question goes for a validation set. Suppose I was fitting a LASSO model and want to use a separate data set for model selection, i.e. choose the best tuning parameter, should the validation set be outside or inside the loop?


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.