*Bounty: 50*

*Bounty: 50*

I understand that the residuals from a regression model are not i.i.d. Hence, checking if they are normal (even when we know it is the case), should be a problem since they are dependent.

The residuals are not independent since they need to keep the two conditions that their sum is 0 and that they are orthogonal to x. Also, they do not have the same distribution since the error depends on the specific x.

That said, when I run shapiro test on residuals I see that the error is 5% for regular residuals and almost 10% for studentized residuals (while in both cases the i.i.d is not kept). So one could say that since the values are not i.i.d then shapiro test should reject in both cases, which means that the power for the first case is 5% and for the studentized case is 10%.

Any insight into why this is? (and also if there is a way to account for the dependency, other than a larger sample size)

Reproducible code (See comments for output):

```
get_lm_resid_normality_p_value <- function(fit, resid_function = MASS::stdres) {
# library(MASS)
library(dplyr)
library(broom)
# https://en.wikipedia.org/wiki/Studentized_residual
resids <- fit %>% resid_function(.)
if(length(resids) < 5000 & length(resids) > 3) {
resids %>% shapiro.test %>% glance %>% pull(p.value) %>% return
} else {
return(NA)
}
}
n <- 30
R <- 10000
x <- 1:n
# stdres
set.seed(1234)
pv <- numeric(R)
for(i in 1:R) {
y <- x + rnorm(n)
fit <- lm(y ~ x)
pv[i] <- get_lm_resid_normality_p_value(fit)
}
mean(pv < 0.05)
# 0.0503 # this is o.k.
# resid
set.seed(1234)
pv <- numeric(R)
for(i in 1:R) {
y <- x + rnorm(n)
fit <- lm(y ~ x)
pv[i] <- get_lm_resid_normality_p_value(fit, resid_function = resid)
}
mean(pv < 0.05)
# 0.05 # this is o.k.
# MASS::studres
set.seed(1234)
pv <- numeric(R)
for(i in 1:R) {
y <- x + rnorm(n)
fit <- lm(y ~ x)
pv[i] <- get_lm_resid_normality_p_value(fit, resid_function = MASS::studres)
}
mean(pv < 0.05)
# 0.0992 # this is NOT o.k. (!)
# a way to fix the problem is to only use half the data for building the model. But is there no more efficient way to undo the dependency structure?
set.seed(1234)
pv <- numeric(R)
for(i in 1:R) {
y <- x + rnorm(n)
xx <- data.frame(x, y)
fit <- lm(y ~ x, data = xx[1:15,])
pv[i] <- shapiro.test(xx[-c(1:15),"y"] - predict(fit, newdata = xx[-c(1:15),]))$p.value
}
mean(pv < 0.05)
# 0.0487 # this works since the residuals of the predicted ys are i.i.d
```