*Bounty: 100*

*Bounty: 100*

In their preprint, Rocha et al. suggest a new type of residual for linear regression models with heteroscedasticity. They call their new residual *PCA residuals.* I have tried to replicate some of their examples but failed. I would be very grateful if someone could give me a hint where I’m wrong.

## Background

Briefly, they assume the usual linear model of the form in matrix notation $$Y=Xbeta + epsilon$$

where $Y = (Y_1ldots, Y_n)^T$, $X = (X_{ij})$ is an $ntimes p$ matrix, $beta = (beta_1,ldots, beta_p)^T$ and $epsilon = (epsilon_1, ldots, epsilon_n)^T$. Further, let $Omega = mathrm{Cov}(epsilon) = mathrm{diag}(sigma_1^2, ldots, sigma_n^2)$ (with $0<sigma_i^2<infty$ for $i=1,ldots, n$). They show that the residuals satisfy

$$

hat{epsilon} = (I_n – H)ysim N_n(0, (I_n – H)Omega)

$$

where $I_n$ is the identity matrix and $H$ the hat matrix with diagonal elements $h_i$. There exist a number of known heteroscedasticity-consistent estimators for $Omega$ but I will focus on $HC_3$ here:

$$

widehat{Omega}_{3} = mathrm{diag}(1/(1-h_i)^{2})widehat{Omega}

$$

where $widehat{Omega}=mathrm{diag}(hat{epsilon}^2_{1}, ldots, hat{epsilon}^2_{n})$.

## PCA residuals

To get the PCA residuals, perform a spectral decomposition (Eigendecomposition) of the form

$$

(I_n – H)widehat{Omega}_{3} = Qwidehat{Lambda}Q^{-1}

$$

where $Q$ is an orthogonal matrix of eigenvectors and $Lambda = mathrm{diag}(overbrace{lambda_1, ldots, lambda_{n-p}}^{n-p}, underbrace{0, ldots, 0}_{p})$. That means that the last $p$ eigenvalues are zero. Finally, the standardized PCA residuals are defined as

$$

R_{i} = frac{Qhat{epsilon}}{sqrt{lambda_{i}}},quad i = 1,ldots, n-p.

$$

**One problem in my calculations below (I think) is that this covariance matrix $(I_n – H)widehat{Omega}_{3}$ is not positive and symmetric.**

## An example

The authors use the `states.dta`

dataset to illustrate the use of their PCA residuals. I have tried to replicate their calculations but was not able to.

Consider the multiple linear regression model

$$

mathrm{csat}*i = beta_0 + beta_1mathrm{percent}_i + beta_2mathrm{high}_i + beta_3mathrm{percent}*{i}^2 + epsilon_i, quad i = 1,ldots, 51

$$

Here are my steps in R:

```
library(foreign)
dat <- read.dta("http://github.com/IQSS/workshops/blob/master/R/Rstatistics/dataSets/states.dta?raw=True")
mod <- lm(csat~percent + high + I(percent^2), data = dat)
```

Calculate the heteroscedasticity-consistent matrix $widehat{Omega}_{3}$:

```
res2 <- as.numeric(residuals(mod)^2)
omegahat_HC3 <- diag(res2/(1 - hatvalues(mod))^2)
```

Now perform the spectral decomposition (note that the decomposition results in some complex numbers but I don’t know if that’s a problem):

```
X <- model.matrix(mod)
Ident <- diag(nobs(mod))
Hatmatrix <- tcrossprod(svd(X)$u)
covmat <- (Ident - Hatmatrix)%*%omegahat_HC3
decomp <- eigen(covmat) # Spectral decomposition
Q <- decomp$vectors # Eigenvectors
Lambda <- decomp$values # Eigenvalues
```

Finally, calculate the PCA residuals (note that I’m multiplying the residuals with the transpose of $Q$, because the results where nonsensical otherwise. But I’m unsure why that’s the case):

```
res <- as.numeric(residuals(mod))
res_pca <- t(Q)%*%res # Unstandardized PCA residuals
n <- nobs(mod)
p <- length(coef(mod))
res_pca_std <- Re(res_pca[1:(n - p)]/sqrt(Lambda[1:(n - p)]))
```

The authors show the following index plot of the PCA residuals

Here is the index plot of my calculations:

My PCA residuals are obviously not quite the same but I don’t know here I was going wrong with my calculations.

Can anybody explain why my calculations are off and why?