## #StackBounty: #regression #chi-squared #least-squares #goodness-of-fit #weighted-regression Goodness of Fit ot Least Squares with known…

### Bounty: 150

Given a linear model where $$Y=Xbeta+e$$ and $$esim N(0,Omega^{-1})sim N(0,sigma^2W^{-1})$$, where $$W_{ii}=w_i=frac{sigma^2}{sigma_i^2}$$ and $$Omega=frac{1}{sigma_i^2}$$ .

Assume that, thanks to a lot of repeated measurements, we know the underlying measurement uncertainties $$sigma_i$$ of my response variable in the i-th measured point. We measure a total of $$n$$ points

Given Heteroskedasticity we use weigthed least squares.

The residual analysis yields good results, showing that the residuals are independent and normally distributed, and that the weights enable studentized residuals with constant variance.

Now what is the best way to assess the goodness of fit ?

1.) Reduced chi-square: $$chi_{red}^2$$ should be close to 1. (1) (2)
$$chi_{red}^2 = frac{chi^2}{nu} = frac{r’Omega r}{nu} = frac{1}{nu} cdot sum_i^nfrac{ r_i^2}{sigma_i^2}$$

N.B.: This corresponds to a comparison of the unbiased estimate of error variance $$hat{sigma}^2$$ and the known mean measurement uncertainty $$sigma^2$$.
$$frac{hat{sigma}^2}{sigma^2} = frac{r’Wr}{nu} cdot frac{1}{sigma^2} = frac{ frac{1}{nu} sum_i^n r_i^2 cdot w_i}{sigma^2} = frac{ frac{1}{nu} sum_i^n r_i^2 cdot frac{sigma^2}{sigma_i^2}}{sigma^2} = frac{1}{nu} cdot sum_i^nfrac{ r_i^2}{sigma_i^2}$$

or

2.) Evaluation of the variance of the standardized/studentized Residuals, which should be close to 1. Note that the value for $$sigma$$ would be the one geven by the prior repeated measurements and not the MSE, where:

Standardized Residuals $$sim mathcal{N}(0,,1)$$, so $$Var(r_{i,Stand}) approx 1$$
$$r_{i,Stand} = frac{r_i}{sigma}$$
Internally studentized Residuals:
$$r_{i,ExtStud} = frac {r_i}{var(r_i)} = frac{r_i}{sqrt{sigma^2 (frac{1}{w_{i}} – H_{ii})}}$$
Externally studentized Residuals $$sim t(nu)$$, so $$Var(r_{i,IntStud}) approx frac{nu}{nu-2}$$
$$r_{i,IntStud} = frac{r_i}{sqrt{sigma_{(i)}^2 (frac{1}{w_{i}} – H_{ii})}}$$

or another alternative ?

Get this bounty!!!

## #StackBounty: #regression #variance #linear-algebra #polynomial #weighted-regression Prove variance of locally weighted regression incr…

### Bounty: 50

I am interested in proving the following fact for locally weighted polynomial regression from The Elements of Statistical Learning by Hastie et. al.

It can be shown that $$||l(x_0)||$$ increases with $$d$$

The same fact is also mentioned in Local Regression and Likelihood by Loader:

… the influence function infl(x) increases as the degree of the
local polynomial increases.

Where $$d$$ is the degree of local polynomial and

$$||l(x_0)||^2 = b(x_0)^T(B^TWB)^{-1}B^TWW^TB(B^TWB)^{-1}b(x_0)$$
where $$b(x)^T = (1, x, x^2, …, x^{d-1})$$, $$B$$ is the $$N times d$$ regression matrix with ith row $$b(x_i)^T$$, and $$W(x_0)$$ is an $$N times N$$ diagonal weight matrix.

This question’s accepted solution solves a related variance problem in the special case when $$W = I_{N times N}$$, however this doesn’t cover the more general case as it uses the fact that $$I_{N times N} times I_{N times N} = I_{N times N}$$ which is not the case for $$W$$.

Get this bounty!!!

## #StackBounty: #r #heteroscedasticity #poisson-regression #weighted-regression Closest approximation of a Poisson GLM using weighted lea…

### Bounty: 100

I have an application where I would like to approximate a Poisson GLM with identity link, i.e. a `glm` of the form

``````fit = glm(y~x1+x2+x3+..., family=poisson(link=identity), data=data)
``````

using a (noniterative) weighted least squares model fit instead, ie using

``````fit = lm(y~x1+x2+x3+...,, weights=..., data=data)
``````

I was wondering what would be the correct/best weights to use though to take into account the mean/variance relationship?

For Poisson the variance=mean, so would that imply that I should use weights=1/variance, i.e. `weights = 1/(y+epsilon)` with small `epsilon` (e.g. 1E-10)? I would like to use weighted OLS instead of a GLM mainly for computational reasons (a `lm.fit` typically being 5-10x faster than a `glm.fit`) and also because the identity link Poisson GLM in R doesn’t seem to be implemented in a very robust way (it often complains about it not finding valid starting values). A quick test indeed shows that such a weighted LS fit indeed is very close to the corresponding Poisson GLM.

When working with a single covariate `sqrt` transforming `x` and `y` to stabilize the variance and doing a regular OLS fit on those also seems to give a nearly identical fit as the Poisson GLM. But this only works with a single covariate though.

I was wondering if both of these procedures are generally accepted approximations of Poisson GLMs? And how they would typically be referred to, and what a good citation would be for both recipes?

To enforce nonnegativity coefficients on the fitted coefficients I would then also like to use `nnls` or `glmnet` with `lower.limits=0` (as glmnet does not support identity link Poisson; h20 does but does not support nonnegativity constraints). So would using `1/variance` weights within a regular (nonnegative) weighted least squares framework be an OK thing to do to approach a Poisson error structure?

Quick test :

``````x=1:100
set.seed(1)
y=rpois(n=length(x), lambda=x)
epsilon=1E-10
weights = 1/(y+epsilon)
weights = nrow(data)*weights/sum(weights)

data=data.frame(x=x,y=y,weights=weights)
fit = glm(y~x, family=poisson(link=identity), data=data) # Poisson GLM
lmfit = lm(sqrt(y)~sqrt(x), data=data) # OLS with x & y sqrt transformed to stabilize variance
wlmfit = lm(y*sqrt(weights) ~ x*sqrt(weights), data=data) # weighted OLS with 1/variance weights
preds.glm = data.frame(ypred=preds.glm[,1],
y.lwr=preds.glm[,1]-1.96*preds.glm[,2],
y.upr=preds.glm[,1]+1.96*preds.glm[,2])
preds.lm = do.call(cbind,predict(lmfit,se.fit=TRUE)[1:2])
preds.lm = data.frame(ypred=preds.lm[,1]^2,
y.lwr=(preds.lm[,1]-1.96*preds.lm[,2])^2,
y.upr=(preds.lm[,1]+1.96*preds.lm[,2])^2)
preds.wlm = do.call(cbind,predict(wlmfit,se.fit=TRUE)[1:2])
preds.wlm = data.frame(ypred=preds.wlm[,1]^2,
y.lwr=(preds.wlm[,1]-1.96*preds.wlm[,2])^2,
y.upr=(preds.wlm[,1]+1.96*preds.wlm[,2])^2)
dev.off()
plot(x,y,pch=16,main="Red=Poisson GLM,nBlue=Unweighted LS on sqrt() transformed data,nGreen=Weighted LS with 1/variance weights",cex=0.5)
lines(x,preds.glm$$ypred, col="red",lty=1,lwd=6) lines(x,preds.glm$$y.lwr, col="red",lty=1,lwd=4)
lines(x,preds.glm$$y.upr, col="red",lty=1,lwd=4) lines(x,preds.lm$$ypred, col="blue",lty=2,lwd=6)
lines(x,preds.lm$$y.lwr, col="blue",lty=2,lwd=4) lines(x,preds.lm$$y.upr, col="blue",lty=2,lwd=4)
lines(x,preds.lm$$ypred, col="green4",lty=3,lwd=6) lines(x,preds.lm$$y.lwr, col="green4",lty=3,lwd=4)
lines(x,preds.lm\$y.upr, col="green4",lty=3,lwd=4)
``````

Also relevant perhaps is this link.

Get this bounty!!!

## #StackBounty: #regression #bias #measurement-error #weighted-regression Using regression weights when \$Y\$ might be measured with bias

### Bounty: 100

Suppose we observe data $$Y, X$$ and would like to fit a regression model for $$mathbf{E}[Y ,|, X]$$. Unfortunately, $$Y$$ is sometimes measured with a systematic bias (i.e. errors whose mean is nonzero).

Let $$Z in left{text{unbiased}, text{biased}right}$$ indicate whether $$Y$$ is measured with bias or not. We would actually like to estimate $$mathbf{E}[Y ,|, X, Z = text{unbiased}]$$. Unfortunately, $$Z$$ is generally not observed, and $$mathbf{E}[Y ,|, X, Z = text{unbiased}] neq mathbf{E}[Y ,|, X]$$. If we fit a regression of $$Y$$ on $$X$$, we’ll get biased predictions.

Suppose we cannot generally observe $$Z$$, but have access to a model for $$Pr[Z ,|, X,Y]$$ (because we manually learned $$Z$$ on a small training set and fit a classification model with $$Z$$ as the target variable). Does fitting a regression of $$Y$$ on $$X$$ using $$Pr[Z = text{unbiased} ,|, X,Y]$$ as regression weights produce an unbiased estimate of $$mathbf{E}[Y ,|, X, Z = text{unbiased}]$$? If so, is this method used in practice, and does it have a name?

Small example in R with `df\$y_is_unbiased` playing the role of $$Z$$ and `df\$y_observed` playing the role of $$Y$$:

``````library(ggplot2)
library(randomForest)

get_df <- function(n_obs, constant, beta, sd_epsilon, mismeasurement) {
df <- data.frame(x1=rnorm(n_obs), x2=rnorm(n_obs), epsilon=rnorm(n_obs, sd=sd_epsilon))

## Value of Y if measured correctly
df$$y_unbiased <- constant + as.matrix(df[c("x1", "x2")]) %*% beta + df$$epsilon

## Value of Y if measured incorrectly
df$$y_biased <- df$$y_unbiased + sample(mismeasurement, size=n_obs, replace=TRUE)

## Y is equally likely to be measured correctly or incorrectly
df$$y_is_unbiased<- sample(c(TRUE, FALSE), size=n_obs, replace=TRUE) df$$y_observed <- ifelse(df$$y_is_unbiased, df$$y_unbiased, df\$y_biased)

return(df)
}

## True coefficients
constant <- 5
beta <- c(1, 5)

df <- get_df(n_obs=2000, constant=constant, beta=beta, sd_epsilon=1.0, mismeasurement=c(-10.0, 5.0))

ggplot(df, aes(x=x1, y=y_observed, color=y_is_unbiased)) + geom_point() + scale_color_manual(values=c("#ff7f00", "#377eb8"))

df$$string_y_is_unbiased <- paste0("y_is_unbiased: ", df$$y_is_unbiased)

## Pr[Y | correct] differs from Pr[Y | incorrect]
ggplot(df, aes(x=y_observed)) + geom_histogram(color="black", fill="grey", binwidth=0.5) + facet_wrap(~ string_y_is_unbiased, ncol=1)

## Recover true constant and beta (plus noise) when using y_unbiased
summary(lm(y_unbiased ~ x1 + x2, data=df))

## Biased estimates when using y_biased (constant is biased downward)
summary(lm(y_biased ~ x1 + x2, data=df))

## Biased estimates when using y_observed (constant is biased downward)
summary(lm(y_observed ~ x1 + x2, data=df))

## Now image that we "rate" subset of the data (manually check/research whether y was measured correctly)
n_rated <- 1000
df_rated <- df[1:n_rated, ]

## Use a factor so that randomForest does classification instead of regression
df_rated$$y_is_unbiased <- factor(df_rated$$y_is_unbiased)

model_pr_unbiased <- randomForest(formula=y_is_unbiased ~ y_observed + x1 + x2, data=df_rated, mtry=2)

## Examine OOB confusion matrix (error rate < 5%)
print(model_pr_unbiased)

## Use the model to get Pr[correct | X, observed Y] on unrated data
df_unrated <- df[(n_rated+1):nrow(df), ]
df_unrated\$pr_unbiased <- as.vector(predict(model_pr_unbiased, newdata=df_unrated, type="prob")[, "TRUE"])

## Train a model on unrated data, using pr_unbiased as regression weights -- is this unbiased?
summary(lm(y_observed ~ x1 + x2, data=df_unrated, weights=df_unrated\$pr_unbiased))
``````

In this example, the model $$Pr[Z = text{unbiased} ,|, X,Y]$$ is a random forest with `formula=y_is_unbiased ~ y_observed + x1 + x2`. In the limit, as this model becomes perfectly accurate (when it puts weights of 1.0 where $$Y$$ is unbiased, and 0.0 where $$Y$$ is biased), the weighted regression will clearly be unbiased. What happens when the model for $$Pr[Z = text{unbiased} ,|, X,Y]$$ has test precision and recalls that aren’t perfect (<100% accuracy)?

Get this bounty!!!