#StackBounty: #regression #chi-squared #least-squares #goodness-of-fit #weighted-regression Goodness of Fit ot Least Squares with known…

Bounty: 150

Given a linear model where $Y=Xbeta+e$ and $esim N(0,Omega^{-1})sim N(0,sigma^2W^{-1})$, where $W_{ii}=w_i=frac{sigma^2}{sigma_i^2}$ and $Omega=frac{1}{sigma_i^2}$ .

Assume that, thanks to a lot of repeated measurements, we know the underlying measurement uncertainties $sigma_i$ of my response variable in the i-th measured point. We measure a total of $n$ points

Given Heteroskedasticity we use weigthed least squares.

The residual analysis yields good results, showing that the residuals are independent and normally distributed, and that the weights enable studentized residuals with constant variance.

Now what is the best way to assess the goodness of fit ?

1.) Reduced chi-square: $chi_{red}^2$ should be close to 1. (1) (2)
$$chi_{red}^2 = frac{chi^2}{nu} = frac{r’Omega r}{nu} = frac{1}{nu} cdot sum_i^nfrac{ r_i^2}{sigma_i^2} $$

N.B.: This corresponds to a comparison of the unbiased estimate of error variance $hat{sigma}^2$ and the known mean measurement uncertainty $sigma^2$.
$$ frac{hat{sigma}^2}{sigma^2} = frac{r’Wr}{nu} cdot frac{1}{sigma^2} = frac{ frac{1}{nu} sum_i^n r_i^2 cdot w_i}{sigma^2} = frac{ frac{1}{nu} sum_i^n r_i^2 cdot frac{sigma^2}{sigma_i^2}}{sigma^2} = frac{1}{nu} cdot sum_i^nfrac{ r_i^2}{sigma_i^2} $$


2.) Evaluation of the variance of the standardized/studentized Residuals, which should be close to 1. Note that the value for $sigma$ would be the one geven by the prior repeated measurements and not the MSE, where:

Standardized Residuals $sim mathcal{N}(0,,1)$, so $Var(r_{i,Stand}) approx 1$
$$r_{i,Stand} = frac{r_i}{sigma}$$
Internally studentized Residuals:
$$ r_{i,ExtStud} = frac
{r_i}{var(r_i)} = frac{r_i}{sqrt{sigma^2 (frac{1}{w_{i}} – H_{ii})}}$$

Externally studentized Residuals $sim t(nu)$, so $Var(r_{i,IntStud}) approx frac{nu}{nu-2}$
$$ r_{i,IntStud} = frac{r_i}{sqrt{sigma_{(i)}^2 (frac{1}{w_{i}} – H_{ii})}} $$

or another alternative ?

Get this bounty!!!

#StackBounty: #regression #variance #linear-algebra #polynomial #weighted-regression Prove variance of locally weighted regression incr…

Bounty: 50

I am interested in proving the following fact for locally weighted polynomial regression from The Elements of Statistical Learning by Hastie et. al.

It can be shown that $||l(x_0)||$ increases with $d$

The same fact is also mentioned in Local Regression and Likelihood by Loader:

… the influence function infl(x) increases as the degree of the
local polynomial increases.

Where $d$ is the degree of local polynomial and

$$ ||l(x_0)||^2 = b(x_0)^T(B^TWB)^{-1}B^TWW^TB(B^TWB)^{-1}b(x_0)$$
where $b(x)^T = (1, x, x^2, …, x^{d-1})$, $B$ is the $N times d$ regression matrix with ith row $b(x_i)^T$, and $W(x_0)$ is an $N times N$ diagonal weight matrix.

This question’s accepted solution solves a related variance problem in the special case when $W = I_{N times N}$, however this doesn’t cover the more general case as it uses the fact that $I_{N times N} times I_{N times N} = I_{N times N}$ which is not the case for $W$.

Get this bounty!!!

#StackBounty: #r #heteroscedasticity #poisson-regression #weighted-regression Closest approximation of a Poisson GLM using weighted lea…

Bounty: 100

I have an application where I would like to approximate a Poisson GLM with identity link, i.e. a glm of the form

fit = glm(y~x1+x2+x3+..., family=poisson(link=identity), data=data)

using a (noniterative) weighted least squares model fit instead, ie using

fit = lm(y~x1+x2+x3+...,, weights=..., data=data)

which comes down to using a weighted nonnegative least squars model (nnls) where both y and the covariate matrix are multiplied by sqrt(weights).

I was wondering what would be the correct/best weights to use though to take into account the mean/variance relationship?

For Poisson the variance=mean, so would that imply that I should use weights=1/variance, i.e. weights = 1/(y+epsilon) with small epsilon (e.g. 1E-10)? I would like to use weighted OLS instead of a GLM mainly for computational reasons (a lm.fit typically being 5-10x faster than a glm.fit) and also because the identity link Poisson GLM in R doesn’t seem to be implemented in a very robust way (it often complains about it not finding valid starting values). A quick test indeed shows that such a weighted LS fit indeed is very close to the corresponding Poisson GLM.

When working with a single covariate sqrt transforming x and y to stabilize the variance and doing a regular OLS fit on those also seems to give a nearly identical fit as the Poisson GLM. But this only works with a single covariate though.

I was wondering if both of these procedures are generally accepted approximations of Poisson GLMs? And how they would typically be referred to, and what a good citation would be for both recipes?

To enforce nonnegativity coefficients on the fitted coefficients I would then also like to use nnls or glmnet with lower.limits=0 (as glmnet does not support identity link Poisson; h20 does but does not support nonnegativity constraints). So would using 1/variance weights within a regular (nonnegative) weighted least squares framework be an OK thing to do to approach a Poisson error structure?

Quick test :

y=rpois(n=length(x), lambda=x)
weights = 1/(y+epsilon)
weights = nrow(data)*weights/sum(weights)

fit = glm(y~x, family=poisson(link=identity), data=data) # Poisson GLM
lmfit = lm(sqrt(y)~sqrt(x), data=data) # OLS with x & y sqrt transformed to stabilize variance
wlmfit = lm(y*sqrt(weights) ~ x*sqrt(weights), data=data) # weighted OLS with 1/variance weights
preds.glm = do.call(cbind,predict(fit,type="link",se.fit=TRUE)[1:2])
preds.glm = data.frame(ypred=preds.glm[,1], 
preds.lm = do.call(cbind,predict(lmfit,se.fit=TRUE)[1:2])
preds.lm = data.frame(ypred=preds.lm[,1]^2, 
preds.wlm = do.call(cbind,predict(wlmfit,se.fit=TRUE)[1:2])
preds.wlm = data.frame(ypred=preds.wlm[,1]^2, 
plot(x,y,pch=16,main="Red=Poisson GLM,nBlue=Unweighted LS on sqrt() transformed data,nGreen=Weighted LS with 1/variance weights",cex=0.5)
lines(x,preds.glm$ypred, col="red",lty=1,lwd=6)
lines(x,preds.glm$y.lwr, col="red",lty=1,lwd=4)
lines(x,preds.glm$y.upr, col="red",lty=1,lwd=4)
lines(x,preds.lm$ypred, col="blue",lty=2,lwd=6)
lines(x,preds.lm$y.lwr, col="blue",lty=2,lwd=4)
lines(x,preds.lm$y.upr, col="blue",lty=2,lwd=4)
lines(x,preds.lm$ypred, col="green4",lty=3,lwd=6)
lines(x,preds.lm$y.lwr, col="green4",lty=3,lwd=4)
lines(x,preds.lm$y.upr, col="green4",lty=3,lwd=4)

enter image description here

Also relevant perhaps is this link.

Get this bounty!!!

#StackBounty: #regression #bias #measurement-error #weighted-regression Using regression weights when $Y$ might be measured with bias

Bounty: 100

Suppose we observe data $Y, X$ and would like to fit a regression model for $mathbf{E}[Y ,|, X]$. Unfortunately, $Y$ is sometimes measured with a systematic bias (i.e. errors whose mean is nonzero).

Let $Z in left{text{unbiased}, text{biased}right}$ indicate whether $Y$ is measured with bias or not. We would actually like to estimate $mathbf{E}[Y ,|, X, Z = text{unbiased}]$. Unfortunately, $Z$ is generally not observed, and $mathbf{E}[Y ,|, X, Z = text{unbiased}] neq mathbf{E}[Y ,|, X]$. If we fit a regression of $Y$ on $X$, we’ll get biased predictions.

Suppose we cannot generally observe $Z$, but have access to a model for $Pr[Z ,|, X,Y]$ (because we manually learned $Z$ on a small training set and fit a classification model with $Z$ as the target variable). Does fitting a regression of $Y$ on $X$ using $Pr[Z = text{unbiased} ,|, X,Y]$ as regression weights produce an unbiased estimate of $mathbf{E}[Y ,|, X, Z = text{unbiased}]$? If so, is this method used in practice, and does it have a name?

Small example in R with df$y_is_unbiased playing the role of $Z$ and df$y_observed playing the role of $Y$:


get_df <- function(n_obs, constant, beta, sd_epsilon, mismeasurement) {
    df <- data.frame(x1=rnorm(n_obs), x2=rnorm(n_obs), epsilon=rnorm(n_obs, sd=sd_epsilon))

    ## Value of Y if measured correctly
    df$y_unbiased <- constant + as.matrix(df[c("x1", "x2")]) %*% beta + df$epsilon

    ## Value of Y if measured incorrectly
    df$y_biased <- df$y_unbiased + sample(mismeasurement, size=n_obs, replace=TRUE)

    ## Y is equally likely to be measured correctly or incorrectly
    df$y_is_unbiased<- sample(c(TRUE, FALSE), size=n_obs, replace=TRUE)
    df$y_observed <- ifelse(df$y_is_unbiased, df$y_unbiased, df$y_biased)


## True coefficients
constant <- 5
beta <- c(1, 5)

df <- get_df(n_obs=2000, constant=constant, beta=beta, sd_epsilon=1.0, mismeasurement=c(-10.0, 5.0))

ggplot(df, aes(x=x1, y=y_observed, color=y_is_unbiased)) + geom_point() + scale_color_manual(values=c("#ff7f00", "#377eb8"))

df$string_y_is_unbiased <- paste0("y_is_unbiased: ", df$y_is_unbiased)

## Pr[Y | correct] differs from Pr[Y | incorrect]
ggplot(df, aes(x=y_observed)) + geom_histogram(color="black", fill="grey", binwidth=0.5) + facet_wrap(~ string_y_is_unbiased, ncol=1)

## Recover true constant and beta (plus noise) when using y_unbiased
summary(lm(y_unbiased ~ x1 + x2, data=df))

## Biased estimates when using y_biased (constant is biased downward)
summary(lm(y_biased ~ x1 + x2, data=df))

## Biased estimates when using y_observed (constant is biased downward)
summary(lm(y_observed ~ x1 + x2, data=df))

## Now image that we "rate" subset of the data (manually check/research whether y was measured correctly)
n_rated <- 1000
df_rated <- df[1:n_rated, ]

## Use a factor so that randomForest does classification instead of regression
df_rated$y_is_unbiased <- factor(df_rated$y_is_unbiased)

model_pr_unbiased <- randomForest(formula=y_is_unbiased ~ y_observed + x1 + x2, data=df_rated, mtry=2)

## Examine OOB confusion matrix (error rate < 5%)

## Use the model to get Pr[correct | X, observed Y] on unrated data
df_unrated <- df[(n_rated+1):nrow(df), ]
df_unrated$pr_unbiased <- as.vector(predict(model_pr_unbiased, newdata=df_unrated, type="prob")[, "TRUE"])

## Train a model on unrated data, using pr_unbiased as regression weights -- is this unbiased?
summary(lm(y_observed ~ x1 + x2, data=df_unrated, weights=df_unrated$pr_unbiased))

In this example, the model $Pr[Z = text{unbiased} ,|, X,Y]$ is a random forest with formula=y_is_unbiased ~ y_observed + x1 + x2. In the limit, as this model becomes perfectly accurate (when it puts weights of 1.0 where $Y$ is unbiased, and 0.0 where $Y$ is biased), the weighted regression will clearly be unbiased. What happens when the model for $Pr[Z = text{unbiased} ,|, X,Y]$ has test precision and recalls that aren’t perfect (<100% accuracy)?

Get this bounty!!!