# #StackBounty: #regression #bias #measurement-error #weighted-regression Using regression weights when \$Y\$ might be measured with bias

### Bounty: 100

Suppose we observe data $$Y, X$$ and would like to fit a regression model for $$mathbf{E}[Y ,|, X]$$. Unfortunately, $$Y$$ is sometimes measured with a systematic bias (i.e. errors whose mean is nonzero).

Let $$Z in left{text{unbiased}, text{biased}right}$$ indicate whether $$Y$$ is measured with bias or not. We would actually like to estimate $$mathbf{E}[Y ,|, X, Z = text{unbiased}]$$. Unfortunately, $$Z$$ is generally not observed, and $$mathbf{E}[Y ,|, X, Z = text{unbiased}] neq mathbf{E}[Y ,|, X]$$. If we fit a regression of $$Y$$ on $$X$$, we’ll get biased predictions.

Suppose we cannot generally observe $$Z$$, but have access to a model for $$Pr[Z ,|, X,Y]$$ (because we manually learned $$Z$$ on a small training set and fit a classification model with $$Z$$ as the target variable). Does fitting a regression of $$Y$$ on $$X$$ using $$Pr[Z = text{unbiased} ,|, X,Y]$$ as regression weights produce an unbiased estimate of $$mathbf{E}[Y ,|, X, Z = text{unbiased}]$$? If so, is this method used in practice, and does it have a name?

Small example in R with `df\$y_is_unbiased` playing the role of $$Z$$ and `df\$y_observed` playing the role of $$Y$$:

``````library(ggplot2)
library(randomForest)

get_df <- function(n_obs, constant, beta, sd_epsilon, mismeasurement) {
df <- data.frame(x1=rnorm(n_obs), x2=rnorm(n_obs), epsilon=rnorm(n_obs, sd=sd_epsilon))

## Value of Y if measured correctly
df$$y_unbiased <- constant + as.matrix(df[c("x1", "x2")]) %*% beta + df$$epsilon

## Value of Y if measured incorrectly
df$$y_biased <- df$$y_unbiased + sample(mismeasurement, size=n_obs, replace=TRUE)

## Y is equally likely to be measured correctly or incorrectly
df$$y_is_unbiased<- sample(c(TRUE, FALSE), size=n_obs, replace=TRUE) df$$y_observed <- ifelse(df$$y_is_unbiased, df$$y_unbiased, df\$y_biased)

return(df)
}

## True coefficients
constant <- 5
beta <- c(1, 5)

df <- get_df(n_obs=2000, constant=constant, beta=beta, sd_epsilon=1.0, mismeasurement=c(-10.0, 5.0))

ggplot(df, aes(x=x1, y=y_observed, color=y_is_unbiased)) + geom_point() + scale_color_manual(values=c("#ff7f00", "#377eb8"))

df$$string_y_is_unbiased <- paste0("y_is_unbiased: ", df$$y_is_unbiased)

## Pr[Y | correct] differs from Pr[Y | incorrect]
ggplot(df, aes(x=y_observed)) + geom_histogram(color="black", fill="grey", binwidth=0.5) + facet_wrap(~ string_y_is_unbiased, ncol=1)

## Recover true constant and beta (plus noise) when using y_unbiased
summary(lm(y_unbiased ~ x1 + x2, data=df))

## Biased estimates when using y_biased (constant is biased downward)
summary(lm(y_biased ~ x1 + x2, data=df))

## Biased estimates when using y_observed (constant is biased downward)
summary(lm(y_observed ~ x1 + x2, data=df))

## Now image that we "rate" subset of the data (manually check/research whether y was measured correctly)
n_rated <- 1000
df_rated <- df[1:n_rated, ]

## Use a factor so that randomForest does classification instead of regression
df_rated$$y_is_unbiased <- factor(df_rated$$y_is_unbiased)

model_pr_unbiased <- randomForest(formula=y_is_unbiased ~ y_observed + x1 + x2, data=df_rated, mtry=2)

## Examine OOB confusion matrix (error rate < 5%)
print(model_pr_unbiased)

## Use the model to get Pr[correct | X, observed Y] on unrated data
df_unrated <- df[(n_rated+1):nrow(df), ]
df_unrated\$pr_unbiased <- as.vector(predict(model_pr_unbiased, newdata=df_unrated, type="prob")[, "TRUE"])

## Train a model on unrated data, using pr_unbiased as regression weights -- is this unbiased?
summary(lm(y_observed ~ x1 + x2, data=df_unrated, weights=df_unrated\$pr_unbiased))
``````

In this example, the model $$Pr[Z = text{unbiased} ,|, X,Y]$$ is a random forest with `formula=y_is_unbiased ~ y_observed + x1 + x2`. In the limit, as this model becomes perfectly accurate (when it puts weights of 1.0 where $$Y$$ is unbiased, and 0.0 where $$Y$$ is biased), the weighted regression will clearly be unbiased. What happens when the model for $$Pr[Z = text{unbiased} ,|, X,Y]$$ has test precision and recalls that aren’t perfect (<100% accuracy)?

Get this bounty!!!

This site uses Akismet to reduce spam. Learn how your comment data is processed.