#StackBounty: #regression #bias #measurement-error #weighted-regression Using regression weights when $Y$ might be measured with bias

Bounty: 100

Suppose we observe data $Y, X$ and would like to fit a regression model for $mathbf{E}[Y ,|, X]$. Unfortunately, $Y$ is sometimes measured with a systematic bias (i.e. errors whose mean is nonzero).

Let $Z in left{text{unbiased}, text{biased}right}$ indicate whether $Y$ is measured with bias or not. We would actually like to estimate $mathbf{E}[Y ,|, X, Z = text{unbiased}]$. Unfortunately, $Z$ is generally not observed, and $mathbf{E}[Y ,|, X, Z = text{unbiased}] neq mathbf{E}[Y ,|, X]$. If we fit a regression of $Y$ on $X$, we’ll get biased predictions.

Suppose we cannot generally observe $Z$, but have access to a model for $Pr[Z ,|, X,Y]$ (because we manually learned $Z$ on a small training set and fit a classification model with $Z$ as the target variable). Does fitting a regression of $Y$ on $X$ using $Pr[Z = text{unbiased} ,|, X,Y]$ as regression weights produce an unbiased estimate of $mathbf{E}[Y ,|, X, Z = text{unbiased}]$? If so, is this method used in practice, and does it have a name?


Small example in R with df$y_is_unbiased playing the role of $Z$ and df$y_observed playing the role of $Y$:

library(ggplot2)
library(randomForest)

get_df <- function(n_obs, constant, beta, sd_epsilon, mismeasurement) {
    df <- data.frame(x1=rnorm(n_obs), x2=rnorm(n_obs), epsilon=rnorm(n_obs, sd=sd_epsilon))

    ## Value of Y if measured correctly
    df$y_unbiased <- constant + as.matrix(df[c("x1", "x2")]) %*% beta + df$epsilon

    ## Value of Y if measured incorrectly
    df$y_biased <- df$y_unbiased + sample(mismeasurement, size=n_obs, replace=TRUE)

    ## Y is equally likely to be measured correctly or incorrectly
    df$y_is_unbiased<- sample(c(TRUE, FALSE), size=n_obs, replace=TRUE)
    df$y_observed <- ifelse(df$y_is_unbiased, df$y_unbiased, df$y_biased)

    return(df)
}

## True coefficients
constant <- 5
beta <- c(1, 5)

df <- get_df(n_obs=2000, constant=constant, beta=beta, sd_epsilon=1.0, mismeasurement=c(-10.0, 5.0))

ggplot(df, aes(x=x1, y=y_observed, color=y_is_unbiased)) + geom_point() + scale_color_manual(values=c("#ff7f00", "#377eb8"))

df$string_y_is_unbiased <- paste0("y_is_unbiased: ", df$y_is_unbiased)

## Pr[Y | correct] differs from Pr[Y | incorrect]
ggplot(df, aes(x=y_observed)) + geom_histogram(color="black", fill="grey", binwidth=0.5) + facet_wrap(~ string_y_is_unbiased, ncol=1)

## Recover true constant and beta (plus noise) when using y_unbiased
summary(lm(y_unbiased ~ x1 + x2, data=df))

## Biased estimates when using y_biased (constant is biased downward)
summary(lm(y_biased ~ x1 + x2, data=df))

## Biased estimates when using y_observed (constant is biased downward)
summary(lm(y_observed ~ x1 + x2, data=df))

## Now image that we "rate" subset of the data (manually check/research whether y was measured correctly)
n_rated <- 1000
df_rated <- df[1:n_rated, ]

## Use a factor so that randomForest does classification instead of regression
df_rated$y_is_unbiased <- factor(df_rated$y_is_unbiased)

model_pr_unbiased <- randomForest(formula=y_is_unbiased ~ y_observed + x1 + x2, data=df_rated, mtry=2)

## Examine OOB confusion matrix (error rate < 5%)
print(model_pr_unbiased)

## Use the model to get Pr[correct | X, observed Y] on unrated data
df_unrated <- df[(n_rated+1):nrow(df), ]
df_unrated$pr_unbiased <- as.vector(predict(model_pr_unbiased, newdata=df_unrated, type="prob")[, "TRUE"])

## Train a model on unrated data, using pr_unbiased as regression weights -- is this unbiased?
summary(lm(y_observed ~ x1 + x2, data=df_unrated, weights=df_unrated$pr_unbiased))

In this example, the model $Pr[Z = text{unbiased} ,|, X,Y]$ is a random forest with formula=y_is_unbiased ~ y_observed + x1 + x2. In the limit, as this model becomes perfectly accurate (when it puts weights of 1.0 where $Y$ is unbiased, and 0.0 where $Y$ is biased), the weighted regression will clearly be unbiased. What happens when the model for $Pr[Z = text{unbiased} ,|, X,Y]$ has test precision and recalls that aren’t perfect (<100% accuracy)?


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.