## #StackBounty: #estimation #binomial #beta-distribution #measurement-error How to model errors around the estimation of proportions – wi…

### Bounty: 100

I have a situation I’m trying to model. I would appreciate any ideas on how to model this, or if there are known names for such a situation.

Background:

Let’s assume we have a large number of movies (M). For each movie, I’d like to know the proportion of people in the population who enjoy watching these movies. So for movie $$m_1$$ we’d say that $$p_1$$ proportion of the population would say "yes" to "did you enjoy watching this movie?" question. And the same for movie $$m_j$$, we’d have proportion $$p_j$$ (up to movie $$m_M$$).

We sample $$n$$ people, and ask each of them to say if they enjoyed watching movies $$m_1, m_2, …, m_M$$ of the movies. We can now easily build estimations for $$p_1, …, p_M$$ using standard point estimates, and build confidence intervals for these estimations using the standard methods (ref).

But there is a problem.

Problem: measurement error

Some of the people in the sample do not bother to answer truthfully. They instead just answer yes/no to the question regardless of their true preference. Luckily, for some sample of the M movies, we know the true proportion of people who like the movies. So let’s assume that M is very large, but that for the first 100 movies (of some indexing) we know the real proportion.
So we know the real values of $$p_1, p_2, …, p_{100}$$, and we have their estimations $$hat p_1 , hat p_2, …, hat p_{100}$$. While we still want to know the confidence intervals that takes this measurement error into account for $$p_{101} , p_{102}, …, p_M$$, using our estimators $$hat p_{101} , hat p_{102}, …, hat p_M$$.

I could imagine some simple model such as:

$$hat p_i sim N(p_i, epsilon^2 + eta^2 )$$

Where $$eta^2$$ is for the measurement error.

Questions:

1. Are there other reasonable models for this type of situation?
2. What are good ways to estimate $$eta^2$$ (for the purpose of building confidence interval)? For example, would using $$hat eta^2 = frac{1}{n-1}sum (p_i – hat p_i)^2$$ make sense? Or, for example, it makes sense to first take some transformation of the $$p_i$$ and $$hat p_i$$ values (using logit, probit or some other transformation from the 0 to 1, to an -inf to inf scale)?

Get this bounty!!!

## #StackBounty: #least-squares #measurement-error Measurement error in one indep variable in OLS with multiple regression

### Bounty: 50

Suppose I regress (with OLS) $$y$$ on $$x_1$$ and $$x_2$$. Suppose I have i.i.d. sample of size n, and that $$x_1$$ is observed with error but $$y$$ and $$x_2$$ are observed without error. What is the probability limit of the estimated coefficient on $$x_1$$?

Let us suppose for tractability that the measurement error of $$x_1$$ is "classical". That is the measurement error is normally distributed with mean 0 and is uncorrelated with $$x_2$$ or the error term.

Get this bounty!!!

## #StackBounty: #least-squares #measurement-error Measurement error in one indep variable in OLS with multivariate regression

### Bounty: 50

Suppose I regress (with OLS) $$y$$ on $$x_1$$ and $$x_2$$. Suppose I have i.i.d. sample of size n, and that $$x_1$$ is observed with error but $$y$$ and $$x_2$$ are observed without error. What is the probability limit of the estimated coefficient on $$x_1$$?

Let us suppose for tractability that the measurement error of $$x_1$$ is "classical". That is the measurement error is normally distributed with mean 0 and is uncorrelated with $$x_2$$ or the error term.

Get this bounty!!!

## #StackBounty: #confidence-interval #standard-error #fitting #measurement-error #error-propagation Correct way to combine 95% confidence…

### Bounty: 50

I am looking for someone to just confirm / double-check something for me with regards to errors on measurements.

Let’s say I am trying to determine the slope of a relationship by varying one quantity and measuring another, and then I plot the graph and do a least-squares fit straight line to the data (graph on the left). Then I repeat this procedure twice more, to get the middle and right-most graphs.

Each fit routune will typically give me back a slope and the corresponding 95% confidence interval, so that I obtain $$(m_1pmDelta m_1), (m_2pmDelta m_2)$$ and $$(m_3pmDelta m_3)$$. Now I know that the underlying quantity which determines $$m$$ in each case is the same, so I should be able to quote a best estimate for the slope as their mean

$$bar{m} = frac{m_1+m_2+m_3}{3}. tag{1}$$

My question is about the appropriate way to quote the error. We know that for a function $$f(x,y)$$ with errors in $$x$$ and $$y$$ given by $$Delta x$$ and $$Delta y$$, respectively, the error on $$f$$ is given by

$$Delta f = sqrt{ (Delta x)^2 bigg(frac{partial f}{partial x}bigg)^2 + (Delta y)^2 bigg(frac{partial f}{partial y}bigg)^2 } tag{2}$$

So I would think I can determine the error in $$bar{m}$$ to be

begin{align} Delta bar{m} &= sqrt{ (Delta m_1)^2 bigg(frac{partial bar{m}}{partial m_1}bigg)^2 + (Delta m_2)^2 bigg(frac{partial bar{m}}{partial m_2}bigg)^2 + (Delta m_3)^2 bigg(frac{partial bar{m}}{partial m_3}bigg)^2} tag{3} \ &= frac{1}{3} sqrt{ (Delta m_1)^2 + (Delta m_2)^2 + (Delta m_3)^2 } tag{4} end{align}

First question, is this correct?

Second question, is it okay to propagate 95% confidence intervals in this way? Should I simply quote now the result as $$bar{m} pm Delta bar{m}$$ and just explain that $$Delta bar{m}$$ is the combined 95% confidence interval, or should I convert the 95% number from the fits into standard errors (through the factor of 1.96)?

(I am for now assuming Gaussian errors everywhere.)

EDIT

It was suggested in the comments that I first implement weighting in the averaging step before worrying about the errors. This should help to give more weight to slopes which have tighter confidence intervals (and vice versa).

According to this link, the weighted version of the mean would be given by
$$bar{m}_textrm{w} = frac{sum_i w_i m_i}{sum_iw_i}, hspace{1cm} textrm{where} hspace{0.5cm} w_i = frac{1}{sigma_i^2}tag{5}$$
and $$sigma_i$$ is the variance of each slope. Therefore, in my case with the three example slopes, it should be
$$bar{m}_textrm{w} = frac{m_1/sigma_1^2 + m_2/sigma_2^2 + m_3/sigma_3^2}{1/sigma_1^2 + 1/sigma_2^2 + 1/sigma_3^2}. tag{6}$$

The variance on the weighted mean slope is given at the above link again by
begin{align} textrm{Var}(bar{m}_textrm{w}) &= frac{sum_iw_i^2sigma_i^2}{big( sum_iw_ibig)^2}tag{7}\ &= frac{1/sigma_1^2 + 1/sigma_2^2 + 1/sigma_3^2}{big(1/sigma_1^2 + 1/sigma_2^2 + 1/sigma_3^2big)^2}tag{8}\ &= big(1/sigma_1^2 + 1/sigma_2^2 + 1/sigma_3^2big)^{-1}.tag{9} end{align}

So now my main question remains – these are variances, so should we convert the 95% confidence intervals $$Delta m_i$$ returned by a fitting algorthm somehow into a variance?

Maybe for a concrete example we could imagine the following values were returned from the fitting routine:
begin{align} m_1 &= 5.5; (4.9, 6.1)rightarrow Delta m_1 = 0.6\ m_2 &= 5.5; (5.3, 5.7)rightarrow Delta m_2 = 0.2\ m_3 &= 5.2; (4.5, 5.9)rightarrow Delta m_3 = 0.7 end{align}
where the values in brackets represent the 95% confidence intervals. How should the estimate of the slope be reported, including errors?

Get this bounty!!!

## #StackBounty: #sampling #measurement-error Balancing measurement error with number of samples

### Bounty: 50

Suppose I am doing a physical experiment and would like to measure the output (random variable). Inherently, I introduce measurement errors when sampling the random variable. There are also sampling errors, due to sampling only a finite number of realizations of my random variable. Is there literature relating to how to balance these two types of errors?

It is not hard to imagine a scenario where I can take more samples if I reduce how precise of a measurement device I use, and so a natural question is how do I decide what precision to use.

For instance, suppose that by rounding my measurements to the nearest centimeter rather than millimeter, I can increase the number of samples I can take by a factor of 5. Which precision should I use?

I am aware of Sheppard’s corrections, but I don’t think those are general enough for all cases; e.g. if my data is discrete. Moreover, even in the continuous case, Sheppard’s corrections say that the measurement errors do not affect the mean. This is reasonable if you’re using a relatively fine measurement system, but is clearly not true if your measurement precision is very low.

To clarify, I am considering the case where the rounding error is a deterministic function of my original random variable; i.e. assume that I sample in infinite precision, and then round to my measurement system (say the integers).

Get this bounty!!!

## #StackBounty: #measurement-error #metrology Instrument accuracy necessary to distinguish two values

### Bounty: 50

Let’s say I want to know the size of some shoes, but the size isn’t marked. The real size could be any of [35,36,37…50 cm], and the shoes are made so well that their real size is exactly one of these integers when measured with a reference instrument. I measure the shoes with a lower-quality instrument whose measurements include normally-distributed error. I estimated this error previously by measuring objects of known size. What is the maximum acceptable error to be able to distinguish whether a shoe is 36 cm and not 35 or 37 cm? Is it just 0.49 cm? In other words how should I specify the necessary performance of the measurement device?

Get this bounty!!!

## Problem

I have a system that is measured at regular intervals. The state of the system at those times is given by the vector $$vec x=(x_0, x_1, x_2,cdots)$$. In between each measurement, a random variable $$z$$, drawn from a Gaussian distribution with zero mean and standard deviation $$sigma_z$$ is added onto the state and there is some decay, i.e., $$x_i=(1-gamma)x_{i-1}+z_i$$. (The underlying model is a damped harmonic oscillator in a thermal environment, probed at multiples of its period.)

My measurements are imperfect, i.e., I measure $$vec y=(y_0,y_1,y_2,cdots)$$, where $$y_i=x_i+d_i$$. Again, the $$d$$ are drawn from a Gaussian distribution of zero mean, with standard deviation $$sigma_d$$.
Note that this setup is a close variant of an example on the Wikipedia page for the Kalman filter. But as far as I understand, the Kalman filter does not take the whole measurement record into account, but provides an update rule for the best guess. In particular, it’s estimate for $$x_0$$, which I am particularly interested in, is bad.

I would like to find $$vec x$$ given some measurement record $$vec y$$ (parameters $$sigma_d, sigma_z$$, $$gamma$$ are known). I believe I should use some maximum likelihood analysis, but I have trouble finding the right probability distributions for $$P(vec y|vec x), P(vec x), P(vec y)$$ that I need to determine $$P(vec x|vec y)$$.

## Attempt

I tried guessing what the right probability distributions should be.
This one is fairly easy:
$$P(vec y|vec x) = prod_{i=0}^{N-1}frac{1}{sqrt{2pisigma_z^2}}expleft(-frac{(y_i-x_i)^2}{2sigma_z^2}right).$$

Now for $$P(vec x)$$, I’m already slightly unsure. I start at equilibrium of the above update rule for $$x_i$$, i.e., when the covariance $$sigma_{x_i}^2to(1-gamma)sigma_{x_{i-1}}^2+sigma_d^2$$ has reached a steady state, i.e., $$sigma_{x_0}=sigma_d/sqrt{gamma}$$. This is my expectation for $$x_0$$.
Thus I think I should write
$$P(vec x) = frac{1}{sqrt{2pisigma_{x_0}^2}}expleft(-frac{x_0^2}{2sigma_{x_0}^2}right)timesprod_{i=1}^{N-1}frac{1}{sqrt{2pisigma_d^2}}expleft(-frac{(x_i-x_{i-1})^2}{2sigma_d^2}right).$$

However, the real problem is with $$P(vec y)$$, and I think this is due to my limited experience with this method. Naively, without thinking about Bayesian updates, my guess for $$P(vec x|vec y)$$ would have looked like the product of $$P(vec x)$$ and $$P(vec y|vec x)$$.
I think my expectation for $$vec y$$ should be
$$P(vec y) = int dvec x, P(vec x)P(vec y|vec x).$$
It is possible to do all the Gaussian integrals, but it doesn’t seem to give a simple result.

Is there some simpler assumption I can make? And how is it justified? Is there a mistake somewhere else?

## Attempt #2

I’m still not entirely sure about $$P(vec y)$$ (I think it’s correct though), but one can get by without explicitly calculating it.

One mistake above is calling this approach “maximum likelihood”. It’s not. MLE would mean that I chose $$vec x$$ such that $$P(vec y|vec x)$$ above becomes maximal for a given set of measurement results $$vec y$$. This is a bad guess, as it neglects what we know about how the state evolves (note that $$sigma_d$$ does not crop up in $$P(vec y|vec x)$$). Instead, what I am doing is Bayesian inference, i.e., finding the conditional probability $$P(vec x|vec y)$$.

To do this, we start with the joint distribution $$P(vec x, vec y)=P(vec y|vec x)P(vec x)$$. This is a normal distribution
$$begin{pmatrix}vec x \ vec yend{pmatrix} sim Nleft[begin{pmatrix}vec mu_x \ vec mu_yend{pmatrix}, begin{pmatrix}Sigma_{xx} & Sigma_{xy} \ Sigma_{xy} & Sigma_{yy}end{pmatrix}right],$$
with $$mu_x,mu_y=0$$, $$Sigma_{yy} = Sigma_{xy} = (1/2sigma_z^2)mathbb 1$$, and
$$[Sigma_{xx}]{ij} = left( frac{1}{2sigma_z^2} + frac{1}{sigma_d^2} right)delta{i,j} + left( frac{1}{2sigma_{x_0}^2} – frac{1}{2sigma_d^2} right)delta_{i,0}delta_{j,0}+frac{1}{2sigma_d^2}(delta_{i,j+1}+delta_{i,j-1}).$$
NB: $$Sigma_{xx}$$ is tri-diagonal apart from the first element, which differs (indexing from 0).

It’s actually straightforward to derive the conditional distribution for $$vec x$$ from the joint distribution [2]. It is given through
$$vec x sim N(mu_x+Sigma_{xy}Sigma_{yy}^{-1}(vec y-vecmu_y), Sigma_{xx}-Sigma_{xy}Sigma_{yy}^{-1}Sigma_{xy}).$$
However, in our case, this reduces to
$$vec xsim N( vec y, Sigma_{xx}-1/2sigma_z^2 ),$$
which is not at all what I am expecting. All values of $$vec y$$ should be used to determine a specific $$x_i$$. The problem lies in the fact that $$Sigma_{yy}$$ is diagonal. What am I missing?

Get this bounty!!!

## #StackBounty: #regression #bias #measurement-error #weighted-regression Using regression weights when \$Y\$ might be measured with bias

### Bounty: 100

Suppose we observe data $$Y, X$$ and would like to fit a regression model for $$mathbf{E}[Y ,|, X]$$. Unfortunately, $$Y$$ is sometimes measured with a systematic bias (i.e. errors whose mean is nonzero).

Let $$Z in left{text{unbiased}, text{biased}right}$$ indicate whether $$Y$$ is measured with bias or not. We would actually like to estimate $$mathbf{E}[Y ,|, X, Z = text{unbiased}]$$. Unfortunately, $$Z$$ is generally not observed, and $$mathbf{E}[Y ,|, X, Z = text{unbiased}] neq mathbf{E}[Y ,|, X]$$. If we fit a regression of $$Y$$ on $$X$$, we’ll get biased predictions.

Suppose we cannot generally observe $$Z$$, but have access to a model for $$Pr[Z ,|, X,Y]$$ (because we manually learned $$Z$$ on a small training set and fit a classification model with $$Z$$ as the target variable). Does fitting a regression of $$Y$$ on $$X$$ using $$Pr[Z = text{unbiased} ,|, X,Y]$$ as regression weights produce an unbiased estimate of $$mathbf{E}[Y ,|, X, Z = text{unbiased}]$$? If so, is this method used in practice, and does it have a name?

Small example in R with df\$y_is_unbiased playing the role of $$Z$$ and df\$y_observed playing the role of $$Y$$:

library(ggplot2)
library(randomForest)

get_df <- function(n_obs, constant, beta, sd_epsilon, mismeasurement) {
df <- data.frame(x1=rnorm(n_obs), x2=rnorm(n_obs), epsilon=rnorm(n_obs, sd=sd_epsilon))

## Value of Y if measured correctly
df$$y_unbiased <- constant + as.matrix(df[c("x1", "x2")]) %*% beta + df$$epsilon

## Value of Y if measured incorrectly
df$$y_biased <- df$$y_unbiased + sample(mismeasurement, size=n_obs, replace=TRUE)

## Y is equally likely to be measured correctly or incorrectly
df$$y_is_unbiased<- sample(c(TRUE, FALSE), size=n_obs, replace=TRUE) df$$y_observed <- ifelse(df$$y_is_unbiased, df$$y_unbiased, df\$y_biased)

return(df)
}

## True coefficients
constant <- 5
beta <- c(1, 5)

df <- get_df(n_obs=2000, constant=constant, beta=beta, sd_epsilon=1.0, mismeasurement=c(-10.0, 5.0))

ggplot(df, aes(x=x1, y=y_observed, color=y_is_unbiased)) + geom_point() + scale_color_manual(values=c("#ff7f00", "#377eb8"))

df$$string_y_is_unbiased <- paste0("y_is_unbiased: ", df$$y_is_unbiased)

## Pr[Y | correct] differs from Pr[Y | incorrect]
ggplot(df, aes(x=y_observed)) + geom_histogram(color="black", fill="grey", binwidth=0.5) + facet_wrap(~ string_y_is_unbiased, ncol=1)

## Recover true constant and beta (plus noise) when using y_unbiased
summary(lm(y_unbiased ~ x1 + x2, data=df))

## Biased estimates when using y_biased (constant is biased downward)
summary(lm(y_biased ~ x1 + x2, data=df))

## Biased estimates when using y_observed (constant is biased downward)
summary(lm(y_observed ~ x1 + x2, data=df))

## Now image that we "rate" subset of the data (manually check/research whether y was measured correctly)
n_rated <- 1000
df_rated <- df[1:n_rated, ]

## Use a factor so that randomForest does classification instead of regression
df_rated$$y_is_unbiased <- factor(df_rated$$y_is_unbiased)

model_pr_unbiased <- randomForest(formula=y_is_unbiased ~ y_observed + x1 + x2, data=df_rated, mtry=2)

## Examine OOB confusion matrix (error rate < 5%)
print(model_pr_unbiased)

## Use the model to get Pr[correct | X, observed Y] on unrated data
df_unrated <- df[(n_rated+1):nrow(df), ]
df_unrated\$pr_unbiased <- as.vector(predict(model_pr_unbiased, newdata=df_unrated, type="prob")[, "TRUE"])

## Train a model on unrated data, using pr_unbiased as regression weights -- is this unbiased?
summary(lm(y_observed ~ x1 + x2, data=df_unrated, weights=df_unrated\$pr_unbiased))

In this example, the model $$Pr[Z = text{unbiased} ,|, X,Y]$$ is a random forest with formula=y_is_unbiased ~ y_observed + x1 + x2. In the limit, as this model becomes perfectly accurate (when it puts weights of 1.0 where $$Y$$ is unbiased, and 0.0 where $$Y$$ is biased), the weighted regression will clearly be unbiased. What happens when the model for $$Pr[Z = text{unbiased} ,|, X,Y]$$ has test precision and recalls that aren’t perfect (<100% accuracy)?

Get this bounty!!!