#StackBounty: #maximum-likelihood #least-squares #covariance #uncertainty #hessian Parameter uncertainity in least squares optimization…

Bounty: 50

Given a least squares optimization problem of the form:

$$ C(lambda) = sum_i ||y_i – f(x_i, lambda)||^2$$

I have found in multiple questions/answers (e.g. here) that an estimate for the covariance of the parameters can be computed from the inverse rescaled Hessian at the minimum point:

$$ mathrm{cov}(hatlambda) = hat H^{-1} hatsigma_r^2 = hat H^{-1} frac{sum_i ||y_i – f(x_i, hatlambda)||^2}{N_{DOF}} $$

While I understand why the covariance is related to the inverse Hessian (Fischer information), I haven’t found anywhere a demonstration or explanation for the $hatsigma_r^2$ term, although it appears reasonable to me on intuitive grounds.

Could anybody explain the need for the rescaling by the residual variance and/or provide a reference?


Get this bounty!!!

#StackBounty: #modeling #error #uncertainty #epidemiology #covid-19 Error models for COVID-19 prevalence data

Bounty: 150

1) Some background
I’m currently learning about compartmental ODE modeling for epidemiology (inspired by the current pandemic!) and I’ve been exploring parameter estimation of SIR-like models using optimization tools in Julia. One of the many challenges of modeling COVID-19, I have learned, is that the available time series data sets are probably quite noisy. Infection prevalence and incidence data in particular is probably very noisy, and subject to both under- and over-counting. In my experimentation with parameter estimation in Julia, I’ve found that small to moderate changes in the data point values can sometimes lead to significant changes in the parameter estimates. Consequently, I’ve become interested in modeling the error structure of the observed data so that I can get a better sense of the uncertainty in the parameter values. This leads me to

2) My Question: How can I model reporting error/noise in COVID-19 prevalence data?

By "prevalence", I mean the following definition from Wikipedia:

"Prevalence is a measurement of all individuals affected by the disease at a particular time."

This differs from "incidence" data which, to quote Wikipedia again, is "a measurement of the number of new individuals who contract a disease during a particular period of time".

3) More detailed background
As a simple example, consider the basic SIR (Susceptible-Infected-Recovered) model:

$$frac{dS}{dt} = -beta frac{SI}{N}, qquad frac{dI}{dt} = beta frac{SI}{N} – gamma I, qquad frac{dR}{dt} = gamma I $$

where $N = S(t) + I(t) + R(t) = text{const}$. Let’s say that $t$ is in days. The prevalence on day $t$ would be $I(t)$; that is, $I(t)$ is the number of individuals that are actively infected on day $t$. The daily incidence for day $t$, which I’ll denote by $Delta C_t$, would be given by $Delta C_t = C(t) – C(t-1)$, where $C(t)$ denotes the number of cumulative infections that have occurred by day $t$ (starting from some day $t_0$). (Note that $C(t) = I(t) + R(t)$.) So in other words, $Delta C_t$ is the number of people that became infected in the one day period from $t-1$ to $t$.

When incidence data is available, there are some reasonable ways to model the error structure. For example, letting $Delta C_1^{text{obs}},ldots,Delta C_n^{text{obs}}$ be the observed incidence data and $Delta C_1^{text{true}},ldots,Delta C_n^{text{true}}$ be the "true" (but unobservable) incidence data, one reasonable (IMO) error model would be

$$hspace{2cm} frac{Delta C_t^{text{true}} – Delta C_t^{text{obs}} }{Delta C_t^{text{obs}}} = 1 + epsilon_t, quad text{where } epsilon_t overset{text{iid}}{sim} N(0,sigma^2) qquad (1)$$

for a chosen value of $sigma$. (Perhaps a truncated normal should actually be used–truncating $epsilon_t$ to $[0,infty)$ would ensure that $Delta C_t^{text{true}} geq 0$, which we obviously want.) I find that the above model makes intuitive sense: It says that the relative error in the number of new cases reported on day $t$ is normally distributed with mean $0$ and variance $sigma^2$. I’ve tested out the above model by simulating many sets ${Delta C_t^{text{true}} }_{t=0}^{n}$, fitting the SIR model to the simulated data sets, and then examining the distributions of the parameter estimates for $beta$ and $gamma$. The results I’m getting seem reasonable.

Now I’d like to repeat the procedure I just described, but using prevalence data. In the case of COVID-19, incidence data seems to be the most common type of data reported, but I have a data set I’m interested in that only contains prevalence data. (And I would just like to note: The relevance and importance of my question goes beyond my particular data set, and beyond just COVID. For example, the CDC collects and reports influenza prevalence data as part of its Epidemic Prediction Initiative.) When it comes to modeling error in prevalence data, things are trickier because the daily change in $I(t)$, call it $Delta I_t := I(t) – I(t-1)$, is given by
begin{align*}
Delta I_t &= Delta C_t – Delta R_t,
end{align*}

where $Delta R_t = R(t) – R(t-1)$. In other words:
$$Delta I_t = (# text{ of new infections on day } t) – (# text{ of new recoveries on day } t).$$

Thus, $Delta I_t$ depends on both the incidence of infection and the incidence of recovery. So my thinking is that an error model for $Delta I_t$ should account for over- and under-reporting of both $Delta C_t$ and $Delta R_t$. (So it therefore probably does not make sense to simply substitute $Delta C_t$ with $Delta I_t$ into (1)…or could such a model be justified?)

Herein lies my dilemma: $Delta I_t$ depends on $Delta R_t$, and often there is not available or reliable data for $Delta R_t$. By not "reliable" I mean that in the case of many COVID data sets, people who are 2 weeks post-infection are automatically classified as "recovered" (unless they’ve died, but for this toy model I’m ignoring deaths). Thus, I don’t know if I would be able to simulate "noisy" $Delta R_t$ data.

So to summarize… Is there a reasonable way to model the error in $Delta I_t$ when prevalence data is the only data we have? If so, what are some error models that I could try? (I would also be interested to hear feedback and/or thoughts on error models for incidence data as well.)


Get this bounty!!!

#StackBounty: #r #confidence-interval #uncertainty Uncertainty of a trend

Bounty: 50

I need to estimate the confidence interval of the solar radiation change (in %) over time. For this, I’m going to use R default dataset airquality. Let’s imagine that we know that the measurement error of Solar.R is 30% (UME). Hence, UME can be simulated as a random number from a normal distribution with a mean of 1 and a standard deviation of 0.30. Therefore I calculate 95% confidence intervals on the solar radiation as the difference between the 97.5% and 2.5% quantile of the 1000 simulated values.

library(tidyverse)

# Load dataset and keep only June
df <- airquality %>% 
  as_tibble() %>%
  filter(Month == 6) %>% 
  dplyr::select(day = Day,
                solar = Solar.R)

# Create a function to estimate uncertainties
unc <- function(x){
  
  ume <- rnorm(1000,
               mean = 1,
               sd = .3)
  
  sim <- x * ume
  
  quantile(sim, probs = c(0.025, 0.975)) %>% 
    as.data.frame() %>% 
    rownames_to_column() %>%
    rename(val = 2) %>% 
    pivot_wider(names_from = rowname,
                values_from = val) %>% 
    rename(lower = 1,
           upper = 2)
  
}

# Calculate CI attributed to measuring errors
set.seed(2021)
df_sim <- df %>% 
  group_by(day) %>% 
  nest() %>% 
  mutate(sim = map(data, ~unc(.x$solar))) %>%
  unnest(cols = c(data, sim)) %>% 
  ungroup()

# Plot
df_sim %>% 
  ggplot(aes(x = day,
             y = solar)) +
  geom_ribbon(aes(ymin = lower,
                  ymax = upper),
              alpha = .4) +
  geom_line() +
  geom_smooth(method = "lm",
              se = F)

enter image description here

So there is obviously a negative trend. The main question is, how can I calculate a slope and corresponding slope’s CI considering the measurement uncertainty of Solar.R? I end up with two various solutions. Help me to select the right, please. Or suggest another, more robust way.

Way 1

From googling (here and here), I found that I can get the difference between the lower value at the beginning and the upper in the end, and vice versa. E.g.

df_sim %>% 
  summarize(
    average = mean(solar),
    slope_lower = (first(lower) - last(upper))/30,
    slope_upper = (last(lower) - first(upper))/30
         ) %>% 
  mutate(change_lower = slope_lower / average * 100,
         change_upper = slope_upper / average * 100)
#> # A tibble: 1 x 5
#>   average slope_lower slope_upper change_lower change_upper
#>     <dbl>       <dbl>       <dbl>        <dbl>        <dbl>
#> 1    190.       -3.12       -13.3        -1.64        -6.98

So, from these calculations, I can see that my CI is [-1.64%; -6.98%].

Way 2

Calculate 3 slopes: for original Solar.R, for its upper and lower values. E.g.

df_sim %>% 
  gather(var, val, -day) %>% 
  group_by(var) %>% 
  nest() %>% 
  mutate(mod = map(data,
                   ~lm(val ~ day,
                       data = .x))) %>% 
  mutate(average = map_dbl(data, ~mean(.x$val))) %>% 
  mutate(slope = map_dbl(mod, ~(.x$coefficients[2]))) %>% 
  # Calculate change in %
  mutate(change = slope / 190 * 100) # 190 is an average
#> # A tibble: 3 x 6
#> # Groups:   var [3]
#>   var   data                  mod    average  slope change
#>   <chr> <list>                <list>   <dbl>  <dbl>  <dbl>
#> 1 solar <tibble[,2] [30 x 2]> <lm>     190.   -6.85  -3.61
#> 2 lower <tibble[,2] [30 x 2]> <lm>      78.5  -2.88  -1.52
#> 3 upper <tibble[,2] [30 x 2]> <lm>     303.  -11.1   -5.83

Using this approach, my CI is [-1.52%; -5.83%].


Get this bounty!!!

#StackBounty: #confidence-interval #maximum-likelihood #quantiles #uncertainty #error-propagation Uncertainty propagation for the solut…

Bounty: 50

I have a dataset and I use Maximum Likelihood Estimation to estimate the values of the parameters of a weibull distribution. The MLE theory provides with theoretical Confidence Intervals (asymptotical, or for $n$ samples).

Then, I use the fitted Weibull distribution in an expression which is currently optimised numerically :

$Y(t_0) = h(t_0) . int_{0}^{t_0} S(t) dt + S(t_0)$

Where $t_0$ is unknown and $h$ and $S$ are the hazard function and the survival function of the distribution, and therefore are functions of the parameters.

I would like to propagate uncertainty on the fitted weibull parameters to estimate confidence intervales or quantiles for Y(t_0), how could I do that (numerically or analytically) ?
Thanks !


Get this bounty!!!

#StackBounty: #confidence-interval #cross-validation #modeling #standard-error #uncertainty K-fold cross validation based standard erro…

Bounty: 50

I have an expensive model (or class of models). My baseline approach to quantify uncertainty re the model parameters are hessian based standard errors, and I use k-fold cross validation for model comparison / validation. While a full bootstrap would be pleasant as a more robust uncertainty quantification, this is quite expensive. I think I should also be able to develop expectations for the variance of the leave-k out estimates, to at least get a rough sense of where the hessian based standard error estimates are not performing well. I wonder if someone knows how to do this, or can point to work that does this? Something like an approximate jackknife?


Get this bounty!!!

#StackBounty: #bayesian #uncertainty #high-dimensional #variational-bayes Uncertainty estimation in high-dimensional inference problems…

Bounty: 100

I’m working on a high-dimensional inference problem (around 2000 model parameters) for which we are able to robustly perform MAP estimation by finding the global maximum of the log-posterior using a combination of gradient-based optimisation and a genetic algorithm.

I’d very much like to be able to make some estimate of the uncertainties on the model parameters in addition to finding the MAP estimate.

We are able to efficiently calculate the gradient of the log-posterior with respect to the parameters, so long-term we’re aiming to use Hamiltonian MCMC to do some sampling, but for now I’m interested in non-sampling based estimates.

The only approach I know of is to calculate the inverse of the Hessian at the mode to approximate the posterior as multivariate normal, but even this seems infeasible for such a large system, since even if we calculate the $sim 4times10^{6}$ elements of the Hessian I’m sure we couldn’t find its inverse.

Can anyone suggest what kind of approaches are typically used in cases like this?

Thanks!


Get this bounty!!!

#StackBounty: #regression #scikit-learn #error #gaussian-process #uncertainty Accounting for errors in independent variable through Gau…

Bounty: 50

In Gaussian process regression (GPR), one applies a kernel (i.e. covariance function) to describe the similarity between observed and predicted data in the domain. The diagonal of the covariance function accounts for noise in the dependent or response variable. But is there an analogous method to errors-in-variables models (e.g. total least squares) to account for error in the independent or explanatory variable(s) inherently in GPR?


Get this bounty!!!