#StackBounty: #modeling #error #uncertainty #epidemiology #covid-19 Error models for COVID-19 prevalence data

Bounty: 150

1) Some background
I’m currently learning about compartmental ODE modeling for epidemiology (inspired by the current pandemic!) and I’ve been exploring parameter estimation of SIR-like models using optimization tools in Julia. One of the many challenges of modeling COVID-19, I have learned, is that the available time series data sets are probably quite noisy. Infection prevalence and incidence data in particular is probably very noisy, and subject to both under- and over-counting. In my experimentation with parameter estimation in Julia, I’ve found that small to moderate changes in the data point values can sometimes lead to significant changes in the parameter estimates. Consequently, I’ve become interested in modeling the error structure of the observed data so that I can get a better sense of the uncertainty in the parameter values. This leads me to

2) My Question: How can I model reporting error/noise in COVID-19 prevalence data?

By "prevalence", I mean the following definition from Wikipedia:

"Prevalence is a measurement of all individuals affected by the disease at a particular time."

This differs from "incidence" data which, to quote Wikipedia again, is "a measurement of the number of new individuals who contract a disease during a particular period of time".

3) More detailed background
As a simple example, consider the basic SIR (Susceptible-Infected-Recovered) model:

$$frac{dS}{dt} = -beta frac{SI}{N}, qquad frac{dI}{dt} = beta frac{SI}{N} – gamma I, qquad frac{dR}{dt} = gamma I $$

where $N = S(t) + I(t) + R(t) = text{const}$. Let’s say that $t$ is in days. The prevalence on day $t$ would be $I(t)$; that is, $I(t)$ is the number of individuals that are actively infected on day $t$. The daily incidence for day $t$, which I’ll denote by $Delta C_t$, would be given by $Delta C_t = C(t) – C(t-1)$, where $C(t)$ denotes the number of cumulative infections that have occurred by day $t$ (starting from some day $t_0$). (Note that $C(t) = I(t) + R(t)$.) So in other words, $Delta C_t$ is the number of people that became infected in the one day period from $t-1$ to $t$.

When incidence data is available, there are some reasonable ways to model the error structure. For example, letting $Delta C_1^{text{obs}},ldots,Delta C_n^{text{obs}}$ be the observed incidence data and $Delta C_1^{text{true}},ldots,Delta C_n^{text{true}}$ be the "true" (but unobservable) incidence data, one reasonable (IMO) error model would be

$$hspace{2cm} frac{Delta C_t^{text{true}} – Delta C_t^{text{obs}} }{Delta C_t^{text{obs}}} = 1 + epsilon_t, quad text{where } epsilon_t overset{text{iid}}{sim} N(0,sigma^2) qquad (1)$$

for a chosen value of $sigma$. (Perhaps a truncated normal should actually be used–truncating $epsilon_t$ to $[0,infty)$ would ensure that $Delta C_t^{text{true}} geq 0$, which we obviously want.) I find that the above model makes intuitive sense: It says that the relative error in the number of new cases reported on day $t$ is normally distributed with mean $0$ and variance $sigma^2$. I’ve tested out the above model by simulating many sets ${Delta C_t^{text{true}} }_{t=0}^{n}$, fitting the SIR model to the simulated data sets, and then examining the distributions of the parameter estimates for $beta$ and $gamma$. The results I’m getting seem reasonable.

Now I’d like to repeat the procedure I just described, but using prevalence data. In the case of COVID-19, incidence data seems to be the most common type of data reported, but I have a data set I’m interested in that only contains prevalence data. (And I would just like to note: The relevance and importance of my question goes beyond my particular data set, and beyond just COVID. For example, the CDC collects and reports influenza prevalence data as part of its Epidemic Prediction Initiative.) When it comes to modeling error in prevalence data, things are trickier because the daily change in $I(t)$, call it $Delta I_t := I(t) – I(t-1)$, is given by
Delta I_t &= Delta C_t – Delta R_t,

where $Delta R_t = R(t) – R(t-1)$. In other words:
$$Delta I_t = (# text{ of new infections on day } t) – (# text{ of new recoveries on day } t).$$

Thus, $Delta I_t$ depends on both the incidence of infection and the incidence of recovery. So my thinking is that an error model for $Delta I_t$ should account for over- and under-reporting of both $Delta C_t$ and $Delta R_t$. (So it therefore probably does not make sense to simply substitute $Delta C_t$ with $Delta I_t$ into (1)…or could such a model be justified?)

Herein lies my dilemma: $Delta I_t$ depends on $Delta R_t$, and often there is not available or reliable data for $Delta R_t$. By not "reliable" I mean that in the case of many COVID data sets, people who are 2 weeks post-infection are automatically classified as "recovered" (unless they’ve died, but for this toy model I’m ignoring deaths). Thus, I don’t know if I would be able to simulate "noisy" $Delta R_t$ data.

So to summarize… Is there a reasonable way to model the error in $Delta I_t$ when prevalence data is the only data we have? If so, what are some error models that I could try? (I would also be interested to hear feedback and/or thoughts on error models for incidence data as well.)

Get this bounty!!!

#StackBounty: #r #estimation #optimization #epidemiology #differential-equations SIR: parameter estimation and optimization here (R)

Bounty: 100

From here https://ourworldindata.org/coronavirus/country/israel I have extracted the Covid Data for Israel, with some manipulations, I have obtained the plot of the daily new infections in Israel
If I want to create a SIR or SEIR model, e.g.

 dS <- -beta/N * I * S
 dI <- beta/N * I * S - gamma * I
 dR <- gamma * I

but let’s say that beta can not be a constant over the whole time, but also a parameter, i.e $beta = beta(tau)$, $beta$ does not need to change every day it would be better if it stayed constant weekly or bi-weekly. (therefor I used $tau$ instaed of $t$). What would be efficient ways to estimate $beta(tau)$ from the number of daily new infections? (for sake of simplicity let gamma be constant over the whole time)

I have thought to somehow approximate the infection curve given below by polynomials or trigonometric functions, but I do not know what to after that? Because the functions would not be transmission rates at a given time $tau$. What do I have to do to obtain the transmission rates from the fitted piecewise functions, or should I use a whole other approach?

I would like to solve it in R. Ideally I would also like to somehow optimize the solution so that it more or less fits exactly to the data.

Any suggestions and help highly appreciated

(the data can be found here https://covid.ourworldindata.org/data/owid-covid-data.csv)

 covid <-  Covid[apply(Covid,1,function(x) {any(c("Israel") %in% x)}),]
 covid[,1] <- NULL  
 covid[,2] <- NULL 
 covid <- as.data.frame(covid)
 newcases <- diff(covid[,4])

enter image description here

Get this bounty!!!

#StackBounty: #dataset #epidemiology Fatality Rate for SARS-CoV-2

Bounty: 50

I am sure many people have been reviewing the data about the SARS-CoV-2 epidemic. One of my main sources is at Worldometer. My question is geared toward a specific statistic they provide, namely the recovered/discharged versus deaths for closed cases.

Although there is currently some variability in the estimate of the fatality rate of SARS-CoV-2 (i.e. ~3% to ~14%), these estimates are much lower than the percentage of deaths relative to recovered/discharged for closed cases (~21%).

My understanding is that the latter number should be considered the more accurate estimate of how deadly the contagion is since only the closed cases represent terminal states of an individual. If we introduce cases that are still ongoing or the population which has not contracted the disease then, to me, this seems misleading.

However, the majority of individuals and reporting I have seen use the former statistics and so I feel as if I am missing a key insight. I am hoping that the community could help clarify this for me.

Also, it was challenging to determine which forum would be the best for this question. If there is a better one, please let me know and I will repost at that location.

Thank you very much for your time and stay safe!

Get this bounty!!!

#StackBounty: #machine-learning #survival #causality #difference-in-difference #epidemiology How can I update a disease prediction mode…

Bounty: 50


  • I have a prediction model which predicts the probability of getting a disease.
  • This prediction model has been created based on data of patients who did not get any form of treatment.
  • I use this model on new patients. Patients which have a probability of higher than X of getting the disease will be treated. Patients with a lower probabilty than X will not be treated. The treatment lowers the probability with Y.


I want to update my prediction model with the data of the new patients. What is the best way to do that?

The problem

You can not simply add the new patients data to the original data and then retrain the model because it will change “the causality” of the model. Since the treatment will interfer with the rest of the variables of the prediction model. I hope the example below will clarify this statement:


Originally we created a logistic regression model to predict getting lungcancer (yes/no) we used the variables, age, family history of lungcancer(yes/no), gender, currently smoking(yes/no), smokinghistory(yes/no).

We used this model on a new patients. There is a new patient who is a smoker and has a probability of higher than X getting lungcancer and you give him treatment (“lungcancer chance reduction pills”) and the patient ends up not getting lungcancer. Now we would like to use the data of this patient to update the original model.

However if we add the data of this patient (smoker) to the original model with as outcome ‘not getting lungcancer’ the model will be biased with the idea smoking –> not getting longcancer. Which is incorrect.

What is the best way to add new patient data to the model while keeping any ‘causal’ relationships?

EDIT: To show that updating prediction models with ‘treated’ patients is a more theoretical methodological problem I shall add an example which I encountered in a business setting:

Originally we created a logistic regression model to predict whether a custumor would stop his mobile phone subscription (yes/no) we used the variables, age, number of send texts, number of calls, years having a subscription, internet usage.

We used this model on a new customers. There is a new customer with a decreasing internet usage (sign of stopping the subscription) and has a probability of higher than X of stopping the subscription. You ‘treat’ this customer (“call him/her and offer discount”) and the customer ends up not stopping the subscription. Now we would like to use the data of this customer to update the original model.

However if we add the data of this customer (low internet usage) to the original model with as outcome ‘not stopping his subscription’ the model will be biased with the idea low internet usage–> not stopping his subscription. Which is incorrect.

Get this bounty!!!