*Bounty: 150*

*Bounty: 150*

**1) Some background**

I’m currently learning about compartmental ODE modeling for epidemiology (inspired by the current pandemic!) and I’ve been exploring parameter estimation of SIR-like models using optimization tools in Julia. One of the many challenges of modeling COVID-19, I have learned, is that the available time series data sets are probably quite noisy. Infection prevalence and incidence data in particular is probably very noisy, and subject to both under- and over-counting. In my experimentation with parameter estimation in Julia, I’ve found that small to moderate changes in the data point values can sometimes lead to significant changes in the parameter estimates. Consequently, I’ve become interested in modeling the error structure of the observed data so that I can get a better sense of the uncertainty in the parameter values. This leads me to

**2) My Question:** How can I model reporting error/noise in COVID-19 prevalence data?

By "prevalence", I mean the following definition from Wikipedia:

"Prevalence is a measurement of all individuals affected by the disease at a particular time."

This differs from "incidence" data which, to quote Wikipedia again, is "a measurement of the number of new individuals who contract a disease during a particular period of time".

**3) More detailed background**

As a simple example, consider the basic SIR (Susceptible-Infected-Recovered) model:

$$frac{dS}{dt} = -beta frac{SI}{N}, qquad frac{dI}{dt} = beta frac{SI}{N} – gamma I, qquad frac{dR}{dt} = gamma I $$

where $N = S(t) + I(t) + R(t) = text{const}$. Let’s say that $t$ is in days. The *prevalence* on day $t$ would be $I(t)$; that is, $I(t)$ is the number of individuals that are actively infected on day $t$. The daily *incidence* for day $t$, which I’ll denote by $Delta C_t$, would be given by $Delta C_t = C(t) – C(t-1)$, where $C(t)$ denotes the number of cumulative infections that have occurred by day $t$ (starting from some day $t_0$). (Note that $C(t) = I(t) + R(t)$.) So in other words, $Delta C_t$ is the number of people that became infected in the one day period from $t-1$ to $t$.

When incidence data is available, there are some reasonable ways to model the error structure. For example, letting $Delta C_1^{text{obs}},ldots,Delta C_n^{text{obs}}$ be the observed incidence data and $Delta C_1^{text{true}},ldots,Delta C_n^{text{true}}$ be the "true" (but unobservable) incidence data, one reasonable (IMO) error model would be

$$hspace{2cm} frac{Delta C_t^{text{true}} – Delta C_t^{text{obs}} }{Delta C_t^{text{obs}}} = 1 + epsilon_t, quad text{where } epsilon_t overset{text{iid}}{sim} N(0,sigma^2) qquad (1)$$

for a chosen value of $sigma$. (Perhaps a truncated normal should actually be used–truncating $epsilon_t$ to $[0,infty)$ would ensure that $Delta C_t^{text{true}} geq 0$, which we obviously want.) I find that the above model makes intuitive sense: It says that the relative error in the number of new cases reported on day $t$ is normally distributed with mean $0$ and variance $sigma^2$. I’ve tested out the above model by simulating many sets ${Delta C_t^{text{true}} }_{t=0}^{n}$, fitting the SIR model to the simulated data sets, and then examining the distributions of the parameter estimates for $beta$ and $gamma$. The results I’m getting seem reasonable.

Now I’d like to repeat the procedure I just described, but using prevalence data. In the case of COVID-19, incidence data seems to be the most common type of data reported, but I have a data set I’m interested in that only contains prevalence data. (And I would just like to note: The relevance and importance of my question goes beyond my particular data set, and beyond just COVID. For example, the CDC collects and reports influenza prevalence data as part of its Epidemic Prediction Initiative.) When it comes to modeling error in prevalence data, things are trickier because the daily change in $I(t)$, call it $Delta I_t := I(t) – I(t-1)$, is given by

begin{align*}

Delta I_t &= Delta C_t – Delta R_t,

end{align*}

where $Delta R_t = R(t) – R(t-1)$. In other words:

$$Delta I_t = (# text{ of new infections on day } t) – (# text{ of new recoveries on day } t).$$

Thus, $Delta I_t$ depends on both the incidence of infection and the incidence of recovery. So my thinking is that an error model for $Delta I_t$ should account for over- and under-reporting of both $Delta C_t$ and $Delta R_t$. (So it therefore probably does not make sense to simply substitute $Delta C_t$ with $Delta I_t$ into (1)…or could such a model be justified?)

**Herein lies my dilemma**: $Delta I_t$ depends on $Delta R_t$, and often there is not available or reliable data for $Delta R_t$. By not "reliable" I mean that in the case of many COVID data sets, people who are 2 weeks post-infection are automatically classified as "recovered" (unless they’ve died, but for this toy model I’m ignoring deaths). Thus, I don’t know if I would be able to simulate "noisy" $Delta R_t$ data.

* So to summarize…* Is there a reasonable way to model the error in $Delta I_t$ when prevalence data is the only data we have? If so, what are some error models that I could try? (I would also be interested to hear feedback and/or thoughts on error models for incidence data as well.)