The underreported model

The phenomenon of underreporting in the context of biomedical research has been widely studied, at least since the works of Bernard et. al. in 2011 [2]. A slightly more recent reference is Fernández-Fontelo et. al. in 2016 [4], where a very simple mechanism is proposed to model the phenomenon. Let us consider a hidden process (the real daily number of positive cases, which is not observable since the entire population of interest is not tested daily) $X_n$ with an autoregressive structure of order 1, for integer data and with Poisson innovations, Po- INAR (1):

$X_n=\alpha \circ X_{n-1}+W_n(\lambda)$

where $0<\alpha<1$ is a fixed parameter, $W_n \sim$ Poisson( $\lambda$ ), i.i.d., independent of $X_n$ and $\circ$ is the operator of thinning binomial:

$\begin{equation*}\alpha \circ X_{n-1}=\sum_{i=1}^{X_{n-1}}Z_i \end{equation*}$

con $Z_i$ variables aleatorias i.i.d. con distribución de Bernoulli $( \alpha)$ .
El proceso INAR(1) es una cadena de Markov homogénea con probabilidades de transición

$\pr (X_n = i | X_{n-1} = j) = \sum_{k=0}^{i\wedge j} {j \choose k} \alpha^k (1-\alpha)^{j-k} \pr(W_n=i-m)$

La esperanza y varianza del operador de thinning binomial son

$\textrm{{\bf E}}\left(\alpha \circ X_{n-1}|X_{n-1}=x_{n-1}\right)=\alpha x_{n-1}$

$\textrm{{\bf Var}}\left(\alpha \circ X_{n-1}|X_{n-1}=x_{n-1}\right)=\alpha (1-\alpha)x_{n-1}$

Un esquema simple de infra-reporte

The underreporting phenomenon is modeled assuming that the observed counts have the following form:

(1) $\begin{align*} Y_n = \left\{ \begin{array}{ll} X_n & \text{with probability } 1-\omega\\ q \circ X_n & \text{with probability } \omega, \end{array} \right.\nonumber \end{align*}$

where $\omega$ y $q$ represent the frequency and intensity of the underreported process. That is, for each $n$ , a $X_n$ can be observed with a probability $1-\omega$ , and a $q$ -thinning of $X_n$ with probability $\omega$ , regardless of previous $\{X_j: j \leq n\}$ . Therefore, what is observed (the reported counts) is

$Y_n = (1- \uno_n)X_n + \uno_n \sum_{j=1}^{X_n} \xi_j \quad\quad \uno_n\sim\mbox{Bern}(\omega), \quad \xi_j\sim\mbox{Bern}(q)$

This model will surely have to be modified to make both $q$ and $\lambda$ are dependent on time.

Model properties

The expectation and variance of the INAR (1) $X_n$ process with Poisson innovations ( $\lambda$ ) is $\mu_X=\sigma_X^2=\frac{\lambda}{1-\alpha}$ . The autocovariances and autocorrelations are

$\gamma_X(k)=\alpha^{|k|}\lambda \textrm{ y }\rho_{X}(k)=\alpha^{|k|}$

respectively. On the other hand, $\E Y_n =\mu_{Y}=\mu_X(1-\omega(1-q))$ . The autocovariance from the observed process $Y_n$ is

$\gamma_Y(k)=(1-\omega(1-q))^2\alpha^{|k|}\mu_X$

Thus, the autocorrelation function $Y_n$ is a multiple of $\rho_{X}(k)$ :

$\rho_Y(k)=\frac{(1-\alpha)(1-\omega(1-q))^2}{(1-\alpha)(1-\omega(1-q))+\lambda(\omega(1-\omega)(1-q)^2)}\alpha^{|k|}=c(\alpha,\lambda,\omega,q)\alpha^{|k|}.$

The shape of the expected value $Y_n$ also allows to obtain predictions for $Y_{n+k}$ having seen until $Y_n$ .

Parameter estimation

The marginal distribution of $Y_n$ is a mixture of two Poission distributions.

(2) $\begin{align*} Y_n &\sim \begin{cases} \textrm{Poisson}\left(\frac{\lambda}{1-\alpha}\right) &\quad \mbox{con probabilida } 1-\omega,\\ \textrm{Poisson}\left(\frac{q\lambda}{1-\alpha}\right) &\quad \mbox{with probability } \omega.\nonumber \end{cases} \end{align*}$

When $q=0$ the distribution of $\{Y_n\}$ Poisson zero-inflated. From the mixture we can have preliminary estimates for the parameters that will later be used in a calculation of the maximum likelihood estimators. The plausibity of $Y$ is very complicated,

$P(Y)=\sum_XP(X,Y)=\sum_xP(Y|X=x)P(X=x)$

and to calculated the forward algorithm ~\cite{lystig02} is widely used in the context of hidden Markov models. The probabilities of forward are

(3) $\begin{align*} \alpha_k(X_k)=P(Y_k|X_k)\sum_{X_{k-1}}P(X_k|X_{k-1})\alpha_{k-1}(X_{k-1}) \nonumber, \end{align*}$

con $\alpha_1(X_1)=P(X_1)P(Y_1|X_1)$ . Therefore the plausibity is

$P(Y)=\sum_{n}\alpha_n(X_n).$

$P(Y_k|X_k)$ y $P(X_k|X_{k-1})$ these are the so-called emission and transition probabilities. Transition probabilities are calculated as follows:

(4) $\begin{align*} P(X_n=x_n \mid X_{n-1}=x_{n-1})=e^{-\lambda}\sum_{j=0}^{x_n\wedge x_{n-1}}\binom{x_{n-1}}{j}\alpha^j(1-\alpha)^{x_{n-1}-j}\frac{\lambda^{x_n-j}}{(x_n-j)!} \nonumber \end{align*}$

and the emision ones

(5) $\begin{align*} P(Y_i=j \mid X_i=k) & = \left\{ \begin{array}{lcc} 0 & if & k < j \\ (1-\omega) + \omega q^k& if & k=j\\ \omega \binom{k}{j} q^j (1-q)^{k-j}& if & k > j, \end{array} \right. \nonumber \end{align*}$

Rebuilding the hidden chain $X_n$

To reconstruct the hidden series $X_n$ , the \textbf{Viterbi} [10] algorithm is used as described previously.

The main idea to reconstruct the latent chain $X_1^*=x_1^*,\dots, X_N^*=x_N^*$ which maximizes the probability of the latent process given the observations $Y_n$ , assuming that the parameters are known.

Bibliografía

[2] Bernard H,Werber D, Hoehle M. Estimating the under-reporting of norovirus illness in Germany utilizing enhanced awareness of diarrhoea during a large outbreak of Shiga toxin-producing E. coli O104: H4 in 2011-a time series analysis. BMC
Infectious Diseases 2014; 14:116. DOI:10.1186/1471-2334-14-116.

[3] Fan, C. et al. Prediction of Epidemic Spread of the 2019 Novel Coronavirus Driven by Spring Festival Transportation in China: A Population-Based Study. Int. J. Environ. Res. Public Health 17, (2020).

[4] Fernández-Fontelo A., Cabaña A., Puig P., Moriña D. Under-reported data analysis with INAR-hidden Markov chains
Statistics in Medicine (2016), vol. 35, Issue 26, 4875-4890.

[5] Fernández‐Fontelo, A., Cabaña, A., Joe, H., Puig, P. & Moriña, D. Untangling serially dependent underreported count data for gender‐based violence. Stat. Med. 38, 4404–4422 (2019).

[6] Moriña, D., Fernández-Fontelo, A., Cabaña, A. & Puig, P. New statistical model for misreported data with application to current public health challenges. Submitted to Statistical Methods in Medical Research, (2020).

[7] Moriña, D., Fernández-Fontelo, A., Cabaña, A., Puig, P., Monfil, L., Brotons, M. & Diaz, M. Quantifying the underreporting of genital warts cases. Submitted to the European Journal of Epidemiology (2020).

[8] Kermack W. O., McKendrick A. G. (1927). A Contribution to the Mathematical Theory of Epidemics. Proceedings of the Royal Society A. 115 (772): 700–721.

[9] T.C. Lystig, J.P. Hughes (2002), Exact computation of the observed information matrix for hidden Markov models, J. of Comp.and Graph. Stat.

[10] Viterbi, A.J. (1967), Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13, 260D269.

Select your language

Un esquema simple de infra-reporte

Model properties

Parameter estimation

Rebuilding the hidden chain $X_n$

Bibliografía

Right Sidebar