• Select your language

    en English

The underreported model

The phenomenon of underreporting in the context of biomedical research has been widely studied, at least since the works of Bernard et. al. in 2011 [2]. A slightly more recent reference is Fernández-Fontelo et. al. in 2016 [4], where a very simple mechanism is proposed to model the phenomenon. Let us consider a hidden process (the real daily number of positive cases, which is not observable since the entire population of interest is not tested daily) X_n with an autoregressive structure of order 1, for integer data and with Poisson innovations, Po- INAR (1):

     $$X_n=\alpha \circ X_{n-1}+W_n(\lambda)$$

where  0<\alpha<1 is a fixed parameter,  W_n \sim Poisson( \lambda ), i.i.d., independent of  X_n and  \circ is the operator of thinning binomial:

     \begin{equation*}\alpha \circ X_{n-1}=\sum_{i=1}^{X_{n-1}}Z_i \end{equation*}

con  Z_i variables aleatorias i.i.d. con distribución de Bernoulli( \alpha).
El proceso INAR(1) es una cadena de Markov homogénea con probabilidades de transición

     $$\pr (X_n = i | X_{n-1} = j) = \sum_{k=0}^{i\wedge j} {j \choose k}  \alpha^k (1-\alpha)^{j-k} \pr(W_n=i-m)$$

La esperanza y varianza del operador de thinning binomial son

    \[\textrm{{\bf E}}\left(\alpha \circ X_{n-1}|X_{n-1}=x_{n-1}\right)=\alpha x_{n-1}\]


    \[\textrm{{\bf Var}}\left(\alpha \circ X_{n-1}|X_{n-1}=x_{n-1}\right)=\alpha (1-\alpha)x_{n-1}\]

Un esquema simple de infra-reporte



The underreporting phenomenon is modeled assuming that the observed counts have the following form:

(1)   \begin{align*} Y_n = \left\{ \begin{array}{ll} X_n & \text{with probability } 1-\omega\\ q \circ X_n & \text{with probability } \omega, \end{array} \right.\nonumber \end{align*}

where \omega y q represent the frequency and intensity of the underreported process. That is, for each n, a X_n can be observed with a probability 1-\omega, and a q-thinning of X_n with probability \omega, regardless of previous \{X_j: j \leq n\}. Therefore, what is observed (the reported counts) is

    \[Y_n = (1- \uno_n)X_n + \uno_n \sum_{j=1}^{X_n} \xi_j \quad\quad \uno_n\sim\mbox{Bern}(\omega), \quad \xi_j\sim\mbox{Bern}(q)\]

This model will surely have to be modified to make both q and \lambda are dependent on time.

Model properties

The expectation and variance of the INAR (1) X_n process with Poisson innovations (\lambda) is \mu_X=\sigma_X^2=\frac{\lambda}{1-\alpha}. The autocovariances and autocorrelations are

    \[\gamma_X(k)=\alpha^{|k|}\lambda \textrm{ y }\rho_{X}(k)=\alpha^{|k|}\]

respectively. On the other hand, \E Y_n =\mu_{Y}=\mu_X(1-\omega(1-q)).  The autocovariance from the observed process Y_n is


Thus, the autocorrelation function Y_n is a multiple of \rho_{X}(k)


The shape of the expected value Y_n also allows to obtain predictions for Y_{n+k} having seen until Y_n.

Parameter estimation

The marginal distribution of Y_n is a mixture of two Poission distributions.

(2)   \begin{align*} Y_n &\sim \begin{cases} \textrm{Poisson}\left(\frac{\lambda}{1-\alpha}\right) &\quad \mbox{con probabilida } 1-\omega,\\ \textrm{Poisson}\left(\frac{q\lambda}{1-\alpha}\right) &\quad \mbox{with probability } \omega.\nonumber \end{cases} \end{align*}

When q=0 the distribution of \{Y_n\} Poisson zero-inflated. From the mixture we can have preliminary estimates for the parameters that will later be used in a calculation of the maximum likelihood estimators. The plausibity of Y is very complicated,


and to calculated the forward algorithm ~\cite{lystig02} is widely used in the context of hidden Markov models. The probabilities of forward are

(3)   \begin{align*} \alpha_k(X_k)=P(Y_k|X_k)\sum_{X_{k-1}}P(X_k|X_{k-1})\alpha_{k-1}(X_{k-1}) \nonumber, \end{align*}

con \alpha_1(X_1)=P(X_1)P(Y_1|X_1). Therefore the plausibity is


P(Y_k|X_k) y P(X_k|X_{k-1}) these are the so-called emission and transition probabilities. Transition probabilities are calculated as follows:

(4)   \begin{align*} P(X_n=x_n \mid X_{n-1}=x_{n-1})=e^{-\lambda}\sum_{j=0}^{x_n\wedge x_{n-1}}\binom{x_{n-1}}{j}\alpha^j(1-\alpha)^{x_{n-1}-j}\frac{\lambda^{x_n-j}}{(x_n-j)!} \nonumber \end{align*}

and the emision ones

(5)   \begin{align*} P(Y_i=j \mid X_i=k) & = \left\{ \begin{array}{lcc} 0 & if & k < j \\ (1-\omega) + \omega q^k& if & k=j\\ \omega \binom{k}{j} q^j (1-q)^{k-j}& if & k > j, \end{array} \right. \nonumber \end{align*}

Rebuilding the hidden chain X_n

To reconstruct the hidden series X_n, the \textbf{Viterbi} [10] algorithm is used as described previously.

The main idea to reconstruct the latent chain X_1^*=x_1^*,\dots, X_N^*=x_N^* which maximizes the probability of the latent process given the observations Y_n, assuming that the parameters are known.



[2] Bernard H,Werber D, Hoehle M. Estimating the under-reporting of norovirus illness in Germany utilizing enhanced awareness of diarrhoea during a large outbreak of Shiga toxin-producing E. coli O104: H4 in 2011-a time series analysis. BMC
Infectious Diseases 2014; 14:116. DOI:10.1186/1471-2334-14-116.

[3] Fan, C. et al. Prediction of Epidemic Spread of the 2019 Novel Coronavirus Driven by Spring Festival Transportation in China: A Population-Based Study. Int. J. Environ. Res. Public Health 17, (2020).

[4] Fernández-Fontelo A., Cabaña A., Puig P., Moriña D. Under-reported data analysis with INAR-hidden Markov chains
Statistics in Medicine (2016), vol. 35, Issue 26, 4875-4890.

[5] Fernández‐Fontelo, A., Cabaña, A., Joe, H., Puig, P. & Moriña, D. Untangling serially dependent underreported count data for gender‐based violence. Stat. Med. 38, 4404–4422 (2019).

[6] Moriña, D., Fernández-Fontelo, A., Cabaña, A.  & Puig, P. New statistical model for misreported data with application to current public health challenges. Submitted to Statistical Methods in Medical Research, (2020).

[7] Moriña, D., Fernández-Fontelo, A., Cabaña, A., Puig, P., Monfil, L., Brotons, M. & Diaz, M. Quantifying the underreporting of genital warts cases. Submitted to the European Journal of Epidemiology (2020).

[8] Kermack W. O., McKendrick A. G. (1927). A Contribution to the Mathematical Theory of Epidemics. Proceedings of the Royal Society A. 115 (772): 700–721.

[9] T.C. Lystig, J.P. Hughes (2002),  Exact computation of the observed information matrix for hidden Markov models, J. of Comp.and Graph. Stat.

[10] Viterbi, A.J. (1967), Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory,  13, 260D269.


Comments are closed.