It is essential yet never easy to estimate the prevalence rate of Covid-19 cases among the Indian population, mostly due to the country’s large and diverse population, and a high proportion of asymptomatic cases. Simple random sampling (SRS) or some of its variants —like stratified sampling — are suggested by many in this context.
A simple random sampling provides each individual comprising the population the same probability of inclusion in the sample. Such a sampling technique has been adopted in some European countries — Sweden and Austria, for example — to estimate the prevalence rate of Covid-19. According to such a survey conducted by the Public Health Authority of Sweden in Stockholm during March-April (https://www.folkhalsomyndi gheten.se/nyheter-och-press/ ny hetsarkiv/2020/april/resultat-fran-undersokning-av-forekom sten-av-covid-19-i-region-stockholm), about 2.5 per cent of Stockholmers had an ongoing Covid-19 infection. Also, in another random sample-based study in Austria (https:// www.sora.at/ uploads/ media/ Austria_COVID-19_ Pre vale nce_BMBWF_SORA_ 202 004 10_EN_Version), where testing was conducted between April 1-6, the proportion of positively tested in the weighted sample was 0.33 per cent.
However, the population of Stockholm was 974,000 and that of Austria 8.86 million in 2019. Also, the estimated percentage of prevalence of the disease is quite high in these countries.
India, in contrast, might need to adopt a sampling scheme that would suit its condition the best. The country has an extremely large population and a high population density of 464 per square kilometre. Still, the incidences of positive Covid-19 cases in India are fortunately less compared to many European countries or the United States. With nearly 60,000 positive cases detected as on May 9, and assuming another 60,000-1,200,000 asymptomatic cases (up to 20 times) present in the country, the total cases should be within 120,000 and 1,260,000 at the moment. And this is within 0.0089 per cent to 0.0933 per cent of the total population — a meagre proportion of the 1.35 billion people.
So far, we observe a similar feature in other Southeast Asian countries, such as Pakistan, Afghanistan, Bangladesh, Nepal, Bhutan and Sri Lanka. On the other hand, European countries and the United States have a high proportion of positive incidences; population size in those countries is relatively small, and the infection rate is high. The simple random sampling or its variants may be useful for such countries. India needs to adopt a completely different sampling scheme that would be specifically designed to estimate “rare” events.
When the incidences are “rare” compared to the population size, SRS needs an extremely large sample size to provide a reasonable estimate. And, in this case, an extremely large sample size involves huge cost in terms of travel and kits. However, we know that availability of kits is a serious problem. Several variants of SRS, such as stratification, clustering, systematic sampling and multistage sampling, will have the same problem of precision and cost.
In contrast, the Adaptive Cluster Sampling (ACS) scheme is designed to estimate “rare” events. It’s well-known that its precision level is much higher than that of SRS or its other variants. It is also cost-effective. The idea of ACS was advocated by S K Thompson in a classic research article in 1990 (Thompson, S K, 1990, “Adaptive cluster sampling”, Journal of the American Statistical Association, volume 85, pp. 1050-1059). Several variants of ACS were proposed subsequently (see Borkowski and Turk (2014) “Adaptive Cluster Sampling: an Introduction”, in Researchgate). ACS has successfully been applied in several problems related to ecology, environment, and epidemiology. And, it may be noted that ACS may also be successfully applied when the incidences are abundant.
In this context, it is worth mentioning that, in order to get an idea of the rate of positive incidences, some are advocating strategies similar to “snowball sampling” (Goodman, L.A., 1961, “Snowball sampling”, Annals of Mathematical Statistics, volume 32 (1), pp. 148-170). It seems quite pragmatic. However, it is a non-probability sampling. Such a sampling scheme does not provide any unbiased estimate; it also does not provide any estimate of standard error, and hence it does not give any margin of error.
On the other hand, ACS is a probability sampling related to the snowball sampling that yields an unbiased estimate along with standard error or the margin of error. ACS involves unequal probability of sampling, and the probability of inclusion of different individuals of the population can be defined in a state-of-the-art manner.
When applying ACS for Covid-19, we may start with a sample of reasonable size to be selected by some predefined mechanism. If an observed sampling unit is tested to be positive, then additional units in a defined neighbourhood are to be adaptively added to the sample. Again, if any of these additional units is found to be Covid-19-positive, units in their neighbourhoods are also to be added to the sample. This adaptive process should continue until no additional Covid-19 units are encountered. It’s a typical example of “inverse sampling” where the total sample size is not exactly fixed a priori, rather we may provide an expected sample size. Clearly, a variant of the standard ACS , which would suit the Indian context best, to estimate Covid-19 proportions, using the extra information specific to Covid-19 virus— for example, Covid-19 virus has an incubation period of 2-14 days — might be more useful to design the sampling scheme.
The writer is professor of statistics, Indian Statistical Institute, Kolkata