Data and context
Survey data were obtained from the National Institute of Statistics and Demography (INSD) in Ouagadougou. The annual agricultural dataset provides detailed information on crops, yields, inputs and pesticides, plot characteristics and labor type for the 2003−2017 period. Surveyed villages were distributed relatively uniformly across the country (Fig. 1a). The household survey data cover the 1996–2017 period and include detailed information on household characteristics and demographics. We first merge and synchronize both survey datasets, and geolocalize the villages using fuzzy string matching with administrative data. We include a comprehensive set of climatic remote-sensing data including precipitation, temperature, and vegetation indices such as temperature and rainfall data, land surface temperature for day and night, a normalized difference vegetation index and enhanced vegetation index. We extrapolate a variety of covariates from the raw rasters, such as mean temperatures in dry and rainy seasons, precipitations in all seasons and mean and maximum dry spells. The pixel resolution of the climate data allows us to merge the rasters with the survey datasets at the village level. Panel (a) in Fig. 1 shows the location of our study villages. We then merge the dataset with two high-resolution maps of schistosomiasis prevalence estimates in school-aged children at a pixel resolution of (5times 5) km. The first map applies up until 2010 (Fig. 1c), the second between 2011 and 2018 (Fig. 1d). The estimated prevalence is a joint measure of both uro-genital and intestinal schistosomiasis, and is obtained by means of Bayesian geostatistical analysis [14]. We choose the years 2009 and 2011 in order to maximize the number of repeated households (i.e. observed each year) and to provide the best fit and model diagnostics: in the Additional file 1: Appendix we show our results to be robust to the choice of different years around the 2010 cutoff. Furthermore, these years cover the sharp decrease in disease burden in school-aged children through Burkina Faso’s schistosomiasis control program observed between 2008 and 2013 [22]. Because of resolution of the disease maps, prevalence is constant within each village: households belonging to the same village will be assigned the same level of schistosomiasis prevalence. Given that health consequences of schistosomiasis are more directly linked to infection intensity (measured by egg-output) than to prevalence [23], we translate estimated prevalence into a joint measure of schistosomiasis infection intensity in terms of the average number schistosome egg-output per person (see Additional file 1: Appendix). Model-based estimates of spatial snail density were derived from malacological surveys collected in two field sites located along the South-North climatic gradient between the Sudanian and Sahelian regions [24]. Full information on the dataset, its sources, all summary statistics and details concerning the covariates used in the analysis can be found in the Additional file 1: Appendix.
Schistosomiasis and agriculture
In order to estimate the economic impact of schistosomiasis (relationship (1) in Fig. 1), we rely on a specification that fully exploits the wealth of information available in our dataset at both plot and household levels. We identify the burden of the disease as a shock to overall productivity in an agricultural production function, which does not directly affect the amount of labor or physical inputs needed for production but instead influences their effectiveness. We begin by specifying the production technology as
$$begin{aligned} Y_{ihjt} = A_{hj}(1+phi _{hjt})^{-theta } F(X_{ihjt}) e_{ihjt}, end{aligned}$$
(1)
where (Y_{ihjt}) is the yield (output per hectare) of plot i, farmed by household h in village j at time t, (phi _{hjt}), which is common to all plots cultivated by a given household, represents the direct effect of schistosomiasis on productivity, calibrated by a parameter (theta). We model the impact of the disease as ((1+phi )^{-theta }) in order to represent the fact that in a disease-free state the household’s production technology is unaffected, and decreases as the disease burden worsens. (A_{jh}) are household time-invariant productivity shifters, some of which may be unobservable, F(.) is the production function and (X_{ihjt}) is the matrix of overall inputs used for agricultural production, including plot, household, climate and land covariates. Lastly, the term (alpha ^{FE} = { alpha _t + alpha _h +alpha _c +alpha _{ct} + alpha _{ht} +alpha _{ch}}) allows to control for all unobservables at a household, crop and year level as well as their interactions (also known as fixed effects in the economics literature). Lastly, the term (epsilon _{ihjt}) as an idiosyncratic Gaussian disturbance. The parameter of interest is (theta): it modulates the extent to which the disease affects agricultural yield, and we expect its estimate to be negative. Since the “real” effect (phi) of the disease is unobservable, we proxy it with the measure of infection intensity in terms of mean egg-output per person discussed earlier, denoted by (I_{jt} = (1+Int_{jt})). Taking logarithms then results in the quasilinear model
$$begin{aligned} y_{ihjt} = {tilde{A}}_{jh} – theta {tilde{I}}_{jt} + {tilde{F}}(X_{ihjt}) + alpha ^{FE} + epsilon _{ihjt}, end{aligned}$$
(2)
where our goal is to identify the parameter (theta). We estimate this burden first by means of multiplicative (Cobb-Douglas) form for the production technology. We then refine this estimate by absorbing all the non-linear confounding effects stemming from the large matrix of inputs by means of various machine learning methods, compared with a mean squared error criterion. In order to check for nonlinearities in the burden, we fit an instrumented adaptive spline for the schistosomiasis intensity variable using a two-stage semiparametric method on the log-linear model. In the Additional file 1: Appendix we report all details concerning the estimation framework and show how this interpretation of schistosomiasis as a productivity shock is not rejected by the data, and is compatible with a household optimization model. Since the “real” effect of the disease on household members is unobservable, we approximate it by the measure of infection intensity in terms of mean egg-output per person, as discussed earlier. This implies that plot-level estimation is likely to be affected by biases stemming from endogeneity issues, and we address these with the use of an instrumental variables (IV) strategy. This requires the choice of a set of variables (instruments) that are strongly related to the presence of schistosomiasis, and affect agricultural yield only via the disease burden. A natural choice is constituted by the densities of the snails that act as intermediate hosts of the schistosomes. The presence of either kind of aquatic snail is directly linked to the prevalence of both forms of schistosomiasis, whilst not having any direct detrimental effect on agricultural yields. In order to control for any effects that might jointly determine the presence of snails and agricultural production, we include a plethora of covariates that account for climate, precipitations and land characteristics, and use the snail predictions lagged by one year in order to avoid simultaneous dependance on joint unobservables. Furthermore, snail control strategies are shown to have potential benefits for morbidity control [25, 26], validating the theoretical relevance of our IVs. Snail densities are thus likely to constitute ideal instruments, and allow us to estimate the causal effect of interest. This corresponds to relationship (2) in the upper panel of Fig. 1. In the Additional file 1: Appendix we report a detailed justification of the validity of this specification, as well as the complete description of the models, the covariates and the estimation methods as well as all additional tables, results and robustness checks.
Schistosomiasis and poverty
In Burkina Faso, agricultural households are largely engaged in subsistence farming: shocks to yield therefore affect simultanously both income and survival probability. Moreover, it is likely that the productivity decrease attributable to schistosomiasis is a function of various household characteristics. To account for this, we examine the additional burden of the disease for households farming plots that belong to different quantiles of the joint distribution of plot surface and crop weight. We interact schistosomiasis intensity with an indicator representing whether a given plot belongs to a specific quantile of the joint plot surface/crop weight distribution. The estimated equation is the following:
$$begin{aligned} y_{ihjt}= & {} {tilde{A}}_{jh} + theta {tilde{I}}_{jt} + theta _{pj} {tilde{I}}_{jt}times mathbbm {1}_{ ( w_j, s_j) le Q^j_k} + {tilde{F}}(X_{ihjt}) + alpha ^{FE} + epsilon _{ihjt} nonumber \ Q^j_k= & {} min {w_{ihjt}, text {s}_{ihjt} : k le Phi (w, text {surf}; t) }, end{aligned}$$
(3)
where (Phi) is the joint distribution of plot weight w and surface s for each year t. The coefficient of interest is (theta _{pj}), estimated varying the indicator k on a grid ranging from the 20th to the 70th percentile of the joint distribution.
In order to characterize this mechanism in terms of its link to household poverty, we then carry out a similar procedure where the indicator function is defined at the household rather than at the plot level. The cutoffs in this case correspond to the lower 5% and 10% tails of the joint distribution of total harvest weight and plot surface farmed by each household. The interest is in the estimation of (theta _{ph}) in
$$begin{aligned} y_{ihjt}= & {} {tilde{A}}_{jh} + theta {tilde{I}}_{jt} + theta _{ph} {tilde{I}}_{jt} times mathbbm {1}_{(w^h,s^h) le Q^h_k} + {tilde{F}}(X_{ihjt}) +alpha ^{FE} + epsilon _{ihjt} nonumber \ Q^h_k= & {} min left{ sum _{j=1}^{N^h_{t}} w_{ihjt}, sum _{j=1}^{N^h_{t} } text {s}_{ihjt} : k < Phi left( w^h, text {s}^h; t right) right} , end{aligned}$$
(4)
where (Phi left( w^h, text {s}^h right)) is the joint distribution of total crop weight and plot surface farmed by each household and (N^h_{t}) is the number of plots farmed by each household each year.
Schistosomiasis and water resources development
Given that schistosomiasis is water-based, we study the feedback effects between the disease and water resources development. We begin by concentrating our attention on Burkina Faso’s four main dams, shown in (Fig. 1b). We aggregate the data up to the village level, and we compute the geographical distance of each village from the closest dams and reservoirs. By interacting the presence of a dam with our measure of disease intensity, we disentangle the direct impact of dams, which should increase yields, from the deleterious indirect effects that they may produce by facilitating the diffusion of schistosomiasis. For this estimation we rely on an instrumented spatial autoregressive specification (SAR). The estimated equation is
$$begin{aligned} y_{jt} = (1_{2j} – rho W_j)^{-1}[ theta _1 I_{jt} + theta _{2} mathbbm {1}_{dam} + theta _{int} I_{jt} times mathbbm {1}_{dam}+ {tilde{X}}_{jt}beta + alpha _t + alpha _r + epsilon _{jt}], end{aligned}$$
(5)
where (alpha _{t,r}) accounts for time and region unobservables, (W_j) is a matrix of spatial weights given by the between-villages distance in coordinate degrees and (rho) is the spatial correlation parameter, fit by maximum likelihood.
We refine the previous results by accounting for each village’s distance in km from the nearest dam or water reservoir. The full network of dams and reservoirs is illustrated in panel (b) of Fig. 1. We interact the presence of a large dam with the distance from any water infrastructure as well as with the measure of intensity. The estimated equation is
$$begin{aligned} y_{jt},= & {} (1_{2j} – rho W_j)^{-1}[ theta _1 I_{jt} + theta _2 dist_{j} + theta _3 mathbbm {1}_{dam} + theta _{3int}I_{jt} times dist_j times mathbbm {1}_{dam} + nonumber \&+ theta _{2int1} I_{jt} times dist_j + theta _{2int2} I_{jt} times mathbbm {1}_{dam} + theta _{2int3} dist_{j} times mathbbm {1}_{dam}+ nonumber \&+{tilde{X}}_{jt}beta + alpha _t+ alpha _r + epsilon _{jt}]. end{aligned}$$
(6)
where the main coefficients of interest are (theta _{3int}) and (theta _{2int1}). All details and robustness checks are again reported in the Additional file 1: Appendix.
Data analysis
All estimations have been obtained using the software R 3.6.2 (https://cran.r-project.org/). The dataset as well as the main variables of interest are described in full in the Additional file 1: Appendix. Inference on the estimated coefficients has been done by means of cluster-bootstrapping standard errors at the village level in order to match the resolution of the disease and climate maps.