Supply Chain Council of European Union | Scceu.org
Procurement

Air quality assessment and pollution forecasting using artificial neural networks in Metropolitan Lima-Peru

In this work, we follow the Knowledge Discovery from Databases (KDD) methodology to obtain relevant information for air quality management decision-making. The main goal of the KDD is to extract implicit, previously unknown, and potentially helpful information24 from raw data stored in databases. Therefore, the resulting models can predict, e.g., one-hour ahead, the air quality and support the city’s management decision-making (see Fig. 1).

The KDD methodology has the following stages: (a) Phenomena Understanding; (b) Data Understanding; (c) Data Preparation; (d) Modeling; (e) Evaluation; and, (d) Selection/Interpretation. In the following subsections, we explain each stage of the process.

Figure 1
figure1

Knowledge Discovery from Databases (KDD) methodology used for Air Quality Assessment and Pollution Forecasting.

Phenomena Understanding

In this first stage, we contextualize the contamination phenomenon concerning the (hbox {PM}_{10}) concentrations in the five Lima monitoring stations. The main focus is to predict air pollution to support decision-making related to establishing pollution mitigation policies. For this, we use both MLP and LSTM as computational statistical methods for (hbox {PM}_{10}) prediction.

Lima is the capital of the Republic of Peru. It is located in the center of the western side of the South American continent in the (77^{circ }) W and (12^{circ }) S and, together with its neighbor, the constitutional province of Callao, form a populated and extensive metropolis with 10,628,470 inhabitants and an area of (2819.3,hbox {km}^2)25,26.

The average relative humidity (temperature) in the summer (December–March) ranges from 65–68% (24 °C–26 °C) in the mornings, while at night the values fluctuate between 87–90% (18 °C–20 °C). In the winter (June–September), the average daytime relative humidity (temperature) ranges between 85–87% (18 °C–19 °C) and at night it fluctuates between 90–92% (18 °C–19 °C). The average annual precipitation is 10 mm. On the other hand, the average altitudes reached by the thermal inversion in summer and winter are approximately 500 and 1500 m above sea level, respectively27,28.

Figure 2
figure2

Map with the study area and the locations of the Lima air quality monitoring stations: ATE, Campo de Marte (CDM), Carabayllo (CRB), Huachipa (HCH) and San Martin de Porres (SMP).

Table 1 Pollutant and weather variables used in this study, and their units of measurement.

Data understanding

Lima has ten air quality monitoring stations located in the constitutional province of Callao and the north, south, east, and center of Lima. The data used comprise hourly observations from January 1st, 2017, to December 31st, 2018, and includes three meteorological variables and the concentration of particulate matter (hbox {PM}_{10}). Where the latter is considered to be an agent that, when released into the environment, causes damage to ecosystems and living beings29,30. For this study, the hourly data, recorded at five air quality monitoring stations (see Fig. 2), which are managed by the National Service of Meteorology and Hydrology of Peru (SENAMHI), was considered. Table 1 shows the considered variables and their units of measurement.

When considering environmental data, such as (hbox {PM}_{10}) concentrations, from different locations, preliminary spatio-temporal visualization studies are of great use to better understand the behavior of the meteorological variables, the topography of the area, and the pollutants31.

Data preparation

This stage is very relevant because it precedes the modeling stage. The preparation of the data had various stages. First, we address the problem of missing data. The treatment was performed with the MICE library. This library performs multiple imputations using the Fully Conditional Specification32 and requires a specification of a separate univariate imputation method for each incomplete variable. In this context, predictive mean matching, a versatile semiparametric method focusing on continuous data, was used, which allows the imputed values to match one of the observed values for each variable. The data imputation was performed for each of the five stations with a percentage of missing data below 25%.

The data from the monitoring stations consist of a sequence of observed values ({x_t}) recorded at specific times t. In this case, the time series is collected at hourly intervals. After the data imputation, we proceed to normalize all the observations in the range [0,1] as follows:

$$begin{aligned} X_{t} = frac{x_{t} – min {x_t}}{max {x_t} – min {x_t}} end{aligned}$$

(1)

Moreover, the time series is decomposed into the trend, seasonality, and the irregular components following an additive model (the cyclic component is omitted in this work):

$$begin{aligned} X_t = Trend_t + Cyclic_t + Seasonal_t + Irregular_t end{aligned}$$

(2)

The trend component (Trend_t) at time t reflects the long-term progression of the series that could be linear or non-linear. The seasonal component (Seasonal_t) at time t, reflects the seasonal variation. The irregular component (Irregular_t) (or “noise”) at time t describes the random and irregular influences. In some cases, the time series has a cyclic component (Cyclic_t) that reflects the repeated but non-periodic fluctuations. The main idea of applying this decomposition is to obtain the deterministic and the random components, where a forecasting model is obtained using the deterministic part33,34. In this article, we have used the method implemented in Statmodels for Python35, where a centered moving average filter is applied to the time series.

Modeling using artificial neural networks

Artificial Neural Networks have received a great deal of attention in engineering and science. Inspired by the study of brain architecture, ANNs represent a class of non-linear models capable of learning from data36. The essential features of an ANN are the basic processing elements referred to as neurons or nodes, the network architecture describing the connections between nodes, and the training algorithm used to estimate values of the network parameters.

Researchers see ANNs as either highly parameterized models, or semiparametric structures36. ANNs can be considered as hypotheses of the parametric form (h(cdot ;{{mathbf {w}}})), where the hypothesis h is indexed by the vector of parameters ({mathbf {w}}). The learning process consists of estimating the value of the vector of parameters ({mathbf {w}}) to adapt the learner h to perform a particular task.

Machine Learning and Deep learning methods have been successfully applied for time series forecasting37,38,39,40,41,42. For instance, recurrent artificial neural networks (RNNs) are dynamic models frequently used for processing sequences of real data step by step, predicting what comes next. They are applied in many domains, such as the prediction of pollutants43. It is known that when there are long-term dependencies in the data, RNNs are challenging to train, which leads to the development of models such as the LSTM that have been successfully applied in time series forecasting44.

Figure 3
figure3

Schematic of the architecture of the Multilayer Perceptron. The figure shows three layers of neurons: input, hidden and output layers.

The Multilayer Perceptron model consists of a set of elementary processing elements called neurons36,45,46,47,48. These units are organized in architecture with three layers: input, hidden, and output. The neurons corresponding to one layer are linked to the neurons of the subsequent layer. Figure 3 illustrates the architecture of this artificial neural network with one hidden layer. The non-linear function ({mathbf {g}}({mathbf {x}},{mathbf {w}})) represents the output of the model, where ({mathbf {x}}) is the input signal and ({mathbf {w}}) being its parameter vector. For a three-layer FANN (one hidden layer), the k-th output computation is given by the following equation

$$begin{aligned} g_k({mathbf {x}},{mathbf {w}})=f_2left( sum _{j=1}^{lambda }w_{kj}^{[2]}f_1 left( sum _{i=1}^{d} w_{ji}^{[1]}x_i+w_{j0}^{[1]}right) +w_{k0}^{[2]}right) end{aligned}$$

(3)

where (lambda) is the number of hidden neurons. An important factor in the specification of neural models is the activation function’s choice. These can be any non-linear functions as long as they are continuous, bounded, and differentiable. The transfer function of the hidden neurons (f_1(cdot )) should be nonlinear while for the output neurons the function (f_2(cdot )) could be a linear function or nonlinear functions. One of the most used functions is the sigmoid:

$$begin{aligned} f(z)=frac{1}{1+exp (-z)} end{aligned}$$

(4)

The MLP operates as follows. The input layer neurons receive the input signal; these neurons propagate the signal to the first hidden layer and do not make any processing. The first hidden layer processes the signal and transfers it to the subsequent layer; the second hidden layer propagates the signal to the third, and so on. When the signal is received and processed by the output layer, it generates the response.

The Long Short-Term Memory networks model is a type of RNN, having as its primary strength the ability to learn long-term dependencies and being a solution for long time series intervals20,49. In such a model, memory blocks replace the neurons in the hidden layer of the standard RNN50. The memory block consists of three gates that control the system’s state: Input, forget, and output gates. First, the input gate determines how much information will be added to the cell. Second, the forget gate controls the information lost in the cells. Lastly, the output gate performs the function of determining the final output value based on the input and memory of the cell51,52.

$$begin{aligned} f_{t}= & {} sigma left( W_{f} cdot left[ h_{t-1}, x_{t}right] +b_{f}right) end{aligned}$$

(5)

$$begin{aligned} i_{t}= & {} sigma left( W_{i} cdot left[ h_{t-1}, x_{t}right] +b_{i}right) end{aligned}$$

(6)

$$begin{aligned} {tilde{C}}_{t}= & {} tanh left( W_{{tilde{C}}} cdot left[ h_{t-1}, x_{t}right] +b_{{tilde{C}}}right) end{aligned}$$

(7)

$$begin{aligned} C_{t}= & {} left( {f}_{t} cdot C_{t-1}right) + left( i_{t} cdot {tilde{C}}_{t}right) end{aligned}$$

(8)

$$begin{aligned} o_{t}= & {} sigma left( W_{o}left[ h_{t-1}, x_{t}right] +b_{o}right) end{aligned}$$

(9)

$$begin{aligned} {{h}_{t}}= & {} {{o}_{t}}cdot tanh({{C}_{t}}) end{aligned}$$

(10)

Figure 4
figure4

Model of one block of the LSTM. The block is composed of the input gate, forget gate and output gate.

Figure 4 shows the LSTM model block, with the output and input blocks, which consists of three gates. At each step, an LSTM maintains a hidden vector h and a memory vector o responsible for controlling status updates and outputs.

The first step is to decide what information will not be considered in the status cell. This decision is made by the forget gate, which uses a hyperbolic tangent activation function (IAF). (f_{t}) represents the output of the forget gate, which can be calculated using equation (5). This gate considers the concatenation of the vectors (h_{t-1}) and (x_t). It generates a number between 0 and 1 for each number in the state cell (C_{t-1}), where (W_f) and (b_f) are the weight matrices and the bias vector parameters, respectively. Both must be learned during training and are stored in the vector (f_t). If one of the values of this vector is equal to or close to zero, then the LSTM will eliminate that information. On the other hand, if it reaches values equal to or close to 1, this information will be maintained and reach the status cell.

The next step is to decide what new information to store in the status cell. This is done by the input gate, linked to a sigmoid activation function (GAF), and with an output for that gate ((i_{t})), all this is calculated by the equation (67). In addition, for the input block, the hyperbolic tangent activation function (IAF) is used. First, the vectors (h_{t-1}) and (x_t) are concatenated. Being (W_i) and (b_i), the weight matrices and the bias vector parameters, respectively, must be learned during training; all this is stored in the vector (i_t) called the input gate, which decides which values to update. Then a hyperbolic tangent function creates a vector of new candidate values, ({tilde{C}}_{t}), involving the vectors (h_{t-1}) and (x_t). In the next step, these values are filtered by multiplying point by point both vectors to create a status cell update. The previous cell, (C_{t-1}) is updated to the new state of cell (C_t) (equation 8).

In addition, the output gate, also linked with the GAF activation function and with an output of the output gate ((o_{t})), for its calculation uses the equation (equation 9). Finally, (h_{t}), expresses the new output of the model (equation 10). The current cell state is represented by (C_{t}), while W is the weight vector o parameters of the model, and b is the bias of the model.

Model evaluation

To evaluate the forecast ability of the models, the performance metrics given below were used (see53,54). In what follows, we will consider: (y_{i}), (i=1,ldots ,n), are the target values; ({hat{y}}_{i}), (i=1,ldots ,n), are the model’s predictions; ({bar{y}}_{i}) is the mean of the target values; and n is the number of samples.

  1. 1.

    Mean Absolute Error: The average absolute difference between the target and the predicted values.

    $$begin{aligned} mathrm {MAE}=frac{sum _{i=1}^{n}left| y_{i}-{widehat{y}}_{i}right| }{n} end{aligned}$$

    (11)

  2. 2.

    Root Mean Squared Error: The squared root of the average of the squared errors.

    $$begin{aligned} mathrm {RMSE}=sqrt{frac{sum _{i=1}^{n}left( y_{i} -{widehat{y}}_{i}right) ^{2}}{n}} end{aligned}$$

    (12)

  3. 3.

    Symmetric Mean Absolute Percentage Error: A measure of accuracy based on a percentage of relative errors.

    $$begin{aligned} mathrm {sMAPE}=frac{100 %}{n} sum _{i=1}^{n} frac{left| y_{i}-{widehat{y}}_{i}right| }{left| {widehat{y}}_{i}right| +left| y_{i}right| } end{aligned}$$

    (13)

  4. 4.

    Spearman’s rank correlation coefficient: A nonparametric correlation measure between the target and the prediction. Spearman’s correlation assesses monotonic relationships by using the rank of the variables.

    $$begin{aligned} S = 1-frac{6 sum _{i=1}^{n} d_{i}^{2}}{nleft( n^{2}-1right) } end{aligned}$$

    (14)

    where (d_{i} = rg(y_i) – rg({hat{y}}_i)) is the difference between the ranks of the targets (rg(y_i)) and the predictions (rg({hat{y}}_i)).

Model selection and interpretation

The model selection and interpretation is the final step in the KDD process and requires that the knowledge extracted from the previous step be applied to the specific domain of the (hbox {PM}_{10}) prediction in a visualized format. At this stage, in addition to selecting the model with the best precision in the prediction, it also drives the decision-making process based on the air quality assessment in Lima.

We have used two schemes for the validation: Hold-Out (HO) and Blocked Nested Cross-Validation (BNCV). On the one hand, HO has the conventional separation of the dataset in training, validation, and testing subsets (see Fig. 5). On the other hand, the BNCV is a fixed-size window that slides, and the model is retrained with all the data up to the current day (see Fig. 6).

Figure 5
figure5

Hold-Out Scheme used for the validation of the models. The dataset is split into three sets: training, validation, and testing. The train set is the basis for training the model, and the test set is used to see how well the model performs in untrained (hbox {PM}_{10}) concentrations.

Figure 6
figure6

Blocked Nested Cross-Validation Scheme used for the validation of the models. The dataset is separated into three sets using a time-window of fixed size: training, validation, and testing. The last day is used for testing.

Related posts

Debate rages on for nuclear waste facility proposed near Carlsbad

scceu

Using BIM to Deliver Low-Carbon Wood Buildings

scceu

From Equity Management To Liquidity: How Morgan Stanley At Work Is Supporting Top Law Firms

scceu