An Effective Approach for Air Quality Prediction in Bishkek Based on Machine Learning techniques

Air pollution is the presence of any air pollutant that can affect human health, environment, plants, and animal life. Urban areas are increasingly facing the detrimental consequences of air pollution. Therefore, pollution prediction for urban areas becomes a necessary step to help reduce the harmful effects of air pollution. In this paper, we implement various regression, classification and forecasting techniques such as Catboost Regression, LightGMB Regression, XGboost Regression, Extra Trees Regression, Random Forest, Artificial Neural Networks, Support Vector Machines, Naïve Bayes and ARIMA to assess and forecast the Air Quality Index and pollutants like PM1, PM10, PM2.5 in the City of Bishkek Kyrgyzstan. The Techniques are then evaluated using Mean Squared Error, Mean Absolute Error and R² The results show that CatBoost algorithm and LightGBM algorithm are well suited for regression and classification respectively.


INTRODUCTION
In the last few decades, accelerated urbanization has significantly affected the quality of air in urban areas due to concentrated economic activities that consume huge amounts energy and, thus, an increased environmental burden including enormous anthropogenic emissions of pollutants into the atmosphere.Although there are various types of emissions, some of the main contributors to air pollution are motor vehicles and industrial operations [7,19].According to the WHO, the top six main pollutants in air that affect human health include particle pollution, ground-level ozone, carbon monoxide, sulfur oxides, nitrogen oxides, and lead.Different toxicological effects on humans are caused by both long-term and short-term exposure to air dispersed toxicants, including respiratory and cardiovascular disorders, neuropsychiatric issues, eye irritation, skin diseases, and long-term chronic diseases including cancer [5,17,20].
The total number of harmful solid and liquid particles floating in the air is known as Particulate Matter (PM).The two main categories of mass and composition in urban settings are coarse particles and fine particles with aerodynamic diameters between 2.5 m to 10 m.Epidemiological studies have shown evidence of increased mortality and morbidity associated with particulate pollution even at moderate concentrations [8,9,15] The most common ways to pollute air with particulate matter include emissions from power plants, industries, motor vehicles, coal and wood combustion, biomass burning, and soil and dust particles [21].
Traditionally statistical models and numerical methods were used to assess and predict air quality.Numerical methods use physical principles to simulate atmospheric processes and predict air quality.Complexity of governing processes and strong couple across scales make air quality prediction a challenging task [6] Statistical models such as regression and time series employ historical data to find patterns that can be used to forecast air pollution levels.Some notable methods used include multiple linear regression, and exponential smoothing and moving averages models [3,18].The estimates using these methods are often not accurate as they have too many assumptions to be satisfied and cannot account for the dynamic and complex behavior of meteorological parameters.
Unlike a statistical method, artificial intelligence-based machine learning algorithms consider multiple parameters for prediction.As a result of advancement in technology and increased computational power, machine learning algorithms are widely used in predicting pollution levels.Most commonly used techniques are the artificial neural networks (ANNs) [1,12].Supervised machine learning algorithms like linear regression, support vector machines, decision trees and random forests have also been employed for predicting air pollution [11,22,24] Ensemble machine learning methods have also been used to identify pollution sources and predict urban air quality [27].A modified Wavelet Decomposition technique and Back Propagation Neural Network (W-BPNN) model has also been implemented to forecast concentrations of three air pollutants SO 2 , NO 2 and PM10 [2].In this work authors use back-propagation neural network to modify using wavelet-transform technique.Another study in Quito, Ecuador demonstrates that the use of statistical models based on machine learning is relevant to predict PM2.5 concentrations using a meteorological data [16].
Air pollutant emissions vary among different countries with developing countries with a high level of pollution compared to the developed countries.According to our world in data, in 2019, air pollution contributed to 11.65% of deaths globally.Central Asia is one of the regions with the highest share of deaths by pollution.Bishkek, the capital city of Kyrgyzstan ranks among top cities in the world for its bad quality.For example, on 15 January 2021, Bishkek was on the top list of most air-polluted cities in the world-by-World Air Quality rankings.The Government of Kyrgyzstan is therefore giving more attention to this problem and has installed nearly 50 monitoring devices across Bishkek to measure the quality of air especially the concentration of PM2.5 in the air.Moreover, the Kyrgyz government is also planning to add more air quality monitoring devices to measure the pollutants such as CO, NO 2 , NO and SO 2 concentration in air [13].
There is a need for studies that use meteorological data to assess the quality of air in Bishkek city.There is one study which does a quantitative assessment and prediction of the degree of ambient air benzo(a)pyrene pollution [28].There are no other major studies that investigate the air pollution in Bishkek and therefore no progress has been made in understanding air quality in Bishkek.It is imperative that government institutions and policy makers have enough information on air quality so that necessary steps are taken.Therefore, we have employed various regression, classification and forecasting techniques such as Catboost Regression, LightGMB Regression, XGboost Regression, Extra Trees Regression, Random Forest, Artificial Neural Networks, Support Vector Machines, Naïve Bayes and ARIMA to assess and forecast the Air Quality Index and pollutants like PM1, PM10, PM2.5 in the City of Bishkek Kyrgyzstan.
The Bishkek city (Latitude 42°52'N, Longitude 74°34') is located on the Chu River Valley, in the northern part of Kyrgyzstan.According to the 2022 census, Bishkek has a total population of 1.082 million and a total area of 127 square kilometers.The population and area of the Bishkek city constitute 15.23% and 0.064% of total population and total area of Kyrgyzstan respectively.At an altitude of 700-900 meters above sea level, Bishkek is situated at the base of the Kyrgyz Ala-Too mountains, one of the inner Tien Shan Mountain groups.Few windy days occur in the city since it is situated in a basin, which tends to limit airflow throughout the Bishkek region.Another complicated factor is that the air temperature in and above the city is about 5 • C higher than in the surrounding atmosphere, creating a heat island effect.This action prevents the light outside winds that are common during the colder months.The city's climate, which is distinctly continental, is influenced by latitude, altitude, the significant distance from oceans, regional orographic influences, and atmospheric circulation [13].
The aim of this paper is to analyze the particulate matter in Bishkek and come up with predictive models that would not only predict the level of these particulate matter but also help to forecast for the future as well.
The structure of the paper is as follows.Section 2 presents the proposed model.In Section 3 the implementation details and results are given and then the paper is concluded.Notations with explanations are listed in table 1.

PROPOSED METHODOLOGY
The proposed methodology consists of three modules namely data collection, preprocessing, and prediction.The abstract model is shown in figure 1 and detailed model is illustrated in figure 2. In the data collection module, we have collected data from different sources related to pollution, in the preprocessed module different actions, such as noise removal, handling missing values, etc. have been performed on the collected data to remove noise and prepare it for further processing.The preprocessed data is the feed for prediction modules.In the prediction module different machine learning models, such as regression models, classification models, and time series forecasting models have been applied on the preprocessed data.

Data Collection
The data has been collected about pollution in Bishkek from various open sources which include datasets collected by Kyrgyzhyromet and the daily air quality and meteorological measurements for major worldwide cities starting from 2015-2021 [29].The collected data had a lot of problems, i.e., null values, outlier, etc.To handle these problems the data are fed to the preprocessing module.

Preprocessing
In the processing stage different preprocessing operations have been performed on the data to remove null values, outliers, and other issues, such as appending data frames, indexing, normalization, and standardization.The first step to build a model is to process the data and make it suitable for the training process using various data preprocessing steps.Some of the general preprocessing steps that we used on the dataset before training the models are merging, filtering, handling null values and grouping.The datasets that we used in this project were categorized by years and we need to merge them based on the date index.The missing values in the AQI column were represented by -999.We convert the missing values to NULL values using NumPy.The preprocessing is then fed to the prediction module for further processing.to train Machine Learning models.

Prediction
In the prediction module we have used different machine learning algorithms for classification and regression.First in this list is CatBoost Algorithm which is used for both classification and regression.CatBoost is a gradient boosting library developed by Yandex.In gradient boosting, a model is trained to make predictions based on the output of several simpler models, called weak learners.These weak learners are typically decision trees, which are trained to make predictions based on a series of splits in the data [10].Cat-Boost handles categorical variables differently from other gradient boosting libraries.Rather than encoding the categorical variables as numerical values, CatBoost creates a separate decision tree for each category, and combines the results of these trees to make a final prediction.This can help to improve the model's performance, especially when there are many categories or when the categories are not ordinal.CatBoost also includes several additional features, such as the ability to handle missing values without imputation, and tools for model visualization and interpretation.Another Algorithm that we have used for both regression and classification purposes is LightGBM is a gradient boosting library developed by Microsoft.In gradient boosting, a model is trained to make predictions based on the output of several simpler models, called weak learners.These weak learners are typically decision trees, which are trained to make predictions based on a series of splits in the data.LightGBM uses a variant of decision trees called gradient-based one-side sampling (GOSS) trees, which can help to improve the training speed and reduce overfitting.It also includes several additional features, such as the ability to handle missing values without imputation and support for parallel training [26].
We have also used XGBoost algorithm for classification and regression.XGBoost is a gradient boosting library developed by Tianqi Chen and developed further by a group of developers at DMLC.In gradient boosting, a model is trained to make predictions based on the output of several simpler models, called weak learners.These weak learners are typically decision trees, which are trained to make predictions based on a series of splits in the data.XGBoost includes several additional features, such as support for parallel training and the ability to handle missing values without imputation.It also includes several optimization techniques that can help to improve the training speed and the model's performance [14].
A decision tree is a type of machine learning model that is used for both classification and regression tasks.It is a tree-like model in which an internal node represents a feature, and each leaf node represents a class label or a predicted value.To build a decision tree, the algorithm starts at the root node and splits the data into two or more subsets based on a feature value.This process is repeated for each child node until the leaves are reached.The final tree is constructed by connecting the sub-trees.To make a prediction, the algorithm traverses the tree from the root node to a leaf node, using the feature values of the input data to determine which path to take at each internal node.
Random forest is an ensemble machine learning method that is used for both classification and regression tasks.It is a type of decision tree-based model, but rather than building a single decision tree, it constructs a forest of decision trees and combines their predictions to make a final output.To build a random forest model, the algorithm begins by selecting a random sample of the training data, and then building a decision tree model on this sample.This process is repeated multiple times, resulting in a forest of decision trees.To make a prediction, the algorithm passes the input data through each decision tree in the forest and combines the predictions using a majority vote or an averaging method.This can help to reduce overfitting and improve the model's generalization performance [4].
Extra trees (also known as extremely randomized trees) algorithm is an ensemble machine learning method that is used for both classification and regression tasks [25].It is a variant of the random forest model, which is a decision tree-based model that builds a forest of decision trees and combines their predictions to make a final prediction.Extra trees differ from random forests in that the decision trees are built using a randomized feature selection process.This means that at each split in the tree, the algorithm selects a random subset of the features to consider, rather than using the entire feature set.To build an extra trees model, the algorithm begins by selecting a random sample of the training data, and then building a decision tree model on this sample using the randomized feature selection process.This process is repeated multiple times, resulting in a forest of decision trees.To make a prediction, the algorithm passes the input data through each decision tree in the forest and combines the predictions using a majority vote or an averaging method.This can help to reduce overfitting and improve the model's generalization performance.
A neural network is a type of machine learning model that is inspired by the structure and function of the brain.It consists of layers of interconnected "neurons," which process and transmit information.Neural networks can be used for a wide range of tasks, including classification, regression, and forecasting [23].
K-nearest neighbors' algorithm, also known as KNN, is a simple and widely used classification tasks in Machine Learning.In KNN, the algorithm first takes a set of labeled data points as input and stores them.When given a new data point, the algorithm finds the k closest labeled data points to the new data point, based on a distance metric such as Euclidean distance.Then, the algorithm takes the majority class of the labels of those k nearest neighbors and assigns that value to the new data point.In the proposed work we have used KNN algorithm both for classification and regression purposes [17].

Implementation Step
The implementation of the proposed work has been done in Python 3.10.6 using Core i5 having 8 GB RAM.The coding part is mostly done using Jupyter notebook on Linux (Ubuntu 22.04.1 LTS) operating system.The datasets that had been used in this work were taken from HydroMet Kyrgyzstan website, AirNow Department of State website and Air Quality Open Data Platform.The historical data was arranged by years.The dataset from AirNow website is starting from 2019 till 2022.This dataset contains information about PM2.5 and the AQI of Bishkek which is updated each hour.While the dataset taken from Air Quality Open Data Platform contains information about different pollutants (PM1, PM2.5, PM10, SO, NO, NO2) for different cities of the world starting in 2015.
We have implemented regression models (Catboost regressor, LightGBM regressor, XGboost regressor, Random Forest Regressor, Extra trees regressor and Neural Networks) to predict the quantity of PM1, PM2.5 and PM10.We used R 2 score, mean absolute error and mean square error to evaluate the performance of the models.
Moreover, we used classification models (CatBoost classifier, LightGBM classifier, KNN, Neural networks and SVM classifier) to predict the quality of air.There are six possible outputs based on the quality of air: Moderate, Unhealthy for Sensitive Groups, Unhealthy, Very Unhealthy, Hazardous and Good.The shape of the training dataset is (23322, 6) and the size of testing part is (9996, 6).
We have used accuracy, precision, f1-score, and confusion matrix as evaluation matrices to evaluate the performance of classification models.

Results for Regression Models
In our case, the Extra tree regressor and Catboost regressor models performed better than other models, so we decided to select those models.Evaluation metrices for regression models are listed in table 2.

Results of Classification Models
The classification models were used to predict the quality of air based on based on the testing results, the LightGBM model gives the most accurate results with an accuracy of 99.75% on the testing dataset.
Table 3 gives the accuracy, precision, recall and f1-score of different models used for predicting date (hour, day, month, year)

Conclusion
In this paper, a model has been proposed based on regression and classification algorithms to predict air pollution in Bishkek city.The proposed model consisted of the three main stages, namely data collection, preprocessing, and prediction.In the data collection stage, we have collected data related to pollution in Bishkek.In the input data we have considered different parameters for air pollution

Figure 1 :
Figure 1: Abstract model of the proposed methodology

Figure 2 :
Figure 2: Detailed Structure diagram of the proposed model.

Table 1 :
Notations and explanations

Table 2 :
Evaluation Metrics for Regression

Table 3 :
Evaluation of each of the classification models based on testing dataset. in the proposed work.In the preprocessing different techniques have been applied on the collected data to remove noise, outliers and null values.In the prediction stage we have used different machine learning algorithms for regression and classification, such as XGBoost, CatBoost, neural network, regression tree, random forest, extra trees, light GBML, etc. the purpose of using different machine learning algorithms to select the best model among the selected algorithms on the given data.The results indicate that the performance of CatBoost algorithm is better for regression, and LightGBM perform better as compared to other algorithms for classification.