SciELO - Scientific Electronic Library Online

 
vol.40 issue2Analysis of the Operational Variables in the Extraction Stage of a Sugar MillDevelopment of a Computational Tool to Evaluate the Energy Diversification of Transportation Systems in Colombia author indexsubject indexarticles search
Home Pagealphabetic serial listing  

Services on Demand

Journal

Article

Indicators

Related links

  • On index processCited by Google
  • Have no similar articlesSimilars in SciELO
  • On index processSimilars in Google

Share


Ingeniería y Desarrollo

Print version ISSN 0122-3461On-line version ISSN 2145-9371

Ing. Desarro. vol.40 no.2 Barranquilla July/Dec. 2022  Epub Apr 10, 2023

https://doi.org/10.14482/inde.40.02.622.553 

Artículo de investigación

Evaluation of Unsupervised Machine Learning Algorithms with Climate Data

Evaluación de algoritmos de Aprendizaje de Máquina no supervisados con datos climáticos

Juan Sebastián Ramírez* 
http://orcid.org/0000-0001-8876-5371

Néstor Duque-Méndez** 
http://orcid.org/0000-0002-4608-281X

*Universidad Nacional de Colombia. Departament of Informatics and Computing. msc en Computer Systems Administration. Orcid id: https://orcid.org/0000-0001-8876-5371. jsramirezgo@unal.edu.co

**Universidad Nacional de Colombia-Departament of Informatics and Computing. PhD en Engineering - Systems. Orcid id: https://orcid.org/0000-0002-4608-281X. ndduqueme@unal.edu.co - Tel 3007876574


ABSTRACT

When using climate data, researchers have difficulty determining the clustering algorithm and the best performing parameters for processing a specific dataset. We evaluated of the following unsupervised machine learning algorithms: K-means, K-medoids and Linkage-complete, which are applied to three datasets with climatological variables (temperature, rainfall, relative humidity, and solar radiation) for three meteorological stations located in the department of Caldas, Colombia, at different heights above sea level. Five scenarios are defined for 2, 3, and 5 clusters for each of the two partitioned algorithms, and five scenarios for the hierarchical algorithm, in each one of the meteorological stations. Different quantities and groupings of variables are applied for the different scenarios by using Euclidean distance. Davis-Bouldin is the applied method of quality evaluation of clusters. Normalization with techniques such as range-transformation and Z-trans-formation, as well as some iterations of the algorithm and reduction of dimensionality with PCA. In addition, the computational cost is evaluated. This study can guide researchers on certain decisions in cluster analysis used in meteorological data, as well as identify the most important algorithm and parameters to take into consideration for the best performance, according to particular conditions and requirements.

Keywords: Climate; clustering; machine learning; K-means; K-medoids

RESUMEN

Al usar datos climáticos, los investigadores tienen dificultades para determinar el algoritmo de agrupamiento y los parámetros de mejor rendimiento para procesar un conjunto de datos específico. Se realiza la evaluación de algoritmos de aprendizaje automático no supervisados K-means, K-medoids y Linkage-complete, aplicados a tres conjuntos de datos con variables climatológicas (temperatura, lluvia, humedad relativa y radiación solar), para tres estaciones meteorológicas ubicadas en el departamento de Caldas, Colombia, a diferentes alturas sobre el nivel del mar. Se definen 5 escenarios para 2, 3 y 5 clústeres para cada uno de los dos algoritmos particionados y 5 escenarios para el algoritmo jerárquico, para cada una de las estaciones meteorológicas, y aplicando una cantidad y agrupación diferente de variables para los diferentes escenarios y utilizando la distancia euclidiana, Davis-Bouldin como método de evaluación de calidad de clústeres, normalización con técnicas como transformación de rango y transformación Z, varias iteraciones del algoritmo y reducción de dimensionalidad con PCA. Además, se evalúa el costo computacional. Esta investigación puede guiar al investigador sobre ciertas decisiones en el análisis de conglomerados utilizados en datos meteorológicos, así como identificar el algoritmo y los parámetros más importantes a considerar para el mejor desempeño, de acuerdo con las condiciones y requisitos particulares.

Palabras claves: Agrupamiento; aprendizaje de máquina; clima; K-means; K-medoids

INTRODUCTION

Climate and atmospheric scenarios have been approached by a variety of researchers to acquire knowledge of interest. Environmental, climatic, and meteorological information has been used to determine behaviors and patterns within the studied area [1]-[12], and air pollutants have been used to understand the formation and impact of natural disasters and greenhouse effect within a region [13]-[16], be it to predict situations, relate causes and effects, and take measurements of the area to finally provide improvements, conclusions, and considerations in favor of the environment.

Currently, many algorithms are used to process records in the analysis of climate data, as shown in Table 1, which compiles more than 30 studies that used clustering algorithms for climate data, using various sources of information, records, timelines, and various objectives.

Table 1 Bibliographic review for clustering algorithms with climatic data. 

As shown in Table 1, K-means, K-medoids, and hierarchical grouping are the clustering algorithms most used by the authors.

Researchers approached various clustering algorithms (Agglomerative, K-means) [1], with various methods where they applied climate data with a series of metrics to find performance, especially computational performance. Other studies used clustering tools to observe the behavior of the data according to the number of established clusters [22]. Different works recommended some grouping models for specific environmental data by looking for the best projections of particulate matter in the studied region [24]. Also it was demonstrated the proper techniques to view annual temperature trends [34]. Other studies obtained, with proposed partitioned algorithms, the best precipitation estimates in the studied regions [30] and it was used clustering to improve noise reduction in analysis of solar radiation and temperature parameters [19]. Finally, other works used comparisons between unsupervised algorithms using climate data to find higher returns on them [32].

However, there is no clear guidance on which algorithm and parameters specifically serve to obtain the best results with the available data. Therefore, this research focused on working in that gray area to know and understand the behavior of some unsupervised machine learning algorithms applied to various scenarios with climate data.

In order to address this issue from experimentation, the guiding question of the paper is proposed as: How do clustering algorithms behave in different scenarios for climate data processing?

METHODOLOGY

The methodology includes the stages of variable selection, definition of the climatic seasons, obtaining the data set, definition of the scenarios, creation of the scenarios, and choice of tool for the execution of the algorithms. Then, it includes the results and its analysis.

Selection of Climatic Variables

To carry out the investigation, the following four meteorological variables were taken into account: temperature, precipitation, relative humidity, and solar radiation. These variables were selected for being the most reported in the state-of-the-art review in climate research with clustering algorithms [1], [2], [4], [7], [10], [20], [29], [31]-[33], [37]-[39].

Definition of the Climatic Stations

Data from three meteorological stations called Villamaria Hospital, Caldas Hospital, and Los Nevados National Natural Park (El Cisne) were used, which comprised of 430.635, 530.802, and 248.297 instances, respectively. Table 2 shows the information of the stations.

Table 2 weather station information 

Source: the Authors.

Source: the Authors.

Figure 1 Location of the three meteorological stations on a satellite map of Caldas Department, Colombia. 

Obtaining the Data Set

To obtain the records, the data warehouse of the Caldas Environmental Data and Indicators Center (CDIAC) was accessed. The data warehouse is a climate records storage system for the entire department of Caldas, administered and managed by the Adaptive Intelligent Environment Group (GAIA). It is also a project lead by the IDEA (Environmental Studies Institute) of the National University of Colombia, Manizales branch. The data warehouse is a large storage structure implemented in PostgreSQL that houses more than 60 million environmental data, whose information is collected from more than 100 stations, including meteorological stations located in different geographic sectors throughout the department, and whose information can be viewed through http://cdiac.manizales.unal.edu.co. SQL queries were executed to extract the required data from the data warehouse to form the datasets. The records are between April 12, 2012, and August 16, 2017 (time range that contains the whole required data), comprising 64 months (5.3 years) with a data periodicity of every 5 minutes, and for this period information is extracted from the four climatic variables to be analyzed. Table 3 shows the retrieved datasets.

Table 3 Established datasets for the evaluation of unsupervised algorithms 

Source: the Authors.

Definition of Scenarios

The scenarios are the defined environments where the clustering algorithms are applied, with a diversity of characteristics and parameters to, therefore, analyze and understand how the algorithms behaved in each of these scenarios. The parameters for each scenario are number of variables, type of variable, number of records, missing data, and presence of outliers, which, combined with the characteristics of each station, represent an interesting spectrum for the evaluation of the algorithms.

Different algorithms and modifications in some execution parameters are applied on the data related to the defined scenarios and on the selected stations.

Clustering algorithms: The three clustering algorithms most used by researchers in the analysis of climate data were selected. Agglomerative hierarchical grouping with Linkage-complete, K-means, and K-medoids for partitioned grouping.

Number of clusters (K): The generation of three different groupings whose K value were 2, 3, and 5 was proposed. This selection was based on results from other authors [1], this was corroborated by applying the elbow method in one of the cases, which validated these ranges.

Normalization: For experimentation, normalization with Z-transformation and range-transformation was used as part of the process of reducing value scales in the variables.

Dimensionality reduction: Principal Component Analysis (PCA) was used. The used variables were the four ones: relative humidity, temperature, precipitation, and solar radiation.

Number of algorithm iterations: For experimentation, iteration values of 1, 10, 100 and 1000 were used, where each value is ten times greater than the previous one.

Distance measurement: Euclidean distance was used as the distance function, considered to be the most reliable [1], and being used in a wide variety of jobs in the climate field [2], [7], [18], [19], [30], [33], [40], [41].

Cluster Quality Assessment: This metric consists of evaluating the result of the grouping to determine the quality of the clustering. For experimentation, the Da-vis-Bouldin index was used as a proposed metric for evaluating cluster quality [1], [22], [42H45].

Scenario Creation

Some work scenarios are defined for each one of the three algorithms. These scenarios are configurations to be taken into account in the executions to test each algorithm with different metrics.

Table 4 Work scenarios for the K-means algorithm. 

Source: the Authors [52]

Table 5: Work scenarios for the K-medoids algorithm. 

Source: the Authors [52]

Table 6 work scenarios for the agglomerative algorithm with the Linkage-Complete method. 

Choice of Tool to Execute the Algorithms

To create the scenarios and run the algorithms, we chose to use Rapid Miner (version 9.2), a data mining software used for the analysis of a data set using a variety of operators, tools, and functionalities. It has been used by the scientific community in environmental issues [25], [46H49] given the versatility and options it allows, as well as the confidence it generates due to its proven effectiveness.

RESULTS

The results obtained from algorithm and hardware performance for K-means, K-medoids, and Hierarchical Grouping for each of the stations are presented in Tables 7, 8, 9, 10, 11, 12, and 13. Each one of the three weather stations of the region (Villamaría Hospital, Caldas Hospital, and Los Nevados National Natural Park) are K = 2, K = 3, and K = 5, respectively.

Table 7 Algorithm and hardware performance results for K-means with K=2 for the Villamaria Hospital, Caldas Hospital and PNNN El Cisne stations 

Source: the Authors.

Table 8 Algorithm and hardware performance results for K-means with K = 3 for the Villamaria Hospital, Caldas Hospital, and PNNN El Cisne stations 

Table 9 Algorithm and hardware performance results for K-means with K = 5 for the YillamarIa Hospital, Caldas Hospital, and PNNN El Cisne stations 

Source: the Authors, from previous work [39]

In a global view, for the Hospital de Villamaria station (low altitude), the evaluation indices for all its scenarios and treatments are observed. Regarding the Hospital de Caldas station (intermediate altitude), the evaluation indices are lower than the previous station (gaining quality), and they also remain similar for the rest of its K values. However, for the El Cisne PNNN station (maximum height), quite the opposite happens: The evaluation Davis-Bouldin indices (clustering lose quality) [1] and remain the same for their K values.

Regarding K-means, iterating the algorithm to form clusters by assigning each point to its closest centroid and recalculating the centroid of each cluster is a very efficient and simple process, not only by executing two steps for each iteration, but also by seeing how it is able to process immense quantities of instances very quickly. In its experimentation with the Hospital de Caldas station, it used more than 530.000 instances. This coincides with the contribution of [22], by defining K-means as a simple and efficient algorithm.

The execution times of the K-means algorithm increase as the value of "k" is greater. This is because you must iterate more times due to the need to create more clusters. For stations such as Hospital de Caldas and Villamaria, the execution times are higher since there are datasets of more than 430,000 instances. For the El Cisne PNN station, the execution times are shorter since they comprise a dataset of less than 250,000 instances.

ram memory consumption is very similar for all k values and for the three weather seasons. Although it differs in certain work scenarios, the average consumption is 2.6 Gb of ram. This means that regardless of the characteristics of the scenarios and datasets, ram uses, on average, the same amount of resources because its consumption is given the minimum it needs to run the algorithm's functionality.

The cpu runtime increases as the value of k increases. This is due to the processing it uses for the number of clusters to generate. The same behavior is observed for the three climatic stations.

Table 10 Algorithm and hardware performance results for K-medoids with K=2 for the Villamaria Hospital, Caldas Hospital, and PNNN El Cisne stations 

Source: the Authors.

Table 11 Algorithm and hardware performance results for K-medoids with K = 3 for the Villamaria Hospital, Caldas Hospital, and PNNN El Cisne stations 

Source: the Authors.

Table 12 Algorithm and hardware performance results for K-medoids with K = 5 for the Villamaria Hospital, Caldas Hospital, and PNNN El Cisne stations 

Source: the Authors, from previous work [39]

Regarding K-medoids for the Hospital de Villamaria station (low altitude), the evaluation index becomes lower as the value of K (number of clusters) increases. For the Hospital de Caldas station (intermediate altitude), the evaluation indices are lower than the previous station and the greater the number of groups, the value of these indices is still lower (gaining quality). However, for the El Cisne PNNN station (maximum height), the opposite is true: the evaluation rates are, once again, higher (losing quality).

For previous partitioned algorithms (K-medoids, K-means), standardization and technique type greatly influence the evaluation of cluster quality. The Davis-Bouldin index, when evaluating the quality of the cluster, generates an approach (and visually verifies) the best grouping result. Furthermore, the higher the K value, the more hardware requirements and time requirements will be demanded to execute and process a dataset, and, subsequently, to execute the algorithm.

Also, note that K-means and K-medoids cannot process empty fields. Some authors omit missing and corrupt data from these algorithms [1], therefore the missing data was transformed to an average value of the attribute. This decision was supported by experts in the matter and made to allow the algorithm to run. Also, the average value corresponds to the whole dataset. We did not want to replace it with the lower or higher value because this would generate dragging of clusters, and it would alter the analysis of the results. The research shows a notable difference between clean datasets versus datasets with missing values that are replaced by an average value of the attribute, since, in the results, there is variation in the grouping evaluation index, which is better when the dataset is clean. On the other hand, the datasets with missing data and atypical data (scenario 1) produced the lowest performance results. This is due in part to there being an imbalance in the formation of the clusters and the evaluation index of their treatments not being the best. This signifies that using a raw dataset is not recommended. Furthermore, the outliers did not affect the results, since the clustering evaluation indices given by scenario 4 are very similar. For example, in scenario 5, which uses a clean dataset. This could be due to the fact that the outliers number was small compared to the dataset size (outliers subject to existence within the dataset), or, conversely, normalization allowed for reducing these large distance margins to provide better groupings.

It also corroborates the idea that applying dimensionality reduction with PCA, where three components are obtained, raises the level of abstraction of the results, since it does not allow for direct visualization of the map of the original attributes. As it was mentioned [42], that data transforming from an original space into a new one with a lower dimension, where they cannot be associated with the characteristics of the original, means that an analysis of the new space is very complicated and complex, since there is no physical meaning for the transformed and obtained characteristics.

Therefore, promoting a PCA with two components could determine the behavior of the data in a two-dimensional plane and make its analysis easier. In turn, this brings the reduction of initial attributes (which are four) to only two. In terms of clustering evaluation, PCA did not influence the improvement of the Davis-Bouldin index.

On the other hand, the number of iterations forces the algorithm to form the clusters and recalculate the centroids more times. However, it reaches a point where it finds the calculation it needs without improving with more iterations. As seen in experimentation, a number of iterations in 100 was a balanced value for working with clustering, where computational performance in terms of execution times is not affected for the algorithm. This prevents an investigator from unnecessarily repeating thousands of times. It is verified that iterating with a larger number does not affect the improvement of the evaluation index (recalculating its centroids to find a suitable value).

Table 13 Hardware performance results for Linkage-Complete for the YillamarIa Hospital, Caldas Hospital, and PNNN El Cisne stations 

Source: the Authors.

For the agglomerative clustering algorithm, we decided to process with 20,000 instances to test the previous algorithm operation and determine the subsequent creation of the scenarios. The processing was found to be too slow. This was due in part to the algorithm presenting great computational complexity. Once a distance measurement is determined and used, a dissimilarity matrix is constructed. This process leads to the generation of a 20,000 x 20,000 size matrix (for a dataset of 20,000 instances), which, in hardware terms, requires storage and processing resources. After this, the data sets are merged at each level and the difference matrix is subsequently updated. This has a great impact on computer processing, and execution takes more than 1 hour and 30 minutes (for a dataset of 20,000 instances). That is, it took 72 times more than the previous 5,000 instance scenarios. This conclusion supports the research of [1], where hierarchical grouping is not recommended for a dataset of more than 10,000 instances. Therefore, it was decided to create scenarios with data sets not exceeding 5,000 instances.

Hierarchical grouping cannot process empty fields. With that said, the missing data was transformed to an average value of the attribute.

In terms of attributes, precipitation makes the dendrogram more complex to analyze, not only because it creates an additional agglomeration in the lower levels, but also because it involves increasing the dataset with thousands of more data. This leads to the graph agglomerate creating many instances, as well as becoming narrow for subsequent visualizations and analyzes. Due to the initial dataset being large, it is recommended to use precipitation for a dataset that guarantees a lower number of instances than those used in this experimentation; that is, below 1,000 instances for the agglomerative algorithm.

Based on the above data, a dendrogram of around 3,000 instances (sheets) can allow an investigator to easily see how the instances merge from the intermediate level, and focus the observation on higher levels, despite the lower levels being impossible. To visualize them, a researcher must evaluate from level 0 of the tree. It is suggested to use data sets of less than 100 instances for the dendrograms to be more visible, allowing better analysis from the lowest levels. Hierarchical grouping is preferred for a small dataset [1], [50].

On the other hand, normalization facilitated the construction of dendrograms, helping the dissimilarity and similarity distances (Y axis) to become closer on a scale between zero and one. This allowed the dendrogram to be viewed in a more simple manner. The dimensionality reduction was not transcendental in the results, therefore, it is concluded that it was not useful for the agglomerative algorithm.

In computational terms, the algorithm uses similar machine resources in all the scenarios, regardless of the preset characteristics. However, if high execution times and CPU times are found for scenario 1 (up to eight times greater than the rest of the scenarios, with only 2,000 instances apart), confirming that using datasets with large instance volumes for agglomerative hierarchical grouping can lead to slow processing.

DISCUSSION

To determine, in a preliminary study, the behavior of clustering algorithms on climate data, stations and datasets with different characteristics, scenarios were defined to which variants of the learning algorithms were applied, and the behavior of the metrics was evaluated.

The results, without being conclusive, can guide people who work with these data in the speedy selection of these elements, which we consider the contribution of this work.

For K-means, at the Hospital de Caldas station, there are more clustering evaluations with better quality compared to the other two stations. This is determined by taking a value as a reference to make the count. In this case, the indices are equal to or below -0.700. It could be given by the fact that a dataset whose attribute values do not contain extreme conditions (such as high or low temperatures), is associated to better clustering evaluation indices, with this algorithm.

For K-means, the best clustering evaluation index for the Hospital de Villamaria station had a value of -1,004, as opposed to the Hospital de Caldas station, which had a value of -1,009. These best results are given for the climate dataset extracted from a region that oscillates between 1790 msnm and 2183 msnm (between warm and temperate climates), using K-means with a value of K = 3, performing normalization with transformation Z and a number of 10 iterations.

Regarding the El Cisne pnnn station, a dataset that comes from high altitude sources, such as 4,812 meters above sea level, the best evaluation index was of -1,051, with a value of K = 2, normalization with Z-transformation, and a number of iterations of the algorithm in 10.

On the other hand, for K-medoids, at El Cisne PNNN station, there are more clustering evaluations with better quality compared to the other two stations. This is determined by taking a value as a reference to make the count. In this case, the indices equal to or below -0.700. It could be given by the fact that a dataset whose attribute values contain extreme conditions (high temperatures or relative humidity of the 100%), such as the El Cisne pnnn station, generate an approximation to better clustering evaluation indices for the clusters in K-medoids.

For K-medoids, the best clustering evaluation index for the Villamaria Hospital station had a value of -1,405, these best results are given for a climate dataset extracted from a region that oscillates around 1,790 masl (warm climate), when using K -medoids a value of K = 5 clusters, normalization with Z-transformation, and number of algorithm iterations in 10.

For the Hospital de Caldas station (altitude of 2,183 masl, temperate climate), the best index had a value of 14,231, using a value of K = 3, without any other characteristic. Regarding the El Cisne pnnn station (4,812 meters, extremely cold weather), the best clustering evaluation had a value of -7,937 and used a value of K = 5, without any other characteristics.

Based on the above and the information seen in the Results section, the cluster evaluation indices are observed with very low values for K-medoids, compared to those obtained in K-means. For two partitioned algorithms used in the experimental framework, the algorithm that presented the best performances and results was K-medoids.

For Linkage-Complete agglomerative clustering, dataset processing that contains the fewest instances and has gone through a normalization process with Range-Transformation performs best on dendrograms, in graphic terms. Even though having fewer instances makes the dendrogram easier to visualize and analyze, normalization makes it possible to shorten similarity distances (Y axis). A performance evaluation index or performance cannot be applied to this algorithm because it is hierarchical clustering and researchers must develop external functionalities in software to provide performance evaluations at a mathematical level [51], and to determine at what point they want to cut the tree to obtain a value of clusters (K), and, from there, analyze the results.

The contribution sought with this work is to provide some basic guidelines, so as not to start from scratch, on certain decisions in the analysis of clusters with meteorological data, as well as to help identify the algorithm and the most important parameters to take into account for the best performance, in accordance with the particular conditions and requirements [52].

CONCLUSIONS AND FUTURE WORK

For future work, it is recommended to use other types of scenarios, treatments, algorithms, and other amounts of clusters to see performance evaluations. It would also be important to know how to evaluate hierarchical agglomerative algorithms to determine the quality of dendrograms to break the subjectivity of each researcher and to apply mathematical measurements.

Furthermore, carrying out scenarios with a K value greater than 5 would allow researchers to investigate what happens with clustering and performance for partitioned algorithms (K-medoids, K-means), both at the machine level and in their performance.

On the other hand, evaluating data on a time scale (per day, per week, etc.) using time series would allow for knowing interesting clustering behaviors, as well as the quality of their clusters within a timeline for different seasons, or times of the year (how the performance would be given for cold seasons or summer seasons). Also, it would be interesting to perform processing under different scenarios that comprise a larger data set (of millions of instances) for K-means, in order to better observe the computational behavior on a larger scale. This will help determine how efficient it is for large datasets, to better detect new patterns or relationships.

Based on the results, it is possible to suggest using other normalization methods, such as ratio and interquartile range transformation, to see how clustering behaves with these analyzes.

It is recommended to use techniques, such as Ordinary Kriging, to handle the large amounts of zeros that a variable contains within a dataset.

REFERENCES

[1] A. Arroyo, A. Herrero, V. Tricio, and E. Corchado, "Analysis of meteorological conditions in Spain by means of clustering techniques," in J. Appl. Log., vol. 24, 2017, pp. 76-89. Available: https://doi.org/10.1016/j.jal.2016.11.026Links ]

[2] M. A. Asadi Zarch, B. Sivakumar, and A. Sharma, "Assessment of global aridity change," J. Hydrol., Vol. 520, , 2015, pp. 300-313. Available: https://doi.org/10.1016/1.jhydrol.2014.11.033Links ]

[3] L. Carro-Calvo, C. Ordonez, R. Garcia-Herrera, and J. L. Schnell, "Spatial clustering and meteorological drivers of summer ozone in Europe," in Atmos. Environ., Vol. 167, 2017, pp. 496-510. Available: https://doi.org/10.1016/Latmosenv.2017.08.050Links ]

[4] M. J. Carvalho, P. Melo-Goncalves, J. C. Teixeira, and A. Rocha, "Regionalization of Europe based on a K-Means Cluster Analysis of the climate change of temperatures and precipitation," in Phys. Chem. Earth, Vol. 94, , 2016, pp. 22-28. Available: https://doi.org/10.1016/j.pce.2016.05.001Links ]

[5] J. Chen, M. Song, and L. Xu, "Evaluation of environmental efficiency in China using data envelopment analysis," in Ecol. Indic., Vol. 52, 2015, pp. 577-583. Available: https://doi.org/10.1016/Lecolind.2014.05.008Links ]

[6] L. Chen and G. Jia, "Environmental efficiency analysis of China's regional industry : a data envelopment analysis (DEA) based approach," in J. Clean. Prod., Vol. 142, 2017, pp. 846-853. Available: https://doi.org/10.1016/Liclepro.2016.01.045Links ]

[7] R. Falquina and C. Gallardo, "Development and application of a technique for projecting novel and disappearing climates using cluster analysis," in Atmos. Res., Vol. 197, No. July 2017, pp. 224-231. Available: https://doi.org/10.1016/Latmosres.2017.06.031Links ]

[8] A. M. Kalteh, P. Hjorth, and R. Berndtsson, "Review of the self-organizing map (SOM) approach in water resources: Analysis, modelling and application," in Environ. Model. Softw., Vol. 23, No.7, 2008, pp. 835-845. Available: https://doi.org/http://dx.doi.org/10.1016/j.envsoft.2007.10.001Links ]

[9] S. C. Sheridan and C. C. Lee, "The self-organizing map in synoptic climatological research," Prog. Phys. Geogr., Vol. 35, No. 1, 2011, pp. 109-119. Available: https://doi.org/10.1177/0309133310397582Links ]

[10] X. Wang et-al ., "A stepwise cluster analysis approach for downscaled climate projection - A Canadian case study," Environ. Model. Softw., Vol. 49, 2013, pp. 141-151. Available: https://doi.org/10.1016/Lenvsoft.2013.08.006Links ]

[11] Y. Zheng et al., "Vegetation response to climate conditions based on NDVI simulations using stepwise cluster analysis for the Three-River Headwaters region of China," in Ecol. Indie, No. September 2016, pp. 0-1, 2017. Available: https://doi.org/10.1016/j.ecolind.2017.06.040Links ]

[12] X. Zuo, H. Hua, Z. Dong, and C. Hao, "Environmental Performanee Index at the Provineial Level for China 2006-2011," in Ecol. Indic., Vol. 75, 2017, pp. 48-56. Available: https://doi.org/10.1016/j.ecolind.2016.12.016Links ]

[13] S. A. Cashman et al., "Mining Available Data from the United States Environmental Proteetion Ageney to Support Rapid Life Cyele Inventory Modeling of Chemieal Manufaeturing," in Environ. Sci. Technol., Vol. 50, no. 17, 2016, pp. 9013-9025. Available: https://doi.org/10.1021/aes.est.6b02160Links ]

[14] C. Gallo, N. Faeeilongo, and P. La Sala, "Clustering analysis of environmental emissions: A study on Kyoto Protoeol's impaet on member eountries," J. Clean. Prod., 2017. Available: https://doi.org/10.1016/i.ielepro.2017.07.194Links ]

[15] J. Jiang, B. Ye, D. Xie, and J. Tang, "Provineial-level earbon emission drivers and emission reduetion strategies in China: Combining multi-layer LMDI deeomposition with hierarehieal elustering," in J. Clean. Prod., Vol. 169, 2017, pp. 178-190. Available: https://doi.org/10.1016/i.ielepro.2017.03.189Links ]

[16] I. Meghea, M. Mihai, I. Laeatusu, and I. Iosub, "Evaluation of Monitoring of Lead Emissions in Bueharest by Statistieal Proeessing," in J. Environ. Prot. Ecol., Vol. 13, No. 2, ,2012, pp. 746-755. Available: http://www.seopus.eom/inward/reeord.url?eid=2-s2.0-84864251930&partnerID=MN8TOARSLinks ]

[17] N. Clay and B. King, "Smallholders uneven eapaeities to adapt to elimate ehange amid Afriea's green revolution: Case study of Rwanda's erop intensifieation program," in World Dev., Vol. 116, 2019, pp. 1-14. Available: https://doi.org/S0305750X18304285Links ]

[18] N. D. Abdul Halim et al., "The long-term assessment of air quality on an island in Malaysia," in Heliyon, Vol. 4, No. 12, 2018. Available: https://doi.org/10.1016/j.heliyon.2018.e01054Links ]

[19] T. Conradt, C. Gornott, and F. Weehsung, "Extending and improving regionalized winter wheat and silage maize yield regression models for Germany: Enhaneing the predietive skill by panel definition through eluster analysis," in Agric. For. Meteorol., Vol. 216, 2016, pp. 68-81. Available: https://doi.org/10.1016/i.agrformet.2015.10.003Links ]

[20] S. Farah, D. Whaley, W. Saman, and J. Boland, "Integrating Climate Change into Meteorologieal Weather Data for Building Energy Simulation," in Energy Build., Vol. 183, 2019, pp. 749-760. Available: https://doi.org/S0378778818323296Links ]

[21] T. Soubdhan, M. Abadi, and R. Emilion, "Time dependent elassifieation of solar radiation sequenees using best information eriterion," in EnergyProcedia, Vol. 57, 2014, pp. 1309-1316. Available: https://doi.org/10.1016/i.egypro.2014.10.121Links ]

[22] S. Khedairia and M. T. Khadir, "Impact of clustered meteorological parameters on air pollutants concentrations in the region of Annaba, Algeria," in Atmos. Res., Vol. 113, 2012, pp. 89-101. Available: https://doi.org/10.1016/j.atmosres.2012.05.002Links ]

[23] T. Schneider, H. Hampel, P. V. Mosquera, W. Tylmann, and M. Grosjean, "Paleo-ENSO revisited: Ecuadorian Lake Pallcacocha does not reveal a conclusive El Niño signal," in Glob. Planet. Change, Vol. 168, No. February, 2018, pp. 54-66. Available: https://doi.org/10.1016/j.gloplacha.2018.06.004Links ]

[24] F. Franceschi, M. Cobo, and M. Figueredo, "Discovering relationships and forecasting PM10 and PM2.5 concentrations in Bogotá Colombia, using Artificial Neural Networks, Principal Component Analysis, and k-means clustering," in Atmos. Pollut. Res., Vol. 9, No. 5, 2018, pp. 912-922. Available: https://doi.org/10.1016/j.apr.2018.02.006Links ]

[25] A. K. Yadav, H. Malik, and S. S. Chandel, "Application of rapid miner in ANN based prediction of solar radiation for assessment of solar energy resource potential of 76 sites in Northwestern India," in Renew. Sustain. Energy Rev., Vol. 52, 2015, pp. 1093-1106. Available: https://doi.org/10.1016/j.rser.2015.07.156Links ]

[26] Y. Hao, L. Dong, X. Liao, J. Liang, L. Wang, and B. Wang, "A novel clustering algorithm based on mathematical morphology for wind power generation prediction," in Renew. Energy, Vol. 136, 2019, pp. 572-585. Available: https://doi.org/10.1016/j.renene.2019.01.018Links ]

[27] S. Han et al., "Quantitative evaluation method for the complementarity of windsolar-hydro power and optimization of wind-solar ratio," in Appl. Energy, Vol. 236, No. December 2018, pp. 973-984, 2019. Available: https://doi.org/10.1016/j.apenergy.2018.12.059Links ]

[28] M. André, R. Perez, T. Soubdhan , J. Schlemmer, R. Calif, and S. Monjoly, "Preliminary assessment of two spatio-temporal forecasting technics for hourly satellite-derived irradiance in a complex meteorological context," in Sol. Energy, Vol. 177, No. December 2018, pp. 703-712, 2019. Available: https://doi.org/10.1016/j.solener.2018.11.010Links ]

[29] P. Lin, Z. Peng, Y. Lai, S. Cheng, Z. Chen, and L. Wu, "Short-term power prediction for photovoltaic power plants using a hybrid improved Kmeans-GRA-Elman model based on multivariate meteorological factors and historical power datasets," in Energy Convers. Manag., Vol. 177, No. July, 2018, pp. 704-717. Available: https://doi.org/10.1016/j.enconman.2018.10.015Links ]

[30] F. Mokdad and B. Haddad, "Improved infrared precipitation estimation approaches based on k-means clustering: Application to north Algeria using MSG-SEVIRI satellite data," in Adv. Sp. Res., Vol. 59, No. 12, 2017, pp. 2880-2900. Available: https://doi.org/10.1016/j.asr.2017.03.027Links ]

[31] S. Li, H. Ma, and W. Li, "Typical solar radiation year construction using k-means clustering and discrete-time Markov chain," in Appl. Energy, Vol. 205, No. May, 2017, pp. 720-731. Available: https://doi.org/10.1016/j.apenergy.2017.08.067Links ]

[32] M. Ghayekhloo, M. Ghofrani, M. B. Menhaj, and R. Azimi, "A novel clustering approach for short-term solar radiation forecasting," in Sol. Energy, Vol. 122, 2015, pp. 1371-1383. Available: https://doi.org/10.1016/j.solener.2015.10.053Links ]

[33] M. Bador, P. Naveau, E. Gilleland, M. Castellá, and T. Arivelo, "Spatial clustering of summer temperature maxima from the CNRM-CM5 climate model ensembles & E-OBS over Europe," in Weather Clim. Extrem., Vol. 9, 2015, pp. 17-24. Available: https://doi.org/10.1016/j.wace.2015.05.003Links ]

[34] L. Pokorná, M. Kucerová, and R. Huth, "Annual cycle of temperature trends in Europe, 1961-2000," in Glob. Planet. Change, Vol. 170, No. August, 2018, pp. 146-162. Available: https://doi.org/10.1016/j.gloplacha.2018.08.015Links ]

[35] J. Parente, M. G. Pereira, and M. Tonini, "Space-time clustering analysis of wildfires: The influence of dataset characteristics, fire prevention policy decisions, weather and climate," in Sci. Total Environ., Vol. 559, 2016, pp. 151-165. Available: https://doi.org/10.1016/j.scitotenv.2016.03.129Links ]

[36] M. I. Chidean, J. Muñoz-Bulnes, J. Ramiro-Bargueño, A. J. Caamaño, and S. Salcedo-Sanz, "Spatio-temporal trend analysis of air temperature in Europe and Western Asia using data-coupled clustering," in Glob. Planet. Change, Vol. 129, 2015, pp. 45-55. Available: https://doi.org/10.1016/j.gloplacha.2015.03.006Links ]

[37] M. I. Chidean, A. J. Caamaño , J. Ramiro-Bargueño , C. Casanova-Mateo, andS. Salcedo-Sanz , "Spatio-temporal analysis of wind resource in the Iberian Peninsula with datacoupled clustering," in Renew. Sustain. Energy Rev., Vol. 81, No. June, 2018, pp. 2684-2694. Available: https://doi.org/10.1016/j.rser.2017.06.075Links ]

[38] Y. Zheng et al., "Assessment of global aridity change," Ecol. Indic., Vol. 75, No. September 2016, pp. 151-165, 2016. Available: https://doi.org/10.1016/j.scitotenv.2015.11.063Links ]

[39] - J.S. Ramirez, N.D. Duque N, . y J.J. Velez, "Normalización en desempeño de k-means sobre datos climáticos," in Vínculos, Vol. 16, 201, 9pp. 57-72. Available: https://doi.org/10.14483/2322939X.15550Links ]

[40] D. G. de B. Franco and M. T. A. Steiner, "Clustering of solar energy facilities using a hybrid fuzzy c-means algorithm initialized by metaheuristics," in J. Clean. Prod., Vol. 191, 2018, pp. 445-457. Available: https://doi.org/10.1016/j.jclepro.2018.04.207Links ]

[41] J. Hidalgo et al., "Comparison between local climate zones maps derived from administrative datasets and satellite observations," in Urban Clim., Vol. 27, No. November 2017, pp. 64-89, 2019. Available: https://doi.org/10.1016/j.uclim.2018.10.004Links ]

[42] C. C. Aggarwal and C. K. Reddy, DATA Custering Algorithms and Applications, CRC Press, 2013. Available: https://doi.org/10.1201/Q781315373515Links ]

[43] G. Gan, C. Ma, and J. Wu, Data Clustering: Theory, Algorithms, and Applications. Philadelphia, Pennsylvania: SIAM - Society for Industrial and Applied Mathematics, 2007. Available: https://doi.org/10.1137/LQ7808Q8718348Links ]

[44] T. T. Nguyen , A. Kawamura, T. N. Tong, N. Nakagawa, H. Amaguchi, and R. Gilbuena, "Clustering spatio-seasonal hydrogeochemical data using self-organizing maps for groundwater quality assessment in the Red River Delta, Vietnam," in J. Hydrol., Vol. 522, 2015, pp. 661-673. Available: https://doi.org/10.1016/j.jhydrol.2015.01.023Links ]

[45] H. Yahyaoui and H. S. Own, "Unsupervised clustering of service performance behaviors," in Inf. Sci. (Ny)., Vol. 422, 2018, pp. 558-571. Available: https://doi.org/10.1016/j.. ins.2017.08.065Links ]

[46] A. Lausch, A. Schmidt, and L. Tischendorf, "Data mining and linked open data - New perspectives for data analysis in environmental research," in Ecol. Modell., Vol. 295, 2015, pp. 5-17. Available: https://doi.org/10.1016/j.ecolmodel.2014.09.018Links ]

[47] A. Naik and L. Samant, "Correlation Review of Classification Algorithm Using Data Mining Tool: WEKA, Rapidminer, Tanagra, Orange and Knime," in Procedia Comput. Sci., Vol. 85, No. Cms, 2016, pp. 662-668. Available: https://doi.org/10.1016/j.procs.2016.05.251Links ]

[48] V. Obradovic, D. Bjelica, D. Petrovic, M. Mihic, and M. Todorovic, "Whether We are Still Immature to Assess the Environmental KPIs!," in Procedia - Soc. Behav. Sci., Vol. 226, No. October 2015, pp. 132-139, 2016. Available: https://doi.org/10.1016/j.sbspro.2016.06.171Links ]

[49] K. Pitchayadejanant and P. Nakpathom, "Data mining approach for arranging and clustering the agro-tourism activities in orchard," in Kasetsart J. Soc. Sci., 2017. Available: https://doi.org/10.1016/j.kjss.2017.07.004Links ]

[50] S. S. Shaukat, T. A. Rao, and M. A. Khan, "Impact of sample size on principal component analysis ordination of an environmental data set: Effects on Eigenstructure," in Ekol. Bratislava, Vol. 35, No. 2, 2016, pp. 173-190. Available: https://doi.org/10.1515/eko-2016-0014Links ]

[51] N. Erman and J. Suklan, "Performance of selected agglomerative clustering methods," in Innov. Issues Approaches Soc. Sci., Vol. 8, No. January, 2015. Available: https://doi.org/10.12Q5Q/issn.1855-0541.IIASS-2015-no1-art11Links ]

[52] J. Ramírez, "Evaluación de algoritmos de aprendizaje de máquina no supervisados sobre datos climáticos". Universidad Nacional de Colombia, 2019. Available: https://repositorio.unal.edu.co/bitstream/handle/unal/75848/1053773873.2019.pdf?isAllowed=y&sequence=3Links ]

Received: March 25, 2022; Accepted: August 16, 2022

Creative Commons License This is an open-access article distributed under the terms of the Creative Commons Attribution License