Machine learning applied to the prediction of diabetes mellitus, using socioeconomic and environmental information from health system users

Mejía, Jessner Alexander; Oviedo-Benalcázar, Mario Andrés; Ordoñez, José Armando; Valencia-Murillo, José Fernando

doi:10.17533/udea.rfnsp.e351168

Servicios Personalizados

Revista

Articulo

Indicadores

Citado por SciELO
Accesos

Links relacionados

Citado por Google
Similares en SciELO
Similares en Google

Permalink

Revista Facultad Nacional de Salud Pública

versión impresa ISSN 0120-386Xversión On-line ISSN 2256-3334

Resumen

MEJIA, Jessner Alexander; OVIEDO-BENALCAZAR, Mario Andrés; ORDONEZ, José Armando y VALENCIA-MURILLO, José Fernando. Machine learning applied to the prediction of diabetes mellitus, using socioeconomic and environmental information from health system users. Rev. Fac. Nac. Salud Pública [online]. 2023, vol.41, n.2, e06. Epub 15-Nov-2023. ISSN 0120-386X. https://doi.org/10.17533/udea.rfnsp.e351168.

Objective:

The objective was to apply models based on machine learning techniques to support the early diagnosis of diabetes mellitus, using environmental, social, economic and health data variables, without dependence on clinical sample collection.

Methodology:

Data from 10,889 users affiliated with the subsidized health system in the southwestern area of Colombia, diagnosed with hypertension and grouped into users without (74.3%) and with (25.7%) diabetes mellitus, were used. Supervised models were trained using k-nearest neighbors, decision trees, and random forests, as well as ensemble-based models, applied to the database before and after balancing the number of cases in each diagnostic group. The performance of the algorithms was evaluated by dividing the database into training and test data (70/30, respectively), and metrics of accuracy, sensitivity, specificity, and area under the curve were used.

Results:

Sensitivity values increased significantly when using balanced data, going from maximum values of 17.1% (unbalanced data) to values as high as 57.4% (balanced data). The highest value of area under the curve (0.61) was obtained with the ensemble models, by applying a balance in the amount of data for each group and by coding the categorical variables. The variables with the greatest weight were associated with hereditary aspects (24.65%) and with the ethnic group (5.59%), in addition to visual difficulty, low water consumption, a diet low in fruits and vegetables, and the consumption of salt and sugar.

Conclusions:

Although predictive models, using people's socioeconomic and environmental information, emerge as a tool for the early diagnosis of diabetes mellitus, their predictive capacity still needs to be improved.

Palabras clave : machine learning; diabetes mellitus; environmental factors; socioeconomic factors; predictive model.

· resumen en Español | Portugués · texto en Español · Español (

pdf )