Introduction
Hypertension is a major modifiable risk factor for cardiovascular disease, yet achieving adequate blood pressure control in affected patients remains a significant public health challenge [1]. Global estimates indicate that approximately 42% of adults with hypertension are diagnosed and treated, but only 20% of them achieve controlled blood pressure levels [2]. This gap in hypertension management is more pronounced in low- and middle-income countries, including Mexico, where health system barriers and socioeconomic disparities create additional obstacles to optimal care [3].
The management of hypertension is complex due to its multifactorial nature, with control rates influenced by the interplay of biological, behavioral, and socioeconomic determinants [4,5]. Conventional approaches to hypertension classification and treatment, while useful for identifying high-risk clusters, may overlook important variations in treatment response and outcomes. Data-driven phenotyping approaches, which identify clinically and epidemiologically distinct clusters with shared characteristics, could provide critical insights for developing targeted interventions [6].
In this context, data-driven techniques are useful analytical tools for uncovering patterns in complex health data [7]. These methods have demonstrated promise in research on other chronic conditions [8-10]. For hypertension specifically, a data-driven approach implemented at Boston Medical Center improved blood pressure control compared to standard care [11]. To the authors' knowledge, such methods remain underutilized in hypertension research, particularly for Latin American populations. By extending beyond conventional risk factor analysis, these approaches can identify distinct hypertension control phenotypes that may benefit from tailored management strategies.
While multiple unsupervised learning techniques are available, Principal Component Analysis (PCA) and Gaussian Mixture Models (GMM) are useful to capture continuous latent structures [12] and accommodate overlapping clusters through probabilistic boundaries [13], features particularly well-suited to the complexity of hypertension control phenotypes. Compared to other methods, GMM offers greater flexibility in modeling subpopulations with differing variances and covariances [14].
This study aimed to identify and characterize distinct blood pressure control phenotypes among Mexican adults with diagnosed hypertension using PCA and GMM as data-driven techniques. These findings could provide novel insights into the heterogeneity of hypertension control in Mexico and inform more targeted approaches to blood pressure management in similar settings.
Methods
Study population and data sources
Data from the 2022 National Health and Nutrition Survey (ENSANUT, by its Spanish acronym) were analyzed. This nationally representative cross-sectional survey assesses health and nutritional status across Mexico [15].
The study sample included adults aged 20 years and older with physician-diagnosed hypertension, identified through affirmative responses to the survey item: "Has a doctor ever told you that you have high blood pressure?" (response options: Yes, Yes [during pregnancy only], and No). Adult females who reported hypertension diagnosis exclusively during pregnancy were excluded and those with missing data for any study variables.
A total of 11,913 adults were interviewed. Among them, 2,290 reported a prior medical diagnosis of hypertension. Of these, 29 were excluded due to a pregnancy-related diagnosis. A total of 1,933 participants indicated that they were currently receiving pharmacological treatment for hypertension, and 1,542 had sequential blood pressure measurements available. Additionally, 234 individuals were excluded due to missing information relevant to the analysis. No imputation procedures were performed. Observations with missing values for any study variable were excluded from the analytical sample using pairwise deletion.
Outcome
The primary outcome was controlled hypertension, defined as a binary measure (no/yes) among adults with a prior physician diagnosis of arterial hypertension. Blood pressure assessment followed a standardized protocol: after ≥ 5 minutes of rest, trained personnel obtained three sequential measurements using calibrated digital sphygmomanometers, with 2-3-minute intervals between readings. The arithmetic mean of the second and third measurements was calculated, classifying participants as having controlled hypertension if their average systolic pressure was < 130 mmHg and/or diastolic pressure was < 80 mmHg.
Variable selection and preprocessing
Variables spanning key clinical and epidemiological domains were selected. These variables were selected based on their association with the analysed outcome [16-18].
Demographic characteristics included sex, age, and socioeconomic status (categorized as low, middle, or high). Clinical variables comprised time since hypertension diagnosis (stratified as ≤ 5 years, 6-10 years, or > 10 years) and comorbid type 2 diabetes status. Behavioural measures assessed self-reported adherence to dietary modifications and physical activity levels. Healthcare access was characterized by the usual source of medical care, categorized across social security institutions, public sector providers, private services, and other sources.
Principal Component Analysis
PCA was performed to reduce the dimensionality of the dataset and identify the underlying patterns of variability. This analysis facilitated the identification of key variables contributing most to the variance in the data, thereby simplifying the subsequent clustering analysis. By transforming the data into a smaller number of uncorrelated components, PCA enhanced the interpretability of the GMM clustering.
Variable contributions
The contributions of the original variables to each principal component were extracted from the PCA results. For each variable, the mean contribution across all derived dimensions was computed. Variables were then ranked by their contribution to principal component 1(PC1) to identify the most influential features in explaining data variance.
Gaussian Mixture Model
To identify natural clusters within the data, a GMM was employed, a probabilistic clustering approach that assumes the data is generated from a mixture of several multivariate Gaussian distributions. The optimal number of clusters was determined by comparing models with different numbers of components using the Bayesian Information Criterion (BIC), which balances model fit with complexity. To ensure robust parameter estimation, we ran the Expectation-Maximization (EM) algorithm with multiple random initializations, mitigating the risk of converging to local optima.
For each observation, cluster assignments were made by selecting the component with the highest posterior probability. We evaluated cluster quality by examining silhouette width and stability.
To characterize the identified clusters, we computed the mean values of numerical features and the most frequent categories for categorical variables within each cluster. This allowed us to create distinct profiles that highlighted the key differences between clusters.
All analyses were implemented in R using the mclust package (v5.4.7), which provides flexible modeling of covariance structures (e.g., spherical, diagonal, or full). Cluster visualizations were generated by overlaying the GMM results onto principal component plots, facilitating intuitive interpretation of the clustering in reduced-dimensional space.
Results
Sample characteristics
Data from 1,308 adult patients were analyzed. The mean age (± standard deviation) was 61.1 ± 13.0 years and the interquartile range was from 52 to 70 years. Most participants were female (69.7%, 𝑛 = 913). A total of 912 patients were identified with controlled blood pressure, therefore the computed prevalence was 69.7%. Other characteristics of the study sample are summarized in Table 1.
Table 1 Characteristics of study sample according to blood-pressure control, Mexico 2022
| Characteristic | Overall | Uncontrolled | Controlled | P |
|---|---|---|---|---|
| (𝐧 = 1,308) | (𝒏 = 396) | (𝒏 = 912) | ||
| Sex | ||||
| Female | 913 (69.8) | 263 (66.4) | 650 (71.3) | 0.079 |
| Male | 395 (30.2) | 133 (33.6) | 262 (28.7) | |
| Age, years | 61.1 ± 13.0 | 63.4 ± 12.3 | 60.1 ± 13.1 | < 0.001 |
| Socioeconomic status | ||||
| Low | 389 (29.7) | 141 (35.6) | 248 (27.2) | 0.003 |
| Middle | 458 (35.1) | 138 (34.9) | 320 (35.1) | |
| High | 461 (35.2) | 117 (29.5) | 344 (37.7) | |
| Time since hypertension diagnosis, years | ||||
| < 5 | 561 (42.9) | 142 (35.9) | 419 (46.0) | 0.002 |
| 6 - 10 | 312 (23.9) | 101 (25.5) | 211 (23.1) | |
| > 10 | 435 (33.3) | 153 (38.6) | 282 (30.9) | |
| Comorbid type 2 diabetes mellitus, self-reported | ||||
| No | 852 (65.1) | 248 (62.6) | 604 (66.2) | 0.209 |
| Yes | 456 (34.9) | 148 (37.4) | 308 (33.8) | |
| Usual source of medical care | ||||
| Social security institutions | 666 (50.9) | 190 (48.0) | 476 (52.2) | 0.270 |
| Public sector | 257 (19.6) | 87 (22.0) | 170 (18.6) | |
| Private services | 358 (27.4) | 108 (27.2) | 250 (27.4) | |
| Other | 27 (2.1) | 11 (2.8) | 16 (1.8) | |
| Adherence to dietary modifications, self-reported | ||||
| No | 916 (70.0) | 285 (72.0) | 631 (69.2) | 0.313 |
| Yes | 392 (30.0) | 111 (28.0) | 281 (30.8) | |
| Adherence to physical activity levels, self-reported | ||||
| No | 1,088 | 331 (83.6) | 757 (83.0) | 0.796 |
| Yes | 220 | 65 (16.4) | 155 (17.0) |
Notes: 1) Total counts and relative frequencies are presented for categorical variables, except for age, which is summarized as the arithmetic mean and standard deviation; 2) p-values resulted from chi-squared tests or t-test (for age), as appropriate.
Dimensionality reduction
PCA reduced the dimensionality of the clinical dataset, with the first five components collectively explaining 73.6% of the total variance. The first component (PC1) accounted for 45.9% of the variance, followed by PC2 (11.9%), PC3 (5.8%), PC4 (5.3%), and PC5 (4.7%), indicating that PC1 captured the dominant patterns in the data (Figure 1).
Cluster identification
GMM applied to the PCA-transformed space identified eight distinct clusters (𝑘 = 8) among evaluated adults with previously diagnosed hypertension. Model selection via the BIC favoured an ellipsoidal, equal-shape covariance model (VEV) with BIC = -1,954.40 (Figure 2), demonstrating better fit compared to alternative cluster configurations (𝑘 = 1-8 tested). The integrated completed likelihood (ICL = -1,955.29) supported this solution. The VEV model outperformed simpler parameterizations, indicating that while patient clusters share similar geometric proportions (equal shape), they vary in size (unequal volume) and spatial orientation (varying orientation).

Abbreviations: EII, Spherical, equal volume; VII, Spherical, unequal volume; EEI, Diagonal, equal volume and shape; VEI, Diagonal, equal shape; EVI, Diagonal, equal volume; VVI, Diagonal, varying volume and shape; EEE, Ellipsoidal, equal volume/shape/orientation; EEV, Ellipsoidal, equal volume and shape; VEV, Ellipsoidal, equal shape; VVV, Ellipsoidal, varying volume/shape/orientation.
Figure 2 Bayesian Information Criterion (BIC) values for Gaussian Mixture Models with 1-8 clusters in hypertension phenotyping, Mexico 2022
Cluster stability was high, with minimal off-diagonal overlap in the co-assignment matrix (Figure 3), supporting strong separation.

Note: The concentration of red exclusively along the diagonal suggests that observations within the same cluster are highly consistent (strong internal similarity).
Figure 3 Matrix of stability for the eight identified clusters, Mexico 2022
Cluster sizes (Figure 4) were heterogeneous, ranging from 22 patients (Cluster 8) to 325 patients (Cluster 3), with intermediate clusters of 88 (Cluster 1), 24 (Cluster 2), 312 (Cluster 4), 154 (Cluster 5), 241 (Cluster 6), and 142 (Cluster 7) individuals.
Cluster quality assessment
The silhouette analysis revealed meaningful variation in cluster quality across the identified clusters (Figure 5). Two clusters showed particularly strong separation: Cluster 2 (average silhouette width = 0.72) and Cluster 8 (average silhouette width = 0.58), indicating these represent well-defined, distinct clinical profiles. In contrast, Clusters 3, 4, and 6 showed negative or near-zero silhouette scores, suggesting overlap with neighboring clusters and less distinct phenotypic boundaries. The remaining clusters (1 and 7) exhibited marginal separation (silhouette widths 0.01-0.06), potentially representing transitional or heterogeneous patient clusters. These findings suggest that while the eight-cluster solution captures two well differentiated phenotypes, the larger clusters may encompass patients with more heterogeneous characteristics or may benefit from alternative stratification approaches.

Abbreviations: T2DM, type 2 diabetes mellitus.
Notes: 1) Uncontrolled hypertension, T2DM diagnosis, adherence to physical activity, and adherence to dietary modifications are binary variables (0 = No, 1 = Yes); 2) Sex is a binary variable (0 = Female, 1 = Male); 3) For usual source of medical care, the code 1 denotes care provided by social‑security institutions; 4) Socioeconomic status was coded as 1 = Low, 2 = Middle, and 3 = High; and 5) Cluster sizes were as follows: 1 (𝑛 = 88), 2 (𝑛 = 24), 3 (𝑛 = 325), 4 (𝑛 = 312), 5 (𝑛 = 154), 6 (𝑛 = 241), 7 (𝑛 = 142), and 8 (𝑛 = 22).
Figure 5 Characteristics of identified clusters, Mexico 2022
Cluster characterization
The largest cluster (Cluster 3) predominantly included older women, with a mean age of 66.5 years, who had lived with hypertension for at least a decade and generally belonged to a high socioeconomic stratum. Cluster 8 had the youngest participants (mean age = 59.1 years) and was characterized by a mid‑level socioeconomic profile. Cluster 7 concentrated the greatest proportion of individuals with uncontrolled hypertension, indicating that blood‑pressure management challenges are largely confined to this smaller cluster.
Adherence to physical activity (9.1%), dietary modifications (7.2%), sex (7.1%), and T2DM (6.6%) jointly contributed 30.0% of PC1’s variance (45.9% of total variance). Age explained 89.2% of PC2’s variance (11.9% total), highlighting its orthogonal role (Table 2).
Table 2 Percentage of variance contributed by each variable to the first two PCA dimensions, Mexico 2022.
| Variable | Dimension 1 | Dimension 2 |
|---|---|---|
| Adherence to physical activity | 9.07 | 0.27 |
| Adherence to dietary modifications | 7.24 | 0.16 |
| Sex | 7.14 | 0.09 |
| T2DM diagnosis | 6.64 | 0.38 |
| Time since hypertension diagnosis | 2.82 | 2.28 |
| Socioeconomic status | 2.66 | 0.08 |
| Usual source of medical care | 2.28 | 0.07 |
| Age | 0.01 | 89.2 |
Note. Abbreviations: PCA, Principal Components Analysis; T2DM, type 2 diabetes mellitus.
Discussion
The presented results suggest eight distinct hypertensive phenotypes with potential clinical implications. Two particularly well-defined clusters were observed (Cluster 2, 𝑛 = 24; and Cluster 8, 𝑛 = 22), representing patients with clear phenotypic patterns that may benefit from tailored management approaches. The robust separation of these clusters (silhouette widths 0.58-0.72) suggests they constitute clinically meaningful clusters. Caution is needed in generalizing these findings due to the relatively small size of these clusters. Sample size influences both the stability of unsupervised classifications and the external validity of phenotype-derived clinical implications [19]. Smaller clusters may reflect true but rare subpopulations, or even artifacts of algorithmic sensitivity, and should ideally be replicated in larger cohorts or prospective designs before clinical translation is attempted.
Three key findings deserve discussion. The uncontrolled hypertension cluster (Cluster 7, 𝑛 = 142) appeared to represent a high-priority population where current management strategies may be insufficient. This cluster's strong association with physical inactivity (9.1% variance contribution) and poor dietary adherence (7.2%) is consistent with prior evidence identifying these behaviors as significant risk factors for cardiovascular and cerebrovascular diseases in young and middle-aged adults [20,21]. Moreover, while Yu and Chen [21] observed that targeted lifestyle interventions can improve both blood pressure control and cognitive function in hypertensive individuals, our findings extend this literature by identifying a distinct subgroup with compounded behavioural risks. This suggests that precision-targeted interventions, guided by unsupervised classification, may enhance the effectiveness of lifestyle strategies in populations where conventional approaches have limited impact.
The elderly female cluster (Cluster 3, 𝑛 = 325) showed characteristics of long-standing disease. Given their advanced mean age (66.5 years) and high socioeconomic status, more aggressive monitoring for end-organ damage may be warranted. Cluster 8 may represent an important opportunity for early intervention. With a mean age of 59.1 years, this population could benefit from intensive risk factor modification to prevent the cardiovascular complications seen in older clusters [22,23].
The weak separation of several larger clusters (3, 4, and 6) likely reflects both the biological complexity of hypertension and limitations in current clinical characterization. While our model incorporated standard clinical variables, the inclusion of biomarkers might improve differentiation of these clusters in future studies [24]. These may include markers of vascular inflammation (e.g., high-sensitivity C-reactive protein, IL-6), cardiac stress (e.g., NT-proBNP, copeptin), and metabolic dysfunction, which have demonstrated utility in cardiovascular risk stratification and may help differentiate clusters with distinct pathophysiological profiles, especially when integrated with behavioral and sociodemographic data [25-27].
These findings support a shift toward phenotype-specific approaches to hypertension management. For instance, patients in Cluster 2, characterized by a strong profile with a silhouette score of 0.72, are likely to respond well to standardized treatment protocols. In contrast, the persistent blood pressure control challenges observed in Cluster 7 may need more intensive follow-up and coordinated multidisciplinary care. Meanwhile, the younger age profile of individuals in Cluster 8 suggests that early intervention could be particularly beneficial for this phenotype.
In our study, although 234 individuals were excluded due to missing data, and others due to specific conditions such as pregnancy-related hypertension, no significant differences were observed between included and excluded participants with regard to the key sociodemographic and clinical variables evaluated. Therefore, the risk of selection bias may be low. No imputation procedures were applied, further ensuring the integrity of the observed patterns within the analyzed sample.
While the stratification of hypertensive phenotypes offers promising avenues for personalized care, its translation into actionable interventions within the Mexican public health system warrants critical reflection. The system’s segmented structure, divided among social security institutions, public sector providers, and private services, creates substantial variability in diagnostic capacity, treatment availability, and continuity of care [28]. These disparities are further compounded by geographic inequities, with rural and marginalized populations facing longer wait times, limited access to specialists, and under-resourced facilities [29].
Ethically, the deployment of stratification models must avoid reinforcing existing inequities. For example, phenotypes identified as high-risk may not benefit from tailored interventions if the infrastructure to support such care is absent or inconsistently distributed. Organizationally, the feasibility of implementing phenotype-guided strategies depends on the system’s ability to integrate data-driven tools into routine practice, ensure equitable access to diagnostics, and align treatment protocols across fragmented care pathways.
Future research should validate these phenotypes against hard cardiovascular outcomes and test whether phenotype-guided therapy improves blood pressure control rates compared to current approaches.
This study has several other methodological limitations. First, the possibility of autocorrelation among categorical predictors may affect the independence assumptions required for certain analyses. While PCA was employed to reduce dimensionality and capture shared variance, it does not fully account for structural dependencies across categorical variables. The presence of negative silhouette coefficients in some clusters likely reflects overlapping feature profiles among clusters rather than definitive misclassification. These may represent transitional phenotypes or latent gradients, rather than clearly bounded clusters. Although PCA helped address collinearity and improve signal extraction, the selected input variables may have lacked sufficient discriminative power to achieve tight separation between clusters.
Second, the operationalization of self-reported behavioural measures, particularly adherence to physical activity and dietary modifications, introduces susceptibility to recall bias and social desirability effects. While these variables offer insight into participants’ engagement in lifestyle interventions, potential misclassification may have influenced phenotype assignment.
Third, the broad inclusion criteria may introduce substantial clinical heterogeneity, particularly in terms of treatment intensity, therapeutic adherence, and disease progression. Although the decision to include all adults with physician-diagnosed hypertension was intended to reflect real-world patient diversity in Mexico, this inclusiveness may reduce the specificity of the identified phenotypes and constrain the external validity of findings when applied to more narrowly defined hypertensive subpopulations.
Fourth, the ENSANUT 2022 dataset lacked consistent information regarding treatment status, which limited analytical precision. It was not possible to stratify participants by antihypertensive regimen intensity or to distinguish between monotherapy and polytherapy. Hypertension severity could not be categorized beyond a binary classification based on blood pressure control. These constraints reduced the granularity of phenotype characterization and may have limited the clinical interpretability of the identified clusters.
Fifth, while clusters characterized by poor blood pressure control were characterized, the inability to evaluate treatment-related factors (such as medication type, dosing intensity, adherence, and duration) limits interpretation of the underlying drivers of this phenotype. As such, the observed classification may reflect uncontrolled hypertension without accounting for therapeutic variation.
Finally, while type 2 diabetes mellitus was included as a clinically relevant covariate given its epidemiologic burden in Mexico, the omission of other prevalent comorbidities as well as the absence of polypharmacy indicators, may limit the clinical applicability of the phenotypic classifications.
Conclusions
The findings of this study show heterogeneous patterns in hypertension control, influenced by a combination of demographic, clinical, and behavioral factors. These variations underscore the importance of moving beyond a one-size-fits-all approach to hypertension management and instead adopting strategies tailored to specific patient profiles. By identifying and targeting high-risk phenotypes, healthcare providers and policymakers in Mexico can implement more effective, patient-centered strategies to reduce the burden of uncontrolled hypertension.
The practical implementation of classification systems warrants further exploration. Integrating clustering-derived phenotypes into existing care protocols, such as risk stratification tools used in primary care, could enhance early identification of patients requiring intensified follow-up or behavioural interventions. Embedding these models into electronic health records or decision-support systems may facilitate scalable, context-sensitive deployment, particularly in resource-constrained settings.
















