SciELO - Scientific Electronic Library Online

 
vol.43 número2Teoremas de convergencias en los modelos saturados y logísticos multinomialesAnálisis Bayesiano de procesos autorregresivos de umbrales estacionales multiplicativos índice de autoresíndice de materiabúsqueda de artículos
Home Pagelista alfabética de revistas  

Servicios Personalizados

Revista

Articulo

Indicadores

Links relacionados

  • En proceso de indezaciónCitado por Google
  • No hay articulos similaresSimilares en SciELO
  • En proceso de indezaciónSimilares en Google

Compartir


Revista Colombiana de Estadística

versión impresa ISSN 0120-1751

Rev.Colomb.Estad. vol.43 no.2 Bogotá jul./dic. 2020  Epub 05-Dic-2020

https://doi.org/10.15446/rce.v43n2.81811 

Original articles of research

PLS Generalized Linear Regression and Kernel Multilogit Algorithm (KMA) for Microarray Data Classification Problem

Regresión lineal generalizada por MCP y algoritmo kernel multilogit para la clasificación de datos de microarreglos

Adolphus Wagala1  a 

Graciela González-Farías1  b 

Rogelio Ramos1  c 

Oscar Dalmau2  d 

1Department of Probability and Statistics, Centro de Investigación en Matemáticas A.C., Guanajuato, México

2Department of Computer Science, Centro de Investigación en Matemáticas A.C., Guanajuato, México


Abstract

This study involves the implentation of the extensions of the partial least squares generalized linear regression (PLSGLR) by combining it with logistic regression and linear discriminant analysis, to get a partial least squares generalized linear regression-logistic regression model (PLSGLR-log), and a partial least squares generalized linear regression-linear discriminant analysis model (PLSGLRDA). A comparative study of the obtained classifiers with the classical methodologies like the fc-nearest neighbours (KNN), linear discriminant analysis (LDA), partial least squares discriminant analysis (PLSDA), ridge partial least squares (RPLS), and support vector machines(SVM) is then carried out. Furthermore, a new methodology known as kernel multilogit algorithm (KMA) is also implemented and its performance compared with those of the other classifiers. The KMA emerged as the best classifier based on the lowest classification error rates compared to the others when applied to the types of data are considered; the un-preprocessed and preprocessed.

Key words: Generalized linear regression; Kernel multilogit algorithm; Partial least squares

Resumen

Este estudio combina el modelo de regresión lineal generalizado por mínimos cuadrado parciales (RLGMCP), con regresión logística y análisis discriminante lineal, para obtener los modelos de regresión logística generalizada por mínimos cuadrados parciales, (RLGMCP) y regresión logística generalizada-discriminante por mínimos cuadrados parciales (RLGDMCP). Se realiza un estudio comparativo con clasificadores clásicos como, fc-vecinos más cercanos (KVC), análisis discriminante lineal (ADL), análisis discriminante de por mínimos cuadrados parciales (ADMCP), regresión por mínimos cuadrados parciales (RMCP) y máquinas de vectores de soporte de soporte vectorial (MSV). Además, se implementa una nueva metodología conocida como algoritmo de kernel multilogit (AKM). Su desempeño es comparado con los de los otros clasificadores. De acuerdo con las tasas de error de clasificación obtenidas a partir de los diferentes tipos de datos, el KMA es el de mejor resultado.

Palabras clave: Regresíon lineal generalizada; Algoritmo de kernel multilogit; Mínimos cuadrados parciales

Full text available only in PDF format.

Acknowledgements

We acknowledge the partial support from the Mexico's Consejo Nacional de Ciencias y Tecnología (CONACyT) project number 252996. Part of this work was done when A.W was a PhD Candidate at CIMAT, AC. Guanajuato, Gto, México (Wagala 2018).

References

Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D. & Levine, A. J. (1999), 'Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays', Proceedings of the National Academy of Sciences of the United States of America 96(12), 6745-6750. [ Links ]

Alshamlan, H. M., Badr, G. & Alohali, Y. (2013), A study of cancer microarray gene expression profile: Objectives and approaches, in 'Proceedings of the World Congress on Engineering', Vol. II, London. [ Links ]

Awada, W., Khoshgoftaar, T. M., Dittman, D., Wald, R. & Napolitano, A. (2012), A review of the stability of feature selection techniques for bioinformatics data, in '2012 IEEE 13th International Conference on Information Reuse & Integration (IRI)', IEEE, pp. 356-363. [ Links ]

Bastien, P., Vinzi, E. V. & Tenenhaus, M. (2005), 'PLS generalised linear regression', Computational Statistics and Data Analysis 48, 17-46. [ Links ]

Boulesteix, A. L., Strobl, C., Augustin, T. & Daumer, M. (2008), 'Evaluating microarray-based classifiers: an overview', Cancer informatics 6, 77-97. [ Links ]

Chun, H. & Keles, S. (2009), 'Sparse partial least squares regression for simultaneous dimension reduction and variable selection', Journal of the Royal Statistical Society. Series B, Statistical Methodology 72(1), 325. *http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2810828/Links ]

Chung, D. & Keles, S. (2010), 'Sparse partial least squares classification for high dimensional data', Statistical Applications in Genetics and Molecular Biology 9(1), 17. [ Links ]

Dalmau, O., Alarcón, T. E. & González, G. (2015), 'Kernel multilogit algorithm for multiclass classification', Computational Statistics and Data Analysis 82, 199-206. [ Links ]

Dong, K., Zhang, F., Zhu, Z., Wang, Z. & Wang, G. (2014), 'Partial least squares based gene expression analysis in posttraumatic stress disorder', European Review for Medical and Pharmacological Sciences 18, 2306-2310. [ Links ]

Dudoit, S., Fridlyand, J. & Speed, T. (2002), 'Comparison of discrimination methods for the classification of tumors using gene expression data', Journal of the American Statistical Association 97(457), 77-86. [ Links ]

Fort, G. & Lambert-Lacroix, S. (2005), 'Classification using partial least squares with penalized logistic regression', Bioinformatics 7, 1104-1111. [ Links ]

Gagnon-Bartsch, J. A. & Speed, T. P. (2011), 'Using control genes to correct for unwanted variation in microarray data', Biostatistics 13(3), 539-552. *http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3577104/Links ]

Gromski, S., Muhamadali, H., Ellis, D., Xu, Y., Correa, E., Turner, M. & Goodcare, R. (2015), 'A tutorial review: Metabolomics and partial least squares-discriminant analysis a marriage of convenience or a shotgun wedding', Analytica Chimica Acta 879, 10-23. [ Links ]

Gusnanto, A., Ploner, A., Shuweihdi, F. & Pawitan, Y. (2013), 'Partial least squares and logistic regression random-effects estimates for gene selection in supervised classification of gene expression data', Journal of Biomedical Informatics pp. 697-709. [ Links ]

Höskuldsson, A. (1988), 'PLS regression methods', Journal of Chemometrics 2, 211-228. [ Links ]

Huang, C. C., Tu, S. H., Huang, C. H., Lien, H. H., Lai, L. H. & Chuang, E. (2013), 'Multiclass prediction with partial least square regression for gene expression data: Applications in breast cancer intrinsic taxonomy', Bio Med Research International pp. 1-9. [ Links ]

Lê Cao, K., Rossouw, D., Robert-Granieé, C. & Besse, P. (2008), 'A Sparse PLS for variable selection when integrating omics data', Statistical Applications in Genetics and Molecular Biology 7(1). [ Links ]

Lee, D., Lee, W., Lee, Y. & Pawitan, Y. (2011), 'Sparse partial least-squares regression and its applications to high-throughput data analysis', Chemometrics and Intelligent Laboratory Systems 109(1), 1-8. [ Links ]

Nguyen, D. V. & Rocke, D. M. (2002a), 'Multi-class cancer classification via partial least squares with gene expression profiles', Bioinformatics 18(9), 1216-1226. [ Links ]

Nguyen, D. V. & Rocke, D. M. (2002b), 'Tumor classification by partial least squares using microarray gene expression data', Bioinformatics 18(1), 39-50. [ Links ]

Telaar, A., Liland, K., Repsilber, D. & Nürnberg, G. (2013), 'An extension of PPLS-DA for classification and comparison to ordinary PLS-DA', PLoS ONE 82, e55267. [ Links ]

Wagala, A. (2018), Problems in Statistical Genetics: Classification and Testing for Network Changes, PhD thesis, Centro de Investigación en Matemáticas A. C., Department of Probability & Statistics. *https://cimat.repositorioinstitucional.mxLinks ]

Wang, A., An, N., Chen, G., Li, L. & Alterovitz, G. (2015), 'Improving plsrfe based gene selection for microarray data classification', Computers in Biology and Medicine 62, 14-24. [ Links ]

Wold, S., Ruhe, A., Wold, W. & Dunn III, W. J. (1984), 'The collinearity problem in linear regression, the partial least squares approach to generalized inverses', SIAM Journal on Scientific and Statistical Computing 5(3), 735-743. [ Links ]

Wold, S., Sjöström, M. & Erikson, L. (2001), 'PLS-regression: A basic tool of chemometrics.', Chemometrics and Intelligent Laboratory Systems 58, 109-130. [ Links ]

Xi, B., Gu, H., Baniasadi, H. & Raftery, D. (2014), 'Statistical analysis and modeling of mass spectrometry-based metabolomics data', Methods Mol Biol. 1198, 333-353. [ Links ]

Received: August 2019; Accepted: January 2020

aPhD. E-mail: adolphus.wagala@cimat.mx

bPhD. E-mail: farias@cimat.mx

cPhD. E-mail: rramosq@cimat.mx

dPhD. E-mail: dalmau@cimat.mx

Creative Commons License This is an open-access article distributed under the terms of the Creative Commons Attribution License