SciELO - Scientific Electronic Library Online

 
vol.37 issue1A New Method for Detecting Significant p-values with Applications to Genetic DataLocally D-Optimal Designs with Heteroscedasticity: A Comparison between Two Methodologies author indexsubject indexarticles search
Home Pagealphabetic serial listing  

Services on Demand

Journal

Article

Indicators

Related links

  • On index processCited by Google
  • Have no similar articlesSimilars in SciELO
  • On index processSimilars in Google

Share


Revista Colombiana de Estadística

Print version ISSN 0120-1751

Rev.Colomb.Estad. vol.37 no.1 Bogotá Jan./June 2014

https://doi.org/10.15446/rce.v37n1.44359 

http://dx.doi.org/10.15446/rce.v37n1.44359

Three Similarity Measures between One-Dimensional DataSets

Tres medidas de similitud entre conjuntos de datosunidimensionales

LUIS GONZALEZ-ABRIL1, JOSE M. GAVILAN2, FRANCISCO VELASCO MORENTE3

1Universidad de Sevilla, Facultad de Ciencias Económicas y Empresariales, Departamento de Economía Aplicada I, Sevilla, Spain. Senior lecturer. Email: luisgon@us.es
2Universidad de Sevilla, Facultad de Ciencias Económicas y Empresariales, Departamento de Economía Aplicada I, Sevilla, Spain. Senior lecturer. Email: gavi@us.es
3Universidad de Sevilla, Facultad de Ciencias Económicas y Empresariales, Departamento de Economía Aplicada I, Sevilla, Spain. Senior lecturer. Email: velasco@us.es


Abstract

Based on an interval distance, three functions are given in order to quantify similarities between one-dimensional data sets by using first-order statistics. The Glass Identification Database is used to illustrate how to analyse a data set prior to its classification and/or to exclude dimensions. Furthermore, a non-parametric hypothesis test is designed to show how these similarity measures, based on random samples from two populations, can be used to decide whether these populations are identical. Two comparative analyses are also carried out with a parametric test and a non-parametric test. This new non-parametric test performs reasonably well in comparison with classic tests.

Key words: Data mining, Interval distance, Kernel methods, Non-parametric tests.


Resumen

Basadas en una distancia intervalar, se dan tres funciones para cuantificar similaridades entre conjuntos de datos unidimensionales mediante el uso de estadísticos de primer orden. Se usa la base de datos Glass Identification para ilustrar cómo esas medidas de similaridad se pueden usar para analizar un conjunto de datos antes de su clasificación y/o para excluir dimensiones. Además, se diseña un test de hipótesis no parámetrico para mostrar cómo similaridad, basadas en muestras aleatorias de dos poblaciones, se pueden usar para decidir si esas poblaciones son idénticas. También se realizan dos análisis comparativos con un test paramétrico y un test no paramétrico. Este nuevo test se comporta razonablemente bien en comparación con test clásicos.

Palabras clave: distancia entre intervalos, métodos del núcleo, minería de datos, tests no paramétricos.


Texto completo disponible en PDF


References

1. A.K.C. Wong, & D.K.Y. Chiu, (1987), 'Synthesizing statistical knowledge from incomplete mixed-mode data', IEEE Transactions on Pattern Analysis and Machine Intelligence 9(6), 796-805.         [ Links ]

2. Anguita, D., Ridella, S. & Sterpi, D. (2004), A New Method for Multiclass Support Vector Machines, 'Proceedings of the IEEE IJCNN2004', Budapest, Hungary.         [ Links ]

3. B. Skhólkopf, & A. J. Smola, (2002), Learning with Kernel, MIT Press.         [ Links ]

4. Bach, F. R. & Jordan, M. I. (2003), 'Kernel independent component analysis', Journal of Machine Learning Research 3, 1-48.         [ Links ]

5. Bache, K. & Lichman, M. (2013), 'UCI Machine Learning Repository', http://archive.ics.uci.edu/ml, University of California, Irvine, School of Information and Computer Sciences.         [ Links ]

6. Burrell, Q. L. (2005), 'Measuring Similarity of Concentration Between Different Informetric Distributions: Two New Approaches', Journal of the American Society for Information Science and Technology 56(7), 704-714.         [ Links ]

7. Chiu, D., Wong, A. & Cheung, B. (1991), Information Discovery through Hierarchical Maximum Entropy Discretization and Synthesis, 'Knowledge Discovery in Databases', MIT Press, p. 125-140.         [ Links ]

8. Cristianini, N. & Shawe-Taylor, J. (2000), An introduction to Support Vector Machines and other kernel-based learning methods, Cambridge University press.         [ Links ]

9. González, L. & Gavilan, J. M. (2001), Una metodología para la construcción de histogramas. Aplicación a los ingresos de los hogares andaluces, 'XIV Reunión ASEPELT-Spain'.         [ Links ]

10. González, L., Velasco, F., Angulo, C., Ortega, J. & Ruiz, F. (2004), 'Sobre núcleos, distancias y similitudes entre intervalos', Inteligencia Artificial 8(23), 113-119.         [ Links ]

11. González, L., Velasco, F. & Gasca, R. (2005), 'A Study of the Similarities between Topics', Computational Statistics 20(3), 465-479.         [ Links ]

12. González-Abril, L., Cuberos, F. J., Velasco, F. & Ortega, J. A. (2009), 'Ameva: An autonomous discretization algorithm', Expert Systems with Applications 36(3), 5327 - 5332.         [ Links ]

13. González-Abril, L., Velasco, F., Gavilán, J. & Sánchez-Reyes, L. (2010), 'The Similarity between the Square of the Coeficient of Variation and the Gini Index of a General Random Variable', Revista de métodos cuantitativos para la economía y la empresa 10, 5-18.         [ Links ]

14. Hartigan, J. (1975), Clustering Algorithms, Wiley, New York.         [ Links ]

15. Hsu, Chih-Wei & Lin, Chih-Jen (2002), 'A Comparison of Methods for Multiclass Support Vector Machine', IEEE Transactions on Neural Networks 13(2), 415-425.         [ Links ]

16. Lee, J., Kim, M. & Lee, Y. (1993), 'Information retrieval based on conceptual distance in is-a hierarchies', Journal of Documentation 49(2), 188-207.         [ Links ]

17. Lin, D. (1998), An Information-Theoretic Definition of Similarity, 'Proceedings of the Fifteenth International Conference on Machine Learning (ICML 1998)', p. 296-304.         [ Links ]

18. Nielsen, J., Ghugre, N. & Panigrahy, A. (2004), 'Affine and polynomial mutual information coregistration for artifact elimination in diffusion tensor imaging of newborns', Magnetic Resonance Imaging 22, 1319-1323.         [ Links ]

19. Parthasarathy, S. & Ogihara, M. (2000), 'Exploiting Dataset Similarity for Distributed Mining', http://ipdps.eece.unm.edu/2000/datamine/18000400.pdf.         [ Links ]

20. Rada, R., Mili, H., Bicknell, E. & Blettner, M. (1989), 'Development and application of a metric on semantic nets', IEEE Transaction on Systems, Man, and Cybernetics 19(1), 17-30.         [ Links ]

21. Salazar, D. A., Vélez, J. I. & Salazar, J. C. (2012), 'Comparison between SVM and Logistic Regression: Which One is Better to Discriminate?', Revista Colombiana de Estadística 35, 2, 223-237.         [ Links ]

22. Sheridan, R., Feuston, B., Maiorov, V. & Kearsley, S. (2004), 'Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR', Journal of Chemical Information and Modeling 44, 1912-1928.         [ Links ]

23. V. Vapnik, (1998), Statistical Learning Theory, John Wiley & Sons, Inc.         [ Links ]


[Recibido en julio de 2013. Aceptado en enero de 2014]

Este artículo se puede citar en LaTeX utilizando la siguiente referencia bibliográfica de BibTeX:

@ARTICLE{RCEv37n1a06,
    AUTHOR  = {Gonzalez-Abril, Luis and Gavilan, Jose M. and Velasco Morente, Francisco},
    TITLE   = {{Three Similarity Measures between One-Dimensional DataSets}},
    JOURNAL = {Revista Colombiana de Estadística},
    YEAR    = {2014},
    volume  = {37},
    number  = {1},
    pages   = {79-94}
}