Measuring Representativeness Using Covering Array Principles

Castro-Romero, Alexander; Cobos-Lozada, Carlos-Alberto

doi:10.19053/01211129.v32.n65.2023.15314

Serviços Personalizados

Journal

Artigo

Indicadores

Citado por SciELO
Acessos

Links relacionados

Citado por Google
Similares em SciELO
Similares em Google

Permalink

Revista Facultad de Ingeniería

versão impressa ISSN 0121-1129versão On-line ISSN 2357-5328

Resumo

CASTRO-ROMERO, Alexander e COBOS-LOZADA, Carlos-Alberto. Measuring Representativeness Using Covering Array Principles. Rev. Fac. ing. [online]. 2023, vol.32, n.65, e6. Epub 13-Jan-2024. ISSN 0121-1129. https://doi.org/10.19053/01211129.v32.n65.2023.15314.

Representativeness is an important data quality characteristic in data science processes; a data sample is said to be representative when it reflects a larger group as accurately as possible. Having low representativeness indices in the data can lead to the generation of biased models. Hence, this study shows the elements that make up a new model for measuring representativeness using a mathematical object testing element of coverage arrays called the "P Matrix". To test the model, an experiment was proposed where a data set is taken, divided into training and test data subsets using two sampling strategies: Random and Stratified, and the representativeness values are compared. If the data division is adequate, the two sampling strategies should present similar representativeness indexes. The model was implemented in a prototype software using Python (for data processing) and Vue (for data visualization) technologies, this version of the model only allows to analyze binary data sets (for now). To test the model, the "Wines" dataset (UC Irvine Machine Learning Repository) was fitted. The conclusion is that both sampling strategies generate similar representativeness results for this dataset, although this result is predictable, it is clear that adequate representativeness of the data is important when generating the test and training datasets subsets. Therefore, as future work we plan to extend the model to categorical data and explore more complex datasets.

Palavras-chave : classification algorithms; coverage arrays; data quality; data sets; data representativeness.

· resumo em Português | Espanhol · texto em Inglês · Inglês (

pdf )