SciELO - Scientific Electronic Library Online

 
 issue43Supporting Process Deployment in the Context of Small Software OrganizationsAgile and Scaled Reference Model for the Software Industry author indexsubject indexarticles search
Home Pagealphabetic serial listing  

Services on Demand

Journal

Article

Indicators

Related links

  • On index processCited by Google
  • Have no similar articlesSimilars in SciELO
  • On index processSimilars in Google

Share


Revista científica

Print version ISSN 0124-2253On-line version ISSN 2344-8350

Abstract

GARCIA-CHICANGANA, David-Santiago et al. Multi-Client Document Classification Service Based on Machine Learning Techniques and Elasticsearch. Rev. Cient. [online]. 2022, n.43, pp.64-79.  Epub Feb 18, 2022. ISSN 0124-2253.  https://doi.org/10.14483/23448350.18352.

This paper presents a document classification service that allows multiple client (multi-tenant) document management systems to provide greater confidence and credibility regarding the document types assigned to documents uploaded by users. The research was carried out through the phases of CRISP-DM, where two document representation models were evaluated (bags of words with cumulative n-grams and BERT, which was recently proposed by Google) and five machine learning techniques (multilayer perceptron, random forests, k-nearest neighbors, decision trees, and naïve bayes). The experiments were carried out with data from two organizations, and the best results were obtained by multilayer perceptron, random forests, and k-nearest neighbors, which showed very similar results regarding general accuracy and recall by class. The results are not conclusive with respect to the ability to offer the service to multiple clients with a single model, since this also depends on their documents and document types. Therefore, a service is offered which is based on a microservices architecture that allows each organization to create its own model, monitor its performance in production, and update it when performance is not adequate.

Keywords : CRISP-DM; data analytics; document management system; k-nearest neighbors; multilayer perceptron; random forests; trigrams..

        · abstract in Spanish | Portuguese     · text in Spanish     · Spanish ( pdf )