Smart Product Backlog: Automatic Classification of User Stories Using Large Language Models (LLM)

Gaona-Cuevas, Mauricio; Bucheli-Guerrero, Víctor; Vera-Rivera, Fredy

doi:10.19503/01211129.v33.n69.2024.18076

Services on Demand

Journal

Article

Indicators

Cited by SciELO
Access statistics

Revista Facultad de Ingeniería

Print version ISSN 0121-1129On-line version ISSN 2357-5328

Abstract

GAONA-CUEVAS, Mauricio; BUCHELI-GUERRERO, Víctor and VERA-RIVERA, Fredy. Smart Product Backlog: Automatic Classification of User Stories Using Large Language Models (LLM). Rev. Fac. ing. [online]. 2024, vol.33, n.69, e18076. Epub Aug 29, 2024. ISSN 0121-1129. https://doi.org/10.19503/01211129.v33.n69.2024.18076.

In agile software development processes, specifically within intelligent applications that leverage artificial intelligence (AI), Smart Product Backlog (SPB) serves as an artifact that includes both AI-implementable functionalities and those that do not use AI. Significant work has been done in the development of Natural Language Processing (NLP) models, and Large Language Models (LLMs) have demonstrated exceptional performance. However, whether LLMs can be used in automatic classification tasks without prior annotation, thereby allowing direct extraction from the Smart Product Backlog (SPB) remains an unanswered question. In this study, we compared the effectiveness of fine-tuning techniques with "prompting" methods to determine the potential of models such as ChatGPT-4o, Gemini Pro 1.5, and ChaGPT-Mini. A dataset was constructed with user stories manually classified by a group of experts, which enabled assembling experiments and creating the respective contingency tables. The classification performance metrics of each LLM were statistically evaluated; accuracy, sensitivity, and F1-Score were used to assess the effectiveness of each model. This comparative approach aimed to highlight the strengths and limitations of each LLM in efficiently and accurately assisting in the construction of the SPB. This comparative analysis demonstrates that ChatGPT-Mini has limitations in balancing precision and sensitivity. Although Gemini Pro 1.5 was superior in accuracy scores and ChatGPT performed well, neither is robust enough to build a fully automated tool for user story classification. Therefore, we identified the need to develop a specialized classifier that enables the construction of an automated tool to recommend viable user stories for AI development, thereby supporting decision-making in agile software projects.

Keywords : artificial intelligence; large scale language models; smart product backlog; smart user story identifier; Software Requirements Specification; user story classification.

· abstract in Spanish · text in Spanish · Spanish (

pdf )