SciELO - Scientific Electronic Library Online

 
vol.19 número2Characterization of activated carbon synthesized at low temperature from cocoa shell (Theobroma cacao) for adsorbing amoxicillinIdentification of movement intention of gait on various terrains -a bioinspired approach- índice de autoresíndice de assuntospesquisa de artigos
Home Pagelista alfabética de periódicos  

Serviços Personalizados

Artigo

Indicadores

Links relacionados

  • Em processo de indexaçãoCitado por Google
  • Não possue artigos similaresSimilares em SciELO
  • Em processo de indexaçãoSimilares em Google

Compartilhar


Ingeniería y competitividad

versão impressa ISSN 0123-3033

Resumo

RICO-SULAYES, Antonio; SALDIVAR-ARREOLA, Rafael  e  RABAGO-TANORI, Álvaro. Part-of-speech tagging with maximum entropy and distributional similarity features in a subregional corpus of Spanish. Ing. compet. [online]. 2017, vol.19, n.2, pp.55-67. ISSN 0123-3033.  http://dx.doi.org/10.25100/iyc.v19i2.5293.

The present research study has used two state-of-the-art Spanish taggers with the primary goal of automatically tagging for POS a strictly assembled collection of unstructured text aimed at assisting a number of linguistic tasks, the subregional Mexican Corpus del Habla de Baja California (CHBC). These taggers, a Maximum-Entropy-based one and another one that adds to this statistical construct distributional similarity features, have recently been released but were missing an accuracy rate. Therefore, the second goal of this article is to evaluate and provide attested accuracy figures for the language models behind these taggers. In order to achieve these two goals, this article has proposed a novel, reduced tag set, which has also been proven useful for the goals here pursued. On a sample of almost 11,000 words and more than 12,500 tags for two genres (written text and transcribed oral speech), the Maximum Entropy tagger and the tagger with Maximum Entropy plus distributional similarity features have achieved results of 97.2% and 97.4%, respectively. By comparing these figures to a human ceiling or gold standard of 97.1%, also attested here, it is clear that the results of both taggers are competitive even when applied to an external data collection for which they have not been previously trained or tuned for. This is particularly important because under these kinds of experimental conditions taggers performance has been shown to deteriorate.

Palavras-chave : Mexican Spanish; stochastic POS tagging; tagged corpus.

        · resumo em Espanhol     · texto em Inglês     · Inglês ( pdf )