Part-of-speech tagging with maximum entropy and distributional similarity features in a subregional corpus of Spanish

Rico-Sulayes, Antonio; Saldívar-Arreola, Rafael; Rábago-Tánori, Álvaro

doi:10.25100/iyc.v19i2.5293

Services on Demand

Journal

Article

Indicators

Cited by SciELO
Access statistics

Ingeniería y competitividad

Print version ISSN 0123-3033

Abstract

RICO-SULAYES, Antonio; SALDIVAR-ARREOLA, Rafael and RABAGO-TANORI, Álvaro. Part-of-speech tagging with maximum entropy and distributional similarity features in a subregional corpus of Spanish. Ing. compet. [online]. 2017, vol.19, n.2, pp.55-67. ISSN 0123-3033. https://doi.org/10.25100/iyc.v19i2.5293.

The present research study has used two state-of-the-art Spanish taggers with the primary goal of automatically tagging for POS a strictly assembled collection of unstructured text aimed at assisting a number of linguistic tasks, the subregional Mexican Corpus del Habla de Baja California (CHBC). These taggers, a Maximum-Entropy-based one and another one that adds to this statistical construct distributional similarity features, have recently been released but were missing an accuracy rate. Therefore, the second goal of this article is to evaluate and provide attested accuracy figures for the language models behind these taggers. In order to achieve these two goals, this article has proposed a novel, reduced tag set, which has also been proven useful for the goals here pursued. On a sample of almost 11,000 words and more than 12,500 tags for two genres (written text and transcribed oral speech), the Maximum Entropy tagger and the tagger with Maximum Entropy plus distributional similarity features have achieved results of 97.2% and 97.4%, respectively. By comparing these figures to a human ceiling or gold standard of 97.1%, also attested here, it is clear that the results of both taggers are competitive even when applied to an external data collection for which they have not been previously trained or tuned for. This is particularly important because under these kinds of experimental conditions taggers performance has been shown to deteriorate.

Keywords : Mexican Spanish; stochastic POS tagging; tagged corpus.

· abstract in Spanish · text in English · English (

pdf )