SciELO - Scientific Electronic Library Online

 
vol.29 issue3Falling film reactor for methyl ester sulphonation with gaseous sulphur trioxideAnalysing and comparing a PI fuzzy controller and an optimal conventional PI controller for a buck converter author indexsubject indexarticles search
Home Pagealphabetic serial listing  

Services on Demand

Journal

Article

Indicators

Related links

  • On index processCited by Google
  • Have no similar articlesSimilars in SciELO
  • On index processSimilars in Google

Share


Ingeniería e Investigación

Print version ISSN 0120-5609

Abstract

CADAVID RENGIFO, Héctor Fabio  and  GOMEZ PERDOMO, Jonatan. web text corpus extraction system for linguistic tasks. Ing. Investig. [online]. 2009, vol.29, n.3, pp.54-60. ISSN 0120-5609.

Internet content, used as text corpus for natural language learning, offers important characteristics for such task, like its huge volume, being permanently uptodate with linguistic variants and having low time and resource costs regarding the traditional way that text is built for natural language machine learning tasks. This paper describes a system for the automatic extraction of large bodies of text from the Internet as a valuable tool for such learning tasks. A concurrent programmingbased, hardwareuse optimisation strategy significantly improving extraction performance is also presented. The strategies incorporated into the system for maximising hardware resource exploitation, thereby reducing extraction time are presented, as are extendibility (supporting digital-content formats) and adaptability (regarding how the system cleanses content for obtaining pure natural language samples). The experimental results obtained after processing one of the biggest Spanish domains on the internet, are presented (i.e. es.wikipedia.org). Such results are used for presenting initial conclusions about the validity and applicability of corpus directly extracted from Internet as morphological or syntactical learning input.

Keywords : web corpus; crawler; unsupervised language learning; concurrent programming.

        · abstract in Spanish     · text in Spanish     · Spanish ( pdf )

 

Creative Commons License All the contents of this journal, except where otherwise noted, is licensed under a Creative Commons Attribution License