SciELO - Scientific Electronic Library Online

 
vol.29 número3Falling film reactor for methyl ester sulphonation with gaseous sulphur trioxideAnalysing and comparing a PI fuzzy controller and an optimal conventional PI controller for a buck converter índice de autoresíndice de assuntospesquisa de artigos
Home Pagelista alfabética de periódicos  

Serviços Personalizados

Journal

Artigo

Indicadores

Links relacionados

  • Em processo de indexaçãoCitado por Google
  • Não possue artigos similaresSimilares em SciELO
  • Em processo de indexaçãoSimilares em Google

Compartilhar


Ingeniería e Investigación

versão impressa ISSN 0120-5609

Resumo

CADAVID RENGIFO, Héctor Fabio  e  GOMEZ PERDOMO, Jonatan. web text corpus extraction system for linguistic tasks. Ing. Investig. [online]. 2009, vol.29, n.3, pp.54-60. ISSN 0120-5609.

Internet content, used as text corpus for natural language learning, offers important characteristics for such task, like its huge volume, being permanently uptodate with linguistic variants and having low time and resource costs regarding the traditional way that text is built for natural language machine learning tasks. This paper describes a system for the automatic extraction of large bodies of text from the Internet as a valuable tool for such learning tasks. A concurrent programmingbased, hardwareuse optimisation strategy significantly improving extraction performance is also presented. The strategies incorporated into the system for maximising hardware resource exploitation, thereby reducing extraction time are presented, as are extendibility (supporting digital-content formats) and adaptability (regarding how the system cleanses content for obtaining pure natural language samples). The experimental results obtained after processing one of the biggest Spanish domains on the internet, are presented (i.e. es.wikipedia.org). Such results are used for presenting initial conclusions about the validity and applicability of corpus directly extracted from Internet as morphological or syntactical learning input.

Palavras-chave : web corpus; crawler; unsupervised language learning; concurrent programming.

        · resumo em Espanhol     · texto em Espanhol     · Espanhol ( pdf )

 

Creative Commons License Todo o conteúdo deste periódico, exceto onde está identificado, está licenciado sob uma Licença Creative Commons