Improving the Boilerpipe Algorithm for Boilerplate Removal in News Articles Using HTML Tree Structure

Autores/as

  • Francisco Viveros-Jiménez CIC
  • Miguel A. Sánchez-Pereza Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC)
  • Helena Gómez-Adorno Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC)
  • Juan Pablo Posadas-Durán Instituto Politécnico Nacional (IPN), Escuela Superior de Ingeniería Mecánica y Eléctrica Unidad Zacatenco (ESIME-Zacatenco)
  • Grigori Sidorov Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC)
  • Alexander Gelbukh Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC)

DOI:

https://doi.org/10.13053/cys-22-2-2959

Palabras clave:

Boilerplate removal, news extraction, HTML tree structure, Boilerpipe

Resumen

It is well-known that the lack of quality data is a major problem for information retrieval engines. Web articles are flooded with non-relevant data such as advertising and related links. More over, some of these ads are loaded in a randomized way every time you hit a page, so the HTML document will be different and hashing of the content will be not possible. Therefore, we need to extract the relevant text of documents. The automatic extraction of relevant text in on-line text (news articles, etc.) is not a trivial task. There are many algorithms for this purpose described in the literature. Boilerpipe is one of the most popular one sand its performance is one of the best. In this paper, we improve the precision of the Boilerpipe algorithm using the HTML tree for selection of the relevant content. We make the experiments for the news articles. We evaluated our approach by extracting news from English and Spanish websites and compared the results with other approaches. Our approach achieved better results than approaches from the state-of-the-art. We also present an analysis of our dataset confirming that the amount of relevant text is less than 40%.

Descargas

Publicado

2018-06-29