Improving the Boilerpipe Algorithm for Boilerplate Removal in News Articles Using HTML Tree Structure
Abstract
It is well-known that the lack of quality data is a major problem for information retrieval engines. Web articles are flooded with non-relevant data such as advertising and related links. More over, some of these ads are loaded in a randomized way every time you hit a page, so the HTML document will be different and hashing of the content will be not possible. Therefore, we need to extract the relevant text of documents. The automatic extraction of relevant text in on-line text (news articles, etc.) is not a trivial task. There are many algorithms for this purpose described in the literature. Boilerpipe is one of the most popular one sand its performance is one of the best. In this paper, we improve the precision of the Boilerpipe algorithm using the HTML tree for selection of the relevant content. We make the experiments for the news articles. We evaluated our approach by extracting news from English and Spanish websites and compared the results with other approaches. Our approach achieved better results than approaches from the state-of-the-art. We also present an analysis of our dataset confirming that the amount of relevant text is less than 40%.
Keywords
Boilerplate removal, news extraction, HTML tree structure, Boilerpipe