Group: de.l3s.boilerpipe - All Dependencies

icon
boilerpipe 1.1.0

Boilerpipe -- Boilerplate Removal and Fulltext Extraction from HTML pages · The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page. The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings. Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate. Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0. The algorithms used by the library are based on (and extending) some concepts of the paper "Boilerplate Detection using Shallow Text Features" by Christian Kohlschütter et al., presented at WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining New York City, NY USA.

Nov 03, 2010
33 usages

Advertisement

Top Dependency Usages

Feb 13, 2021
95.1k usages
8.5k stars
Jun 02, 2023
69.4k usages
14.3k stars
Mar 17, 2023
51k usages
2.1k stars
Jul 31, 2023
27.1k usages
50.1k stars
Aug 09, 2023
25k usages
2.7k stars