Research RSS article extraction libraries

There are two main problems with doing a RSS Import and republishing it as a Briar Blog:

The feed may not include the full article, but only a teaser
How would RSS feed of a traditional blog or news website fit with Briar's more tumblr-like blogs

This ticket is about solving the first problem. Once this is solved, we'll open a new ticket for the second one.

One solution could be to fetch and reformat the full article that is usually linked from the RSS feed. This is a difficult job that would require a lot of testing with real-world data. Fortunately, there are libraries out there that could solve this problem for us.

It is difficult to detect if an RSS feed provides the full content or not. In both cases, the <description> tag is used. So maybe we could show users a preview before importing the feed and allow them to switch article extraction mode on manually for when the feed only contains teasers.

An alternative is not to support teaser-only feeds at all and rely on users to provide full text feeds. There is even a Free Software webservice to do this.

This is a sub-ticket of #135 (closed).

Article Extraction Libraries

boilerpipe

seems to be the most popular library on the net, but last release was 5 years ago and last commit 2 years ago
not on jcenter, only private maven repo or jars
ArticleExtractor#getText() can take various arguments such as Url, String, Reader, etc. so we can fetch the document ourselves via Tor
The built-in HTMLFetcher is very simple and does not seem to support proxies
License: Apache License 2.0
Dependencies:
- nekohtml
- xerces

snacktory

used by the RSS reader Torsten is using and works well, but also no longer actively developed
good detection for none-english sites (German, Japanese, ...), snacktory does not depend on the word count in its text detection to support CJK languages
not on jcenter, only private maven repo or jars (or one .java file)
ArticleTextExtractor#extractContent() can take various arguments such as JResult, String, Document, etc. so we can fetch the document ourselves via Tor
There is also a built-in HtmlFetcher that has a setProxy() method
License: Apache License 2.0
Dependencies:
- jsoup
- log4j
- slf4j-api

goose

written in Scala which apparently can be used in Android projects
Last release in Nov 2015
License: Apache License 2.0

Edited Nov 21, 2020 by Cleopatra