Research RSS article extraction libraries
There are two main problems with doing a RSS Import and republishing it as a Briar Blog:
- The feed may not include the full article, but only a teaser
- How would RSS feed of a traditional blog or news website fit with Briar's more tumblr-like blogs
This ticket is about solving the first problem. Once this is solved, we'll open a new ticket for the second one.
One solution could be to fetch and reformat the full article that is usually linked from the RSS feed. This is a difficult job that would require a lot of testing with real-world data. Fortunately, there are libraries out there that could solve this problem for us.
It is difficult to detect if an RSS feed provides the full content or not. In both cases, the <description>
tag is used. So maybe we could show users a preview before importing the feed and allow them to switch article extraction mode on manually for when the feed only contains teasers.
An alternative is not to support teaser-only feeds at all and rely on users to provide full text feeds. There is even a Free Software webservice to do this.
This is a sub-ticket of #135 (closed).
Article Extraction Libraries
boilerpipe
- seems to be the most popular library on the net, but last release was 5 years ago and last commit 2 years ago
- not on jcenter, only private maven repo or jars
-
ArticleExtractor#getText()
can take various arguments such asUrl
,String
,Reader
, etc. so we can fetch the document ourselves via Tor - The built-in
HTMLFetcher
is very simple and does not seem to support proxies - License: Apache License 2.0
- Dependencies:
snacktory
- used by the RSS reader Torsten is using and works well, but also no longer actively developed
- good detection for none-english sites (German, Japanese, ...), snacktory does not depend on the word count in its text detection to support CJK languages
- not on jcenter, only private maven repo or jars (or one
.java
file) -
ArticleTextExtractor#extractContent()
can take various arguments such asJResult
,String
,Document
, etc. so we can fetch the document ourselves via Tor - There is also a built-in
HtmlFetcher
that has asetProxy()
method - License: Apache License 2.0
- Dependencies:
goose
- written in Scala which apparently can be used in Android projects
- Last release in Nov 2015
- License: Apache License 2.0