Skip to content

GitLab

  • Menu
Projects Groups Snippets
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • briar briar
  • Project information
    • Project information
    • Activity
    • Labels
    • Planning hierarchy
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 789
    • Issues 789
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 9
    • Merge requests 9
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • briar
  • briarbriar
  • Issues
  • #454

Closed
Open
Created Jun 29, 2016 by Torsten Grote@groteOwner

Research RSS article extraction libraries

There are two main problems with doing a RSS Import and republishing it as a Briar Blog:

  1. The feed may not include the full article, but only a teaser
  2. How would RSS feed of a traditional blog or news website fit with Briar's more tumblr-like blogs

This ticket is about solving the first problem. Once this is solved, we'll open a new ticket for the second one.

One solution could be to fetch and reformat the full article that is usually linked from the RSS feed. This is a difficult job that would require a lot of testing with real-world data. Fortunately, there are libraries out there that could solve this problem for us.

It is difficult to detect if an RSS feed provides the full content or not. In both cases, the <description> tag is used. So maybe we could show users a preview before importing the feed and allow them to switch article extraction mode on manually for when the feed only contains teasers.

An alternative is not to support teaser-only feeds at all and rely on users to provide full text feeds. There is even a Free Software webservice to do this.

This is a sub-ticket of #135 (closed).

Article Extraction Libraries

boilerpipe

  • seems to be the most popular library on the net, but last release was 5 years ago and last commit 2 years ago
  • not on jcenter, only private maven repo or jars
  • ArticleExtractor#getText() can take various arguments such as Url, String, Reader, etc. so we can fetch the document ourselves via Tor
  • The built-in HTMLFetcher is very simple and does not seem to support proxies
  • License: Apache License 2.0
  • Dependencies:
    • nekohtml
    • xerces

snacktory

  • used by the RSS reader Torsten is using and works well, but also no longer actively developed
  • good detection for none-english sites (German, Japanese, ...), snacktory does not depend on the word count in its text detection to support CJK languages
  • not on jcenter, only private maven repo or jars (or one .java file)
  • ArticleTextExtractor#extractContent() can take various arguments such as JResult, String, Document, etc. so we can fetch the document ourselves via Tor
  • There is also a built-in HtmlFetcher that has a setProxy() method
  • License: Apache License 2.0
  • Dependencies:
    • jsoup
    • log4j
    • slf4j-api

goose

  • written in Scala which apparently can be used in Android projects
  • Last release in Nov 2015
  • License: Apache License 2.0
Edited Nov 21, 2020 by Cleopatra
Assignee
Assign to
Time tracking