pictuga/morss

Fork 0

Go to file

pictuga a613acfe7d

continuous-integration/drone/push Build is failing

Details

Fix isort

2022-01-31 00:44:04 +01:00

morss

crawler: improve handling of non-ascii urls

2022-01-30 23:27:49 +01:00

tests

Fix isort

2022-01-31 00:44:04 +01:00

www

More ordering options

2022-01-23 12:27:07 +01:00

.drone.yml

pytest: first batch with test_feeds

2022-01-31 00:23:09 +01:00

.pylintrc

…

app.json

Make use of GUNICORN_CMD_ARGS

2021-12-24 11:44:24 +01:00

Dockerfile

Turns out exec array is not supported in HEALTHCHECK

2021-12-28 15:23:40 +01:00

heroku.yml

heroku: make env var customizable

2021-11-24 21:40:34 +01:00

LICENSE

…

main.py

Make helper & main.py executable

2021-12-29 15:47:05 +01:00

morss-helper

helper: fix reload code

2022-01-19 13:44:15 +01:00

README.md

Ability to pass custom data_files location

2022-01-25 22:36:34 +01:00

setup.py

pytest: first batch with test_feeds

2022-01-31 00:23:09 +01:00

README.md

Morss - Get full-text RSS feeds

Homepage • Upstream source code • Github mirror (for Issues & Pull requests)

This tool's goal is to get full-text RSS feeds out of striped RSS feeds, commonly available on internet. Indeed most newspapers only make a small description available to users in their rss feeds, which makes the RSS feed rather useless. So this tool intends to fix that problem.

This tool opens the links from the rss feed, then downloads the full article from the newspaper website and puts it back in the rss feed.

Morss also provides additional features, such as: .csv and json export, extended control over output. A strength of morss is its ability to deal with broken feeds, and to replace tracking links with direct links to the actual content.

Morss can also generate feeds from html and json files (see feeds.py), which for instance makes it possible to get feeds for Facebook or Twitter, using hand-written rules (ie. there's no automatic detection of links to build feeds). Please mind that feeds based on html files may stop working unexpectedly, due to html structure changes on the target website.

Additionally morss can detect rss feeds in html pages' <meta>.

You can use this program online for free at morss.it.

Some features of morss:

Read RSS/Atom feeds
Create RSS feeds from json/html pages
Export feeds as RSS/JSON/CSV/HTML
Fetch full-text content of feed items
Follow 301/meta redirects
Recover xml feeds with corrupt encoding
Supports gzip-compressed http content
HTTP caching with different backends (in-memory/sqlite/mysql/redis/diskcache)
Works as server/cli tool
Deobfuscate various tracking links

Install

Python package

Simple install (without optional dependencies)

From pip

pip install morss

From git

pip install git+https://git.pictuga.com/pictuga/morss.git

Full installation (including optional dependencies)

From pip

pip install morss[full]

From git

pip install git+https://git.pictuga.com/pictuga/morss.git#egg=morss[full]

The full install includes all the cache backends. Otherwise, only in-memory and sqlite3 caches are available. The full install also includes gunicorn (for more efficient HTTP handling).

The dependency lxml is fairly long to install (especially on Raspberry Pi, as C code needs to be compiled). If possible on your distribution, try installing it with the system package manager.

Docker

From docker hub

With cli

docker pull pictuga/morss

With docker-compose

services:
    app:
        image: pictuga/morss
        ports:
            - '8000:8000'

Build from source

With cli

docker build --tag morss https://git.pictuga.com/pictuga/morss.git --no-cache --pull

With docker-compose

services:
    app:
        build: https://git.pictuga.com/pictuga/morss.git
        image: morss
        ports:
            - '8000:8000'

Then execute

docker-compose build --no-cache --pull

Cloud providers

One-click deployment:

Providers supporting cloud-init (AWS, Oracle Cloud Infrastructure), based on Ubuntu:

#cloud-config

packages:
  - python3-pip
  - python3-wheel
  - python3-lxml
  - python3-setproctitle
  - ca-certificates

write_files:
  - path: /etc/environment
    append: true
    content: |
      DEBUG=1
      CACHE=diskcache
      CACHE_SIZE=1073741824 # 1GiB
  - path: /var/lib/cloud/scripts/per-boot/morss.sh
    permissions: 744
    content: |
      #!/bin/sh
      /usr/local/bin/morss-helper daemon

runcmd:
  - source /etc/environment
  - update-ca-certificates
  - iptables -I INPUT 6 -m state --state NEW -p tcp --dport ${PORT:-8000} -j ACCEPT
  - netfilter-persistent save
  - pip install morss[full]

Run

morss will auto-detect what "mode" to use.

Running on/as a server

Set up the server as indicated below, then visit:

http://PATH/TO/MORSS/[main.py/][:argwithoutvalue[:argwithvalue=value[...]]]/FEEDURL

For example: http://morss.example/:clip/https://twitter.com/pictuga

(Brackets indicate optional text)

The main.py part is only needed if your server doesn't support the Apache redirect rule set in the provided .htaccess.

Works like a charm with Tiny Tiny RSS, and most probably other clients.

Using Docker

From docker hub

docker run -p 8000:8000 pictuga/morss

From source

docker run -p 8000:8000 morss

With docker-compose

docker-compose up

Using Gunicorn

gunicorn --preload morss

Using uWSGI

Running this command should do:

uwsgi --http :8000 --plugin python --wsgi-file main.py

Using morss' internal HTTP server

Morss can run its own, very basic, HTTP server, meant for debugging mostly. The latter should start when you run morss without any argument, on port 8000. I'd highly recommend you to use gunicorn or something similar for better performance.

morss

You can change the port using environment variables like this PORT=9000 morss.

Via mod_cgi/FastCGI with Apache/nginx

For this, you'll want to change a bit the architecture of the files, for example into something like this.

/
├── cgi
│   │
│   ├── main.py
│   ├── morss
│   │   ├── __init__.py
│   │   ├── __main__.py
│   │   ├── morss.py
│   │   └── ...
│   │
│   ├── dateutil
│   └── ...
│
├── .htaccess
├── index.html
└── ...

For this, you need to make sure your host allows python script execution. This method uses HTTP calls to fetch the RSS feeds, which will be handled through mod_cgi for example on Apache severs.

Please pay attention to main.py permissions for it to be executable. See below some tips for the .htaccess file.

Options -Indexes

ErrorDocument 404 /cgi/main.py

# Turn debug on for all requests
SetEnv DEBUG 1

# Turn debug on for requests with :debug in the url
SetEnvIf Request_URI :debug DEBUG=1

<Files ~ "\.(py|pyc|db|log)$">
	deny from all
</Files>

<Files main.py>
	allow from all
	AddHandler cgi-script .py
	Options +ExecCGI
</Files>

As a CLI application

Run:

morss [--argwithoutvalue] [--argwithvalue=value] [...] FEEDURL

For example: morss --clip http://feeds.bbci.co.uk/news/rss.xml

(Brackets indicate optional text)

If using Docker:

docker run morss --clip http://feeds.bbci.co.uk/news/rss.xml

As a newsreader hook

To use it, the newsreader Liferea is required (unless other newsreaders provide the same kind of feature), since custom scripts can be run on top of the RSS feed, using its output as an RSS feed.

To use this script, you have to enable "(Unix) command" in liferea feed settings, and use the command:

morss [--argwithoutvalue] [--argwithvalue=value] [...] FEEDURL

For example: morss http://feeds.bbci.co.uk/news/rss.xml

(Brackets indicate optional text)

As a python library

Quickly get a full-text feed:

>>> import morss
>>> xml_string = morss.process('http://feeds.bbci.co.uk/news/rss.xml')
>>> xml_string[:50]
"<?xml version='1.0' encoding='UTF-8'?>\n<?xml-style"

Using cache and passing arguments:

>>> import morss
>>> url = 'http://feeds.bbci.co.uk/news/rss.xml'
>>> cache = '/tmp/morss-cache.db' # sqlite cache location
>>> options = {'csv':True}
>>> xml_string = morss.process(url, cache, options)
>>> xml_string[:50]
'{"title": "BBC News - Home", "desc": "The latest s'

morss.process is actually a wrapper around simpler function. It's still possible to call the simpler functions, to have more control on what's happening under the hood.

Doing it step-by-step:

import morss, morss.crawler

url = 'http://newspaper.example/feed.xml'
options = morss.Options(csv=True) # arguments
morss.crawler.sqlite_default = '/tmp/morss-cache.db' # sqlite cache location

url, rss = morss.FeedFetch(url, options) # this only grabs the RSS feed
rss = morss.FeedGather(rss, url, options) # this fills the feed and cleans it up

output = morss.FeedFormat(rss, options, 'unicode') # formats final feed

Arguments and settings

Arguments

morss accepts some arguments, to lightly alter the output of morss. Arguments may need to have a value (usually a string or a number). How to pass those arguments to morss is explained in Run above.

The list of arguments can be obtained by running morss --help

usage: morss [-h] [--post STRING] [--xpath XPATH]
             [--format {rss,json,html,csv}] [--search STRING] [--clip]
             [--indent] [--cache] [--force] [--proxy]
             [--order {first,last,newest,oldest}] [--firstlink] [--resolve]
             [--items XPATH] [--item_link XPATH] [--item_title XPATH]
             [--item_content XPATH] [--item_time XPATH] [--nolink] [--noref]
             [--silent]
             url

Get full-text RSS feeds

positional arguments:
  url                   feed url

options:
  -h, --help            show this help message and exit
  --post STRING         POST request
  --xpath XPATH         xpath rule to manually detect the article

output:
  --format {rss,json,html,csv}
                        output format
  --search STRING       does a basic case-sensitive search in the feed
  --clip                stick the full article content under the original feed
                        content (useful for twitter)
  --indent              returns indented XML or JSON, takes more place, but
                        human-readable

action:
  --cache               only take articles from the cache (ie. don't grab new
                        articles' content), so as to save time
  --force               force refetch the rss feed and articles
  --proxy               doesn't fill the articles
  --order {first,last,newest,oldest}
                        order in which to process items (which are however NOT
                        sorted in the output)
  --firstlink           pull the first article mentioned in the description
                        instead of the default link
  --resolve             replace tracking links with direct links to articles
                        (not compatible with --proxy)

custom feeds:
  --items XPATH         (mandatory to activate the custom feeds function)
                        xpath rule to match all the RSS entries
  --item_link XPATH     xpath rule relative to items to point to the entry's
                        link
  --item_title XPATH    entry's title
  --item_content XPATH  entry's content
  --item_time XPATH     entry's date & time (accepts a wide range of time
                        formats)

misc:
  --nolink              drop links, but keeps links' inner text
  --noref               drop items' link
  --silent              don't output the final RSS (useless on its own, but
                        can be nice when debugging)

GNU AGPLv3 code

Further HTTP-only options:

callback=NAME: for JSONP calls
cors: allow Cross-origin resource sharing (allows XHR calls from other servers)
txt: changes the http content-type to txt (for faster "view-source:")

Environment variables

To pass environment variables:

Docker-cli: docker run -p 8000:8000 morss --env KEY=value
docker-compose: add an environment: section in the .yml file
Gunicorn/uWSGI/CLI: prepend KEY=value before the command
Apache: via the SetEnv instruction (see sample .htaccess provided)
cloud-init: in the /etc/environment file

Generic:

DEBUG=1: to have some feedback from the script execution. Useful for debugging.
IGNORE_SSL=1: to ignore SSL certs when fetch feeds and articles
DELAY (seconds) sets the browser cache delay, only for HTTP clients
TIMEOUT (seconds) sets the HTTP timeout when fetching rss feeds and articles
DATA_PATH: to set custom file location for the www folder

When parsing long feeds, with a lot of items (100+), morss might take a lot of time to parse it, or might even run into a memory overflow on some shared hosting plans (limits around 10Mb), in which case you might want to adjust the below settings via environment variables.

Also, if the request takes too long to process, the http request might be discarded. See relevant config for gunicorn or nginx.

MAX_TIME (seconds) sets the maximum amount of time spent fetching articles, more time might be spent taking older articles from cache. -1 for unlimited.
MAX_ITEM sets the maximum number of articles to fetch. -1 for unlimited. More articles will be taken from cache following the nexts settings.
LIM_TIME (seconds) sets the maximum amount of time spent working on the feed (whether or not it's already cached). Articles beyond that limit will be dropped from the feed. -1 for unlimited.
LIM_ITEM sets the maximum number of article checked, limiting both the number of articles fetched and taken from cache. Articles beyond that limit will be dropped from the feed, even if they're cached. -1 for unlimited.

morss uses caching to make loading faster. There are 3 possible cache backends:

(nothing/default): a simple python in-memory dict-like object.
CACHE=sqlite: sqlite3 cache. Default file location is in-memory (i.e. it will be cleared every time the program is run). Path can be defined with SQLITE_PATH.
CACHE=mysql: MySQL cache. Connection can be defined with the following environment variables: MYSQL_USER, MYSQL_PWD, MYSQL_DB, MYSQL_HOST
CACHE=redis: Redis cache. Connection can be defined with the following environment variables: REDIS_HOST, REDIS_PORT, REDIS_DB, REDIS_PWD
CACHE=diskcache: disk-based cache. Target directory canbe defined with DISKCACHE_DIR.

To limit the size of the cache:

CACHE_SIZE sets the target number of items in the cache (further items will be deleted but the cache might be temporarily bigger than that). Defaults to 1k entries. NB. When using diskcache, this is the cache max size in Bytes.
CACHE_LIFESPAN (seconds) sets how often the cache must be trimmed (i.e. cut down to the number of items set in CACHE_SIZE). Defaults to 1min.

Gunicorn also accepts command line arguments via the GUNICORN_CMD_ARGS environment variable.

Content matching

The content of articles is grabbed with our own readability fork. This means that most of the time the right content is matched. However sometimes it fails, therefore some tweaking is required. Most of the time, what has to be done is to add some "rules" in the main script file in readabilite.py (not in morss).

Most of the time when hardly nothing is matched, it means that the main content of the article is made of images, videos, pictures, etc., which readability doesn't detect. Also, readability has some trouble to match content of very small articles.

morss will also try to figure out whether the full content is already in place (for those websites which understood the whole point of RSS feeds). However this detection is very simple, and only works if the actual content is put in the "content" section in the feed and not in the "summary" section.

Languages

Python 90.5%

XSLT 7.4%

HTML 1%

Shell 0.7%

Dockerfile 0.4%