Convenient README update

Break lines, update info, say something about uwsgi
master
pictuga 2015-08-29 12:45:36 +02:00
parent 554bdb4650
commit 9b911213b6
1 changed files with 113 additions and 23 deletions

136
README.md
View File

@ -1,17 +1,33 @@
#Morss - Get full-text RSS feeds
This tool's goal is to get full-text RSS feeds out of striped RSS feeds, commonly available on internet. Indeed most newspapers only make a small description available to users in their rss feeds, which makes the RSS feed rather useless. So this tool intends to fix that problem.
This tool opens the links from the rss feed, then downloads the full article from the newspaper website and puts it back in the rss feed.
This tool's goal is to get full-text RSS feeds out of striped RSS feeds,
commonly available on internet. Indeed most newspapers only make a small
description available to users in their rss feeds, which makes the RSS feed
rather useless. So this tool intends to fix that problem.
Morss also provides additional features, such as: .csv and json export, extended control over output. A strength of morss is its ability to deal with broken feeds, and to replace tracking links with direct links to the actual content.
Morss can also generate feeds from html and json files (see `feedify.py`), which for instance makes it possible to get feeds for Facebook or Twitter, using hand-written rules (ie. there's no automatic detection of links to build feeds). Please mind that feeds based on html files may stop working unexpectedly, due to html structure changes on the target website.
Additionally morss can grab the source xml feed of iTunes podcast, and detect rss feeds in html pages' `<meta>`.
This tool opens the links from the rss feed, then downloads the full article
from the newspaper website and puts it back in the rss feed.
You can use this program online for free at **[morss.it](http://morss.it/)** (there's also a [test](http://test.morss.it/) version).
Morss also provides additional features, such as: .csv and json export, extended
control over output. A strength of morss is its ability to deal with broken
feeds, and to replace tracking links with direct links to the actual content.
Morss can also generate feeds from html and json files (see `feedify.py`), which
for instance makes it possible to get feeds for Facebook or Twitter, using
hand-written rules (ie. there's no automatic detection of links to build feeds).
Please mind that feeds based on html files may stop working unexpectedly, due to
html structure changes on the target website.
Additionally morss can grab the source xml feed of iTunes podcast, and detect
rss feeds in html pages' `<meta>`.
You can use this program online for free at **[morss.it](http://morss.it/)**
(there's also a [test](http://test.morss.it/) version).
##Dependencies
You do need:
- [python](http://www.python.org/) >= 2.6 (python 3 is supported)
- [lxml](http://lxml.de/) for xml parsing
- [this](https://github.com/bookieio/breadability) readability fork
@ -20,9 +36,11 @@ You do need:
- [OrderedDict](https://pypi.python.org/pypi/ordereddict) if using python &lt; 2.7
Simplest way to get these:
`pip install -r requirements.txt`
pip install -r requirements.txt
You may also need:
- Apache, with python-cgi support, to run on a server
- a fast internet connection
@ -30,7 +48,9 @@ GPL3 code.
##Arguments
morss accepts some arguments, to lightly alter the output of morss. Arguments may need to have a value (usually a string or a number). In the different "Use cases" below is detailed how to pass those arguments to morss.
morss accepts some arguments, to lightly alter the output of morss. Arguments
may need to have a value (usually a string or a number). In the different "Use
cases" below is detailed how to pass those arguments to morss.
The arguments are:
@ -68,26 +88,72 @@ The arguments are:
morss will auto-detect what "mode" to use.
###Running on a server
####Via mod_cgi/FastCGI with Apache/nginx
To achieve this, you will have to move the files of `/morss/` and of `/www/` into one single folder.
For this, you'll want to change a bit the architecture of the files, for example
into something like this.
####With Apache, nginx, lighttpd and others
For this, you need to make sure your host allows python script execution. This method uses HTTP calls to fetch the RSS feeds, which will be handled through `mod_cgi` for example on Apache severs.
```
/
├── cgi
│   │
│   ├── main.py
│   ├── morss
│   │   ├── __init__.py
│   │   ├── __main__.py
│   │   ├── crawler.py
│   │   ├── feedify.ini
│   │   ├── feedify.py
│   │   ├── feeds.py
│   │   ├── morss.ini
│   │   ├── morss.py
│   │   └── reader.html.template
│   │
│   ├── breadability
│   ├── dateutil
│   ├── html2text.py
│   ├── ordereddict.py
│   └── wheezy
│   └── template
├── .htaccess
├── facebook.php
└── index.html
```
For this, you need to make sure your host allows python script execution. This
method uses HTTP calls to fetch the RSS feeds, which will be handled through
`mod_cgi` for example on Apache severs.
Please pay attention to `main.py` permissions for it to be executable. Also
ensure that the provided `/www/.htaccess` works well with your server.
####Using uWSGI
Running this command should do:
uwsgi --http :9090 --plugin python --wsgi-file main.py
However, one problem might be how to serve the provided `index.html` file if it
isn't in the same directory. Therefore you can add this at the end of the
command to point to another directory `--pyargv '--root ../../www/'`.
Please pay attention to `/www/main.py` permissions for it to be executable. Also ensure that the provided `/www/.htaccess` works well with your server.
####Using morss' internal HTTP server
Morss can run its own HTTP server. The later should start when you run morss without any argument, on port 8080. For now, you have to change the hardcoded port if you want to change it.
Morss can run its own HTTP server. The later should start when you run morss
without any argument, on port 8080.
You can change the port and the location of the `www/` folder like this `python -m morss 9000 --root ../../www`.
####Passing arguments
Then visit: **`http://PATH/TO/MORSS/[morss.py/][:argwithoutvalue[:argwithvalue=value[...]]]/FEEDURL`**
Then visit: **`http://PATH/TO/MORSS/[main.py/][:argwithoutvalue[:argwithvalue=value[...]]]/FEEDURL`**
For example: `http://morss.example/:clip/https://twitter.com/pictuga`
*(Brackets indicate optional text)*
The `morss.py` part is only needed if your server doesn't support the Apache redirect rule set in the provided `.htaccess`.
The `main.py` part is only needed if your server doesn't support the Apache redirect rule set in the provided `.htaccess`.
Works like a charm with [Tiny Tiny RSS](http://tt-rss.org/redmine/projects/tt-rss/wiki), and most probably other clients.
@ -99,7 +165,10 @@ For example: `python -m morss debug http://feeds.bbci.co.uk/news/rss.xml`
###As a newsreader hook
To use it, the newsreader [Liferea](http://lzone.de/liferea/) is required (unless other newsreaders provide the same kind of feature), since custom scripts can be run on top of the RSS feed, using its [output](http://lzone.de/liferea/scraping.htm) as an RSS feed.
To use it, the newsreader [Liferea](http://lzone.de/liferea/) is required
(unless other newsreaders provide the same kind of feature), since custom
scripts can be run on top of the RSS feed, using its
[output](http://lzone.de/liferea/scraping.htm) as an RSS feed.
To use this script, you have to enable "(Unix) command" in liferea feed settings, and use the command: **`[python2.7] PATH/TO/MORSS/main.py [argwithoutvalue] [argwithvalue=value] [...] FEEDURL`**
For example: `python2.7 PATH/TO/MORSS/main.py http://feeds.bbci.co.uk/news/rss.xml`
@ -126,7 +195,9 @@ Using cache and passing arguments:
'{"title": "BBC News - Home", "desc": "The latest s'
```
`morss.process` is actually a wrapper around simpler function. It's still possible to call the simpler functions, to have more control on what's happening under the hood.
`morss.process` is actually a wrapper around simpler function. It's still
possible to call the simpler functions, to have more control on what's happening
under the hood.
Doing it step-by-step:
```python
@ -146,12 +217,20 @@ output = morss.Format(rss, options) # formats final feed
##Cache information
morss uses a small cache directory to make the loading faster. Given the way it's designed, the cache doesn't need to be purged each while and then, unless you stop following a big amount of feeds. Only in the case of mass un-subscribing, you might want to delete the cache files corresponding to the bygone feeds. If morss is running as a server, the cache folder is at `MORSS_DIRECTORY/cache/`, and in `$HOME/.cache/morss` otherwise.
morss uses a small cache directory to make the loading faster. Given the way
it's designed, the cache doesn't need to be purged each while and then, unless
you stop following a big amount of feeds. Only in the case of mass un-subscribing,
you might want to delete the cache files corresponding to the bygone feeds. If
morss is running as a server, the cache folder is at `MORSS_DIRECTORY/cache/`,
and in `$HOME/.cache/morss` otherwise.
##Configuration
###Length limitation
When parsing long feeds, with a lot of items (100+), morss might take a lot of time to parse it, or might even run into a memory overflow on some shared hosting plans (limits around 10Mb), in which case you might want to adjust the different values at the top of the script.
When parsing long feeds, with a lot of items (100+), morss might take a lot of
time to parse it, or might even run into a memory overflow on some shared
hosting plans (limits around 10Mb), in which case you might want to adjust the
different values at the top of the script.
- `MAX_TIME` sets the maximum amount of time spent *fetching* articles, more time might be spent taking older articles from cache. `-1` for unlimited.
- `MAX_ITEM` sets the maximum number of articles to fetch. `-1` for unlimited. More articles will be taken from cache following the nexts settings.
@ -166,17 +245,28 @@ When parsing long feeds, with a lot of items (100+), morss might take a lot of t
###Content matching
The content of articles is grabbed with a [**readability** fork](https://github.com/buriy/python-readability). This means that most of the time the right content is matched. However sometimes it fails, therefore some tweaking is required. Most of the time, what has to be done is to add some "rules" in the main script file in *readability* (not in morss).
The content of articles is grabbed with a
[**readability** fork](https://github.com/buriy/python-readability). This means
that most of the time the right content is matched. However sometimes it fails,
therefore some tweaking is required. Most of the time, what has to be done is to
add some "rules" in the main script file in *readability* (not in morss).
Most of the time when hardly nothing is matched, it means that the main content of the article is made of images, videos, pictures, etc., which readability doesn't detect. Also, readability has some trouble to match content of very small articles.
Most of the time when hardly nothing is matched, it means that the main content
of the article is made of images, videos, pictures, etc., which readability
doesn't detect. Also, readability has some trouble to match content of very
small articles.
morss will also try to figure out whether the full content is already in place (for those websites which understood the whole point of RSS feeds). However this detection is very simple, and only works if the actual content is put in the "content" section in the feed and not in the "summary" section.
morss will also try to figure out whether the full content is already in place
(for those websites which understood the whole point of RSS feeds). However this
detection is very simple, and only works if the actual content is put in the
"content" section in the feed and not in the "summary" section.
***
##Todo
You can contribute to this project. If you're not sure what to do, you can pick from this list:
You can contribute to this project. If you're not sure what to do, you can pick
from this list:
- Add ability to run morss.py as an update daemon
- Rewrite the readability fork, for better performances, and make it more "pythonic" (Firefox for Android may have it's own implementation, most probably cleaner than `readability.js`')