morss/README.md

#Morss - Get full-text RSS feeds

This tool's goal is to get full-text RSS feeds out of striped RSS feeds, commonly available on internet. Indeed most newspapers only make a small description available to users in their rss feeds, which makes the RSS feed rather useless. So this tool intends to fix that problem.
This tool opens the links from the rss feed, then downloads the full article from the newspaper website and puts it back in the rss feed.

You can use this program online for free at **[morss.it](http://morss.it/)** (there's also a [test](http://test.morss.it/) version).

##Dependencies

You do need:
- [python](http://www.python.org/) >= 2.6 < 3
- [lxml](http://lxml.de/) for xml parsing
- this [readability](https://github.com/buriy/python-readability) fork
- [dateutil](http://labix.org/python-dateutil) to parse feed dates

You may also need:
- Apache, with python-cgi support, to run on a server
- a fast internet connection

GPL3 code.

##Arguments

morss accepts some arguments, to lightly alter the output of morss. Arguments may need to have a value (usually a string or a number). In the different "Use cases" below is detailed how to pass those arguments to morss.

The arguments are:

- Change what morss does
	- `proxy`: doesn't fill the articles
	- `clip`: stick the full article content under the original feed content (useful for twitter)
	- `keep`: by default, morss does drop feed description whenever the full-content is found (so as not to mislead users who use Firefox, since the latter only shows the description in the feed preview, so they might believe morss doens't work), but with this argument, the description is kept
- Advanced
	- `cache`: only take articles from the cache (ie. don't grab new articles' content), so as to save time
	- `debug`: to have some feedback from the script execution. Useful for debugging
	- `theforce`: force download the rss feed
- http server only
	- `html`: changes the http content-type to html, so that python cgi erros (written in html) are readable in a web browser
	- `force`: avoid using your browser cache (do not support 304 errors)

##Use cases

morss will auto-detect what "mode" to use.

###Running on a server

For this, you need to make sure your host allows python script execution. This method uses HTTP calls to fetch the RSS feeds, which will be handled throu `mod_cgi` for example on Apache severs.

Then visit: **`http://PATH/TO/MORSS/[morss.py]/[:argwithoutvalue[:argwithvalue=value[...]]]/FEEDURL`**  
For example: `http://morss.example/:clip/https://twitter.com/pictuga`  
*(Brackets indicate optional text)*

The `morss.py` part is only needed if your server doesn't support the Apache redirect rule set in the provided `.htaccess`.

**NB.** Morss does NOT provide any HTTP server itself. This must be done with Apache or nginx with python support!

Works like a charm with [Tiny Tiny RSS](http://tt-rss.org/redmine/projects/tt-rss/wiki).

###As a CLI application

Run: **`[python2.7] morss.py [argwithoutvalue] [argwithvalue=value] [...] FEEDURL`**  
For example: `python2.7 morss.py debug http://feeds.bbci.co.uk/news/rss.xml`  
*(Brackets indicate optional text)*

###As a newsreader hook

To use it, the newsreader [Liferea](http://lzone.de/liferea/) is required (unless other newsreaders provide the same kind of feature), since custom scripts can be run on top of the RSS feed, using its [output](http://lzone.de/liferea/scraping.htm) as an RSS feed.

To use this script, you have to enable "(Unix) command" in liferea feed settings, and use the command: **`[python2.7] PATH/TO/MORSS/morss.py [argwithoutvalue] [argwithvalue=value] [...] FEEDURL`**  
For example: `python2.7 PATH/TO/MORSS/morss.py http://feeds.bbci.co.uk/news/rss.xml`

##Cache information

morss uses a small cache directory to make the loading faster. Given the way it's designed, the cache doesn't need to be purged each while and then, unless you stop following a big amount of feeds. Only in the case of mass un-subscribing, you might want to delete the cache files corresponding to the bygone feeds. If morss is running as a server, the cache folder is at `MORSS_DIRECTORY/cache/`, and in `$HOME/.cache/morss` otherwise.

##Configuration
###Length limitation

When parsing long feeds, with a lot of items (100+), morss might take a lot of time to parse it, or might even run into a memory overflow on some shared hosting plans (limits around 10Mb), in which case you might want to adjust the different values at the top of the script.

- `MAX_TIME` sets the maximum amount of time spent *fetching* articles, more time might be spent taking older articles from cache. `-1` for unlimited.
- `MAX_ITEM` sets the maximum number of articles to fetch. `-1` for unlimited. More articles will be taken from cache following the nexts settings.
- `LIM_TIME` sets the maximum amount of time spent working on the feed (whether or not it's already cached). Articles beyond that limit will be dropped from the feed. `-1` for unlimited.
- `LIM_ITEM` sets the maximum number of article checked, limiting both the number of articles fetched and taken from cache. Articles beyond that limit will be dropped from the feed, even if they're cached. `-1` for unlimited.

###Content matching

The content of articles is grabbed with a [**readability** fork](https://github.com/buriy/python-readability). This means that most of the time the right content is matched. However sometimes it fails, therefore some tweaking is required. Most of the time, what has to be done is to add some "rules" in the main script file in *readability* (not in morss).

Most of the time when hardly nothing is matched, it means that the main content of the article is made of images, videos, pictures, etc., which readability doesn't detect. Also, readability has some trouble to match content of very small articles.

morss will also try to figure out whether the full content is already in place (for those websites which understood the whole point of RSS feeds). However this detection is very simple, and only works if the actual content is put in the "content" section in the feed and not in the "summary" section.

***

##Todo

You can contribute to this projet. If you're not sure what to do, you can pick from this list:

- Add ability to run morss.py as an update daemon
full-text with a dash in README 2013-09-15 15:59:17 +00:00			`#Morss - Get full-text RSS feeds`
Added README 2013-02-25 15:40:51 +00:00
Move to OOP. This is a huge commit. The whole code is ported to Object-Oritented Programming. This makes the code cleaner, which became required to deal with all the different cases, for example with encoding detection. Encoding detection now works better, and uses 3 different methods. HTML pages with an xml declaration are now supported. Feed urls with parameters (eg. "index.php?option=par") are also supported. Cache is now smarter, since it no longer grows indefinitely, since only in-use pages are kept in the cache. Caching is now mandatory. urllib (not urllib2) is no longer needed. Solved a possible crash with log function (when passing list of str with non-unicode encoging). README is also updated. 2013-04-04 15:43:30 +00:00			`This tool's goal is to get full-text RSS feeds out of striped RSS feeds, commonly available on internet. Indeed most newspapers only make a small description available to users in their rss feeds, which makes the RSS feed rather useless. So this tool intends to fix that problem.`
Added README 2013-02-25 15:40:51 +00:00			`This tool opens the links from the rss feed, then downloads the full article from the newspaper website and puts it back in the rss feed.`
Updated README to markdown 2013-02-25 20:49:38 +00:00
Nicer link display in readme 2013-07-11 12:17:04 +00:00			`You can use this program online for free at [morss.it](http://morss.it/) (there's also a [test](http://test.morss.it/) version).`
Another huge commit. Now uses OOP where it fits. Atom feeds are supported, but no real tests were made. Unix globbing is now possible for urls. Caching is done a cleaner way. Feedburner links are also replaced. HTML is cleaned a more efficient way. Code is now much cleaner, using lxml.objectify and a small wrapper to access Atom feeds as if they were RSS feeds (and much faster than feedparser). README has been updated. 2013-04-15 16:51:55 +00:00
todo and newsreader hook update in readme Updated liferea use to reflect code changes. Link to morss.it as live "preview". Added a todo. Added dependencies list. 2013-06-19 19:12:03 +00:00			`##Dependencies`

			`You do need:`
typo in readme 2013-06-19 19:16:46 +00:00			`- [python](http://www.python.org/) >= 2.6 < 3`
todo and newsreader hook update in readme Updated liferea use to reflect code changes. Link to morss.it as live "preview". Added a todo. Added dependencies list. 2013-06-19 19:12:03 +00:00			`- [lxml](http://lxml.de/) for xml parsing`
			`- this [readability](https://github.com/buriy/python-readability) fork`
Tell about dateutil in readme 2013-10-12 21:43:09 +00:00			`- [dateutil](http://labix.org/python-dateutil) to parse feed dates`
todo and newsreader hook update in readme Updated liferea use to reflect code changes. Link to morss.it as live "preview". Added a todo. Added dependencies list. 2013-06-19 19:12:03 +00:00
			`You may also need:`
			`- Apache, with python-cgi support, to run on a server`
			`- a fast internet connection`

typo in readme 2013-06-19 19:16:46 +00:00			`GPL3 code.`

Update README LIM_TIME, arguments, CLI use case, facebook api, feedify, arguments explanation 2013-11-16 16:48:21 +00:00			`##Arguments`

			`morss accepts some arguments, to lightly alter the output of morss. Arguments may need to have a value (usually a string or a number). In the different "Use cases" below is detailed how to pass those arguments to morss.`

			`The arguments are:`

			`- Change what morss does`
			- `proxy`: doesn't fill the articles
			- `clip`: stick the full article content under the original feed content (useful for twitter)
			- `keep`: by default, morss does drop feed description whenever the full-content is found (so as not to mislead users who use Firefox, since the latter only shows the description in the feed preview, so they might believe morss doens't work), but with this argument, the description is kept
			`- Advanced`
			- `cache`: only take articles from the cache (ie. don't grab new articles' content), so as to save time
			- `debug`: to have some feedback from the script execution. Useful for debugging
			- `theforce`: force download the rss feed
			`- http server only`
			- `html`: changes the http content-type to html, so that python cgi erros (written in html) are readable in a web browser
			- `force`: avoid using your browser cache (do not support 304 errors)

typo in readme 2013-06-19 19:16:46 +00:00			`##Use cases`
Updated README since SERVER var drop. 2013-04-28 09:37:11 +00:00
			`morss will auto-detect what "mode" to use.`

Move to OOP. This is a huge commit. The whole code is ported to Object-Oritented Programming. This makes the code cleaner, which became required to deal with all the different cases, for example with encoding detection. Encoding detection now works better, and uses 3 different methods. HTML pages with an xml declaration are now supported. Feed urls with parameters (eg. "index.php?option=par") are also supported. Cache is now smarter, since it no longer grows indefinitely, since only in-use pages are kept in the cache. Caching is now mandatory. urllib (not urllib2) is no longer needed. Solved a possible crash with log function (when passing list of str with non-unicode encoging). README is also updated. 2013-04-04 15:43:30 +00:00			`###Running on a server`
Updated README to markdown 2013-02-25 20:49:38 +00:00
Update README LIM_TIME, arguments, CLI use case, facebook api, feedify, arguments explanation 2013-11-16 16:48:21 +00:00			For this, you need to make sure your host allows python script execution. This method uses HTTP calls to fetch the RSS feeds, which will be handled throu `mod_cgi` for example on Apache severs.
Move to OOP. This is a huge commit. The whole code is ported to Object-Oritented Programming. This makes the code cleaner, which became required to deal with all the different cases, for example with encoding detection. Encoding detection now works better, and uses 3 different methods. HTML pages with an xml declaration are now supported. Feed urls with parameters (eg. "index.php?option=par") are also supported. Cache is now smarter, since it no longer grows indefinitely, since only in-use pages are kept in the cache. Caching is now mandatory. urllib (not urllib2) is no longer needed. Solved a possible crash with log function (when passing list of str with non-unicode encoging). README is also updated. 2013-04-04 15:43:30 +00:00
Update README LIM_TIME, arguments, CLI use case, facebook api, feedify, arguments explanation 2013-11-16 16:48:21 +00:00			Then visit: `http://PATH/TO/MORSS/[morss.py]/[:argwithoutvalue[:argwithvalue=value[...]]]/FEEDURL`
			For example: `http://morss.example/:clip/https://twitter.com/pictuga`
			`(Brackets indicate optional text)`

			The `morss.py` part is only needed if your server doesn't support the Apache redirect rule set in the provided `.htaccess`.

			`NB. Morss does NOT provide any HTTP server itself. This must be done with Apache or nginx with python support!`
Warning in README: no http server provided 2013-05-23 19:54:11 +00:00
Use proper markdown for links in readme 2013-06-11 11:10:40 +00:00			`Works like a charm with [Tiny Tiny RSS](http://tt-rss.org/redmine/projects/tt-rss/wiki).`
Move to OOP. This is a huge commit. The whole code is ported to Object-Oritented Programming. This makes the code cleaner, which became required to deal with all the different cases, for example with encoding detection. Encoding detection now works better, and uses 3 different methods. HTML pages with an xml declaration are now supported. Feed urls with parameters (eg. "index.php?option=par") are also supported. Cache is now smarter, since it no longer grows indefinitely, since only in-use pages are kept in the cache. Caching is now mandatory. urllib (not urllib2) is no longer needed. Solved a possible crash with log function (when passing list of str with non-unicode encoging). README is also updated. 2013-04-04 15:43:30 +00:00
Update README LIM_TIME, arguments, CLI use case, facebook api, feedify, arguments explanation 2013-11-16 16:48:21 +00:00			`###As a CLI application`

			Run: `[python2.7] morss.py [argwithoutvalue] [argwithvalue=value] [...] FEEDURL`
			For example: `python2.7 morss.py debug http://feeds.bbci.co.uk/news/rss.xml`
			`(Brackets indicate optional text)`

Move to OOP. This is a huge commit. The whole code is ported to Object-Oritented Programming. This makes the code cleaner, which became required to deal with all the different cases, for example with encoding detection. Encoding detection now works better, and uses 3 different methods. HTML pages with an xml declaration are now supported. Feed urls with parameters (eg. "index.php?option=par") are also supported. Cache is now smarter, since it no longer grows indefinitely, since only in-use pages are kept in the cache. Caching is now mandatory. urllib (not urllib2) is no longer needed. Solved a possible crash with log function (when passing list of str with non-unicode encoging). README is also updated. 2013-04-04 15:43:30 +00:00			`###As a newsreader hook`

todo and newsreader hook update in readme Updated liferea use to reflect code changes. Link to morss.it as live "preview". Added a todo. Added dependencies list. 2013-06-19 19:12:03 +00:00			`To use it, the newsreader [Liferea](http://lzone.de/liferea/) is required (unless other newsreaders provide the same kind of feature), since custom scripts can be run on top of the RSS feed, using its [output](http://lzone.de/liferea/scraping.htm) as an RSS feed.`
Move to OOP. This is a huge commit. The whole code is ported to Object-Oritented Programming. This makes the code cleaner, which became required to deal with all the different cases, for example with encoding detection. Encoding detection now works better, and uses 3 different methods. HTML pages with an xml declaration are now supported. Feed urls with parameters (eg. "index.php?option=par") are also supported. Cache is now smarter, since it no longer grows indefinitely, since only in-use pages are kept in the cache. Caching is now mandatory. urllib (not urllib2) is no longer needed. Solved a possible crash with log function (when passing list of str with non-unicode encoging). README is also updated. 2013-04-04 15:43:30 +00:00
Update README LIM_TIME, arguments, CLI use case, facebook api, feedify, arguments explanation 2013-11-16 16:48:21 +00:00			To use this script, you have to enable "(Unix) command" in liferea feed settings, and use the command: `[python2.7] PATH/TO/MORSS/morss.py [argwithoutvalue] [argwithvalue=value] [...] FEEDURL`
			For example: `python2.7 PATH/TO/MORSS/morss.py http://feeds.bbci.co.uk/news/rss.xml`
Move to OOP. This is a huge commit. The whole code is ported to Object-Oritented Programming. This makes the code cleaner, which became required to deal with all the different cases, for example with encoding detection. Encoding detection now works better, and uses 3 different methods. HTML pages with an xml declaration are now supported. Feed urls with parameters (eg. "index.php?option=par") are also supported. Cache is now smarter, since it no longer grows indefinitely, since only in-use pages are kept in the cache. Caching is now mandatory. urllib (not urllib2) is no longer needed. Solved a possible crash with log function (when passing list of str with non-unicode encoging). README is also updated. 2013-04-04 15:43:30 +00:00
			`##Cache information`

			morss uses a small cache directory to make the loading faster. Given the way it's designed, the cache doesn't need to be purged each while and then, unless you stop following a big amount of feeds. Only in the case of mass un-subscribing, you might want to delete the cache files corresponding to the bygone feeds. If morss is running as a server, the cache folder is at `MORSS_DIRECTORY/cache/`, and in `$HOME/.cache/morss` otherwise.

Update README LIM_TIME, arguments, CLI use case, facebook api, feedify, arguments explanation 2013-11-16 16:48:21 +00:00			`##Configuration`
Speak about deleteTags in README. 2013-04-04 16:31:26 +00:00			`###Length limitation`
Move to OOP. This is a huge commit. The whole code is ported to Object-Oritented Programming. This makes the code cleaner, which became required to deal with all the different cases, for example with encoding detection. Encoding detection now works better, and uses 3 different methods. HTML pages with an xml declaration are now supported. Feed urls with parameters (eg. "index.php?option=par") are also supported. Cache is now smarter, since it no longer grows indefinitely, since only in-use pages are kept in the cache. Caching is now mandatory. urllib (not urllib2) is no longer needed. Solved a possible crash with log function (when passing list of str with non-unicode encoging). README is also updated. 2013-04-04 15:43:30 +00:00
Detail MAX settings in README 2013-05-23 19:48:45 +00:00			`When parsing long feeds, with a lot of items (100+), morss might take a lot of time to parse it, or might even run into a memory overflow on some shared hosting plans (limits around 10Mb), in which case you might want to adjust the different values at the top of the script.`

			- `MAX_TIME` sets the maximum amount of time spent fetching articles, more time might be spent taking older articles from cache. `-1` for unlimited.
Update README LIM_TIME, arguments, CLI use case, facebook api, feedify, arguments explanation 2013-11-16 16:48:21 +00:00			- `MAX_ITEM` sets the maximum number of articles to fetch. `-1` for unlimited. More articles will be taken from cache following the nexts settings.
			- `LIM_TIME` sets the maximum amount of time spent working on the feed (whether or not it's already cached). Articles beyond that limit will be dropped from the feed. `-1` for unlimited.
			- `LIM_ITEM` sets the maximum number of article checked, limiting both the number of articles fetched and taken from cache. Articles beyond that limit will be dropped from the feed, even if they're cached. `-1` for unlimited.
Use readability to fetch article content. Makes the whole "xpath rules" things useless. Almost any feed is now supported. CSS liferea stylesheets are also uneeded now, since readability cleans up html code a more efficient way. README was updated. 2013-04-19 09:37:43 +00:00
			`###Content matching`

Use proper markdown for links in readme 2013-06-11 11:10:40 +00:00			`The content of articles is grabbed with a [readability fork](https://github.com/buriy/python-readability). This means that most of the time the right content is matched. However sometimes it fails, therefore some tweaking is required. Most of the time, what has to be done is to add some "rules" in the main script file in readability (not in morss).`
Added quick licence information. 2013-03-29 19:05:53 +00:00
Use readability to fetch article content. Makes the whole "xpath rules" things useless. Almost any feed is now supported. CSS liferea stylesheets are also uneeded now, since readability cleans up html code a more efficient way. README was updated. 2013-04-19 09:37:43 +00:00			`Most of the time when hardly nothing is matched, it means that the main content of the article is made of images, videos, pictures, etc., which readability doesn't detect. Also, readability has some trouble to match content of very small articles.`
Speak about deleteTags in README. 2013-04-04 16:31:26 +00:00
Use readability to fetch article content. Makes the whole "xpath rules" things useless. Almost any feed is now supported. CSS liferea stylesheets are also uneeded now, since readability cleans up html code a more efficient way. README was updated. 2013-04-19 09:37:43 +00:00			`morss will also try to figure out whether the full content is already in place (for those websites which understood the whole point of RSS feeds). However this detection is very simple, and only works if the actual content is put in the "content" section in the feed and not in the "summary" section.`
Speak about deleteTags in README. 2013-04-04 16:31:26 +00:00
typo in readme 2013-06-19 19:16:46 +00:00			`***`

todo and newsreader hook update in readme Updated liferea use to reflect code changes. Link to morss.it as live "preview". Added a todo. Added dependencies list. 2013-06-19 19:12:03 +00:00			`##Todo`

			`You can contribute to this projet. If you're not sure what to do, you can pick from this list:`

			`- Add ability to run morss.py as an update daemon`