Get full text RSS feeds https://morss.it/
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

400 lines
13 KiB

8 years ago
7 months ago
7 months ago
  1. # Morss - Get full-text RSS feeds
  2. _GNU AGPLv3 code_
  3. _Provided logo is CC BY-NC-SA 4.0_
  4. Upstream source code: https://git.pictuga.com/pictuga/morss
  5. Github mirror (for Issues & Pull requests): https://github.com/pictuga/morss
  6. Homepage: https://morss.it/
  7. This tool's goal is to get full-text RSS feeds out of striped RSS feeds,
  8. commonly available on internet. Indeed most newspapers only make a small
  9. description available to users in their rss feeds, which makes the RSS feed
  10. rather useless. So this tool intends to fix that problem.
  11. This tool opens the links from the rss feed, then downloads the full article
  12. from the newspaper website and puts it back in the rss feed.
  13. Morss also provides additional features, such as: .csv and json export, extended
  14. control over output. A strength of morss is its ability to deal with broken
  15. feeds, and to replace tracking links with direct links to the actual content.
  16. Morss can also generate feeds from html and json files (see `feeds.py`), which
  17. for instance makes it possible to get feeds for Facebook or Twitter, using
  18. hand-written rules (ie. there's no automatic detection of links to build feeds).
  19. Please mind that feeds based on html files may stop working unexpectedly, due to
  20. html structure changes on the target website.
  21. Additionally morss can detect rss feeds in html pages' `<meta>`.
  22. You can use this program online for free at **[morss.it](https://morss.it/)**.
  23. Some features of morss:
  24. - Read RSS/Atom feeds
  25. - Create RSS feeds from json/html pages
  26. - Export feeds as RSS/JSON/CSV/HTML
  27. - Fetch full-text content of feed items
  28. - Follow 301/meta redirects
  29. - Recover xml feeds with corrupt encoding
  30. - Supports gzip-compressed http content
  31. - HTTP caching with 3 different backends (in-memory/sqlite/mysql)
  32. - Works as server/cli tool
  33. - Deobfuscate various tracking links
  34. ## Install
  35. ### Python package
  36. ```shell
  37. pip install git+https://git.pictuga.com/pictuga/morss.git
  38. ```
  39. The dependency `lxml` is fairly long to install (especially on Raspberry Pi, as
  40. C code needs to be compiled). If possible on your distribution, try installing
  41. it with the system package manager.
  42. Dependencies:
  43. - [python](http://www.python.org/) >= 2.6 (python 3 is supported)
  44. - [lxml](http://lxml.de/) for xml parsing
  45. - [bs4](https://pypi.org/project/bs4/) for badly-formatted html pages
  46. - [dateutil](http://labix.org/python-dateutil) to parse feed dates
  47. - [chardet](https://pypi.python.org/pypi/chardet)
  48. - [six](https://pypi.python.org/pypi/six), a dependency of chardet
  49. - pymysql
  50. You may also need:
  51. - Apache, with python-cgi support, to run on a server
  52. - a fast internet connection
  53. ### Docker
  54. Build & run
  55. ```shell
  56. docker build --tag morss https://git.pictuga.com/pictuga/morss.git --no-cache --pull
  57. docker run -p 8080:8080 morss
  58. ```
  59. With docker-compose:
  60. ```yml
  61. services:
  62. app:
  63. build: https://git.pictuga.com/pictuga/morss.git
  64. image: morss
  65. ports:
  66. - '8080:8080'
  67. ```
  68. Then execute
  69. ```shell
  70. docker-compose build --no-cache --pull
  71. docker-compose up
  72. ```
  73. ## Run
  74. morss will auto-detect what "mode" to use.
  75. ### Running on/as a server
  76. Set up the server as indicated below, then visit:
  77. ```
  78. http://PATH/TO/MORSS/[main.py/][:argwithoutvalue[:argwithvalue=value[...]]]/FEEDURL
  79. ```
  80. For example: `http://morss.example/:clip/https://twitter.com/pictuga`
  81. *(Brackets indicate optional text)*
  82. The `main.py` part is only needed if your server doesn't support the Apache
  83. redirect rule set in the provided `.htaccess`.
  84. Works like a charm with [Tiny Tiny
  85. RSS](http://tt-rss.org/redmine/projects/tt-rss/wiki), and most probably other
  86. clients.
  87. #### Via Docker
  88. See above (in Install)
  89. #### Using Gunicorn
  90. ```shell
  91. gunicorn --preload morss
  92. ```
  93. #### Using uWSGI
  94. Running this command should do:
  95. ```shell
  96. uwsgi --http :8080 --plugin python --wsgi-file main.py
  97. ```
  98. #### Using morss' internal HTTP server
  99. Morss can run its own, **very basic**, HTTP server, meant for debugging mostly.
  100. The latter should start when you run morss without any argument, on port 8080.
  101. I'd highly recommend you to use gunicorn or something similar for better
  102. performance.
  103. ```shell
  104. morss
  105. ```
  106. You can change the port using environment variables like this `PORT=9000 morss`.
  107. #### Via mod_cgi/FastCGI with Apache/nginx
  108. For this, you'll want to change a bit the architecture of the files, for example
  109. into something like this.
  110. ```
  111. /
  112. ├── cgi
  113. │   │
  114. │   ├── main.py
  115. │   ├── morss
  116. │   │   ├── __init__.py
  117. │   │   ├── __main__.py
  118. │   │   ├── morss.py
  119. │   │   └── ...
  120. │   │
  121. │   ├── dateutil
  122. │   └── ...
  123. ├── .htaccess
  124. ├── index.html
  125. └── ...
  126. ```
  127. For this, you need to make sure your host allows python script execution. This
  128. method uses HTTP calls to fetch the RSS feeds, which will be handled through
  129. `mod_cgi` for example on Apache severs.
  130. Please pay attention to `main.py` permissions for it to be executable. Also
  131. ensure that the provided `/www/.htaccess` works well with your server.
  132. ### As a CLI application
  133. Run:
  134. ```
  135. morss [--argwithoutvalue] [--argwithvalue=value] [...] FEEDURL
  136. ```
  137. For example: `morss --clip http://feeds.bbci.co.uk/news/rss.xml`
  138. *(Brackets indicate optional text)*
  139. ### As a newsreader hook
  140. To use it, the newsreader [Liferea](http://lzone.de/liferea/) is required
  141. (unless other newsreaders provide the same kind of feature), since custom
  142. scripts can be run on top of the RSS feed, using its
  143. [output](http://lzone.de/liferea/scraping.htm) as an RSS feed.
  144. To use this script, you have to enable "(Unix) command" in liferea feed
  145. settings, and use the command:
  146. ```
  147. morss [--argwithoutvalue] [--argwithvalue=value] [...] FEEDURL
  148. ```
  149. For example: `morss http://feeds.bbci.co.uk/news/rss.xml`
  150. *(Brackets indicate optional text)*
  151. ### As a python library
  152. Quickly get a full-text feed:
  153. ```python
  154. >>> import morss
  155. >>> xml_string = morss.process('http://feeds.bbci.co.uk/news/rss.xml')
  156. >>> xml_string[:50]
  157. "<?xml version='1.0' encoding='UTF-8'?>\n<?xml-style"
  158. ```
  159. Using cache and passing arguments:
  160. ```python
  161. >>> import morss
  162. >>> url = 'http://feeds.bbci.co.uk/news/rss.xml'
  163. >>> cache = '/tmp/morss-cache.db' # sqlite cache location
  164. >>> options = {'csv':True}
  165. >>> xml_string = morss.process(url, cache, options)
  166. >>> xml_string[:50]
  167. '{"title": "BBC News - Home", "desc": "The latest s'
  168. ```
  169. `morss.process` is actually a wrapper around simpler function. It's still
  170. possible to call the simpler functions, to have more control on what's happening
  171. under the hood.
  172. Doing it step-by-step:
  173. ```python
  174. import morss, morss.crawler
  175. url = 'http://newspaper.example/feed.xml'
  176. options = morss.Options(csv=True) # arguments
  177. morss.crawler.sqlite_default = '/tmp/morss-cache.db' # sqlite cache location
  178. url, rss = morss.FeedFetch(url, options) # this only grabs the RSS feed
  179. rss = morss.FeedGather(rss, url, options) # this fills the feed and cleans it up
  180. output = morss.FeedFormat(rss, options, 'unicode') # formats final feed
  181. ```
  182. ## Arguments and settings
  183. ### Arguments
  184. morss accepts some arguments, to lightly alter the output of morss. Arguments
  185. may need to have a value (usually a string or a number). How to pass those
  186. arguments to morss is explained in Run above.
  187. The list of arguments can be obtained by running `morss --help`
  188. ```
  189. usage: morss [-h] [--format {rss,json,html,csv}] [--search STRING] [--clip]
  190. [--indent] [--cache] [--force] [--proxy] [--newest] [--firstlink]
  191. [--resolve] [--items XPATH] [--item_link XPATH]
  192. [--item_title XPATH] [--item_content XPATH] [--item_time XPATH]
  193. [--nolink] [--noref] [--silent]
  194. url
  195. Get full-text RSS feeds
  196. positional arguments:
  197. url feed url
  198. optional arguments:
  199. -h, --help show this help message and exit
  200. output:
  201. --format {rss,json,html,csv}
  202. output format
  203. --search STRING does a basic case-sensitive search in the feed
  204. --clip stick the full article content under the original feed
  205. content (useful for twitter)
  206. --indent returns indented XML or JSON, takes more place, but
  207. human-readable
  208. action:
  209. --cache only take articles from the cache (ie. don't grab new
  210. articles' content), so as to save time
  211. --force force refetch the rss feed and articles
  212. --proxy doesn't fill the articles
  213. --newest return the feed items in chronological order (morss
  214. ohterwise shows the items by appearing order)
  215. --firstlink pull the first article mentioned in the description
  216. instead of the default link
  217. --resolve replace tracking links with direct links to articles
  218. (not compatible with --proxy)
  219. custom feeds:
  220. --items XPATH (mandatory to activate the custom feeds function)
  221. xpath rule to match all the RSS entries
  222. --item_link XPATH xpath rule relative to items to point to the entry's
  223. link
  224. --item_title XPATH entry's title
  225. --item_content XPATH entry's content
  226. --item_time XPATH entry's date & time (accepts a wide range of time
  227. formats)
  228. misc:
  229. --nolink drop links, but keeps links' inner text
  230. --noref drop items' link
  231. --silent don't output the final RSS (useless on its own, but
  232. can be nice when debugging)
  233. GNU AGPLv3 code
  234. ```
  235. Further HTTP-only options:
  236. - `callback=NAME`: for JSONP calls
  237. - `cors`: allow Cross-origin resource sharing (allows XHR calls from other
  238. servers)
  239. - `txt`: changes the http content-type to txt (for faster "`view-source:`")
  240. ### Environment variables
  241. To pass environment variables:
  242. - Docker-cli: `docker run -p 8080:8080 morss --env KEY=value`
  243. - docker-compose: add an `environment:` section in the .yml file
  244. - Gunicorn/uWSGI/CLI: prepend `KEY=value` before the command
  245. - Apache: via the `SetEnv` instruction (see sample `.htaccess` provided)
  246. Generic:
  247. - `DEBUG=1`: to have some feedback from the script execution. Useful for
  248. debugging.
  249. - `IGNORE_SSL=1`: to ignore SSL certs when fetch feeds and articles
  250. - `DELAY` (seconds) sets the browser cache delay, only for HTTP clients
  251. - `TIMEOUT` (seconds) sets the HTTP timeout when fetching rss feeds and articles
  252. When parsing long feeds, with a lot of items (100+), morss might take a lot of
  253. time to parse it, or might even run into a memory overflow on some shared
  254. hosting plans (limits around 10Mb), in which case you might want to adjust the
  255. below settings via environment variables.
  256. Also, if the request takes too long to process, the http request might be
  257. discarded. See relevant config for
  258. [gunicorn](https://docs.gunicorn.org/en/stable/settings.html#timeout) or
  259. [nginx](http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_read_timeout).
  260. - `MAX_TIME` (seconds) sets the maximum amount of time spent *fetching*
  261. articles, more time might be spent taking older articles from cache. `-1` for
  262. unlimited.
  263. - `MAX_ITEM` sets the maximum number of articles to fetch. `-1` for unlimited.
  264. More articles will be taken from cache following the nexts settings.
  265. - `LIM_TIME` (seconds) sets the maximum amount of time spent working on the feed
  266. (whether or not it's already cached). Articles beyond that limit will be dropped
  267. from the feed. `-1` for unlimited.
  268. - `LIM_ITEM` sets the maximum number of article checked, limiting both the
  269. number of articles fetched and taken from cache. Articles beyond that limit will
  270. be dropped from the feed, even if they're cached. `-1` for unlimited.
  271. morss uses caching to make loading faster. There are 3 possible cache backends:
  272. - `(nothing/default)`: a simple python in-memory dict-like object.
  273. - `CACHE=sqlite`: sqlite3 cache. Default file location is in-memory (i.e. it
  274. will be cleared every time the program is run). Path can be defined with
  275. `SQLITE_PATH`.
  276. - `CACHE=mysql`: MySQL cache. Connection can be defined with the following
  277. environment variables: `MYSQL_USER`, `MYSQL_PWD`, `MYSQL_DB`, `MYSQL_HOST`
  278. To limit the size of the cache:
  279. - `CACHE_SIZE` sets the target number of items in the cache (further items will
  280. be deleted but the cache might be temporarily bigger than that). Defaults to 1k
  281. entries.
  282. - `CACHE_LIFESPAN` (seconds) sets how often the cache must be trimmed (i.e. cut
  283. down to the number of items set in `CACHE_SIZE`). Defaults to 1min.
  284. ### Content matching
  285. The content of articles is grabbed with our own readability fork. This means
  286. that most of the time the right content is matched. However sometimes it fails,
  287. therefore some tweaking is required. Most of the time, what has to be done is to
  288. add some "rules" in the main script file in `readabilite.py` (not in morss).
  289. Most of the time when hardly nothing is matched, it means that the main content
  290. of the article is made of images, videos, pictures, etc., which readability
  291. doesn't detect. Also, readability has some trouble to match content of very
  292. small articles.
  293. morss will also try to figure out whether the full content is already in place
  294. (for those websites which understood the whole point of RSS feeds). However this
  295. detection is very simple, and only works if the actual content is put in the
  296. "content" section in the feed and not in the "summary" section.