Use readability to fetch article content.

Makes the whole "xpath rules" things useless. Almost any feed is now supported. CSS liferea stylesheets are also uneeded now, since readability cleans up html code a more efficient way. README was updated.
2013-04-19 11:37:43 +02:00 · 2013-04-19 11:37:43 +02:00 · 4abf7b699c
parent 437b0da8a9
commit 4abf7b699c
4 changed files with 13 additions and 131 deletions
--- a/README.md
+++ b/README.md
@ -5,27 +5,12 @@ This tool opens the links from the rss feed, then downloads the full article fro

 morss also has experimental support for Atom feeds.

-##(xpath) Rules
-
-To find the article content on the newspaper's website, morss need to know where to look at. The default target is the first `<h1>` element, since it's a common practice, or a `<article>` element, for HTML5 compliant websites.
-
-However in some cases, these global rules are not working. Therefore custom xpath rules are needed. The proper way to input them to morss is detailed in the different use cases.
-
 ##Use cases
 ###Running on a server

 For this, you need to make sure your host allows python script execution. This method uses HTTP calls to fetch the RSS feeds, such as `http://DOMAIN/MORSS/morss.py/feeds.bbci.co.uk/news/rss.xml`. Therefore the python script has to be accessible by the HTTP server. With the `.htaccess` file provided, it's also possible, on APACHE servers, to access the filled feed at `http://DOMAIN/MORSS/feeds.bbci.co.uk/news/rss.xml` (without the `morss.py`).
 This will require you to set `SERVER` to `True` at the top of the script.

-Here, xpath rules stored in the `rules` file. (The name of the file can be changed in the script, in `class Feed`→`self.rulePath`. The file structure can be seen in the provided file. More details:
-
-	Fancy name (description)(useless but not optional)
-	http://example.com/path/to/the/rss/feed.xml
-	http://example.co.uk/other/*/path/with/wildcard/*.xml
-	//super/accurate[@xpath='expression']/..
-
-As shown in the example, multiple urls can be specified for a single rule, so as to be able to match feeds from different locations of the website server (for example with or without "www."). Moreover feeds urls can be *NIX glob-style patterns, so as to match any feed from a website.
-
 Works like a charm with Tiny Tiny RSS (<http://tt-rss.org/redmine/projects/tt-rss/wiki>).

 ###As a newsreader hook
@ -34,12 +19,6 @@ To use it, the newsreader *Liferea* is required (unless other newsreaders provid

 To use this script, you have to enable "postprocessing filter" in liferea feed settings, and to add `PATH/TO/MORSS/morss` as command to run.

-For custom xpath rules, you have to add them in the command this way:
-
-	PATH/TO/MORSS/morss "//custom[@xpath]/rule"
-
-Quotes around the xpath rule are mandatory.
-
 ##Cache information

 morss uses a small cache directory to make the loading faster. Given the way it's designed, the cache doesn't need to be purged each while and then, unless you stop following a big amount of feeds. Only in the case of mass un-subscribing, you might want to delete the cache files corresponding to the bygone feeds. If morss is running as a server, the cache folder is at `MORSS_DIRECTORY/cache/`, and in `$HOME/.cache/morss` otherwise.
@ -47,11 +26,15 @@ morss uses a small cache directory to make the loading faster. Given the way it'
 ##Extra configuration
 ###Length limitation

-When parsing long feeds, with a lot of items (100+), morss might take a lot of time to parse it, or might even run into a memory overflow on some shared hosting plans (limits around 10Mb), in which case you might want to adjust the `self.max` value in `class Feed`. That value is the maximum number of items to parse. `0` means parse all items.
+When parsing long feeds, with a lot of items (100+), morss might take a lot of time to parse it, or might even run into a memory overflow on some shared hosting plans (limits around 10Mb), in which case you might want to adjust the `MAX` value at the top of the script. That value is the maximum number of items to parse. `0` means parse all items.

-###Remove useless HTML elements
+###Content matching

-Unwanted HTML elements are also stripped from the article. By default, elements such as `<script>` and `<object>` are removed. Other elements can be specified, by adding them in the `self.trash` array in `class Feed`.
+The content of articles is grabbed with a **readability** fork (see <https://github.com/buriy/python-readability>). This means that most of the time the right content is matched. However sometimes it fails, therefore some tweaking is required. Most of the time, what has to be done is to add some "rules" in the main script file in *readability* (not in morss).
+
+Most of the time when hardly nothing is matched, it means that the main content of the article is made of images, videos, pictures, etc., which readability doesn't detect. Also, readability has some trouble to match content of very small articles.
+
+morss will also try to figure out whether the full content is already in place (for those websites which understood the whole point of RSS feeds). However this detection is very simple, and only works if the actual content is put in the "content" section in the feed and not in the "summary" section.

 ---

--- a/liferea.css
+++ b/liferea.css
@ -1,23 +0,0 @@
-img
-{
-	max-width: 80%;
-	height: auto;
-}
-
-noscript,
-.bbx_container, /*TT*/
-.share-help, /*BBC*/
-div.video_container iframe, /*LM*/
-.story-info, .story-share.bluelinks, .story-content img:last-child, .pager, /*CI*/
-.story_tools_social_links, .new_reactions.box, .breadcrumbs, .page-jump /*lesoir*/
-{
-	display: none;
-}
-
-h2.txt15_140, /*LM*/
-h2.chapo, /*FranceInfo*/
-div.article-content h3 /*lesoir*/
-{
-	font-size: 1em;
-	font-weight: normal;
-}
--- a/morss.py
+++ b/morss.py
@ -1,9 +1,7 @@
 #!/usr/bin/env python
 import sys
 import os
-import copy
 from base64 import b64encode, b64decode
-from fnmatch import fnmatch
 import os.path
 import lxml.etree
 import lxml.objectify
@ -16,11 +14,10 @@ import urllib2
 from cookielib import CookieJar
 import chardet

-# DISCLAIMER: feedparser is pure shit if you intend to *edit* the feed.
+from readability import readability

 SERVER = True
 MAX = 70
-TRASH = ['//h1', '//header']
 E = lxml.objectify.E

 ITEM_MAP = {
@ -244,18 +241,9 @@ def EncDownload(url):
 			log('chardet')
 			enc = chardet.detect(data)['encoding']

-	return (data, enc)
+	return (data, enc, con.geturl())

-def parseRules(rulePath, url):
-	rules = open(rulePath, "r").read().strip().split("\n\n")
-	rules = [r.split('\n') for r in rules]
-	for rule in rules:
-		for domain in rule[1:-1]:
-			if fnmatch(url, domain):
-				return rule[-1]
-	return '//article|//h1/..'
-
-def Fill(rss, rule, cache):
+def Fill(rss, cache):
 	item = XMLMap(rss, ITEM_MAP, True)
 	log(item.link)

@ -286,30 +274,11 @@ def Fill(rss, rule, cache):
 	if ddl is False:
 		return item

-	data, enc = ddl
+	data, enc, url = ddl
 	log(enc)

-	# parse
-	parser = lxml.html.HTMLParser(encoding=enc)
-	page = lxml.etree.fromstring(data, parser)
+	out = readability.Document(data.decode(enc, 'ignore'), url=url).summary(True)

-	# filter
-	match =	page.xpath(rule)
-	if len(match):
-		art = match[0]
-		log('ok txt')
-	else:
-		log('no match')
-		return item
-
-	# clean
-	for tag in TRASH:
-		for elem in art.xpath(tag):
-			elem.getparent().remove(elem)
-
-	art.tag = 'div' # solves crash in lxml.html.clean
-	art = lxml.html.clean.clean_html(art)
-	out = lxml.etree.tostring(art, pretty_print=True).decode(enc, 'ignore')
 	item.content = out
 	cache.save(item.link, out)

@ -329,22 +298,12 @@ def Gather(data, cachePath):

 	cache = Cache(cachePath, unicode(root.title))

-	# rules
-	if data.startswith("http"):
-		rule = parseRules('rules', url)
-	else:
-		if len(sys.argv) > 1:
-			rule = sys.argv[1]
-		else:
-			rule = '//article|//h1/..'
-
 	# set
-	log(rule)
 	if MAX:
 		for item in root.item[MAX:]:
 			item.getparent().remove(item)
 	for item in root.item:
-		Fill(item, rule, cache)
+		Fill(item, cache)

 	return root.tostring(xml_declaration=True, encoding='UTF-8')

--- a/37
+++ b/37
@ -1,37 +0,0 @@
-TehranTimes
-http://www.tehrantimes.com/*
-http://tehrantimes.com/*
-//div[@class='article-indent']
-
-FranceInfo
-http://www.franceinfo.fr/rss*
-//h2[@class='chapo']/..
-
-Les Echos
-http://rss.feedsportal.com/c/499/f/413829/index.rss
-http://syndication.lesechos.fr/rss/*
-//h1/../..
-
-Spiegel
-http://www.spiegel.de/schlagzeilen/*
-//div[@id='spArticleSection']
-
-Le Soir
-http://www.lesoir.be/feed/*
-//div[@class='article-content']
-
-Stack Overflow
-http://stackoverflow.com/feeds/*
-//*[@id='question']
-
-Daily Telegraph
-http://www.telegraph.co.uk/*
-//*[@id='mainBodyArea']
-
-Cracked.com
-http://feeds.feedburner.com/CrackedRSS
-//div[@class='content']|//section[@class='body']
-
-TheOnion
-http://feeds.theonion.com/*
-//article