Move to OOP.
This is a huge commit. The whole code is ported to Object-Oritented Programming. This makes the code cleaner, which became required to deal with all the different cases, for example with encoding detection. Encoding detection now works better, and uses 3 different methods. HTML pages with an xml declaration are now supported. Feed urls with parameters (eg. "index.php?option=par") are also supported. Cache is now smarter, since it no longer grows indefinitely, since only in-use pages are kept in the cache. Caching is now mandatory. urllib (not urllib2) is no longer needed. Solved a possible crash with log function (when passing list of str with non-unicode encoging). README is also updated.master
parent
c21af6d9a8
commit
82084c2c75
43
README.md
43
README.md
|
@ -1,18 +1,49 @@
|
|||
#Morss
|
||||
|
||||
This tool's goal is to get full-text rss feeds out of striped rss feeds, commonly available on internet. Indeed most newspapers only make a small description available to users in their rss feeds, which makes the rss feed rather useless. So this tool intends to fix that problem.
|
||||
This tool's goal is to get full-text RSS feeds out of striped RSS feeds, commonly available on internet. Indeed most newspapers only make a small description available to users in their rss feeds, which makes the RSS feed rather useless. So this tool intends to fix that problem.
|
||||
This tool opens the links from the rss feed, then downloads the full article from the newspaper website and puts it back in the rss feed.
|
||||
|
||||
To use it, the rss reader *Liferea* is required (unless other rss readers provide the same kind of feature), since custom scripts can be run on top of the rss feed, using its output as an rss feed. (more: <http://lzone.de/liferea/scraping.htm>)
|
||||
##(xpath) Rules
|
||||
|
||||
To use this script, you have to enable "postprocessing filter" in liferea feed settings, and to add the following line as command to run:
|
||||
To find the article content on the newspaper's website, morss need to know where to look at. The default target is the first `<h1>` element, since it's a common practice, or a `<article>` element, for HTML5 compliant websites.
|
||||
|
||||
morss "RULE"
|
||||
However in some cases, these global rules are not working. Therefore custom xpath rules are needed. The proper way to input them to morss is detailed in the different use cases.
|
||||
|
||||
And you have to replace **RULE** with a proper rule, which has to be a proper xpath instruction, matching the main content of the website. Some rules example are given in the "rules" file. You have to keep the " " aroung the rule. If the parameter is omitted, `//h1/..` is used instead. This default rule works on a lot of websites, since it's a common practice for search engine optimization.
|
||||
##Use cases
|
||||
###Running on a server
|
||||
|
||||
Using this, rss refresh tends to be a bit slower, but caching helps a lot for frequent updates.
|
||||
For this, you need to make sure your host allows python script execution. This method uses HTTP calls to fetch the RSS feeds, such as `http://DOMAIN/MORSS/morss.py/feeds.bbci.co.uk/news/rss.xml`. Therefore the python script has to be accessible by the HTTP server.
|
||||
This will require you to set `SERVER` to `True` at the top of the script.
|
||||
|
||||
Here, xpath rules stored in the `rules` file. (The name of the file can be changed in the script, in `class Feed`→`self.rulePath`. The file structure can be seen in the provided file. More details:
|
||||
|
||||
Fancy name (description)(useless but not optional)
|
||||
http://example.com/path/to/the/rss/feed.xml
|
||||
//super/accurate[@xpath='expression']/..
|
||||
|
||||
Works like a charm with Tiny TinyRSS (<http://tt-rss.org/redmine/projects/tt-rss/wiki>).
|
||||
|
||||
###As a newsreader hook
|
||||
|
||||
To use it, the newsreader *Liferea* is required (unless other newsreaders provide the same kind of feature), since custom scripts can be run on top of the RSS feed, using its output as an RSS feed. (more: <http://lzone.de/liferea/scraping.htm>)
|
||||
|
||||
To use this script, you have to enable "postprocessing filter" in liferea feed settings, and to add `PATH/TO/MORSS/morss` as command to run.
|
||||
|
||||
For custom xpath rules, you have to add them in the command this way:
|
||||
|
||||
PATH/TO/MORSS/morss "//custom[@xpath]/rule"
|
||||
|
||||
Quotes around the xpath rule are mandatory.
|
||||
|
||||
##Cache information
|
||||
|
||||
morss uses a small cache directory to make the loading faster. Given the way it's designed, the cache doesn't need to be purged each while and then, unless you stop following a big amount of feeds. Only in the case of mass un-subscribing, you might want to delete the cache files corresponding to the bygone feeds. If morss is running as a server, the cache folder is at `MORSS_DIRECTORY/cache/`, and in `$HOME/.cache/morss` otherwise.
|
||||
|
||||
##Extra configuration
|
||||
|
||||
When parsing long feeds, with a lot of items (100+), morss might take a lot of time to parse it, or might even run into a memory overflow on some shared hosting plans (limits around 10Mb), in which case you might want to adjust the `self.max` value in `class Feed`. That value is the maximum number of items to parse. `0` means parse all items.
|
||||
|
||||
---
|
||||
|
||||
GPL3 licence.
|
||||
Python **2.6** required (not 3).
|
||||
|
|
|
@ -1,69 +1,227 @@
|
|||
#! /usr/bin/env python2.7
|
||||
#!/usr/bin/python
|
||||
import sys
|
||||
import os
|
||||
from os.path import expanduser
|
||||
from lxml import etree
|
||||
import string
|
||||
import urllib2
|
||||
import urllib
|
||||
from cookielib import CookieJar
|
||||
import chardet
|
||||
|
||||
SERVER = True
|
||||
|
||||
if SERVER:
|
||||
import httplib
|
||||
httplib.HTTPConnection.debuglevel = 1
|
||||
|
||||
import cgitb
|
||||
cgitb.enable()
|
||||
|
||||
def log(txt):
|
||||
if os.getenv('DEBUG', False):
|
||||
if not SERVER and os.getenv('DEBUG', False):
|
||||
print txt
|
||||
if SERVER:
|
||||
with open('morss.log', 'a') as file:
|
||||
if isinstance(txt, str):
|
||||
file.write(txt.encode('utf-8') + "\n")
|
||||
|
||||
def xmlclean(xml):
|
||||
table = string.maketrans('', '')
|
||||
return xml.translate(table, table[:32])
|
||||
class Info:
|
||||
def __init__(self, item, feed):
|
||||
self.item = item
|
||||
self.feed = feed
|
||||
|
||||
node = sys.argv[1] if len(sys.argv) > 1 else "//h1/.."
|
||||
self.data = False
|
||||
self.page = False
|
||||
self.html = False
|
||||
self.con = False
|
||||
self.opener = False
|
||||
self.enc = False
|
||||
|
||||
xml = xmlclean(sys.stdin.read())
|
||||
rss = etree.XML(xml)
|
||||
items = rss.xpath('//item')
|
||||
self.link = self.item.findtext('link')
|
||||
self.desc = self.item.xpath('description')[0]
|
||||
|
||||
cache = expanduser("~") + "/.cache/morss"
|
||||
if not os.path.exists(cache):
|
||||
os.makedirs(cache)
|
||||
def fetch(self):
|
||||
log(self.link)
|
||||
if not self.findCache():
|
||||
self.download()
|
||||
self.chardet()
|
||||
self.fetchDesc()
|
||||
self.save()
|
||||
log(self.enc)
|
||||
|
||||
for item in items:
|
||||
link = item.findtext('link').encode('utf-8')
|
||||
desc = item.xpath('description')[0]
|
||||
|
||||
log(link)
|
||||
|
||||
cached = cache + "/" + str(hash(link))
|
||||
log(cached)
|
||||
|
||||
if os.path.exists(cached):
|
||||
log("cached")
|
||||
desc.text = open(cached, 'r').read()
|
||||
else:
|
||||
def parseHTML(self):
|
||||
if self.enc is False:
|
||||
self.page = etree.HTML(self.data)
|
||||
else:
|
||||
try:
|
||||
self.page = etree.HTML(self.data.decode(self.enc, 'ignore'))
|
||||
except ValueError:
|
||||
self.page = etree.HTML(self.data)
|
||||
|
||||
|
||||
def save(self):
|
||||
self.feed.save()
|
||||
|
||||
def findCache(self):
|
||||
if self.feed.cache is not False:
|
||||
xpath = "//link[text()='" + self.link + "']/../description/text()"
|
||||
match = self.feed.cache.xpath(xpath)
|
||||
if len(match):
|
||||
log('cached')
|
||||
self.desc.text = match[0]
|
||||
return True
|
||||
return False
|
||||
|
||||
def fetchDesc(self):
|
||||
self.parseHTML()
|
||||
match = self.page.xpath(self.feed.rule)
|
||||
if len(match):
|
||||
self.html = match[0]
|
||||
self.deleteTags()
|
||||
self.desc.text = etree.tostring(self.html).decode(self.enc, 'ignore')
|
||||
log('ok txt')
|
||||
else:
|
||||
log('no match')
|
||||
|
||||
def download(self):
|
||||
try:
|
||||
cj = CookieJar()
|
||||
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
|
||||
data = opener.open(link).read()
|
||||
html = etree.HTML(data)
|
||||
match = html.xpath(node)
|
||||
|
||||
if len(match):
|
||||
try:
|
||||
text = etree.tostring(match[0])
|
||||
log("ok txt")
|
||||
except etree.SerialisationError:
|
||||
log('serialisation')
|
||||
continue
|
||||
try:
|
||||
desc.text = text
|
||||
open(cached, 'w').write(text)
|
||||
except ValueError:
|
||||
log('xml error')
|
||||
else:
|
||||
log("no match")
|
||||
|
||||
self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
|
||||
self.con = self.opener.open(self.link.encode('utf-8'))
|
||||
self.data = self.con.read()
|
||||
except (urllib2.HTTPError, urllib2.URLError) as error:
|
||||
log(error)
|
||||
log("http error")
|
||||
log('http error')
|
||||
|
||||
if not os.getenv('DEBUG', False):
|
||||
print etree.tostring(rss)
|
||||
def chardet(self):
|
||||
if self.con.headers.getparam('charset'):
|
||||
log('header')
|
||||
self.enc = self.con.headers.getparam('charset')
|
||||
return
|
||||
|
||||
page = etree.HTML(self.data)
|
||||
header = page.xpath("//head/meta[@http-equiv='Content-Type']/@content")
|
||||
if len(header) and len(header[0].split("=")):
|
||||
log('meta')
|
||||
self.enc = header[0].split("=")[1]
|
||||
return
|
||||
|
||||
header = page.xpath("//head/meta[@charset]/@charset")
|
||||
if len(header):
|
||||
log('meta2')
|
||||
self.enc = header[0]
|
||||
return
|
||||
|
||||
log('chardet')
|
||||
self.enc = chardet.detect(self.data)['encoding']
|
||||
|
||||
def deleteTags(self):
|
||||
for tag in self.feed.trash:
|
||||
for elem in self.html.xpath(tag):
|
||||
elem.getparent().remove(elem)
|
||||
|
||||
class Feed:
|
||||
def __init__(self, impl, data, cachePath):
|
||||
self.rulePath = 'rules'
|
||||
self.rule = '//article|//h1/..'
|
||||
|
||||
self.trash = ['//script', '//iframe', '//object', '//noscript', '//form', '//h1']
|
||||
self.max = 70
|
||||
|
||||
self.cachePath = cachePath
|
||||
self.cacheFile = False
|
||||
self.cache = False
|
||||
self.impl = impl
|
||||
|
||||
self.items = []
|
||||
self.rss = False
|
||||
self.out = False
|
||||
|
||||
if self.impl == 'server':
|
||||
self.url = data
|
||||
self.xml = False
|
||||
else:
|
||||
self.url = False
|
||||
self.xml = data
|
||||
|
||||
def save(self):
|
||||
self.out = etree.tostring(self.rss, xml_declaration=True, pretty_print=True)
|
||||
open(self.cacheFile, 'w').write(self.out)
|
||||
|
||||
def getData(self):
|
||||
if self.impl == 'server':
|
||||
req = urllib2.Request(self.url)
|
||||
req.add_unredirected_header('User-Agent', '')
|
||||
self.xml = urllib2.urlopen(req).read()
|
||||
self.cleanXml()
|
||||
|
||||
def setCache(self):
|
||||
if self.cache is not False:
|
||||
return
|
||||
|
||||
self.parse()
|
||||
key = str(hash(self.rss.xpath('//channel/title/text()')[0]))
|
||||
self.cacheFile = self.cachePath + "/" + key
|
||||
log(self.cacheFile)
|
||||
if not os.path.exists(self.cachePath):
|
||||
os.makedirs(self.cachePath)
|
||||
|
||||
if os.path.exists(self.cacheFile):
|
||||
self.cache = etree.XML(open(self.cacheFile, 'r').read())
|
||||
|
||||
def parse(self):
|
||||
if self.rss is not False:
|
||||
return
|
||||
|
||||
self.rss = etree.XML(self.xml)
|
||||
|
||||
def setItems(self):
|
||||
self.items = [Info(e, self) for e in self.rss.xpath('//item')]
|
||||
if self.max:
|
||||
self.items = self.items[:self.max]
|
||||
|
||||
def fill(self):
|
||||
self.parseRules()
|
||||
log(self.rule)
|
||||
for item in self.items:
|
||||
item.fetch()
|
||||
|
||||
def cleanXml(self):
|
||||
table = string.maketrans('', '')
|
||||
self.xml = self.xml.translate(table, table[:32]).lstrip()
|
||||
|
||||
def parseRules(self):
|
||||
if self.impl == 'server':
|
||||
rules = open(self.rulePath, "r").read().split("\n\n")
|
||||
rules = [r.split('\n') for r in rules]
|
||||
for rule in rules:
|
||||
if rule[1] == self.url:
|
||||
self.rule = rule[2]
|
||||
return
|
||||
else:
|
||||
if len(sys.argv) > 1:
|
||||
self.rule = sys.argv[1]
|
||||
|
||||
if __name__ == "__main__":
|
||||
if SERVER:
|
||||
print 'Content-Type: text/html\n'
|
||||
url = os.environ['REQUEST_URI'][len(os.environ['SCRIPT_NAME'])+1:]
|
||||
url = 'http://' + url.replace(' ', '%20')
|
||||
log(url)
|
||||
RSS = Feed('server', url, os.getcwd() + '/cache')
|
||||
else:
|
||||
xml = sys.stdin.read()
|
||||
cache = expanduser('~') + '/.cache/morss'
|
||||
RSS = Feed('liferea', xml, os.getcwd() + '/cache')
|
||||
|
||||
RSS.getData()
|
||||
RSS.parse()
|
||||
RSS.setCache()
|
||||
RSS.setItems()
|
||||
RSS.fill()
|
||||
RSS.save()
|
||||
|
||||
if SERVER or not os.getenv('DEBUG', False):
|
||||
print RSS.out
|
||||
else:
|
||||
print 'done'
|
||||
|
|
Loading…
Reference in New Issue