readabilite: custom xpath for article detection

cache: avoid name collision
readability: better var names
2021-09-18 16:16:34 +02:00 · 2021-09-18 16:08:01 +02:00 · 2021-09-16 07:40:58 +02:00 · 2021-09-11 21:50:09 +02:00 · 2021-09-11 21:41:49 +02:00 · 2021-09-11 21:39:11 +02:00
14 changed files with 431 additions and 278 deletions
--- a/.drone.yml
+++ b/.drone.yml
@ -0,0 +1,15 @@
 kind: pipeline
 name: default
 steps:
 - name: isort
  image: python:alpine
  commands:
  - pip install isort
  - isort --check-only --diff .
 - name: pylint
  image: alpine
  commands:
  - apk add --no-cache python3 py3-lxml py3-pip py3-wheel py3-pylint py3-enchant hunspell-en
  - pip3 install --no-cache-dir .[full]
  - pylint morss --rcfile=.pylintrc --disable=C,R,W --fail-under=8
--- a/.pylintrc
+++ b/.pylintrc
@ -0,0 +1,50 @@
 [MASTER]
 ignore=CVS
 suggestion-mode=yes
 extension-pkg-allow-list=lxml.etree
 [MESSAGES CONTROL]
 disable=missing-function-docstring,
        missing-class-docstring,
        missing-module-docstring,
        wrong-spelling-in-comment,
 [REPORTS]
 reports=yes
 score=yes
 [SPELLING]
 spelling-dict=en_GB
 spelling-ignore-words=morss
 [STRING]
 check-quote-consistency=yes
 check-str-concat-over-line-jumps=yes
 [VARIABLES]
 allow-global-unused-variables=no
 init-import=no
 [FORMAT]
 expected-line-ending-format=LF
 indent-string='    '
 max-line-length=120
 max-module-lines=1000
 [BASIC]
 argument-naming-style=snake_case
 attr-naming-style=snake_case
 class-attribute-naming-style=snake_case
 class-const-naming-style=UPPER_CASE
 class-naming-style=PascalCase
 const-naming-style=UPPER_CASE
 function-naming-style=snake_case
 inlinevar-naming-style=snake_case
 method-naming-style=snake_case
 module-naming-style=snake_case
 variable-naming-style=snake_case
 include-naming-hint=yes
 bad-names=foo, bar
 good-names=i, j, k
--- a/2
+++ b/2
@ -3,6 +3,6 @@ FROM alpine:latest
 RUN apk add --no-cache python3 py3-lxml py3-pip py3-wheel git
 ADD . /app
-RUN pip3 install --no-cache-dir /app gunicorn
+RUN pip3 install --no-cache-dir /app[full] gunicorn
 CMD gunicorn --bind 0.0.0.0:8080 -w 4 --preload --access-logfile - morss
--- a/README.md
+++ b/README.md
@ -1,5 +1,7 @@
 # Morss - Get full-text RSS feeds
 [![Build Status](https://ci.pictuga.com/api/badges/pictuga/morss/status.svg)](https://ci.pictuga.com/pictuga/morss)
 _GNU AGPLv3 code_  
 _Provided logo is CC BY-NC-SA 4.0_
@ -46,28 +48,25 @@ Some features of morss:
 ### Python package
 Simple install (without optional dependencies)
 ```shell
 pip install git+https://git.pictuga.com/pictuga/morss.git
 ```
 Full installation (including optional dependencies)
 ```shell
 pip install git+https://git.pictuga.com/pictuga/morss.git#[full]
 ```
 The full install includes mysql and redis (possible cache backends). Otherwise,
 only in-memory and sqlite3 caches are available.
 The dependency `lxml` is fairly long to install (especially on Raspberry Pi, as
 C code needs to be compiled). If possible on your distribution, try installing
 it with the system package manager.
 Dependencies:
 - [python](http://www.python.org/) >= 2.6 (python 3 is supported)
 - [lxml](http://lxml.de/) for xml parsing
 - [bs4](https://pypi.org/project/bs4/) for badly-formatted html pages
 - [dateutil](http://labix.org/python-dateutil) to parse feed dates
 - [chardet](https://pypi.python.org/pypi/chardet)
 - [six](https://pypi.python.org/pypi/six), a dependency of chardet
 - pymysql
 You may also need:
 - Apache, with python-cgi support, to run on a server
 - a fast internet connection
 ### Docker
 Build & run
@ -263,11 +262,12 @@ arguments to morss is explained in Run above.
 The list of arguments can be obtained by running `morss --help`
 ```
-usage: morss [-h] [--post STRING] [--format {rss,json,html,csv}]
+usage: morss [-h] [--post STRING] [--xpath XPATH]
-             [--search STRING] [--clip] [--indent] [--cache] [--force]
+             [--format {rss,json,html,csv}] [--search STRING] [--clip]
-             [--proxy] [--newest] [--firstlink] [--resolve] [--items XPATH]
+             [--indent] [--cache] [--force] [--proxy] [--newest] [--firstlink]
-             [--item_link XPATH] [--item_title XPATH] [--item_content XPATH]
+             [--resolve] [--items XPATH] [--item_link XPATH]
-             [--item_time XPATH] [--nolink] [--noref] [--silent]
+             [--item_title XPATH] [--item_content XPATH] [--item_time XPATH]
             [--nolink] [--noref] [--silent]
             url
 Get full-text RSS feeds
@ -278,6 +278,7 @@ positional arguments:
 optional arguments:
  -h, --help            show this help message and exit
  --post STRING         POST request
  --xpath XPATH         xpath rule to manually detect the article
 output:
  --format {rss,json,html,csv}
@ -373,6 +374,8 @@ will be cleared every time the program is run). Path can be defined with
 `SQLITE_PATH`.
 - `CACHE=mysql`: MySQL cache. Connection can be defined with the following
 environment variables: `MYSQL_USER`, `MYSQL_PWD`, `MYSQL_DB`, `MYSQL_HOST`
 - `CACHE=redis`: Redis cache. Connection can be defined with the following
 environment variables: `REDIS_HOST`, `REDIS_PORT`, `REDIS_DB`, `REDIS_PWD`
 To limit the size of the cache:
--- a/morss/init.py
+++ b/morss/init.py
@ -16,5 +16,8 @@
 # with this program. If not, see <https://www.gnu.org/licenses/>.
 # ran on `import morss`
 # pylint: disable=unused-import,unused-variable
 from .morss import *
 from .wsgi import application
--- a/morss/main.py
+++ b/morss/main.py
@ -20,9 +20,7 @@
 import os
 import sys
-from . import wsgi
+from . import cli, wsgi
 from . import cli
 from .morss import MorssException
--- a/morss/caching.py
+++ b/morss/caching.py
@ -0,0 +1,188 @@
 # This file is part of morss
 #
 # Copyright (C) 2013-2020 pictuga <contact@pictuga.com>
 #
 # This program is free software: you can redistribute it and/or modify it under
 # the terms of the GNU Affero General Public License as published by the Free
 # Software Foundation, either version 3 of the License, or (at your option) any
 # later version.
 #
 # This program is distributed in the hope that it will be useful, but WITHOUT
 # ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
 # FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more
 # details.
 #
 # You should have received a copy of the GNU Affero General Public License along
 # with this program. If not, see <https://www.gnu.org/licenses/>.
 import os
 import pickle
 import threading
 import time
 from collections import OrderedDict
 CACHE_SIZE = int(os.getenv('CACHE_SIZE', 1000)) # max number of items in cache (default: 1k items)
 CACHE_LIFESPAN = int(os.getenv('CACHE_LIFESPAN', 60)) # how often to auto-clear the cache (default: 1min)
 class BaseCache:
    """ Subclasses must behave like a dict """
    def trim(self):
        pass
    def autotrim(self, delay=CACHE_LIFESPAN):
        # trim the cache every so often
        self.trim()
        t = threading.Timer(delay, self.autotrim)
        t.daemon = True
        t.start()
    def __contains__(self, url):
        try:
            self[url]
        except KeyError:
            return False
        else:
            return True
 try:
    import sqlite3 # isort:skip
 except ImportError:
    pass
 class SQLiteCache(BaseCache):
    def __init__(self, filename=':memory:'):
        self.con = sqlite3.connect(filename, detect_types=sqlite3.PARSE_DECLTYPES, check_same_thread=False)
        with self.con:
            self.con.execute('CREATE TABLE IF NOT EXISTS data (ky UNICODE PRIMARY KEY, data BLOB, timestamp INT)')
            self.con.execute('pragma journal_mode=WAL')
        self.trim()
    def __del__(self):
        self.con.close()
    def trim(self):
        with self.con:
            self.con.execute('DELETE FROM data WHERE timestamp <= ( SELECT timestamp FROM ( SELECT timestamp FROM data ORDER BY timestamp DESC LIMIT 1 OFFSET ? ) foo )', (CACHE_SIZE,))
    def __getitem__(self, key):
        row = self.con.execute('SELECT * FROM data WHERE ky=?', (key,)).fetchone()
        if not row:
            raise KeyError
        return row[1]
    def __setitem__(self, key, data):
        with self.con:
            self.con.execute('INSERT INTO data VALUES (?,?,?) ON CONFLICT(ky) DO UPDATE SET data=?, timestamp=?', (key, data, time.time(), data, time.time()))
 try:
    import pymysql.cursors # isort:skip
 except ImportError:
    pass
 class MySQLCacheHandler(BaseCache):
    def __init__(self, user, password, database, host='localhost'):
        self.user = user
        self.password = password
        self.database = database
        self.host = host
        with self.cursor() as cursor:
            cursor.execute('CREATE TABLE IF NOT EXISTS data (ky VARCHAR(255) NOT NULL PRIMARY KEY, data MEDIUMBLOB, timestamp INT)')
        self.trim()
    def cursor(self):
        return pymysql.connect(host=self.host, user=self.user, password=self.password, database=self.database, charset='utf8', autocommit=True).cursor()
    def trim(self):
        with self.cursor() as cursor:
            cursor.execute('DELETE FROM data WHERE timestamp <= ( SELECT timestamp FROM ( SELECT timestamp FROM data ORDER BY timestamp DESC LIMIT 1 OFFSET %s ) foo )', (CACHE_SIZE,))
    def __getitem__(self, key):
        cursor = self.cursor()
        cursor.execute('SELECT * FROM data WHERE ky=%s', (key,))
        row = cursor.fetchone()
        if not row:
            raise KeyError
        return row[1]
    def __setitem__(self, key, data):
        with self.cursor() as cursor:
            cursor.execute('INSERT INTO data VALUES (%s,%s,%s) ON DUPLICATE KEY UPDATE data=%s, timestamp=%s',
                (key, data, time.time(), data, time.time()))
 class CappedDict(OrderedDict, BaseCache):
    def trim(self):
        if CACHE_SIZE >= 0:
            for i in range( max( len(self) - CACHE_SIZE , 0 )):
                self.popitem(False)
    def __setitem__(self, key, data):
        # https://docs.python.org/2/library/collections.html#ordereddict-examples-and-recipes
        if key in self:
            del self[key]
        OrderedDict.__setitem__(self, key, data)
 try:
    import redis # isort:skip
 except ImportError:
    pass
 class RedisCacheHandler(BaseCache):
    def __init__(self, host='localhost', port=6379, db=0, password=None):
        self.r = redis.Redis(host=host, port=port, db=db, password=password)
    def __getitem__(self, key):
        return self.r.get(key)
    def __setitem__(self, key, data):
        self.r.set(key, data)
 if 'CACHE' in os.environ:
    if os.environ['CACHE'] == 'mysql':
        default_cache = MySQLCacheHandler(
            user = os.getenv('MYSQL_USER'),
            password = os.getenv('MYSQL_PWD'),
            database = os.getenv('MYSQL_DB'),
            host = os.getenv('MYSQL_HOST', 'localhost')
        )
    elif os.environ['CACHE'] == 'sqlite':
        if 'SQLITE_PATH' in os.environ:
            path = os.getenv('SQLITE_PATH')
        else:
            path = ':memory:'
        default_cache = SQLiteCache(path)
    elif os.environ['CACHE'] == 'redis':
        default_cache = RedisCacheHandler(
            host = os.getenv('REDIS_HOST', 'localhost'),
            port = int(os.getenv('REDIS_PORT', 6379)),
            db = int(os.getenv('REDIS_DB', 0)),
            password = os.getenv('REDIS_PWD', None)
        )
 else:
        default_cache = CappedDict()
--- a/morss/cli.py
+++ b/morss/cli.py
@ -15,12 +15,11 @@
 # You should have received a copy of the GNU Affero General Public License along
 # with this program. If not, see <https://www.gnu.org/licenses/>.
 import sys
 import os.path
 import argparse
 import os.path
 import sys
-from .morss import FeedFetch, FeedGather, FeedFormat
+from .morss import FeedFetch, FeedFormat, FeedGather, Options
 from .morss import Options
 def cli_app():
@ -33,6 +32,7 @@ def cli_app():
    parser.add_argument('url', help='feed url')
    parser.add_argument('--post', action='store', type=str, metavar='STRING', help='POST request')
    parser.add_argument('--xpath', action='store', type=str, metavar='XPATH', help='xpath rule to manually detect the article')
    group = parser.add_argument_group('output')
    group.add_argument('--format', default='rss', choices=('rss', 'json', 'html', 'csv'), help='output format')
--- a/morss/crawler.py
+++ b/morss/crawler.py
@ -16,30 +16,36 @@
 # with this program. If not, see <https://www.gnu.org/licenses/>.
 import os
-import sys
+import pickle
 import zlib
 from io import BytesIO, StringIO
 import re
 import chardet
 from cgi import parse_header
 import time
 import threading
 import random
 import re
 import sys
 import time
 import zlib
 from cgi import parse_header
 from collections import OrderedDict
 from io import BytesIO, StringIO
 import chardet
 from .caching import default_cache
 try:
    # python 2
    from urllib2 import BaseHandler, HTTPCookieProcessor, Request, addinfourl, parse_keqv_list, parse_http_list, build_opener
    from urllib import quote
-    from urlparse import urlparse, urlunparse
+
    import mimetools
    from urllib2 import (BaseHandler, HTTPCookieProcessor, HTTPRedirectHandler,
                         Request, addinfourl, build_opener, parse_http_list,
                         parse_keqv_list)
    from urlparse import urlparse, urlunparse
 except ImportError:
    # python 3
    from urllib.request import BaseHandler, HTTPCookieProcessor, Request, addinfourl, parse_keqv_list, parse_http_list, build_opener
    from urllib.parse import quote
    from urllib.parse import urlparse, urlunparse
    import email
    from urllib.parse import quote, urlparse, urlunparse
    from urllib.request import (BaseHandler, HTTPCookieProcessor,
                                HTTPRedirectHandler, Request, addinfourl,
                                build_opener, parse_http_list, parse_keqv_list)
 try:
    # python 2
@ -49,10 +55,6 @@ except NameError:
    basestring = unicode = str
 CACHE_SIZE = int(os.getenv('CACHE_SIZE', 1000)) # max number of items in cache (default: 1k items)
 CACHE_LIFESPAN = int(os.getenv('CACHE_LIFESPAN', 60)) # how often to auto-clear the cache (default: 1min)
 MIMETYPE = {
    'xml': ['text/xml', 'application/xml', 'application/rss+xml', 'application/rdf+xml', 'application/atom+xml', 'application/xhtml+xml'],
    'rss': ['application/rss+xml', 'application/rdf+xml', 'application/atom+xml'],
@ -131,6 +133,7 @@ def custom_opener(follow=None, delay=None):
    handlers.append(SizeLimitHandler(500*1024)) # 500KiB
    handlers.append(HTTPCookieProcessor())
    handlers.append(GZIPHandler())
    handlers.append(HTTPAllRedirectHandler())
    handlers.append(HTTPEquivHandler())
    handlers.append(HTTPRefreshHandler())
    handlers.append(UAHandler(random.choice(DEFAULT_UAS)))
@ -397,6 +400,11 @@ class HTTPEquivHandler(RespStrHandler):
                    resp.headers[meta.get('http-equiv').lower()] = meta.get('content')
 class HTTPAllRedirectHandler(HTTPRedirectHandler):
    def http_error_308(self, req, fp, code, msg, headers):
        return self.http_error_301(req, fp, 301, msg, headers)
 class HTTPRefreshHandler(BaseHandler):
    handler_order = 700 # HTTPErrorProcessor has a handler_order of 1000
@ -447,37 +455,46 @@ class CacheHandler(BaseHandler):
    def load(self, url):
        try:
-            out = list(self.cache[url])
+            data = pickle.loads(self.cache[url])
        except KeyError:
-            out = [None, None, unicode(), bytes(), 0]
+            data = None
        if sys.version_info[0] >= 3:
            out[2] = email.message_from_string(out[2] or unicode()) # headers
        else:
-            out[2] = mimetools.Message(StringIO(out[2] or unicode()))
+            if sys.version_info[0] >= 3:
                data['headers'] = email.message_from_string(data['headers'] or unicode()) # headers
            else:
                data['headers'] = mimetools.Message(StringIO(data['headers'] or unicode()))
-        return out
+        return data
-    def save(self, url, code, msg, headers, data, timestamp):
+    def save(self, key, data):
-        self.cache[url] = (code, msg, unicode(headers), data, timestamp)
+        data['headers'] = unicode(data['headers'])
        self.cache[key] = pickle.dumps(data, 0)
-    def is_cached(self, url):
+    def is_cached(self, key):
-        return self.load(url)[0] is not None
+        return self.load(key) is not None
    def cached_response(self, req):
        # this does NOT check whether it's already cached, use with care
-        (code, msg, headers, data, timestamp) = self.load(req.get_full_url())
+        data = self.load(req.get_full_url())
        # return the cache as a response
-        resp = addinfourl(BytesIO(data), headers, req.get_full_url(), code)
+        resp = addinfourl(BytesIO(data['data']), data['headers'], req.get_full_url(), data['code'])
-        resp.msg = msg
+        resp.msg = data['msg']
        return resp
    def save_response(self, req, resp):
        data = resp.read()
-        self.save(req.get_full_url(), resp.code, resp.msg, resp.headers, data, time.time())
+        self.save(req.get_full_url(), {
            'code': resp.code,
            'msg': resp.msg,
            'headers': resp.headers,
            'data': data,
            'timestamp': time.time()
            })
        fp = BytesIO(data)
        old_resp = resp
@ -487,13 +504,14 @@ class CacheHandler(BaseHandler):
        return resp
    def http_request(self, req):
-        (code, msg, headers, data, timestamp) = self.load(req.get_full_url())
+        data = self.load(req.get_full_url())
-        if 'etag' in headers:
+        if data is not None:
-            req.add_unredirected_header('If-None-Match', headers['etag'])
+            if 'etag' in data['headers']:
                req.add_unredirected_header('If-None-Match', data['headers']['etag'])
-        if 'last-modified' in headers:
+            if 'last-modified' in data['headers']:
-            req.add_unredirected_header('If-Modified-Since', headers.get('last-modified'))
+                req.add_unredirected_header('If-Modified-Since', data['headers']['last-modified'])
        return req
@ -502,33 +520,33 @@ class CacheHandler(BaseHandler):
        # If 'None' is returned, try your chance with the next-available handler
        # If a 'resp' is returned, stop there, and proceed with 'http_response'
-        (code, msg, headers, data, timestamp) = self.load(req.get_full_url())
+        data = self.load(req.get_full_url())
        if data is None:
            # cache empty, refresh
            return None
        # some info needed to process everything
-        cache_control = parse_http_list(headers.get('cache-control', ()))
+        cache_control = parse_http_list(data['headers'].get('cache-control', ()))
-        cache_control += parse_http_list(headers.get('pragma', ()))
+        cache_control += parse_http_list(data['headers'].get('pragma', ()))
        cc_list = [x for x in cache_control if '=' not in x]
        cc_values = parse_keqv_list([x for x in cache_control if '=' in x])
-        cache_age = time.time() - timestamp
+        cache_age = time.time() - data['timestamp']
        # list in a simple way what to do when
        if self.force_min == -2:
-            if code is not None:
+            if data['code'] is not None:
                # already in cache, perfect, use cache
                return self.cached_response(req)
            else:
                # raise an error, via urllib handlers
-                resp = addinfourl(BytesIO(), headers, req.get_full_url(), 409)
+                resp = addinfourl(BytesIO(), data['headers'], req.get_full_url(), 409)
                resp.msg = 'Conflict'
                return resp
        elif code is None:
            # cache empty, refresh
            return None
        elif self.force_min == -1:
            # force use cache
            return self.cached_response(req)
@ -537,7 +555,7 @@ class CacheHandler(BaseHandler):
            # force refresh
            return None
-        elif code == 301 and cache_age < 7*24*3600:
+        elif data['code'] == 301 and cache_age < 7*24*3600:
            # "301 Moved Permanently" has to be cached...as long as we want
            # (awesome HTTP specs), let's say a week (why not?). Use force_min=0
            # if you want to bypass this (needed for a proper refresh)
@ -594,142 +612,6 @@ class CacheHandler(BaseHandler):
    https_response = http_response
 class BaseCache:
    """ Subclasses must behave like a dict """
    def trim(self):
        pass
    def autotrim(self, delay=CACHE_LIFESPAN):
        # trim the cache every so often
        self.trim()
        t = threading.Timer(delay, self.autotrim)
        t.daemon = True
        t.start()
    def __contains__(self, url):
        try:
            self[url]
        except KeyError:
            return False
        else:
            return True
 import sqlite3
 class SQLiteCache(BaseCache):
    def __init__(self, filename=':memory:'):
        self.con = sqlite3.connect(filename, detect_types=sqlite3.PARSE_DECLTYPES, check_same_thread=False)
        with self.con:
            self.con.execute('CREATE TABLE IF NOT EXISTS data (url UNICODE PRIMARY KEY, code INT, msg UNICODE, headers UNICODE, data BLOB, timestamp INT)')
            self.con.execute('pragma journal_mode=WAL')
        self.trim()
    def __del__(self):
        self.con.close()
    def trim(self):
        with self.con:
            self.con.execute('DELETE FROM data WHERE timestamp <= ( SELECT timestamp FROM ( SELECT timestamp FROM data ORDER BY timestamp DESC LIMIT 1 OFFSET ? ) foo )', (CACHE_SIZE,))
    def __getitem__(self, url):
        row = self.con.execute('SELECT * FROM data WHERE url=?', (url,)).fetchone()
        if not row:
            raise KeyError
        return row[1:]
    def __setitem__(self, url, value): # value = (code, msg, headers, data, timestamp)
        value = list(value)
        value[3] = sqlite3.Binary(value[3]) # data
        value = tuple(value)
        with self.con:
            self.con.execute('INSERT INTO data VALUES (?,?,?,?,?,?) ON CONFLICT(url) DO UPDATE SET code=?, msg=?, headers=?, data=?, timestamp=?', (url,) + value + value)
 import pymysql.cursors
 class MySQLCacheHandler(BaseCache):
    def __init__(self, user, password, database, host='localhost'):
        self.user = user
        self.password = password
        self.database = database
        self.host = host
        with self.cursor() as cursor:
            cursor.execute('CREATE TABLE IF NOT EXISTS data (url VARCHAR(255) NOT NULL PRIMARY KEY, code INT, msg TEXT, headers TEXT, data BLOB, timestamp INT)')
        self.trim()
    def cursor(self):
        return pymysql.connect(host=self.host, user=self.user, password=self.password, database=self.database, charset='utf8', autocommit=True).cursor()
    def trim(self):
        with self.cursor() as cursor:
            cursor.execute('DELETE FROM data WHERE timestamp <= ( SELECT timestamp FROM ( SELECT timestamp FROM data ORDER BY timestamp DESC LIMIT 1 OFFSET %s ) foo )', (CACHE_SIZE,))
    def __getitem__(self, url):
        cursor = self.cursor()
        cursor.execute('SELECT * FROM data WHERE url=%s', (url,))
        row = cursor.fetchone()
        if not row:
            raise KeyError
        return row[1:]
    def __setitem__(self, url, value): # (code, msg, headers, data, timestamp)
        with self.cursor() as cursor:
            cursor.execute('INSERT INTO data VALUES (%s,%s,%s,%s,%s,%s) ON DUPLICATE KEY UPDATE code=%s, msg=%s, headers=%s, data=%s, timestamp=%s',
                (url,) + value + value)
 class CappedDict(OrderedDict, BaseCache):
    def trim(self):
        if CACHE_SIZE >= 0:
            for i in range( max( len(self) - CACHE_SIZE , 0 )):
                self.popitem(False)
    def __setitem__(self, key, value):
        # https://docs.python.org/2/library/collections.html#ordereddict-examples-and-recipes
        if key in self:
            del self[key]
        OrderedDict.__setitem__(self, key, value)
 if 'CACHE' in os.environ:
    if os.environ['CACHE'] == 'mysql':
        default_cache = MySQLCacheHandler(
            user = os.getenv('MYSQL_USER'),
            password = os.getenv('MYSQL_PWD'),
            database = os.getenv('MYSQL_DB'),
            host = os.getenv('MYSQL_HOST', 'localhost')
        )
    elif os.environ['CACHE'] == 'sqlite':
        if 'SQLITE_PATH' in os.environ:
            path = os.getenv('SQLITE_PATH')
        else:
            path = ':memory:'
        default_cache = SQLiteCache(path)
 else:
        default_cache = CappedDict()
 if 'IGNORE_SSL' in os.environ:
    import ssl
    ssl._create_default_https_context = ssl._create_unverified_context
--- a/morss/feeds.py
+++ b/morss/feeds.py
@ -15,35 +15,35 @@
 # You should have received a copy of the GNU Affero General Public License along
 # with this program. If not, see <https://www.gnu.org/licenses/>.
 import sys
 import os.path
 import sys
-from datetime import datetime
+sys.path.append('/home/paul/Documents/Code/morss/lib')
 import re
 import json
 import csv
-
+import json
 import re
 from copy import deepcopy
 from datetime import datetime
 from fnmatch import fnmatch
 from lxml import etree
 from dateutil import tz
 import dateutil.parser
 from copy import deepcopy
 import lxml.html
 from dateutil import tz
 from lxml import etree
 from .readabilite import parse as html_parse
 json.encoder.c_make_encoder = None
 try:
    # python 2
    from StringIO import StringIO
    from ConfigParser import RawConfigParser
    from StringIO import StringIO
 except ImportError:
    # python 3
    from io import StringIO
    from configparser import RawConfigParser
    from io import StringIO
 try:
    # python 2
--- a/morss/morss.py
+++ b/morss/morss.py
@ -16,30 +16,25 @@
 # with this program. If not, see <https://www.gnu.org/licenses/>.
 import os
-
+import re
 import time
 from datetime import datetime
 from dateutil import tz
 from fnmatch import fnmatch
 import re
 import lxml.etree
 import lxml.html
 from dateutil import tz
-from . import feeds
+from . import caching, crawler, feeds, readabilite
 from . import crawler
 from . import readabilite
 try:
    # python 2
    from httplib import HTTPException
-    from urlparse import urlparse, urljoin, parse_qs
+    from urlparse import parse_qs, urljoin, urlparse
 except ImportError:
    # python 3
    from http.client import HTTPException
-    from urllib.parse import urlparse, urljoin, parse_qs
+    from urllib.parse import parse_qs, urljoin, urlparse
 MAX_ITEM = int(os.getenv('MAX_ITEM', 5)) # cache-only beyond
@ -91,12 +86,12 @@ class Options:
        else:
            self.options = options or {}
-    def __getattr__(self, key):
+    def __getattr__(self, key, default=None):
        if key in self.options:
            return self.options[key]
        else:
-            return None
+            return default
    def __setitem__(self, key, value):
        self.options[key] = value
@ -104,12 +99,7 @@ class Options:
    def __contains__(self, key):
        return key in self.options
-    def get(self, key, default=None):
+    get = __getitem__ = __getattr__
        if key in self.options:
            return self.options[key]
        else:
            return default
 def ItemFix(item, options, feedurl='/'):
@ -228,7 +218,11 @@ def ItemFill(item, options, feedurl='/', fast=False):
        log('non-text page')
        return True
-    out = readabilite.get_article(req['data'], url=req['url'], encoding_in=req['encoding'], encoding_out='unicode')
+    if not req['data']:
        log('empty page')
        return True
    out = readabilite.get_article(req['data'], url=req['url'], encoding_in=req['encoding'], encoding_out='unicode', xpath=options.xpath)
    if out is not None:
        item.content = out
@ -417,7 +411,7 @@ def process(url, cache=None, options=None):
    options = Options(options)
    if cache:
-        crawler.default_cache = crawler.SQLiteCache(cache)
+        caching.default_cache = caching.SQLiteCache(cache)
    url, rss = FeedFetch(url, options)
    rss = FeedGather(rss, url, options)
--- a/morss/readabilite.py
+++ b/morss/readabilite.py
@ -15,10 +15,11 @@
 # You should have received a copy of the GNU Affero General Public License along
 # with this program. If not, see <https://www.gnu.org/licenses/>.
 import re
 import lxml.etree
 import lxml.html
 from bs4 import BeautifulSoup
 import re
 def parse(data, encoding=None):
@ -210,7 +211,7 @@ def clean_node(node, keep_threshold=None):
        return
    # high score, so keep
-    if keep_threshold is not None and get_score(node) >= keep_threshold:
+    if keep_threshold is not None and keep_threshold > 0 and get_score(node) >= keep_threshold:
        return
    gdparent = parent.getparent()
@ -293,28 +294,26 @@ def clean_node(node, keep_threshold=None):
            gdparent.insert(gdparent.index(parent)+1, new_node)
-def lowest_common_ancestor(nodeA, nodeB, max_depth=None):
+def lowest_common_ancestor(node_a, node_b, max_depth=None):
-    ancestorsA = list(nodeA.iterancestors())
+    ancestors_a = list(node_a.iterancestors())
-    ancestorsB = list(nodeB.iterancestors())
+    ancestors_b = list(node_b.iterancestors())
    if max_depth is not None:
-        ancestorsA = ancestorsA[:max_depth]
+        ancestors_a = ancestors_a[:max_depth]
-        ancestorsB = ancestorsB[:max_depth]
+        ancestors_b = ancestors_b[:max_depth]
-    ancestorsA.insert(0, nodeA)
+    ancestors_a.insert(0, node_a)
-    ancestorsB.insert(0, nodeB)
+    ancestors_b.insert(0, node_b)
-    for ancestorA in ancestorsA:
+    for ancestor_a in ancestors_a:
-        if ancestorA in ancestorsB:
+        if ancestor_a in ancestors_b:
-            return ancestorA
+            return ancestor_a
-    return nodeA # should always find one tho, at least <html/>, but needed for max_depth
+    return node_a # should always find one tho, at least <html/>, but needed for max_depth
-def get_article(data, url=None, encoding_in=None, encoding_out='unicode', debug=False, threshold=5):
+def get_best_node(html, threshold=5):
-    " Input a raw html string, returns a raw html string of the article "
+    # score all nodes
    html = parse(data, encoding_in)
    score_all(html)
    # rank all nodes (largest to smallest)
@ -331,9 +330,29 @@ def get_article(data, url=None, encoding_in=None, encoding_out='unicode', debug=
    else:
        best = ranked_nodes[0]
    return best
 def get_article(data, url=None, encoding_in=None, encoding_out='unicode', debug=False, threshold=5, xpath=None):
    " Input a raw html string, returns a raw html string of the article "
    html = parse(data, encoding_in)
    if xpath is not None:
        xpath_match = html.xpath(xpath)
        if len(xpath_match):
            best = xpath_match[0]
        else:
            best = get_best_node(html, threshold)
    else:
        best = get_best_node(html, threshold)
    # clean up
    if not debug:
-        keep_threshold = get_score(ranked_nodes[0]) * 3/4
+        keep_threshold = get_score(best) * 3/4
        clean_root(best, keep_threshold)
    # check for spammy content (links only)
@ -352,6 +371,7 @@ def get_article(data, url=None, encoding_in=None, encoding_out='unicode', debug=
 if __name__ == '__main__':
    import sys
    from . import crawler
    req = crawler.adv_get(sys.argv[1] if len(sys.argv) > 1 else 'https://morss.it')
--- a/morss/wsgi.py
+++ b/morss/wsgi.py
@ -15,16 +15,16 @@
 # You should have received a copy of the GNU Affero General Public License along
 # with this program. If not, see <https://www.gnu.org/licenses/>.
-import sys
+import cgitb
 import mimetypes
 import os.path
 import re
-import lxml.etree
+import sys
 import cgitb
 import wsgiref.util
 import wsgiref.simple_server
 import wsgiref.handlers
-import mimetypes
+import wsgiref.simple_server
 import wsgiref.util
 import lxml.etree
 try:
    # python 2
@ -33,11 +33,9 @@ except ImportError:
    # python 3
    from urllib.parse import unquote
-from . import crawler
+from . import caching, crawler, readabilite
-from . import readabilite
+from .morss import (DELAY, TIMEOUT, FeedFetch, FeedFormat, FeedGather,
-from .morss import FeedFetch, FeedGather, FeedFormat
+                    MorssException, Options, log)
 from .morss import Options, log, TIMEOUT, DELAY, MorssException
 PORT = int(os.getenv('PORT', 8080))
@ -201,7 +199,7 @@ def cgi_get(environ, start_response):
    req = crawler.adv_get(url=url, timeout=TIMEOUT)
    if req['contenttype'] in ['text/html', 'application/xhtml+xml', 'application/xml']:
-        if options.get == 'page':
+        if options['get'] == 'page':
            html = readabilite.parse(req['data'], encoding=req['encoding'])
            html.make_links_absolute(req['url'])
@ -213,7 +211,7 @@ def cgi_get(environ, start_response):
            output = lxml.etree.tostring(html.getroottree(), encoding='utf-8', method='html')
-        elif options.get == 'article':
+        elif options['get'] == 'article':
            output = readabilite.get_article(req['data'], url=req['url'], encoding_in=req['encoding'], encoding_out='utf-8', debug=options.debug)
        else:
@ -289,7 +287,7 @@ class WSGIRequestHandlerRequestUri(wsgiref.simple_server.WSGIRequestHandler):
 def cgi_start_server():
-    crawler.default_cache.autotrim()
+    caching.default_cache.autotrim()
    print('Serving http://localhost:%s/' % PORT)
    httpd = wsgiref.simple_server.make_server('', PORT, application, handler_class=WSGIRequestHandlerRequestUri)
@ -297,4 +295,4 @@ def cgi_start_server():
 if 'gunicorn' in os.getenv('SERVER_SOFTWARE', ''):
-    crawler.default_cache.autotrim()
+    caching.default_cache.autotrim()
--- a/setup.py
+++ b/setup.py
@ -1,6 +1,7 @@
 from setuptools import setup
 from glob import glob
 from setuptools import setup
 package_name = 'morss'
 setup(
@ -12,7 +13,8 @@ setup(
    download_url = 'https://git.pictuga.com/pictuga/morss',
    license = 'AGPL v3',
    packages = [package_name],
-    install_requires = ['lxml', 'bs4', 'python-dateutil', 'chardet', 'pymysql'],
+    install_requires = ['lxml', 'bs4', 'python-dateutil', 'chardet'],
    extras_require = {'full': ['pymysql', 'redis']},
    package_data = {package_name: ['feedify.ini']},
    data_files = [
        ('share/' + package_name, ['README.md', 'LICENSE']),
Author	SHA1	Message	Date
pictuga	ef6efd981c	readabilite: custom xpath for article detection continuous-integration/drone/push Build is passing Details	2021-09-18 16:16:34 +02:00
pictuga	70d6ee02d3	cache: avoid name collision	2021-09-18 16:08:01 +02:00
pictuga	3c551d77bb	readability: better var names	2021-09-16 07:40:58 +02:00
pictuga	c3052a0619	morss: do not crash on empty pages continuous-integration/drone/push Build is passing Details	2021-09-11 21:50:09 +02:00
pictuga	268d43914f	drone: full module install continuous-integration/drone/push Build is passing Details	2021-09-11 21:41:49 +02:00
pictuga	22c6e7aed8	cache: fix isort continuous-integration/drone/push Build is passing Details	2021-09-11 21:39:11 +02:00
pictuga	1543895281	Add Redis cache backend continuous-integration/drone/push Build is failing Details	2021-09-11 21:36:14 +02:00
pictuga	e7bee5b7e2	Optdeps in REAMDE and Dockerfile	2021-09-11 21:35:14 +02:00
pictuga	7766d06b50	Fix Options class behaviour continuous-integration/drone/push Build is failing Details	2021-09-11 19:20:26 +02:00
pictuga	000a5cda7a	Move cache code to its own file continuous-integration/drone/push Build is failing Details	2021-09-11 13:20:34 +02:00
pictuga	f2efd56e8f	crawler: cache pickle'd array	2021-09-11 13:10:42 +02:00
pictuga	52e73331b8	Make mysql optdep	2021-09-11 12:12:51 +02:00
pictuga	046f3f9f3d	crawler: support 308 redirects continuous-integration/drone/push Build is passing Details	2021-09-11 11:34:16 +02:00
pictuga	db8e046eae	ci: add spell check dict continuous-integration/drone/push Build is passing Details	2021-09-08 22:21:49 +02:00
pictuga	b4b1e93289	ci: fix spell check continuous-integration/drone/push Build is failing Details	2021-09-08 22:15:53 +02:00
pictuga	8a329fbb6d	ci: fix pylint install continuous-integration/drone/push Build is failing Details	2021-09-08 22:09:56 +02:00
pictuga	77159b99ca	drone: use alpine image (to benefit from pkgs) continuous-integration/drone/push Build is failing Details	2021-09-08 22:07:21 +02:00
pictuga	c158e65192	ci: added pylint (triggered upon error w/ score < 8 only) continuous-integration/drone/push Build is failing Details	2021-09-08 22:01:49 +02:00
pictuga	0d64964a02	README: add ci badge	2021-09-08 21:39:12 +02:00
pictuga	e8271ae9a0	ci/cd: fix isort args continuous-integration/drone/push Build is passing Details	2021-09-08 21:31:42 +02:00
pictuga	2abe061422	Further isort implementation continuous-integration/drone/push Build is passing Details	2021-09-08 21:29:56 +02:00
pictuga	57bd94d42f	ci/cd attempt continuous-integration/drone/push Build is failing Details	2021-09-08 21:24:50 +02:00
pictuga	69cdf05341	Apply isort	2021-09-08 20:54:34 +02:00