pictuga
|
438c32a312
|
Remove sqlite & mysql cache backends
Obsoleted since the introduction of diskcache & redis
|
2 months ago |
pictuga
|
e1ed33f320
|
crawler: improve html iter code
Ignores tags without attributes. Avoids bug with unclosed tags.
|
12 months ago |
pictuga
|
b65272daab
|
crawler: accept more meta redirects
|
1 year ago |
pictuga
|
4d64afe9cb
|
crawler: fix regression from d6b90448f3
|
1 year ago |
pictuga
|
32645548c2
|
pytest: first batch with test_feeds
And multiple related fixes
|
1 year ago |
pictuga
|
d6b90448f3
|
crawler: improve handling of non-ascii urls
|
1 year ago |
pictuga
|
da81edc651
|
log to stderr
|
1 year ago |
pictuga
|
4f2895f931
|
cli: update `--help`
|
1 year ago |
pictuga
|
b2b04691d6
|
Ability to pass custom data_files location
|
1 year ago |
pictuga
|
bfaf7b0fac
|
feeds: clean up default `item_link`
To be supported by feeds' `_rule_parse`
|
1 year ago |
pictuga
|
32d9bc9d9d
|
feeds: proceed with conversion when rules do not match
|
1 year ago |
pictuga
|
b138f11771
|
util: support more `data_files` location
|
1 year ago |
pictuga
|
a01258700d
|
More ordering options
|
1 year ago |
pictuga
|
4d6d3c9239
|
wsgi: limit supported mimetypes & return actual mimetype
|
1 year ago |
pictuga
|
e81f6b173f
|
readabilite: remove code duplicate
|
1 year ago |
pictuga
|
fe5dbf1ce0
|
wsgi: reuse mimetype table from crawler
|
1 year ago |
pictuga
|
d05706e056
|
crawler: fix typo
|
1 year ago |
pictuga
|
e88a823ada
|
feeds: better handle rulesets without a 'mode' specified
|
1 year ago |
pictuga
|
750850c162
|
crawler: avoid too many .append()
|
1 year ago |
pictuga
|
c8669002e4
|
feeds: exotic xpath in html as well
|
1 year ago |
pictuga
|
c524e54d2d
|
feeds: support some exotic xpath rules returning a single string
|
1 year ago |
pictuga
|
fb643f5ef1
|
readabilite: remove unneeded reference to `features` (overriden by `builder`)
|
1 year ago |
pictuga
|
dbdca910d8
|
readabilite: fix new parser code & drop PIs
|
1 year ago |
pictuga
|
9eb19fac04
|
readabilite: use custom html parser within bs4's lxml parser
Solves the following obscure error:
ValueError: Invalid PI name 'b'xml''
|
1 year ago |
pictuga
|
d424e394d1
|
readabilite: use lxml bs4 parser for speed
|
1 year ago |
pictuga
|
3f92787b38
|
readabilite: limit html comments related issues
|
1 year ago |
pictuga
|
afc31eb6e9
|
readabilite: avoid double parsing of html
|
1 year ago |
pictuga
|
87d2fe772d
|
wsgi: fix py2 compatibility
|
1 year ago |
pictuga
|
917aa0fbc5
|
crawler: do not re-save cached response
Otherwise cache never gets invalidated!
|
1 year ago |
pictuga
|
d17b9a2f27
|
Fix typo in DISKCACHE_DIR var name
|
1 year ago |
pictuga
|
368e4683d6
|
util: clean paths code
|
1 year ago |
pictuga
|
7cdcbd23e1
|
wsgi: fix another typo
|
1 year ago |
pictuga
|
25f283da1f
|
wsgi: fix bug following the removal of the loop
|
1 year ago |
pictuga
|
727d14e539
|
wsgi: use data_files helper
|
1 year ago |
pictuga
|
3392ae3973
|
util: try one more path for data_files
|
1 year ago |
pictuga
|
51f1d330a4
|
Fn to access data_files & pkg files
|
1 year ago |
pictuga
|
eb47aac6f1
|
morss: respect timeout settings in all cases
Special treatment of feed fetch not justified and not documented
|
1 year ago |
pictuga
|
eca546b890
|
Change HTTP error code to 404
To tell them apart from 'true' 500 errors
|
1 year ago |
pictuga
|
d8cc07223e
|
readabilite: fix bug when nothing above threshold
|
1 year ago |
pictuga
|
765e0ba728
|
Pass py error msg in http headers
|
1 year ago |
pictuga
|
6ec3fb47d1
|
readabilite: .strip() first to save time
|
1 year ago |
pictuga
|
1083f3ffbc
|
crawler: make sure to use HTTPMessage
|
1 year ago |
pictuga
|
7eeb1d696c
|
crawler: clean up code
|
1 year ago |
pictuga
|
e42df98f83
|
crawler: fix regression brought with 44a6b2591
|
1 year ago |
pictuga
|
cb21871c35
|
crawler: clean up caching code
|
1 year ago |
pictuga
|
c71cf5d5ce
|
caching: fix diskcache implementation
|
1 year ago |
pictuga
|
44a6b2591d
|
crawler: cleaner http header object import
|
1 year ago |
pictuga
|
a890536601
|
morss: comment code a bit
|
1 year ago |
pictuga
|
8de309f2d4
|
caching: add diskcache backend
|
1 year ago |
pictuga
|
cbf7b3f77b
|
caching: simplify sqlite code
|
1 year ago |