567 Commits

Author SHA1 Message Date
b138f11771 util: support more data_files location 2022-01-23 12:40:18 +01:00
a01258700d More ordering options 2022-01-23 12:27:07 +01:00
4d6d3c9239 wsgi: limit supported mimetypes & return actual mimetype 2022-01-23 11:44:07 +01:00
e81f6b173f readabilite: remove code duplicate 2022-01-23 11:41:32 +01:00
fe5dbf1ce0 wsgi: reuse mimetype table from crawler 2022-01-22 13:22:39 +01:00
d05706e056 crawler: fix typo 2022-01-19 13:41:12 +01:00
e88a823ada feeds: better handle rulesets without a 'mode' specified 2022-01-19 13:08:33 +01:00
750850c162 crawler: avoid too many .append() 2022-01-19 13:04:33 +01:00
c8669002e4 feeds: exotic xpath in html as well 2022-01-17 14:22:48 +00:00
c524e54d2d feeds: support some exotic xpath rules returning a single string 2022-01-17 13:59:58 +00:00
fb643f5ef1 readabilite: remove unneeded reference to features (overriden by builder) 2022-01-03 18:01:12 +00:00
dbdca910d8 readabilite: fix new parser code & drop PIs 2022-01-03 17:51:49 +00:00
9eb19fac04 readabilite: use custom html parser within bs4's lxml parser
Solves the following obscure error:
ValueError: Invalid PI name 'b'xml''
2022-01-03 16:26:17 +00:00
d424e394d1 readabilite: use lxml bs4 parser for speed 2022-01-01 14:52:48 +01:00
3f92787b38 readabilite: limit html comments related issues 2022-01-01 13:58:42 +01:00
afc31eb6e9 readabilite: avoid double parsing of html 2022-01-01 12:51:30 +01:00
87d2fe772d wsgi: fix py2 compatibility 2022-01-01 12:35:41 +01:00
917aa0fbc5 crawler: do not re-save cached response
Otherwise cache never gets invalidated!
2021-12-31 19:28:11 +01:00
d17b9a2f27 Fix typo in DISKCACHE_DIR var name 2021-12-23 12:02:24 +01:00
368e4683d6 util: clean paths code 2021-12-16 08:53:18 +00:00
7cdcbd23e1 wsgi: fix another typo 2021-12-14 12:06:08 +00:00
25f283da1f wsgi: fix bug following the removal of the loop 2021-12-14 11:56:55 +00:00
727d14e539 wsgi: use data_files helper 2021-12-14 11:47:10 +00:00
3392ae3973 util: try one more path for data_files 2021-12-14 11:10:26 +00:00
51f1d330a4 Fn to access data_files & pkg files 2021-12-05 12:09:01 +01:00
eb47aac6f1 morss: respect timeout settings in all cases
Special treatment of feed fetch not justified and not documented
2021-11-25 22:13:38 +01:00
eca546b890 Change HTTP error code to 404
To tell them apart from 'true' 500 errors
2021-11-25 21:34:46 +01:00
d8cc07223e readabilite: fix bug when nothing above threshold 2021-11-23 20:53:00 +01:00
765e0ba728 Pass py error msg in http headers 2021-11-22 23:22:13 +01:00
6ec3fb47d1 readabilite: .strip() first to save time 2021-11-15 21:54:07 +01:00
1083f3ffbc crawler: make sure to use HTTPMessage 2021-11-11 10:21:48 +01:00
7eeb1d696c crawler: clean up code 2021-11-10 23:25:03 +01:00
e42df98f83 crawler: fix regression brought with 44a6b2591 2021-11-10 23:08:31 +01:00
cb21871c35 crawler: clean up caching code 2021-11-08 22:02:23 +01:00
c71cf5d5ce caching: fix diskcache implementation 2021-11-08 21:57:43 +01:00
44a6b2591d crawler: cleaner http header object import 2021-11-07 19:44:36 +01:00
a890536601 morss: comment code a bit 2021-11-07 18:26:07 +01:00
8de309f2d4 caching: add diskcache backend 2021-11-07 18:15:20 +01:00
cbf7b3f77b caching: simplify sqlite code 2021-11-07 18:14:18 +01:00
d023ec8d73 Change default port to 8000 2021-10-19 22:19:59 +02:00
5473b77416 Post-clean up isort 2021-09-21 08:11:04 +02:00
0365232a73 readabilite: custom xpath for article detection 2021-09-21 08:04:45 +02:00
a523518ae8 cache: avoid name collision 2021-09-21 08:04:45 +02:00
52c48b899f readability: better var names 2021-09-21 08:04:45 +02:00
9649cabb1b morss: do not crash on empty pages 2021-09-21 08:04:45 +02:00
10535a17c5 cache: fix isort 2021-09-21 08:04:45 +02:00
7d86972e58 Add Redis cache backend 2021-09-21 08:04:45 +02:00
5da7121a77 Fix Options class behaviour 2021-09-21 08:04:45 +02:00
bb82902ad1 Move cache code to its own file 2021-09-21 08:04:45 +02:00
04afa28fe7 crawler: cache pickle'd array 2021-09-21 08:04:45 +02:00