pictuga
|
d6b90448f3
|
crawler: improve handling of non-ascii urls
|
2022-01-30 23:27:49 +01:00 |
pictuga
|
4d6d3c9239
|
wsgi: limit supported mimetypes & return actual mimetype
continuous-integration/drone/push Build is passing
Details
|
2022-01-23 11:44:07 +01:00 |
pictuga
|
d05706e056
|
crawler: fix typo
continuous-integration/drone/push Build was killed
Details
|
2022-01-19 13:41:12 +01:00 |
pictuga
|
750850c162
|
crawler: avoid too many .append()
|
2022-01-19 13:04:33 +01:00 |
pictuga
|
917aa0fbc5
|
crawler: do not re-save cached response
continuous-integration/drone/push Build is passing
Details
Otherwise cache never gets invalidated!
|
2021-12-31 19:28:11 +01:00 |
pictuga
|
1083f3ffbc
|
crawler: make sure to use HTTPMessage
continuous-integration/drone/push Build is passing
Details
|
2021-11-11 10:21:48 +01:00 |
pictuga
|
7eeb1d696c
|
crawler: clean up code
continuous-integration/drone/push Build is passing
Details
|
2021-11-10 23:25:03 +01:00 |
pictuga
|
e42df98f83
|
crawler: fix regression brought with 44a6b2591
continuous-integration/drone/push Build is passing
Details
|
2021-11-10 23:08:31 +01:00 |
pictuga
|
cb21871c35
|
crawler: clean up caching code
continuous-integration/drone/push Build is passing
Details
|
2021-11-08 22:02:23 +01:00 |
pictuga
|
44a6b2591d
|
crawler: cleaner http header object import
|
2021-11-07 19:44:36 +01:00 |
pictuga
|
a523518ae8
|
cache: avoid name collision
|
2021-09-21 08:04:45 +02:00 |
pictuga
|
bb82902ad1
|
Move cache code to its own file
|
2021-09-21 08:04:45 +02:00 |
pictuga
|
04afa28fe7
|
crawler: cache pickle'd array
|
2021-09-21 08:04:45 +02:00 |
pictuga
|
75bb69f0fd
|
Make mysql optdep
|
2021-09-21 08:04:45 +02:00 |
pictuga
|
97d9dda547
|
crawler: support 308 redirects
|
2021-09-21 08:04:45 +02:00 |
pictuga
|
0b3e6d7749
|
Apply isort
|
2021-09-21 08:04:23 +02:00 |
pictuga
|
06e0ada95b
|
Allow POST requests
|
2021-09-08 20:43:21 +02:00 |
pictuga
|
8f24214915
|
crawler: better name for custom fns
|
2021-08-29 00:22:40 +02:00 |
pictuga
|
5582fbef31
|
crawler: comment
|
2021-08-29 00:18:50 +02:00 |
pictuga
|
6880a443e0
|
crawler: improve CacheHandler code
|
2021-03-25 23:54:08 +01:00 |
pictuga
|
7342ab26d2
|
crawler: comment on how urllib works
|
2021-03-25 23:49:58 +01:00 |
pictuga
|
981da9e66a
|
crawler: SQLITE_PATH point to .db file instead of folder
|
2021-03-25 23:48:21 +01:00 |
pictuga
|
3e886caaab
|
crawler: drop encoding setting
|
2020-10-30 22:41:16 +01:00 |
pictuga
|
ad927e03a7
|
crawler: use regex instead of lxml
Less reliable but should be faster
|
2020-10-30 22:21:19 +01:00 |
pictuga
|
0efb096fa7
|
crawler: shift gzip & encoding-fix to intermediary handler
|
2020-10-30 22:16:51 +01:00 |
pictuga
|
9ab2e488ef
|
crawler: add intermediary handlers
|
2020-10-30 22:15:35 +01:00 |
pictuga
|
b525ab0d26
|
crawler: fix typo
|
2020-10-30 22:12:43 +01:00 |
pictuga
|
bd0bca69fc
|
crawler: ignore ssl via env var
|
2020-10-03 19:57:08 +02:00 |
pictuga
|
8abd951d40
|
More sensible default values for cache autotrim (1k entries, 1min)
|
2020-10-03 19:55:57 +02:00 |
pictuga
|
056a1b143f
|
crawler: autotrim: make ctrl+c working
|
2020-10-01 00:04:36 +02:00 |
pictuga
|
eed949736a
|
crawler: add ability to limit cache size
|
2020-09-30 23:59:55 +02:00 |
pictuga
|
d9f46b23a6
|
crawler: default value for MYSQL_HOST (localhost)
|
2020-09-30 13:17:02 +02:00 |
pictuga
|
bbada0436a
|
Quick guide to ignore SSL certs
|
2020-09-27 16:48:22 +02:00 |
pictuga
|
0f33db248a
|
Add license info in each file
|
2020-08-26 20:08:22 +02:00 |
pictuga
|
bd0efb1529
|
crawler: missing os import
|
2020-08-23 18:45:44 +02:00 |
pictuga
|
4dfebe78f7
|
Pick caching backend via env vars
|
2020-08-23 18:43:18 +02:00 |
pictuga
|
64af86c11e
|
crawler: catch html parsing errors
|
2020-07-06 12:25:38 +02:00 |
pictuga
|
ce4cf01aa6
|
crawler: clean up encoding detection code
|
2020-05-27 21:35:24 +02:00 |
pictuga
|
dcfdb75a15
|
crawler: fix chinese encoding support
|
2020-05-27 21:34:43 +02:00 |
pictuga
|
4ccc0dafcd
|
Basic help for sub-lib interactive use
|
2020-05-26 19:34:20 +02:00 |
pictuga
|
5dac4c69a1
|
crawler: more code comments
|
2020-05-12 20:44:25 +02:00 |
pictuga
|
36e2a1c3fd
|
crawler: increase size limit from 100KiB to 500
I'm looking at you, worldbankgroup.csod.com/ats/careersite/search.aspx
|
2020-05-12 19:34:16 +02:00 |
pictuga
|
f685139137
|
crawler: use UPSERT statements
Avoid potential race conditions
|
2020-05-03 21:27:45 +02:00 |
pictuga
|
271ac8f80f
|
crawler: comment code a bit
|
2020-05-02 19:18:01 +02:00 |
pictuga
|
64e41b807d
|
crawler: handle http:/ (single slash)
Fixing one more corner case! malayalam.oneindia.com
|
2020-05-02 19:17:15 +02:00 |
pictuga
|
c27c38f7c7
|
crawler: return dict instead of tuple
|
2020-04-28 22:29:07 +02:00 |
pictuga
|
749acc87fc
|
Centralize url clean up in crawler.py
|
2020-04-28 22:03:49 +02:00 |
pictuga
|
cb69e3167f
|
crawler: accept non-ascii urls
Covering one more corner case!
|
2020-04-28 14:47:23 +02:00 |
pictuga
|
818cdaaa9b
|
Make it possible to call sub-libs in non interactive mode
Run `python -m morss.feeds http://lemonde.fr` and so on
|
2020-04-27 18:00:14 +02:00 |
pictuga
|
2806c64326
|
Make it possible to directly run sub-libs (feeds, crawler, readabilite)
Run `python -im morss.feeds http://website.sample/rss.xml` and so on
|
2020-04-27 17:19:31 +02:00 |